The main objective of this project is to learn to compute shader in OpenGL. First, according to Reference 7 by Mike Bailey, I have implemented a simple particle system using the compute shader. This tutorial helps me understand the basic idea of compute shader a lot. And it gives a basic implementation of the particle system. I am having fun of implementing this particle system and learn a lot about compute shader.

The idea of tile-based deferred shading is that after g-buffer passes, it adds a light culling pass, such that light that is not inside view frustum will be discarded during shading. And the shading part usually is done in the same compute shader.

My implementation is different from the original one. I decided to pass the visible light indices data calculated during the light culling from GPU to CPU.

// MAX_POINT_LIGHT_COUNT == 80, in total 320 bytes
// For example, if only the 4th, 7th lights are visible, this array looks like:
// 0, 0, 0, 1, 0, 0, 1, 0, 0,.....
int g_disablePointLightIndices[MAX_POINT_LIGHT_COUNT]; 

In the next frame, according to the data, I disable all lights that are not visible. In this way, the light outside the view frustum will neither generate shadow maps nor shade any fragment

This is a random scene with multiple lights.

This is the Per-tile visible light count image
(black = 0 lights, white >= 10, resolution: (80 x 45))

Prons: Speed up shadow map generaetion progress.
The left image is a scene I captured randomly. The right image tells us that out of 32 lights in the scene, and only 23 lights are visible, that is, in this view, this method saves the same time as generating 9 point light shadow maps.

(You can click the image to see clearer version.)

Cons: Fragments outside the light volume will still be shaded during deferred lighting pass.
The right image demonstrates which fragments are illuminated by the point-light on the left image. If it is the original tile-based deferred shading method, all black pixels on the right image will not be shaded by this point light. But my implementation does not discard those black pixels.

(You can click the image to see clearer version.)



First of all, we need to compile and attach the compute shader to a program. The process is the same as binding vertex/fragment pair shader. Setting up uniform variables is also the same.

The following marcos are my work group definition. The size of a tile is 16×16, so the num_of_group.x == (1280 / 16) == 80, num_of_group.7 == (720 / 16) == 45. That is why my Per-tile visible light count image resolution is 80 x 45.

// screen resolution: 1280 * 720
#define WORK_GROUP_SIZE 16

We will use GL_SHADER_STORAGE_BUFFER to modify or retrieve data from the compute shader. The first thing is to generate a shader storage buffer and bind initial data to it.

// 1. Generate and bind buffer.
glGenBuffers(1, &g_lightVisibilitiesID);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, g_lightVisibilitiesID);

// 2. allocate memory with certain size of data. Here is 320 bytes
const GLsizei bufferSize = sizeof(GLuint) * MAX_POINT_LIGHT_COUNT;

// 3. write initial data to the buffer
// GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT tell glMapBufferRange that I am going to write the data, 
// after glUnmapBuffer() is called, updated data in CPU will be sent to GPU
GLuint cleanData[MAX_POINT_LIGHT_COUNT] = { 0 };
// Modifying data where this pointer point to
GLuint* _data = (GLuint *)glMapBufferRange(GL_SHADER_STORAGE_BUFFER, 0, bufferSize, GL_MAP_WRITE_BIT | GL_MAP_INVALIDATE_BUFFER_BIT);
memcpy(_data, cleanData, bufferSize);

// 4. clean up binding and error checking
assert(GL_NO_ERROR == glGetError());

// till now the buffer has the initial data and it is ready to use

(The initialization of image2D in compute shader and how to bind it to a texture will not be introduced here. If you are interested, please read Reference 6)

Compute shader in CPU

And then we can start to use the compute shader. But before that, we should know about what we should do in CPU and the pipeline of the whole light culling process. The pipeline of my implementation is like this:

  1. Update uniform data (matrices, light data…). Inverse projection matrix and view matrix is neccessary
  2. Clear light visibilities in GPU (set the array all to zero)
  3. glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 3, g_lightVisibilitiesID);
  4. Use compute shader program
  5. (glBindImageTexture(…), for storing the Per-tile visible light count image. This is not neccessary)
  6. Read the depth texture from G-Buffer in the previous state
  7. glDispatchComputeGroupSizeARB(…)
  8. glMemoryBarrier(…); // Wait until all work groups has been processed
  9. Get data from the GPU using shader storage buffer
  10. Pass the data to application thread and set visibilities for next draw call
  11. (In other mode, draw Per-tile visible light count image if neccessary)
// 1. pass in uniform data, inverse projection matrix and view matrix is neccessary
	// .......some omitted codes here .......
// 2. clear light visibilities
	// This part is similar to how we initialize this buffer
	glBindBuffer(GL_SHADER_STORAGE_BUFFER, g_lightVisibilitiesID);
	const GLsizei bufferSize = sizeof(GLuint) * MAX_POINT_LIGHT_COUNT;
	GLuint* _data = (GLuint *)glMapBufferRange(GL_SHADER_STORAGE_BUFFER, 0, bufferSize, bufMask);
	memcpy(_data, cleanData, bufferSize);

// 3. bind buffer base, tell shader the binding order
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 3, g_lightVisibilitiesID);
// 4. Use compute shader program something like glUseProgram(id)
	// .......some omitted codes here .......
// 5. bind Per-tile visible light count image, not neccessary
	// .......some omitted codes here .......

// 6. Read the depth texture from G-Buffer in the previous state

	/** 7. Call the compute shader! very important! */

	/** 8. Wait until all work groups has been processed */

// 9. Get buffer data
GLuint visiblePointLightCount = 0;
GLuint visibilities[MAX_POINT_LIGHT_COUNT];
	glBindBuffer(GL_SHADER_STORAGE_BUFFER, g_lightVisibilitiesID);
	const GLsizei bufferSize = sizeof(GLuint) * MAX_POINT_LIGHT_COUNT;
	// function here is no longer glMapBufferRange, and the last parameter becomes GL_READ_ONLY
	// This tells the GPU that I only need to read the buffer
	GLuint* _dataOut = (GLuint *)glMapBuffer(GL_SHADER_STORAGE_BUFFER, GL_READ_ONLY);
	// copy the data out
	memcpy(visibilities, _dataOut, bufferSize);
	// clean up binding

// 10. Pass the data to application thread and set visibilities for next draw call
	int pLightCount = g_dataRenderingByGraphicThread->g_pointLights.size();
	int invisiblePointLightCount = 0;
	for (int i = 0; i < pLightCount; ++i)
		// if the point light is not visible in view, not casting shadow
		if (visibilities[i] == 1)
				g_dataGetFromRenderThread->g_disablePointLightIndices[invisiblePointLightCount] = i;
	// update global lighting data
	g_dataGetFromRenderThread->g_visiblePointLightCount = visiblePointLightCount;

In the 30 line: glDispatchComputeGroupSizeARB, it is different from traditional glDispatchCompute. It specifies the num of group in CPU code so that you do not have to do it in shader code again.
But instead, you write layout( local_size_variable ) in; in shader code. I think it is a feature of newer version OpenGL.

Tile-based light culling Compute shader.glsl

You can find the complete code here:

Top part, compatibility and version.

#version 430 compatibility
#extension GL_ARB_compute_shader: 			enable
#extension GL_ARB_shader_storage_buffer_object: 	enable
#extension GL_ARB_compute_variable_group_size : 	enable

Parameter definitions.

// texture unit 0, depth buffer
uniform sampler2D gDepth; 

// shared memory
shared uint minDepthInt = 0xFFFFFFFF;
shared uint maxDepthInt = 0;
shared uint visibleLightCount = 0;

// std430 will remove the restriction of rounding up to a multiple of 16 bytes like std140, 320 bytes
// this buffer is the shader storage buffer
layout(std430, binding = 3) buffer lightVisibility
	 uint g_lightVisibilities[MAX_LIGHT_COUNT];

// image2D for Per-tile visible light count image
layout(rgba32f, binding = 1) writeonly uniform image2D img_output;

layout( local_size_variable ) in;

Helper functions definition, they are from Reference 1 and I have changed a lot cause I am using OpenGL and the reference is using DirectX.

// Helper functions

// this creates the standard Hessian-normal-form plane equation from three points, 
// except it is simplified for the case where the first point is the origin
vec4 CreatePlaneEquation( vec4 b, vec4 c )
    vec4 n;

    // normalize(cross(, )), except we know "a" is the origin
    // the normal direction should point to the inside of the frustum.
    // use right hand rule to determine the cross product order = normalize(cross(, ));

    // -(n dot a), except we know "a" is the origin
    n.w = 0;

    return n;

// point-plane distance, simplified for the case where 
// the plane passes through the origin
float GetSignedDistanceFromPlane( vec4 p, vec4 eqn )
    // dot(, ) + eqn.w, , except we know eqn.w is zero 
    // (see CreatePlaneEquation above)
    return dot(, );

// convert a point from post-projection space into view space
vec4 ConvertProjToView( vec4 p )
    // InvProj: inverse projection matrix, passed in at the first step of CPU pipeline
    p = InvProj * p;
    p /= p.w;
    return p;

// convert a depth value from post-projection space into view space
float ConvertProjDepthToView( float z )
    float newZ = 2.0f * z - 1;
    newZ = 1.f / (newZ*InvProj[2][3] + InvProj[3][3]);
    return newZ;

Calculate the min and max depth of each tile.

float depthMinFloat = 100000; // a random large magic number
float depthMaxFloat = 0;
// Get texture coordinate by using Special Pre-set Variables in the Compute Shader
// Read reference 7 for more information about these variables
vec2 texCoord = vec2(
	(gl_WorkGroupID.x * WORK_GROUP_SIZE + gl_LocalInvocationID.x) / float(SCREEN_WIDTH), 
	(gl_WorkGroupID.y * WORK_GROUP_SIZE + gl_LocalInvocationID.y) / float(SCREEN_HEIGHT)); 
// get depth in float
// depthFloat is non linear
float depthFloat = texture(gDepth, texCoord).r;
// convert the depth to view space z, it is always positive
float viewPosZ = ConvertProjDepthToView( depthFloat );

// convert depth from float to uint, since atomics only works on uints
uint depthuInt = floatBitsToUint(viewPosZ);
// calculate min, max depth of this tile
atomicMin(minDepthInt, depthuInt);
atomicMax(maxDepthInt, depthuInt);

// wait until min, max depth has been calculated in all threads of this tile

Construct frustums of each tile.

vec4 frustumEqn[4];
	// construct frustum for this tile, getting rect coordinates of the tile, unit is pixel
    	uint minX = WORK_GROUP_SIZE * gl_WorkGroupID.x;
    	uint minY = WORK_GROUP_SIZE * gl_WorkGroupID.y;
    	uint maxX = WORK_GROUP_SIZE * (gl_WorkGroupID.x + 1);
   	uint maxY = WORK_GROUP_SIZE * (gl_WorkGroupID.y + 1);

	vec4 corners[4];
	// create a clock-wised square
	corners[0] = ConvertProjToView(vec4( (float(minX)/SCREEN_WIDTH) * 2.0f - 1.0f, 	(float(minY)/SCREEN_HEIGHT) * 2.0f - 1.0f, 1.0f, 1.0f));
 	corners[1] = ConvertProjToView(vec4( (float(maxX)/SCREEN_WIDTH) * 2.0f - 1.0f, 	(float(minY)/SCREEN_HEIGHT) * 2.0f - 1.0f, 1.0f, 1.0f));
 	corners[2] = ConvertProjToView(vec4( (float(maxX)/SCREEN_WIDTH) * 2.0f - 1.0f, 	(float(maxY)/SCREEN_HEIGHT) * 2.0f - 1.0f, 1.0f, 1.0f));
 	corners[3] = ConvertProjToView(vec4( (float(minX)/SCREEN_WIDTH) * 2.0f - 1.0f, 	(float(maxY)/SCREEN_HEIGHT) * 2.0f - 1.0f, 1.0f, 1.0f));

	// create plane equations using the four corners of the tile
	// plane = ax + by + cz + d, so a vec4 can represent a plane
	// use two continuous corners and the origin can createa a plane
        for(uint i=0; i<4; i++)
            frustumEqn[i] = CreatePlaneEquation( corners[i], corners[(i+1)&3] );
// wait until all threads finish craeting frustums;

Do light-frustums intersection check and record the data.

depthMinFloat = uintBitsToFloat(minDepthInt); // near to the camera, positive
depthMaxFloat = uintBitsToFloat(maxDepthInt); // fat to the camera, positive

uint threadPertile = WORK_GROUP_SIZE * WORK_GROUP_SIZE;
	// loop over the lights and do a sphere vs. frustum intersection test
	// each thread process a point light in parallel, for max 80 point lights, i will end at 1
	for (uint i = 0; i < g_pointLightCount; i += threadPertile)
		uint lightIndex = i + localIdxFlattened;
		if(lightIndex < g_pointLightCount)
			vec4 pLightLocation = vec4(g_pointLights[lightIndex].position, 1.0);
			float r = g_pointLights[lightIndex].radius;
			// things that is in front of the camera in view space should have negative z value
			vec4 pLightLoc_ViewSpace = ViewMatrix * pLightLocation;
			// do z test, if the sphere is inside the near and far frustum, since in view space, all z are negative, 
			//so I add negative sign for two depth values
			if(pLightLoc_ViewSpace.z - r < -depthMinFloat && pLightLoc_ViewSpace.z + r > -depthMaxFloat)
				bool bInFrustum = true;
				for(int i = 0; i < 4; ++i)
					float dist = GetSignedDistanceFromPlane( pLightLoc_ViewSpace, frustumEqn[i] );
					if(dist >= r || dist <= -r)
						bInFrustum = false;
                    // do a thread-safe increment of the list counter 
                    // and put the index of this light into the list
                    uint id = atomicAdd(visibleLightCount, 1);
                    g_lightVisibilities[g_pointLights[lightIndex].base.uniqueID] = 1;


Edit the Per-tile visible light count image if neccessary.

vec4 pixel = vec4(visibleLightCount / 10.f, visibleLightCount / 10.f, visibleLightCount / 10.f, 1.0);

ivec2 pixel_coords = ivec2(gl_WorkGroupID.xy);
imageStore(img_output, pixel_coords, pixel);


The frustum and sphere intersection does not work quite well as my expectation. I have seen other’s output image like in Reference 2, their Per-tile visible light count is very precise but mine is not. Especially when the camera is insdie the light volume, almost the whole scene is considered illuminated by the light, which is incorrect.

This is my first time get in touch with compute shader and I have gained a lot knowledge. I will fix the problem if I find the solution and give update here. If any reader knows what’s the problem here, please let me know and it will be appreciated.


  1. DirectX version from AMD used for tiled forward shading
  2. Parallel Graphics in Frostbite – Current & Future by Johan Andersson
  3. Efficient Tile-Based Deferred Shading Pipeline by Denis Ishmukhametov Bachelor of Science, Computer Science Ufa State Aviation Technical University, June 2011
  4. Mapping between HLSL and GLSL
  5. Generate random number in shader
  6. It’s More Fun to Compute, An Introduction to Compute Shaders
  7. How to Use and Teach OpenGL Compute Shaders by Mike Bailey
  8. How to recover view space position given view space depth value and ndc xy
  9. Point-Plane Distance