Cg Programming/Unity/Computing the Brightest Pixel

A dancer wearing a motion capture suit with reflective markers that lead to very bright pixels.

This tutorial shows how to compute the position of the brightest pixel in an image with the help of compute shaders in Unity. In particular, it shows how threads in a thread group can use “groupshared” data and how the execution of these threads can be synchronized. If you are not familiar with compute shaders in Unity, you should first read Section “Computing Image Effects” and Section “Computing Color Histograms”. Note that compute shaders are not supported on macOS.

Why Would Anyone Want to Do This?

Finding the brightest pixel in photographed images is useful for some applications of optical motion capture. Another application is a template matching algorithm that is applied at the positions of all pixels of an image and that stores the likelihood of a match at each pixel in an intermediate image. In this case, the “brightest” pixel of this intermediate image represents the best match with the template. Finding this best match is useful for template-based feature detection and tracking.

Also, the problem of finding the brightest pixel is closely related to many other problems, e.g., finding the darkest pixel or the two (or more) brightest pixels or the two (or more) brightest pixels with a certain distance between them or the sum or average of all pixels of an image, etc. In fact, by solving the problem of finding the brightest pixel, one is very close to solving several related problems.

Finding the Brightest Pixel with a Compute Shader

To find the brightest pixel in an image, one has to look at all pixels of the image; thus, the problem can benefit a lot from parallelization.

In this tutorial, we implement a compute shader that first finds the brightest pixel in one pixel row of an image − simply by going through all pixels of the row in a loop and keeping track of the brightest pixel it encounters. We call this compute shader for all rows of the image in parallel. The result is an array of the brightest pixels of each row, which might be a relatively large array (depending on the height of the image). Therefore, we reduce the size of this array by computing the brightest pixel of each thread group at the end of the shader. Since we use thread groups of 64 threads, this reduces the dimension of the resulting array by factor 64, and the new result is an array of the brightest pixels of each thread group. One could try to further reduce this array in parallel but since this array is already relatively small, we simply transfer the data to the CPU and find the brightest pixel of the whole image by a linear search on the CPU. Note: for any tool, it is not only important to know when to use it, but also when not to use it.

Here is a first version of the compute shader:

#pragma kernel MaximumMain

Texture2D<float4> InputTexture;
int InputTextureWidth;

struct maxStruct 
{
   uint xMax; // column of maximum
   uint yMax; // row of maximum
   uint lMax; // luminance of maximum (0, ..., 1023)
};

RWStructuredBuffer<maxStruct> GroupMaxBuffer;

groupshared maxStruct rowMaxData[64];

[numthreads(64,1,1)]
void MaximumMain (uint3 groupID : SV_GroupID, 
      // 3D ID of thread group; range depends on Dispatch call
   uint3 groupThreadID : SV_GroupThreadID, 
      // 3D ID of thread in a thread group; range depends on numthreads
   uint groupIndex : SV_GroupIndex, 
      // flattened/linearized SV_GroupThreadID. 
      // groupIndex specifies the index within the group (0 to 63)
   uint3 id : SV_DispatchThreadID) 
      // = SV_GroupID * numthreads + SV_GroupThreadID
      // id.x specifies the row in the input texture image
{
   int column;

   // find the maximum of this row 
   // and store its data in rowMaxData[groupIndex]
   rowMaxData[groupIndex].xMax = 0; 
   rowMaxData[groupIndex].yMax = id.x; 
   rowMaxData[groupIndex].lMax = 0;
   for (column = 0; column < InputTextureWidth; column++) 
   {
      float4 color = InputTexture[uint2(column, id.x)];
      uint luminance = (uint)(1023.0 * 
         (0.21 * color.r + 0.72 * color.g + 0.07 * color.b));
      if (luminance > rowMaxData[groupIndex].lMax) 
      {
         rowMaxData[groupIndex].xMax = column;
         rowMaxData[groupIndex].lMax = luminance;
      }
   }

   // find the maximum of this group 
   // and store its data in GroupMaxBuffer[groupID.x]
   GroupMemoryBarrierWithGroupSync(); 
   if (0 == groupIndex) 
   {
      int row; 
      int rowMax = 0;
      for (row = 1; row < 64; row++) 
      { 
         if (rowMaxData[row].lMax > rowMaxData[rowMax].lMax) 
         {
            rowMax = row;
         }
      }
      GroupMaxBuffer[groupID.x] = rowMaxData[rowMax];
   }
}

The first (Unity-specific) line #pragma kernel MaximumMain specifies that the function MaximumMain() is a compute shader function that can be called from a script.

Texture2D<float4> InputTexture is a uniform variable to access the RGBA input texture, while int InputTextureWidth is a uniform variable to get its width, i.e., the length of a row of pixels.

The next lines define a struct to store the data about candidates for the brightest pixel. xMax and yMax are its coordinates while lMax is its relative luminance from 0 to 1023:

struct maxStruct 
{
   uint xMax; // column of maximum
   uint yMax; // row of maximum
   uint lMax; // luminance of maximum (0, ..., 1023)
};

The definition RWStructuredBuffer<maxStruct> GroupMaxBuffer uses this struct to define a RWStructuredBuffer (corresponding to a compute buffer in Unity) to store the information about the brightest pixel of each thread group.

The definition groupshared maxStruct rowMaxData[64] uses the same struct to define a groupshared array to store the information about the brightest pixel of each thread (i.e., of each row) within the current thread group. Note that the total size of the groupshared data in Direct3D 11 is limited to 32 KB. Assuming that an unsigned int requires at most 8 bytes, the rowMaxData array requires at most 64 × 3 × 8 = 1536 bytes, which is well below the limit of 32 KB.

We define the dimensions of a thread group with [numthreads(64, 1, 1)] instead of [numthreads(1, 64, 1)] since the thread group is assumed to work on an “1D array” of 64 rows and it is usually more straightforward to use the x dimension for a one-dimensional group.

The compute shader function MaximumMain() asks for all thread-related indices that are available (although it doesn't use groupThreadID). The index groupID.x of the thread group is used to index the GroupMaxBuffer, the thread index groupIndex within the thread group is used to index rowMaxData, and the overall dispatch index id.x specifies the row of the whole image.

The function MaximumMain() then runs a loop over all pixels of the thread's row by counting the variable column from 0 to InputTextureWidth - 1. It computes the relative luminance (scaled with 1023 in order to work with unsigned ints) of each pixel, compares this luminance to the greatest luminance so far, and if the new luminance is greater, it updates the data in rowMaxData[groupIndex], which at the end of the loop contains the information about the brightest pixel of the row.

After computing the brightest pixel of a row, the function computes the brightest pixel of the thread group. Since we need to compare the data of different threads, we first have to make sure that all threads have determined the brightest pixel of their row. This is achieved with GroupMemoryBarrierWithGroupSync() which not only makes sure that all memory writes of the thread group before this line are completed but also waits until all threads in the thread group reach this line. The code then checks whether groupIndex is 0, i.e., whether this is the zeroth thread of the thread group. Only this thread determines the brightest pixel of the pixels in rowMaxData and writes it to GroupMaxBuffer[groupID.x]. While this solution works (and is straightforward to implement), it is somewhat wasteful because the 63 other threads in the same thread group have nothing to do while the zeroth thread is working in this loop.

A more efficient alternative is given in the second version of the compute shader below. It implements a reduce operation (or fold function) similar to a knockout tournament: In the first step, each thread with an even number compares its brightest pixel with the brightest pixel of the next thread. In the second step, each thread with a number that is divisible by 4 compares its best candidate pixel with the next thread but one, etc. In the sixth (and last) step, the zeroth thread compares its best candidate with the best candidate of the 32nd thread. The “winner” of this last comparison is then the brightest pixel of the group. There are still many idle threads in this version but it only takes 6 steps instead of a loop of 64 iterations, which is a worthwhile improvement. Avoiding any idle threads would require multiple dispatch calls which come with some overhead and therefore might not save any time.

Here is the improved shader:

#pragma kernel MaximumMain

Texture2D<float4> InputTexture;
int InputTextureWidth;

struct maxStruct 
{
   uint xMax; // column of maximum
   uint yMax; // row of maximum
   uint lMax; // luminance of maximum (0, ..., 1023)
};

RWStructuredBuffer<maxStruct> GroupMaxBuffer;

groupshared maxStruct rowMaxData[64];

[numthreads(64,1,1)]
void MaximumMain (uint3 groupID : SV_GroupID, 
      // 3D ID of thread group; range depends on Dispatch call
   uint3 groupThreadID : SV_GroupThreadID, 
      // 3D ID of thread in a thread group; range depends on numthreads
   uint groupIndex : SV_GroupIndex, 
      // flattened/linearized SV_GroupThreadID. 
      // groupIndex specifies the index within the group (0 to 63)
   uint3 id : SV_DispatchThreadID) 
      // = SV_GroupID * numthreads + SV_GroupThreadID
      // id.x specifies the row in the input texture image
{
   int column;

   // find the maximum of this row 
   // and store its data in rowMaxData[groupIndex]
   rowMaxData[groupIndex].xMax = 0; 
   rowMaxData[groupIndex].yMax = id.x; 
   rowMaxData[groupIndex].lMax = 0;
   for (column = 0; column < InputTextureWidth; column++) 
   {
      float4 color = InputTexture[uint2(column, id.x)];
      uint luminance = (uint)(1023.0 * 
         (0.21 * color.r + 0.72 * color.g + 0.07 * color.b));
      if (luminance > rowMaxData[groupIndex].lMax) 
      {
         rowMaxData[groupIndex].xMax = column;
         rowMaxData[groupIndex].lMax = luminance;
      }
   }

   // find the maximum of this group
   // and store its data in GroupMaxBuffer[groupID.x]
   GroupMemoryBarrierWithGroupSync(); 
      // we have to wait for all writes to rowMaxData by the group's threads
   if (0 == (groupIndex & 1)) { // is groupIndex even?
      if (rowMaxData[groupIndex + 1].lMax > rowMaxData[groupIndex].lMax) {
         rowMaxData[groupIndex] = rowMaxData[groupIndex + 1];
      }
   }
   GroupMemoryBarrierWithGroupSync(); 
   if (0 == (groupIndex & 3)) { // is groupIndex divisible by 4?
      if (rowMaxData[groupIndex + 2].lMax > rowMaxData[groupIndex].lMax) {
         rowMaxData[groupIndex] = rowMaxData[groupIndex + 2];
      }
   }
   GroupMemoryBarrierWithGroupSync(); 
   if (0 == (groupIndex & 7)) { // is groupIndex divisible by 8?
      if (rowMaxData[groupIndex + 4].lMax > rowMaxData[groupIndex].lMax) {
         rowMaxData[groupIndex] = rowMaxData[groupIndex + 4];
      }
   }
   GroupMemoryBarrierWithGroupSync();
   if (0 == (groupIndex & 15)) { // is groupIndex divisible by 16?
      if (rowMaxData[groupIndex + 8].lMax > rowMaxData[groupIndex].lMax) {
         rowMaxData[groupIndex] = rowMaxData[groupIndex + 8];
      }
   }
   GroupMemoryBarrierWithGroupSync(); 
   if (0 == (groupIndex & 31)) { // is groupIndex divisible by 32?
      if (rowMaxData[groupIndex + 16].lMax > rowMaxData[groupIndex].lMax) {
         rowMaxData[groupIndex] = rowMaxData[groupIndex + 16];
      }
   }
   GroupMemoryBarrierWithGroupSync();
   if (0 == (groupIndex & 63)) { // is groupIndex divisible by 64?
      if (rowMaxData[groupIndex + 32].lMax > rowMaxData[groupIndex].lMax) {
         rowMaxData[groupIndex] = rowMaxData[groupIndex + 32];
      }
      GroupMaxBuffer[groupID.x] = rowMaxData[groupIndex];
         // copy maximum of group to buffer
   }
}

Note that the code uses the bitwise and operator & with powers of 2 minus 1 to test whether groupIndex is divisble by the power of 2. We could also use the modulus operator % with the power of 2 instead.

Calling the Compute Shader

The C# script to call the compute shader is relatively straightforward:

using UnityEngine;

public class maximumScript : MonoBehaviour 
{
   public ComputeShader shader;
   public Texture2D inputTexture;

   public uint[] groupMaxData;
   public int groupMax;

   private ComputeBuffer groupMaxBuffer;
   
   private int handleMaximumMain;

   void Start () 
   {
      if (null == shader || null == inputTexture) 
      {
         Debug.Log("Shader or input texture missing.");
         return;
      }

      handleMaximumMain = shader.FindKernel("MaximumMain");
      groupMaxBuffer = new ComputeBuffer((inputTexture.height + 63) / 64, sizeof(uint) * 3);
      groupMaxData = new uint[((inputTexture.height + 63) / 64) * 3];

      if (handleMaximumMain < 0 || null == groupMaxBuffer || null == groupMaxData) 
      {
         Debug.Log("Initialization failed.");
         return;
      }
      
      shader.SetTexture(handleMaximumMain, "InputTexture", inputTexture);
      shader.SetInt("InputTextureWidth", inputTexture.width);
      shader.SetBuffer(handleMaximumMain, "GroupMaxBuffer", groupMaxBuffer);
   }

   void OnDestroy() 
   {
      if (null != groupMaxBuffer) 
      {
         groupMaxBuffer.Release();
      }
   }

   void Update()
   {
      shader.Dispatch(handleMaximumMain, (inputTexture.height + 63) / 64, 1, 1);
         // divided by 64 in x because of [numthreads(64,1,1)] in the compute shader code
         // added 63 to make sure that there is a group for all rows
      
      // get maxima of groups
      groupMaxBuffer.GetData(groupMaxData);
      
      // find maximum of all groups
      groupMax = 0;
      for (int group = 1; group < (inputTexture.height + 63) / 64; group++) 
      {
         if (groupMaxData[3 * group + 2] > groupMaxData[3 * groupMax + 2]) 
         {
            groupMax = group;
         }
      }
   }
}

The script has public variables for the compute shader and the input texture image that you have to set. It returns its result in the array uint[] groupMaxData at a position determined by groupMax.

The RWStructuredBuffer of the compute shader corresponds to the compute buffer groupMaxBuffer. Note that this is an array of elements that contain 3 unsigned ints. The array groupMaxData has the same memory layout but consists of unsigned ints; thus, it has three times as many elements as groupMaxBuffer.

The Start() function does some error checking, finds the handle for the compute shader function, creates groupMaxBuffer and groupMaxData, and sets the uniform variables of the compute shader.

The OnDestroy() function released the compute buffer because it is not released by the garbage collector.

The Update() function simply calls the compute shader function, where the number of thread groups is determined by the number of rows of the image (i.e., its height) divided by the number of threads in one thread group (i.e., 64 in our case). We add 63 to the number of rows before the division in order to make sure that we have enough thread groups for image heights that are not divisible by 64.

groupMaxBuffer.GetData(groupMaxData) copies the data from the compute buffer to the groupMaxData array. Then the code finds the brightest pixel in that array by a loop over all groups. Note that the relative luminance of the group with index group is at groupMaxData[3 * group + 2] because groupMaxData is a “flattened” array of unsigned ints instead of an array of structs with 3 unsigned ints.

At the end, the relative luminance of the brightest pixel is at groupMaxData[3 * groupMax + 2]. Its x coordinate is at groupMaxData[3 * groupMax + 0] and its y coordinate is at groupMaxData[3 * groupMax + 1].

Summary

You have reached the end of this tutorial! A few of the things that you have learned are:

How to parallelize a search over all pixels of an image.
How to synchronize the execution of threads in a thread group.
How to communicate data between threads in a thread group.
How to use reduce operations to speed up a search in a “groupshared” array.

Why Would Anyone Want to Do This?

Finding the Brightest Pixel with a Compute Shader

Calling the Compute Shader

Summary

Further reading