# Cg Programming/Unity/Computing the Brightest Pixel A dancer wearing a motion capture suit with reflective markers that lead to very bright pixels.

This tutorial shows how to compute the position of the brightest pixel in an image with the help of compute shaders in Unity. In particular, it shows how threads in a thread group can use “groupshared” data and how the execution of these threads can be synchronized. If you are not familiar with compute shaders in Unity, you should first read Section “Computing Image Effects” and Section “Computing Color Histograms”. Note that compute shaders are not supported on macOS.

## Why Would Anyone Want to Do This?

Finding the brightest pixel in photographed images is useful for some applications of optical motion capture. Another application is a template matching algorithm that is applied at the positions of all pixels of an image and that stores the likelihood of a match at each pixel in an intermediate image. In this case, the “brightest” pixel of this intermediate image represents the best match with the template. Finding this best match is useful for template-based feature detection and tracking.

Also, the problem of finding the brightest pixel is closely related to many other problems, e.g., finding the darkest pixel or the two (or more) brightest pixels or the two (or more) brightest pixels with a certain distance between them or the sum or average of all pixels of an image, etc. In fact, by solving the problem of finding the brightest pixel, one is very close to solving several related problems.

## Finding the Brightest Pixel with a Compute Shader

To find the brightest pixel in an image, one has to look at all pixels of the image; thus, the problem can benefit a lot from parallelization.

In this tutorial, we implement a compute shader that first finds the brightest pixel in one pixel row of an image − simply by going through all pixels of the row in a loop and keeping track of the brightest pixel it encounters. We call this compute shader for all rows of the image in parallel. The result is an array of the brightest pixels of each row, which might be a relatively large array (depending on the height of the image). Therefore, we reduce the size of this array by computing the brightest pixel of each thread group at the end of the shader. Since we use thread groups of 64 threads, this reduces the dimension of the resulting array by factor 64, and the new result is an array of the brightest pixels of each thread group. One could try to further reduce this array in parallel but since this array is already relatively small, we simply transfer the data to the CPU and find the brightest pixel of the whole image by a linear search on the CPU. Note: for any tool, it is not only important to know when to use it, but also when not to use it.

Here is a first version of the compute shader:

```#pragma kernel MaximumMain

Texture2D<float4> InputTexture;
int InputTextureWidth;

struct maxStruct
{
uint xMax; // column of maximum
uint yMax; // row of maximum
uint lMax; // luminance of maximum (0, ..., 1023)
};

RWStructuredBuffer<maxStruct> GroupMaxBuffer;

groupshared maxStruct rowMaxData;

void MaximumMain (uint3 groupID : SV_GroupID,
// 3D ID of thread group; range depends on Dispatch call
uint groupIndex : SV_GroupIndex,
// groupIndex specifies the index within the group (0 to 63)
// id.x specifies the row in the input texture image
{
int column;

// find the maximum of this row
// and store its data in rowMaxData[groupIndex]
rowMaxData[groupIndex].xMax = 0;
rowMaxData[groupIndex].yMax = id.x;
rowMaxData[groupIndex].lMax = 0;
for (column = 0; column < InputTextureWidth; column++)
{
float4 color = InputTexture[uint2(column, id.x)];
uint luminance = (uint)(1023.0 *
(0.21 * color.r + 0.72 * color.g + 0.07 * color.b));
if (luminance > rowMaxData[groupIndex].lMax)
{
rowMaxData[groupIndex].xMax = column;
rowMaxData[groupIndex].lMax = luminance;
}
}

// find the maximum of this group
// and store its data in GroupMaxBuffer[groupID.x]
GroupMemoryBarrierWithGroupSync();
if (0 == groupIndex)
{
int row;
int rowMax = 0;
for (row = 1; row < 64; row++)
{
if (rowMaxData[row].lMax > rowMaxData[rowMax].lMax)
{
rowMax = row;
}
}
GroupMaxBuffer[groupID.x] = rowMaxData[rowMax];
}
}
```

The first (Unity-specific) line `#pragma kernel MaximumMain` specifies that the function `MaximumMain()` is a compute shader function that can be called from a script.

`Texture2D<float4> InputTexture` is a uniform variable to access the RGBA input texture, while `int InputTextureWidth` is a uniform variable to get its width, i.e., the length of a row of pixels.

The next lines define a struct to store the data about candidates for the brightest pixel. `xMax` and `yMax` are its coordinates while `lMax` is its relative luminance from 0 to 1023:

```struct maxStruct
{
uint xMax; // column of maximum
uint yMax; // row of maximum
uint lMax; // luminance of maximum (0, ..., 1023)
};
```

The definition `RWStructuredBuffer<maxStruct> GroupMaxBuffer` uses this struct to define a `RWStructuredBuffer` (corresponding to a compute buffer in Unity) to store the information about the brightest pixel of each thread group.

The definition `groupshared maxStruct rowMaxData` uses the same struct to define a `groupshared` array to store the information about the brightest pixel of each thread (i.e., of each row) within the current thread group. Note that the total size of the `groupshared` data in Direct3D 11 is limited to 32 KB. Assuming that an unsigned int requires at most 8 bytes, the `rowMaxData` array requires at most 64 × 3 × 8 = 1536 bytes, which is well below the limit of 32 KB.

We define the dimensions of a thread group with `[numthreads(64, 1, 1)]` instead of `[numthreads(1, 64, 1)]` since the thread group is assumed to work on an “1D array” of 64 rows and it is usually more straightforward to use the x dimension for a one-dimensional group.

The compute shader function `MaximumMain()` asks for all thread-related indices that are available (although it doesn't use `groupThreadID`). The index `groupID.x` of the thread group is used to index the `GroupMaxBuffer`, the thread index `groupIndex` within the thread group is used to index `rowMaxData`, and the overall dispatch index `id.x` specifies the row of the whole image.

The function `MaximumMain()` then runs a loop over all pixels of the thread's row by counting the variable `column` from 0 to `InputTextureWidth - 1`. It computes the relative luminance (scaled with 1023 in order to work with unsigned ints) of each pixel, compares this luminance to the greatest luminance so far, and if the new luminance is greater, it updates the data in `rowMaxData[groupIndex]`, which at the end of the loop contains the information about the brightest pixel of the row.

After computing the brightest pixel of a row, the function computes the brightest pixel of the thread group. Since we need to compare the data of different threads, we first have to make sure that all threads have determined the brightest pixel of their row. This is achieved with `GroupMemoryBarrierWithGroupSync()` which not only makes sure that all memory writes of the thread group before this line are completed but also waits until all threads in the thread group reach this line. The code then checks whether `groupIndex` is 0, i.e., whether this is the zeroth thread of the thread group. Only this thread determines the brightest pixel of the pixels in `rowMaxData` and writes it to `GroupMaxBuffer[groupID.x]`. While this solution works (and is straightforward to implement), it is somewhat wasteful because the 63 other threads in the same thread group have nothing to do while the zeroth thread is working in this loop.

```#pragma kernel MaximumMain

Texture2D<float4> InputTexture;
int InputTextureWidth;

struct maxStruct
{
uint xMax; // column of maximum
uint yMax; // row of maximum
uint lMax; // luminance of maximum (0, ..., 1023)
};

RWStructuredBuffer<maxStruct> GroupMaxBuffer;

groupshared maxStruct rowMaxData;

void MaximumMain (uint3 groupID : SV_GroupID,
// 3D ID of thread group; range depends on Dispatch call
uint groupIndex : SV_GroupIndex,
// groupIndex specifies the index within the group (0 to 63)
// id.x specifies the row in the input texture image
{
int column;

// find the maximum of this row
// and store its data in rowMaxData[groupIndex]
rowMaxData[groupIndex].xMax = 0;
rowMaxData[groupIndex].yMax = id.x;
rowMaxData[groupIndex].lMax = 0;
for (column = 0; column < InputTextureWidth; column++)
{
float4 color = InputTexture[uint2(column, id.x)];
uint luminance = (uint)(1023.0 *
(0.21 * color.r + 0.72 * color.g + 0.07 * color.b));
if (luminance > rowMaxData[groupIndex].lMax)
{
rowMaxData[groupIndex].xMax = column;
rowMaxData[groupIndex].lMax = luminance;
}
}

// find the maximum of this group
// and store its data in GroupMaxBuffer[groupID.x]
GroupMemoryBarrierWithGroupSync();
// we have to wait for all writes to rowMaxData by the group's threads
if (0 == (groupIndex & 1)) { // is groupIndex even?
if (rowMaxData[groupIndex + 1].lMax > rowMaxData[groupIndex].lMax) {
rowMaxData[groupIndex] = rowMaxData[groupIndex + 1];
}
}
GroupMemoryBarrierWithGroupSync();
if (0 == (groupIndex & 3)) { // is groupIndex divisible by 4?
if (rowMaxData[groupIndex + 2].lMax > rowMaxData[groupIndex].lMax) {
rowMaxData[groupIndex] = rowMaxData[groupIndex + 2];
}
}
GroupMemoryBarrierWithGroupSync();
if (0 == (groupIndex & 7)) { // is groupIndex divisible by 8?
if (rowMaxData[groupIndex + 4].lMax > rowMaxData[groupIndex].lMax) {
rowMaxData[groupIndex] = rowMaxData[groupIndex + 4];
}
}
GroupMemoryBarrierWithGroupSync();
if (0 == (groupIndex & 15)) { // is groupIndex divisible by 16?
if (rowMaxData[groupIndex + 8].lMax > rowMaxData[groupIndex].lMax) {
rowMaxData[groupIndex] = rowMaxData[groupIndex + 8];
}
}
GroupMemoryBarrierWithGroupSync();
if (0 == (groupIndex & 31)) { // is groupIndex divisible by 32?
if (rowMaxData[groupIndex + 16].lMax > rowMaxData[groupIndex].lMax) {
rowMaxData[groupIndex] = rowMaxData[groupIndex + 16];
}
}
GroupMemoryBarrierWithGroupSync();
if (0 == (groupIndex & 63)) { // is groupIndex divisible by 64?
if (rowMaxData[groupIndex + 32].lMax > rowMaxData[groupIndex].lMax) {
rowMaxData[groupIndex] = rowMaxData[groupIndex + 32];
}
GroupMaxBuffer[groupID.x] = rowMaxData[groupIndex];
// copy maximum of group to buffer
}
}
```

Note that the code uses the bitwise and operator & with powers of 2 minus 1 to test whether `groupIndex` is divisble by the power of 2. We could also use the modulus operator % with the power of 2 instead.

The C# script to call the compute shader is relatively straightforward:

```using UnityEngine;

public class maximumScript : MonoBehaviour
{
public Texture2D inputTexture;

public uint[] groupMaxData;
public int groupMax;

private ComputeBuffer groupMaxBuffer;

private int handleMaximumMain;

void Start ()
{
if (null == shader || null == inputTexture)
{
return;
}

groupMaxBuffer = new ComputeBuffer((inputTexture.height + 63) / 64, sizeof(uint) * 3);
groupMaxData = new uint[((inputTexture.height + 63) / 64) * 3];

if (handleMaximumMain < 0 || null == groupMaxBuffer || null == groupMaxData)
{
Debug.Log("Initialization failed.");
return;
}

}

void OnDestroy()
{
if (null != groupMaxBuffer)
{
groupMaxBuffer.Release();
}
}

void Update()
{
shader.Dispatch(handleMaximumMain, (inputTexture.height + 63) / 64, 1, 1);
// divided by 64 in x because of [numthreads(64,1,1)] in the compute shader code
// added 63 to make sure that there is a group for all rows

// get maxima of groups
groupMaxBuffer.GetData(groupMaxData);

// find maximum of all groups
groupMax = 0;
for (int group = 1; group < (inputTexture.height + 63) / 64; group++)
{
if (groupMaxData[3 * group + 2] > groupMaxData[3 * groupMax + 2])
{
groupMax = group;
}
}
}
}
```

The script has public variables for the compute shader and the input texture image that you have to set. It returns its result in the array `uint[] groupMaxData` at a position determined by `groupMax`.

The `RWStructuredBuffer` of the compute shader corresponds to the compute buffer `groupMaxBuffer`. Note that this is an array of elements that contain 3 unsigned ints. The array `groupMaxData` has the same memory layout but consists of unsigned ints; thus, it has three times as many elements as `groupMaxBuffer`.

The `Start()` function does some error checking, finds the handle for the compute shader function, creates `groupMaxBuffer` and `groupMaxData`, and sets the uniform variables of the compute shader.

The `OnDestroy()` function released the compute buffer because it is not released by the garbage collector.

The `Update()` function simply calls the compute shader function, where the number of thread groups is determined by the number of rows of the image (i.e., its height) divided by the number of threads in one thread group (i.e., 64 in our case). We add 63 to the number of rows before the division in order to make sure that we have enough thread groups for image heights that are not divisible by 64.

`groupMaxBuffer.GetData(groupMaxData)` copies the data from the compute buffer to the `groupMaxData` array. Then the code finds the brightest pixel in that array by a loop over all groups. Note that the relative luminance of the group with index `group` is at `groupMaxData[3 * group + 2]` because `groupMaxData` is a “flattened” array of unsigned ints instead of an array of structs with 3 unsigned ints.

At the end, the relative luminance of the brightest pixel is at `groupMaxData[3 * groupMax + 2]`. Its x coordinate is at `groupMaxData[3 * groupMax + 0]` and its y coordinate is at `groupMaxData[3 * groupMax + 1]`.

## Summary

You have reached the end of this tutorial! A few of the things that you have learned are:

• How to parallelize a search over all pixels of an image.
• How to synchronize the execution of threads in a thread group.