Optimizing C++/Code optimization/Memory access

When the application accesses main memory, it implicitly uses both the various processor caches and the disk swapping mechanism by the virtual memory manager of the operating system.

Both the processor caches and the virtual memory manager process data block-wise, and therefore the software is faster if the few memory blocks contain the code and the data used by a single command. The principle that the data and the code processed by a command should reside in near regions of memory is called locality of reference.

This principle becomes even more important for performance in multi-threaded applications on multi-core systems, as if several threads running on different cores access the same cache block, the contention causes a performance degradation.

In this section techniques are proposed to optimize usage of the processor caches and of the virtual memory, by incrementing the locality of reference of code and data.

Nearing the code

Put near in the same compilation unit all the function definitions belonging to the same bottleneck.

In such a way, the machine code generated by compiling such functions will have near addresses, and so greater code locality of reference.

Another positive consequence is that the local static data declared and used by such functions will have near addresses, and so greater data locality of reference.

Though, data used by different threads should be on different cache lines to avoid false sharing (i.e. cache lines bouncing). The recommendation made by Gerber/Bik/Smith/Tian - The Software Optimization Cookbook (page 262) is to keep this type of data spaced by at least 128 bytes apart (which is larger than any L1/2/3 cache line).

`union`s

In medium or large arrays or collections, use unions.

unions allow to save memory space in variable type structures, and therefore to make them more compact.

Though, don't use them for small or tiny objects, as there are no significant space gains, and with some compilers the objects put in a union are not kept in processor registers.

Bit-fields

If a medium or large object contains several integer numbers with a small range, transform them in bit-fields.

Bit-fields decrease the object size.

For example, instead of the following structure:

struct {
    bool b;
    unsigned short ui1, ui2, ui3; // range: [0, 1000]
};

that takes up 8 bytes, you can define the following structure:

struct {
    unsigned b: 1;
    unsigned ui1: 10, ui2: 10, ui3: 10; // range: [0, 1000]
};

that takes up only (1 + 10 + 10 + 10 = 31 bits, 31 <= 32) 4 bytes.

For another example, instead of the following array:

unsigned char a[5]; // range: [-20, +20]

that takes up 5 bytes, you can define the following structure:

struct {
    signed a1: 6, a2: 6, a3: 6, a4: 6, a5: 6; // range: [-20, +20]
};

that takes up only (6 + 6 + 6 + 6 + 6 = 30 bits, 30 <= 32) 4 bytes.

Though, there is a performance penalty in packing and unpacking the field. In addition, in the last example, the field can no more be accessed by index.

It should be noted that bit field behavior is not well defined so different compilers will give different results. Bit field struct members are not thread safe due to memory granularity so have undefined behavior when accessed concurrently on multi-processor platforms which can result in hard to detect faults. As such many compilers will attempt to ignore them completely.

This optimization is best used on microcontrollers. As these devices are usually single processor devices there is no problem with concurrent access. They also have very limited memory so cannot afford wasted bits. In addition, many microcontrollers have special instructions designed to improve performance of bit manipulation, eg an instruction to test if a single bit is set within a byte at an offset.

Template code independent of parameters

If in a class template a non-trivial member function does not depend on any template parameter, define a non-member function having the same body, and replace the original function body with a call to the new function.

Let's assume you wrote the following code:

template <typename T>
class C {
public:
    C(): x_(0) { }
    int f(int i) { body(); return i; }
private:
    T x_;
};

Try to replace the above code with the following:

template <typename T>
class C {
public:
    C(): x_(0) { }
    int f(int i) { return f_(i); }
private:
    T x_;
};

int f_(int i) { body(); return i; }

For every instantiation of a class template that uses a function of that class template, the whole code of the function is instantiated. If a function in that class template do not depend on any template parameters, at every instantiation the function machine code is duplicated. Such code replication bloats the program.

In a class template or in a function template, a big function could have a large part that do not depend on any template parameters. In such a case, first, factor out such a code portion as a distinct function, and then apply this guideline.

Nearing the code

unions

Bit-fields

Template code independent of parameters

`union`s