Optimizing C++/Writing efficient code/Performance improving features

From Wikibooks, open books for an open world
Jump to: navigation, search

Some features of the C++ language, if properly used, allow to increase the speed of the resulting software.

In this section guidelines to exploit such features are presented.

Contents

[edit] The most efficient types

When defining an object to store an integer number, use the int or the unsigned int type, except when a longer type is needed; when defining an object to store a character, use the char type, except when the wchar_t type is needed; and when defining an object to store a floating point number, use the double type, except when the long double type is needed. If the resulting aggregate object is of medium or large size, replace all the integer types with the smallest integer type that is long enough to contain it, but without using bit-fields, and replace the floating point types with the float type, except when greater precision is needed.

The int and unsigned int types are by definition the most efficient ones on any platform.

The double type is two to three times less efficient than the float type, but it has greater precision.

Some processor types handle more efficiently signed char objects, while others handle more efficiently unsigned char objects. Therefore, both in C and in C++, the char type, different from signed char type, has been introduced as the most efficient character type for the target processor.

The char type can contain only small character sets; typically up to a maximum of 255 distinct characters. To handle bigger character sets, you should use the wchar_t type, although that is obviously less efficient.

In case of numbers contained in a medium or large aggregate object, or in a collection that will be probably be of medium or large size, it is better to minimize the size in bytes of the aggregate object or collection. This can be done by replacing the ints with shorts or signed chars, replacing the unsigned ints with unsigned shorts or unsigned chars, and replacing the doubles with floats. For example, to store an integer number with a range of 0 to 1000, you can use an unsigned short, while to store an integer number with a range of -100 to 100, you can use a signed char.

Bit-fields could contribute to minimize the size in bytes of the aggregate object, but their handling causes a slow down that could be counterproductive; therefore, postpone their introduction to the optimization stage.

[edit] Function-objects

Instead of passing a function pointer as an argument to a function, pass a function-object (or, if using the C++0x standard, a lambda expression).

For example, if you have the following array of structures:

struct S {
    int a, b;
};
S arr[n_items];

… and you want to sort it by the b field, you could define the following comparison function:

bool compare(const S& s1, const S& s2) {
    return s1.b < s2.b;
}

… and pass it to the standard sort algorithm:

std::sort(arr, arr + n_items, compare);

However, it is probably more efficient to define the following function-object class (aka functor):

struct Comparator {
    bool operator()(const S& s1, const S& s2) const {
        return s1.b < s2.b;
    }
};

… and pass a temporary instance of it to the standard sort algorithm:

std::sort(arr, arr + n_items, Comparator());

Function-objects functions are usually expanded inline, and therefore they are just as efficient as in-place code, while functions passed by pointers are rarely inlined. Lambda expressions are implemented as function-objects, so they have the same performance.

[edit] qsort and bsearch functions

Instead of the qsort and bsearch C standard library functions, use the std::sort and std::lower_bound C++ standard library functions.

The former two functions require a pointer to function as argument, as the latter two may get a function-object argument (or, using the C++0x standard, a lambda expression). Pointers to function often are not expanded inline, and therefore they are less efficient than function-objects that are almost always inlined.

[edit] Encapsulated collections

Encapsulate in specific classes the collections that are accessible from several compilation units, to make easily interchangeable the implementation.

At design time, it is difficult to decide which data structure will have optimal performance in the actual use of the software. At optimization time, you can measure whether by changing the type of a container, for example passing from std::vector to std::list, you can improve the performance. Though, such a change of implementation causes the change of most source code that uses directly the collection whose type has been changed.

If a collection is private to one compilation unit, such change will impact only the source code contained in this unit, and therefore it is not necessary to encapsulate that collection. If instead that collection isn't private, that is it is directly accessible from other compilation units, in the future there could be a huge amount of code to change for such a collection type change. Therefore, to make feasible such optimization, you should encapsulate that collection in a class whose interface does not change when the container implementation is changed.

STL containers already pursue this principle, but some operations are still available only for some containers (like operator[], existing for std::vector, but not for std::list).

[edit] STL containers usage

When using an STL container, if several equivalent expressions have the same performance, choose the more general expression.

For instance, call a.empty() instead of a.size() == 0, call iter != a.end() instead of iter < a.end(), and call distance(iter1, iter2) instead of iter2 - iter1. The former expressions are valid for every container kind, while the latter are valid only for some kinds, and the former are no less efficient than the latter.

Unfortunately, it is not always possible to write code that is equally correct and efficient for every type of container. Nevertheless, decreasing the number of statements that are dependent on the container type will decrease the number of statements have to be changed if the type of the container is later changed.

[edit] Choice of the default container

When choosing a variable-length container, if in doubt, choose a vector.

For set up to 8 elements, vector is the most efficient variable-length container for any operation.

For larger collections, other containers may become more efficient for certain operations, but vector remains the one that has the least space overhead (as long as there is no excess capacity), and the greatest locality of reference.

[edit] Inlined functions

If you use compilers that allow whole program optimization and the automatic inline expansion of functions, use such options, and do not declare any functions inline. If such compiler features are not available, declare, in header files only, functions inline that contain no more than three lines of code and have no loops.

Functions that are expanded inline avoid the overhead of function calls, which grows as the number of function arguments. In addition, since inline code is near caller code, it has better locality of reference. And lastly, as the intermediate code generated by the compiler for inlined functions is merged with the caller code, it can be more easily optimized by the compiler.

Expanding inline a tiny function, such as a function containing only a simple assignment or a simple return statement, can result in a decrease in the size of the generated machine code.

Conversely, every time a function containing substantial code is inlined the machine code is duplicated, and therefore the total size of the program is increased.

Inlined code is more difficult to profile. If a non-inlined function is a bottleneck, it can be found by the profiler. But if that function is inlined in all the places where it is called, its running time is scattered among many functions and the bottleneck situation cannot be detected by a profiler.

For functions containing substantial amounts of code, only the performance critical ones should be be declared inline at the optimization stage.

[edit] Symbols representation

To represent internal symbols, use enumerations instead of strings.

For example, instead of the following code:

const char* directions[] = { "North", "South", "East", "West" };

use the following code:

enum directions { North, South, East, West };

An enumeration is implemented as an integer. Compared to an integer, a string occupies more space and is slower to copy and compare. (In addition, using strings instead of integers for representation of internal state may introduce string comparison errors in code that deals with multiple locales.)

[edit] if and switch statements

If you have to compare an integer value with a set of constant values, instead of a sequence of if statements, use a switch statement.

For example, instead of the following code:

if (a[i] == 1) f();
else if (a[i] == 2) g();
else if (a[i] == 5) h();

write the following code:

switch (a[i]) {
case 1: f(); break;
case 2: g(); break;
case 5: h(); break;
}

Compilers may exploit the regularity of switch statements to apply some optimizations, in particular if the guideline "Case values for switch statements" in this section is applied.

[edit] Case values of switch statements

As constants for switch statements cases, use compact sequences of values, that is sequences with no gaps or with few small gaps.

Optimizing compilers, when compiling a switch statement whose case values comprise most of the values in an integer interval, instead of generating a sequence of if statements, generate a jump-table, that is an array of the start addresses of the code for every case, and when executing the switch statement, they use the table to jump to the code associated to the case number.

For example, the following C++ code:

switch (i) {
case 10:
case 13:
    func_a();
    break;
case 11:
    func_b();
    break;
}

probably generates machine code corresponding to the following pseudo-code:

// N.B.: This is not C++ code
static address jump_table[] = { case_a, case_b, end, case_a };
unsigned int index = i - 10;
if (index > 3) goto end;
goto jump_table[index];
case_a: func_a(); goto end;
case_b: func_b();
end:

Instead, the following C++ code:

switch (i) {
case 100:
case 130:
    func_a();
    break;
case 110:
    func_b();
    break;
}

probably generates machine code corresponding to the following code:

if (i == 100) goto case_a;
if (i == 130) goto case_a;
if (i == 110) goto case_b;
goto end;
case_a: func_a(); goto end;
case_b: func_b();
end:

For so few cases, probably there is little difference between the two situations, but as the case count increases, the former code becomes more efficient, as it performs only one computed goto instead of a sequence of branches.

[edit] Cases order in switch statement

In switch statements, put the most typical cases before.

If the compiler does not use a jump-table, the cases are evaluated in order of appearance; therefore, fewer comparisons are performed for the more frequent cases.

[edit] Grouping several arrays into a single array of structures

Instead of processing in parallel two or more arrays having the same length, process a single array of aggregate objects.

For example, instead of the following code:

const int n = 10000;
double a[n], b[n], c[n];
for (int i = 0; i < n; ++i) {
    a[i] = b[i] + c[i];
}

write the following code:

const int n = 10000;
struct { double a, b, c; } s[n];
for (int i = 0; i < n; ++i) {
    s[i].a = s[i].b + s[i].c;
}

Using this optimization a, b, and c are stored sequentially. For a compiler this means where a, b, and c are in relation to each other can be more accurately predicted and fewer steps may be needed to read or modify the data. If a cache is used, more efficient use of the cache may result and less throttling of the cache may be required, but only if all the data can fit into the cache and there isn't other data that needs to be stored in the cache in the mean time.

[edit] Grouping function arguments

Instead of calling a function in a loop which uses more variables than there are registers, use a function that is passed a struct or object.

For example, instead of the following code:

for (int i = 0; i < 1000; ++i) {
    f(i, a1, a2, a3, a4, a5, a6, a7, a8);
}

write the following:

struct {
    int i;
    type a1, a2, a3, a4, a5, a6, a7, a8;
} s;
s.a1 = a1; s.a2 = a2; s.a3 = a3; s.a4 = a4;
s.a5 = a5; s.a6 = a6; s.a7 = a7; s.a8 = a8;
for (int i = 0; i < 1000; ++i) {
    s.i = i;
    f(s);
}

If all the function's arguments can be placed directly into a processor's registers, the arguments can be passed and manipulated quickly. If there are more arguments then registers, all the arguments or the arguments which couldn't be placed into registers may require being pushed onto a stack at the start of every function call and popped off a stack at the end of every function call, even if the arguments have not changed. If a structure or object is passed, a register may be used and after initialization of the structure or object, only the parts of the structure or object which have changed between successive calls must be updated.

[edit] Use of containers member functions

To search for an element in a container, use a member function of the container, instead of an STL algorithm.

If such a specific member function has been created, when a generic STL algorithm was already existing, it is only because such member functions is more efficient.

For example, to search a std::set object, you can use the std::find generic algorithm, or the std::set::find member function; but the former has linear complexity (O(n)), while the latter has logarithmic complexity (O(log(n))).

[edit] Search in sorted sequences

To search a sorted sequence, use the std::lower_bound, std::upper_bound, std::equal_range, or std::binary_search generic algorithms.

Given that all the cited algorithms use a logarithmic complexity (O(log(n))) binary search, they are faster than the std::find algorithm, that uses the linear complexity (O(n)) sequential scan.

[edit] static member functions

In every class, declare every member function that does not access the non-static members of the class as static .

In other words, declare all the member functions that you can as static.

In this way, the implicit this argument is not passed.

Personal tools
Namespaces
Variants
Actions
Navigation
Community
Toolbox
In other languages
Sister projects
Print/export