Optimizing C++/Optimization life cycle

The construction of an efficient application should adhere to the following development process:

Design. Firstly, the algorithms and data structures are designed in a way that makes sense for the application logic and that is reasonably efficient, but without considering optimization. When designing a widely used data structure the optimal implementation of which is not obvious (for example, it is debated whether it is better to use an array or a linked list), an abstract structure is defined for which the implementation may be changed at the optimization stage.
Implementation. Secondly, the code that implements the designed algorithms is written, following guidelines to avoid inefficient operations and encapsulating operations that are likely to require optimization.
Functional testing. The resulting software is then tested to increase the probability that it doesn't have crippling defects.
Optimization (aka Tuning). After having completed the development of a correctly working application, the optimization stage begins, with the following sub-stages:
1. Performance testing. Commands with inadequate performance are detected. These are commands that, when processing typical data, require more resources (CPU time, storage space, etc) than are available.
2. Profiling (aka Performance analysis). For every command with inadequate performance, a profiler is used to determine which portions of code are the so-called bottlenecks for that command. Bottlenecks are parts of code where a disproportionate amount of time is spent or memory space is allocated.
3. Algorithmic optimization. For each bottleneck, optimization techniques are applied that are largely independent of the programming language and totally independent of the platform. Such techniques can be found in algorithm theory textbooks. This optimization involves attempting to reduce the number of executed machine cycles. In particular it involves reducing the number of calls to costly routines or transforming expensive operations into equivalent but less costly operations. For example, the quick sort sorting algorithm is chosen instead of the selection sort algorithm. If this makes the program fast enough, the optimization stage is complete.
4. Platform independent optimization. For each bottleneck, optimization techniques are applied that depend upon the programming language and its standard library but that are independent of both the software and hardware platforms. For example, integer operations are used instead of floating point operations or a more appropriate container class is chosen from those available in the standard library. If this makes the program fast enough, the optimization stage is complete.
5. Software platform dependent optimization. For each bottleneck, optimization techniques are applied that depend upon both the programming language and the software platform but that are independent of the hardware platform. For example, compiler options, pragma compiler directives, language extensions, non-standard libraries, direct calls to the operating system are exploited. If this makes the program fast enough, the optimization stage is complete.
6. Hardware platform dependent optimization. For each bottleneck, optimization techniques that depend upon the hardware platform are applied. This can involve using machine instructions that are specific to a processor family or using more generally available high level features that are particularly advantageous on some processor types.

This development process follows two criteria:

Principle of diminishing returns. Optimizations that yield big results with little effort should be applied first, as this minimizes the time needed to reach the performance goals.
Principle of diminishing portability. It is better to apply optimizations applicable to several platforms first, as they remain applicable on changing platform and are more understandable to other programmers.

In the rare case of software that must be used with several compilers and several operating systems but just one processor architecture, the stages 4.5 and 4.6 should be swapped.

This stage sequence is not meant to be a one-way sequence, in which once one stage is reached, the preceding stage is no longer used. In fact, every stage may succeed or fail. If a stage succeeds, the next stage is applied, while if a stage fails, the previous stage is repeated, in a sort of backtracking algorithm.

In addition, a partial performance test should be performed after every optimization attempt, just to check whether the attempt was useful and, if so, to check whether more optimizations are needed.

Finally, after having completed the optimization, both the functional testing and the complete performance testing must be repeated to guarantee that the newly optimized version of the software is still functionally correct and has suitable performance.

This book is about only three of the above stages:

Stage 2, specifically the usage of the C++ language, in the chapter "Writing efficient code".
Some general techniques for stage 4.3, with examples in C++, in chapter "General optimization techniques".
Stage 4.4, specifically the usage of the C++ language, in the chapter "Code optimization".

Conventions

By object, it is meant an allocated region of memory. In particular, a piece of data associated to a variable of a fundamental type (like bool, double, unsigned long, or a pointer) is an object, as it is such the data structure associated to an instance of a class. With every variable an object is associated, whose size is given by the sizeof C++ operator, but an object could have no variable associated with it, or several variables associated with it. For example, a pointer is an object, but it can point to another object; this pointed object is not associated with any variable. On the other hand, in the following code, both the variable a and the variable b are associated with the same object:

int a;
int& b = a;

Arrays, structs, and class instances are objects which, if not empty, contain sub-objects. Therefore, they will be called aggregate objects.

We say that an object owns another object when the destruction of the former object causes the destruction of the latter. For example, a non-empty vector object typically contains a pointer to a buffer containing all the elements; the destruction of the vector causes the destruction of such buffer, and therefore we say that this buffer is owned by the vector object.

Some optimizations are useful only for short data sequences, others for longer sequences. Later on, the following classification will be used for objects sizes:

Tiny: No more than 8 bytes. It fits in one or two 32-bit registers or in one 64-bit register.
Small: More than 8 bytes, but no more than 64 bytes. It doesn't fit in a processor register, but it fits in a processor data cache line, and it can be wholly referenced by very compact machine instructions using an offset from the start address.
Medium: More than 64 bytes, but no more than 4096 bytes. It doesn't fit in a processor data cache line, and it cannot be wholly referenced by compact machine instructions, but it fits in the processor first-level data cache, it fits in a virtual memory page, and it fits in a mass storage cluster of blocks.
Large: More than 4096 bytes. It doesn't fit in the processor first-level data cache, it doesn't fit in a single virtual memory page, and it doesn't fit in a single mass storage cluster of blocks.

For example, an array of doubles is considered tiny only if it contains exactly one element, small if it has 2 to 8 elements, medium if it has 9 to 512 numbers, large if it has more than 512 of them.

Because there are very different hardware architectures, the given numbers are only an indication. Though, such numbers are rather realistic, and can be taken as serious criteria to develop software covering the main architectures in a rather efficient way.