C++ Programming/Compiler

From Wikibooks, the open-content textbooks collection

Jump to: navigation, search

Contents

[edit] The Compiler

A compiler is a program that translates a computer program written in one computer language (the source code) into an equivalent program written in the computer's native machine language. This process of translation is called compilation.

[edit] Compilation

The compilation output of a compiler is the result from translating or compiling a program. The most important part of the output is saved to a file called an object file. As we have seen before in the The Code Section of the book, it consists of the transformation of source files into object files.

NOTE:
Some files may be created/needed for a successful compilation, that data isn't part of the C++ language or may result from the compilation of external code (an example would be a library), this may depend on the specific compiler you use (MS Visual Studio for example adds several extra files to a project), in that case you should check the documentation or it can part of a specific framework that needs to be accessed. Be aware that some of this constructs may limit the portability of the code.

The instructions of this compiled program can then be run (executed) by the computer if the object file is in an executable format. Often, however, there are additional steps that may be required to create an executable program: preprocessing and linking.

[edit] Compile Time

Defines the time and operations performed by a compiler (ie, compile-time operations) during a build (creation) of a program (executable or not).

The operations performed at compile time usually include lexical analysis, syntax analysis, various kinds of semantic analysis (eg, type checks, and instantiation of template) and code generation.

The definition of a programming language will specify compile time requirements that source code must meet to be successfully compiled.

Compile time occurs before link time (when the output of one or more compiled files are joined together) and runtime (when a program is executed). In some programming languages it may be necessary for some compilation and linking to occur at runtime. The concept of runtime will be introduced later.

TODO

TODO
Add run time concept, and mention it here (probably on Debugging)

[edit] Lexical analysis

This is alternatively known as scanning or tokenisation. It happens before syntax analysis and converts the code into tokens, which are the parts of the code that the program will actually use. The source code as expressed as characters (arranged on lines) into a sequence of special tokens for each reserved keyword, and tokens for data types and identifiers and values. The lexical analyzer is the part of the compiler which removes whitespace and other non compilable characters from the source code. It uses whitespace to separate different tokens, and ignores the whitespace.

To give a simple illustration of the process:

int main()
{
    std::cout << "hello world" << std::endl;
    return 0;
}

Depending on the lexical rules used it might be tokenized as:

1 = string "int"
2 = string "main"
3 = opening parenthesis
4 = closing parenthesis
5 = opening brace
6 = string "std"
7 = namespace operator
8 = string "cout"
9 = << operator
10 = string ""hello world""
11 = string "endl"
12 = semicolon
13 = string "return"
14 = number 0
15 = closing brace

And so for this program the lexical analyzer might send something like:

1 2 3 4 5 6 7 8 9 10 9 6 11 12 13 14 12 15

To the syntactical analyzer, which is talked about next, to be parsed. It is easier for the syntactical analyzer to apply the rules of the language when it can work with numerical values and can distinguish between language syntax (such as the semicolon) and everything else, and knows what data type each thing has.

[edit] Syntax Analysis

This step (also called sometimes syntax checking) ensures that the code is valid and will sequence into an executable program. The syntactical analyzer applies rules to the code, checking to make sure that each opening brace has a corresponding closing brace, and that each declaration has a type, and that the type exists, and that.... syntax analysis is more complicated that lexical analysis =). As an example

int main()
{
    std::cout << "hello world" << std::endl;
    return 0;
}

The syntax analyzer would first look at the string "int", check it against defined keywords, and find that it is a type for integers. The analyzer would then look at the next token as an identifier, and check to make sure that it has used a valid identifier name. It would then look at the next token. Because it is an opening parenthesis it will treat "main" as a function, instead of a declaration of a variable if it found a semicolon or the initialization of an integer variable if it found an equals sign. After the opening parenthesis it would find a closing parenthesis, meaning that the function has 0 parameters. Then it would look at the next token and see it was an opening brace, so it would think that this was the implementation of the function main, instead of a declaration of main if the next token had been a semicolon, even though you can't declare main in c++. It would probably create a counter also to keep track of the level of the statement blocks to make sure the braces were in pairs. After that it would look at the next token, and probably not do anything with it, but then it would see the :: operator, and check that "std" was a valid namespace. Then it would see the next token "cout" as the name of an identifier in the namespace "std", and see that it was a template. The analyzer would see the << operator next, and so would check that the << operator could be used with cout, and also that the next token could be used with the << operator. The same thing would happen with the next token after the ""hello world"" token. Then it would get to the "std" token again, look past it to see the :: operator token and check that the namespace existed again, then check to see if "endl" was in the namespace. Then it would see the semicolon and so it would see that as the end of the statement. Next it would see the keyword "return", and then expect an integer value as the next token because main returns an integer, and it would find 0, which is an integer. Then the next symbol is a semicolon so that is the end of the statement. The next token is a closing brace so that is the end of the function. And there are no more tokens, so if the syntax analyzer didn't find any errors with the code, it would send the tokens to the compiler so that the program could be converted to machine language. This is a simple view of syntax analysis, and real syntax analyzers don't really work this way, but the idea is the same.

Here are some keywords which the syntax analyzer will look for to make sure you aren't using any of these as identifier names, or to know what type you are defining your variables as or what function you are using which is included in the c++ language.

[edit] Compile Speed

There are several factors that dictate how fast a compilation proceeds, like:

  • Hardware
    • Resources (Slow CPU, low memory and even a slow HDD can have an influence)
  • Software
    • The compiler itself, new is always better, but may depend on how portable you want the project to be.
    • The design selected for the program (structure of object dependencies, includes) will also factor in.

Experience tells that most likely if you are suffering from slow compile times, the program you are trying to compile is poorly designed, take the time to structure your own code to minimize re-compilation after changes. Large projects will always compile slower. Use pre-compiled headers and external header guards. We will discuss ways to reduce compile time in the Optimization Section of this book.

[edit] ISO C++ (C++98) Keywords

  • and
  • and_eq
  • asm
  • auto
  • bitand
  • bitor
  • bool
  • break
  • case
  • catch
  • char
  • class
  • compl
  • const
  • const_cast
  • continue
  • default
  • delete
  • do
  • double
  • dynamic_cast
  • else
  • enum
  • explicit
  • export
  • extern
  • false
  • float
  • for
  • friend
  • goto
  • if
  • inline
  • int
  • long
  • mutable
  • namespace
  • new
  • not
  • not_eq
  • operator
  • or
  • or_eq
  • private
  • protected
  • public
  • register
  • reinterpret_cast
  • return
  • short
  • signed
  • sizeof
  • static
  • static_cast
  • struct
  • switch
  • template
  • this
  • throw
  • true
  • try
  • typedef
  • typeid
  • typename
  • union
  • unsigned
  • using
  • virtual
  • void
  • volatile
  • wchar_t
  • while
  • xor
  • xor_eq

Specific compilers may (in a non-standard compliant mode) also treat some other words as keywords, including cdecl, far, fortran, huge, interrupt, near, pascal, typeof. Old compilers may recognize the overload keyword, an anachronism that has been removed from the language.

The next revision of C++, informally known as C++0x for now, is likely to add some keywords, probably including at least:

  • static_assert
  • decltype
  • nullptr

(These are being considered carefully to minimize breakage to existing code; see http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2105.html for some details.)

Old compilers may not recognize some or all of the following keywords:

  • and
  • and_eq
  • bitand
  • bitor
  • bool
  • catch
  • compl
  • const_cast
  • dynamic_cast
  • explicit
  • export
  • false
  • mutable
  • namespace
  • not
  • not_eq
  • or
  • or_eq
  • reinterpret_cast
  • static_cast
  • template
  • throw
  • true
  • try
  • typeid
  • typename
  • using
  • wchar_t
  • xor
  • xor_eq

[edit] C++ Reserved Identifiers

Some "nonstandard" identifiers are reserved for distinct uses, to avoid conflicts on the naming of identifiers by vendors, library creators and users in general.

Reserved identifiers include keywords with two consecutive underscores (__), all that start with an underscore followed by an uppercase letter and some other categories of reserved identifiers carried over from the C library specification.

A list of C reserved identifiers can be found at the Internet Wayback Machine archived page: http://web.archive.org/web/20040209031039/http://oakroadsystems.com/tech/c-predef.htm#ReservedIdentifiers

TODO

TODO
It would be nice to list those C reserved identifiers, for the moment All Standard C Library Functions have already been listed

[edit] Compiler Keywords

A limited set of keywords exists to directly control the compiler's behavior, these keywords are very powerful and must be used with care, they may make a huge difference on the program's compile time and running speed.

In C++ Standard, these keywords are called Specifiers.

[edit] auto

NOTE:

This functionality is not yet available in the C++ Standard Language.

The auto keyword used to have a different behavior, but in C++0x it will allow one to omit the type of a variable and let the compiler decide. This is particularly useful for generic programming in which the return type of a function may depend on the type of its arguments. Thus, rather than this:

int x = 42;
std::vector<double> numbers;
numbers.push_back(1.0);
numbers.push_back(2.0);
for(std::vector<double>::iterator i = numbers.begin();
    i != numbers.end(); ++i) {
  cout << *i << " ";
}

we could write this:

auto x = 42; // We can use auto on base types...
std::vector<double> numbers;
numbers.push_back(1.0);
numbers.push_back(2.0);
// But auto is most useful for complicated types.
for(auto i = numbers.begin(); i != numbers.end(); ++i) {
  cout << *i << " ";
}

[edit] inline

A function declaration with an inline keyword declares an inline function. The inline keyword is used to suggest to the compiler that a particular function be subjected to in-line expansion; that is, it suggests that the compiler insert the complete body of the function in every context where that function is used and so it is used to avoid the overhead implied by making a CPU jump from one place in code to another and back again to execute a subroutine, as is done in naive implementations of subroutines.

Example:

inline swap( int& a, int& b) { int const tmp(b); b=a; a=tmp; }

Marking a function as inline (possibly implicitly, by defining a member function inside a class/struct definition) is a (non-binding) request to the compiler to consider inlining the function, i.e., expanding its code at the call site; it is legal, but redundant, to add the inline keyword in that context, and good style is to omit it.

Example:

struct length
{
  explicit length(int metres) : m_metres(metres) {}
  operator int&() { return m_metres; }
  private:
  int m_metres;
};

Inlining can be an optimization, or a pessimization. It can increase code size (by duplicating the code for a function at multiple call sites) or can decrease it (if the code for the function, after optimization, is less than the size of the code needed to call a non-inline function). It can increase speed (by allowing for more optimization and by avoiding jumps) or can decrease speed (by increasing code size and hence cache misses).

One important side-effect of inlining is that more code is then accessible to the optimizer.

Marking a function as inline also has an effect on linking: multiple definitions of an inline function are permitted (so long as each is in a different translation unit) so long as they are identical. This allows inline function definitions to appear in header files; defining non-inline functions in header files is almost always an error (though function templates can also be defined in header files, and often are).

Mainstream C++ compilers like Microsoft Visual C++ and GCC support an option that lets the compilers automatically inline any suitable function, even those that are not marked as inline functions. A compiler is often in a better position than a human to decide whether a particular function should be inlined; in particular, the compiler may not be willing or able to inline many functions that the human asks it to.

Excessive use of inline functions can greatly increase coupling/dependencies and compilation time, as well as making header files less useful as documentation of interfaces.

[edit] extern

The extern keyword tells the compiler that a variable is declared in another source module. The linker then finds this actual declaration and sets up the extern variable to point to the correct location. If a variable is declared extern, and the linker finds no actual declaration of it, it will throw an "Unresolved external symbol" error.

Examples:

extern int i;
declares that there is a variable named i of type int, defined somewhere in the program.
extern int j = 0;
defines a variable j with external linkage; the extern keyword is redundant here.
extern void f();
declares that there is a function f taking no arguments and with no return value defined somewhere in the program; extern is redundant, but sometimes considered good style.
extern void f() {;}
defines the function f() declared above; again, the extern keyword is technically redundant here as external linkage is default.
extern const int k = 1;
defines a constant int k with value 1 and external linkage; extern is required because const variables have internal linkage by default.

[edit] Storage Class Specifiers

  • register - A hint to the compiler that the specified variable will be heavily used; therefore the compiler should consider allocating a CPU register to the variable. The compiler may ignore this hint.
  • static - Retains a memory location for all instances of the program or class.