GNU C Compiler Internals/GNU C Compiler Architecture

From Wikibooks, the open-content textbooks collection

Jump to: navigation, search

Contents

[edit] An Overview of GCC Architecture

This section is based on a Red Hat magazine article [1].

GNU Compiler Collection (GCC) comprises a number of compilers for different programming languages. The main GCC executable gcc processes source files written in C, C++, Objective-C, Objective-C++, Java, Fortran, or Ada and produces an assembly file for each source file. It is a driver program that invokes the appropriate compilation programs depending on the language of the source file. For a C source file they are the preprocessor and compiler cc1, the assembler as, and the linker collect2. The first and the third programs come with a GCC distribution, the assembler is a part of the GNU binutils package. This book describes the internals of the preprocessor and compiler cc1.

Each compiler includes the following three components: a front end, a middle end, and a back end. GCC compiles one file at a time. A source file goes through all three components one after another. Figure 1 illustrates the components and the source file representations associated with each component.

GCC front end, middle end, and back end with source file representations.

The purpose of the front end is to read the source file, parse it, and convert it into the standard abstract syntax tree (AST) representation. There is one front end for each programming language. Because of the differences in languages, the format of the generated ASTs is slightly different for each language. The next step after AST generation is the unification step in which the AST tree is converted into a unified form called generic. After this, the middle end part of the compiler takes control. First, the tree is converted into another representation called GIMPLE. In this form, each expression contains no more than three operands, all control flow constructs are represented as combinations of conditional statements and goto operators, arguments of a function call can only be variables, etc. Figure 2 illustrates the differences between a tree in generic form and a tree in GIMPLE form. GIMPLE is a convenient representations for optimizing the source code.


After GIMPLE, the source code is converted into the static single assignment (SSA) representation. The central idea of this form is the fact that each variable is assigned to only once, but can be used at the right hand side of an expression many times. Every time the same variable of a tree in the GIMPLE form is reassigned, the compiler creates a new version of that variable and stores the new value into it. When the same variable is assigned to in both branches of a conditional expression, one needs to merge the two possible values of the variable into a single variable. This operation is denoted as PHI function in the SSA form.


The SSA form is also used for optimizations. GCC performs more than 20 different optimizations on SSA trees. After the SSA optimization pass, the tree is converted back to the GIMPLE form which is then used to generate a register-transfer language (RTL) form of a tree. RTL is a hardware-based representation that corresponds to an abstract target architecture with an infinite number of registers. An RTL optimization pass optimizes the tree in the RTL form. Finally, a GCC back-end generates the assembly code for the target architecture using the RTL representation. Examples of back-ends are x86 back end, mips back end, etc.

In the next sections we describe the internals of the C front end and the x86 back end. The compiler starts with its initialization and command line options processing. After that the C front end preprocesses the source file, parses it and performs a number of optimizations. The back end then generates the assembly code for the target platform and saves it to a file.

Take home: GCC is a compiler collection that consists of a front end for each programming language, a middle end, and a back end for each architecture. The main representations that each source file goes through are AST in front end, RTL in middle end, and the assembly representation in back end. GCC compiles one file at a time.

[edit] GCC Initialization

The C front end includes the C/C++ preprocessor and the C compiler. Program cc1 includes both the preprocessor and C compiler. It compiles a C source file and generates an assembly (.S) file.

The compiler frontend and backend interact with each other using callback functions called language hooks. All hooks are included into a global variable struct lang_hooks lang_hooks that is defined in file langhooks.h. There are the following types of hooks: hooks for tree inlining, hooks for call graph, hooks for functions, hooks for tree dump, hooks for types, hooks for declarations, and language-specific hooks. The default values for the hooks are defined in file langhooks-def.h.

GCC initialization consists of command line option parsing, initializing the back end, creating the global scope, and initializing the built-in data types and functions.

Each declaration is associated with a scope. For example, a local variable is associated with its function's scope. A global declaration is associated with the global scope.

Function {{{4}}} is the function that processes the command-line options, initializes the compiler, compiles a file, and frees up the allocated resources. Function {{{4}}} processes the command-line options and sets the corresponding variables within the compiler.

Following the command line option parsing function {{{4}}} is called. It performs the back-end initialization by calling function {{{4}}}. After that, function {{{4}}} performs language-dependent initialization which includes the initialization of a front-end and a back-end. The C initialization function {{{4}}} creates built-in data types, initializes the global scope and performs other initialization tasks. Function {{{4}}} creates pre-defined types described in file builtin-types.def.

The standard C types are created at the initialization time. The following table presents several types:

GCC builtin types
Variable Name C type
char_type_node char
integer_type_node int
unsigned_type_node unsigned int
void_type_node void
ptr_type_node void*

GCC built-in functions are the functions that are evaluated at compile time. For example, if the size argument of a strcpy() function is a constant then GCC replaces the function call with the required number of assignments. The compiler replaces standard library calls with built-in functions and then evaluates them once the function's AST is constructed. In case of strcpy(), the compiler checks the size argument and uses the optimized built-in version of strcpy() if the argument is constant. Builtin {{{4}}} allows to find out if the value of its argument is known at compile time. GCC builtins are used beyond GCC. For example, the string processing library of Linux kernel uses builtin_constant_p() to invoke the optimized version of a string processing function if the string size is known at compile time.

GCC evaluates each builtin function using a corresponding expand_builtin() function. For example, builtin_strcmp() is evaluated using {{{4}}}. The following table gives examples of GCC builtins:

GCC builtin functions
Builtin Name Explanation
builtin_constant_p returns true if the argument is a constant
builtin_memcpy equivalent to memcpy()
builtin_strlen equivalent to strlen()


Take home: GCC initialization consists of command line option parsing, initializing the back end, creating the global scope, and initializing the built-in data types and functions.

[edit] C Preprocessor

Following the initialization, function {{{4}}} calls function {{{4}}}. This function invokes parse_file() front-end language hook which is set to function {{{4}}} for C language. The latter function invokes function {{{4}}} which initializes the preprocessor and handles -D, -U, and -A command line options (which are equivalent to #define, #undef, and #assert respectively). The C preprocessor handles the preprocessor directives such as #define, #include in the source code.

Following the preprocessor initialization, {{{4}}} function is invoked. This function uses the standard lex/bison tools to parse the file. The preprocessor is implemented as a part of the lexer. The C language lexer function {{{4}}} calls the libcpp function {{{4}}} which handles the preprocessor keywords. The state of the preprocessor is defined by variable *parse_in {{{4}}}. Type cpp_reader {{{4}}} contains most importantly the list of text buffers being processed. Each buffer corresponds to a source file (.c or .h). Function {{{4}}} calls appropriate functions for legitimate preprocessor keywords. For example, when #include is encountered, function {{{4}}} is invoked. It allocates a new buffer and places it on top of the buffer stack making it the current buffer. When a new file is compiled the buffer is popped off the stack and the compilation of the old file continues.

Function {{{4}}} is called whenever a new macro is defined using #define keyword.

Take home: The preprocessor takes care of the preprocessor directives, for example #include and #ifdef.

[edit] Source Code Parsing

After running the preprocessor GCC constructs an abstract syntax tree (AST) for each function of the source file. An AST is a number of connected nodes of type tree {{{4}}}. Each node has a tree code that defines the type of the tree. Macro {{{4}}} is used to refer to the code. Tree codes are defined in files {{{4}}} and {{{4}}}. Trees with different tree codes are grouped into tree classes. The following tree classes are defined among others in GCC:

GCC tree classes
Tree Class Explanation
'd' declarations
'<' comparison
'2' binary arithmetic


There are two types of trees: statements and expressions. Statements correspond to the C constructs, for example an expression followed by a ';', a conditional statement, a loop statement, etc. Expressions are what the statements are build up from. Examples of expressions are an assignment expression, an arithmetic expression, a function call, etc. Examples of tree codes are given in the following table:

GCC tree codes
Tree Code Tree Class Explanation Operands
MODIFY_EXPR 'e' an assignment expression TREE_OPERAND(t,0) - left hand side; TREE_OPERAND(t,1) - right hand side;
CALL_EXPR 'e' function call TREE_OPERAND(t,0) - function definition; TREE_OPERAND(t,1) - function arguments;
FUNCTION_DECL 'd' variable declaration DECL_SOURCE_FILE(t) - source file; DECL_NAME(t) - variable name;
ARRAY_TYPE 't' array type TREE_TYPE(t) - type of an array element; TYPE_DOMAIN(t) - type of index;
DECL_STMT 'e' variable declaration TREE_OPERAND(t,0) - variable; DECL_INITIAL(TREE_OPERAND(t,1)) - initial value;

In addition to the tree code that defines the type of the tree, a number of operands different for each tree type are available. For example, an assignment expression has two operands which correspond to the left and right sides of the expression. Macro {{{4}}} is used to refer to the operands. Macro {{{4}}} is used to find the name of an identifier that an {{{4}}} tree represents. The following table presents several tree nodes, their purpose, and their operands.

Each tree has a type that corresponds to the type of C expression that it represents For example, the type of the {{{4}}} node is the type of the left-hand side operand. {{{4}}} and {{{4}}} trees are used for type casting.

Tree NULL_TREE is equivalent to NULL. Function {{{4}}} prints out the tree to stderr.

The parser is implemented as a bison grammar. The grammar invokes the GCC functions that construct the AST. When a new identifier is lexed, it is added to the pool of strings that GCC maintains. The tree code of an identifier is {{{4}}}. When the same identifier is lexed again, the same tree node is returned. Function {{{4}}} returns the tree node for an identifier.

A new variable declaration is processed in a number of function calls. First, function {{{4}}} is called with the name of the declaration, its type as returned by the lexer, and its attributes. The function calls {{{4}}} that checks the type and argument nodes and returns a tree with the code appropriate for the declaration: {{{4}}} for variables, {{{4}}} for types, etc. The declarartion is then added to the scope. A scope contains all declarations created within a function, but does not contain global declarations. There is also the global scope that contains global declarations. When a function is parsed, its declarations are attached to its body as a {{{4}}} node. When a declaration is created, the identifier node is associated with the declaration node using {{{4}}}. Function {{{4}}} returns the declaration for a given identifier. When a declaration leaves the scope the tree attributes {{{4}}} is asserted.

A symbol table is not maintained in GCC. Instead, the compiler uses the pool of identifiers and {{{4}}} attribute. Language hook lang_hooks.decls.getdecls() returns the variables in the scope chained together.

Functions {{{4}}} and {{{4}}} are called for an initialized declaration. Function {{{4}}} finalizes the declaration. For an array declaration it computes the size of an initialized array. Then function {{{4}}} is called. It computes the size and alignment of the declaration.

Parsing a function depends on the presense of its body. A function declaration is parsed using the same functions as those used for variable declarations. For a function definition, function {{{4}}} is called. Then the compiler parses the body of the function. When the function ends function {{{4}}} is called.

Functions {{{4}}} and {{{4}}} take declaration's attributes parameter as one of their arguments. The attributes are described in the GCC Manual. Attributes are extensions of GNU implementation of C. The following table presents a few of them and explains their purpose:

Function attributes
Attribute Explanation
constructor call function automatically before main()
destructor call function automatically after main()
alias alias to another function

For each type of C statements there is a function that constructs a tree node of the corresponding type. For example, function {{{4}}} builds a CALL_EXPR node for a function call. It takes the identifier of the function name and the arguments as its parameters. The function finds the function declaration using {{{4}}} and type casts the arguments using {{{4}}}.

The ASTs are then converted into the SSA and eventually to the RTL representations. Whether the conversion occurs after parsing each function or when the whole file is parsed is controlled by the compiler option -funit-at-a-time. It is false by default.

Take home: GCC parser is written as a bison grammar. It creates the AST representation of the source file. ASTs are built up of tree nodes. Each node has a code. Tree nodes correspond to the statements and expressions of C. Function debug_tree() prints out the tree.

[edit] Assembly Code Generation

The backend generates the assembly code for the specified target platform. Function {{{4}}} is called for each instruction written to the assembly file. Function {{{4}}} generates the prolog of the function before it is saved to the assembly file.

Take home: The backend generates the assembly code for the specified target platform.
Prev: Table of Contents GNU C Compiler Architecture Next: GEM Framework
  1. ^ http://www.redhat.com/magazine/002dec04/features/gcc/