GNU C Compiler Internals/GNU C Compiler Architecture

From Wikibooks, open books for an open world
Jump to navigation Jump to search

An Overview of GCC Architecture[edit | edit source]

This section is based on a Red Hat magazine article [1].

GNU Compiler Collection (GCC) comprises a number of compilers for different programming languages. The main GCC executable gcc processes source files written in C, C++, Objective-C, Objective-C++, Java, Fortran, or Ada and produces an assembly file for each source file. It is a driver program that invokes the appropriate compilation programs depending on the language of the source file. For a C source file they are the preprocessor and compiler cc1, the assembler as, and the linker collect2. The first and the third programs come with a GCC distribution, the assembler is a part of the GNU binutils package. This book describes the internals of the preprocessor and compiler cc1.

Each compiler includes the following three components: a front end, a middle end, and a back end. GCC compiles one file at a time. A source file goes through all three components one after another. Figure 1 illustrates the components and the source file representations associated with each component.

GCC front end, middle end, and back end with source file representations.

The purpose of the front end is to read the source file, parse it, and convert it into the standard abstract syntax tree (AST) representation. There is one front end for each programming language. Because of the differences in languages, the format of the generated ASTs is slightly different for each language. The next step after AST generation is the unification step in which the AST tree is converted into a unified form called generic. After this, the middle end part of the compiler takes control. First, the tree is converted into another representation called GIMPLE. In this form, each expression contains no more than three operands, all control flow constructs are represented as combinations of conditional statements and goto operators, arguments of a function call can only be variables, etc. Figure 2 illustrates the differences between a tree in generic form and a tree in GIMPLE form. GIMPLE is a convenient representation for optimizing the source code.

After GIMPLE, the source code is converted into the static single assignment (SSA) representation. The central idea of this form is the fact that each variable is assigned to only once, but can be used at the right hand side of an expression many times. Every time the same variable of a tree in the GIMPLE form is reassigned, the compiler creates a new version of that variable and stores the new value into it. When the same variable is assigned to in both branches of a conditional expression, one needs to merge the two possible values of the variable into a single variable. This operation is denoted as PHI function in the SSA form.

The SSA form is also used for optimizations. GCC performs more than 20 different optimizations on SSA trees. After the SSA optimization pass, the tree is converted back to the GIMPLE form which is then used to generate a register-transfer language (RTL) form of a tree. RTL is a hardware-based representation that corresponds to an abstract target architecture with an infinite number of registers. An RTL optimization pass optimizes the tree in the RTL form. Finally, a GCC back-end generates the assembly code for the target architecture using the RTL representation. Examples of back-ends are x86 back end, mips back end, etc.

In the next sections we describe the internals of the C front end and the x86 back end. The compiler starts with its initialization and command line options processing. After that the C front end preprocesses the source file, parses it and performs a number of optimizations. The back end then generates the assembly code for the target platform and saves it to a file.

Take home: GCC is a compiler collection that consists of a front end for each programming language, a middle end, and a back end for each architecture. The main representations that each source file goes through are AST in front end, RTL in middle end, and the assembly representation in back end. GCC compiles one file at a time.

GCC Initialization[edit | edit source]

The C front end includes the C/C++ preprocessor and the C compiler. Program cc1 includes both the preprocessor and C compiler. It compiles a C source file and generates an assembly (.S) file.

The compiler frontend and backend interact with each other using callback functions called language hooks. All hooks are included into a global variable struct lang_hooks lang_hooks that is defined in file langhooks.h. There are the following types of hooks: hooks for tree inlining, hooks for call graph, hooks for functions, hooks for tree dump, hooks for types, hooks for declarations, and language-specific hooks. The default values for the hooks are defined in file langhooks-def.h.

GCC initialization consists of command line option parsing, initializing the back end, creating the global scope, and initializing the built-in data types and functions.

Each declaration is associated with a scope. For example, a local variable is associated with its function's scope. A global declaration is associated with the global scope.

Function toplev_main() is the function that processes the command-line options, initializes the compiler, compiles a file, and frees up the allocated resources. Function decode_options() processes the command-line options and sets the corresponding variables within the compiler.

Following the command line option parsing function do_compile() is called. It performs the back-end initialization by calling function backend_init(). After that, function lang_dependent_init() performs language-dependent initialization which includes the initialization of a front-end and a back-end. The C initialization function c_objc_common_init() creates built-in data types, initializes the global scope and performs other initialization tasks. Function c_common_nodes_and_builtins() creates pre-defined types described in file builtin-types.def.

The standard C types are created at the initialization time. The following table presents several types:

GCC builtin types
Variable Name C type
char_type_node char
integer_type_node int
unsigned_type_node unsigned int
void_type_node void
ptr_type_node void*

GCC built-in functions are the functions that are evaluated at compile time. For example, if the size argument of a strcpy() function is a constant then GCC replaces the function call with the required number of assignments. The compiler replaces standard library calls with built-in functions and then evaluates them once the function's AST is constructed. In case of strcpy(), the compiler checks the size argument and uses the optimized built-in version of strcpy() if the argument is constant. Builtin builtin_constant_p() allows to find out if the value of its argument is known at compile time. GCC builtins are used beyond GCC. For example, the string processing library of Linux kernel uses builtin_constant_p() to invoke the optimized version of a string processing function if the string size is known at compile time.

GCC evaluates each builtin function using a corresponding expand_builtin() function. For example, builtin_strcmp() is evaluated using expand_builtin_strcmp(). The following table gives examples of GCC builtins:

GCC builtin functions
Builtin Name Explanation
builtin_constant_p returns true if the argument is a constant
builtin_memcpy equivalent to memcpy()
builtin_strlen equivalent to strlen()

Take home: GCC initialization consists of command line option parsing, initializing the back end, creating the global scope, and initializing the built-in data types and functions.

C Preprocessor[edit | edit source]

Following the initialization, function do_compile() calls function compile_file(). This function invokes parse_file() front-end language hook which is set to function c_common_parse_file() for C language. The latter function invokes function finish_options() which initializes the preprocessor and handles -D, -U, and -A command line options (which are equivalent to #define, #undef, and #assert respectively). The C preprocessor handles the preprocessor directives such as #define, #include in the source code.

Following the preprocessor initialization, c_parse_file() function is invoked. This function uses the standard lex/bison tools to parse the file. The preprocessor is implemented as a part of the lexer. The C language lexer function c_lex() calls the libcpp function cpp_get_token() which handles the preprocessor keywords. The state of the preprocessor is defined by variable cpp_reader *parse_in. Type struct cpp_reader contains most importantly the list of text buffers being processed. Each buffer corresponds to a source file (.c or .h). Function cpp_get_token() calls appropriate functions for legitimate preprocessor keywords. For example, when #include is encountered, function do_include_common() is invoked. It allocates a new buffer and places it on top of the buffer stack making it the current buffer. When a new file is compiled the buffer is popped off the stack and the compilation of the old file continues.

Function do_define() is called whenever a new macro is defined using #define keyword.

Take home: The preprocessor takes care of the preprocessor directives, for example #include and #ifdef.

Source Code Parsing[edit | edit source]

After running the preprocessor GCC constructs an abstract syntax tree (AST) for each function of the source file. An AST is a number of connected nodes of type struct tree. Each node has a tree code that defines the type of the tree. Macro TREE_CODE() is used to refer to the code. Tree codes are defined in files tree.def and c-common.def. Trees with different tree codes are grouped into tree classes. The following tree classes are defined among others in GCC:

GCC tree classes
Tree Class Explanation
'd' declarations
'<' comparison
'2' binary arithmetic

There are two types of trees: statements and expressions. Statements correspond to the C constructs, for example an expression followed by a ';', a conditional statement, a loop statement, etc. Expressions are what the statements are build up from. Examples of expressions are an assignment expression, an arithmetic expression, a function call, etc. Examples of tree codes are given in the following table:

GCC tree codes
Tree Code Tree Class Explanation Operands
MODIFY_EXPR 'e' an assignment expression TREE_OPERAND(t,0) - left hand side; TREE_OPERAND(t,1) - right hand side;
CALL_EXPR 'e' function call TREE_OPERAND(t,0) - function definition; TREE_OPERAND(t,1) - function arguments;
FUNCTION_DECL 'd' variable declaration DECL_SOURCE_FILE(t) - source file; DECL_NAME(t) - variable name;
ARRAY_TYPE 't' array type TREE_TYPE(t) - type of an array element; TYPE_DOMAIN(t) - type of index;
DECL_STMT 'e' variable declaration TREE_OPERAND(t,0) - variable; DECL_INITIAL(TREE_OPERAND(t,1)) - initial value;

In addition to the tree code that defines the type of the tree, a number of operands different for each tree type are available. For example, an assignment expression has two operands which correspond to the left and right sides of the expression. Macro TREE_OPERAND is used to refer to the operands. Macro IDENTIFIER_POINTER is used to find the name of an identifier that an IDENTIFIER_NODE tree represents. The following table presents several tree nodes, their purpose, and their operands.

Each tree has a type that corresponds to the type of C expression that it represents For example, the type of the MODIFY_EXPR node is the type of the left-hand side operand. NOP_EXPR and CONVERT_EXPR trees are used for type casting.

Tree NULL_TREE is equivalent to NULL. Function debug_tree() prints out the tree to stderr.

The parser is implemented as a bison grammar. The grammar invokes the GCC functions that construct the AST. When a new identifier is lexed, it is added to the pool of strings that GCC maintains. The tree code of an identifier is IDENTIFIER_NODE. When the same identifier is lexed again, the same tree node is returned. Function get_identifier() returns the tree node for an identifier.

A new variable declaration is processed in a number of function calls. First, function start_decl() is called with the name of the declaration, its type as returned by the lexer, and its attributes. The function calls grokdeclarator() that checks the type and argument nodes and returns a tree with the code appropriate for the declaration: VAR_DECL for variables, TYPE_DECL for types, etc. The declarartion is then added to the scope. A scope contains all declarations created within a function, but does not contain global declarations. There is also the global scope that contains global declarations. When a function is parsed, its declarations are attached to its body as a BLOCK node. When a declaration is created, the identifier node is associated with the declaration node using IDENTIFIER_SYMBOL_VALUE. Function lookup_name() returns the declaration for a given identifier. When a declaration leaves the scope the tree attributes C_DECL_INVISIBLE is asserted.

A symbol table is not maintained in GCC. Instead, the compiler uses the pool of identifiers and C_DECL_INVISIBLE attribute. Language hook lang_hooks.decls.getdecls() returns the variables in the scope chained together.

Functions start_init() and finish_init() are called for an initialized declaration. Function finish_decl() finalizes the declaration. For an array declaration it computes the size of an initialized array. Then function layout_decl() is called. It computes the size and alignment of the declaration.

Parsing a function depends on the presence of its body. A function declaration is parsed using the same functions as those used for variable declarations. For a function definition, function start_function() is called. Then the compiler parses the body of the function. When the function ends function finish_function() is called.

Functions start_decl() and start_function() take declaration's attributes parameter as one of their arguments. The attributes are described in the GCC Manual. Attributes are extensions of GNU implementation of C. The following table presents a few of them and explains their purpose:

Function attributes
Attribute Explanation
constructor call function automatically before main()
destructor call function automatically after main()
alias alias to another function

For each type of C statements there is a function that constructs a tree node of the corresponding type. For example, function build_function_call() builds a CALL_EXPR node for a function call. It takes the identifier of the function name and the arguments as its parameters. The function finds the function declaration using lookup_name() and type casts the arguments using default_conversion().

The ASTs are then converted into the SSA and eventually to the RTL representations. Whether the conversion occurs after parsing each function or when the whole file is parsed is controlled by the compiler option -funit-at-a-time. It is false by default.

Take home: GCC parser is written as a bison grammar. It creates the AST representation of the source file. ASTs are built up of tree nodes. Each node has a code. Tree nodes correspond to the statements and expressions of C. Function debug_tree() prints out the tree.

Assembly Code Generation[edit | edit source]

The backend generates the assembly code for the specified target platform. Function output_asm_insn() is called for each instruction written to the assembly file. Function final_start_function() generates the prolog of the function before it is saved to the assembly file.

Take home: The backend generates the assembly code for the specified target platform.
  1. ^