Compiler Construction/Intermediate Representation

Intermediate Representation

The form of the internal representation among different compilers varies widely. If the back end is called as a subroutine by the front end then the intermediate representation is likely to be some form of annotated parse tree, possibly with supplementary tables. If the back end operates as a separate program then the intermediate representation is likely to be some low-level pseudo assembly language or some register transfer language (it could be just numbers, but debugging is easier if it is human-readable).

Stack-based representation

In this chapter, we discuss the stack-based representation of intermediate code. It has a number of advantages, some of which are:

An interpreter for the stack-based language tends to be more compact and straightforward.
A syntax of the language tends to be simple.

But the representation also has the following disadvantages, which make it unsuitable for manipulating and improving code:

It is not trivial to change the order of instructions.
Little research has been done to the stack-based code.

Complications with the stack-based code arises often with control flows.

Conversion algorithms

It is usually trivial to convert representations like three-address code to the stack-based code, so the case is left as an exercise. It is the inverse of this that is a challenging problem.

The main task behind the algorithm converting the stack-based code is to identify dependencies among operations. And it is conditional and unconditional jumps that make hard to figure these dependencies. So the code without them can be transformed into the three-address code in a straightforward way, as follows:

push 2
push 3
push 4
add
mul

We can see each stack position has a corresponding temporary variable. Put in another way, store and load are done only by push and pop, respectively, and a temporary variable that can be accessed at a time is limited to only the top as opposed to a usual case in which variables are specified freely.

s0 = 2
s1 = 3
s2 = 4
s1 = s1 + s2
s0 = s0 * s1

When a variable is typed, it may be beneficial to adopt SSA form. This dispenses with the need to analyze what type each variable holds at a moment, which, as illustrated below, can be quite tricky. The adaptation can be done after the conversion or simultaneously as the code is being converted.

Now, suppose the execution may not go from top to bottom. In that case, we basically have to analyze the control flow before translating the code. More specifically, we calculate how each instruction contributes to the depth of the stack. For example,

...
goto A // unconditionally jump to label A
...
A:    // a label
add  // push the sum of two values popped from the stack.
...

As we can see, at label A, the status of the stack depends on the operation before instruction "goto A".

One conservative approach is to annotate the byte-code before converting it. The basic idea is that when we interpret the code, we know both where we are and how tall the stack is. So by pretending as if we were evaluating the code, we can calculate the height of the stack for each position. An algorithm for this would be like (in actual writing one has to arrange the code so that it will terminate.):

procedure eval(start, depth)
{
  for i from start to code.length
  {
    depth_at[i] = depth
    case code[i]
    {
      'push': depth = depth + 1
      'pop':  depth = depth - 1
      'goto': i = code[i].target
      'if_so_goto': eval(code[i].target, depth)
      ...
    }
  }
}

eval(0, 0) // start the calculation

Coding the above solution may be tedious in practice, especially when a number of instructions in the language is large. Java byte code is a notable example of this. So a radical alternative below is to convert the stack-based code not in the usual sequential way but per basic block (i.e., a block that has no jumps). To see this, consider the following:

0 (A): push 10
1 (A): push 13
2 (A): less_than // pop < pop
3 (A): if_not_goto 6
4 (B): push '10 < 13'
5 (B): goto 7
6 (C): push 'it is not 10 < 13; your computer is broken!'
7 (C): print

In the above we can identify three basic blocks: the first (A) from 0 to 3, the second (B) from 4 to 5 and the third (C) from 6 to 7. We first compile A, then we know the height of the stack with which either B or C begins. After each block is compiled, we output blocks in the order they appear in the source code.

Astute readers would notice that throughout this section we are assuming the depth of the stack is fixed at each instruction position and thus can be determined at compiler time. If the assumption does not hold, then we have to have some kind of stack at runtime.

Exercises

Write the stack-based code for each of following high-level expressions:
- 10 * (20 + 30);
- If a < b then -a else -b;
- case a % 3 { 0: x; 1: y; 2: z; }
Write a piece of stack-based code so that the depth of the stack may vary after the piece, depending on an execution path.
Sketch the algorithm for converting three-address code to the stack-based code, assuming no jumps. Hint: view each position in the stack has a corresponding temporary variable.
Write a stack-based code such that the height of the stack at each position cannot be determined at a compiler time.

Further readings

Raja Vallee-Rai The Jimple Framework. 1998. [1]
expr.c from GCC.
CFG.java from Soot framework.
Stack caching for interpreters by M. Anton Ertl

Low-Level Pseudo Code

To do:
Complete

Annotated Parse Tree

To do:
Complete