Creating a Virtual Machine/Printable version

From Wikibooks, open books for an open world
Jump to navigation Jump to search


Creating a Virtual Machine

The current, editable version of this book is available in Wikibooks, the open-content textbooks collection, at
https://en.wikibooks.org/wiki/Creating_a_Virtual_Machine

Permission is granted to copy, distribute, and/or modify this document under the terms of the Creative Commons Attribution-ShareAlike 3.0 License.

Introduction

Introduction[edit | edit source]

Scope[edit | edit source]

This book is a tutorial on creating several different types of virtual machines. The intent is to show the basics of virtual machine design and implementation. The authors will do their best to keep the program code simple and easy to read and understand.

What this book is not about:

Audience[edit | edit source]

This book assumes that you have some programming experience and understand the basics of binary numbers and hexadecimal notation. Familiarity with machine code or assembly language programming for one CPU or other would also be beneficial, but is not strictly necessary. In fact, I have found that studying a virtual machine to be an effective means of learning an assembly language.

Required Software[edit | edit source]

In order to compile and run the examples you will need to have certain software installed on your computer.

For the Register VM in C you will need to have a basic C compiler (a C++ compiler will also work fine).

  • Windows users: I recommend Quincy. It is free and easy to use.
  • Linux users: Install GCC on your system if it's not there already. Code::Blocks also works fine.
    • Debian (& derivative) users open a command shell, become root, and type apt-get install gcc.
  • Mac users: Install the Xcode package from the extra disk that came with your computer or via the AppStore, for free.

For the Register VM in Swift you will need to have a Swift interpreter.

  • Mac users: Install Xcode (free). The easiest way to use the interpreter is to paste the code into a new "playground."
  • Other platforms: In June, 2015 Apple announced it would "making Swift open source later this year." Since none of Foundation, Cocoa or UIKit are used by Register VM in Swift, it should work in any compliant port of the language.

For the Stack-Register VM in Java you will need to have a fairly recent Java compiler.

  • Windows users: Download and install the Java SE JDK from Sun's Java download page.
  • Linux users: You may use OpenJDK, but I recommend downloading and installing the Java SE JDK from Sun's Java download page.
  • Mac users: You should have Java installed already, or download it from java.com.

For the Register VM in Erlang you will need to install Erlang.

  • Windows users:
  • Linux users:
  • Mac users: (1) install MacPorts, (2) install Erlang

Terminology and Definitions[edit | edit source]

Run and execute[edit | edit source]

When a computer follows a program's instructions it is said to be running the program. Another term for run in this context is execute. This use of the word execute means to carry out, to make happen, or to put into effect. Computer programmers also talk about killing or terminating a program, but in this context the word execute is never used.

We count from 0[edit | edit source]

If you have done any Java, C, or C++ programming you will have noticed that when a program does any sort of counting, it usually starts counting from 0. We will do the same in these program examples, not by convention, but by necessity. Some high level languages encourage (or require) counting from 0 because they are based on machine languages which require counting from 0. The reason for this has to do with register indexing, something that should become clear to you as you follow through this book.

Words are numbers[edit | edit source]

When writing programs at a low level such as this, programmers often speak of words. A word in this context is a number, but one that lacks both cardinality and ordinality in the normal numeric sense. Instead a numeric word contains lexicographic information. Words are often written in hexadecimal notation.

Other terms[edit | edit source]

Before we proceed I thought it would be useful to agree on some terms that tend to be confused or misused in the field. This is not so much to impose my will upon you as it is to establish a context.

abstract machine
the definition of some machine, usually just 'on paper'
virtual machine
a software implementation of an abstract machine
emulator
a software implementation of a real machine

(Note to self: I think this should probably be moved to a glossary and expanded.)


Register VM in C

Design[edit | edit source]

The first example will be one of the simplest possible architectures, and it will consist of the following components:

  1. A set of registers (I will arbitrarily choose to have 4 of them, numbered 0 to 3). The registers serve as a group of read/write memory cells.
  2. A program, which is a read-only sequence of VM instructions
  3. An execution unit to run the program. The execution unit will read each instruction in order and execute it.

Addressing modes[edit | edit source]

In any assembly language numbers serve multiple uses. Aside from representing scalar values, they also represent register numbers and memory locations. In this example I will follow the convention of using prefix characters to denote the role a number plays.

  • Register numbers begin with the letter r, like r0, r1, r2.
  • Immediate (scalar) values begin with the hash mark #, like #100, #200.
  • Memory addresses begin with the at sign @, like @1000, @1004.

Instruction set[edit | edit source]

After designing the VM, it is necessary to design the VM's instruction set. The instruction set is simply a reflection of the kinds of things we would like to do (or allow others to do) with this VM. Here are some things it should be possible to do with this VM:

  • Load an immediate number (a constant) into a register.
  • Perform an arithmetic sum on two registers (i.e., add two numbers).
  • Halt the machine.

Let's keep it just that simple for now. After we have a working implementation we can go back and add more instructions.

First let's write a short program using these three instructions to see what they might look like:

loadi r0 #100
loadi r1 #200
add r2 r0 r1
halt

If the VM were to run the program, this is a description of what would happen on each line:

  1. Load the immediate value 100 into the register r0. Note that placing a value into a register is commonly called a load operation.
  2. Load the immediate value 200 into register r1.
  3. Add the contents of registers r0 and r1 and place the sum into register r2.
  4. End the program and halt the VM.

Note that in keeping with common convention I have chosen to place the destination registers first in the operand list.

Instruction codes[edit | edit source]

After the assembly language is created it is necessary to determine how to represent each instruction as a number. This establishes a one-to-one correspondence between each instruction in the assembly language and each instruction code in the set of instruction codes. Converting a program from assembly language to instruction codes is called assembling, and conversion from instruction codes back into assembly language is called disassembling.

Several choices we must make at this point are:

  • What number is used to represent each assembly language instruction?
  • How are instruction operands encoded?
  • Are operands part of the instruction word (remember, by word I mean number), or are they separate words (numbers)?

First, to answer the last question, since there are only small numbers of instructions and registers in this VM it should not be very difficult to encode all operands in a single instruction word, even if (for the sake of simplicity) I were to use a 16-bit instruction word. Thus, a 16-bit number written in hexadecimal has 4 digits, giving us easy access to 4 information fields, each containing 16 variations (0-9 and A-F).

  1. The first digit of a machine word will be the instruction number. This gives our VM the potential for having up to 16 different instructions. This is a small amount by contemporary standards, but it is plenty for our example virtual machine.
  2. The next three digits will be used for the operands. These can be used as three 1-digit operands, two operands of 1 and 2 digits, or a single 3-digit operand.

Having made these decisions, let us now establish the encoding. Recall that we have 16 instruction numbers available.

The halt instruction will be instruction 0, and there is an important reason for choosing 0 for this instruction. Since empty space in the computer's memory will most likely be filled with 0s, any run-away program will eventually encounter a 0 and attempt to execute this instruction, immediately halting the program.

The remaining two instructions can be assigned arbitrarily: the loadi instruction can be instruction 1, and the add instruction can be instruction 2. This is our current instruction encoding list:

0 = halt
1 = loadi
2 = add

Examining the first program instruction, we see that we must now encode the register and the immediate value:

loadi r0 #100

There are three hexadecimal digits available for operands, so we will use the first of those three as the register number, and the second and third together as the immediate value. Now we have just determined the complete encoding of the loadi instruction:

bits 15-12 = 1
bits 11- 8 = register number
bits  7- 0 = immediate value

The register number for this instruction is 0, and the immediate value is 100, which in hexadecimal is 64 (6 x 16 + 4). Thus, the complete 16-bit hexadecimal instruction code for the first instruction is 1064.

The second instruction is assembled in a similar way. The immediate value is 200, which in hexadecimal is C8. The second instruction assembles to 11C8.

The third instruction is 2201.

The last instruction is 0.

Putting the instructions together, we get these 4 16-bit hexadecimal numbers as the complete program:

1064
11C8
2201
0000

Implementation[edit | edit source]

It is now time to implement this design. I have chosen to write this program in C for a number of reasons:

  • It's a very simple language that allows the writing of concise programs.
  • It's a low level language with some high-level features. This makes it a good choice for creating a VM.

The primary function of the VM will be a run function that does the following steps in order inside a loop. The loop repeats until the halt instruction is executed.

  1. Fetch the next instruction from the program.
  2. Decode the instruction into its constituent parts.
  3. Execute the decoded instruction.

Let's begin writing the program.

Registers[edit | edit source]

The first and simplest part to implement is the set of 4 registers.

# define NUM_REGS 4
int regs[ NUM_REGS ];

The registers in this VM are signed integers that are, depending on your computer and operating system, either 16, 32, or 64 bits wide. The exact size is irrelevant for this example.

Program instructions[edit | edit source]

The fully assembled program can easily be stored in an array of integers.

int prog[] = { 0x1064, 0x11C8, 0x2201, 0x0000 };

Fetch[edit | edit source]

The fetch function retrieves the next word from the program. In order to do this we must introduce a variable called the program counter (also sometimes called the instruction pointer). Before the instruction word is returned from the function the program counter must be incremented so that it refers to the next instruction in the program array.

int pc = 0; /* program counter */

int fetch()
{
  return prog[ pc++ ];
}

Decode[edit | edit source]

The decode function will decode each instruction completely. It will determine the three operand registers and immediate value for each instruction, even if an instruction doesn't use that part. This actually makes the function simpler. The instruction number and operands are stored in variables.

int instrNum = 0;
/* operands: */
int reg1     = 0;
int reg2     = 0;
int reg3     = 0;
int imm      = 0;

/* decode a word */
void decode( int instr )
{
  instrNum = (instr & 0xF000) >> 12;
  reg1     = (instr & 0xF00 ) >>  8;
  reg2     = (instr & 0xF0  ) >>  4;
  reg3     = (instr & 0xF   );
  imm      = (instr & 0xFF  );
}

Execute[edit | edit source]

The execute function actually performs the instruction. First there needs to be a running flag that the halt instruction can set to 0.

/* the VM runs until this flag becomes 0 */
int running = 1;

Now the eval function. It's just a switch ladder. This function uses the instruction code and operand variables that the decode function stored values into. Notice that in each instruction case I display the instruction. This makes it easy to follow the program as it runs.

/* evaluate the last decoded instruction */
void eval()
{
  switch( instrNum )
  {
    case 0:
      /* halt */
      printf( "halt\n" );
      running = 0;
      break;
    case 1:
      /* loadi */
      printf( "loadi r%d #%d\n", reg1, imm );
      regs[ reg1 ] = imm;
      break;
    case 2:
      /* add */
      printf( "add r%d r%d r%d\n", reg1, reg2, reg3 );
      regs[ reg1 ] = regs[ reg2 ] + regs[ reg3 ];
      break;
  }
}

Run[edit | edit source]

The run function is rather simple: fetch, then decode, then execute.

void run()
{
  while( running )
  {
    int instr = fetch();
    decode( instr );
    eval();
  }
}

Main[edit | edit source]

The main function simply calls the run function:

int main( int argc, const char * argv[] )
{
  run();
  return 0;
}

Complete program source[edit | edit source]

For your convenience I have included the complete program source in a single block. Note that I have added a showRegs function in order to see the values of all the registers between instructions. This allowed me to verify that both the VM and the assembled program are working correctly.

# include <stdio.h>

# define NUM_REGS 4
unsigned regs[ NUM_REGS ];

unsigned program[] = { 0x1064, 0x11C8, 0x2201, 0x0000 };

/* program counter */
int pc = 0;

/* fetch the next word from the program */
int fetch()
{
  return program[ pc++ ];
}

/* instruction fields */
int instrNum = 0;
int reg1     = 0;
int reg2     = 0;
int reg3     = 0;
int imm      = 0;

/* decode a word */
void decode( int instr )
{
  instrNum = (instr & 0xF000) >> 12;
  reg1     = (instr & 0xF00 ) >>  8;
  reg2     = (instr & 0xF0  ) >>  4;
  reg3     = (instr & 0xF   );
  imm      = (instr & 0xFF  );
}

/* the VM runs until this flag becomes 0 */
int running = 1;

/* evaluate the last decoded instruction */
void eval()
{
  switch( instrNum )
  {
    case 0:
      /* halt */
      printf( "halt\n" );
      running = 0;
      break;
    case 1:
      /* loadi */
      printf( "loadi r%d #%d\n", reg1, imm );
      regs[ reg1 ] = imm;
      break;
    case 2:
      /* add */
      printf( "add r%d r%d r%d\n", reg1, reg2, reg3 );
      regs[ reg1 ] = regs[ reg2 ] + regs[ reg3 ];
      break;
  }
}

/* display all registers as 4-digit hexadecimal words */
void showRegs()
{
  int i;
  printf( "regs = " );
  for( i=0; i<NUM_REGS; i++ )
    printf( "%04X ", regs[ i ] );
  printf( "\n" );
}

void run()
{
  while( running )
  {
    showRegs();
    int instr = fetch();
    decode( instr );
    eval();
  }
  showRegs();
}

int main( int argc, const char * argv[] )
{
  run();
  return 0;
}

Sample run[edit | edit source]

Here is the output from the program:

regs = 0000 0000 0000 0000
loadi r0 #100
regs = 0064 0000 0000 0000
loadi r1 #200
regs = 0064 00C8 0000 0000
add r2 r0 r1
regs = 0064 00C8 012C 0000
halt
regs = 0064 00C8 012C 0000


Register VM in Swift

Design[edit | edit source]

The first example will be one of the simplest possible architectures, and it will consist of the following components:

  1. A set of registers (I will arbitrarily choose to have 4 of them, numbered 0 to 3). The registers serve as a group of read/write memory cells.
  2. A program, which is a read-only sequence of VM instructions
  3. An execution unit to run the program. The execution unit will read each instruction in order and execute it.

Addressing modes[edit | edit source]

In any assembly language numbers serve multiple uses. Aside from representing scalar values, they also represent register numbers and memory locations. In this example I will follow the convention of using prefix characters to denote the role a number plays.

  • Register numbers begin with the letter r, like r0, r1, r2.
  • Immediate (scalar) values begin with the hash mark #, like #100, #200.
  • Memory addresses begin with the at sign @, like @1000, @1004.

Instruction set[edit | edit source]

After designing the VM, it is necessary to design the VM's instruction set. The instruction set is simply a reflection of the kinds of things we would like to do (or allow others to do) with this VM. Here are some things it should be possible to do with this VM:

  • Load an immediate number (a constant) into a register.
  • Perform an arithmetic sum on two registers (i.e., add two numbers).
  • Halt the machine.

Let's keep it just that simple for now. After we have a working implementation we can go back and add more instructions.

First let's write a short program using these three instructions to see what they might look like:

loadi r0 #100
loadi r1 #200
add r2 r0 r1
halt

If the VM were to run the program, this is a description of what would happen on each line:

  1. Load the immediate value 100 into the register r0. Note that placing a value into a register is commonly called a load operation.
  2. Load the immediate value 200 into register r1.
  3. Add the contents of registers r0 and r1 and place the sum into register r2.
  4. End the program and halt the VM.

Note that in keeping with common convention I have chosen to place the destination registers first in the operand list.

Instruction codes[edit | edit source]

After the assembly language is created it is necessary to determine how to represent each instruction as a number. This establishes a one-to-one correspondence between each instruction in the assembly language and each instruction code in the set of instruction codes. Converting a program from assembly language to instruction codes is called assembling, and conversion from instruction codes back into assembly language is called disassembling.

Several choices we must make at this point are:

  • What number is used to represent each assembly language instruction?
  • How are instruction operands encoded?
  • Are operands part of the instruction word (remember, by word I mean number), or are they separate words (numbers)?

First, to answer the last question, since there are only small numbers of instructions and registers in this VM it should not be very difficult to encode all operands in a single instruction word, even if (for the sake of simplicity) I were to use a 16-bit instruction word. Thus, a 16-bit number written in hexadecimal has 4 digits, giving us easy access to 4 information fields, each containing 16 variations (0-9 and A-F).

  1. The first digit of a machine word will be the instruction number. This gives our VM the potential for having up to 16 different instructions. This is a small amount by contemporary standards, but it is plenty for our example virtual machine.
  2. The next three digits will be used for the operands. These can be used as three 1-digit operands, two operands of 1 and 2 digits, or a single 3-digit operand.

Having made these decisions, let us now establish the encoding. Recall that we have 16 instruction numbers available.

The halt instruction will be instruction 0, and there is an important reason for choosing 0 for this instruction. Since empty space in the computer's memory will most likely be filled with 0s, any run-away program will eventually encounter a 0 and attempt to execute this instruction, immediately halting the program.

The remaining two instructions can be assigned arbitrarily: the loadi instruction can be instruction 1, and the add instruction can be instruction 2. This is our current instruction encoding list:

0 = halt
1 = loadi
2 = add

Examining the first program instruction, we see that we must now encode the register and the immediate value:

loadi r0 #100

There are three hexadecimal digits available for operands, so we will use the first of those three as the register number, and the second and third together as the immediate value. Now we have just determined the complete encoding of the loadi instruction:

bits 15-12 = 1
bits 11- 8 = register number
bits  7- 0 = immediate value

The register number for this instruction is 0, and the immediate value is 100, which in hexadecimal is 64 (6 x 16 + 4). Thus, the complete 16-bit hexadecimal instruction code for the first instruction is 1064.

The second instruction is assembled in a similar way. The immediate value is 200, which in hexadecimal is C8. The second instruction assembles to 11C8.

The third instruction is 2201.

The last instruction is 0.

Putting the instructions together, we get these 4 16-bit hexadecimal numbers as the complete program:

1064
11C8
2201
0000

Implementation[edit | edit source]

It is now time to implement this design. This is an implementation of swift:

  • It's a very simple, modern and concise language.
  • It has both low-level and high-level features.

The primary function of the VM will be a run function that does the following steps in order inside a loop. The loop repeats until the halt instruction is executed.

  1. Fetch the next instruction from the program.
  2. Decode the instruction into its constituent parts.
  3. Execute the decoded instruction.

Let's begin writing the program.

Registers[edit | edit source]

The first and simplest part to implement is the set of 4 registers.

let NUM_REGS = 4
var regs = [Int](repeating: 0, count: NUM_REGS)

The registers in this VM are signed integers that are, depending on your computer and operating system, either 16, 32, or 64 bits wide. The exact size is irrelevant for this example.

Program instructions[edit | edit source]

The fully assembled program can easily be stored in an array of integers.

let prog = [0x1064, 0x11c8, 0x2201, 0x0000]

Fetch[edit | edit source]

The fetch function retrieves the next word from the program. In order to do this we must introduce a variable called the program counter (also sometimes called the instruction pointer). Before the instruction word is returned from the function the program counter must be incremented so that it refers to the next instruction in the program array.

var pc = 0  // program counter

func fetch() -> Int {
    let instruction = prog[pc]
    pc += 1
    return instruction
}

Decode[edit | edit source]

The decode function will decode each instruction completely. It will determine the three operand registers and immediate value for each instruction, even if an instruction doesn't use that part. This actually makes the function simpler. The instruction number and operands are stored in variables.

var instrNum = 0
// operands:
var reg1     = 0
var reg2     = 0
var reg3     = 0
var imm      = 0

// decode a word
func decode(instr:Int) -> Void {
    instrNum = (instr & 0xF000) >> 12
    reg1     = (instr & 0xF00 ) >>  8
    reg2     = (instr & 0xF0  ) >>  4
    reg3     = (instr & 0xF   )
    imm      = (instr & 0xFF  )
}

Execute[edit | edit source]

The execute function actually performs the instruction. First there needs to be a running flag that the halt instruction can set to 0.

// the VM runs until this flag becomes 0
var running = 1

Now the eval function. It's just a switch ladder. This function uses the instruction code and operand variables that the decode function stored values into. Notice that in each instruction case I display the instruction. This makes it easy to follow the program as it runs.

// evaluate the last decoded instruction
func eval() -> Void {
    switch instrNum {
    case 0:
        // halt
        print("halt")
        running = 0
    case 1:
        // loadi
        print("loadi r\(reg1) #\(imm)")
        regs[reg1] = imm;
    case 2:
        // add
        print("add r\(reg1) r\(reg2) r\(reg3)");
        regs[reg1] = regs[reg2] + regs[reg3]
    default:
        print("oh-oh");
    }
}

Run[edit | edit source]

The run function is rather simple: fetch, then decode, then execute.

func run() -> Void {
    while running != 0 {
        let instr = fetch()
        decode(instr: instr)
        eval()
    }
}

Entry point[edit | edit source]

The entry point is the run function:

run()

Complete program source[edit | edit source]

For your convenience I have included the complete program source in a single block. Note that I have added the formatRegister and showRegs functions in order to see the values of all the registers between instructions. This allowed me to verify that both the VM and the assembled program are working correctly.

let NUM_REGS = 4
var regs = [Int](repeating: 0, count: NUM_REGS)

let prog = [0x1064, 0x11c8, 0x2201, 0x0000]

var pc = 0  // program counter

func fetch() -> Int {
    let instruction = prog[pc]
    pc += 1
    return instruction
}

var instrNum = 0
// operands:
var reg1     = 0
var reg2     = 0
var reg3     = 0
var imm      = 0

// decode a word
func decode(instr:Int) -> Void {
    instrNum = (instr & 0xF000) >> 12
    reg1     = (instr & 0xF00 ) >>  8
    reg2     = (instr & 0xF0  ) >>  4
    reg3     = (instr & 0xF   )
    imm      = (instr & 0xFF  )
}

// the VM runs until this flag becomes 0
var running = 1

// evaluate the last decoded instruction
func eval() -> Void {
    switch instrNum {
    case 0:
        // halt
        print("halt")
        running = 0
    case 1:
        // loadi
        print("loadi r\(reg1) #\(imm)")
        regs[reg1] = imm;
    case 2:
        // add
        print("add r\(reg1) r\(reg2) r\(reg3)");
        regs[reg1] = regs[reg2] + regs[reg3]
    default:
        print("credit-card #");
    }
}

func formatRegister(reg:Int) -> String {
    let hex = String(reg, radix: 16, uppercase: true)
    let paddingCharacter = "0" as Character
    let padding = String(repeating: paddingCharacter, count: 4 - hex.count)
    return padding + hex
}

func showRegs() -> Void {
    print("regs = ", terminator:"")
    print(regs.map(formatRegister).joined(separator: " "))
}

func run() -> Void {
    while running != 0 {
        showRegs()
        let instr = fetch()
        decode(instr: instr)
        eval()
    }
    showRegs()
}

run()

Sample run[edit | edit source]

Here is the output from the program:

regs = 0000 0000 0000 0000 
loadi r0 #100
regs = 0064 0000 0000 0000 
loadi r1 #200
regs = 0064 00C8 0000 0000 
add r2 r0 r1
regs = 0064 00C8 012C 0000 
halt
regs = 0064 00C8 012C 0000


Symbolic Stack VM in Erlang - Compiler

This is not to be a compiler in the canonical sense of lexer, parser, code generator, optimizer, but we will concentrate on the code generation phase, and do some subsequent optimization.