# X86 Assembly/Print Version

Jump to: navigation, search

# Introduction

## Why Learn Assembly?

Assembly is among some of the oldest tools in a computer-programmer's toolbox. Entire software projects can be written without ever looking at a single line of assembly code. So the question arises: why learn assembly? Assembly language is one of the closest forms of communication that humans can engage in with a computer. With assembly, the programmer can precisely track the flow of data and execution in a program in a mostly human-readable form. Once a program has been compiled, it is difficult (and at times, nearly impossible) to reverse-engineer the code into its original form. As a result, if you wish to examine a program that is already compiled but would rather not stare at hexadecimal or binary, you will need to examine it in assembly language. Since debuggers will frequently only show program code in assembly language, this provides one of many benefits for learning the language.

Assembly language is also the preferred tool, if not the only tool, for implementing some low-level tasks, such as bootloaders and low-level kernel components. Code written in assembly has less overhead than code written in high-level languages, so assembly code frequently will run much faster than equivalent programs written in other languages. Also, code that is written in a high-level language can be compiled into assembly and "hand optimized" to squeeze every last bit of speed out of it. As hardware manufacturers such as Intel and AMD add new features and new instructions to their processors, often times the only way to access those features is to use assembly routines. That is, at least until the major compiler vendors add support for those features.

Developing a program in assembly can be a very time consuming process, however. While it might not be a good idea to write new projects in assembly language, it is certainly valuable to know a little bit about it.

## Who is This Book For?

This book will serve as an introduction to assembly language and a good resource for people who already know about the topic, but need some more information on x86 system architecture. It will also describe some of the more advanced uses of x86 assembly language. All readers are encouraged to read (and contribute to) this book, although prior knowledge of programming fundamentals would definitely be beneficial.

## How is This Book Organized?

The first section will discuss the x86 family of chips and introduce the basic instruction set. The second section will explain the differences between the syntax of different assemblers. The third section will go over some of the additional instruction sets available, including the floating point, MMX, and SSE operations.

The fourth section will cover some advanced topics in x86 assembly, including some low-level programming tasks such as writing bootloaders. There are many tasks that cannot be easily implemented in a higher-level language such as C or C++. For example, enabling and disabling interrupts, enabling protected mode, accessing the Control Registers, creating a Global Descriptor Table, and other tasks all need to be handled in assembly. The fourth section will also deal with interfacing assembly language with C and other high-level languages. Once a function is written in assembly (a function to enable protected mode, for instance), we can interface that function to a larger, C-based (or even C++ based) kernel. The fifth section will discuss the standard x86 chipset, cover the basic x86 computer architecture, and generally deal with the hardware side of things.

The current layout of the book is designed to give readers as much information as they need without going overboard. Readers who want to learn assembly language on a given assembler only need to read the first section and the chapter in the second section that directly relates to their assembler. Programmers looking to implement the MMX or SSE instructions for different algorithms only really need to read section 3. Programmers looking to implement bootloaders, kernels, or other low-level tasks, can read section 4. People who really want to get to the nitty-gritty of the x86 hardware design can continue reading on through section 5.

# Basic FAQ

This page is going to serve as a basic FAQ for people who are new to assembly language programming.

## How Does the Computer Read/Understand Assembly?

The computer doesn't really "read" or "understand" anything per se, since a computer has no awareness nor consciousness, but that's beside the point. The fact is that the computer cannot read the assembly language that you write. Your assembler will convert the assembly language into a form of binary information called "machine code" that your computer uses to perform its operations. If you don't assemble the code, it's complete gibberish to the computer.

That said, assembly is important because each assembly instruction usually relates to just a single machine code, and it is possible for "mere mortals" to do this task directly with nothing but a blank sheet of paper, a pencil, and an assembly instruction reference book. Indeed, in the early days of computers this was a common task and even required in some instances "hand assembling" machine instructions for some basic computer programs. A classical example of this was done by Steve Wozniak, when he hand assembled the entire Integer BASIC interpreter into 6502 machine code for use on his initial Apple I computer. It should be noted, however, that such tasks done for commercially distributed software are so rare that they deserve special mention from that fact alone. Very, very few programmers have actually done this for more than a few instructions, and even then only for a classroom assignment.

## Is it the Same On Windows/DOS/Linux?

The answers to this question are yes and no. The basic x86 machine code is dependent only on the processor. The x86 versions of Windows and Linux are obviously built on the x86 machine code. There are a few differences between Linux and Windows programming in x86 Assembly:

1. On a Linux computer, the most popular assemblers are the GAS assembler, which uses the AT&T syntax for writing code, and the Netwide Assembler, also known as NASM, which uses a syntax similar to MASM.
2. On a Windows computer, the most popular assembler is MASM, which uses the Intel syntax but also, a lot of Windows Users use NASM.
3. The available software interrupts, and their functions, are different on Windows and Linux.
4. The available code libraries are different on Windows and Linux.

Using the same assembler, the basic assembly code written on each Operating System is basically the same, except you interact with Windows differently than you interact with Linux.

## Which Assembler is Best?

The short answer is that none of the assemblers is better than any other; it's a matter of personal preference.

The long answer is that different assemblers have different capabilities, drawbacks, etc. If you only know GAS syntax, then you will probably want to use GAS. If you know Intel syntax and are working on a Windows machine, you might want to use MASM. If you don't like some of the quirks or complexities of MASM and GAS, you might want to try FASM or NASM. We will cover the differences between the different assemblers in section 2.

## Do I Need to Know Assembly?

You don't need to know assembly for most computer tasks, but it can definitely be useful. Learning assembly is not about learning a new programming language. If you are going to start a new programming project (unless that project is a bootloader, a device driver, or a kernel), then you will probably want to avoid assembly like the plague. An exception to this could be if you absolutely need to squeeze the last bits of performance out of a congested inner loop and your compiler is producing suboptimal code. Keep in mind, though, that premature optimization is the root of all evil, although some computing-intense realtime tasks can only be optimized sufficiently if optimization techniques are understood and planned for from the start.

However, learning assembly gives you a particular insight into how your computer works on the inside. When you program in a higher-level language like C or Ada, all your code will eventually need to be converted into machine code instructions so your computer can execute them. Understanding the limits of exactly what the processor can do, at the most basic level, will also help when programming a higher-level language.

## How Should I Format my Code?

Most assemblers require that assembly code instructions each appear on their own line and are separated by a carriage return. Most assemblers also allow for whitespace to appear between instructions, operands, etc. Exactly how you format code is up to you, although there are some common ways:

One way keeps everything lined up:

```Label1:
mov ax, bx
add ax, bx
jmp Label3
Label2:
mov ax, cx
...
```

Another way keeps all the labels in one column and all the instructions in another column:

```Label1: mov ax, bx
add ax, bx
jmp Label3
Label2: mov ax, cx
...
```

Another way puts labels on their own lines and indents instructions slightly:

```Label1:
mov ax, bx
add ax, bx
jmp Label3
Label2:
mov ax, cx
...
```

Yet another way separates labels and instructions into separate columns AND keeps labels on their own lines:

```Label1:
mov ax, bx
add ax, bx
jmp Label3
Label2:
mov ax, cx
...
```

So there are different ways to do it, but there are some general rules that assembly programmers generally follow:

1. Make your labels obvious, so other programmers can see where they are.
2. More structure (indents) will make your code easier to read.
3. Use comments to explain what you are doing. The meaning of a piece of assembly code can often not be immediately clear.

# X86 Family

The term "x86" can refer both to an instruction set architecture and to microprocessors which implement it. The name x86 is derived from the fact that many of Intel's early processors had names ending in "86".

The x86 instruction set architecture originated at Intel and has evolved over time by the addition of new instructions as well as the expansion to 64-bits. As of 2009, x86 primarily refers to IA-32 (Intel Architecture, 32-bit) and/or x86-64, the extension to 64-bit computing.

Versions of the x86 instruction set architecture have been implemented by Intel, AMD and several other vendors, with each vendor having its own family of x86 processors.

## Intel x86 Microprocessors

8086/8087 (1978)
The 8086 was the original x86 microprocessor, with the 8087 as its floating-point coprocessor. The 8086 was Intel's first 16-bit microprocessor with a 20-bit address bus, thus enabling it to address up to 1 Megabyte, although the architecture of the original IBM PC imposed a limit of 640 Kilobytes of RAM, with the remainder reserved for ROM and memory-mapped expansion cards, such as video memory. This limitation is still present in modern CPUs, since they all support the backward-compatible "Real Mode" and boot into it.
8088 (1979)
After the development of the 8086, Intel also created the lower-cost 8088. The 8088 was similar to the 8086, but with an 8-bit data bus instead of a 16-bit bus. The address bus was left untouched.
80186/80187 (1982)
The 186 was the second Intel chip in the family; the 80187 was its floating point coprocessor. Except for the addition of some new instructions, optimization of some old ones, and an increase in the clock speed, this processor was identical to the 8086.
80286/80287 (1982)
The 286 was the third model in the family; the 80287 was its floating point coprocessor. The 286 introduced the “Protected Mode” mode of operation, as opposed to the “Real Mode” that the earlier models used. All subsequent x86 chips can also be made to run in real mode or in protected mode. Switching back from protected mode to real mode was initially not supported, but found to be possible (although relatively slow) by resetting the CPU, then continuing in real mode. Although the processor featured an address bus with 24 lines (24 bits, thus enabling to address up to 16 Mebibytes), these could only be used in protected mode. In real mode, the processor was still limited to the 20-bits address bus.
80386 (1985)
The 386 was the fourth model in the family. It was the first Intel microprocessor with a 32-bit word. The 386DX model was the original 386 chip, and the 386SX model was an economy model that used the same instruction set, but which only had a 16-bit data bus. Both featured a 32-bits address bus, thus getting rid of the segmented addressing methods used in the previous models and enabling a "flat" memory model, where one register can hold an entire address, instead of relying on two 16-bit registers to create a 20-bit/24-bit address. The flat memory layout was only supported in protected mode. Also, contrary to the 286, it featured an "unreal mode" in which protected-mode software could switch to perform real-mode operations (although this backward compatibility was not complete, as the physical memory was still protected). The 386EX model is still used today in embedded systems,
80486 (1989)
The 486 was the fifth model in the family. It had an integrated floating point unit for the first time in x86 history. Early model 80486 DX chips were found to have defective FPUs. They were physically modified to disconnect the FPU portion of the chip and sold as the 486SX (486-SX15, 486-SX20, and 486-SX25). A 487 "math coprocessor" was available to 486SX users and was essentially a 486DX with a working FPU and an extra pin added. The arrival of the 486DX-50 processor saw the widespread introduction of fanless heat-sinks being used to keep the processors from overheating.
Pentium (1993)
Intel called it the “Pentium” because they couldn't trademark the code number “80586”. The original Pentium was a faster chip than the 486 with a few other enhancements; later models also integrated the MMX instruction set.
Pentium Pro (1995)
The Pentium Pro was the sixth-generation architecture microprocessor, originally intended to replace the original Pentium in a full range of applications, but later reduced to a more narrow role as a server and high-end desktop chip.
Pentium II (1997)
The Pentium II was based on a modified version of the P6 core first used for the Pentium Pro, but with improved 16-bit performance and the addition of the MMX SIMD instruction set, which had already been introduced on the Pentium MMX.
Pentium III (1999)
Initial versions of the Pentium III were very similar to the earlier Pentium II, the most notable difference being the addition of SSE instructions.
Pentium 4 (2000)
The Pentium 4 had a new 7th generation "NetBurst" architecture. Pentium 4 chips also introduced the notions “Hyper-Threading”, and “Multi-Core” chips.
Core (2006)
The architecture of the Core processors was actually an even more advanced version of the 6th generation architecture dating back to the 1995 Pentium Pro. The limitations of the NetBurst architecture, especially in mobile applications, were too great to justify creation of more NetBurst processors. The Core processors were designed to operate more efficiently with a lower clock speed. All Core branded processors had two processing cores; the Core Solos had one core disabled, while the Core Duos used both processors.
Core 2 (2006)
An upgraded, 64-bit version of the Core architecture. All desktop versions are multi-core.
i Series (2008)
The successor to Core 2 processors, with the i7 line featuring Hyper-Threading.
Celeron (first model 1998)
The Celeron chip is actually a large number of different chip designs, depending on price. Celeron chips are the economy line of chips, and are frequently cheaper than the Pentium chips—even if the Celeron model in question is based off a Pentium architecture.
Xeon (first model 1998)
The Xeon processors are modern Intel processors made for servers, which have a much larger cache (measured in megabytes in comparison to other chips' kilobyte-sized cache) than the Pentium microprocessors.

## AMD x86 Compatible Microprocessors

Athlon
Athlon is the brand name applied to a series of different x86 processors designed and manufactured by AMD. The original Athlon, or Athlon Classic, was the first seventh-generation x86 processor and, in a first, retained the initial performance lead it had over Intel's competing processors for a significant period of time.
Turion
Turion 64 is the brand name AMD applies to its 64-bit low-power (mobile) processors. Turion 64 processors (but not Turion 64 X2 processors) are compatible with AMD's Socket 754 and are equipped with 512 or 1024 KiB of L2 cache, a 64-bit single channel on-die memory controller, and an 800 MHz HyperTransport bus.
Duron
The AMD Duron was an x86-compatible computer processor manufactured by AMD. It was released as a low-cost alternative to AMD's own Athlon processor and the Pentium III and Celeron processor lines from rival Intel.
Sempron
Sempron is, as of 2006, AMD's entry-level desktop CPU, replacing the Duron processor and competing against Intel's Celeron D processor.
Opteron
The AMD Opteron is the first eighth-generation x86 processor (K8 core), and the first of AMD's AMD64 (x86-64) processors. It is intended to compete in the server market, particularly in the same segment as the Intel Xeon processor.

# X86 Architecture

## x86 Architecture

The x86 architecture has 8 General-Purpose Registers (GPR), 6 Segment Registers, 1 Flags Register and an Instruction Pointer. 64-bit x86 has additional registers.

### General-Purpose Registers (GPR) - 16-bit naming conventions

The 8 GPRs are:

1. Accumulator register (AX). Used in arithmetic operations.
2. Counter register (CX). Used in shift/rotate instructions and loops.
3. Data register (DX). Used in arithmetic operations and I/O operations.
4. Base register (BX). Used as a pointer to data (located in segment register DS, when in segmented mode).
5. Stack Pointer register (SP). Pointer to the top of the stack.
6. Stack Base Pointer register (BP). Used to point to the base of the stack.
7. Source Index register (SI). Used as a pointer to a source in stream operations.
8. Destination Index register (DI). Used as a pointer to a destination in stream operations.

The order in which they are listed here is for a reason: it is the same order that is used in a push-to-stack operation, which will be covered later.

All registers can be accessed in 16-bit and 32-bit modes. In 16-bit mode, the register is identified by its two-letter abbreviation from the list above. In 32-bit mode, this two-letter abbreviation is prefixed with an 'E' (extended). For example, 'EAX' is the accumulator register as a 32-bit value.

Similarly, in the 64-bit version, the 'E' is replaced with an 'R', so the 64-bit version of 'EAX' is called 'RAX'.

It is also possible to address the first four registers (AX, CX, DX and BX) in their size of 16-bit as two 8-bit halves. The least significant byte (LSB), or low half, is identified by replacing the 'X' with an 'L'. The most significant byte (MSB), or high half, uses an 'H' instead. For example, CL is the LSB of the counter register, whereas CH is its MSB.

In total, this gives us five ways to access the accumulator, counter, data and base registers: 64-bit, 32-bit, 16-bit, 8-bit LSB, and 8-bit MSB. The other four are accessed in only three ways: 64-bit, 32-bit and 16-bit. The following table summarises this:

Register Accumulator Counter Data Base Stack Pointer Stack Base Pointer Source Destination
64-bit RAX RCX RDX RBX RSP RBP RSI RDI
32-bit EAX ECX EDX EBX ESP EBP ESI EDI
16-bit AX CX DX BX SP BP SI DI
8-bit AH AL CH CL DH DL BH BL

### Segment Registers

The 6 Segment Registers are:

• Stack Segment (SS). Pointer to the stack.
• Code Segment (CS). Pointer to the code.
• Data Segment (DS). Pointer to the data.
• Extra Segment (ES). Pointer to extra data ('E' stands for 'Extra').
• F Segment (FS). Pointer to more extra data ('F' comes after 'E').
• G Segment (GS). Pointer to still more extra data ('G' comes after 'F').

Most applications on most modern operating systems (like FreeBSD, Linux or Microsoft Windows) use a memory model that points nearly all segment registers to the same place (and uses paging instead), effectively disabling their use. Typically the use of FS or GS is an exception to this rule, instead being used to point at thread-specific data.

### EFLAGS Register

The EFLAGS is a 32-bit register used as a collection of bits representing Boolean values to store the results of operations and the state of the processor.

The names of these bits are:

 IOPL 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 0 0 0 0 0 0 0 0 0 0 ID VIP VIF AC VM RF 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 NT OF DF IF TF SF ZF 0 AF 0 PF 1 CF

The bits named 0 and 1 are reserved bits and shouldn't be modified.

The different use of these flags are:
0. CF : Carry Flag. Set if the last arithmetic operation carried (addition) or borrowed (subtraction) a bit beyond the size of the register. This is then checked when the operation is followed with an add-with-carry or subtract-with-borrow to deal with values too large for just one register to contain.
2. PF : Parity Flag. Set if the number of set bits in the least significant byte is a multiple of 2.
4. AF : Adjust Flag. Carry of Binary Code Decimal (BCD) numbers arithmetic operations.
6. ZF : Zero Flag. Set if the result of an operation is Zero (0).
7. SF : Sign Flag. Set if the result of an operation is negative.
8. TF : Trap Flag. Set if step by step debugging.
9. IF : Interruption Flag. Set if interrupts are enabled.
10. DF : Direction Flag. Stream direction. If set, string operations will decrement their pointer rather than incrementing it, reading memory backwards.
11. OF : Overflow Flag. Set if signed arithmetic operations result in a value too large for the register to contain.
12-13. IOPL : I/O Privilege Level field (2 bits). I/O Privilege Level of the current process.
14. NT : Nested Task flag. Controls chaining of interrupts. Set if the current process is linked to the next process.
16. RF : Resume Flag. Response to debug exceptions.
17. VM : Virtual-8086 Mode. Set if in 8086 compatibility mode.
18. AC : Alignment Check. Set if alignment checking of memory references is done.
19. VIF : Virtual Interrupt Flag. Virtual image of IF.
20. VIP : Virtual Interrupt Pending flag. Set if an interrupt is pending.
21. ID : Identification Flag. Support for CPUID instruction if can be set.

### Instruction Pointer

The EIP register contains the address of the next instruction to be executed if no branching is done.

EIP can only be read through the stack after a `call` instruction.

### Memory

The x86 architecture is little-endian, meaning that multi-byte values are written least significant byte first. (This refers only to the ordering of the bytes, not to the bits.)

So the 32 bit value B3B2B1B016 on an x86 would be represented in memory as:

 `B0` `B1` `B2` `B3`

For example, the 32 bits double word 0x1BA583D4 (the 0x denotes hexadecimal) would be written in memory as:

 `D4` `83` `A5` `1B`

This will be seen as `0xD4 0x83 0xA5 0x1B` when doing a memory dump.

### Two's Complement Representation

Two's complement is the standard way of representing negative integers in binary. The sign is changed by inverting all of the bits and adding one.

 Start: `0001` Invert: `1110` Add One: `1111`

0001 represents decimal 1

1111 represents decimal -1

### Addressing modes

The addressing mode indicates the manner in which the operand is presented.

Register Addressing
(operand address R is in the address field)
```mov ax, bx  ; moves contents of register bx into ax
```
Immediate
(actual value is in the field)
```mov ax, 1   ; moves value of 1 into register ax
```

or

```mov ax, 010Ch ; moves value of 0x010C into register ax
```
Direct memory addressing
(operand address is in the address field)
```.data
my_var dw 0abcdh ; my_var = 0xabcd
.code
mov ax, [my_var] ; copy my_var content into ax (ax=0xabcd)
```
Direct offset addressing
(uses arithmetics to modify address)
```byte_tbl db 12,15,16,22,..... ; Table of bytes
mov al,[byte_tbl+2]
mov al,byte_tbl[2] ; same as the former
```
Register Indirect
(field points to a register that contains the operand address)
```mov ax,[di]
```
The registers used for indirect addressing are BX, BP, SI, DI

### General-purpose registers (64-bit naming conventions)

Main page: X86 Assembly/16 32 and 64 Bits
Main page: X86 Assembly/SSE

64-bit x86 adds 8 more general-purpose registers, named R8, R9, R10 and so on up to R15. It also introduces a new naming convention that must be used for these new registers and can also be used for the old ones (except that AH, CH, DH and BH have no equivalents). In the new convention:

• R0 is RAX.
• R1 is RCX.
• R2 is RDX.
• R3 is RBX.
• R4 is RSP.
• R5 is RBP.
• R6 is RSI.
• R7 is RDI.
• R8,R9,R10,R11,R12,R13,R14,R15 are the new registers and have no other names.
• R0D~R15D are the lowermost 32 bits of each register. For example, R0D is EAX.
• R0W~R15W are the lowermost 16 bits of each register. For example, R0W is AX.
• R0L~R15L are the lowermost 8 bits of each register. For example, R0L is AL.

As well, 64-bit x86 includes SSE2, so each 64-bit x86 CPU has at least 8 registers (named XMM0~XMM7) that are 128 bits wide, but only accessible through SSE instructions. They cannot be used for quadruple-precision (128-bit) floating-point arithmetic, but they can each hold 2 double-precision or 4 single-precision floating-point values for a SIMD parallel instruction. They can also be operated on as 128-bit integers or vectors of shorter integers. If the processor supports AVX, as newer Intel and AMD desktop CPUs do, then each of these registers is actually the lower half of a 256-bit register (named YMM0~YMM7), the whole of which can be accessed with AVX instructions for further parallelization.

## Stack

The stack is a Last In First Out (LIFO) data structure; data is pushed onto it and popped off of it in the reverse order.

``` mov ax, 006Ah
mov bx, F79Ah
mov cx, 1124h
push ax
```

You push the value in AX onto the top of the stack, which now holds the value \$006A.

```push bx
```

You do the same thing to the value in BX; the stack now has \$006A and \$F79A.

```push cx
```

Now the stack has \$006A, \$F79A, and \$1124.

```call do_stuff
```

Do some stuff. The function is not forced to save the registers it uses, hence us saving them.

```pop cx
```

Pop the last element pushed onto the stack into CX, \$1124; the stack now has \$006A and \$F79A.

```pop bx
```

Pop the last element pushed onto the stack into BX, \$F79A; the stack now has just \$006A.

```pop ax
```

Pop the last element pushed onto the stack into AX, \$006A; the stack is empty.

The Stack is usually used to pass arguments to functions or procedures and also to keep track of control flow when the `call` instruction is used. The other common use of the Stack is temporarily saving registers.

## CPU Operation Modes

### Real Mode

Real Mode is a holdover from the original Intel 8086. You generally won't need to know anything about it (unless you are programming for a DOS-based system or, more likely, writing a boot loader that is directly called by the BIOS).

The Intel 8086 accessed memory using 20-bit addresses. But, as the processor itself was 16-bit, Intel invented an addressing scheme that provided a way of mapping a 20-bit addressing space into 16-bit words. Today's x86 processors start in the so-called Real Mode, which is an operating mode that mimics the behavior of the 8086, with some very tiny differences, for backwards compatibility.

In Real Mode, a segment and an offset register are used together to yield a final memory address. The value in the segment register is multiplied by 16 (shifted 4 bits to the left) and the offset is added to the result. This provides a usable address space of 1 MB. However, a quirk in the addressing scheme allows access past the 1 MB limit if a segment address of 0xFFFF (the highest possible) is used; on the 8086 and 8088, all accesses to this area wrapped around to the low end of memory, but on the 80286 and later, up to 65520 bytes past the 1 MB mark can be addressed this way if the A20 address line is enabled. See: The A20 Gate Saga.

One benefit shared by Real Mode segmentation and by Protected Mode Multi-Segment Memory Model is that all addresses must be given relative to another address (this is, the segment base address). A program can have its own address space and completely ignore the segment registers, and thus no pointers have to be relocated to run the program. Programs can perform near calls and jumps within the same segment, and data is always relative to segment base addresses (which in the Real Mode addressing scheme are computed from the values loaded in the Segment Registers).

This is what the DOS *.COM format does; the contents of the file are loaded into memory and blindly run. However, due to the fact that Real Mode segments are always 64 KB long, COM files could not be larger than that (in fact, they had to fit into 65280 bytes, since DOS used the first 256 of a segment for housekeeping data); for many years this wasn't a problem.

### Protected Mode

#### Flat Memory Model

If programming in a modern operating system (such as Linux, Windows), you are basically programming in flat 32-bit mode. Any register can be used in addressing, and it is generally more efficient to use a full 32-bit register instead of a 16-bit register part. Additionally, segment registers are generally unused in flat mode, and it is generally a bad idea to touch them.

#### Multi-Segmented Memory Model

Using a 32-bit register to address memory, the program can access (almost) all of the memory in a modern computer. For earlier processors (with only 16-bit registers) the segmented memory model was used. The 'CS', 'DS', and 'ES' registers are used to point to the different chunks of memory. For a small program (small model) the CS=DS=ES. For larger memory models, these 'segments' can point to different locations.

# Comments

## Comments

When writing code, it is very helpful to use some comments explaining what is going on. A comment is a section of regular text that the assembler ignores when turning the assembly code into the machine code. In assembly comments are usually denoted with a semicolon ";", although GAS uses "#" for single line comments and "/* ... */" for multi-line comments.

Here is an example:

```Label1:
mov ax, bx    ;move contents of bx into ax
add ax, bx    ;add the contents of bx into ax
...
```

Everything after the semicolon, on the same line, is ignored. Let's show another example:

```Label1:
mov ax, bx
;mov cx, ax
...
```

Here, the assembler never sees the second instruction "mov cx, ax", because it ignores everything after the semicolon. When someone reads the code in the future they will find the comments and hopefully try to figure out what the programmer intended.

## HLA Comments

The HLA assembler also has the ability to write comments in C or C++ style, but we can't use the semicolons. This is because in HLA, the semicolons are used at the end of every instruction:

```mov(ax, bx); //This is a C++ comment.
/*mov(cx, ax);  everything between the slash-stars is commented out.
This is a C comment*/
```

C++ comments go all the way to the end of the line, but C comments go on for many lines from the "/*" all the way until the "*/". For a better understanding of C and C++ comments in HLA, see Programming:C or the C++ Wikibooks.

# 16 32 and 64 Bits

When using x86 assembly, it is important to consider the differences between architectures that are 16, 32, and 64 bits. This page will talk about some of the basic differences between architectures with different bit widths.

## The 8086 Registers

The 8086 registers are the following: AX, BX, CX, DX, BP, SP, DI, SI, CS, SS, ES, DS, IP and FLAGS. They are all 16 bits wide.

On any Windows-based system (except 64 bit versions), you can run a very handy program called "debug.exe" from a DOS shell, which is very useful for learning about 8086. If you are using DOSBox or FreeDOS, you can use "debug.exe" as provided by FreeDOS.

AX, BX, CX, DX
These general purpose registers can also be addressed as 8-bit registers. So AX = AH (high 8-bit) and AL (low 8-bit).
SI, DI
These registers are usually used as offsets into data space. By default, SI is offset from the DS data segment, DI is offset from the ES extra segment, but either or both of these can be overridden.
SP
This is the stack pointer, offset usually from the stack segment SS. Data is pushed onto the stack for temporary storage, and popped off the stack when it is needed again.
BP
The stack frame, usually treated as an offset from the stack segment SS. Parameters for subroutines are commonly pushed onto the stack when the subroutine is called, and BP is set to the value of SP when a subroutine starts. BP can then be used to find the parameters on the stack, no matter how much the stack is used in the meanwhile.
CS, DS, ES, SS
The segment pointers. These are the offset in memory of the current code segment, data segment, extra segment, and stack segment respectively.
IP
The instruction pointer. Offset from the code segment CS, this points at the instruction currently being executed.
FLAGS (F)
A number of single-bit flags that indicate (or sometimes set) the current status of the processor.

The original 8086 only had registers that were 16 bits in size, effectively allowing to store one value of the range [0 - (2^16 - 1)] (or simpler: it could address up to 65536 different bytes, or 64 Kilobytes) - but the address bus (the connection to the memory controller, which receives addresses, then loads the content from the given address, and returns the data back on the data bus to the CPU) was 20 bits in size, effectively allowing to address up to 1 Megabyte of memory. That means that all registers by themselves were not large enough to make use of the entire width of the address bus. leaving 4 bits unused, scaling down the size of usable addresses by 16 (1024 Kilobytes / 64 Kilobytes = 16).

The problem was this: how can a 20-bit address space be referred to by the 16-bit registers? To solve this problem, the engineers of Intel came up with segment registers CS (Code Segment), DS (Data Segment), ES (Extra Segment), and SS (Stack Segment). To convert from 20-bit address, one would first divide it by 16 and place the quotient in the segment register and remainder in the offset register. This was represented as CS:IP (this means, CS is the segment and IP is the offset). Likewise, when an address is written SS:SP it means SS is the segment and SP is the offset.

This works also the reversed way. If one was, instead of convert from, to create a 20 bit address, it would be done by taking the 16-bit value of a segment register and put it on the address bus, but shifted 4 times to the left (thus effectively multiplying the register by 16), and then by adding the offset from another register untouched to the value on the bus, thus creating a full a 20-bit address.

### Example

If CS = 0x258C and IP = 0x0012 (the "0x" prefix denotes hexadecimal notation), then CS:IP will point to a 20 bit address equivalent to "CS * 16 + IP" which will be = 0x258C * 0x10 + 0x0012 = 0x258C0 + 0x0012 = 0x258D2 (Remember: 16 decimal = 0x10). The 20-bit address is known as an absolute (or linear) address and the Segment:Offset representation (CS:IP) is known as a segmented address. This separation was necessary, as the register itself could not hold values that required more than 16 bits encoding. When programming in protected mode on a 32-bit or 64-bit processor, the registers are big enough to fill the address bus entirely, thus eliminating segmented addresses - only linear/logical addresses are generally used in this "flat addressing" mode, although the segment:offset architecture is still supported for backwards compatibility.

It is important to note that there is not a one-to-one mapping of physical addresses to segmented addresses; for any physical address, there is more than one possible segmented address. For example: consider the segmented representations B000:8000 and B200:6000. Evaluated, they both map to physical address B8000 (B000:8000 = B000x10+8000 = B0000+8000 = B8000 and B200:6000 = B200x10+6000 = B2000+6000 = B8000). However, using an appropriate mapping scheme avoids this problem: such a map applies a linear transformation to the physical addresses to create precisely one segmented address for each. To reverse the translation, the map [f(x)] is simply inverted.

For example, if the segment portion is equal to the physical address divided by 0x10 and the offset is equal to the remainder, only one segmented address will be generated. (No offset will be greater than 0x0f.) Physical address B8000 maps to (B8000/10):(B8000%10) or B800:0. This segmented representation is given a special name: such addresses are said to be "normalized Addresses".

CS:IP (Code Segment: Instruction Pointer) represents the 20 bit address of the physical memory from where the next instruction for execution will be picked up. Likewise, SS:SP (Stack Segment: Stack Pointer) points to a 20 bit absolute address which will be treated as stack top (8086 uses this for pushing/popping values)

### Protected Mode

As ugly as this may seem, it was in fact a step towards the protected addressing scheme used in later chips. The 80286 had a protected mode of operation, in which all 24 of its address lines were available, allowing for addressing of up to 16MB of memory. In protected mode, the CS, DS, ES, and SS registers were not segments but selectors, pointing into a table that provided information about the blocks of physical memory that the program was then using. In this mode, the pointer value CS:IP = 0x0010:2400 is used as follows:

The CS value 0x0010 is an offset into the selector table, pointing at a specific selector. This selector would have a 24-bit value to indicate the start of a memory block, a 16-bit value to indicate how long the block is, and flags to specify whether the block can be written, whether it is currently physically in memory, and other information. Let's say that the memory block pointed to actually starts at the 24-bit address 0x164400, the actual address referred to then is 0x164400 + 0x2400 = 0x166800. If the selector also includes information that the block is 0x2400 bytes long, the reference would be to the byte immediately following that block, which would cause an exception: the operating system should not allow a program to read memory that it does not own. And if the block is marked as read-only, which code segment memory should be so that programs don't overwrite themselves, an attempt to write to that address would similarly cause an exception.

With CS and IP being expanded to 32 bits in the 386, this scheme became unnecessary; with a selector pointing at physical address 0x00000000, a 32-bit register could address up to 4GB of memory. However, selectors are still used to protect memory from rogue programs. If a program in Windows tries to read or write memory that it doesn't own, for instance, it will violate the rules set by the selectors, triggering an exception, and Windows will shut it down with the "General protection fault" message.

### 32-bit registers

With the chips beginning to support a 32-bit data bus, the registers needed to be updated to support the larger registers. The names for the 32-bit registers are simply the 16-bit names with an 'E' prepended.

EAX, EBX, ECX, EDX
These are the 32-bit versions of the registers shown above.

### 64-bit registers

The names of the 64-bit registers are the same of those of the 16-bit registers, except beginning with an 'R'.

RAX, RBX, RCX, RDX, RSI, RDI, RBP, RSP
These are the 64-bit versions of the registers shown above.
RIP
This is the full instruction pointer and should be used instead of EIP (which will be inaccurate if the address space is larger than 4 GiB, which may happen even with 4 GiB or less of RAM).
R8~15
These are new extra registers for 64-bit. They are counted as if the registers above are registers zero through seven, inclusively, rather than one through eight.

### 128-bit registers

64-bit x86 includes SSE2 (an extension to 32-bit x86), which provides 128-bit registers for specific instructions.

XMM0~7
SSE2 and newer.
XMM8~15
SSE3 and newer and AMD (but not Intel) SSE2.

Most CPUs made since 2008 also have AVX, a further extension that lengthens these registers to 256 bits.

## The A20 Gate Saga

As was said earlier, the 8086 processor had 20 address lines (from A0 to A19), so the total memory addressable by it was 1 MB (or "2 to the power 20"). But since it had only 16 bit registers, they came up with segment:offset scheme or else using a single 16-bit register they couldn't have possibly accessed more than 64 KB (or 2 to the power 16) of memory. So this made it possible for a program to access the whole of 1 MB of memory.

But with this segmentation scheme also came a side effect. Not only could your code refer to the whole of 1 MB with this scheme, but actually a little more than that. Let's see how...

Let's keep in mind how we convert from a Segment:Offset representation to Linear 20 bit representation.

The conversion:

```   Segment:Offset = Segment x 16 + Offset
```

Now to see the maximum amount of memory that can be addressed, let's fill in both Segment and Offset to their maximum values and then convert that value to its 20-bit absolute physical address.

So, max value for Segment = FFFF and max value for Offset = FFFF

Now, let's convert FFFF:FFFF into its 20-bit linear address, bearing in mind 16 (decimal) is represented as 10h in hexadecimal :-

So we get, FFFF:FFFF -> FFFF x 10h + FFFF = FFFF0 (1MB - 16 bytes) + FFFF (64 KB) = FFFFF + FFF0 = 1MB + FFF0 bytes

• Note: FFFFF is hexadecimal and is equal to 1 MB (one megabyte) and FFF0 is equal to 64 KB minus 16 bytes.

Moral of the story: From Real mode a program can actually refer to (1 MB + 64 KB - 16) bytes of memory.

Notice the use of the word "refer" and not "access". Program can refer to this much memory but whether it can access it or not is dependent on the number of address lines actually present. So with the 8086 this was definitely not possible because when programs made references to 1 MB plus memory, the address that was put on the address lines was actually more than 20-bits, and this resulted in wrapping around of the addresses.

For example, if a code is referring to 1 MB, this will get wrapped around and point to location 0 in memory, likewise 1 MB + 1 will wrap around to address 1 (or 0000:0001).

Now there were some super funky programmers around that time who manipulated this feature in their code, that the addresses get wrapped around and made their code a little faster and a fewer bytes shorter. Using this technique it was possible for them to access 32 KB of top memory area (that is 32 KB touching 1 MB boundary) and 32 KB memory of the bottom memory area, without actually reloading their segment registers!

Simple maths you see, if in Segment:Offset representation you make Segment constant, then since Offset is a 16-bit value therefore you can roam around in a 64 KB (or 2 to the power 16) area of memory. Now if you make your segment register point to 32 KB below the 1 MB mark you can access 32 KB upwards to touch 1 MB boundary and then 32 KB further which will ultimately get wrapped to the bottom most 32 KB.

Now these super funky programmers overlooked the fact that processors with more address lines would be created. (Note: Bill Gates has been attributed with saying, "Who would need more than 640 KB memory?", these programmers were probably thinking similarly). In 1982, just 2 years after 8086, Intel released the 80286 processor with 24 address lines. Though it was theoretically backward compatible with legacy 8086 programs, since it also supported Real Mode, many 8086 programs did not function correctly because they depended on out-of-bounds addresses getting wrapped around to lower memory segments. So for the sake of compatibility IBM engineers routed the A20 address line (8086 had lines A0 - A19) through the Keyboard controller and provided a mechanism to enable/disable the A20 compatibility mode. Now if you are wondering why the keyboard controller, the answer is that it had an unused pin. Since the 80286 would have been marketed as having complete compatibility with the 8086 (that wasn't even yet out very long), upgraded customers would be furious if the 80286 was not bug-for-bug compatible such that code designed for the 8086 would operate just as well on the 80286, but faster.

## 32-Bit Addressing

32-bit addresses can cover memory up to 4 GB in size. This means that we don't need to use offset addresses in 32-bit processors. Instead, we use what is called the "Flat addressing" scheme, where the address in the register directly points to a physical memory location. The segment registers are used to define different segments, so that programs don't try to execute the stack section, and they don't try to perform stack operations on the data section accidentally.

# X86 Instructions

These pages will discuss, in detail, the different instructions available in the basic x86 instruction set. For ease, and to decrease the page size, the different instructions will be broken up into groups, and discussed individually.

For more info, see the resources section.

## Conventions

The following template will be used for instructions that take no operands:

Instr

The following template will be used for instructions that take 1 operand:

Instr arg

The following template will be used for instructions that take 2 operands. Notice how the format of the instruction is different for different assemblers.

 Instr src, dest GAS Syntax Instr dest, src Intel syntax

The following template will be used for instructions that take 3 operands. Notice how the format of the instruction is different for different assemblers.

 Instr aux, src, dest GAS Syntax Instr dest, src, aux Intel syntax

### Suffixes

Some instructions, especially when built for non-Windows platforms (i.e. Unix, Linux, etc.), require the use of suffixes to specify the size of the data which will be the subject of the operation. Some possible suffixes are:

• b (byte) = 8 bits
• w (word) = 16 bits
• l (long) = 32 bits
• q (quad) = 64 bits

An example of the usage with the mov instruction on a 32-bit architecture, GAS syntax:

```    movl \$0x000F, %eax          # Store the value F into the eax register
```

On Intel Syntax you don't have to use the suffix. Based on the register name and the used immediate value the compiler knows which data size to use.

```    MOV EAX, 0x000F
```

# Data Transfer

Some of the most important and most frequently used instructions are those that move data. Without them, there would be no way for registers or memory to even have anything in them to operate on.

# Data transfer instructions

## Move

 mov src, dest GAS Syntax mov dest, src Intel syntax

Move

The `mov` instruction copies the `src` operand into the `dest` operand.

Operands

src

• Immediate
• Register
• Memory

dest

• Register
• Memory

Modified flags

• No FLAGS are modified by this instruction

Example

``` .data

value:
.long   2

.text
.global _start

_start:
movl    \$6, %eax
# %eax is now 6

movw    %ax, value
# value is now 6

movl    \$0, %ebx
# %ebx is now 0

movb    %al, %bl
# %ebx is now 6

movl    value, %ebx
# %ebx is now 6

movl    \$value, %esi
# %esi is now the address of value

xorl    %ebx, %ebx
# %ebx is now 0

movw    value(, %ebx, 1), %bx
# %ebx is now 6

# Linux sys_exit
mov     \$1, %eax
xorl    %ebx, %ebx
int     \$0x80
```

## Data swap

 xchg src, dest GAS Syntax xchg dest, src Intel syntax

Exchange.

The `xchg` instruction swaps the `src` operand with the `dest` operand. It's like doing three move operations: from dest to a temporary (another register), then from src to dest, then from the temporary to src, except that no register needs to be reserved for temporary storage.

If one of the operands is a memory address, then the operation has an implicit `LOCK` prefix, that is, the exchange operation is atomic. This can have a large performance penalty.

It's also worth noting that the common `NOP` (no op) instruction, `0x90`, is the opcode for `xchgl %eax, %eax`.

Operands

src

• Register
• Memory

dest

• Register
• Memory

However, only one operand can be in memory: the other must be a register.

Modified flags

• No FLAGS are modified by this instruction

Example

``` .data

value:
.long   2

.text
.global _start

_start:
movl    \$54, %ebx
xorl    %eax, %eax

xchgl   value, %ebx
# %ebx is now 2
# value is now 54

xchgw   %ax, value
# Value is now 0
# %eax is now 54

xchgb   %al, %bl
# %ebx is now 54
# %eax is now 2

xchgw   value(%eax), %ax
# value is now 0x00020000 = 131072
# %eax is now 0

# Linux sys_exit
mov     \$1, %eax
xorl    %ebx, %ebx
int     \$0x80
```

 cmpxchg arg2, arg1 GAS Syntax cmpxchg arg1, arg2 Intel syntax

Compare and exchange.

The `cmpxchg` instruction has two implicit operands `AL/AX/EAX`(depending on the size of `arg1`) and `ZF`(zero) flag. The instruction compares `arg1` to `AL/AX/EAX` and if they are equal sets `arg1` to `arg2` and sets the zero flag, otherwise it sets `AL/AX/EAX` to `arg1` and clears the zero flag. Unlike `xchg` there is not an implicit `lock` prefix and if the instruction is required to be atomic then `lock` must be prefixed.

Operands

arg1

• Register
• Memory

arg2

• Register

Modified flags

• The ZF flag is modified by this instruction

Example

The following example shows how to use the cmpxchg instruction to create a spin lock which will be used to protect the `result` variable. The last thread to grab the spin lock will get to set the final value of `result`:

```global main

extern printf
extern pthread_create
extern pthread_exit
extern pthread_join

section .data
align 4
sLock:		dd 0	; The lock, values are:
; 0	unlocked
; 1	locked
tID1:		dd 0
tID2:		dd 0
fmtStr1:	db "In thread %d with ID: %02x", 0x0A, 0
fmtStr2:	db "Result %d", 0x0A, 0

section .bss
align 4
result:		resd 1

section .text
main:			; Using main since we are using gcc to link

;
; Call pthread_create(pthread_t *thread, const pthread_attr_t *attr,
;			void *(*start_routine) (void *), void *arg);
;
push	dword 0		; Arg Four: argument pointer
push	thread1		; Arg Three: Address of routine
push	dword 0		; Arg Two: Attributes
push	tID1		; Arg One: pointer to the thread ID
call	pthread_create

push	dword 0		; Arg Four: argument pointer
push	thread2		; Arg Three: Address of routine
push	dword 0		; Arg Two: Attributes
push	tID2		; Arg One: pointer to the thread ID
call	pthread_create

;
; Call int pthread_join(pthread_t thread, void **retval) ;
;
push	dword 0		; Arg Two: retval
push	dword [tID1]	; Arg One: Thread ID to wait on
call	pthread_join
push	dword 0		; Arg Two: retval
push	dword [tID2]	; Arg One: Thread ID to wait on
call	pthread_join

push	dword [result]
push	dword fmtStr2
call	printf
add	esp, 8		; Pop stack 2 times 4 bytes

call exit

thread1:
pause
push	dword [tID1]
push	dword 1
push	dword fmtStr1
call	printf
add	esp, 12		; Pop stack 3 times 4 bytes

call	spinLock

mov	[result], dword 1
call	spinUnlock

push	dword 0		; Arg one: retval
call	pthread_exit

thread2:
pause
push	dword [tID2]
push	dword 2
push	dword fmtStr1
call	printf
add	esp, 12		; Pop stack 3 times 4 bytes

call	spinLock

mov	[result], dword 2
call	spinUnlock

push	dword 0		; Arg one: retval
call	pthread_exit

spinLock:
push	ebp
mov	ebp, esp
mov	edx, 1		; Value to set sLock to
spin:	mov	eax, [sLock]	; Check sLock
test	eax, eax	; If it was zero, maybe we have the lock
jnz	spin		; If not try again
;
; Attempt atomic compare and exchange:
; if (sLock == eax):
;	sLock		<- edx
;	zero flag	<- 1
; else:
;	eax		<- edx
;	zero flag	<- 0
;
; If sLock is still zero then it will have the same value as eax and
; sLock will be set to edx which is one and therefore we aquire the
; lock. If the lock was acquire between the first test and the
; cmpxchg then eax will not be zero and we will spin again.
;
lock	cmpxchg [sLock], edx
test	eax, eax
jnz	spin
pop	ebp
ret

spinUnlock:
push	ebp
mov	ebp, esp
mov	eax, 0
xchg	eax, [sLock]
pop	ebp
ret

exit:
;
; Call exit(3) syscall
;	void exit(int status)
;
mov	ebx, 0		; Arg one: the status
mov	eax, 1		; Syscall number:
int 	0x80
```

In order to assemble, link and run the program we need to do the following:

```\$ nasm -felf32 -g cmpxchgSpinLock.asm
\$ gcc -o cmpxchgSpinLock cmpxchgSpinLock.o -lpthread
\$ ./cmpxchgSpinLock
```

## Move with zero extend

 movz src, dest GAS Syntax movzx dest, src Intel syntax

Move zero extend

The `movz` instruction copies the `src` operand in the `dest` operand and pads the remaining bits not provided by `src` with zeros (0).

This instruction is useful for copying a small, unsigned value to a bigger register.

Operands

src

• Register
• Memory

dest

• Register

Modified flags

• No FLAGS are modified by this instruction

Example

```.data

byteval:
.byte   204

.text
.global _start

_start:
movzbw  byteval, %ax
# %eax is now 204

movzwl  %ax, %ebx
# %ebx is now 204

movzbl  byteval, %esi
# %esi is now 204

# Linux sys_exit
mov     \$1, %eax
xorl    %ebx, %ebx
int     \$0x80
```

## Sign Extend

 movs src, dest GAS Syntax movsx dest, src Intel syntax

Move sign extend.

The `movs` instruction copies the `src` operand in the `dest` operand and pads the remaining bits not provided by `src` with the sign bit (the MSB) of `src`.

This instruction is useful for copying a signed small value to a bigger register.

Operands

src

• Register
• Memory

dest

• Register

Modified flags

• No FLAGS are modified by this instruction

Example

```.data

byteval:
.byte   -24 # = 0xe8

.text
.global _start

_start:
movsbw  byteval, %ax
# %ax is now -24 = 0xffe8

movswl  %ax, %ebx
# %ebx is now -24 = 0xffffffe8

movsbl  byteval, %esi
# %esi is now -24 = 0xffffffe8

# Linux sys_exit
mov     \$1, %eax
xorl    %ebx, %ebx
int     \$0x80
```

## Move String

movsb

Move byte

The `movsb` instruction copies one byte from the memory location specified in `esi` to the location specified in `edi`. If the direction flag is cleared, then `esi` and `edi` are incremented after the operation. Otherwise, if the direction flag is set, then the pointers are decremented. In that case the copy would happen in the reverse direction, starting at the highest address and moving toward lower addresses until `ecx` is zero.

Operands

None.

Modified flags

• No FLAGS are modified by this instruction

Example

```section .text
; copy mystr into mystr2
mov esi, mystr    ; loads address of mystr into esi
mov edi, mystr2   ; loads address of mystr2 into edi
cld               ; clear direction flag (forward)
mov ecx,6
rep movsb         ; copy six times

section .bss
mystr2: resb 6

section .data
mystr db "Hello", 0x0
```

movsw

Move word

The `movsw` instruction copies one word (two bytes) from the location specified in `esi` to the location specified in `edi`. It basically does the same thing as `movsb`, except with words instead of bytes.

Operands

None.

Modified flags

• No FLAGS are modified by this instruction

Example

```section .code
; copy mystr into mystr2
mov esi, mystr
mov edi, mystr2
cld
mov ecx,4
rep movsw
; mystr2 is now AaBbCca\0

section .bss
mystr2: resb 8

section .data
mystr db "AaBbCca", 0x0
```

## Load Effective Address

 lea src, dest GAS Syntax lea dest, src Intel syntax

Load Effective Address

The `lea` instruction calculates the address of the `src` operand and loads it into the `dest` operand.

Operands

src

• Immediate
• Register
• Memory

dest

• Register

Modified flags

• No FLAGS are modified by this instruction

Note Load Effective Address calculates its `src` operand in the same way as the `mov` instruction does, but rather than loading the contents of that address into the `dest` operand, it loads the address itself.

`lea` can be used not only for calculating addresses, but also general-purpose unsigned integer arithmetic (with the caveat and possible advantage that FLAGS are unmodified). This can be quite powerful, since the `src` operand can take up to 4 parameters: base register, index register, scalar multiplier and displacement, e.g. `[eax + edx*4 -4]` (Intel syntax) or `-4(%eax, %edx, 4)` (GAS syntax). The scalar multiplier is limited to constant values 1, 2, 4, or 8 for byte, word, double word or quad word offsets respectively. This by itself allows for multiplication of a general register by constant values 2, 3, 4, 5, 8 and 9, as shown below (using NASM syntax):

```lea ebx, [ebx*2]      ; Multiply ebx by 2
lea ebx, [ebx*8+ebx]  ; Multiply ebx by 9, which totals ebx*18
```

# Data transfer instructions of 8086 microprocessor

General purpose byte or word transfer instructions:

• MOV: copy byte or word from specified source to specified destination
• PUSH: copy specified word to top of stack.
• POP: copy word from top of stack to specified location
• PUSHA: copy all registers to stack
• POPA: copy words from stack to all registers.
• XCHG: Exchange bytes or exchange words
• XLAT: translate a byte in AL using a table in memory.

These are I/O port transfer instructions:

• IN: copy a byte or word from specific port to accumulator
• OUT: copy a byte or word from accumulator to specific port

Special address transfer Instructions:

• LEA: load effective address of operand into specified register
• LDS: load DS register and other specified register from memory
• LES: load ES register and other specified register from memory

Flag transfer instructions:

• LAHF: load AH with the low byte of flag register
• SAHF: Stores AH register to low byte of flag register
• PUSHF: copy flag register to top of stack
• POPF: copy top of stack word to flag register

# Control Flow

Almost all programming languages have the ability to change the order in which statements are evaluated, and assembly is no exception. The instruction pointer (EIP) register contains the address of the next instruction to be executed. To change the flow of control, the programmer must be able to modify the value of EIP. This is where control flow functions come in.

```mov eip, label   ; wrong
jmp label        ; right
```

## Comparison Instructions

 test arg1, arg2 GAS Syntax test arg2, arg1 Intel syntax

Performs a bit-wise logical AND on `arg1` and `arg2` the result of which we will refer to as `Temp` and sets the `ZF`(zero), `SF`(sign) and `PF`(parity) flags based on `Temp`. `Temp` is then discarded.

Operands

arg1

• Register
• Immediate

arg2

• `AL/AX/EAX` (only if arg1 is immediate)
• Register
• Memory

Modified flags

• `SF` = MostSignificantBit(`Temp`)
• If (`Temp` == 0) `ZF` = 1 else `ZF` = 0
• `PF` = BitWiseXorNor(`Temp`[Max-1:0])
• `CF` = 0
• `OF` = 0
• `AF` is undefined

 cmp arg2, arg1 GAS Syntax cmp arg1, arg2 Intel syntax

Performs a comparison operation between `arg1` and `arg2`. The comparison is performed by a (signed) subtraction of `arg2` from `arg1`, the results of which can be called `Temp`. `Temp` is then discarded. If `arg2` is an immediate value it will be sign extended to the length of `arg1`. The `EFLAGS` register is set in the same manner as a `sub` instruction.

Note that the GAS/AT&T syntax can be rather confusing, as for example `cmp \$0, %rax` followed by `jl branch` will branch if `%rax < 0` (and not the opposite as might be expected from the order of the operands).

Operands

arg1

• `AL/AX/EAX` (only if arg2 is immediate)
• Register
• Memory

arg2

• Register
• Immediate
• Memory

Modified flags

• `SF` = MostSignificantBit(`Temp`)
• If (`Temp` == 0) `ZF` = 1 else `ZF` = 0
• `PF` = BitWiseXorNor(`Temp`[Max-1:0])
• `CF`, `OF` and `AF`

## Jump Instructions

The jump instructions allow the programmer to (indirectly) set the value of the EIP register. The location passed as the argument is usually a label. The first instruction executed after the jump is the instruction immediately following the label. All of the jump instructions, with the exception of `jmp`, are conditional jumps, meaning that program flow is diverted only if a condition is true. These instructions are often used after a comparison instruction (see above), but since many other instructions set flags, this order is not required.

See X86_Assembly/X86_Architecture#EFLAGS_Register for more information about the flags and their meaning.

### Unconditional Jumps

jmp loc

Loads EIP with the specified address (i.e. the next instruction executed will be the one specified by jmp).

### Jump if Equal

je loc

ZF = 1

Loads EIP with the specified address, if operands of previous CMP instruction are equal. For example:

```mov \$5, ecx
mov \$5, edx
cmp ecx, edx
je equal
; if it did not jump to the label equal, then this means ecx and edx are not equal.
equal:
; if it jumped here, then this means ecx and edx are equal
```

### Jump if Not Equal

jne loc

ZF = 0

Loads EIP with the specified address, if operands of previous CMP instruction are not equal.

### Jump if Greater

jg loc

SF = OF and ZF = 0

Loads EIP with the specified address, if first operand of previous CMP instruction is greater than the second (performs signed comparison).

### Jump if Greater or Equal

jge loc

SF = OF or ZF = 1

Loads EIP with the specified address, if first operand of previous CMP instruction is greater than or equal to the second (performs signed comparison).

### Jump if Greater (unsigned comparison)

ja loc

CF = 0 and ZF = 0

Loads EIP with the specified address, if first operand of previous CMP instruction is greater than the second. `ja` is the same as `jg`, except that it performs an unsigned comparison.

### Jump if Greater or Equal (unsigned comparison)

jae loc

CF = 0 or ZF = 1

Loads EIP with the specified address, if first operand of previous CMP instruction is greater than or equal to the second. `jae` is the same as `jge`, except that it performs an unsigned comparison.

### Jump if Lesser

jl loc

The criteria required for a `JL` is that `SF <> OF`, loads EIP with the specified address, if the criteria is meet. So either `SF` or `OF` can be set but not both in order to satisfy this criteria. If we take the `SUB`(which is basically what a `CMP` does) instruction as an example, we have:

`arg2` - `arg1`

With respect to `SUB` and `CMP` there are several cases that fulfill this criteria:

1. `arg2 < arg1` and the operation does not have overflow
2. `arg2 > arg1` and the operation has an overflow

In case 1) `SF` will be set but not `OF` and in case 2) `OF` will be set but not `SF` since the overflow will reset the most significant bit to zero and thus preventing `SF` being set. The `SF <> OF` criteria avoids the cases where:

1. `arg2 > arg1` and the operation does not have overflow
2. `arg2 < arg1` and the operation has an overflow
3. `arg2 == arg1`

In case 1) neither `SF` nor `OF` are set, in case 2) `OF` will be set and `SF` will be set since the overflow will reset the most significant bit to one and in case 3) neither `SF` nor `OF` will be set.

### Jump if Less or Equal

jle loc

`SF <> OF` or `ZF = 1`.

Loads EIP with the specified address, if first operand of previous CMP instruction is lesser than or equal to the second. See the `JL` section for a more detailed description of the criteria.

### Jump if Lesser (unsigned comparison)

jb loc

CF = 1

Loads EIP with the specified address, if first operand of previous CMP instruction is lesser than the second. `jb` is the same as `jl`, except that it performs an unsigned comparison.

### Jump if Lesser or Equal (unsigned comparison)

jbe loc

CF = 1 or ZF = 1

Loads EIP with the specified address, if first operand of previous CMP instruction is lesser than or equal to the second. `jbe` is the same as `jle`, except that it performs an unsigned comparison.

### Jump if Overflow

jo loc

OF = 1

Loads EIP with the specified address, if the overflow bit is set on a previous arithmetic expression.

### Jump if Not Overflow

jno loc

OF = 0

Loads EIP with the specified address, if the overflow bit is not set on a previous arithmetic expression.

### Jump if Zero

jz loc

ZF = 1

Loads EIP with the specified address, if the zero bit is set from a previous arithmetic expression. `jz` is identical to `je`.

### Jump if Not Zero

jnz loc

ZF = 0

Loads EIP with the specified address, if the zero bit is not set from a previous arithmetic expression. `jnz` is identical to `jne`.

### Jump if Signed

js loc

SF = 1

Loads EIP with the specified address, if the sign bit is set from a previous arithmetic expression.

### Jump if Not Signed

jns loc

SF = 0

Loads EIP with the specified address, if the sign bit is not set from a previous arithmetic expression.

## Function Calls

call proc

Pushes the address of the next opcode onto the top of the stack, and jumps to the specified location. This is used mostly for subroutines.

ret [val]

Loads the next value on the stack into EIP, and then pops the specified number of bytes off the stack. If val is not supplied, the instruction will not pop any values off the stack after returning.

## Loop Instructions

loop arg

The `loop` instruction decrements ECX and jumps to the address specified by `arg` unless decrementing ECX caused its value to become zero. For example:

``` mov ecx, 5
start_loop:
; the code here would be executed 5 times
loop start_loop
```

`loop` does not set any flags.

loopx arg

These loop instructions decrement ECX and jump to the address specified by `arg` if their condition is satisfied (that is, a specific flag is set), unless decrementing ECX caused its value to become zero.

• `loope` loop if equal
• `loopne` loop if not equal
• `loopnz` loop if not zero
• `loopz` loop if zero

## Enter and Leave

enter arg

Creates a stack frame with the specified amount of space allocated on the stack.

leave

destroys the current stack frame, and restores the previous frame. Using Intel syntax this is equivalent to:

```mov esp, ebp
pop ebp
```

This will set `EBP` and `ESP` to their respective value before the function prologue began therefore reversing any modification to the stack that took place during the prologue.

## Other Control Instructions

hlt

Halts the processor. Execution will be resumed after processing next hardware interrupt, unless IF is cleared.

nop

No operation. This instruction doesn't do anything, but wastes an instruction cycle in the processor. This instruction is often represented as an XCHG operation with the operands EAX and EAX.

lock

Asserts #LOCK prefix on next instruction.

wait

Waits for the FPU to finish its last calculation.

# Arithmetic

## Arithmetic instructions

Arithmetic instructions take two operands: a destination and a source. The destination must be a register or a memory location. The source may be either a memory location, a register, or a constant value. Note that at least one of the two must be a register, because operations may not use a memory location as both a source and a destination.

 add src, dest GAS Syntax add dest, src Intel syntax

This adds `src` to `dest`. If you are using the MASM syntax, then the result is stored in the first argument, if you are using the GAS syntax, it is stored in the second argument.

 sub src, dest GAS Syntax sub dest, src Intel syntax

Like ADD, only it subtracts source from destination instead. In C: dest -= src;

mul arg

This multiplies "arg" by the value of corresponding byte-length in the AX register.

 operand size 1 byte 2 bytes 4 bytes other operand AL AX EAX higher part of result stored in: AH DX EDX lower part of result stored in: AL AX EAX

In the second case, the target is not EAX for backward compatibility with code written for older processors.

imul arg

As MUL, only signed. The IMUL instruction has the same format as MUL, but also accepts two other formats like so:

 imul src, dest GAS Syntax imul dest, src Intel syntax

This multiplies `src` by `dest`. If you are using the NASM syntax, then the result is stored in the first argument, if you are using the GAS syntax, it is stored in the second argument.

 imul aux, src, dest GAS Syntax imul dest, src, aux Intel syntax

This multiplies `src` by `aux` and places it into `dest`. If you are using the NASM syntax, then the result is stored in the first argument, if you are using the GAS syntax, it is stored in the third argument.

div arg

This divides the value in the dividend register(s) by "arg", see table below.

 divisor size 1 byte 2 bytes 4 bytes dividend AX DX:AX EDX:EAX remainder stored in: AH DX EDX quotient stored in: AL AX EAX

The colon (:) means concatenation. With divisor size 4, this means that EDX are the bits 32-63 and EAX are bits 0-31 of the input number (with lower bit numbers being less significant, in this example).

As you typically have 32-bit input values for division, you often need to use CDQ to sign-extend EAX into EDX just before the division.

If quotient does not fit into quotient register, arithmetic overflow interrupt occurs. All flags are in undefined state after the operation.

idiv arg

As DIV, only signed.

neg arg

Arithmetically negates the argument (i.e. two's complement negation).

## Carry Arithmetic Instructions

 adc src, dest GAS Syntax adc dest, src Intel syntax

Add with carry. Adds `src` + `carry flag` to `dest`, storing result in `dest`. Usually follows a normal add instruction to deal with values twice as large as the size of the register. In the following example, source contains a 64-bit number which will be added to destination.

```mov eax, [source] ; read low 32 bits
mov edx, [source+4] ; read high 32 bits
add [destination], eax ; add low 32 bits
adc [destination+4], edx ; add high 32 bits, plus carry
```

 sbb src, dest GAS Syntax sbb dest, src Intel syntax

Subtract with borrow. Subtracts `src` + `carry flag` from `dest`, storing result in `dest`. Usually follows a normal sub instruction to deal with values twice as large as the size of the register.

## Increment and Decrement

inc arg

Increments the register value in the argument by 1. Performs much faster than ADD arg, 1.

dec arg

Decrements the register value in the argument by 1. Performs much faster than SUB arg, 1.

## Pointer arithmetic

The `lea` instruction can be used for arithmetic, especially on pointers. See X86 Assembly/Data Transfer#Load Effective Address.

# Logic

## Logical instructions

The instructions on this page deal with bit-wise logical instructions. For more information about bit-wise logic, see Digital Circuits/Logic Operations.

 and src, dest GAS Syntax and dest, src Intel syntax

Performs a bit-wise AND of the two operands, and stores the result in dest. For example:

```movl \$0x1, %edx
movl \$0x0, %ecx
andl %edx, %ecx
; here ecx would be 0 because 1 AND 0 = 0
```

 or src, dest GAS Syntax or dest, src Intel syntax

Performs a bit-wise OR of the two operands, and stores the result in dest. For example:

```movl \$0x1, %edx
movl \$0x0, %ecx
orl  %edx, %ecx
; here ecx would be 1 because 1 OR 0 = 1
```

 xor src, dest GAS Syntax xor dest, src Intel syntax

Performs a bit-wise XOR of the two operands, and stores the result in dest. For example:

```movl \$0x1, %edx
movl \$0x0, %ecx
xorl %edx, %ecx
; here ecx would be 1 because 1 XOR 0 = 1
```

not arg

Performs a bit-wise inversion of arg. For example:

```movl \$0x1, %edx
notl %edx
; here edx would be 0xFFFFFFFE because a bitwise NOT 0x00000001 = 0xFFFFFFFE
```

# Shift and Rotate

## Logical Shift Instructions

In a logical shift instruction (also referred to as unsigned shift), the bits that slide off the end disappear (except for the last, which goes into the carry flag), and the spaces are always filled with zeros. Logical shifts are best used with unsigned numbers.

 shr src, dest GAS Syntax shr dest, src Intel syntax

Logical shift `dest` to the right by `src` bits.

 shl src, dest GAS Syntax shl dest, src Intel syntax

Logical shift `dest` to the left by `src` bits.

Examples (GAS Syntax):

```movw   \$ff00,%ax        # ax=1111.1111.0000.0000 (0xff00, unsigned 65280, signed -256)
shrw   \$3,%ax           # ax=0001.1111.1110.0000 (0x1fe0, signed and unsigned 8160)
# (logical shifting unsigned numbers right by 3
#   is like integer division by 8)
shlw   \$1,%ax           # ax=1111.1110.0000.0000 (0x3fc0, signed and unsigned 16320)
# (logical shifting unsigned numbers left by 1
#   is like multiplication by 2)
```

## Arithmetic Shift Instructions

In an arithmetic shift (also referred to as signed shift), like a logical shift, the bits that slide off the end disappear (except for the last, which goes into the carry flag). But in an arithmetic shift, the spaces are filled in such a way to preserve the sign of the number being slid. For this reason, arithmetic shifts are better suited for signed numbers in two's complement format.

 sar src, dest GAS Syntax sar dest, src Intel syntax

Arithmetic shift `dest` to the right by `src` bits. Spaces are filled with sign bit (to maintain sign of original value), which is the original highest bit.

 sal src, dest GAS Syntax sal dest, src Intel syntax

Arithmetic shift `dest` to the left by `src` bits. The bottom bits do not affect the sign, so the bottom bits are filled with zeros. This instruction is synonymous with SHL.

Examples (GAS Syntax):

```movw   \$ff00,%ax        # ax=1111.1111.0000.0000 (0xff00, unsigned 65280, signed -256)
salw   \$2,%ax           # ax=1111.1100.0000.0000 (0xfc00, unsigned 64512, signed -1024)
# (arithmetic shifting left by 2 is like multiplication by 4 for
#   negative numbers, but has an impact on positives with most
#   significant bit set (i.e. set bits shifted out))
sarw   \$5,%ax           # ax=1111.1111.1110.0000 (0xffe0, unsigned 65504, signed -32)
# (arithmetic shifting right by 5 is like integer division by 32
#   for negative numbers)
```

## Extended Shift Instructions

The names of the double precision shift operations are somewhat misleading, hence they are listed as extended shift instructions on this page.

They are available for use with 16- and 32-bit data entities (registers/memory locations). The `src` operand is always a register, the `dest` operand can be a register or memory location, the `cnt` operand is an immediate byte value or the CL register. In 64-bit mode it is possible to address 64-bit data as well.

 shld cnt, src, dest GAS Syntax shld dest, src, cnt Intel syntax

The operation performed by `shld` is to shift the most significant `cnt` bits out of `dest`, but instead of filling up the least significant bits with zeros, they are filled with the most significant `cnt` bits of `src`.

 shrd cnt, src, dest GAS Syntax shrd dest, src, cnt Intel syntax

Likewise, the `shrd` operation shifts the least significant `cnt` bits out of `dest`, and fills up the most significant `cnt` bits with the least significant bits of the `src` operand.

Intel's nomenclature is misleading, in that the shift does not operate on double the basic operand size (i.e. specifying 32-bit operands doesn't make it a 64-bit shift): the `src` operand always remains unchanged.

Also, Intel's manual[1] states that the results are undefined when `cnt` is greater than the operand size, but at least for 32- and 64-bit data sizes it has been observed that shift operations are performed by (`cnt mod n`), with n being the data size.

Examples (GAS Syntax):

```xorw   %ax,%ax          # ax=0000.0000.0000.0000 (0x0000)
notw   %ax              # ax=1111.1111.1111.1111 (0xffff)
movw   \$0x5500,%bx      # bx=0101.0101.0000.0000
shrdw  \$4,%ax,%bx       # bx=1111.0101.0101.0000 (0xf550), ax is still 0xffff
shldw  \$8,%bx,%ax       # ax=1111.1111.1111.0101 (0xfff5), bx is still 0xf550
```

Other examples (decimal numbers are used instead of binary number to explain the concept)

```# ax = 1234 5678
# bx = 8765 4321
shrd   \$3, %ax, %bx     # ax = 1234 5678 bx = 6788 7654
```
```# ax = 1234 5678
# bx = 8765 4321
shld   \$3, %ax, %bx     # bx = 5432 1123 ax = 1234 5678
```

## Rotate Instructions

In a rotate instruction, the bits that slide off the end of the register are fed back into the spaces.

 ror src, dest GAS Syntax ror dest, src Intel syntax

Rotate `dest` to the right by `src` bits.

 rol src, dest GAS Syntax rol dest, src Intel syntax

Rotate `dest` to the left by `src` bits.

## Rotate With Carry Instructions

Like with shifts, the rotate can use the carry bit as the "extra" bit that it shifts through.

 rcr src, dest GAS Syntax rcr dest, src Intel syntax

Rotate `dest` to the right by `src` bits with carry.

 rcl src, dest GAS Syntax rcl dest, src Intel syntax

Rotate `dest` to the left by `src` bits with carry.

### Number of arguments

Unless stated, these instructions can take either one or two arguments. If only one is supplied, it is assumed to be a register or memory location and the number of bits to shift/rotate is one (this may be dependent on the assembler in use, however). `shrl \$1, %eax` is equivalent to `shrl %eax` (GAS syntax).

# Other Instructions

## Stack Instructions

push arg

This instruction decrements the stack pointer and stores the data specified as the argument into the location pointed to by the stack pointer.

pop arg

This instruction loads the data stored in the location pointed to by the stack pointer into the argument specified and then increments the stack pointer. For example:

 ```mov eax, 5 mov ebx, 6 ``` ```push eax ``` The stack is now: [5] ```push ebx ``` The stack is now: [6] [5] ```pop eax ``` The topmost item (which is 6) is now stored in eax. The stack is now: [5] ```pop ebx ``` ebx is now equal to 5. The stack is now empty.

pushf

This instruction decrements the stack pointer and then loads the location pointed to by the stack pointer with the contents of the flag register.

popf

This instruction loads the flag register with the contents of the memory location pointed to by the stack pointer and then increments the contents of the stack pointer.

pusha

This instruction pushes all the general purpose registers onto the stack in the following order: AX, CX, DX, BX, SP, BP, SI, DI. The value of SP pushed is the value before the instruction is executed. It is useful for saving state before an operation that could potential change these registers.

popa

This instruction pops all the general purpose registers off the stack in the reverse order of PUSHA. That is, DI, SI, BP, SP, BX, DX, CX, AX. Used to restore state after a call to PUSHA.

pushad

This instruction works similarly to pusha, but pushes the 32-bit general purpose registers onto the stack instead of their 16-bit counterparts.

popad

This instruction works similarly to popa, but pops the 32-bit general purpose registers off of the stack instead of their 16-bit counterparts.

## Flags instructions

While the flags register is used to report on results of executed instructions (overflow, carry, etc.), it also contains flags that affect the operation of the processor. These flags are set and cleared with special instructions.

### Interrupt Flag

The IF flag tells a processor if it should accept hardware interrupts. It should be kept set under normal execution. In fact, in protected mode, neither of these instructions can be executed by user-level programs.

sti

Sets the interrupt flag. If set, the processor can accept interrupts from peripheral hardware.

cli

Clears the interrupt flag. Hardware interrupts cannot interrupt execution. Programs can still generate interrupts, called software interrupts, and change the flow of execution. Non-maskable interrupts (NMI) cannot be blocked using this instruction.

### Direction Flag

The DF flag tells the processor which way to read data when using string instructions. That is, whether to decrement or increment the `esi` and `edi` registers after a `movs` instruction.

std

Sets the direction flag. Registers will decrement, reading backwards.

cld

Clears the direction flag. Registers will increment, reading forwards.

### Carry Flag

The CF flag is often modified after arithmetic instructions, but it can be set or cleared manually as well.

stc

Sets the carry flag.

clc

Clears the carry flag.

cmc

Complements (inverts) the carry flag.

### Other

sahf

Stores the content of AH register into the lower byte of the flag register.

lahf

Loads the AH register with the contents of the lower byte of the flag register.

## I/O Instructions

 in src, dest GAS Syntax in dest, src Intel syntax

The IN instruction almost always has the operands AX and DX (or EAX and EDX) associated with it. DX (src) frequently holds the port address to read, and AX (dest) receives the data from the port. In Protected Mode operating systems, the IN instruction is frequently locked, and normal users can't use it in their programs.

 out src, dest GAS Syntax out dest, src Intel syntax

The OUT instruction is very similar to the IN instruction. OUT outputs data from a given register (src) to a given output port (dest). In protected mode, the OUT instruction is frequently locked so normal users can't use it.

## System Instructions

These instructions were added with the Pentium II.

sysenter

This instruction causes the processor to enter protected system mode (supervisor mode or "kernel mode").

sysexit

This instruction causes the processor to leave protected system mode, and enter user mode.

## Misc Instructions

RDTSC

RDTSC was introduced in the Pentium processor, the instruction reads the number of clock cycles since reset and returns the value in EDX:EAX. This can be used as a way of obtaining a low overhead, high resolution CPU timing. Although with modern CPU microarchitecture(multi-core, hyperthreading) and multi-CPU machines you are not guaranteed synchronized cycle counters between cores and CPUs. Also the CPU frequency may be variable due to power saving or dynamic overclocking. So the instruction may be less reliable than when it was first introduced and should be used with care when being used for performance measurements.

It is possible to use just the lower 32-bits of the result but it should be noted that on a 600 MHz processor the register would overflow every 7.16 seconds:

${\displaystyle 2^{32}cycles*(1second/600,000,000cycles)=7.16seconds}$

While using the full 64-bts allows for 974.9 years between overflows:

${\displaystyle 2^{64}cycles*((1second/600,000,000cycles)/(86400seconds\ in\ a\ day\ *\ 365\ days\ in\ a\ year))=974.9years}$

The following program (using NASM syntax) is an example of using RDTSC to measure the number of cycles a small block takes to execute:

```global main

extern printf

section .data
align 4
a:	dd 10.0
b:	dd 5.0
c:	dd 2.0
fmtStr:	db "edx:eax = %llu edx = %d eax = %d", 0x0A, 0

section .bss
align 4
cycleLow:	resd 1
cycleHigh:	resd 1
result:		resd 1

section .text
main:			; Using main since we are using gcc to link

;
;	op	dst,  src
;
xor	eax, eax
cpuid
rdtsc
mov	[cycleLow], eax
mov	[cycleHigh], edx

;
; Do some work before measurements
;
fld	dword [a]
fld	dword [c]
fmulp	st1
fmulp	st1
fld	dword [b]
fld	dword [b]
fmulp	st1
faddp	st1
fsqrt
fstp	dword [result]
;
; Done work
;

cpuid
rdtsc
;
; break points so we can examine the values
; before we alter the data in edx:eax and
; before we print out the results.
;
break1:
sub	eax, [cycleLow]
sbb	edx, [cycleHigh]
break2:
push	eax
push	edx
push 	edx
push	eax
push	dword fmtStr
call	printf
add	esp, 20		; Pop stack 5 times 4 bytes

;
; Call exit(3) syscall
;	void exit(int status)
;
mov	ebx, 0		; Arg one: the status
mov	eax, 1		; Syscall number:
int 	0x80
```

In order to assemble, link and run the program we need to do the following:

```\$ nasm -felf -g rdtsc.asm -l rdtsc.lst
\$ gcc -m32 -o rdtsc rdtsc.o
\$ ./rdtsc
```

# X86 Interrupts

Interrupts are special routines that are defined on a per-system basis. This means that the interrupts on one system might be different from the interrupts on another system. Therefore, it is usually a bad idea to rely heavily on interrupts when you are writing code that needs to be portable.

## What is an Interrupt?

In modern operating systems, the programmer often doesn't need to use interrupts. In Windows, for example, the programmer conducts business with the Win32 API. However, these API calls interface with the kernel, and the kernel will often trigger interrupts to perform different tasks. In older operating systems (specifically DOS), the programmer didn't have an API to use, and so they had to do all their work through interrupts.

## Interrupt Instruction

int arg

This instruction issues the specified interrupt. For instance:

```int 0x0A
```

Calls interrupt 10 (0x0A (hex) = 10 (decimal)).

## Types of Interrupts

There are 3 types of interrupts: Hardware Interrupts, Software Interrupts and Exceptions.

### Hardware Interrupts

Hardware interrupts are triggered by hardware devices. For instance, when you type on your keyboard, the keyboard triggers a hardware interrupt. The processor stops what it is doing, and executes the code that handles keyboard input (typically reading the key you pressed into a buffer in memory). Hardware interrupts are typically asynchronous - their occurrence is unrelated to the instructions being executed at the time they are raised.

### Software Interrupts

There are also a series of software interrupts that are usually used to transfer control to a function in the operating system kernel. Software interrupts are triggered by the instruction int. For example, the instruction "int 14h" triggers interrupt 0x14. The processor then stops the current program, and jumps to the code to handle interrupt 14. When interrupt handling is complete, the processor returns flow to the original program.

### Exceptions

Exceptions are caused by exceptional conditions in the code which is executing, for example an attempt to divide by zero or access a protected memory area. The processor will detect this problem, and transfer control to a handler to service the exception. This handler may re-execute the offending code after changing some value (for example, the zero dividend), or if this cannot be done, the program causing the exception may be terminated.

## Further Reading

A great list of interrupts for DOS and related systems is at Ralf Brown's Interrupt List.

# x86 Assemblers

There are a number of different assemblers available for x86 architectures. This page will list some of them, and will discuss where to get the assemblers, what they are good for, and where they are used the most.

## GNU Assembler (GAS)

The GNU assembler is most common as the assembly back-end to the GCC compiler. One of the most compelling reasons to learn to program GAS (as it is frequently abbreviated) is to write inline assembly instructions (assembly code embedded in C source code) when compiled by the gcc need to be in GAS syntax. GAS uses the AT&T syntax for writing the assembly language, which some people claim is more complicated, but other people say it is more informative.

## Microsoft Macro Assembler (MASM)

Microsoft's Macro Assembler, MASM, has been in constant production for many many years. Many people claim that MASM isn't being supported or improved anymore, but Microsoft denies this: MASM is maintained, but is currently in a bug-fixing mode. No new features are currently being added. However, Microsoft is shipping a 64-bit version of MASM with new 64-bit compiler suites. MASM is available from Microsoft as part of Visual C++, as a download from MSDN, or as part of the Microsoft DDK. The latest available version of MASM is version 11.x (ref.: www.masm32.com).

MASM uses the Intel syntax for its instructions, which stands in stark contrast to the AT&T syntax used by the GAS assembler. Most notably, MASM instructions take their operands in reverse order from GAS. This one fact is perhaps the biggest stumbling block for people trying to transition between the two assemblers.

MASM also has a very powerful macro engine, which many programmers use to implement a high-level feel in MASM programs.