Microprocessor Design/Print Version
From Wikibooks, the open-content textbooks collection
This book serves as an introduction to the field of microprocessor design and implementation. It is intended for students in computer science or computer or electrical engineering who are in the third or fourth years of an undergraduate degree. While the focus of this book will be on Microprocessors, many of the concepts will apply to other ASIC design tasks as well.
The reader should have prior knowledge in Digital Circuits and possibly some background in Semiconductors although it isn't strictly necessary. The reader also should know at least one Assembly Language. Knowledge of higher-level languages such as C or C++ may be useful as well, but are not required. Sections about soft-core design will require prior knowledge of Programmable Logic, and a prior knowledge of at least one HDL.
Introduction
About This Book
Computers and computer systems are a pervasive part of the modern world. Aside from just the common desktop PC, there are a number of other types of specialized computer systems that pop up in many different places. The central component of these computers and computer systems is the microprocessor, or the CPU. The CPU (short for "Central Processing Unit") is essentially the brains behind the computer system, it is the component that "computes". This book is going to discuss what microprocessor units do, how they do it, and how they are designed.
This book is going to discuss the design of microprocessor units, but it will not discuss the design of complete computer systems nor the design of other computer components or peripherals. Some microprocessor designs will be implemented and synthesized in Hardware Description Languages, such as Verilog or VHDL. The book will be organized to discuss simple designs and concepts first, and expand the initial designs to include more complicated concepts as the book progresses.
This book will attempt to discuss the basic concepts and theory of microprocessor design from an abstract level, and give real-world examples as necessary. This book will not focus on studying any particular processor architecture, although several of the most common architectures will appear frequently in examples and notes.
How Will This Book Be Organized?
The first section of the book will review computer architecture, and will give a brief overview of the components of a computer, the components of a microprocessor, and some of the basic architectures of modern microprocessors.
The second section will discuss in some detail the individual components of a microcontroller, what they do, and how they are designed.
The third section will focus in on the ALU and FPU, and will discuss implementation of particular mathematical operations.
The fourth section will discuss the various design paradigms, starting with the most simple single cycle machine to more complicated exotic architectures such as vector and VLIW machines.
Additional chapters will serve as extensions and support chapters for concepts discussed in the first four sections.
Prerequisites
This book will rely on some important background information that is currently covered in a number of other local wikibooks. Readers of this book will find the following prerequisites important to understand the material in this book:
All readers must be familiar with binary numbers and also hexadecimal numbers. These notations will be used throughout the book without any prior explanation. Readers of this book should be familiar with at least one assembly language, and should also be familiar with a hardware description language. This book will use both types of languages in the main narrative of the text without offering explanation beforehand. Appendices might be included that contain primers on this material.
Readers of this book will also find some pieces of software helpful in examples. Specifically, assemblers and assembly language simulators will help with many of the examples. Likewise, HDL compilers and simulators will be useful in the design examples. If free versions of these software programs can be found, links will be added in an appendix.
Who Is This Book For?
This book is designed to accompany an advanced undergraduate or graduate study in the field of microprocessor design. Students in the areas of Electrical Engineering, Computer Engineering, or Computer Science will likely find this book to be the most useful. The basic subjects in this field will be covered, and more advanced topics will be included depending on the proficiencies of the authors. Many of the topics considered in this book will apply to the design of many different types of digital hardware, including ASICs. However, the main narrative of the book, and the ultimate goals of the book will be focused on microcontrollers and microprocessors, not other ASICs.
What This Book Will Not Cover
This book will not cover the following topics in any detail, although some mention might be made of them as a matter of interest. For each of these topics, there either is a current wikibook on that subject, or a wikibook being planned on that subject.
- Transistor mechanics or semiconductors -- Semiconductors
- Digital Logic, Digital Circuit Design, or Digital Circuit Layout -- Digital Circuits
- Microprocessor or Integrated Circuit Fabrication -- Microtechnology
- Design or interfacing with other computer components or peripherals -- Embedded Systems
- Design or implementation of communication protocols, even protocols used typically to communicate between computer components -- Serial Programming, Embedded Systems/Serial and Parallel IO
- Design or creation of computer software -- Assembly Language and other Category:Programming wikibooks
- Design of System-on-a-Chip hardware, or the design of any device with an integrated microcontroller -- Programmable Logic
This book is about the design of microcontrollers and microprocessors only.
Terminology
Throughtout the book, the words "Microprocessor", "Microcontroller", "Processor", and "CPU" will all generally be used interchangably to denote a digital processing element capable of performing arithmetic and quantitative comparisons. We may differentiate between these terms in individual sections, but an explanation of the differences will always be provided.
Microprocessor Basics
Microprocessors
Microprocessors
Microprocessors are the devices in a computer that make things happen. Microprocessors are capable of performing basic arithmetic operations, moving data from place to place, and making basic decisions based on the quantity of certain values.
Types of Processors
The vast majoriy of microprocessors are embedded microcontrollers. The second most common type of processors are common desktop processors, such as Intel's Pentium or AMD's Athlon. Less common are the extremely powerful processors used in high-end servers.
Historically, microprocessors and microcontrollers have come in "standard sizes" of 8 bits, 16 bits, 32 bits, and 64 bits. These sizes are common, but that does not mean that other sizes are not available. Some microcontrollers (usually specially designed embedded chips) can come in other "non-standard" sizes such as 4 bits, 12 bits, 18 bits, or 24 bits. The number of bits represent how much physical memory can be directly addressed by the CPU. It also represents the amount of bits that can be read by one read/write operation.
- 8 bit processors can read/write 1 byte at a time and can directly address 256 bytes
- 16 bit processors can read/write 2 bytes at a time, and can address 65,536 bytes (64 Kilobytes)
- 32 bit processors can read/write 4 bytes at a time, and can address 4,294,967,295 bytes (4 Gigabytes)
General Purpose Versus Specific Use
Microprocessors that are capable of performing a wide range of tasks are called general purpose Microprocessors. General purpose chips are typically the kind of chips found in desktop computer systems. These chips typically are capable of a wide range of tasks (integer and floating point arithmetic, external memory interface, general I/O, etc). We will discuss some of the other types of processor units available:
- General Purpose
- A general purpose processing unit, typically referred to as a "microprocessor" is a chip that is designed to be integrated into a larger system with peripherals and external RAM. These chips can typically be used with a very wide array of software.
- DSP
- A Digital Signal Processor, or DSP for short, is a chip that is specifically designed for fast arithmetic operations, especially addition and multiplication. These chips are designed with processing speed in mind, and don't typically have the same flexibility of a general purpose microprocessor. DSP's also have special address generation units that can manage circular buffers, perform bit-reversed addressing, and simultaneously access multiple memory spaces with little to no overhead. They also support zero-overhead looping, and a single-cycle multiply-accumulate instruction. They are not typically more powerful than general purpose microprocessors, but can perform signal processing tasks using far less power (as in watts).
- Embedded Controller
- Embedded controllers, or "microcontrollers" are microprocessors with additional hardware integrated into the single chip. Many microcontrollers have RAM, ROM, A/D and D/A converters, interrupt controllers, timers, and even oscillators built into the chip itself. These controllers are designed to be used in situations where a whole computer system isn't available, and only a small amount of simple processing needs to be performed.
- Programmable State Machines
- The smallest of the bunch, programmable state machines are a minimalist microprocessor that is designed for very small and simple operations. PSMs typically have very small amount of program ROM available, limited scratch-pad RAM, and they are also typically limited in the type and number of instructions that they can perform. PSMs can either be used stand-alone, or (more frequently) they are embedded directly into the design of a larger chip.
- Graphics Processing Units
- Computer graphics are so complicated that functions to process the visuals of video and game applications has been offloaded to a special type of processor known as a GPU. GPU's typically require specialized hardware to implement matrix-multiplications and vector arithmetic. GPU's are typically also highly parallelized, performing shading calculations on multiple pixels and surfaces simultaneously.
Types of Use
Microcontrollers and Microprocessors are used for a number of different types of applications. People may be the most familiar with the desktop PC, but the fact is that desktop PCs make up only a small fraction of all microprocessors in use today. We will list here some of the basic uses for microprocessors:
- Signal Processing
- Signal processing is an area that demands high performance from microcontroller chips to perform complex mathematical tasks. Signal processing systems typically need to have low latency, and are very deadline driven. An example of a signal processing application is the decoding of digital television and radio signals.
- Real Time Applications
- Some tasks need to be performed so quickly that even the slightest delay or inefficiency can be detrimental. These applications are known as "real time systems", and timing is of the upmost importance. An example of a real-time system is the anti-lock break controller in modern automobiles.
- Throughput and Routing
- Throughput and routing is the use of a processor where data is moved from one particular input to an output, without necessarily requiring any processing. An example is an internet router, that reads in data packets and sends them out on a different port.
- Sensor monitoring
- Many processors, especially small embedded processors are used to monitor sensors. The microprocessor will either digitize and filter the sensor signals, or it will read the signals and produce status outputs (the sensor is good, the sensor is bad). An example of a sensor monitoring processor is the processor inside an antilock break system: This processor reads the break sensor to determine when the breaks have locked up, and then outputs a control signal to activate the rest of the system.
- General Computing
- A general purpose processor is like the kind of processor that is typically found inside a desktop PC. Names such as Intel and AMD are typically associated with this type of processor, and this is also the kind of processor that the public is most familiar with.
- Graphics
- Processing of digital graphics is an area where specialized processor units are frequently employed. With the advent of digital television, graphics processors are becoming more common. Graphics processors need to be able to perform multiple simultaneous operations. In digital video, for instance, a million pixels or more will need to be processed for every single frame, and a particular signal may have 60 frames per second! To the benefit of graphics processors, the color value of a pixel is typically not dependant on the values of surrounding pixels, and therefore many pixels can typically be computed in parallel.
Abstraction Layers
Computer systems are developed in layers known as layers of abstraction. Layers of abstraction allow people to develop computer components (hardware and software) without having to worry about the internal design of the other layers in the system. At the highest level are the user-interface programs that people use on their computers. At the lowest level are the transistor layouts of the individual computer components. Some of the layers in a computer system are (listed from highest to lowest):
- Application
- Operating System
- Firmware
- Instruction Set Architecture
- Microprocessor Control Logic
- Physical Circuit Layout
This book will be mostly concerned with the Instruction Set Architecture (ISA), and the Microprocessor Control Logic. Topics above these are typically the realm of computer programmers. The bottom layer, the Physical Circuit Layout is the job of hardware and VLSI engineers.
ISA
The Instruction Set Architecture is a long name for the assembly language of a particular machine, and the associated machine code for that assembly language. We will discuss this below.
Assembly Language
An assembly language is a small language that contains a short word or "mnemonic" for each individual command that a microcontroller can follow. Each command gets a single mnemonic, and each mnemonic corresponds to a single machine command. Assembly language gets converted (by a program called an "assembler") into the binary machine code. The machine code is specific to each different type of machine.
Common ISAs
Some of the most common ISAs, listed in order of popularity (most popular first) are:
- ARM
- IA-32 (Intel x86)
- MIPS
- Motorola 68K
- PowerPC
- Hitachi SH
- SPARC
Moore's Law
A common law that governs the world of microprocessors is Moore's Law. Moore's Law, originally by Dr. Carver Mead at Caltech, and summarized famously by Intel Founder Gordon Moore. Moore's Law states that the number of transistors on a single chip at the same price will double every 18 to 24 months. This law has held without fail since it was originally stated in 1965. Current microprocessor chips contain millions of transistors and the number is growing rapidly. Here is Moore's summarization of the law from Electronics Magazine in 1965:
The complexity for minimum component costs has increased at a rate of roughly a factor of two per year...Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years. That means by 1975, the number of components per integrated circuit for minimum cost will be 65,000. I believe that such a large circuit can be built on a single wafer.
— Gordon Moore
Moore's Law has been used incorrectly to calculate the speed of an integrated circuit, or even to calculate it's power consumption, but neither of these interpretations are true. Also, Moore's law is talking about the number of transistors on a chip for a "minimum component cost", which means that the number of transistors on a chip, for the same price, will double. This goes to show that chips for less price can have fewer transistors, and that chips at a higher price can have more transistors. On an economic note, a consequence of Moore's Law is that companies need to continue to innovate and integrate more transistors onto a single chip, without being able to increase prices.
Moore's Law does not require that the speed of the chip increase along with the number of transistors on the chip. However, the two measurements are typically related. Some points to keep in mind about transistors and Moore's Law are:
- Smaller Transistors typically switch faster then larger transistors.
- To get more transistors on a single chip, the chip needs to be made larger, or the transistors need to be made smaller. Typically, the transistors get smaller.
- Transistors tend to leak electrical current as they get smaller. This means that smaller transistors require more power to operate, and they generate more heat.
- Transistors tend to generate heat as a function of frequencies. Higher clock rates tend to generate more heat.
Moore's law is occasionally misinterpreted to mean that the speed of processors, in hertz will double every 18 months. This is not strictly true, although the speed of processors does tend to increase as transistors are made smaller and more compact. With the advent of multi-core processors, some people have used Moore's law to mean that processor throughput increases with time, which is not strictly the case either (although it is a likely side effect of Moore's law).
Clock Rates
Microprocessors are typically discussed in terms of their clock speed. The clock speed is measured in hertz (or megahertz, or gigahertz). A hertz is a "cycle per second". Each cycle, a microprocessor will perform certain tasks, although the amount of work performed in a single cycle will be different for different types of processors. The amount of work that a processor can complete in a single cycle is measured in "cycles per instruction". For some systems, such as MIPS, there is 1 cycle per instruction. For other systems, such as modern x86 chips, there are typically very many cycles per instruction.
The clock rate is equated as such:
This means that the amount of time for a cycle is inversely proportional to the clock rate. A computer with a 1MHz clock rate will have a clock time of 1 microsecond. A modern desktop computer with a 3.2 GHz processor will have a clock time of approximately 3× 10-10 seconds, or 30 nanoseconds. 30 nanoseconds is an incredibly small amount of time, and there is a lot that needs to happen inside the processor in each clock cycle.
Basic Elements of a Computer
There are a few basic elements that are common to all computers. These elements are:
- CPU
- Memory
- Input Devices
- Output Devices
Depending on the particular computer architecture, these elements may be available in various sizes, and they may be accompanied by additional elements.
Computer Architecture
Von Neumann Architecture
Early computer programs were hard wired. To reprogram a computer meant changing the hardware switches manually, that took a long time with potential errors. Computer memory was only used for storing data.
John von Neumann suggested that data and programs are stored together in memory, it is now called Von Neumann architecture. Programs are fetched from memory for executions by a central unit that is what we call the CPU. Basically programs and data are represented on memory in the same way. The program is just data coded for special meaning.
A Von Neumann microprocessor is a processor that follows this pattern:
- Fetch
- An instruction and the necessary data is obtained from memory.
- Decode
- The instruction and data are separated, the necessary components and pathways to execute the instruction are activated.
- Execute
- The instruction is performed, the data is manipulated, and the results are stored.
This pattern is typically implemented by separating the task into two components, the control, and the datapath.
Control
The control unit reads the instruction, and activates the appropriate parts of the datapath.
Datapath
The datapath is the pathway that the data takes through the microprocessor. As the data travels to different parts of the datapath, the command signals from the control unit cause the data to be manipulated in specific ways, according to the instruction.
Harvard Architecture
In a Harvard Architecture machine, the memory of the computer system is separated into two distinct parts: the data and the instructions. In a pure Harvard system, the two different memories will occupy separate memory modules, and instructions can only be executed from the instruction memory.
In a "Princeton Architecture" machine, the memory of the computer system is one uniform address space: data and instructions may be placed anywhere in the one memory.
Many DSPs are modified Harvard Architectures, designed to access 3 distinct memory areas simultaneously: the program instructions, the signal data samples, and the filter coefficients (often called the P, X, and Y memories).
In theory, such 3-way Harvard architectures can be 3 times as fast as a Princeton architecture that is forced to read each of these things one at a time.
In modern systems, the instructions and data are stored in the same physical memory module, but are stored in different locations in memory. Several common computer security problems arise because modern processors are not pure harvard systems, and are allowed to manipulate instructions as if they were data, and vice-versa.
RISC and CISC
There are two different types of ISA, the reduced instruction set computers (RISC) and the complex instruction set computers. It is a common misunderstanding that RISC systems typically have a small ISA (fewer instructions) but make up for it with faster hardware. RISC system actually have "reduced instructions", in the sense that each instruction does so little that it takes very little time to execute it. It is a common misunderstanding that CISC systems have more instructions, but typically pay a steep performance penalty for the added versatility. CISC systems actually have "complex instructions", in the sense that at least one instruction takes a long time to execute -- for example, the "double indirect" addressing mode inherently requires two memory cycles to execute, and a few CPUs have a "string copy" instruction that may require hundreds of memory cycles to execute. MIPS and SPARC are examples of RISC computers. Intel x86 is an example of a CISC computer.
Some people group stack machines with the the RISC machines; others[1] group stack machines with the CISC machines; some people [2] describe stack machines as neither RISC nor CISC.
We will discuss these terms and concepts in more detail later.
Microprocessor Components
Some of the common components of a microprocessor are:
- Control Unit
- I/O Units
- Arithmetic Logic Unit (ALU)
- Registers
- Cache
We will give a brief introduction to these components below.
Control Unit
The control unit, as described above, reads the instructions, and generates the necessary digital signals to operate the other components. An instruction to add two numbers together would cause the Control Unit to activate the addition module, for instance.
I/O Units
A processor needs to be able to communicate with the rest of the computer system. This communication occurs through the I/O ports. The I/O ports will interface with the system memory (RAM), and also the other peripherals of a computer.
Arithmetic Logic Unit
The Arithmetic Logic Unit, or ALU is the part of the microprocessor that performs arithmetic operations. ALUs can typically add, subtract, divide, multiply, and perform logical operations of two numbers (and, or, nor, not, etc).
Registers
In this book, we talk about lots of different kinds of registers. Hopefully it will be obvious which kind of register we are talking about from the context.
The most general meaning is a "hardware register": anything that can be used to store bits of information, in a way that all the bits of the register can be written to or read out simultaneously. Since registers outside of a CPU are also outside the scope of the book, this book will only discuss processor registers, which are hardware registers that happen to be inside a CPU. But usually we will refer to a more specific kind of register.
programmer-visible registers
The programmer-visible registers, also called the user-accessible registers, also called the architectural registers, often simply called "the registers", are the registers that are directly encoded as part of at least one instruction in the instruction set.
The registers are the fastest accessible memory locations, and because they are so fast, there are typically very few of them. In most processors, there are fewer than 32 registers. The size of the registers defines the size of the computer. For instance, a "32 bit computer" has registers that are 32 bits long. The length of a register is known as the word length of the computer.
There are several factors limiting the number of registers, including:
- It is very convenient for a new CPU to be software-compatible with an old CPU. This requires the new chip to have exactly the same number of programmer-visible registers as the old chip.
- Doubling the number general-purpose registers requires adding another bit to each instruction that selects a particular register. Each 3-operand instruction (that specify 2 source operands and a destination operand) would expand by 3 bits. Modern chip manufacturing processes could put a million registers on a chip; that would make each and every 3-operand instruction require 60 bits just to select the registers, not counting the bits required to specify what to do with those operands.
- Adding more registers adds more wires to the critical path, adding capacitance, which reduces the maximum clock speed of the CPU.
- Historically, CPUs were designed with few registers, because each additional register increased the cost of the CPU significantly. But now that modern chip manufacturing can put tens of millions of bits of storage on a single commodity CPU chip, this is less of an issue.
Microprocessors typically contain a large number of registers, but only a small number of them are accessible by the programmer. The registers that can be used by the programmer to store arbitrary data, as needed, are called general purpose registers. Registers that cannot be accessed by the programmer directly are known as reserved registers[citation needed] .
Some computers have highly specialized registers -- memory addresses always came from the program counter or "the" index register or "the" stack pointer; one ALU input was always hooked to data coming from memory, the other ALU input was always hooked to "the" accumulator; etc.
Other computers have more general-purpose registers -- any instruction that access memory can use any address register as a index register or as a stack pointer; any instruction that uses the ALU can use any data register.
Other computers have completely general-purpose registers -- any register can be used as data or an address in any instruction, without restriction.
microarchitectural registers
Besides the programmer-visible registers, all CPUs have other registers that are not programmer-visible, called "microarchitectural registers" or "physical registers".
These registers include:
- memory address register
- memory data register
- instruction register
- microinstruction register
- microprogram counter
- pipeline registers
- extra physical registers to support register renaming
- the prefetch input queue
- writable control stores (We will discuss the control store in the Microprocessor Design/Control Unit and Microprocessor Design/Microcode)
- Some people consider on-chip cache to be part of the microarchitectural registers; others consider it "outside" the CPU.
There are a wide variety of ways to implement any one instruction set. The vast majority of these microarchitectural registers are technically not "necessary". A designer could choose to design a CPU that had almost no physical registers other than the programmer-visible registers. However, many designers choose to design a CPU with lots of physical registers, using them in ways that make the CPU execute the same given instruction set much faster than a CPU that lacks those registers.
Cache
Most CPUs manufactured do not have any cache.
Cache is memory that is located on the chip, but that is not considered registers. The cache is used because reading external memory is very slow (compared to the speed of the processor), and reading a local cache is much faster. In modern processors, the cache can take up as much as 50% or more of the total area of the chip. The following table shows the relationship between different types of memory:
| smallest | largest | |
| Registers | cache | RAM |
| fastest | slowest |
Cache typically comes in 2 or 3 "levels", depending on the chip. Level 1 (L1) cache is smaller and faster than Level 2 (L2) cache, which is larger and slower. Some chips have Level 3 (L3) cache as well, which is larger still than the L2 cache (although L3 cache is still much faster than external RAM).
Endian
Different computers order their multi-byte data words (i.e., 16-, 32-, or 64-bit words) in different ways in RAM. Each individual byte in a multi-byte word is still separately addressable. Some computers order their data with the most significant byte of a word in the lowest address, while others order their data with the most significant byte of a word in the highest address. There is logic behind both approaches, and this was formerly a topic of heated debate.
This distinction is known as endianness. Computers that order data with the least significant byte in the lowest address are known as "Little Endian", and computers that order the data with the most significant byte in the lowest address are known as "Big Endian". It is easier for a human (typically a programmer) to view multi-word data dumped to a screen one byte at a time if it is ordered as Big Endian. However it makes more sense to others to store the LS data at the LS address.
When using a computer this distinction is typically transparent; that is that the user cannot tell the difference between computers that use the different formats. However, difficulty arises when different types of computers attempt to communicate with one another over a network.
With a big-endian 68K sort of machine,
address increases > ------ >
data : 74 65 73 74 00 00 00 05
is the string "test" followed by the 32-bit integer 5. The little-endian x86 sort of machine would interpret the last part as the integer 0x0500_0000.
When communicating over a network composed of both big-endian and little-endian machines, the network hardware (should) apply the Address Invariance principle, to avoid scrambling text (avoiding the NUXI problem). High-level software (should) format packets of data to be transmitted over the network in Network Byte Order. High-level software (should) be written as "endian clean" -- always reading and writing 16 bit integers as whole 16 bit integers, 32 bit integers as whole 32 bit integers, etc. -- so no changes are needed to re-compile it for big-endian or little-endian machines. Software that is not "endian clean" -- software that writes integers, but then reads them out as 8 bit octets or integers of some other length -- usually fails when re-compiled for another computer.
A few computers -- including nearly all DSPs -- are "neither-endian". They always read and write complete aligned words, and don't have any hardware for dealing with individual bytes. Systems build on top of such computers often *do* have a particular endianness -- but that endianness is written into the software, and can be switched by re-compiling for the opposite endianness.
Stack
A stack is a block of memory that is used as a scratchpad area. The stack is a sequential set of memory locations that is set to act like a LIFO (last in, first out) buffer. Data is added to the top of the stack in a "push" operation, and the top data item is removed from the stack during a "pop" operation. Most computer architectures include at least a register that is usually reserved for the stack pointer.
Some microprocessors include a small hardware stack built into the CPU, independent from the rest of the RAM.
Modern Computers
Modern desktop computers, especially computers based on the Intel x86 ISA are not Harvard computers, although the newer variants have features that are "Harvard-Like". All information, program instructions, and data are stored in the same RAM areas. However, a modern feature called "paging" allows the physical memory to be segmented into large blocks of memory called "pages". Each page of memory can either be instructions or data, but not both. Desktop computers also
Modern embedded computers, however, are typically based on a Harvard architecture. Instructions are stored in a different addressable memory block than the data is, and there is no way for the microprocessor to interchange data and instructions.
further reading
Instruction Set Architectures
ISAs
The instruction set or the instruction set architecture (ISA) is the set of basic instructions that a processor understands. The instruction set is a portion of what makes up an architecture.
There are two basic philosophies to instruction sets: reduced (RISC) and complex (CISC). The merits and argued performance gains by each philosophy are and have been thoroughly debated.
CISC
Complex Instruction Set Computer (CISC) is rooted in the history of computing. Originally there were no compilers and programs had to be coded by hand one instruction at a time. To ease programming more and more instructions were added. Many of these instructions are complicated combination instructions such as loops. In general, more complicated or specialized instructions are inefficient in hardware, and in a typically CISC architecture the best performance can be obtained by using only the most simple instructions from the ISA.
The most well known/commoditized CISC ISAs are the Motorola 68k and Intel x86 architectures.
RISC
Reduced Instruction Set Computer (RISC) was realized in the late 1970s by IBM. Researchers discovered that most programs did not take advantage of all the various address modes that could be used with the instructions. By reducing the number of address modes and breaking down multi-cycle instructions into multiple single-cycle instructions several advantages were realized:
- compilers were easier to write (easier to optimize)
- performance is increased for programs that did simple operations
- the clock rate can be increased since the minimum cycle time was determined by the longest running instruction
The most well known/commoditized RISC ISAs are the ARM, MIPS and SPARC architectures.
Memory Arrangement
Instructions are typically arranged sequentially in memory. Each instruction occupies 1 or more computer words. the Program Counter is a register inside the microprocessor that contains the address of the current instruction. During the fetch cycle, the instruction from that address is read from memory, and the program counter is incremented by n, where n is the word length of the machine (in bytes).
Memory can be addressed relatively based on the Program Counter or directly by specifying the absolute address.
Common Instructions
Arithmetic Instructions
The ALU typically will perform some basic arithmetic and logical instructions, and some models will also perform more advanced operations too. The average ALU will perform addition, subtraction, multiplication, division, NOT, AND, OR, and XOR. Not all ALUs support all these instructions, and some ALUs can perform more advanced operations.
Branch and Jump
Branching and Jumping is the ability to load the PC register with a new address that is not the next sequential address. In general, a "jump" occurs unconditionally, and a "branch" occurs on a given condition. In this book we will generally refer to both as being branches, with a "jump" being an unconditional branch.
Move, Load, Store
Move instructions cause data from one register to be moved or copied to another register. Load instructions put data from an external source, such as memory, into a register. Store instructions move data from a register to an external destination.
NOP
NOP, short for "no operation" is an instruction that can be executed like normal, except it produces no result and it causes no side effects. NOPs are useful for timing and preventing hazards.
Memory
Memory is a fundamental aspect of microcontroller design, and a good understanding of memory is necessary to discuss and processor system.
Memory Hierarchy
Memory suffers from the dichotomy that it can be either large or it can be fast. As memory becomes more large, it becomes less fast, and vice-versa. Because of this trade-off, computer systems are typically have a hierarchy of memory types, where faster (and smaller) memories are closer to the processor, and slower (but larger) memories are further from the processor.
Hard Disk Drives
Hard Disk Drives (HDD) are occasionally known as secondary memory or nonvolatile memory. HDD typically stores data magnetically (although some newer models use FLASH), and data is maintained even when the computer is turned off or removed from power. HDD is several orders of magnitude slower then all other memory devices, and a computer system will be more efficient when the number of interactions with the HDD are minimized.
Because most HDDs are mechanical and have moving parts, they tend to wear out and fail after time.
RAM
Random Access Memory (RAM), also known as main memory, is a volatile storage that holds data for the processor. Unlike HDD storage, RAM typically only has a capacity of a few megabytes to a few gigabites. There are two primary forms of RAM, and many variants on these.
SRAM
Static RAM (SRAM) is a type of memory storage that uses 6 transistors to store data. These transistors store data so long as power is supplied to the RAM and do not need to be refreshed.
SRAM is typically used in processor cache, not main memory because it has a faster speed despite it's larger area.
DRAM
Dynamic RAM (DRAM) is a type of RAM that contains a single transistor and a capacitor. DRAM is smaller than SRAM, and therefore can store more data in a smaller area. Because of the charge and discharge times of the capacitor, however, DRAM tends to be slower than SRAM. Many modern types of Main Memory are based on DRAM design because of the high memory densities. Because DRAM is more simple than SRAM, it is typically cheaper to produce.
A popular type of RAM, SDRAM, is a variant of DRAM and is not related to SRAM.
As digital circuits continue to grow smaller and faster as per Moore's Law, the speed of DRAM is not increasing as rapidly. This means that as time goes on, the speed difference between the processor and the RAM units (so long as the RAM is based on DRAM or variants) will continue to increase, and communications between the two units becomes more inefficient.
Other RAM
Cache
Cache is memory that is smaller and faster than main memory and resides closer to the processor. RAM runs on the system bus clock, but Cache typically runs on the processor speed which can be 10 times faster or more. Cache is frequently divided into multiple levels: L1, L2, and L3, with L1 being the smallest and fastest, and L3 being the largest and slowest. We will discuss cache in more detail in a later chapter.
A computer system may have between several kilobytes to a few megabytes of available cache.
Registers
Registers are the smallest and fastest memory storage elements. A modern processor may have anywhere from 4 to 256 registers.
Control and Datapath
Most processors and other complicated hardware circuits are typically divided into two components: a datapath and a control unit. The datapath contains all the hardware necessary to perform all the necessary operations. In many cases, these hardware modules are parallel to one another, and the final result is determined by multiplexing all the partial results.
The control unit determines the operation of the datapath, by activating switches and passing control signals to the various multiplexers. In this way, the control unit can specify how the data flows through the datapath.
Performance
Clock Cycles
The clock signal is a 1-bit signal that oscillates between a "1" and a "0" with a certain frequency. When the clock transitions from a "0" to a "1" it is called the positive edge, and when the clock transitions from a "1" to a "0" it is called the negative edge.
The time it takes to go from one positive edge to the next positive edge is know as the clock period, and represents one clock cycle.
The number of clock cycles that can fit in 1 second is called the clock frequency. To get the clock frequency, we can use the following formula:
Clock frequency is measured in units of cycles per second.
Cycles per Instruction
In many microprocessor designs, it is common for multiple clock cycles to transpire while performing a single instruction. For this reason, it is frequently useful to keep a count of how many cycles are required to perform a single instruction. This number is known as the cycles per instruction, or CPI of the processor.
Because all processors may operate using a different CPI, it is not possible to accurately compare multiple processors simply by comparing the clock frequencies. It is more useful to compare the number of instructions per second, which can be calculated as such:
One of the most common units of measure in modern processors is the "MIPS", which stands for millions of instructions per second. A processor with 5 MIPS can perform 5 million instructions every second. Another common metric is "FLOPS", which stands for floating point operations per second. MFLOPS is a million FLOPS, BFLOPS is a billion FLOPS, and TFLOPS is a trillion FLOPS.
CPU Time
CPU Time is the amount of time it takes the CPU to complete a particular program. CPU time is a function of the amount of time it takes to complete instructions, and the number of instructions in the program:
Performance
Amdahls Law
Amdahl's Law is a law concerned with computer performance and optimization. Amdahl's law states that an improvement in the speed of a single processor component will have a comparatively small effect on the performance of the overall processor unit.
In the most general sense, Amdahl's Law can be stated mathematically as follows:
where:
- Δ is the factor by which the program is sped up or slowed down,
- Pk is a percentage of the instructions that can be improved (or slowed),
- Sk is the speed-up multiplier (where 1 is no speed-up and no slowing),
- k represents a label for each different percentage and speed-up, and
- n is the number of different speed-up/slow-downs resulting from the system change.
For instance, if we make a speed improvement in the memory module, only the instructions that deal directly with the memory module will experience a speedup. In this case, the percentage of load and store instructions in our program will be P0, and the factor by which those instructions are sped up will be S0. All other instructions, which are not affected by the memory unit will be P1, and the speed up will be S1 Where:
- P1 = 1 − P0
- S1 = 1
We set S1 to 1 because those instructions are not sped up or slowed down by the change to the memory unit.
benchmarking
- SpecInt
- SpecFP
- "Maxim/Dallas APPLICATION NOTE 3593" benchmarking
- "Mod51 Benchmarks"
- EEMBC, the Embedded Microprocessor Benchmark Consortium
Assembly Language
Assemblers
Assembly Language Constructs
There are a number of different assembly languages in existance, but all of them have a few things in common. They all map directly to the underlying hardware CPU instruction sets.
- CPU instuction set
- is a set of binary code/instruction that the CPU understands. Based on the CPU, the instruction can be one byte, two bytes or longer. The instruction code is usually followed by one or two operands.
| Instruction Code | operand 1 | operand 2 |
How many instructions there are depends on the CPU.
Because binary code is difficult to remember, each instruction has as its name a so-called mnemonic. For example 'MOV' can be used for moving instructions.
MOV A, 0x0020
The above instruction moves the value of register A to the specified address.
A simple assembler will translate the 'MOV A' to its CPU's instruction code.
Assembly languages cannot be assumed to be directly portable to other CPU's. Each CPU has its own assembly language, though CPU's within the same family may support limited portability
Load and Store
These instructions tell the CPU to move data from memory to a CPU's register, or move date from one of the CPU's register to memory.
- register
- is a special memory located inside the CPU, where arithmetic operations can be performed.
Arithmetic
Arithmetic operations can be performed using the CPU's registers:
- Increment the value of one of the CPU's registers
- Decrement the value of one of the CPU's registers
- Add a value to the register
- Subtract value from the register
- Multiply the register value
- Divide the register value
- Shift the register value
- Rotate the register value
Jumping
During a jump instruction, the program counter is loaded with a new address that is not necessarily the address of the next sequential instruction. After a jump, the program execution continues from the new location in memory.
- Relative jump
- the instruction's operand tells how many bytes the program counter should be increased or decreased.
- Absolute jump
- the instruction's operand is copied to the program counter; the operand is an absolute memory address where the execution should continue.
Branching
During a branch, the program counter is loaded with one of multiple new values, depending on some specified condition. A branch is a series of conditional jumps.
Some CPUs have skipping instructions. If a register is zero, the following instruction is skipped, if not then the following instruction is executed, which can be a jumping instruction. So Branching can be done by using skipping and jumping instructions together.
further reading
Design Steps
When designing a new microprocessor or microcontroller unit, there are a few general steps that can be followed to make the process flow more logically. These few steps can be further sub-divided into smaller tasks that can be tackled more easily. The general steps to designing a new microprocessor are:
- Determine the capabilities the new processor should have.
- Lay out the datapath to handle the necessary capabilities.
- Define the machine code instruction format (ISA).
- Construct the necessary logic to control the datapath.
We will discuss each of these steps below:
Determine Machine Capabilities
Before you start to design a new processor element, it is important to first ask why you are designing it at all. What new thing will your processor do that existing processors cannot? Keep in mind that it is always less expensive to utilize an existing chip then to design and manufacture a new one.
Some questions to start:
- Is this chip an embedded chip, a general-purpose chip, or a different type entirely?
- What, if any, are the limitations in terms of resources, price, power, or speed?
With that in mind, we need to ask what our chip will do:
- Does it have integer, floating-point, or fixed point arithmetic, or a combination of all three?
- Does it have scalar or vector operation abilities?
- Is it self-contained, or must it interface with a number of external peripherals?
- Will it support interrupts? If so, How much interrupt latency is tolerable? How much interrupt-response jitter is tolerable?
We also need to ask ourselves whether the machine will support a wide array of instructions, or if it will have a limited set of instructions. More instructions make the design more difficult, but make programming and using the chip easier. On the other hand, having fewer instructions is easier to design, but can be harder and more costly to program.
Lay out the basic arithmetic operations you want your chip to have:
- Addition/Subtraction
- Multiplication
- Division
- Shifting and Rotating
- Logical Operations: AND, OR, XOR, NOR, NOT, etc.
List other capabilities that your machine has:
- Unconditional jumps
- Conditional Jumps (and what conditions?)
- Stack operations (Push, pop)
Once we know what our chip is supposed to do, it is easer to lay out the framework for our datapath
Design the Datapath
Right off the bat we need to determine what ALU architecture that our processor will use:
- Accumulator
- Stack
- Register
- A combination of the above 3
This decision, more then any other, is going to have the largest effect on your final design. Do not proceed in the design process until you have made this decision. Once you have your ALU architecture, you create your memory element (stack or register file), and you can lay out your ALU.
Create ISA
Once we have our basic datapath, we can start to design our ISA. There are a few things that we need to consider:
- Is this processor RISC, CISC, or VLIW?
- How long is a machine word?
- How do you deal with immediate values? What kinds of instructions can accept immediate values?
Once we have our machine code basics, we frequently need to determine whether our processor will be compatible with higher-level languages. Specifically, are there any instructions that can be used for function call and return?
Determining the length of the instruction word in a RISC is a very important matter, and one that is worth a considerable amount of thought. For additional flexibility you can utilize a CISC machine instead, at the expense of additional—and more complicated—instruction decode logic. If the instruction word is too long, programmers will be able to fit fewer instructions into memory. If the instruction word is too small, there will not be enough room for all the necessary information. On a desktop PC with several megabytes or even gigabytes of RAM, large instruction words are not a big problem. On an embedded system however, with limited program ROM, the length of the instruction word will have a direct effect on the size of potential programs, and the usefulness of the chips.
Each instruction should have an associated opcode, and typically the length of the opcode field should be constant for all instructions, to reduce complexity of the decoder. The length of the opcode field will directly impact the number of distinct instructions that can be implemented. if the opcode field is too small, you won't have enough room to designate all your instructions. If your opcode is too large, you will be wasting precious bits in your instruction word.
Some instructions will need to be larger then others. For instance, instructions that deal with an immediate value, a memory location, or a jump address are typically larger then instructions that only deal with registers. Instructions that deal only with registers, therefore, will have additional space left over that can be used as an extension to the opcode field.
Example: MIPS R-Type
In the MIPS architecture, instructions that only deal with registers are called R type instructions. With 32 registers, a register address is only 5 bits wide. The MIPS opcode is 6 bits wide. With the opcode and the three register addresses (two source and 1 destination register), an R-type instruction only uses 21 out of the 32 bits available.
The additional 11 bits are broken into two additional fields: Shamt, a 5 bit immediate value that controls the amount of places shifted by a shift or rotate instruction, and Func. Func is a 6 bit field that contains additional information about R-Type instructions. Because of the availability of the Func field, all R-Type instructions share an opcode of 0.Build Control Logic
Once we have our datapath and our ISA, we can start to construct the logic of our primary control unit. These units are typically implemented as a finite state machine, and we can try to map the ISA to the control unit in a logical way.
Microprocessor Components
Basic Components
Basic Components
There are a number of components in a common microprocessor that designers should be familiar with before attempting a design. For an overview of these components, see the Digital Circuits wikibook.
Registers
A register is a storage element typically composed of an array of flip-flops. A 1-bit register can store 1 bit, and a 32-bit register can hold 32 bits, etc. Registers can be any length.
A register has two inputs, a data input and a clock input. The clock input is typically called the "enable". When the enable signal is high, the register stores the data input. When the clock signal is low, the register value stays the same.
Register File
A register file is a whole collection of registers, typically all of which are the same length. A register file takes three inputs, an index address value, a data value, and an enable signal. A signal decoder is used to pass the data value from the register file input to the particular register with the specified address.
Multiplexers
A multiplexer is an input selector. A multiplexer has 1 output, a control input, and several data inputs. For ease, we number multiplexer inputs from zero, at the top. If the control signal is "0", the 0th input is moved to the output. If the control signal is "3", the 3rd input is moved to the output.
A multiplexer with N control signal bits can support 2N inputs. For example, a multiplexer with 3 control signals can support 23 = 8 inputs.
Multiplexers are typically abbreviated as "MUX", and will be abbreviated as such throughout the rest of this book.
| A 4 input Multiplexer with 2 control signal wires | An 8 input Multiplexer with 3 control signal wires |
|---|---|
| A 16 input Multiplexer with 4 control wires | |
Thre can be decoders implemented in the components.Decoder
Decoder ( inverse functionality of Encoder) can have multiple inputs and depending upon the inputs one of the output signals can go high.
For a 2 input decoder there will be 4 output signals.
/|- O0
i0---| |- O1
i1---| |- O2
\|- O3
suppose input i is having value 00 then output signal O0 will go high and remaining other threee lines O1 to O3 will be low. In same fashion if i is having value 2 then output O2 will be high and remaining other three lines will be low.
Adder
Program Counter
The Program Counter (PC) is a register structure that contains the address pointer value of the current instruction. Each cycle, the value at the pointer is read into the instruction decoder and the program counter is updated to point to the next instruction. For RISC computers updating the PC register is as simple as adding the machine word length (in bytes) to the PC. In a CISC machine, however, the length of the current instruction needs to be calculated, and that length value needs to be added to the PC.
Updating the PC
The PC, like any other register, can be updated by making the enable signal high. Each instruction cycle the PC needs to be updated to point to the next instruction in memory. It is important to know how the memory is arranged before constructing your PC update circuit.
Harvard-based systems tend to store one machine word per memory location. This means that every cycle the PC needs to be incremented by 1. Computers that share data and instruction memory together typically are byte addressable, which is to say that each byte has it's own address, as opposed to each machine word having it's own address. In these situations, the PC needs to be incremented by the number of bytes in the machine word.
In this image, the letter M is being used as the amount by which to update the PC each cycle. This might be a variable in the case of a CISC machine.
Example: MIPS
The MIPS architecture uses a byte-addressable instruction memory unit. MIPS is a RISC computer, and that means that all the instructions are the same length: 32-bits. Every cycle, therefore, the PC needs to be incremented by 4 (32 bits = 4 bytes).Example: Intel IA32
The Intel IA32 (better known by some as "x86") is a CISC architecture, which means that each instruction can be a different length. The Intel memory is byte-addressable. Each cycle the instruction decoder needs to determine the length of the instruction, in bytes, and it needs to output that value to the PC. The PC unit increments itself by the value received from the instruction decoder.Branching
Branching occurs at one of a set of special instructions known collectively as "branch" or "jump" instructions. In a branch or a jump, control is moved to a different instruction at a different location in instruction memory.
During a branch, a new address for the PC is loaded, typically from the instruction or from a register. This new value is loaded into the PC, and future instructions are loaded from that location.
Non-Offset Branching
A non-offset branch, frequently referred to as a "jump" is a branch where the previous PC value is discarded and a new PC value is loaded from an external source.
In this image, the PC value is either loaded with an updated version of itself, or else it is loaded with a new Branch Address. For simplification we do not show the control signals to the MUX.
Offset Branching
An offset branch is a branch where a value is added (or subtracted) to the current PC value to produce the new value. This is typically used in systems where the PC value is larger then a register value or an immediate value, and it is not possible to load a complete value into the PC counter.
In this image there is a second ALU unit. Notice that we could simplify this circuit and remove the second ALU unit if we use the configuration below:
These are just two possible configurations for this circuit.
Offset and Non-Offset Branching
Many systems have capabilities to use both offset and non-offset branching. Some systems may differentiate between the two as "far jump" and "near jump", respectively, although this terminology is archaic.
Instruction Decoder
The Instruction Decoder reads the next instruction in from memory, and sends the component peices of that instruction to the necessary destinations.
RISC Instruction Decoder
The RISC instruction decoder is typically a very simple device. Because RISC instruction words are a fixed length, the positions of the fields are fixed. We can decode an instruction, therefore, by simply separating the machine word into small parts using wire slices.
CISC Instruction Decoder
Decoding a CISC instruction word is much more difficult than the RISC case, and the increased complexity of the decoder is a common reason that people cite when they choose to use RISC over CISC in their designs.
A CISC decoder is typically set up as a state machine. The machine reads the opcode field to determine what type of instruction it is, and where the other data values are. The instruction word is read in piece by piece, and decisions are made at each stage as to how the remainder of the instruction word will be read.
Register File
The register file is the component that contains all the general purpose registers of the microprocessor. The register file may not contain some of the reserved registers, such as the PC, the status register, or other special registers.
Register File
A simple register file is a set of registers and a decoder. The register file requires an address and a data input.
However, this simple register file isn't useful in a modern processor design, because there are some occasions when we don't want to write a new value to a register. Also, we typically want to read two values at once and write one value back in a single cycle. Consider the following equation:
- C = A + B
To perform this operation, we want to read two values from the register file, A and B. We also have one result that we want to write back to the register file when the operation has completed. For cases where we do not want to write any value to the register file, we add a control signal called Read/Write. When the control signal is high, the data is written to a register, and when the control signal is low, no new values are written.
In this case, it is likely advantageous for us to specify a third address port for the write address:
Register Bank
Consider a situation where the machine word is very small, and therefore the available address space for registers is very limited. If we have a machine word that can only accomodate 2 bits of register address, we can only address 4 registers. However, register files are small to implement, so we have enough space for 32 registers. The solution to this dilemma is to utilize a register bank which consists of a series of register files combined together.
A register bank contains a number of register files or pages. Only one page can be active at a time, and there are additional instructions added to the ISA to switch between the available register pages. Data values can only be written to and read from the currently active register page, but instructions can exist to move data from one page to another.
As can be seen in this image, the gray box represents the current page, and the page can be moved up and down on the register bank.
If the register bank has N registers, and a page can only show M registers (with N > M), we can address registers with two values, n and m respectively. We can define these values as:
- n = log2(N)
- m = log2(M)
In other words, n and m are the number of bits required to address N and M registers, respectively. We can break down the address into a single value as such:
Where p is the number of bits reserved to specify the current register page. As we can see from this graphic, the current register address is simply the concatenation of the page address and the register address.
Memory Unit
Microprocessors rely on memory for storing the instructions and the data used by software programs. The memory unit is responsible for communicating with the system memory.
Memory Unit
Actions of the Memory Unit
In a harvard architecture, the data memory unit and the instruction memory unit are the same. However, in a non-Harvard architecture the two memory units are combined into a single module. Most modern PC computer systems are not Harvard, so the memory unit must handle all instruction and data transactions. This can serve as a bottleneck in the design.
Timing Issues
The memory unit is typically one of the slowest components of a microcontroller, because the external interface with RAM is typically much slower then the speed of the processor.
ALU
Microprocessors tend to have a single module that performs arithmetic operations on integer values. This is because many of the different arithmetic and logical operations can be performed using similar (if not identical) hardware. The component that performs the arithmetic and logical operations is known as the Arithmetic Logic Unit, or ALU.
The ALU is one of the most important components in a microprocessor, and is typically the part of the processor that is designed first. Once the ALU is designed, the rest of the microprocessor is implemented to feed operands and control codes to the ALU.
Tasks of an ALU
ALU units typically need to be able to perform the basic logical operations (AND, OR), including the addition operation. The inclusion of inverters on the inputs enables the same ALU hardware to perform the subtraction operation (adding an inverted operand), and the operations NAND and NOR.
A basic ALU design involves a collection of "ALU Slices", which each can perform the specified operation on a single bit. There is one ALU slice for every bit in the operand.
ALU Slice
Example: 2-Bit ALU
This is an example of a basic 2-bit ALU. The boxes on the right hand side of the image are multiplexers and are used to select between various operations: OR, AND, XOR, and addition.
Notice that all the operations are performed in parallel, and the select signal ("OP") is used to determine which result to pass on to the rest of the datapath. Notice that the carry signal, which is only used for addition, is generated and passed out of the ALU for every operation, so it is important that if we aren't performing addition that we ignore the carry flag.
Example: 4-Bit ALU
Here is a circuit diagram of a 4 bit ALU.
Additional Operations
Logic and addition are some of the easiest, but also the most common operations. For this reason, typical ALUs are designed to handle these operations specially, and other operations, such as multiplication and division, are handled in a separate module.
Notice also that the ALU units that we are discussing here are only for integer datatypes, not floating-point data. Luckily, once integer ALU and multiplier units have been designed, those units can be used to create floating-point units (FPU).
ALU Configurations
Once an ALU is designed, we need to define how it interacts with the rest of the processor. There are a number of different configurations that we can choose, all with benefits and problems. In all images below, the orange represents memory structures internal to the CPU (registers), and the purple represents external memory (RAM).
Accumulator
An accumulator is a register that stores the result of every ALU operation, and is also one of the operands to every instruction. This means that our ISA can be less complicated, because instructions only need to specify one operand, instead of one operand and a destination. Accumulator architectures have simple ISAs and are typically very fast, but additional software needs to be written to load the accumulator with proper values.
One example of a type of computer system that is likely to use an accumulator is a common desk calculator.
Register-to-Register
One of the more common architectures is a Register-to-register architecture. In this configuration, the programmer can specify both source operands, and a destination register. Unfortunately, the ISA needs to be expanded to include fields for both source operands and the destination operands. This requires longer instruction word lengths, and it also requires additional effort (compared to the accumulator) to write results back to the register file after execution. This write-back step can cause synchronization issues in pipelined processors (we will discuss pipelining later).
Register Stack
A register stack is like a combination of the Register-to-Register and the accumulator structures. In a register stack, the ALU reads the operands from the top of the stack, and the result is pushed onto the top of the stack. Complicated mathematical operations require decomposition into Reverse-Polish form, which can be difficult for programmers to use. However, many computer language compilers can produce reverse-polish notation easily because of the use of binary trees to represent instructions internally. Also, hardware needs to be created to implement the register stack, including PUSH and POP operations, in addition to hardware to detect and handle stack errors (pushing on a full stack, or popping an empty stack).
The benefit comes from a highly simplified ISA. Operands don't need to be specified, because all operations act on specified stack locations.
In the diagram at right, "SP" is the pointer to the top of the stack. This is just one way to implement a stack structure, although it might be one of the easiest.
Register-and-Memory
One complicated structure is a Register-and-Memory structure, like that shown at right. In this structure, one operand comes from a register file, and the other comes from external memory. In this structure, the ISA is complicated because each instruction word needs to be able to store a complete memory address, which can be very long. In practice, this scheme is not used directly, but is typically integrated into another scheme, such as a Register-to-Register scheme, for flexibility.
Some CISC architectures have the option of specifying one of the operands to an instruction as a memory address, although they are typically specified as a register address.
Complicated Structures
There are a number of other structures available, some of which are novel, and others are combinations of the types listed above. It is up to the designer to decide exactly how to structure the microprocessor, and feed data into the ALU.
Example: IA-32
The Intel IA-32 ISA (x86 processors) use a register stack architecture for the floating point unit, but it uses a modified Register-to-Register structure. All integer operations can specify a register as the first operand, and a register or memory location as the second operand. The first operand acts as an accumulator, so that the result is stored in the first operand register. The downside to this is that the instruction words are not uniform in length, which means that the instruction fetch and decode modules of the processor need to be very complex.
A typical IA-32 instruction is written as:
ADD AX, BX
Where AX and BX are the names of the registers. The resulting equation produces AX = AX + BX, so the result is stored back into AX.
Example: MIPS
MIPS uses a Register-to-Register structure. Each operation can specify two register operands, and a third destination register. The downside is that memory reads need to be made in separate operations, and the small format of the instruction words means that space is at a premium, and some tasks are difficult to perform.
An example of a MIPS instruction is:
ADD R1, R2, R3
Where R1, R2 and R3 are the names of registers. The resulting equation looks like: R1 = R2 + R3.
FPU
Similar to the ALU is the Floating-Point Unit, or FPU. The FPU performs arithmetic operations on floating point numbers.
An FPU is complicated to design, although the IEEE 754 standard helps to answer some of the specific questions about implementation. It isn't always necessary to follow the IEEE standard when designing an FPU, but it certainly does help.
Floating point numbers
This section is just going to serve as a brief refresher on floating point numbers. For more information, see the Floating Point book.
Floating point numbers are specified in two parts: the exponent (e), and the mantissa (m). The value of a floating point number, v, is generally calculated as:
IEEE 754
IEEE 754 format numbers are calculated as:
The mantissa, m, is "normalized" in this standard, so that it falls between the numbers 1.0 and 2.0.
Floating Point Multiplication
Multiplying two floating point numbers is done as such:
Likewise, division can be performed by:
To perform floating point multiplication then, we can follow these steps:
- Separate out the mantissa from the exponent
- Multiply (or divide) the mantissa parts together
- Add (or subtract) the exponents together
- Combine the two results into the new value
- Normalize the result value (optional).
Floating Point Addition
Floating point addition—and by extension, subtraction— is more difficult than multiplication is. The only way that floating point numbers can be added together is if the exponents of both numbers are the same. This means that when we add two numbers together, we need first to scale the numbers so that they have the same exponent. Here is the algorithm:
- Separate the mantissa from the exponent
- Compare the two exponents, and determine the difference between them.
- Add the difference to the smaller exponent, to make both exponents the same.
- Logically right-shift the mantissa a number of spaces equal to the difference.
- Add the two mantissas together
- Normalize the result value (optional).
Floating Point Unit Design
As we have seen from the two algorithms above, an FPU needs the following components:
- For addition/Subtraction
- A comparator (subtracter) to determine the difference between exponents, and to determine the smaller of the two exponents.
- An adder unit to add that difference to the smaller exponent.
- A shift unit, to shift the mantissa the specified number of spaces.
- An adder to add the mantissas together
- For multiplication/division
- A multiplier (or a divider) for the mantissa part
- An adder for the exponent prts.
Both operation types require a complex control unit.
Both algorithms require some kind of addition/subtraction unit for the exponent part, so it seems likely that we can use just one component to perform both tasks (since both addition and multiplication won't be happening at the same time in the same unit). Because the exponent is typically a smaller field than the mantissa, we will call this the "Small ALU". We also need an ALU and a multiplier unit to handle the operations on the mantissa. If we combine the two together, we can call this unit the "Large ALU". We can also integrate the fast shifter for the mantissa into the large ALU.
Once we have an integer ALU designed, we can copy those components almost directly into our FPU design.
Further Reading
Control Unit
The control unit reads the opcode and instruction bits from the machine code instruction, and creates a series of control codes to activate and operate the various components to perform the desired task.
Simple Control Unit
In it's most simple form, a control unit can take the form of a lookup table. The machine word opcode is used as the index into the table, and the various control signals are output to the respective destinations.
Complex Control Unit
A more complex version of a control unit is implemented as a finite state machine (FSM). Multi-cycle, Pipelined, and other advanced processor designs require an FSM-based control unit.
ALU Design
Add and Subtract Blocks
Addition and Subtraction
Addition and subtraction are similar algorithms. Taking a look at subtraction, we can see that:
- a − b = a + ( − b)
Using this simple relationship, we can see that addition and subtraction can be performed using the same hardware. Using this setup, however, care must be taken to invert the value of the second operand if we are performing subtraction. Note also that in twos-compliment arithmetic, the value of the second operand must not only be inverted, but 1 must be added to it. For this reason, when performing subtraction, the carry input into the LSB should be a 1 and not a zero.
Our goal on this page, then, is to find suitable hardware for performing addition.
Bit Adders
Half Adder
A half adder is a circuit that performs binary addition on two bits, without explicitly accounting for either a carry in signal, or an overflow (carry out).
In verilog, a half-adder can be implemented as follows:
module half_adder(a, b, c, s) input a, b; output s, c; s = a ^ b; c = a & b; endmodule













