Chip Architect: Detailed Architecture of AMD's Opteron

The Die photo of the Integer core is dominated by all the 64 bit data-busses running North-South. At some points the density may reach up to twenty busses. The busses carry all the source and result operand data between the Integer units. The lay-out of the busses is bit-interleaved, meaning that equal bit-numbers are grouped together: The bits 0 of all the busses are next to each other at one side of this Integer super highway while all bits 63 can be found at the other side. Very visible is also the separation into individual bytes.

1.2 Three way Super Scalar CISC architecture

The Opteron is a 3 way super-scalar processor. It can decode, execute and retire three x86-instructions per cycle. These instructions can be quite complex (CISC) operations involving multiple (>2) source operands. The Pentium 4 handles 3 so called uOps per cycle where multiple of these uOps may be needed to implement a single x86 instruction. It may be that the Prescott, the follow-up of the Pentium 4 can handle four uOps per cycle as we revealed here.

In general an x86 instruction can be expressed as:

F( reg,reg ), F( reg,mem ) or F( mem,reg ) where the first operand is both source and destination. The first two forms are general for Integer, MMX and SSE(2). The later form is found basically in Integer instructions: One source operand is loaded from memory and the result is written back to the same location. The Integer Pipeline handles Loads and Stores for all operations including those for Floating Point and Multimedia instructions.

Overview of Opteron's Processor Core

1.3 A third class of Instructions: Double Dispatch Operations

The original Athlon ( which we'll refer to as the Athlon 32) classifies instructions either as Direct Path or Vector Path.

To the first class belong all the less complex instructions that can be handled by the hardware as a single operation. The more complex instructions (Vector Path) invoke the micro sequencer that executes a micro code program. Instructions are read from micro code Rom and inserted in the 3-way pipeline.

The Opteron introduces a third instruction class: The Double Dispatch instructions, or simply "Doubles". The doubles are generated near the end of the decoding pipeline. The Instructions, which either followed the "Direct Path", or where generated by the Micro Code sequencer, are split into two independent instructions. The 3-way pipeline can thus generate up to six instructions per cycle. The instructions are "re-packed" back to three again in the PACK-stage. This extra pipeline stage has often been the subject of speculation since Opteron's introduction at the 2001 Micro Processor Forum. The Six-fold symmetry of the "doubling stage" is clearly visible on the Die plot above.

1.4 128 bit SSE(2) instructions are split into Doubles

Appendix C of Opteron's Optimization Guide specifies to which class each and every instruction belongs. Most 128 bit SSE and SSE2 instructions are implemented as double dispatch instructions. Only those that can not be split into two independent 64 bit operations are handled as Vector Path (Micro Code) instructions. Those SSE2 instructions that operate on only one half of a 128 bit register are implemented as a single (Direct Path) instruction.

There are both advantages and disadvantages performance-wise here. A disadvantage may be that the decode rate of 128 bit SSE2 instructions is limited to 1.5 per cycle. In general however this not a performance limiter because the maximum throughput is limited by the FP units and the retirement hardware to a single 128 bit SSE instruction per cycle. More important is the extra cycle latency that a Pentium 4 style implementation would bring is avoided.

1.5 Using Doubles for 128 bit SSE(2) instructions avoids a 25% latency penalty

In the Pentium 4 an SSE2 instruction is split in a later stage in the Floating Point unit itself. The Floating Point units accept 128 bit source data at it's first stage. It then splits the operation in two and combines the two results at the end into one single 128 bit result. This effectively adds one extra cycle to the total latency. For instance: The x87 FADD and FMUL take 5 and 7 cycles while the 128 bit (2x64) SSE2 equivalents need 6 and 8 cycles.

The Opteron, like the Athlon 32, handles both FADD and FMUL in 4 cycles. The SSE2 equivalents are handled with the same 4 cycle latency. An extra cycle would mean a latency increase of 25%, a serious performance limiter, so the correct decision has been made here. If you would look at a highly pipelined FP unit in action then you would see mostly bubbles and few instructions . Instructions waiting for the results of others that have yet to finish. Latency is more important here then bandwidth.

The next Pentium, code-named Prescott has an extra Floating Point Multiplier and Adder as we could reveal to you here. We now think that the extra FP units are used for single port but full 128 bit operation. This would bring back the SSE2 latencies for Add and Multiply to 5 and 7 cycles, beneficial for single thread programs. It would double the Floating Point bandwidth which is mainly interesting for Hyper Threading performance.

1.6 Doubles used for some Integer and x87 instructions as well

The Double Dispatch instructions are not only used for SSE and SSE2 instructions. Appendix C of Opteron's Optimization Guide also list classic x86 instructions like POP and PUSH, some of the integer multiplications and the LEAVE instruction. All the instructions are handled by micro code on the Athlon 32 which is a lot slower. Also a number of classical x87 instructions are now handled by doubles, for instance those FP instructions that have an integer as one of the source operands that first needs to be converted to floating point.

1.7 Doubles handle 128 bit memory accesses

The 128 bit memory references used for SSE and SSE2 are likewise split up into two independent 64 bit accesses which are handled by the integer core. The results are snooped from the Load Data busses of the Data Cache by the Floating Point Core.

The decision to extend the Integer Registers from 32 bit to 64 bit and to split the 128 bit SSE(2) instructions into two 64 ones results in an elegant all 64 bit Micro Architecture.

There is a significant advantage in having an L1 Data Cache that can handle 128 bit loads or stores as two independent 64 bit loads or stores per cycle. Two 64 bit loads from different addresses into a single 128 bit SSE2 register with two moves is just as fast a loading a single 128 bit word from memory. Apple had a decent argument for introducing a 128 bit data type containing four 32 bit floating point values. Which is as such usable for high quality ARGB color image data. (Given it's customer base) Two 64 bits floating point numbers in a 128 bit word doesn't seem to serve any practical commercial application. (other then making live miserable for compiler builders...) Providing separate 64 bit loads and stores at a two per cycle rate gives a compiler a better chance to combine unrelated 64 bit operations into a single 128 bit one.

1.8 Address Additions before Scheduling US Patent 6,457,115.

A single x86 instruction may need many source operands when memory is involved:

address = base + index< scale + displacement + segment

Up to four arguments are needed to calculate the address (ignoring the 2 bit "scale-field" hard-coded in the instruction) This means that a typical x86 instruction of the format F(reg,mem) needs not less then 5 input operands! Now one of the parameters is a constant given by the instruction itself (displacement) Another parameter (segment) is a "semi- constant" and is typically zero in modern code with a non- segmented flat memory space.

The Athlon 32 adds the segment to the address only when needed after one of the three AGU's (Address Generator Units) has calculated the linear address. It does so during the Data Cache Access which causes an extra cycle of cache load latency. The Opteron has a different implementation. The displacement and segment are summed together before the actual address calculation.

The segment value is considered a constant and thus, just like the displacement, know during decoding. The addition is made during decoding/dispatch and the result is passed on together with the rest of the instruction bits as a new "displacement field" of the instruction.

An exception is generated whenever the segment value does change. The results of operations depending on it are cancelled and the pipeline is restarted from the right point.

The "Decode-Time" address adder might be used for other address additions as well. (The 64 bit mode gets rid of most of the segmentation)

For instance the new Relative Address mode that adds the 64 bit Instruction Pointer (RIP) and the 32 bit displacement from the instruction together. By reducing the number of input parameters as much as possible during decoding we end up with a maximum of four input parameters for each instruction. Three of them are register variables and the fourth one is a constant.

1.9 Register Renaming and Out-Of-Order Processing

The Athlon (and Opteron) uses some clever tricks to handle Register Renaming and OOO processing (Out-Of-Order) which allows them to shave some 25% of the integer pipeline. The design allows for a simple and fast scheduler that doesn't need special hardware to handle miss-scheduling caused by cache-misses.

Register renaming is used to eliminate "False Dependencies" which limit the number of Instructions Per Cycle (IPC) that a processor can execute. False Dependencies are the result of a limited number of registers. A register that holds an intermediate result needs to be re-used soon for another, maybe unrelated, calculation. Its value is then overwritten and not available anymore. The instruction that overwrites it must always wait for the instruction that needs the previous result.

This serializes the execution of the instruction and limits the IPC. This is especially true for an architecture like x86 which has a very small number of registers. The example below shows how register renaming can eliminate false data dependencies: Register rC is overwritten by the 3rd instruction, so the 3rd instruction has to wait for the 2nd instruction: a False Dependency. With register renaming we can use an "arbitrary" large register file. There is no need to re-use rC(r3) We can simple use another available register instead, register r7 in this case. The basic rule is that all of the instructions that are "in-flight" are given a different destination register. (single assignment)

Non Renamed: rC=rA+rB; rF=rC&rD; rC=rA-rB;

Renamed: r3=r1+r2; r6=r3&r4; r7=r1-r2;

1.10 Renaming the Integer Registers

Opteron has sixteen 64 bit architectural integer registers. Not visible for the programmer are eight more 64 bit scratch registers used to store intermediate results for micro code routines that handle more complex x86 instructions. The Athlon family of processors handles Register Renaming in the simplest possible way. Which is a compliment because it often takes a lot of smart thinking to figure out how to do things in the simplest way! People only rarely succeed in this ...

As we said, each instruction in flight needs a different destination register. The total number of renamed registers must be equal or larger then the sum of all instructions-in-flight plus the architectural-registers. The maximum number of instruction in flight is 72, add everything together then you need 96 "renamed registers". Two different structures are used to maintain these registers. The instructions-in-flight results are maintained by the result fields of the 72 entry Re-Order Buffer ( ROB ) and the architectural-registers are maintained by the "Integer Future File and Register File". ( IFFRF )

Re-Order-Buffer Tag definition

wrap

bit

Instruction In Flight Number

re-order buffer index 0...23

sub-index 0..2

bit 7

bit 6

bit 5

bit 4

bit 3

bit 2

bit 1

bit 0

This configuration allows for a very simple renaming scheme which takes -zero- cycles... Each instruction dispatched from one of the three decode lanes gets a "Re-Order Buffer Tag" or "Instruction In Flight Tag" consisting of:

1) A sub-index 0,1 or 2 which identifies from which of the three lanes the instruction was dispatched.

2) A value 0..23 that identifies the "cycle" in which the instruction was dispatched. The "cycle counter" wraps to 0 after reaching 23.

3) A wrap bit. When two instructions have different wrap bits then the cycle counter has wrapped between the dispatches.

1.11 The IFFRF: Integer Future File and Register File

This register file is used to maintain the 16 architectural registers and the 8 temporary scratch registers. It has two entries for each of the 16 architectural registers. One of the two can be viewed as the actual register as seen by the programmer. It gets its value when the instruction that produced it has "retired" An instruction is retired when it is sure that no exception or branch-miss-prediction has occurred and all preceding instructions have been retired as well. The value of the register is said to be "non-speculative".

40 entry Integer Future File and Register File: IFFRF

16 entries	Retired Architectural Register Values
16 entries	Speculative Register Values: "Future File"
8 entries	Temporary Registers

Instruction-In-Flight and their results may be cancelled and discarded as long as they have not been retired. Cancellation can be a a result of a proceeding instruction that caused an exception or a by a branch-miss-prediction. Instructions-In-Flight are in principle always speculative. The results stay speculative even if the instruction has finished. The results only become non-speculative at retirement when the retirement logic determines that no exception has occurred.

1.12 The Future File section of the IFFRF

The second entry for each Architectural Register holds the so-called "Future" value. The 16 of them together constitutes the so-called Future File These entries contain the most recent value produced for a certain architectural register by any instruction,

( retired or non retired ). The contents of a future file register is speculative as long as the producing instruction has not yet retired. The value becomes non- speculative after a while if the producing instruction successfully retires.

The Future File origins go back to 1985

Instructions write into the Future File as soon as their result is produced. The Future File however does not accept the result if it's not the very latest result for a certain register. If a later instruction has managed to finish earlier and has written its result already into the Future File then it will not accept results anymore for that register from older instructions. Finished Instructions address the Future File with the instruction code register number, a number from 0 to 15 for the 16 architectural registers. The "Re-Order Buffer Tag" is used to determine if a result may be overwritten. Each Future File entry has a corresponding Tag. We will see that an instruction may only write into the Future File entry if it still owns the entry: If the Tags match.

1.13 Exception or Branch Miss-prediction: Overwrite Speculative values with Retired values

All speculative results are cancelled by copying the retired values of the IFFRF over the Future File values of the IFFRF.

The speculative results must be cancelled whenever the retirement logic detects that an exception occurred when the instruction or an earlier one was executed. There are many types of exceptions, Memory accesses can encounter a Page Miss or they can erroneously access a memory area which they are not entitled to access. The divide by zero is another well known exception.

( It shouldn't be for Floating Point numbers because +/- infinity are perfectly valid IEEE Floating Point values)

When we say Speculative Results here then we mean more specifically the results that may need to be canceled because of erroneous Control Flow Speculation: The program flow went into a different direction then predicted, now:

- A branch miss prediction is basically the same as any other exception, but...

- All exceptions are also branch miss-predictions.

An exception causes a change in the program flow much like a conditional call. All instructions that can cause exceptions are thus in fact conditional control flow instructions. Exceptions are however always predicted as not taken and ignored by the branch prediction hardware.

1.14 The Reorder Buffer

We mentioned retirement a number of times now. Retirement is handled with the aid of the reorder buffer. This unit does what its name suggests: It Re-Orders the instructions, It orders them back into the original program flow. The Schedulers are responsible for Out-Of-Order execution. The schedulers do so by launching instructions to execution units whenever all their source operands are available and the needed execution unit is free. It's the reorder buffer that brings the instructions back into order again.

Operation of the Reorder Buffer

index	1	2	3	4	5	6	7	8	9	10	11	12
lane 0
lane 1
lane 2

	= Out Of Order finished Instructions, results still speculative.
	= Instructions being retired now.
	= Retired Instructions, not speculative anymore.

The reorder buffer itself is split into three identical lanes. Each lane has 24 entries. The lanes and entries correspond to the reorder buffer Tag assigned to each instruction. Each Instruction that finishes writes its result into the reorder buffer using the reorder buffer Tag as address. The instructions also store any events that happened during execution that will require an exception.

In particular conditional branch instructions may report that the branch address they calculated does not correspond with the address that was predicted.

The instruction will leave further information needed for the reorder buffer to do its work. It leave some info of what kind of instruction it is. It will leave the architectural register address (0..15) that corresponds with its destination register. The instruction will leave also the address where it is located in 'instruction' memory. Some of this info may already be left in the reorder buffer earlier when the instruction received it's reorder buffer Tag.

The reorder buffer is shared by all instructions. It's also used to reorder Floating Point, SSE(2) and MMX instructions. These instructions however do not write their result data in the reorder buffer. They use the 120 entry renamed floating point register file for that purpose. The reorder buffer however is still used to track the instruction code info and address, exception flags, ready status and retirement status. All instructions are retired with the aid of the reorder buffer.

1.15 Retirement and Exception Processing

The image shows how Instructions can be retired at the moment when all previous instructions are retired. Three instructions can be retired per cycle. The Instruction Control Unit (ICU) accesses the reorder buffer contents for the three instructions. The instructions are retired If there are no exception flags set. The result data is written to the Retired Entries of the IFFRF. The later is basically a write-only process. These values are only used in case of an exception. In this case they are used to overwrite the speculative values of the Future File.

The ICU will handle a branch miss prediction by forwarding the instruction's address to the Instruction Fetch Unit at the beginning of the pipeline. The branch will then be re-executed, now with the right prediction. Other exceptions require an exception routine call. The ICU can for instance save some relevant data in the temporary registers of the IFFRF and invoke the exception call or a micro code function.

1.16 Exception Processing is always delayed until Retirement

Exceptions processing must always be delayed until the instruction that caused the exception is not speculative anymore:

A memory access exception for instance may be caused by accessing an Array with an Index that is out of bounds. The program may have a test for out-of-bound indices and code to handle it. The branch prediction hardware however will most likely predict that the out-of-bound test is Not True because the Index is OK most of the time. The processor will thus access the array with an out-of-bounds Index anyway and not unlikely cause a memory access exception. Exception handling is delayed until retirement where the instruction plus its exception flags is discarded because of the branch-miss-prediction.

It's for the same reason that speculative Stores to memory are delayed and hold in the LSU (Load Store Unit) until retirement.

1.17 Retirement of Vector Path and Double Dispatch Instructions

A single Vector Path instruction may produce many instructions. The Micro sequencer inserts these instructions in the 3 way pipeline. Three per cycle. They do not mix with Direct Path instructions during decoding and retirement. The actual Retirement takes place when the last line of 3 instructions is ready. The retirement hardware scans from the first line of 3 micro code generated instructions to the last line, accumulating all possible exceptions that occurred. Retirement follows If no exception has occurred, otherwise the appropriate exception call is made.

Instructions generated by Doubles can mix with other (Direct Path) instructions during decoding and retirement. The two instructions generated by a Double must however retire simultaneously, imagine a PUSH that does retire the memory store but doesn't retire the Stack Pointer update.. This leads to the limitation that both instructions generated by a Double must be in the same 3 instruction line during retirement.

1.18 Out-Of-Order processing: Instruction Dispatch

We are now ready to describe in greater detail how Out Of Order processing is handled. We go back to the Instruction Dispatch Stage. Instruction Dispatch means here that Instructions are send to the Schedulers. They are not send to the execution units yet. The Instructions do access the Register File though. Three Instructions can look up a total of nine register values from the IFFRF each cycle. The Future File entries are accessed. The Future File contains the latest speculative results for each of the 16 architectural registers. The three Instruction are then placed together with these register values into the three Schedulers.

Now it's highly likely that not all previous instructions where finished so many of the register values are older values from previous instructions. Each Instruction that is dispatched clears the valid flag for the architectural register it will modify. It also leaves its Tag. Succeeding instructions now know the Future File entry is not valid anymore but they also know the Tag of the instruction that will provide the data they need. They will use this Tag later to pick up the result directly from the result busses. The instruction that invalidated the register will later finish, and write its result to the Future File ( if it still owns the entry ) and then set the valid bit back again.

The instruction has ownership over an entry in the Future File if the Tags match. It acquires ownership when the instruction is dispatched and it looses ownership if another instruction is dispatched that has the same destination register. An instruction writes its result in the Future File only if it is still owner. Subsequent instructions can pick up the result from there. If the instruction is not owner anymore than it won't modify the Future File entry. Any other instruction that needed this result was already dispatched and picked up the result directly from the result-busses.

1.19 The Schedulers / Reservation Stations

Each of the (up to) three Instructions that are Dispatched gets assigned to a Reservation Station within the Scheduler they are send to. Each scheduler has eight Reservation Stations. That's up from six in the Athlon 32 and up from five in the first Athlon prototypes. The Reservation Station gathers all remaining input data needed for the instruction from the result busses. It monitors the Tags of these busses to see if the instructions from which data is needed are about to produce their results. (Register File Bypass)

The Tag busses run one cycle in advance of the result data-busses. The Reservation Station does not need to look at all the busses. The Tag's sub-index identifies which of the three ALU's will produce the result. It also knows if the data will come from one of the two cache read ports. It can select the Tag bus in advance rather then having to test all the Tags.

The Scheduler's Reservation Station Entries

Instruction

Data

"CONST" 64 bit Displacement + Segment or Instruction Pointer

Input

Data

VALUE: 64 bit register 'A'	TAG: reg. 'A'
VALUE: 64 bit register 'B' or 64 bit Index register	TAG: reg. 'B' / Index
VALUE: 64 bit Base register	TAG: Base reg.

Input

Status

VALUE: 4 bit ZAPS flags: Zero, Aux, Parity, Sign	TAG: ZAPS flags
VALUE: 1 bit OF/C flag: either OverFlow or Carry	TAG: OF/C flag

Until now we've neglected the x86 status flags. Many x86 instructions use one or more of the six x86 status flags as an input. An x86- instruction does or does not change the status flags. An instruction may change some, all or none of the status flags. This all means that different flags may be produced by different instructions. Luckily there are two rules that help to simplify the scheduler.

rule 1: An instructions that modifies any of the ZAPS flags ( Zero, Aux, Parity, Sign ) modifies all of them. This means that these

can be handled by a single 4 bit entry in the reservation station.

rule 2. An instruction that uses the OverFlow flag (signed integer) does not use the Carry flag (unsigned integer). A single 1-bit

reservation station can be used for the one which is needed.

1.20 Each x86 instruction can launch both an ALU and an AGU operation

A single x86 instruction waiting in a Reservation Station of one of the Schedulers can launch up to two operations. It can launch an integer operation to it's associated ALU and it can launch a memory operation to its AGU (Address Generator Unit) The simplest integer instructions of type F( reg,reg ) do not access memory and launch an ALU operation only. Integer instructions of type

F( reg,mem ) launch a memory load first and consequentially launch an ALU operation when the load data arrives.

Integer instructions of type F( mem,reg ) are implement in the same way. The Load is now a Load/Store operation. The Load/Store keeps hanging in the LSU (Load Store Unit) Here it waits for the result of the ALU operation to be stored in memory. Non Integer instructions such as Floating Point and Multi Media instruction specifying a memory access will launch an AGU instruction only.

The Floating Point / MMX operation itself is then handled by the Floating Point Unit itself.

Each Scheduler can launch one ALU and one AGU operation per cycle. The ALU operation may come from one x86 instruction while the AGU operation may come from another.

1.21 The Scheduling of an ALU operation US Patent 6,535,972.

An ALU operations generally needs two register operands and optionally some status bits. An x86 instruction that accesses memory will leave the Load value in register 'B' The Reservation Station waits until it has all needed input operands (data and status). The Scheduler observes all eight reservation stations and will Launch the ALU operation if its the oldest instructions that is ready to Launch. The Scheduler sends all operands plus instruction information to the ALU that is associated with it.

Reservation Station entries typically involved in an ALU operation:

Instruction

Data

"CONST" 64 bit Displacement + Segment or Instruction Pointer

Input

Data

VALUE: 64 bit register 'A'	TAG: reg. 'A'
VALUE: 64 bit register 'B' or 64 bit Index register	TAG: reg. 'B' / Index
VALUE: 64 bit Base register	TAG: Base reg.

Input

Status

VALUE: 4 bit ZAPS flags: Zero, Aux, Parity, Sign	TAG: ZAPS flags
VALUE: 1 bit OF/C flag: either OverFlow or Carry	TAG: OF/C flag

The Reservation station does not actually need to catch the last operand(s) itself. The Reservation Station can be bypassed. The

ALU may receive the bus number which will carry the last result so it can catch the operand itself. If you take a look at the Die photo then you see that all three ALU's are next to each other, even though each receives only operations from its own scheduler. The bypass mechanism lets them exchange data directly without the need of going back and forward to the schedulers.

1.22 The Scheduling of an AGU operation for memory access US Patent 6,457,115.

We saw how a single x86 instruction may need up to four arguments to calculate the memory address ( ignoring the 2 bit scale field hard-coded in the instruction ). This includes up to two register variables (base and index) We also saw how displacement and segment could be added together already during instruction decoding. Segment is considered a semi-constant a restore mecha- nism is provided for the rare case that it is changed.

Reservation Station entries typically involved in an AGU operation:

( address = base + index << scale + displacement + segment )

Instruction

Data

"CONST" 64 bit Displacement + Segment or Instruction Pointer

Input

Data

VALUE: 64 bit register 'A'	TAG: reg. 'A'
VALUE: 64 bit Index register or 64 bit register 'B'	TAG: Index reg. / 'B'
VALUE: 64 bit Base register	TAG: Base reg.

Input

Status

VALUE: 4 bit ZAPS flags: Zero, Aux, Parity, Sign	TAG: ZAPS flags
VALUE: 1 bit OF/C flag: either OverFlow or Carry	TAG: OF/C flag

The Reservation Station waits until it has all needed input operands. The Scheduler observes all eight reservation stations and will Launch the AGU operation if its the oldest instructions that is ready to Launch. The Scheduler send all operands plus instruction information to the AGU that is associated with it. The Reservation Station can be bypassed for AGU operations as well. The AGU may receive the bus number which will carry last result so it can catch the operand itself.

1.23 Micro Architectural Advantages of Opteron's Integer Core

Extremely fast Schedule-Execute loop eliminates the need for cache hit/miss speculation hardware.

One of the most elegant features of the Opteron Integer core is it's extremely fast Schedule-Execute loop which is significantly faster then any other architecture. It can schedule instructions in one cycle and execute them in the next and doing so at a very high frequency. This effectively eliminates the need for wasteful logic to correct the pipeline in case of an L1 Data Cache miss.

The scheduler sees one cycle in advance that load data is coming from the L1 Data Cache by checking the Tag of the result bus. It uses this information the schedule the instruction(s) that use the load data. At the very end of the schedule cycle it then gets the hit/miss signal. A miss will inhibit that the instruction is removed from the Scheduler in the next cycle. The instruction still went to the execution units (ALU or AGU) but the miss flag will invalidate these instructions. Other instructions will carry on normally and the Scheduler will continue in the next cycle with the victim instructions still on board now waiting for data to become available from further on in the memory hierarchy.

The Pentium 4 in contrast will Replay all dependent instructions issued in up to 7 cycles after the missed load. Non dependent instructions do not need to be replayed. The abundance of Double Pumped ALU capacity is mainly used to add extra capacity for all the replayed instructions. The Alpha EV6 "pulls back" and invalidates all the instructions that were scheduled in the two cycles after a load that misses even though it has a short seven cycle pipeline running at a significantly lower frequency.

The latter two architectures do use these mechanisms also to support another type of data-speculation which is not yet supported by the Opteron: Speculative Load/Store reordering. However such an mechanism may well be supported by a successor and may do so with the smallest miss prediction penalty possible because of it's extremely short Schedule-Execute loop.

Avoids a large multi port register file between scheduler to and execution unit

The micro-architectural feature which makes the short Schedule-Execute loop possible is the functional split of the classical large Renamed Register File in two sub-structures. One is the small, low latency IFFRF. This is exactly the smallest possible subset needed during Out-Of-Order execution. The other subset is the 72 entry Re-Order buffer. This subset is the one that is needed for In-Order retiring and recovery from exceptions and branch-miss-predictions.

The IFFRF is small and has a low single cycle latency. It has many read and write ports (9 read, 8 write) for wide super-scalar operation. The 72 entry Re-Order buffer is larger but much simpler. It uses three simple independent 24 entry sub-arrays. Each of the three basically needs only one read and one write port. The four (1+3) units replace a single large Renamed Register file which would have needed all the read and write ports of the much smaller IFFRF. Such a large Renamed Register File would probably add two cycles of extra latency right in the middle of the Schedule-Execute loop.

Chapter 2, Opteron's Floating Point Units

2.1 The Floating Point Renamed Register File

Opteron's Floating Point renamed register file has been increased from 88 to 120 entries. It is a renamed register file in the classical meaning of the word. It's a single entity that must contain all architectural (non-speculative) and speculative values for the registers defined by the instruction set.

The Opteron restores the support for 72 speculative instructions again. The support for speculative instructions was decreased from 72 to 56 with the introduction of the Athlon XP core that included the eight 128 bit XMM registers for SSE but did not increase the size of the 88 entry renamed register file.

Each 128 bit XMM register uses two entries in the renamed register file. The Opteron thus uses 32 entries to hold the architectural (retired) state of the now 16 XMM registers, which explains the increase: 88 + 32 makes 120 entries.

40 of the 120 entries are used to hold the architectural (non-speculative) state of the registers defined by the instruction set. 32 are used for the sixteen XMM registers. 8 are used for the eight x87/MMX registers.

A further 8 register entries are used for micro code scratch registers, some- times called micro-architectural registers. These registers are not defined by the instruction set and are not directly visible to the programmer. They are used by micro code to calculate complex floating point calculations like sine or log instructions.

The 48 (40+8) entries that define the architectural state of the processor are defined by the 48 entry Architectural Tag Array. The entries that hold the very latest speculative values for the 48 architectural register entries are identified with the 48 entry Future File Tag Array.

The speculative state of the processor needs to be discarded in case of a branch-miss-prediction or exception. This is handled by overwriting the 48 entries of the Future File Tag Array with those of the Architectural Tag Array.

Each entry of the renamed register file is 90 bit wide. Floating Point Values are expanded to a total of 90 bits (68 mantisse, 18 exponent, 1 sign bit and 3 class bits) The three class bits contain extra information about the floating point number. The class bits also identify non floating point numbers (integers) which are not expanded when written in the renamed register file.

The 120 registers

non speculative registers:

FP/MMX registers (arch.)

SSE/SSE2 registers (arch.)

Micro Code Scratch registers (arch)

speculative registers

FP/MMX registers ( latest )

SSE/SSE2 registers ( latest )

Micro Code Scratch reg. (latest )

Remaining speculative

The 90 bit registers

subdivision of the 90 bits for FP

Mantisse bits

Exponent bits

Sign bit

Class Code bits

Definition of the 3 bit Class Code

Zero

Infinity

Quit NAN (Not A Number)

Signaling NAN (Not A Number)

Denormal (very small FP number )

MMX / XMM (non FP contents)

Normal ( FP number, not very small )

Unsupported

2.2 Floating Point rename stage 1: x87 stack to absolute FP register mapping

The "stack features" of the legacy x87 are undone in this first stage of the Floating Point pipeline. The x87 instructions access the eight architectural 80 bit registers via a 3 bit Top Of Stack (TOS) pointer. Instructions use the TOS as both source and destination. The second argument can be another value on the stack relative to the TOS register or a memory operand. The 3 bit TOS pointer is maintained in the 16 bit x87 FP status register.

The x87 TOS register relative references are replaced by absolute references which directly identify the x87 registers involved in the operation. A speculative version of the TOS pointer is used for the translations. The 3 bit pointer can be updated by the actions of up to three instructions per cycle. Instructions can be speculative but are still in order at this stage. They've not yet been scheduled by the Floating Point Out-Of-Order scheduler.

If an exception or a branch-miss-prediction occurs then the speculative TOS pointer is replaced with the non-speculative retired one which is retrieved from the reorder buffer. The retired version reflects the value of the TOS during the instruction just prior to the one that caused the exception or branch miss prediction.

2.3 Floating Point rename stage 2: Regular Register Renaming

The actual register renaming takes place in this stage. Each instruction that needs a destination register gets one assigned here. The destination registers must be unique in respect to all other instructions in flight. No instructions may write to the same register.

Up to three free register entries are obtained from the register free list. There are 120 registers available in total. The free-list can have a maximum of 72 free entries, equal to the maximum number of instructions in flight.

The remaining 48 entries hold the values of the (non-speculative) architectural registers: The eight x87/MMX registers, The eight scratch register (accessible by micro code only) and the sixteen 128 bit XMM registers for SSE and SSE2, each using two entries. These registers are not at a fixed location but may occupy any of the 120 entries. This is what makes the free-list necessary. The 48 entries occupied by the architectural registers mentioned above are identified by the 48 entry Architectural

Tag Array. It has an entry for each architectural register with a value that points to one of the 120 renamed registers.

Up to three instructions can thus be renamed per cycle. The data dependencies are handled with the aid of another structure,

the 48 entry Future File Tag Array This array contains pointers the 48 renamed registers that contain the very latest speculative values for each of the architectural registers. The instructions that are getting renamed access this structure to obtain the renamed registers were they can find their source operands. The instructions will then store the renamed register which was allocated to them to the Future File Tag Array so that subsequent instructions know were to find the result data.

Example: An instruction uses architectural registers 3 and 5 as input data and writes its result back into register 3. It will first read entries 3 and 5 to obtained the pointers to the renamed registers that contain or will contain the latest values for register 3 and 5.

Say renamed registers 93 and 12. The instruction now knows its source registers, 93 and 12 and can overwrite entry 3 of the Future File Tag Array with the renamed register it was assigned to store it's result, say 97. A subsequent instruction that needs architectural register 3 will now use renamed register 97.

If an exception or branch-miss-prediction occurs then the 48 entries of the Future File Tag Array are overwritten with the 48 entries from the Architectural Tag Array. All speculative results are thereby discarded. The pointers in the Architectural Tag Array were written there by the retirement logic. Up to three values can be written per cycle for each line of instructions that retires. The values are taken from the Reorder Buffer. The Reorder Buffer is shared by all instructions.

Floating Point Instructions that finish write certain information like exception status, TOS used et-cetera into the Reorder Buffer. This information includes also the destination register they modify, Both the number of to the architectural register and the renamed register are stored in the Reorder Buffer. The two of them are used to update the Architectural Tag Array at retirement. One as the data and the other as the entry number of the Architectural Tag Array.

2.4 Floating Point instruction scheduler

The Floating Point scheduler uses the following three criteria to determine if it may dispatch an instruction to the execution pipeline it has been assigned to ( FPMUL, FPADD, FPMISC )

1) The instructions source registers and or memory operands will be available.

2) The instruction Pipeline to which the instruction has been assigned will be available.

3) The result bus for that instruction pipe will be available on the clock cycle in which the instruction will complete.

The scheduler will always dispatch the oldest instruction that is ready for each of the three pipelines. When we say will be available then we mean in two cycles from the current cycle. It takes two cycles to get an instruction into execution, one to schedule and another to read the 120 entry renamed register file. An instruction checks if its source registers are available first when it is placed in the scheduler. After that it will continuously monitor the Tag busses of the result busses for all source data still missing.

The Tag busses run two cycles ahead of the result busses. The scheduler can thus see two cycles in advance which results will become ready. A dispatched instruction will arrive in two cycles at its execution were it grabs the incoming result data from the selected result bus. The execution pipelines are 4 stages deep. Instructions with lower latencies may leave the pipeline earlier, after two or three cycles. Two cycles however is the shortest execution latency.

Instructions that need load data from memory wait until the data arrives from the L1 Data Cache or from further away in the Memory Hierarchy. The scheduler knows two cycles in advance that data is coming. This is one cycle more than for integer loads. The extra cycle stems from the Data Convert and Classify unit that pre-processes Floating Point data from memory.

A load miss avoids that the Instruction which needed the load data is removed from the scheduler. The instruction stays in the scheduler until the data arrives with a load hit. Any instruction that was scheduled depending on load that missed is invalidated and its results are not written to the register file.

2.5 The 5 read and 5 write ports of the floating point renamed register file

The renamed register file register file is accessed directly after the instructions are dispatched Out Of Order by the Scheduler.

Up to three instructions can access the register file simultaneously. One instruction for each of the three functional units. The FPMUL and FPADD instructions obtain two source operands each while instructions for the FPMISC unit only need a single operand.

Three write ports are available to write results from the floating point units back to the register file. The write addresses arrive earlier then the result data. This is used to decode the write address in the cycle before the write occurs. All three units can have memory data as a source operand. The reorder buffer tags that accompany the data coming from memory are translated to renamed register locations by the load mapper. Two 64 bit loads can be handled per cycle.

The new 120 entry register file shows bypass logic at both sides. The bypasses are used to pass result and or load data directly to succeeding dependent instructions. Thereby avoiding any extra delay that would result from the actual writing and reading from the register file.

2.6 The Floating Point processing units

There is a range of processing units connected to the FPMUL, FPADD and FPMISC register file ports. The ports determine to which of the three floating point pipelines a particular unit belongs.

The x87 and SSE2 floating point multiplier handles 64 and 80 extended length multiplications. The large Wallace tree which handle the 64 bit multiplications for 80 bit extended floating point and 64 bit integer multiplications can be split into two independent Wallace trees that handle the dual 32 bit SIMD multiplications used for SSE and 3Dnow functions ( US Patent 6,490,607 ) This unit can also autonomously handle floating point divide and square root functions. These instructions are not implemented with micro code but are handled entirely by this unit itself with a single direct path instruction. The unit contains bi-partite lookup tables for this purpose. ( US Patent 6,256,653 ) These table contain base values and differential values for rapid reciprocal and reciprocal square root approximations which are then used as a start point for the divide and the square root instructions. This unit is connected to the FPMUL ports of the register file.

The x87 and SSE2 floating point adder handles 64 and extended length additions and subtractions. It is connected to the FPADD ports of the register file.

The 3Dnow! and SSE dual 32 bit floating point unit handles the single length SIMD floating point instructions as introduced in 3dnow! by AMD and SSE by Intel (The later is called 3Dnow! professional in the Athlon XP). This unit is connected to both the FPMUL and FPADD ports and can handle one 64 bit (2x32) instruction of each group per cycle, So one MUL type and one ADD type instruction per cycle. 128 bit instructions of either type have a throughput of one per two cycles.

The 2x64 bit MMX/SSE ALU unit is a dual unit that can handle certain packed integer 128 bit SSE instructions at a throughput of 1 per cycle. It is connected to both the FPMUL and FPADD ports. The FPMUL ports are used even though the instructions aren't multiplications but rather adds, subtracts and logic functions. The idea is to double op the size of operands that can be read and written to the register file to a full 128 bit. The 128 bit SSE instructions are still handled by two individual 64 bit operations. The throughput is increased to one per cycle because they can be executed by both the FPMUL and the FPADD pipelines.

The 1x64 bit MMX/SSE Multiplier unit handles MMX and SSE integer multiplies. It is connected to the FPMUL ports of the register file. It can handle a single 64 bit MMX instruction per cycle or 128 bit SSE instruction with a 2 cycle throughput using two 64 bit operations.

The FP Store unit, more recently called the FP Miscellaneous unit handles not only the stores but also a number of other single operand functions such as Integer to Float and Float to Integer conversions. It further provides a lot of functions used by Vector Path generated micro code to handle more complex x87 operations. It contains the Floating Point Constant ROM that contains a range of floating point constants such as pi, e, log2 et-cetera.

2.7 The Convert and Classify units

Load data that arrives from the L1 Data Cache or from further on the Memory Hierarchy goes through the Convert and Classify unit first. The Load data is converted, if appropriate, to the internal 87 bit floating point format (1 sign bit, 18 exponent and 68 mantisse bits ). The floating point values are also classified into a three bit Class code. The 87+3=90 bits are then stored into the 90 bit register file. The Class code can sub-sequentially be used to speed up floating point operations. For example: Only the class code needs to be tested to find out if a number is zero instead of all 86 mantisse plus exponent bits.

We've seen that the Floating Point Scheduler runs two cycles ahead of the actual execution units. One cycle more than the Integer Scheduler. It observes at the Tag busses that identify two cycles in advance which results will become ready at a certain result bus. The Tag busses also indicate which data will come from memory in advance. However, the hit/miss signal may later indicate that the data was erroneous because of a Cache Miss. The Convert and Classify units add an extra cycle with at least somewhat useful work in order to give the scheduler the time to take the Hit/Miss signal into account.

The Optimization manual has a whole appendix (E) dedicated to SSE and SSE2 optimizations related to the classification of the contents of the SSE registers. Instructions that operate on another data type then expected should be avoided. Revision C does not need these optimizations anymore. It is likely that Revision C can perform these format translations itself without the intervention of microcode after an exception.

2.8 X87 Status handling: FCOMI / FCMOV and FCOM / FSTSW pairs US Patents 6,393,555 & 6,425,074

AMD has managed to eliminate much of the x87 legacy overhead and did speed up some important but problematic functions. More specifically for the x87 status register. Early Athlons used a large area to handle the processing of the 16 bit floating point status register. This has all gone, some of it already in the Athlon XP.

Program code with a conditional test on x87 floating point values used to kill Out-Of-Order advantages because of the serializing nature of the instructions that make the floating point status code available to the Integer Pipeline which handles the conditional branches. The Opteron has special hardware to avoid this serialization and to preserve Out Of Order processing.

x87 Floating Point Status register

x87

Busy

Cond.

Code

Top of Stack

Cond.

Code

Cond.

Code

Cond.

Code

Excep

tion

Status

Stack

Fault

Preci-

sion

excep

Under-

flow

excep

Over-

flow

excep

Zero

Divide

excep

Denorm

Oper.

excep

Invalid

Oper.

excep

TOS

Different Parts of the x87 floating point status register are handled in different ways. The register is a bit of a mixture of different things. It contains for example the 3 bit TOS pointer that indicates which of the eight x87 is the current top of stack. The first Rename Stage holds the speculative version of this pointer. It is used here to translate the TOS relative register addresses to absolute x87 register addresses. All finishing instructions preserve their copy of this value in the Re-Order buffer when they finish. These copies then become the non-speculative versions of TOS at the moment that the instructions are retired out of the Re-Order buffer.

The Retirement Logic may detect that an exception or branch-miss-prediction did occur. It then replaces the speculative version of the TOS in the first rename stage with latest retired, non-speculative version. The speculative 3 bit TOS value is used before the instructions are scheduled Out Of Order. The only reason that it is used later on is during Retirement which is handled In-Order again. This means that special Out-Of-Order hardware for the TOS can be, and is eliminated.

The execution of a during Floating Point instruction may itself cause an exception. Most bits of the x87 status register are dedicated flags that identify exceptions. Exceptions are always handled In-Order at retirement time. This again means that any special Out-Of-Order hardware for these bits can be, and is eliminated.

The tricky part is in the CC (Condition Code) bits. These bits contain exception data most of the time but may contain sometimes information which is the result of a Floating Point compare and which must be processed in a full Out-Of-Order fashion. The Opteron has special new hardware to handle these cases. This hardware detects combinations of instructions that need special handling.

Condition Code bits after a x87 Floating Point compare

Cond. Code 3	Cond. Code 2	Cond. Code 1	Cond. Code 0	Compare Result
0	0	0	0	ST(0) > source
0	0	0	1	ST(0) < source
1	0	0	0	ST(0) = source
1	1	0	1	Operands were unordered

The first combination is a FCOMI with a FCMOV. The first does a compare and sets the CC bits according to the result. It then moves the compare result to the Integer status Register. The FCMOV then does a conditional floating point move depending on the Integer Status bits. Opteron's hardware allows full speed processing here by implementing an Out-Of-Order bypass that avoids that the FCMOV has to wait for the actual Integer Status Flags.

The second combination is the FCOM and FSTSW pair. The first instruction is identical to the FCOMI instruction with the exception that it does not copies the CC bits to the Integer Status bits. It's the FSTSW (Floating point Store Status Word) instruction that copies the 16 floating point status bits to the AEX register or to a Memory Location from were they can be used for conditional operations. The later is a serializing operation because all floating point instructions need to finish first before the 16 status flags are known. The Opteron has special hardware that does allow maximum speed Out-Of-Order processing without the serializing disadvantage. It also provides a way to recover from any (rare) miss predictions.

The result of all AMD's x87 optimizations is that the Opteron literally runs circles around the Pentium 4 when it comes to x87 processing. It has removed large special purpose circuits for status processing and implemented a few small ones that handle the cases mentioned. The shift to SSE2 floating point however will make removed area overhead more important than the speed-ups.

Chapter 3, Opteron's Data Cache and Load / Store units

3.1 Data Cache: 64 kByte with three cycle data load latency

The Opteron's relatively large L1 Data Cache supports a three cycle Load-Use latency. Actually only the second and third cycle are used to access the Cache memory itself. The first cycle is spend in the Integer Pipeline for the x86 memory address calculation using one of the three available AGU's. The address calculated by the AGU is send to the memory array in the second cycle where it is decoded. This means that it is known at which word line the data can be found at the end of the second cycle.

The right data word is activated at the beginning of the third cycle. Data is accessed in the memory array, selected and send forward to the Integer Pipeline or the Floating Point pipeline. Below the more detailed timing of a typical Integer x86 instruction

F( reg,mem ). This type of instruction first loads data from memory and then performs an operation on it.

We see that in the same cycle in which the instruction is dispatched to the Scheduler it is also dispatched to the so-called "Pre-Cache Load/Store unit" or simply LS1. Instructions in this unit compete for cache access together with those in LS2. The instructions in LS1 first need to wait for their effective memory address. They monitor the result busses of the AGU's. An instruction in LS1 knows from which AGU it can expect its address. Instructions check the re-order buffer Tag which identifies the address one clock-cycle in advance. In general, an instruction in LS1 will fetch its address and wait for its turn to probe the cache.

Typical timing of an F ( reg, mem ) x86 operation.

Cycle	Integer Scheduler	Load / Store Unit (LS1)	ALU's and AGU's	Cache Address Decode	Cache Data Access
0	Dispatched to Scheduler	Dispatched to LS1
1	AGU Scheduled
2		Load Scheduled	Address Generation
3				Cache Address Decode
4	ALU Scheduled				Cache Data Access
5			Dependent Operation

Instructions may route the address immediately to the cache also if there are no other (older) instructions waiting. This is the case in our example above. In any case, each instruction will keep the address for possible follow-on actions. The address is send directly from the AGU result bus to the Data Cache's address decoders in our case here. Data comes back from memory one cycle later and is routed to the Integer Pipeline. LS1 places the re-order buffer Tag one cycle in advance on the Data Cache result Tag bus so that the Integer ALU schedulers can schedule any instruction depending on the load data.

3.2 Two accesses per cycle, read or write: 8 way bank interleaved, two way set associative

The Opteron's cache has two 64 bit ports. Two accesses can occur each cycle. Any combination of loads and stores is possible. The dual port mechanism is implemented by a banking mechanism: The cache consist of 8 individual banks, each with a single port. Two accesses can occur simultaneously if they are to different banks.

Virtual Address bit used to access the L1 data Cache

Cache Line Index									Bank			Byte
14	13	12	11	10	9	8	7	6	5	4	3	2	1	0

A single 64 byte cache line is subdivided in 8 independent 64 bit banks. Two accesses are to two different banks if their addresses have a different bank-field, address bits 3 to 5. The bits are the lowest possible address bits that can be used for this purpose. This schem effectively maps adjacent 64 bit words in different banks. The principle of data locality makes these bits the most suitable choice.

The 64 kByte Cache is two way set-associative. The cache is split in two 32 kByte ways accessed with Virtual Address bits [14:0]

A hit into any of the two ways is detected if the Physical Address Tag, bits [39:12], which is stored alongside with each cache line, is identical to bits [39:12] of the Physical Address. Virtual to Physical address translation is performed with the help of the TLB's (Translation Look aside Buffers). A port accesses 2 ways and compares 2 tags with the translated address. Each port has its own TLB to do the address translation.

The two 64 bit ports are used simultaneously when exchanging cache-lines with the rest of the memory hierarchy. This means that the memory bus from the unified L2 cache to the L1 data cache is now 128 bit wide. The event where a new cache line is needed will take first 4 cycles to evict the old cache-line and then 4 cycles more to load the new cache-line when it arrives.

3.3 The Data Cache Hit / Miss Detection: The cache tags and the primairy TLB's US Patent 6,453,387.

The L1 Data Cache has room to store 1024 cache lines out of the total of 17,179,869,184 cache lines that fit within the 40 bit physical address space. Accesses need to check if the stored cache line corresponds with the actual memory location they want to access. It is for this purpose that the Tag rams store the higher physical address bits belonging to each of the 1024 cache-lines. There are two copies of the Tag ram to allow the simultaneous operation of two access ports.

The Tag rams are accessed with bits [14:6] of the virtual address. Each Tag ram outputs 2 Tags for both ways of the 2 way- set-associative cache. The wanted cache-line can be in either way. The Tag rams contain physical addresses. A physical address uniquely defines a memory position throughout the entire distributed system memory.

The cache is however accessed with the virtual addresses as defined by the program. Virtual addresses have only a meaning from within a process context. This means that a virtual-to-physical-address translation is needed to be able to check the physical Tags. This translation is handled by a lengthy sequence of four table lookups in memory: The virtual address field [47:12] is divided into four equal sub-fields that each indexes into one of the four tables. Each table points to the start of the next table, The last table, the page table, then finally contains the translated address.

Virtual Address to Physical Address Translation: The Table Walk.

virtual address

page offset

page map level 4

table offset

page directory

pointer offset

page directory

offset

page table

offset

47 39

38 30

29 21

20 12

11 0

page map level 4

==>

page dir.pointer

==>

page directory

==>

page table

physical address	page offset
39 12	11 0

This so-called Table-Walk is a very lengthy procedure indeed. The Opteron uses so-called Translation Look aside Buffers (TLB's) to remember the 40 most recently used address translations. 32 of these remember 4k page translations using the scheme above. The remaining 8 are used for so-called 2M / 4M page translations which skip the last table and define the translations for large 2 Megabyte pages. ( The 4M pages are only used for backwards compatibility )

The virtual address bits {47:12] are compared with all 40 entries of the TLB's in the second of the three clock-cycle access. At the end of the second cycle we know if any one of them matches. Each entry also contains the associated physical address bits [39:12]. These are selected in the third cycle and compared with the physical Tags to test if we have a cache hit.

3.4 The 512 entry second level TLB

If the necessary translation is not found within the 40 entries of the primary TLB's, then there is a second chance that it is available in the level-2 TLB which is shared by both ports. This table contains 512 address translations. This larger table can be used to update the primary TLB's with a minor delay. It is organized in a different way: It is 512 entry 4-way set-associative.

This means that it has 128 sets of 4 translations each. Virtual address bits [18:12] are used to select one of the 128 sets. We get four translations giving us four chances that we have the translation we need. Each translation contains the rest of the virtual address bits [47:19]. We can check if we have the right translation by comparing these bits with our address. The matching entry then contains the associated physical address field [39:12] we need.

3.5 Error Coding and Correction

The L1 Data Cache is ECC protected (Error Coding and Correction). Eight bits are used for each 64 bits to be able to correct single bit errors and to detect dual bit errors with the help of a 64 bit Hamming SED/DED scheme (Single Error Detection / Double Error Detection) Six parity bits are needed to retrieve the position of the error bit.

E C C	64 bit Hamming SED/DED error location retrieval
E C C	bit 63 bit 0
	0	1	1	0	1	0	1	0	1	1	1	0	1	1	0	0	1	1	1	1	0	1	x	0	1	0	1	0	1	1	1	0	0	0	1	0	1	1	0	0	0	1	0	1	1	0	1	1	0	1	0	1	0	1	1	0	1	0	1	0	1	1	1	0
0																							x
1																							x
1																							x
0																							x
1																							x
0																							x

The six bits are shown in the column at the left. A one means that a parity error was detected. The six bits represent the parity of the 32 purple bits in each rows. The parity errors together now represent a 6 bit index that points to the error position. Additional parity bits are used to detect double bit errors and errors in the parity bits themselves. (Thanks to Collin for bringing this to my attention)

3.6 The Load / Store Unit, LS1 and LS2

The Load Store unit handle the accesses to the Data Cache. This type of unit plays an increasingly important role play in modern speculative out-of-order processors. They are expected to grow significantly in size and complexity in newer architectures on the horizon. An extra reason to give the Opteron's Load Store units a closer look. The split in LS1 and LS2 is sometimes described as LS1 being for the L1 Data Cache and LS2 for the L2 Cache. This is far to popular however and even incorrect. We'll go into more detail here.

3.7 The "Pre Cache" Load / Store unit: LS1

The Pre-Cache Load/Store unit (LS1) is the place where dispatched memory accesses wait for the addresses generated by the AGU's (Address Generator Units) LS1 has 12 entries, whenever a memory access is dispatched to the Integer Scheduler it is also dispatched to an entry in LS1. The re-order tag bus belonging to the AGU indicates if the required Address is being calculated and available on the result bus of the AGU in the next cycle. An access waiting in LS1 knows at which AGU to look for the address.

When an instruction has its address coming or did already receive it may then probe the cache. There are two access ports. The two oldest accesses in LS1 will be allowed to probe the Cache. Both load and store instructions probe the cache. A load will actually access the cache to obtain the load data. The store presents its address but will never write from LS1 to the Cache. Store instructions will only write after they've received the data to be written and when they are retired.

Stores must be retired first because the store instruction may be speculative and is discarded later. Imagine that MicroSoft patches a buffer overflow exploit by adding a test for the overflow. This test becomes a conditional branch that prevents the write to the buffer in case of an overflow. The overflow tends to never happen so the branch predictor will predict it as not-taken, It will do so also in the case that it finally does happen. The write to the buffer will now be executed speculative.

So the actual writes to the cache must be delayed until after retirement when it's verified that the branch predictions were correct.

These deferred stores do not introduce any real delays however. Loads that access the cache also check LS1 and LS2 to see if there are any pending writes to the memory location they are about to read. If so than they catch the data directly from LS1 or LS2 without delay.

The stores in LS1 however do present their address to the cache hit/miss detection logic. If it turns out that the cache-line is not present then it may be loaded as soon as possible from the Level 2 cache or from system memory. This can be a good policy since there is a significant chance that following loads will need the same cache-line. Stores may receive the data they have to write to memory while waiting in LS1 as long as the data comes in time, otherwise they move on to LS2 to receive the data there.

3.8 Entering LS2: The Cache Probe Response

All accesses in LS1 probe the Cache and then move on to the Post-Cache Load/Store unit ( LS2 )

An access can either be a Load, a Store or a Load-Store (The latter reads first and then writes the result back to the same location)

All accesses which came from LS1 first wait to see the results from the cache probe. If it was a cache hit or a miss, If there was a cache parity error. They also receive the physical address which was translated from virtual into physical by the TLB's. Together with the physical address come the page attribute bits which determine for instance if the memory is cacheable or not.

Then in the following cycle, in case there was a cache miss, the instructions receive a so-called MAB tag ( Missed Address Buffer Tag ) This tag will later be used to see if a missed cache-line arrives from the L2 cache or from system memory. The MAB tag needs to be used instead of the generally used re-order buffer tags. Multiple Loads and Stores may depend on the same cache line and thus on the same MAB tag. All these accesses miss and they'll all receive the same MAB tag.

The Bus Interface Unit (BUI) will load missed cache-lines from the unified L2 cache or system memory to fill the data-cache. It also presents the so-called Fill-tag to LS2. This fill-tag is compared to the MAB-tag of all accesses that missed. The accesses that match the fill-tag are changed from miss to hit.

3.9 The "Post Cache" Load Store unit: LS2

The so-called Post-Cache Load Store unit ( LS2 ) has 32 entries. It is organized in a somewhat "shift register" like way so that the oldest outstanding access ends up in entry 0. Each of the 32 entries has numerous fields. Many of these fields are accompanied with a comparator and other logic to see if the fields matches a certain condition. All accesses stay in LS2 at least until retirement, Accesses that missed the cache will wait in LS2 until the cache-line arrives from the memory hierarchy. All Stores wait in LS2 for their retirement first before actually writing data to memory.

Various fields in an LS2 buffer entry

Type

Address & Data

Tags

Status Flags

Action Flags

....

Valid

Flags

Acc.

Type

Store

Data

64 bit

Virtual Address

Physical Address

mem

Type

Instr-

uction

Tag

Write

Data

Tag

Missed

Address

Buffer

Tag

Cache

Hit /

Miss

Retired

access

Last

Store

in Buff

(LIB)

Self

Mod.

Code

Flag

Snoop

Re-

Sync

Flag

Store

Load

For-

ward

....

Retired Stores in LS2 that have the hit/miss flag set to hit may use a cache port simultaneously with a probing store in LS1. The retired store from LS2 writes to the data cache itself but does not use the cache hit/miss logic. The probing store from LS1 only uses the hit/miss logic but doesn't access the data cache itself. This shared use is important performance wise because each store would occupy a cache port twice otherwise, first while probing from LS1 and secondly when writing from LS2 after retirement. This would halve the store bandwidth of the L1 Data Cache.

3.10 Retiring instructions in the Load Store unit and Exception Handling

All access instructions, Loads as well as Stores stay in LS2 until they are retired. Loads may be removed directly from LS2 when they are retired to make place for new instructions. Stores must still write their data to memory. They wait to do so until retirement when it is determined that no exception or branch miss-prediction occurred. Writes are removed from LS2 after they have committed their data to memory.

LS2 has a retirement interface with the re-order buffer. The re-order buffer presents the Tag of the line that is being retired to LS2.

It only needs to present a single Tag for up to three instructions in a line since these all have the same tag except for the 'sub- index' which identifies the lane (0, 1 or 2). LS2 compares all-instruction tags with the retirement-tag and set the Retired flag of those who match. Retired loads may be deleted directly from LS2.

If the retirement logic of the re-order buffer has detected a branch-miss prediction or exception then all instructions matching the retirement tag and all those with succeeding tags are discarded from LS2. The only ones left in LS2 are the retired stores that are waiting to commit their data to memory.

3.11 Store to Load forwarding, The Dependency Link File US Patent 6,266,744.

A Load probing the data cache will also check the Load Store units to see if there are any outstanding stores to the same address as the load. If it finds such a store ( and the store is before the load in program order ) then there are two possibilities. If the store has already obtained the write data from one of the result busses then these can be directly forwarded to the load. If the store has not yet obtained it data then the load misses and moves to LS2.

An entry is created in a unit called the Dependency Link File. This unit now registers both the tags of the write data, ( which tells the data-to-be-stored is coming in the next cycle ) as well as the Load tag which is the be used to tell a following instruction that the load data will be available. The Dependency Link File keeps monitoring the write data tag, and then, as soon as it detects it, puts the load instruction tag on one of the Cache Load tag busses.

It does the same with the actual data when it comes one cycle later. The result data from instruction 1 can be directly forwarded to the consuming instruction 4 in the example below. Instructions 2 and 3 (the store and the load) are bypassed in this case.

1) F( regA,regD ); // register A is a function of register A and register D

2) store ( mem, regA ); // store register A to memory

3) load ( regB, mem ); // load register B from the same memory location

4) F( regD, regB ); // uses register B and register D to calculate new value of register D

Miss-matched store to loads: Stores that only modify part of the load data are not supported. The load must first wait unit the store is retired and stored to memory. The load may then access the cache to get it's data which is a combination of the stored data and the original contents of the cache. The optimization manual describes all possible miss-match cases since they can lead to a considerable performance penalty.

Multiple Stores to the same address are handled with the so-called LIB flag ( Last In Buffer ) This flag identifies the most recent store to a certain address. A newer load accessing the same address will choose this one. Multiple partial stores to the same word were each modifies only a part of the word are not supported by the Load Store buffer. They are not merged in the Load Store buffer. They will be merged later on in the cache after all stores are retired and written.

3.12 Self Modifying Code checks: Mutual exclusive L1 DCache and L1 ICache US Patent 6,415,360.

Self Modifying Code (SMC) checks must in principle be performed for each store. It must be tested if the store does not modify any of the instructions in the Instruction Cache or any following Instruction in flight in any stage of execution. A significant simplification is made by making the L1 Data Cache and L1 Instruction Cache exclusive to each other: A cache-line can only exist in either one, not in both at the same time. When a cache line is loaded in the L1 Data cache then it will be evicted from the L1 Instruction cache.

The first advantage is that the contents of the Instruction Cache does not need to be tested any further for SMC. The second advantage is that SMC checks may be limited to Data Cache misses. Stores to un-cacheable memory must be checked always.

( They always "miss" ) The store's write-address is send from LS2 to the SMC test unit which is close to the Instruction Cache. This units holds the cache-line addresses of all the Instructions in flight. If there is a conflict then it marks the store that caused the conflict. The reorder buffer will discard all instructions which follow the store when the store is retired.

3.13 Handling multi processing deadlocks: exponential back-off US Patent 6,427,193.

Deadlocks can occur when multiple processors fight for the ownership of the same cache-line. They do so for instance if they both want to write to the same line. A cache-line is generally loaded as soon as possible in case of a cache-miss. This will cause the cache-line to be invalidated in other caches in case of a store. Two processors get in a deadlock if they keep invalidating each others cache-lines before they are able to finish the stores.

An example given is the case where two processor try to complete a store which is to an unaligned address so that part of the store data goes to cache line A1 and part of the store data goes to cache line A2. Unaligned stores of this type are typically split into two stores by the hardware. An exponential back-off mechanism is provided to handle this kind of deadlock situations. A back-off time is introduced when the memory access remains unsuccessful before retrying to become owner of the cache-line again. This time grows exponentially after each unsuccessful try until one of the processors finally succeeds.

3.14 Improvements for multi processing and multi threading

The Opteron's micro architecture has a large number of improvements related to multi processing and multi threading. Very important improvements also for the desktop market. Multi-processor on a chip solutions are just around the corner and hyper- threading may take a significant step forward in the near future with Intel's Prescott.

The ability to perform multi processing and multi threaded applications efficiently becomes essential. Switching contexts, starting and ending of processes and threads as well as inter-process and inter-thread communication is traditionally associated with large overheads. Significant improvements have been made to reduce these overheads to a minimum.

3.15 Address Space Number (ASN) and Global flag US Patent 6,604,187.

Different processes can have different contexts That is: different translations from virtual to physical addresses. A process switch will cause the Translation Look Aside buffers to be invalidated ( flushed ). Large Translation buffers won't help you a lot if they are frequently flushed which then can lead to significant performance degradation. The Opteron introduces a new mechanism to avoid flushing of the TLB's. An Address Space Number (ASN) register is added together with an enable bit (ASNE).

The Address Space Number is used to uniquely identify a process. Each entry in the TLB now includes the ASN of the process. An address can be successfully translated if the address matches the Virtual Address Tag in the TLB and the ASN register matches the ASN field in the TLB. The ASN field can be seen as an "extension" of the Virtual Address. This now means that different translations of different processes can coexist in the TLB, avoiding the need to flush the TLB's for context switches.

A global flag is available for data and code that is preferably accessible for all processes, typically operating system related. Global translations do not require the ASN fields to match. This means that many processes can share a single entry in the TLB to access global data. Another advantage of the ASN and global flag is that flushing can be limited to specific entries whenever an invalidation of the TLB is needed. Only the entries which have a certain ASN or have the global bit set are flushed.

3.16 The TLB Flush Filter CAM US Patent 6,510,508.

The TLB's can be seen as caches containing the translation information stored in the address translation tables in memory. The actual translation requires several levels of indirections through the tables stored in main memory. This is the so-called "table walk"

A very time consuming process which may take many hundreds of cycles for a single TLB entry. The Opteron attempts to speed up the table walk with a 24 entry Page Descriptor Cache.

Even so, it remains important to avoid the table walk whenever possible in a multi-tasking multi-threaded environment. A table walk becomes necessary whenever entries in the TLB do not correspond to the memory resident translations anymore because some- body has modified the latter.

Until now there was only one way to guarantee TLB coherency: Flush the TLB's if it may be possible that any of the entries is not identical anymore to the memory resident tables. Many actions in the x86 architecture result in an automatic flush of the TLB's, often unnecessary. A new feature in the Opteron: The TLB flush filter can avoid these costly flushing in many occasions.

The TLB Flush filter is implemented as a 32 entry, Content Addressable Memory ( CAM ). It remembers the addresses of regions in memory that were accessed when the TLB's were loaded. These regions thus belong to the Page Translation Tables. The Filter then keeps monitoring all accesses to memory to see if any of these regions are accessed again. If not then it may disable the flushing of the TLB's because coherency is guaranteed.

3.17 Data Cache Snoop Interface

The Snoop interface is used for a wide variety of purposes. It's used to maintain Cache Coherency in a multiprocessor system. It is used for conserving strict memory ordering in shared memory, for Self Modifying Code detection, for TLB coherency et-cetera.

The snoop interface uses the physical addresses from other processor accesses as well as from accesses issued on behalf of the instruction cache to probe various memories and buffers for data that has somehow, something to do with that particular address.

3.18 Snooping the Data Cache for Cache Coherency, The MOESI protocol

The Opteron can maintain cache coherency in systems of up to 8 processors. It uses the so-called MOESI protocol for this purpose. The snoop interface plays a central role in the effectuation of the protocol.

If a cache line is read from system memory ( which may be connected to any of the eight processors ), then the read has to snoop all the caches of all processors. Snoop accesses are much smaller then normal memory accesses because they do not carry the 64 byte cache line data. Many snoops may therefore be active without overloading the distributed memory system throughput. A snoop may find the cache-line in one of the caches of another processor.

If a processor does not find the cache-line in someone else's cache then it loads it from system memory into its cache and marks it as Exclusive. Now whenever it writes something in the cache-line then it becomes Modified. It does in general not write the modified cache-line back to memory. It only does so if a special memory-page-attribute tells it to do so (write through). The cache line will be evicted only later on if another cache-line comes in which competes for the same place in the cache.

If a processor needs to read from memory and it finds the cache line in someone else's cache then it will mark the cache line as Shared. If the cache-line it finds in the other processors cache is Modified then it will load it directly from there instead of reading it from the memory which may be not up to date. Cache to cache transfers are generally faster then memory accesses.

The status of the cache-line in the other cache goes from Modified to Owner. This cache-line still isn't written back to memory. Any other (third) processor that needs this cache-line from memory will find a Shared version and a Owner version in the caches of the first two processors. It will obtain the Owner version instead of reading it from system memory. The owner is the latest who modified the cache-line and stays responsible to update the system memory later on.

A cache-line stays shared as long as nobody modifies the cache-line again. If one of the processors modifies it then it must let this know to the other processors by sending an invalidate probe throughout the system. The state becomes Modified in this processor and Invalid in the other ones. If it continues to write to the cache line then it does not have to send anymore invalidate probes because the cache line isn't shared anymore. It has taken over the responsibility to update the system memory with the modified cache line whenever it must evict the cache-line later on.

3.19 Snooping the Data Cache for Cache Coherency, The Snoop Tag RAM

Other processors that access system memory need to snoop the Data Cache to maintain cache coherency using the MOESI protocol. We saw that there were two kinds of snoops. Read and Invalidate snoops. The basic task of a snoop is first to establish if the Data Cache contains the cache-line in question. There is a third set of Tags available specially for the snoop interface. ( The other two are used for the two regular ports of the data cache ). The snoop-Tag ram has 1024 entries, one for each cache line. It holds the Physical address bits [39:12] belonging to each cache line.

Virtual Address bit used to access the L1 data Cache

virtual page address				offset in page						offset in cache line
W	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0

Physical Address used to snoop the L1 data Cache

physical page address				offset in page						offset in cache line
15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0

The regular Tag rams are accessed with the virtual address. The Snoop Tag ram however must deal with the physical address ! Fortunately many of the virtual address bits needed are identical to the physical address bits. Only bits [15:12] are different and thus useless. This means that we must read out the Tags of all 16 possible cache-lines in parallel and then test if anyone of them matches. Luckily enough this doesn't present to much of a burden. The total bus width (in bit-lines) of for instance the cache rams is 512 bit. Sixteen times a 28 bit Tag is less (448) so there's space left for some extra bits like the state info for each cache-line.

Once we know which of the 16 possible cache-lines hits then we know also the remaining virtual address bits needed to access the cache plus the Way (0 or 1) which holds the cache-line. The position itself, ( 1 out of 16 ) directly provides the 3 extra address bits plus the Way bit. This means we can now access the cache if needed in case of a Read Snoop hit.

3.20 Snooping the L1 Data Cache and outstanding stores in LS2

It is not necessary for snoop reads from other processors that want to read a cache-line from the L1 data cache to check for retired stores in LS2 that will write to the cache-line they are about to read. This even though the data these stores will write is already considered to be part of the memory by the processor who issued the writes. It's is OK for other processors to see these writes occur at a later stage. The only effect externally is that it looks as if the processor is slightly slower.

An external processor that writes to a shared cache line must send snoop invalidates around. The snoop interface will invalidate the local cache-line if it receives such a snoop invalidate that hits the cache. The snoop interface must also set the hit/miss flag to miss for all stores in the Load Store unit that want to write to the cache-line that was hit. The later is not a specific snoop operation however. It is needed in all cases in which a cache-line is evicted or invalidated. These stores that originally did hit but who are set back to miss will need to probe the cache again.

3.21 Snooping LS2 for loads to recover strict memory ordering in shared memory US Patent 6,473,837.

An interesting trick allows the Opterons to handle speculative out of order loads from shared memory and still observe the strict memory access ordering required for shared memory multiprocessing. The hardware detects violations and can restore strict memory ordering when needed. A communicating processor may for instance first write a new command for another processor to A1 in memory and then increment a value A2 to notify that it has issued the next command. The processor which is supposed to handle the commands may find the value A2 incremented but still reads the old command from A1 if it executes loads out of order.

The ability to handle loads out of order can significantly speed up processing. Most notable is the example where a first load misses the cache. An out of order processor may issue another load which may hit the cache without waiting for the result of the first load. It would be beneficial to maintain out-of-order loads in a multiprocessing environment.

Another important speed improvement is speculative processing. The first load that missed may have been the counter A2 in our example. The new command must be fetched if A2 has been increased. A conditional call is made based on a test of the value of A2. A speculative processor attempts to predict the outcome of the branch at the beginning of the pipeline. It may predict that the counter has been incremented if it generally takes more time to execute the command than it takes to provide a new command.

That is: The new command is generally sitting waiting to be executed by the time the previous command has been executed.

The speculative out of order processor may first attempt to load the counter A2, It may miss but the branch predictor has predicted that it was increased and the command from A1 will be loaded for execution. The load from A1 may hit the cache. We actually do not know if this is a new commando or not. Let say it is the old one. The counter A2 still has to be loaded from memory. If A2 is increased in the mean time then the load that missed will cause the modified cache-line to be loaded in the local data cache with the incremented counter included. The processor will conclude that the branch prediction was correct and erroneously carry on with the old command.

The Opteron has a snoop mechanism that allows this kind of fully speculative out-of-order processing for high performance multi-processing. The mechanism detects cases which may go wrong and consequently restores memory ordering. We'll illustrate the mechanism with the use of our example. When the first processor writes a new commando into A1 then it will send a snoop- invalidate around to invalidate the cache-line in all other caches. This snoop invalidate will also reach the snoop interface of the Load Store unit:

The snoop interface first checks the entries for a load that did hit the cache-line-to-be-invalidated. This load would then be the "old command" from A1 in our example. When it finds a load hit then it continues by checking all older loads to see any of them is marked as a miss. This would then be the load of the A2 counter value in our example. It marks the Snoop ReSync flag of all the load misses it finds. This flag will cause any succeeding instructions to be canceled when the load is retired including the instruction that loads A1. The load of A1 will be re-executed and will now correctly read the new command from memory.

3.22 Snooping the TLB Flush Filter CAM

Snooping is used to preserve memory coherency. The function of the TLB flush filter is to prevent unnecessary flushes of the TLB's. It does so by monitoring up to 32 areas in memory that are known to contain page table translation information which is cached in the TLB's. These entries must be snooped also by snoop invalidates from other processors that may write to the page tables of our processor. If any of the snoops hits a TLB flush filter entry then we know that a TLB may have invalid entries and that the TLB flush filter may not prevent the flushing of the TLB's anymore.

The snoop-invalidates are not send if a processor is sure that a cache-line is not shared with other processors . This suggests that the TLB's (being caches in their own right) participate in the MOESI protocol for cache coherency via the TLB flush filter. The memory page translation tables ( PML4, PDP, PDE and PTE entries) may be in cacheable memory. A special flag has to be set in the Opteron if the Operating System decides to put the tables in un-cacheable memory. ( TLBCACHEDIS in HWCR )

Chapter 4, Opteron's Instruction Cache and Decoding

4.1 Instruction Cache: More then instructions alone

Access to the Instruction cache is 128 bit wide. 16 bytes of instructions can be loaded from the cache each cycle. The instruction bytes are accompanied with an extra 76 bits of extra information. This extends the total width of the Instruc- ion cache port to 204 bits. We're still counting only the bits that cover the full Instruction Cache. That is: Each of the 1024 cache lines has its own set of these extra bits. There are several more fields that have less then 1024 entries and are valid only for a subset of the cache lines.

	Instruction only	Total size
Instruction Cache size:	64 kByte	102 kByte
Cache Line size	64 Byte	102 Byte
One Read Port	128 bit	204 bit
One Write Port	128 bit	204 bit

Well known are the three so-called pre-decode bits attached to each byte. They mark the start and end points of the complex variable length x86 instructions and provide some functional information. The other two fields are the parity bits, 1 parity bit for each 16 data bits, and the so-called branch selectors. ( eight times 2 bit for each 16 byte line of instruction code ).

	Ram Size	Bus Size	Comments
Instruction Code:	64 kByte	128 bit	16 bytes instruction code
Parity bits	4 kByte	8 bit	One parity bit for each 16 bit
Pre-decode	26 kByte	52 bit	3 bits per byte (start, end, function) + 4 bit per 16 byte line
Branch Selectors	8 kByte	16 bit	2 bits for each 2 bytes of instruction code
TOTAL	102 kByte	204 bit

The Opteron's branch selectors are different from those of the Athlon (32) and they now cover all 1024 cache-lines of the Instruction Cache. The branch selectors contain local branch prediction information which can not be retrieved as readily as for instance the pre-decode information. A piece of code has to be executed multiple times before the branch-selectors become meaningful.

This is the reason that the branch selector bits are saved together with the instruction data in the unified level 2 cache whenever a cache-line is evicted from the instruction cache. The branch selectors represent one bit extra for each byte. The level 2 cache has this bit already for ECC ( Error Coding and Correction ) information. ECC is only used for data cache lines and not for instruction cache lines. The latter do not need ECC, a few parity bits per cache line is sufficient. Instruction cache lines that are corrupted can always be retrieved from external DRAM memory.

4.2 The General Instruction Format

A short overview of the 64 bit instruction format:

A series of prefixes can precede the actual instructions. At the start we have the legacy prefixes. The most important legacy prefixes are the operand size override prefix (hex 66) and the address size override prefix (hex 67). These prefixes can change the length of the entire instruction because they change the length of the displacement and immediate fields which can be 1, 2 or 4 bytes long.

The REX prefix (hex 4X) is the new 64 bit prefix which brings us 64 bit processing. The value of X is used to extend the number of General Purpose registers and SSE registers from 8 to 16. Three bits are used for this purpose because x86 can specify up to three registers per instruction for data and address calculations. The fourth bit is used as operand size override (64 bit or default size)

AMD64 Instruction Format

The Escape prefix (hex 0F) is used to identify SSE instructions. The Opcode is the actual start of the instruction after the prefixes. It can be either one or two bytes and may have an optional MODRM byte and SIB byte. The optional displacement and immediate fields can contain constants used for address and data calculations and can be 1, 2 or 4 bytes. The total length of the instruction is limited to 15 bytes.

4.3 The Pre-decode bits

Each byte in the instruction cache is accompanied with 3 pre-decode bits generated by the pre-decoder. These bits accelerate the decoding of the variable length instructions. Each instruction byte has a start bit that is set when the byte is the start of a variable length instruction and a similar end bit. Both bits are set in case of a single byte instruction. More information is given with the third bit, the function bit. The decoders look first at the function bit at the last byte of the variable length instruction. If the function bit is 0 then the instruction is a so-called direct path instruction which can be handled directly by the functional units. Otherwise if the function bit is 1 at the end byte then the instruction is a so-called vector path instruction. A more complex operation that needs to be handled by a microcode program.

Definition of the Instruction Pre-decode bits

START bit

1 indicates first byte of an instruction

END bit

1 indicates last byte of an instruction

FUNCTION bit

rule 1: Direct Path instruction if 0 on the last byte

Vector Path instruction if 1 on the last byte

rule 2: 1 indicates Prefix byte of Direct Path (except last byte)

0 indicates Prefix byte of Vector Path (except last byte)

rule 3: For vector-path instructions only:

if the function bit of the MODRM byte is set then

the instruction contains a SIB byte.

Then, secondly, the function bits identify the prefix bytes. Ones identify prefix bytes of direct path instructions and zeroes define the prefix bytes of vector-path instructions. Then, finally, in case of vector-path instructions only: if the function bit of the MODRM byte is set then the instruction also contains a SIB byte.

4.4 Massively Parallel Pre-decoding US Patents 6,460,132 & 6,260,134

We find a very large block of logic with fourfold symmetry directly near the position were the 16 byte blocks of data are read and written from and to the instruction cache.

We'll discuss the most likely candidate here, A fourfold incarnation of an earlier pre-decoder described in gate level detail in US Patent 6,260,134

This fourfold version can, according to the patent which describes it, pre- decode an entire line of 16 bytes in only two cycles by means of what it calls: massively parallel pre-decoding. This circumferences a basic problem in variable length pre-decoding and decoding in general, being:

A second instruction can not be decoded until the length of the first instruction is known. The start position of the second instruction depends on the length of the first instruction.

The massively parallel pre-decoder avoids this problem by first pre-decoding the 16 possible instructions in parallel. Each instruction starts at one of the 16 byte locations of the 16 byte line. It then filters out the real instructions with the help of the program-counter which points to the start byte of the next instruction, depending on where we jump into the 16 byte line.

16 bytes of instructions can be fetched per cycle from the instruction cache to be fed to the decoders. It may be that the line is not yet pre-decoded or wrongly pre-decoded. (Data bytes between instructions can mislead the pre- decoder).

If a branch is made to an address which does not have its pre-decode start bit set then we know that something is wrong. The instruction pipeline may invoke the pre- decoding hardware in this case to initialize or correct the pre-decoding bits within only two cycles.

The massively parallel pre-decoder uses four blocks, these blocks are an adapted version of an earlier pre-decoder. A single block pre-decodes four possible instructions in parallel. Each instruction starting at one of four subsequent byte positions. The old single block was capable of stepping through a 16 byte line in four cycles. The massively parallel pre-decoder combines four of them and uses a second stage to resolve the relations between the four: The start / end fixer / sorter.

4.5 Large Workload Branch Prediction

Branch Prediction is the technique that makes it possible to design pipelined processors. The outcome of a conditional branch is generally only known at the very end of the pipeline while we need to have this information at the very beginning of the pipeline. We need the branch outcome to know which line of instructions to load next.

The loading of a line of instructions already takes two cycles. If we don't want to loose anymore cycles then we must have decided on a new instruction pointer at the end of the cycle when 16 instruction byte line arrives from the instruction cache.

This means that there is no time at all to even look at the instruction bytes, to try to identify conditional branches, and then to look up what the behavior was of these branches in recent history in order to make a prediction. Doing this alone would cost us several cycles.

4.6 Improved Branch Prediction

The Branch prediction hardware does not make any attempt to look at the fetched instruction bytes at all. It uses several data structures instead to rapidly select a new address. It has a 2048 entry Branch Target Buffer and a 12 entry Return Stack to select a next Program Counter address. It further uses two branch history structures, one for local and one for global history, It uses these branch history structures to predict the outcome of the branches. The so-called branch selectors are used for local history while the global history counters are used for global history.

4.7 The Branch Selectors

The branch selectors embody the local history. Local means that the prediction is based on the history of the branch itself alone. Conditional branches that are taken about always in the same way can be predicted with the branch selectors. unconditional branches are also handled by the branch selectors. Remember that there is no time to look at the actual code. What a branch selector says is that history has shown that a branch will be encountered that is almost certainly taken, conditional or unconditional.

Now if it's not so certain that a branch will be taken? The branch selectors may leave the prediction in this case to the global branch prediction. The branch selectors will predict the branch as taken to identify the branch but leave the final decision to the global history counters by setting the global flag.

Branch Selectors

16 byte line of instruction code
0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
BS0		BS1		BS2		BS3		BS4		BS5		BS6		BS7

Branch Selection	K7 Athlon 32	K8 Athlon 64
3	take branch 2	take branch 3 (or return)
2	take branch 1	take branch 2 (or return)
1	return	take branch 1 (or return)
0	continue to next line	continue to next line

Each 16 byte line of instruction code is accompanied with eight 2 bit branch-selectors (some patents talk about 9) The branch selector within the line is selected with bits [3:1] of the Instruction Fetch address. The branch selector answers the question: I did enter this 16 byte line on this particular address, now what 16 byte line should I load in the next cycle? A line can have multiple jumps, calls, returns. They can be conditional or unconditional. We may have jumped anywhere in the middle of all these branches. The branch selectors tell us what to do depending on where we entered the line.

The K7 can predict two branches per line plus one return. The new 64 bit core can predict up to three branches per line and anyone of them may be a return according to Opteron's optimization manual. ( There are no patents yet so the table above is our own extrapolation ). The branch selectors are saved together with the instruction code in the large One Megabyte L2 cache whenever a cache-line is evicted from the instruction cache. The most useful data to save there is the information which can't be easily retrieved from the instruction code: the branch history. Information like the actual branch target address or the fact that the branch is a return is retrieved relatively fast in most cases by the processor.

4.8 The Branch Target Buffer

The BTB ( Branch Target Buffer ) contains 2048 addresses from which the branch selectors can choose the next cycle's Instruction Fetch address. Fred Weber's MPF2001 Hammer presentation shows us that each 16 byte line can now have up to four branch target addresses to choose from ( Up from two in the case of the Athlon 32 ). Each branch target entry is shared between eight lines. From the branch selectors we know that any single line may not use more then 3 of these. We assume that when a branch selector says: Select the 2nd branch, This means the second branch available for the current line.

Most important Branch Target Buffer fields

Line Tag

( 3 bit )

15 bit

cache index

Cache Way

Select, 0 or 1

Return

Instruction

Use Global

Prediction

Offset in

Instr. Code

Field

Description

Line Tag ( 3 bit )

A branch target buffer entry is shared between 8 lines.

The line tag tells us if this entry belongs to the current line.

15 bit cache index

These 15 bits are sufficient to access the two 32 kB ways

of the 64 kB 2 way set-associative Instruction Cache

cache way select ( 0 or 1 )

Used to check the way of the cache..

Return Instruction

This bit tells us to use the address from the return stack

instead to access the next line in the instruction cache.

Use Global Prediction

The Global Flag leaves the final Branch Prediction

to the Global History Counters.

Offset in Instruction code

This tells us where the end of the branch is located in the

16 byte line of instruction code.

Each branch target entry needs a 3 bit tag to identify to which of the 8 possible lines of instructions it belongs. Sharing branch target entries strongly reduces the amount of branch target addresses needed. 2048 entries still would represent 12 kByte if the full 48 bit addresses were stored in the BTB. This would be a relatively large memory which you won't find on Opteron's die. The trick used here is to store only the 16 bits which are actually needed to access the 64 kByte instruction cache. The higher address bits are retrieved later on. The Opteron has a new unit called the BTAC ( Branch Target Address Calculator ) to support this.

4.9 The Global History Bimodal Counters

The Athlon 64 has 16,384 branch history counters. Four times as much as its 32 bit predecessor. The counters describe the likelihood that a branch is taken. They count up to a maximum of 3 when branches are taken and count down to a minimum of 0 when not taken. Counter values 3 and 2 predict a branch as taken, see the table.

Definition of the 2 bit Branch History Counters

Counter Value	Branch Prediction
counter = 3	Strongly Taken
counter = 2	Weakly Taken
counter = 1	Weakly not Taken
counter = 0	Strongly not Taken

The BHBC is accessed by using four bits of the Program Counter and the outcome (taken or not taken) from the last eight branches. This is basically the same as in the Athlon 32. The fact that we now have four times as many counters means that we have four branch predictors per 16 byte instruction line. This corresponds with the four branch target addresses per line. This would be an improvement over the Athlon 32 were the two branches per line could interfere which each others branch predictions.

Addressing the Branch History Counters

Instruction

Address bits 7:4

Branch outcome of the

eight previous branches

| | | |

| | | | | | |

16,384 Branch History Counters



Branch prediction 0	Branch prediction 1	Branch prediction 2	Branch prediction 3

Another improvement is that only branches whose global bit was set participate in the global branch prediction. This prevents branches with a static behavior from polluting the global branch history. ( US Patent 6,502,188 describes this in the context of the Athlon 32 ) The global bit is set whenever a branch has a variable outcome. The GHBC table allows the processors to predict global branch patterns of up to eight branches.

4.10 Combined Local and Global Branch Prediction with three branches per line

A single 16 byte line with up to three conditional branches represents a complex situation. If we predict a first branch as not taken then we encounter the next conditional branch which must be predicted also et-cetera. Does the opteron handle this in multiple steps? or does it handle the whole multiple branch prediction at once?

Local and Global Branch Prediction with three Branches per Line

AND

THEN

Branch Selector

Selects Branch 1

Branch 1 is local, or global and predicted taken

TAKE BRANCH 0

Branch Selector

Selects Branch 1

Branch 1 is global and predicted not taken and

Branch 2 is local, or global and predicted taken

TAKE BRANCH 1

Branch Selector

Selects Branch 1

Branch 1 is global and predicted not taken and

Branch 2 is global and predicted not taken and

Branch 3 is local, or global and predicted taken

TAKE BRANCH 2

Branch Selector

Selects Branch 1

Branch 1 is global and predicted not taken and

Branch 2 is global and predicted not taken and

Branch 3 is global and predicted not taken

GO TO NEXT LINE

Branch Selector

Selects Branch 2

Branch 2 is local, or global and predicted taken

TAKE BRANCH 0

Branch Selector

Selects Branch 2

Branch 2 is global and predicted not taken and

Branch 3 is local, or global and predicted taken

TAKE BRANCH 1

Branch Selector

Selects Branch 2

Branch 2 is global and predicted not taken and

Branch 3 is global and predicted not taken

GO TO NEXT LINE

Branch Selector

Selects Branch 3

Branch 3 is local, or global and predicted taken

TAKE BRANCH 2

Branch Selector

Selects Branch 3

Branch 3 is global and predicted not taken

GO TO NEXT LINE

If we may take Fred Weber's MPF2001 presentation as an indication here then we guess that it takes the branches one step at a time. (The presentation shows a single GHBC prediction per cycle ). A potential bottleneck may indeed be the GHBC. A second and a third branch need a different "8 bit branch outcome" index into the table. The 8 bit value should be shifted 1 and 2 positions further for the 2nd and the 3rd branch with zeroes inserted to indicate "not taken" in order to operate according the rules.

4.11 The Branch Target Address Calculator, Backup for the Branch Target Buffer

Another new improvement is the BTAC, The Branch Target Address Generator, This unit is very useful for several purposes. It can generate full (48 bit) branch addresses two cycles after the 16 byte line of code has been loaded from the cache. It works for most branches which typically use an 8 or 32 bit displacement in the instruction to jump or call to code relative to the program counter. The BTAC can probably identify return instructions as well.

One task of the BTAC is a backup function of the BTB (Branch Target Buffer). The BTB shares each branch address with eight lines. We may find that the branch selectors are OK but the branch target they select has been overwritten by another branch. The branch selectors are maintained for all cache-lines in the 64 kByte I Cache. They are also preserved together with instruction cache-lines which are evicted from L1 to the large 1 MegaByte L2 cache. It is unlikely that branch-selectors which are reloaded from L2 into L1 still find their branch target addresses in the BTB. On the contrary, the BTB entries should be cleared whenever a cache-line is evicted from L1 to L2.

A cache-line that returns from L2 to L1 can restore the pre-decode bits rapidly (In two cycles with a massively parallel pre-decoder) It has to restore the BTB entries as well but this can take much more time. The Athlon 32 fills the BTB with instruction-addresses that come back from the re-order buffer when the branch is retired. This procedure would be repeated for each branch in the 16 byte line when it is taken. It may well be that the Athlon 64 still works this way. The BTAC can take over the functionality of the BTB until the BTB entries are restored.

The BTAC can use the lowest Instruction Fetch address bits to see were we enter the 16 byte line. It can then scan from that position to the first branch and calculate the full 48 bit address by adding the 8 or 32 bit displacement from the code. Now we have a calculated value which can be used to index the cache. It is still a guessed address. The certain address only comes when the branch instruction retires. The BTAC may have picked the wrong branch for example.

We believe that the BTAC calculates the full 48 bit address. We believe so because it can be made to maintain the full 48 bit which has several advantages. The 48 bits would be lost whenever the BTB is used to predict an address because it stores only a small portion of the address. The BTAC can be used to maintain 48 bit because the BTB identifies the location in the 16 byte line of the branch it uses. The BTAC can use this to find the right branch and subsequently add the displacement to keep the address at 48 bits.

There are two important tasks that need the full 48 bit address. First: The branch-miss prediction test hardware has to compare the full 48 bit "guess" address with the actual 48 bit address as calculated by the branch instruction. Secondly: The cache hit/miss test hardware needs the full 48 bit "guess" address (virtual) to translate and compare it with the (physical) address tag stored together with each cache-line.

There are some patents without BTAC that use a scheme of reversed TLB lookup to recover the full 48 bit (virtual) "guess" address from the (physical) cache tag and use this for the branch miss prediction test. However such an address is not useful for the cache hit-miss test ( It hits always! ).

4.12 Instruction Cache Hit / Miss detection, The Current Page and BTAC

The basic components for the Instruction Cache hit/miss detection are basically the same as those for the data cache. See section-3.3: "The Data Cache Hit / Miss Detection: The cache tags and the primairy TLB's" The single port Instruction cache only needs a single tag ram and a single TLB. The instruction cache also has a second level TLB ( see section-3.4) and it has its snoop tag ram (section-3.19). All these structures are relatively simple to recognize on the die-photo.

The current page register holds address bits [47:15] of the "guessed" Instruction Fetch address. The BTB only stores the lower 15 Instruction Fetch address bits. The Fetch logic speculates that the next 16 byte instruction line will be fetched from the same 32 kB page and that the upper address bits [47:15] remain the same. Jumps and calls that cross the 32 kB border are miss predicted. The higher bits of the fetch address [47:12] are needed for the cache hit/miss logic. The virtual page address [47:12] is translated to a physical page address [39:12] . This page address is then compared to the two physical address tags read from the two way set associative instruction cache to see if there is a hit in either way.

The new BTAC ( Branch Target Address Calculator) can recover the full 48 bit address from the displacement field in the instruction code two cycles after the code is fetched from the cache. This address can then be compared with the current page register to check if the assumption that the branch would not cross the 32 kB bounder was right. The cache hit/miss logic in the mean time has translated and compared the guessed address with the two instruction cache tags and produced the hit/miss result.

Cache Hit / Miss and Current Page Test
Cache Hit	Current Page OK	Continue with the Instruction Line Fetched from the Instruction Cache
Cache Hit	Current Page not OK	Re-access the cache / TLB with the corrected Current Page
Cache Miss	Current Page OK	Real Cache Miss. Reload Cache-line from L2 or memory.
Cache Miss	Current Page not OK	Re-access the cache / TLB with the corrected Current Page

The processor continues with the 16 instruction bytes fetched from the cache if there was a cache hit and the 32 kB border was not crossed. The Fetch logic will re-access the cache if the 32 kB border was crossed and will ignore the hit/miss result in this case. If the 32 kB border was not crossed and the TLB thus translated the right fetch address and there was a cache miss then we may conclude that the cache miss was real and that we have to reload the line from memory or L2. The BTAC does not help in case of indirect branches. These still have to wait until the correct address becomes available from the retired branch instruction.

4.13 Instruction Cache Snooping

The Snoop interface of the Instruction Cache is used to maintain Cache Coherency in a multiprocessor environment and for Self Modifying Code detection. Another processor that shares a cache-line with a cache line in the Instruction cache sends snoop-invalidates throughout the system when it writes into the shared cache-line. The snoop interface checks if the snoop invalidate hits with a cache line in the instruction cache. It will invalidate the line upon a hit. The snoop interface works with physical addresses as described in section 3.19

The instruction cache can share cache-lines with other processors. It can not share a cache-line with its own data cache however. The latter is forbidden because the processor must correctly handle Self Modifying Code programs. The Instruction and data cache are exclusive to each other as well as to the unified level 2 cache. The snoop interface detects if a cache-line load for the data cache hits a cache-line in the instruction cache and invalidates the cache-line upon a hit.

The instruction cache may share a cache-line with a data-cache on another processor. This so-called Cross Modifying Code case is less stringent. The exact moment at which the other processor overwrites the instruction code is uncertain. The only effect of a shared cache-line which is modified by another processor is that we see the modification somewhat later, as if the other processor was slightly slower.

Interesting is that the new ASN (Address Space Number) could make it possible for the instruction cache and data cache to share cache lines as long as they are assigned to different processes with different ASN's. This would be similar to the cross modifying case mentioned above. The hardware however does not support it because the ASN's are not stored together with the cache lines. It would not be worth the trouble anyway from a performance point of view.

Athlon 64, Bringing 64 bits to the x86 Universe

Regards, Hans

HOME