|
|
|
|
|
You
can find the enlarged versions of the images above further on in this
page.
Click
here for the large (1600x1200) version
|
|
Understanding
the detailed Architecture
of AMD's 64 bit Core
For
those who really want to know what it takes to build a world
class 64 bit
speculative
out of order processor core for a multi-processing
environment.
|
|
Index
.
|
|
Index
Chapter 1, The
Integer Core: The
Integer Super Highway
|
|
1.1 The
Integer Super Highway
1.2 Three way Super Scalar CISC architecture
1.3 A Third class of Instructions: Double Dispatch
Operations
1.4 128
bit SSE(2) instructions are split into Doubles
1.5 Using Doubles for 128
bit SSE(2) instructions avoids a 25% latency penalty
1.6 Doubles used for some Integer and x87 instructions as well
1.7 Doubles handle 128 bit memory accesses
1.8 Address
Additions before Scheduling
1.9 Register Renaming and Out-Of-Order Processing
1.10 Renaming
the Integer Registers
1.11 The IFFRF:
Integer Future
File and Register File
1.12 The "Future
File" section of the IFFRF
1.13 Exception
or Branch Miss-prediction: Overwrite Speculative values with Retired values
1.14 The Reorder Buffer
1.15 Retirement and Exception Processing
1.16 Exception Processing
is always delayed until Retirement
1.17 Retirement of Vector Path and Double Dispatch Instructions
1.18 Out-Of-Order
processing: Instruction
Dispatch
1.19 The Schedulers / Reservation
Stations
1.20 Each x86 instruction can launch both an ALU and an AGU operation
1.21 The Scheduling of an ALU operation
1.22 The
Scheduling of an AGU operation for memory access
1.23 Micro Architectural Advantages of Opteron's Integer
Core
|
|
Index
Chapter 2, Opteron's
Floating Point Units
|
|
2.1 The Floating Point
Renamed Register File
2.2 Floating Point rename stage 1: x87
stack to absolute FP register mapping
2.3 Floating Point
rename stage 2: Regular Register Renaming
2.4 Floating Point
instruction scheduler
2.5 The 5 read and 5 write ports of the floating point renamed register file
2.6 The Floating Point processing units
2.7 The Convert and Classify units
2.8 X87 Status handling:
FCOMI /
FCMOV and FCOM / FSTSW pairs
|
|
Index
Chapter 3, Opteron's
Data Cache and Load / Store units
|
|
3.1 Data Cache: 64
kByte with three cycle data load latency
3.2 Two accesses per cycle, read or write: 8 way bank
interleaved, two way set associative
3.3 The
Data Cache Hit / Miss Detection: The cache tags and the
primairy TLB's
3.4 The 512 entry second level TLB
3.5 Error
Coding and Correction
3.6 The Load / Store Unit, LS1 and LS2
3.7 The
"Pre Cache" Load / Store unit: LS1
3.8 Entering LS2: The Cache Probe Response
3.9 The
"Post Cache" Load Store unit: LS2
3.10 Retiring instructions in the Load Store unit and Exception Handling
3.11 Store to Load forwarding, The Dependency Link File
3.12 Self Modifying Code checks: Mutual exclusive L1 DCache and L1 I Cache
3.13 Handling multi processing deadlocks: exponential back-off
3.14 Improvements for multi processing and multi threading
3.15 Address Space Number (ASN) and Global flag
3.16 The
TLB Flush Filter CAM
3.17 Data Cache Snoop
Interface
3.18 Snooping
the Data Cache for Cache Coherency, The MOESI protocol
3.19 Snooping
the Data Cache for Cache Coherency, The Snoop Tag RAM
3.20 Snooping
the L1 Data Cache and outstanding stores in LS2
3.21 Snooping
LS2 for loads to recover strict memory ordering in shared memory
3.22 Snooping
the TLB Flush Filter CAM
|
|
Index
Chapter 4,
Opteron's
Instruction Cache and Decoding
|
|
4.1 Instruction
Cache: More then instructions alone
4.2 The General Instruction Format
4.3 The
Pre-decode bits
4.4 Massively Parallel
Pre-decoding
4.5 Large
Workload Branch Prediction
4.6 Improved Branch Prediction
4.7 The Branch Selectors
4.8 The
Branch Target Buffer
4.9 The Global
History Bimodal Counters
4.10 Combined Local and Global Branch Prediction with
three branches per line
4.11 The
Branch Target Address Calculator, Backup for the Branch Target
Buffer
4.12 Instruction Cache Hit / Miss detection,
The Current Page and the BTAC
4.13 Instruction
Cache Snooping
|
|
Chapter
1, The
Integer Core: The
Integer Super Highway
|
|
|
|
|
|
1.1 The
Integer Super Highway
|
The
Die photo of the Integer core is dominated by all the 64
bit data-busses running North-South. At some points the density may
reach up to twenty busses. The busses carry all the source and result
operand data between the Integer units. The lay-out of the busses is bit-interleaved, meaning that equal bit-numbers are grouped
together: The bits 0 of
all the busses are next to each other at one side of this Integer super
highway while all bits 63 can be found at the other side. Very visible
is also the separation into individual bytes.
|
|
1.2
Three way Super Scalar CISC architecture
|
|
The
Opteron is a 3 way super-scalar processor. It can decode, execute and
retire three x86-instructions per cycle. These instructions can be quite
complex (CISC) operations involving multiple (>2) source operands.
The Pentium 4 handles 3 so called uOps per cycle where multiple
of these uOps may be needed to implement a single
x86 instruction. It may be that the Prescott, the follow-up of the
Pentium 4 can handle four uOps per cycle as we revealed here.
In
general an x86 instruction can be expressed as:
F( reg,reg ),
F( reg,mem ) or
F( mem,reg ) where the first operand is both source and
destination. The first two forms are general for Integer, MMX and
SSE(2). The later form is found basically in Integer instructions: One
source operand is loaded from memory and the result is written back to
the same location. The Integer Pipeline handles Loads and Stores for
all operations including those for Floating Point and Multimedia instructions.
|
|
Overview
of Opteron's Processor
Core
|
|
|
1.3
A third class of Instructions: Double Dispatch Operations
|
The
original Athlon ( which we'll refer to as the Athlon 32)
classifies instructions either as Direct Path or
Vector Path.
To the first class belong all the less complex
instructions that can be handled by the hardware as a single operation.
The more complex instructions (Vector Path) invoke the micro sequencer
that executes a micro code program. Instructions are read from micro
code Rom and inserted in the 3-way pipeline.
The
Opteron introduces a third instruction class: The Double
Dispatch instructions, or simply "Doubles". The
doubles are generated near the end of the decoding pipeline. The Instructions,
which either followed the "Direct Path", or where generated by
the Micro Code sequencer, are split into two independent instructions. The 3-way pipeline
can thus generate up to six instructions per cycle. The instructions are
"re-packed" back to three again in the PACK-stage. This extra pipeline stage has often been the subject
of speculation since Opteron's introduction at the 2001 Micro Processor
Forum. The
Six-fold symmetry of the "doubling stage" is clearly visible
on the Die plot above.
|
1.4
128
bit SSE(2) instructions are split into Doubles
|
|
Appendix
C of Opteron's Optimization Guide specifies to which class each and
every instruction belongs. Most 128 bit SSE and SSE2 instructions are
implemented as double dispatch instructions. Only those that can not be
split into two independent 64 bit operations are handled as Vector
Path (Micro Code) instructions. Those SSE2 instructions that operate on
only one half of a 128 bit register are implemented as a single (Direct
Path) instruction. There
are both advantages and disadvantages performance-wise here. A
disadvantage may be that the decode rate of 128 bit SSE2 instructions is limited to 1.5 per cycle. In general however this not a
performance limiter because the maximum throughput is limited by the FP
units and the retirement hardware to a single 128 bit SSE instruction per
cycle. More important is the extra cycle latency that a
Pentium 4 style implementation would bring is avoided.
|
|
1.5
Using Doubles for 128
bit SSE(2) instructions avoids a 25% latency penalty
|
|
In the Pentium 4 an SSE2
instruction is split in a later stage in the Floating Point unit itself. The
Floating Point units accept 128 bit source data at it's first stage.
It then splits the operation in two and combines the two results at the end into
one single 128 bit result. This effectively adds one extra cycle to the total
latency. For instance: The x87 FADD and FMUL take 5 and 7 cycles while
the 128 bit (2x64) SSE2 equivalents need 6 and 8 cycles. The
Opteron, like the Athlon 32, handles both FADD and FMUL in 4 cycles.
The SSE2 equivalents are handled with the same 4 cycle latency. An extra
cycle would mean a latency increase of 25%, a serious performance
limiter, so the correct decision has been made here. If
you would look at a highly pipelined FP unit in action then you would
see mostly bubbles and few instructions . Instructions waiting for the
results of others that have yet to finish. Latency is more important
here then bandwidth.
The
next Pentium, code-named Prescott has an extra Floating Point
Multiplier and Adder as we could reveal to you here. We now think that
the extra FP units are used for single port but full 128 bit operation. This would bring
back the SSE2 latencies for Add and Multiply to 5 and 7 cycles,
beneficial for single thread programs. It would double the Floating
Point bandwidth which is mainly interesting for Hyper Threading
performance.
|
|
1.6
Doubles used for some Integer and x87 instructions as well
|
|
The
Double Dispatch instructions are not only used for SSE and SSE2
instructions. Appendix
C of Opteron's Optimization Guide also list classic x86 instructions
like POP and PUSH, some of the integer multiplications and the LEAVE
instruction. All the instructions are handled by micro code on the
Athlon 32 which is a lot slower. Also a number of classical x87
instructions are now handled by doubles, for instance those FP
instructions that have an integer as one of the source operands that
first needs to be converted to floating point.
|
|
1.7 Doubles handle 128 bit memory accesses
|
|
The
128 bit memory references used for SSE and SSE2 are likewise split up
into two independent 64 bit accesses which are handled by the integer
core. The results are snooped from the Load Data busses of the Data
Cache by the Floating Point Core.
The
decision to extend the Integer Registers from 32 bit to 64 bit and to
split the 128
bit SSE(2) instructions into two 64 ones results in an elegant all 64 bit Micro
Architecture.
There
is a significant advantage in having an L1 Data Cache that can handle
128 bit loads or stores as two independent 64 bit loads or stores per
cycle. Two 64 bit loads from different addresses into a single 128 bit
SSE2 register with two moves is just as fast a loading a single 128 bit
word from memory. Apple
had a decent argument for introducing a 128 bit data type containing
four 32 bit floating point values. Which is as such usable for high
quality ARGB color image data. (Given it's customer base) Two 64
bits floating point numbers in a 128 bit word doesn't seem to serve any
practical commercial application. (other then making live miserable for compiler
builders...) Providing separate 64 bit loads and stores at a two per
cycle rate gives a compiler a better chance to combine unrelated 64 bit
operations into a single 128 bit one.
|
|
1.8 Address
Additions before Scheduling
US
Patent 6,457,115.
|
|
A
single x86 instruction may need many source operands when memory is
involved:
address = base +
index< scale
+ displacement +
segment
Up
to four arguments are needed to calculate the address (ignoring the 2
bit "scale-field" hard-coded in the instruction) This
means that a typical x86 instruction of the format F(reg,mem)
needs not less then 5 input operands! Now one of the parameters is a
constant given by the instruction itself (displacement) Another
parameter (segment) is a "semi- constant" and is typically zero
in modern code with a non- segmented flat memory space. The
Athlon 32 adds the segment to the address only when needed after one of
the three AGU's (Address Generator Units) has calculated the
linear address. It does so during the Data Cache Access which
causes an extra cycle of cache load latency. The
Opteron has a different implementation. The displacement and segment are
summed together before the actual
address calculation.
The
segment value is considered a constant and thus, just like the
displacement, know during decoding. The addition is made during decoding/dispatch
and the result is passed on together with the rest of the instruction
bits as a new "displacement field" of the instruction.
An
exception is generated whenever the segment value does change. The
results of operations depending on it are cancelled and the pipeline is
restarted from the right point.
The
"Decode-Time" address adder might be used for other address
additions as well. (The 64 bit mode gets rid of most of the
segmentation)
|
|

|
For
instance the new Relative Address mode
that adds the 64 bit Instruction Pointer (RIP) and the 32 bit
displacement from the instruction together. By reducing the number
of input parameters as much as possible during decoding we end up with a
maximum of four input parameters for each instruction. Three of them are
register variables and the fourth one is a constant.
|
|
|
|
|
|
|
|
|
1.9
Register Renaming and Out-Of-Order Processing
|
|
The
Athlon (and Opteron) uses some clever tricks to handle Register Renaming
and OOO processing (Out-Of-Order) which allows them to shave some 25%
of the integer pipeline. The design allows for a simple and
fast scheduler that doesn't need special hardware to handle
miss-scheduling caused by cache-misses.
Register renaming is used to eliminate "False
Dependencies" which limit the number of Instructions Per Cycle (IPC)
that a processor can execute. False Dependencies are the result of
a limited number of registers. A register that holds an intermediate
result needs to be re-used soon for another, maybe unrelated,
calculation. Its value is then overwritten and not available anymore.
The instruction that overwrites it must always wait for the instruction
that needs the previous result. This
serializes the execution of the instruction and limits the IPC. This is
especially true for an architecture like x86 which has a very small
number of registers. The example below shows how register renaming can
eliminate false data dependencies: Register rC
is overwritten by the 3rd instruction, so the 3rd instruction has to
wait for the 2nd instruction: a False Dependency. With register renaming
we can use an "arbitrary" large register file. There is no
need to re-use rC(r3)
We can simple use another available register instead, register r7
in this case. The basic rule is that all of the instructions that are "in-flight" are
given a different destination register. (single assignment) Non
Renamed: rC=rA+rB; rF=rC&rD; rC=rA-rB; Renamed:
r3=r1+r2; r6=r3&r4; r7=r1-r2;
|
|
1.10 Renaming
the Integer Registers
|
|
Opteron
has sixteen 64 bit architectural integer registers. Not visible for the
programmer are eight more 64 bit scratch registers used to store
intermediate results for micro code routines that handle more complex
x86 instructions. The Athlon family of processors handles Register
Renaming in the simplest possible way. Which is a compliment because it
often takes a lot of smart thinking to figure out how to do things in
the simplest way! People only rarely succeed in this ...
As
we said, each instruction in flight needs a different destination
register. The total number of renamed registers must be equal or
larger then the sum of all instructions-in-flight
plus the architectural-registers.
The maximum number of instruction in flight is 72, add everything
together then you need 96 "renamed registers". Two
different structures are used to maintain these registers. The
instructions-in-flight results are maintained by the result fields of
the 72 entry Re-Order Buffer ( ROB ) and the architectural-registers are
maintained by the "Integer Future File and Register
File". ( IFFRF )
|
|
Re-Order-Buffer Tag
definition |
|
wrap
bit |
Instruction
In Flight Number
|
|
re-order
buffer index 0...23 |
sub-index
0..2 |
|
bit
7 |
bit
6 |
bit
5 |
bit
4 |
bit
3 |
bit
2 |
bit
1 |
bit
0 |
|
This
configuration allows for a very simple renaming scheme which takes
-zero- cycles... Each instruction dispatched from one of the three
decode lanes gets a "Re-Order
Buffer Tag" or
"Instruction In
Flight Tag"
consisting of: 1)
A sub-index 0,1 or 2 which identifies from which of the three lanes the
instruction was dispatched. 2)
A value 0..23 that identifies the "cycle" in which the instruction was
dispatched. The "cycle counter" wraps to 0 after reaching 23. 3)
A wrap bit. When two instructions have different wrap bits then the
cycle counter has wrapped between the dispatches.
|
|
1.11
The IFFRF: Integer Future
File and Register File
|
|
This
register file is used to maintain the 16 architectural registers and the
8 temporary scratch registers. It has two entries for each of the 16
architectural registers. One of the two can be viewed as the actual register as seen by
the programmer. It gets its value when the instruction that produced it
has "retired" An instruction is retired when it is sure
that no exception or branch-miss-prediction has occurred and all
preceding instructions have been retired as well. The value of the
register is said to be "non-speculative".
|
|
40
entry Integer Future File and Register File: IFFRF |
|
16
entries
|
Retired
Architectural Register Values
|
|
16 entries
|
Speculative
Register Values: "Future File"
|
|
8 entries
|
Temporary Registers
|
|
Instruction-In-Flight
and their results may be cancelled and discarded as long as they have
not been retired. Cancellation can be a a result of a proceeding
instruction that caused an exception or a by a branch-miss-prediction. Instructions-In-Flight
are in principle always speculative. The results stay speculative even if the
instruction has finished. The results only become non-speculative at
retirement when the retirement logic determines that no exception has occurred.
|
|
1.12 The
Future
File section of the IFFRF
|
|
The
second entry for each Architectural Register holds the so-called "Future"
value. The 16 of them together constitutes the so-called Future File These
entries contain the most recent value produced for a certain
architectural register by any instruction,
( retired or non retired
). The contents of a future file register is speculative as long
as the
producing instruction has not yet retired. The value becomes non-
speculative after a while if the producing instruction successfully
retires.

The
Future File origins go back to 1985
Instructions
write into the Future File as soon as their result is produced.
The Future File however does not accept
the result if it's not the very latest result for a certain register. If
a later instruction has managed to finish earlier and has written its
result already into the Future File then it will not accept
results anymore for that register from older instructions. Finished
Instructions address the Future File with the instruction code register
number, a number from 0 to 15 for the 16 architectural registers. The "Re-Order Buffer Tag" is used to determine
if a result may be overwritten. Each Future File entry has a
corresponding Tag. We will see that an instruction may only write into
the Future File entry if it still owns the entry: If the Tags match.
|
|
1.13
Exception
or Branch Miss-prediction: Overwrite Speculative values with Retired values
|
|
All
speculative results are cancelled by copying the retired values of the
IFFRF over the Future File values of the IFFRF.
The
speculative results must be cancelled whenever the retirement
logic detects that an exception occurred when the instruction or an earlier
one was
executed. There are many types of exceptions, Memory accesses can
encounter a Page Miss or they can erroneously access a memory area which they are
not entitled to access. The divide by zero is another well known
exception.
( It shouldn't be for Floating Point numbers because +/-
infinity are perfectly valid IEEE Floating Point values)
When
we say Speculative
Results here then we
mean more specifically the results that may need to be canceled because
of erroneous Control
Flow Speculation:
The program flow went into a different direction then
predicted, now:
-
A branch miss prediction is basically the same as any other exception,
but...
-
All exceptions are also branch miss-predictions.
An
exception causes a change in the program flow much like a conditional
call. All instructions that can cause exceptions are thus in fact
conditional control flow instructions. Exceptions are however always
predicted as not taken and ignored by the branch prediction
hardware.
-
|
|
1.14 The Reorder Buffer
|
|
We
mentioned retirement a number of times now. Retirement is handled with
the aid of the reorder buffer. This unit does what its name suggests:
It Re-Orders the instructions, It orders them back into the original
program flow. The Schedulers are responsible for Out-Of-Order
execution. The schedulers do so by launching instructions to execution units
whenever all their source operands are available and the needed
execution unit is free. It's the reorder buffer that brings the
instructions back into order again.
|
|
Operation
of the Reorder Buffer |
|
index |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
|
lane
0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
lane
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
lane
2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
=
Out Of Order finished Instructions, results still
speculative.
|
|
|
=
Instructions being retired now.
|
|
|
=
Retired Instructions, not speculative anymore.
|
|
The
reorder buffer itself is split into three identical lanes. Each lane
has 24 entries. The lanes and entries correspond to the reorder buffer
Tag assigned to each instruction. Each Instruction that finishes writes
its result into the reorder buffer using the reorder buffer Tag as
address. The instructions also store any events that happened
during execution that will require an exception. In
particular conditional branch instructions may report that the branch
address they calculated does not correspond with the address that was
predicted.
The
instruction will leave further information needed for the reorder buffer
to do its work. It leave some info of what kind of instruction it
is. It will leave the architectural register address (0..15) that
corresponds with its destination register. The instruction will
leave also the address where it is located in 'instruction' memory. Some
of this info may already be left in the reorder buffer earlier when the
instruction received it's reorder buffer Tag.
The
reorder buffer is shared by all instructions. It's also used to reorder
Floating Point, SSE(2) and MMX instructions. These instructions
however do not write their result data in the reorder buffer. They use
the 120 entry renamed floating point register file for that purpose. The
reorder buffer however is still used to track the instruction code info
and address, exception flags, ready status and retirement status. All
instructions are retired with the aid of the reorder buffer.
|
|
1.15 Retirement and Exception Processing
|
|
The
image shows how Instructions can be retired at the moment when all
previous instructions are retired. Three instructions can be retired per
cycle. The Instruction Control Unit (ICU) accesses the reorder buffer
contents for the three instructions. The instructions are retired If
there are no exception flags set. The result data is written to the
Retired Entries of the IFFRF. The later is basically a write-only
process. These values are only used in case of an exception. In this
case they are
used to overwrite the speculative values of the Future File. The
ICU will handle a branch miss prediction by forwarding the instruction's
address to the Instruction Fetch Unit at the beginning of the pipeline. The branch will then be
re-executed, now with the right prediction. Other exceptions require an
exception routine call. The ICU can for instance save some relevant data
in the temporary registers of the IFFRF and invoke the exception call or
a micro code function.
|
|
1.16
Exception Processing
is always delayed until Retirement
|
|
Exceptions
processing must always be delayed until the instruction that caused the
exception is not speculative anymore: A
memory access exception for instance may be caused by accessing an Array
with an Index that is out of bounds. The program may have a test for
out-of-bound indices and code to handle it. The branch prediction
hardware however will most likely predict that the out-of-bound test is
Not True because the Index is OK most of the time. The processor will
thus access the array with an out-of-bounds Index anyway and not
unlikely cause a memory access exception. Exception handling is
delayed until retirement where the instruction plus its exception flags
is discarded because of the branch-miss-prediction. It's
for the same reason that speculative Stores to memory are delayed and
hold in the LSU (Load Store Unit) until retirement.
|
|
1.17
Retirement of Vector Path and Double Dispatch Instructions
|
|
A
single Vector Path instruction may produce many instructions. The
Micro sequencer inserts these instructions in the 3 way pipeline. Three
per cycle. They
do not mix with Direct Path instructions during decoding and retirement.
The actual Retirement takes place when the last line of 3 instructions
is ready. The retirement hardware scans from the first line of 3 micro
code generated instructions to the last line, accumulating all possible
exceptions that occurred. Retirement follows If no exception has occurred,
otherwise the appropriate exception call is made.
Instructions
generated by Doubles can mix with other (Direct Path) instructions
during decoding and retirement. The two instructions generated by a
Double must however retire simultaneously, imagine a PUSH that does
retire the memory store but doesn't retire the Stack Pointer update..
This leads to the limitation that both instructions generated by a
Double must be in the same 3 instruction line during retirement.
|
|
1.18
Out-Of-Order
processing: Instruction
Dispatch
|
|
We
are now ready to describe in greater detail how Out Of Order processing
is handled. We go back to the Instruction Dispatch Stage. Instruction
Dispatch means here that Instructions are send to the Schedulers. They
are not send to the execution units yet. The Instructions do access the
Register File though. Three Instructions can look up a total of nine
register values from the IFFRF each cycle. The Future File entries are
accessed. The Future File contains the latest speculative results for
each of the 16 architectural registers. The three Instruction are then
placed together with these register values into the three Schedulers.
Now
it's highly likely that not all previous instructions where finished so
many of the register values are older values from previous instructions.
Each Instruction that is dispatched clears the valid flag for the
architectural register it will modify. It also leaves its Tag.
Succeeding instructions now know the Future File entry is not valid
anymore but they also know the Tag of the instruction that will provide
the data they need. They will use this Tag later to pick up the result
directly from the result busses. The instruction that invalidated the register will
later finish, and write its result to the Future File ( if it still owns
the entry ) and then set the
valid bit back again. The
instruction has ownership
over an entry in the Future File if the Tags match. It acquires
ownership when the instruction is dispatched and it looses ownership if
another instruction is dispatched that has the same destination
register. An instruction writes its result in the Future File only if it
is still owner. Subsequent instructions can pick up the result from
there. If the instruction is not owner anymore than it won't modify the
Future File entry. Any other instruction that needed this result was
already dispatched and picked up the result directly from the
result-busses.
|
|
1.19 The Schedulers / Reservation
Stations
|
|
Each
of the (up to) three Instructions that are Dispatched gets assigned to a
Reservation Station within the Scheduler they are send to. Each
scheduler has eight Reservation Stations. That's up from six in the
Athlon 32 and up from five in the first Athlon prototypes. The Reservation
Station gathers all remaining input data needed for the instruction from
the result busses. It monitors the Tags of these busses to see if the
instructions from which data is needed are about to produce their
results. (Register File Bypass) The
Tag busses run one cycle in advance of the result data-busses. The
Reservation Station does
not need to look at all the busses. The Tag's sub-index identifies which
of the three ALU's will produce the result. It also knows if the data
will come from one of the two cache read ports. It can select the Tag
bus in advance rather then having to test all the Tags.
|
|
The
Scheduler's Reservation Station Entries
|
|
|
Instruction
Data |
|
"CONST"
64 bit Displacement + Segment
or Instruction Pointer |
|
|
Input
Data |
|
VALUE: 64 bit register 'A'
|
TAG: reg. 'A'
|
|
VALUE: 64 bit register
'B' or 64 bit Index register
|
TAG: reg. 'B' / Index
|
|
VALUE: 64 bit Base register
|
TAG: Base reg.
|
|
|
Input
Status |
|
VALUE: 4 bit
ZAPS flags: Zero, Aux, Parity,
Sign
|
TAG: ZAPS flags
|
|
VALUE: 1 bit
OF/C flag: either OverFlow or Carry
|
TAG: OF/C flag
|
|
|
|
Until
now we've neglected the x86 status flags. Many x86 instructions use one
or more of the six x86 status flags as an input. An x86- instruction
does or does not change the status flags. An instruction may change
some, all or none of the status flags. This all means that different
flags may be produced by different instructions. Luckily there are two
rules that help to simplify the scheduler.
rule
1: An instructions that modifies any of the ZAPS flags ( Zero,
Aux, Parity, Sign ) modifies all of them. This means that these
can be handled by a single 4 bit entry in the reservation station. rule
2. An instruction that uses the OverFlow flag (signed
integer) does not use the Carry flag (unsigned integer). A single 1-bit
reservation station can be used for the one which is needed.
|
|
1.20
Each x86 instruction can launch both an ALU and an AGU operation
|
|
A
single x86 instruction waiting in a Reservation Station of one of the
Schedulers can launch up to two operations. It can launch an
integer operation to it's associated ALU and it can launch a memory
operation to its AGU (Address Generator Unit) The
simplest integer instructions of type F( reg,reg )
do not access memory and launch an ALU operation only. Integer instructions
of type F(
reg,mem ) launch a
memory load first and consequentially launch an ALU operation when the
load data arrives. Integer
instructions of type F(
mem,reg ) are
implement in the same way. The Load is now a Load/Store operation. The
Load/Store keeps hanging in the LSU (Load Store Unit) Here it
waits for the result of the ALU operation to be stored in memory. Non
Integer instructions such as Floating Point and Multi Media instruction specifying
a memory access will launch an AGU instruction only. The
Floating Point / MMX operation itself is then handled by the Floating
Point Unit itself. Each
Scheduler can launch one ALU and one AGU operation per cycle. The ALU
operation may come from one x86 instruction while the AGU operation may
come from another.
|
|
1.21 The Scheduling of an ALU operation
US
Patent 6,535,972.
|
|
An
ALU operations generally needs two register operands and optionally some
status bits. An x86 instruction that accesses memory will leave the Load
value in register 'B' The Reservation Station waits until it
has all needed input operands (data and status). The Scheduler observes
all eight reservation stations and will Launch the ALU operation if its
the oldest instructions that is ready to Launch. The Scheduler sends all
operands plus instruction information to the ALU that is associated with
it.
Reservation
Station entries typically involved in an ALU operation:
|
Instruction
Data |
|
"CONST"
64 bit Displacement + Segment
or Instruction Pointer |
|
|
Input
Data |
|
VALUE: 64 bit register 'A'
|
TAG: reg. 'A'
|
|
VALUE: 64 bit register
'B' or 64 bit Index register
|
TAG: reg. 'B' / Index
|
|
VALUE: 64 bit Base register
|
TAG: Base reg.
|
|
|
Input
Status |
|
VALUE: 4 bit
ZAPS flags: Zero, Aux, Parity,
Sign
|
TAG: ZAPS flags
|
|
VALUE: 1 bit
OF/C flag: either OverFlow or Carry
|
TAG: OF/C flag
|
|
|
The
Reservation station does not actually need to catch the last operand(s)
itself. The Reservation Station can be bypassed. The ALU
may receive the bus
number which will carry
the last result so it can catch the operand itself. If you take a look
at the Die photo then you see that all three ALU's are next to each
other, even though each receives only operations from its own scheduler.
The bypass mechanism lets them exchange data directly without the need
of going back and forward to the schedulers.
|
|
1.22
The
Scheduling of an AGU operation for memory access
US
Patent 6,457,115.
|
|
We
saw how a single x86 instruction may need up to four arguments to calculate the
memory address (
ignoring the 2 bit scale
field hard-coded in the instruction ). This includes up to two register
variables (base and
index) We also
saw how
displacement
and
segment
could be added together already during
instruction decoding. Segment is considered a semi-constant a
restore mecha- nism is provided for the rare case that it is changed.
|
Reservation
Station entries typically involved in an AGU operation:
(
address =
base +
index
<< scale
+ displacement +
segment )
|
Instruction
Data |
|
"CONST"
64 bit Displacement + Segment
or Instruction Pointer |
|
|
Input
Data |
|
VALUE: 64 bit register 'A'
|
TAG: reg. 'A'
|
|
VALUE: 64 bit Index register or 64 bit register
'B'
|
TAG: Index reg. /
'B'
|
|
VALUE: 64 bit Base register
|
TAG: Base reg.
|
|
|
Input
Status |
|
VALUE: 4 bit
ZAPS flags: Zero, Aux, Parity,
Sign
|
TAG: ZAPS flags
|
|
VALUE: 1 bit
OF/C flag: either OverFlow or Carry
|
TAG: OF/C flag
|
|
|
The
Reservation Station waits until it has all needed input operands. The
Scheduler observes all eight reservation stations and will Launch the
AGU operation if its the oldest instructions that is ready to Launch.
The Scheduler send all operands plus instruction information to the AGU
that is associated with it. The Reservation Station can be bypassed for
AGU operations as well. The AGU
may receive the bus
number which will carry
last result so it can catch the operand itself.
|
|
1.23
Micro Architectural Advantages of Opteron's Integer Core
|
|
Extremely fast Schedule-Execute loop eliminates the need for
cache hit/miss speculation hardware.
One
of the most elegant features of the Opteron Integer core is it's
extremely fast Schedule-Execute loop which is significantly faster then
any other architecture. It can schedule instructions in one cycle and
execute them in the next and doing so at a very high frequency.
This effectively eliminates the need for wasteful logic to correct the
pipeline in case of an L1 Data Cache miss.
The
scheduler sees one cycle in advance that load data is coming from the L1
Data Cache by checking the Tag of the result bus. It uses this
information the schedule the instruction(s) that use the load
data. At the very end of the schedule cycle it then gets the
hit/miss signal. A miss will inhibit that the instruction is removed
from the Scheduler in the next cycle. The instruction still went to the execution units
(ALU
or AGU) but the miss flag will invalidate these instructions. Other
instructions will carry on normally and the Scheduler will continue in
the next cycle with the victim instructions still on board now waiting
for data to become available from further on in the memory hierarchy.
The
Pentium 4 in contrast will Replay
all dependent instructions issued in up to 7 cycles after the missed
load. Non dependent instructions do not need to be replayed. The
abundance of Double Pumped ALU capacity is mainly used to add extra
capacity for all the replayed instructions. The Alpha EV6
"pulls back" and invalidates all the instructions that were
scheduled in the two cycles after a load that misses even though it has
a short seven cycle pipeline running at a significantly lower
frequency.
The
latter two architectures do use these mechanisms also to support another
type of data-speculation which is not yet supported by the Opteron:
Speculative Load/Store reordering. However such an mechanism may well be
supported by a successor and may do so with the smallest miss prediction
penalty possible because of it's extremely short Schedule-Execute loop.
Avoids a large multi port register file between scheduler to and
execution unit The
micro-architectural feature which makes the short Schedule-Execute loop
possible is the functional split of the classical large Renamed Register
File in two sub-structures. One is the small, low latency IFFRF.
This is exactly the smallest possible subset needed during Out-Of-Order
execution. The other subset is the 72 entry Re-Order buffer. This
subset is the one that is needed for In-Order retiring and recovery from
exceptions and branch-miss-predictions. The
IFFRF is small and has a low single cycle latency. It has many read and
write ports (9 read, 8 write) for wide super-scalar operation. The
72 entry Re-Order buffer is larger but much simpler. It uses three
simple independent 24 entry sub-arrays. Each of the three basically
needs only one read and one write port. The four (1+3) units replace a
single large Renamed Register file which would have needed all the read
and write ports of the much smaller IFFRF. Such
a large Renamed Register File would probably add two cycles of extra
latency right in the middle of the Schedule-Execute loop.
|
|
Chapter
2, Opteron's
Floating Point Units
|
|
|
|
|
|
2.1
The Floating Point
Renamed Register File
|
|
Opteron's
Floating Point renamed register file has been increased from 88 to 120
entries. It is a renamed register file in the classical meaning of
the word. It's a single entity that must contain all architectural
(non-speculative) and speculative values for the registers defined by
the instruction set. The
Opteron restores the support for 72 speculative instructions again. The
support for speculative instructions was decreased from 72 to 56 with
the introduction of the Athlon XP core that included the eight 128 bit
XMM registers for SSE but did not increase the size of the 88 entry
renamed register file. Each
128 bit XMM register uses two entries in the renamed register file. The
Opteron thus uses 32 entries to hold the architectural (retired) state
of the now 16 XMM registers, which explains the increase: 88 + 32 makes
120 entries. 40
of the 120 entries are used to hold the architectural (non-speculative)
state of the registers defined by the instruction set. 32 are used
for the sixteen XMM registers. 8 are used for the eight x87/MMX
registers. A
further 8 register entries are used for micro code scratch registers,
some- times called micro-architectural registers. These registers are
not defined by the instruction set and are not directly visible to the
programmer. They are used by micro code to calculate complex floating
point calculations like sine or log instructions. The
48 (40+8) entries that define the architectural state of the
processor are defined by the 48 entry Architectural
Tag Array. The entries
that hold the very latest speculative
values for the 48 architectural register entries are identified with the
48 entry Future File
Tag Array. The
speculative state of the processor needs to be discarded in case of a
branch-miss-prediction or exception. This is handled by overwriting the
48 entries of the Future
File Tag Array with
those of the Architectural
Tag Array. Each
entry of the renamed register file is 90 bit wide. Floating Point Values
are expanded to a total of 90 bits (68 mantisse, 18 exponent, 1 sign bit
and 3 class bits) The three class bits contain extra information about
the floating point number. The class bits also identify non floating
point numbers (integers) which are not expanded when written in the
renamed register file.
|
|
|
The
120 registers |
|
8
32
8
|
non speculative registers:
FP/MMX registers (arch.)
SSE/SSE2 registers (arch.)
Micro Code Scratch registers (arch)
|
|
8
32
8
24
|
speculative registers
FP/MMX registers ( latest )
SSE/SSE2 registers ( latest )
Micro Code Scratch reg. (latest )
Remaining speculative
|
|
|
The 90 bit registers |
|
68
18
1
3
|
subdivision of the 90 bits for FP
Mantisse bits
Exponent bits
Sign bit
Class Code bits
|
|
|
Definition of the 3 bit Class Code |
|
0
1
2
3
4
5
6
7
|
Zero
Infinity
Quit NAN (Not A Number)
Signaling NAN (Not A Number)
Denormal (very small FP number )
MMX / XMM (non FP contents)
Normal ( FP number, not very small )
Unsupported
|
|
|
|
|
|
|
|
|
|
|
2.2
Floating Point rename stage 1: x87
stack to absolute FP register mapping
|
|
The
"stack features" of the legacy x87 are undone in this first
stage of the Floating Point pipeline. The x87 instructions access
the eight architectural 80 bit registers via a 3 bit Top Of Stack (TOS)
pointer. Instructions use the TOS as both source and destination. The
second argument can be another value on the stack relative to the TOS
register or a memory operand. The 3 bit TOS pointer is maintained in the
16 bit x87 FP status register.
The
x87 TOS register relative references are replaced by absolute references
which directly identify the x87 registers involved in the operation. A
speculative version of the TOS pointer is used for the translations. The
3 bit pointer can be updated by the actions of up to three instructions
per cycle. Instructions
can be speculative but are still in order at this stage. They've not yet
been scheduled by the Floating Point Out-Of-Order scheduler.
If
an exception or a branch-miss-prediction occurs then the speculative TOS
pointer is replaced with the non-speculative retired one which is retrieved
from the reorder buffer. The retired version reflects the value of
the TOS during the instruction just prior to the one that caused the
exception or branch miss prediction.
|
|
2.3
Floating Point
rename stage 2: Regular Register Renaming
|
|
The
actual register renaming takes place in this stage. Each instruction
that needs a destination register gets one assigned here. The
destination registers must be unique in respect to all other
instructions in flight. No instructions may write to the same
register.
Up
to three free register entries are obtained from the register free
list. There are 120 registers available in total. The free-list
can have a maximum of 72 free entries, equal to the maximum number of
instructions in flight.
The
remaining 48 entries hold the values of the (non-speculative)
architectural registers: The eight x87/MMX registers, The eight scratch
register (accessible by micro code only) and the sixteen 128 bit
XMM registers for SSE and SSE2, each using two entries. These
registers are not at a fixed location but may occupy any of the 120
entries. This is what makes the free-list necessary.
The 48 entries occupied by the architectural registers mentioned above
are identified by the 48 entry Architectural
Tag
Array. It
has an entry for each architectural register with a value that points to
one of the 120 renamed registers.
Up
to three instructions can thus be renamed per cycle. The data
dependencies are handled with the aid of another structure,
the
48 entry Future File
Tag Array This
array contains pointers the 48 renamed registers that contain the very
latest speculative values for each of the architectural registers.
The instructions that are getting renamed access this structure to
obtain the renamed registers were they can find their source operands.
The instructions will then store the renamed register which was
allocated to them to the Future
File Tag Array so that
subsequent instructions know were to find the result data.
Example:
An instruction uses architectural registers 3 and 5 as input data and
writes its result back into register 3. It will first read entries 3 and
5 to obtained the pointers to the renamed registers that contain or will
contain the latest values for register 3 and 5.
Say
renamed registers 93 and 12. The
instruction now knows its source registers, 93 and 12 and can overwrite
entry 3 of the Future
File Tag Array with the
renamed register it was assigned to store it's result, say 97. A
subsequent instruction that needs architectural register 3 will now use
renamed register 97.
If
an
exception or branch-miss-prediction occurs then the 48 entries of the
Future File Tag Array
are overwritten with the 48 entries from the Architectural Tag
Array. All
speculative results are thereby discarded. The pointers in the
Architectural Tag Array were written there by the retirement logic. Up
to three values can be written per cycle for each line of instructions
that retires. The values are taken from the Reorder
Buffer. The Reorder
Buffer is shared by all instructions.
Floating
Point Instructions that finish write certain information like exception
status, TOS used et-cetera into the Reorder Buffer. This information
includes also the destination register they modify, Both the number
of to the architectural register and the renamed register are
stored in the Reorder Buffer. The two of them are used to update
the Architectural Tag
Array at retirement.
One as the data and the other as the entry number of the Architectural
Tag Array.
|
|
2.4 Floating Point
instruction scheduler
|
|
The
Floating Point scheduler uses the following three criteria to determine
if it may dispatch an instruction to the execution pipeline it has been
assigned to ( FPMUL, FPADD, FPMISC )
1)
The instructions source registers and or memory operands will be
available.
2)
The instruction Pipeline to which the instruction has been assigned will
be available.
3)
The result bus for that instruction pipe will be available on the clock
cycle in which the instruction will complete.
The
scheduler will always dispatch the oldest instruction that is ready for
each of the three pipelines. When
we say will be
available then we mean
in two
cycles from the current cycle. It takes two
cycles to get an instruction into execution, one to schedule and another
to read the 120 entry renamed register file. An instruction checks
if its source registers are available first when it is placed in the
scheduler. After that it will continuously monitor the Tag busses of the
result busses for all source data still missing.
The
Tag busses run two
cycles ahead of the result busses. The scheduler can thus see two cycles
in advance which results will become ready. A dispatched instruction
will arrive in two cycles at its execution were it grabs the incoming
result data from the selected result bus. The
execution pipelines are 4 stages deep. Instructions with lower latencies
may leave the pipeline earlier, after two or three cycles. Two cycles
however is the shortest execution latency.
Instructions
that need load data from memory wait until the data arrives from the L1
Data Cache or from further away in the Memory Hierarchy. The scheduler
knows two cycles in advance that data is coming. This is one cycle more
than for integer loads. The extra cycle stems from the Data
Convert and Classify unit
that pre-processes Floating Point data from memory.
A
load miss
avoids that the Instruction which needed the load data is removed from
the scheduler. The instruction stays in the scheduler until the data
arrives with a load hit.
Any instruction that was scheduled depending on load that missed is
invalidated and its results are not written to the register file.
|
|
2.5
The 5 read and 5 write ports of the floating point renamed register file
|
|
The
renamed register file register file is accessed directly after the
instructions are dispatched Out Of Order by the Scheduler. Up
to three instructions can access the register file simultaneously.
One instruction for each of the three functional units. The FPMUL and
FPADD instructions obtain two source operands each while instructions
for the FPMISC unit only need a single operand.
Three
write ports are available to write results from the floating point units
back to the register file. The write addresses arrive earlier then the
result data. This is used to decode the write address in the cycle
before the write occurs. All three units
can have memory data as a source operand. The reorder buffer tags that accompany
the data coming from memory are translated to renamed register locations
by the load mapper. Two 64 bit loads can be handled per cycle.
The
new 120 entry register file shows bypass logic at both sides. The
bypasses are used to pass result and or load data directly to succeeding
dependent instructions. Thereby avoiding any extra delay that would
result from the actual writing and reading from the register file.
|
|
2.6 The Floating Point processing units
|
|
There
is a range of processing units connected to the FPMUL, FPADD and FPMISC
register file ports. The ports determine to which of the three floating
point pipelines a particular unit belongs.
The
x87 and SSE2 floating
point multiplier
handles 64 and 80 extended length multiplications. The large Wallace
tree which handle the 64 bit multiplications for 80 bit extended
floating point and 64 bit integer multiplications can be split into two
independent Wallace trees that handle the dual 32 bit SIMD
multiplications used for SSE and 3Dnow functions (
US
Patent 6,490,607
) This unit can
also autonomously handle floating point divide and square root functions. These instructions
are not implemented with micro code but are handled entirely by this
unit itself with a single direct path instruction. The unit contains
bi-partite lookup tables for this purpose. ( US
Patent 6,256,653
) These table contain base
values and differential values for rapid reciprocal and reciprocal
square root approximations which are then used as a start point for the
divide and the square root instructions. This unit is connected to the
FPMUL ports of the register file.
The
x87 and SSE2 floating
point adder handles 64
and extended length additions and subtractions. It is connected to the
FPADD ports of the register file.
The
3Dnow! and SSE dual 32
bit floating point unit
handles the single length SIMD floating point instructions as introduced
in 3dnow! by AMD and SSE by Intel (The later is called 3Dnow!
professional in the Athlon XP). This unit is connected to both the FPMUL
and FPADD ports and can handle one 64 bit (2x32) instruction of each
group per cycle, So one MUL type and one ADD type instruction per cycle.
128 bit instructions of either type have a throughput of one per two
cycles.
The
2x64 bit MMX/SSE ALU
unit is a dual unit
that can handle certain packed integer 128 bit SSE instructions at a throughput
of 1 per cycle. It is
connected to both the FPMUL and FPADD ports. The FPMUL ports are used
even though the instructions aren't multiplications but rather adds,
subtracts and logic functions. The idea is to double op the size of
operands that can be read and written to the register file to a full 128
bit. The 128 bit SSE instructions are still handled by two
individual 64 bit operations. The throughput is increased to one per
cycle because they can be executed by both the FPMUL and the FPADD
pipelines.
The
1x64 bit MMX/SSE
Multiplier unit handles
MMX and SSE integer multiplies. It is connected to the FPMUL ports of
the register file. It can handle a single 64 bit MMX instruction per
cycle or 128 bit SSE instruction with a 2 cycle throughput using two 64
bit operations.
The
FP Store unit,
more recently called the FP
Miscellaneous unit
handles not only the stores but also a number of other single operand
functions such as Integer to Float and Float to Integer conversions. It
further provides a lot of functions used by Vector Path generated micro
code to handle more complex x87 operations. It contains the Floating
Point Constant ROM that contains a range of floating point constants
such as pi, e, log2 et-cetera.
|
|
2.7 The Convert and Classify
units
|
|
Load
data that arrives from the L1 Data Cache or from further on the Memory
Hierarchy goes through the Convert
and Classify unit
first. The Load data is converted, if appropriate, to the internal 87
bit floating point format (1 sign bit, 18 exponent and 68 mantisse
bits ). The floating point values are also classified into a three bit
Class code. The 87+3=90 bits are then stored into the 90 bit register
file. The Class code can sub-sequentially be used to speed up floating
point operations. For example: Only the class code needs to be tested to
find out if a number is zero instead of all 86 mantisse plus exponent
bits.
We've
seen that the Floating Point Scheduler runs two cycles ahead of the
actual execution units. One cycle more than the Integer Scheduler. It
observes at the Tag busses that identify two cycles in advance which
results will become ready at a certain result bus. The Tag busses also
indicate which data will come from memory in advance. However, the
hit/miss signal may later indicate that the data was
erroneous because of a Cache Miss.
The Convert and Classify units add an extra cycle with at least somewhat
useful work in order to give the scheduler the time to take the Hit/Miss
signal into account.
The
Optimization manual has a whole appendix (E) dedicated to SSE and SSE2
optimizations related to the classification of the contents of the SSE
registers. Instructions that operate on another data type then expected
should be avoided. Revision C does not need these optimizations anymore.
It is likely that Revision C can perform these format translations
itself without the intervention of microcode after an exception.
|
|
2.8
X87 Status handling: FCOMI /
FCMOV and FCOM / FSTSW pairs US
Patents 6,393,555
&
6,425,074
|
|
AMD
has managed to eliminate much of the x87 legacy overhead and did speed
up some important but problematic functions. More specifically for
the x87 status register. Early Athlons used a large area to handle the
processing of the 16 bit floating point status register. This has all
gone, some of it already in the Athlon XP.
Program
code with a conditional test on x87 floating point values used to kill
Out-Of-Order advantages because of the serializing nature of the
instructions that make the floating point status code available to the
Integer Pipeline which handles the conditional branches. The Opteron has
special hardware to avoid this serialization and to preserve Out Of
Order processing.
|
x87
Floating Point Status register
|
15 |
14 |
13 |
12 |
11 |
10 |
9 |
8 |
7 |
6 |
5 |
4 |
3 |
2 |
1 |
0 |
|
x87
FP
Busy |
Cond.
Code
3 |
Top
of Stack |
Cond.
Code
2 |
Cond.
Code
1 |
Cond.
Code
0 |
Excep
tion
Status |
Stack
Fault |
Preci-
sion
excep |
Under-
flow
excep |
Over-
flow
excep |
Zero
Divide
excep |
Denorm
Oper.
excep |
Invalid
Oper.
excep |
|
B |
C3 |
TOS |
C2 |
C1 |
C0 |
ES |
SF |
PE |
UE |
OE |
ZE |
DE |
IE |
|
Different
Parts of the x87 floating point status register are handled in different
ways. The register is a bit of a mixture of different things. It
contains for example the 3 bit TOS pointer that indicates which of the eight
x87 is the current top of stack. The first Rename Stage holds the
speculative version of this pointer. It is used here to translate the
TOS relative register addresses to absolute x87 register addresses. All
finishing instructions preserve their copy of this value in the Re-Order
buffer when they finish. These copies then become the non-speculative
versions of TOS at the moment that the instructions are retired out of
the Re-Order buffer.
The
Retirement Logic may detect that an exception or branch-miss-prediction
did occur. It then replaces the speculative version of the TOS in the
first rename stage with latest retired, non-speculative version. The
speculative 3 bit TOS value is used before the instructions are
scheduled Out Of Order. The only reason that it is used later on is
during Retirement which is handled In-Order again. This means that
special Out-Of-Order hardware for the TOS can be, and is eliminated.
The
execution of a during Floating Point instruction may itself cause an
exception. Most bits of the x87 status register are dedicated flags that
identify exceptions. Exceptions are always handled In-Order at
retirement time. This again means that any special Out-Of-Order hardware
for these bits can be, and is eliminated.
The
tricky part is in the CC (Condition Code) bits. These bits contain
exception data most of the time but may contain sometimes information
which is the result of a Floating Point compare and which must be
processed in a full Out-Of-Order fashion. The Opteron has special new
hardware to handle these cases. This hardware detects combinations of
instructions that need special handling.
Condition
Code bits after a x87 Floating Point compare
|
Cond.
Code
3 |
Cond.
Code
2 |
Cond.
Code
1 |
Cond.
Code
0 |
Compare
Result |
|
0
|
0
|
0
|
0
|
ST(0)
> source
|
|
0
|
0
|
0
|
1
|
ST(0)
< source
|
|
1
|
0
|
0
|
0
|
ST(0)
= source
|
|
1
|
1
|
0
|
1
|
Operands
were unordered
|
|
The
first combination is a FCOMI with a FCMOV. The first does a compare and
sets the CC bits according to the result. It then moves the compare
result to the Integer status Register. The FCMOV then does a conditional
floating point move depending on the Integer Status bits. Opteron's
hardware allows full speed processing here by implementing an
Out-Of-Order bypass that avoids that the FCMOV has to wait for the
actual Integer Status Flags.
The
second combination is the FCOM and FSTSW pair. The first instruction is
identical to the FCOMI instruction with the exception that it does not
copies the CC bits to the Integer Status bits. It's the FSTSW (Floating
point Store Status Word) instruction that copies the 16 floating point
status bits to the AEX register or to a Memory Location from were they
can be used for conditional operations. The later is a serializing
operation because all floating point instructions need to finish first
before the 16 status flags are known. The Opteron has special hardware
that does allow maximum speed Out-Of-Order processing without the
serializing disadvantage. It also provides a way to recover from any
(rare) miss predictions.
The
result of all AMD's x87 optimizations is that the Opteron literally runs
circles around the Pentium 4 when it comes to x87 processing. It has
removed large special purpose circuits for status processing and
implemented a few small ones that handle the cases mentioned. The shift
to SSE2 floating point however will make removed area overhead
more important than the speed-ups.
|
|
|
|
|
|
Chapter
3, Opteron's
Data Cache and Load / Store units
|
|
|
|
|
|
3.1
Data Cache: 64
kByte with three cycle data load latency
|
|
The
Opteron's relatively large L1 Data Cache supports a three cycle
Load-Use latency.
Actually only the second and third cycle are used to access the Cache
memory itself. The first cycle is spend in the Integer Pipeline for the
x86 memory address calculation using one of the three available AGU's.
The address calculated by the AGU is send to the memory array in the second cycle
where it is
decoded. This means that it is known
at which word line the data can be
found at the end of the second cycle.
The
right data word is activated at the beginning of the third cycle. Data is
accessed in the memory array, selected and send forward to the Integer
Pipeline or the Floating Point pipeline. Below the more detailed timing
of a typical Integer x86 instruction F(
reg,mem ).
This type of instruction first loads data from memory and then performs
an operation on it. We
see that in the same cycle in which the instruction is dispatched to the
Scheduler it is also dispatched to the so-called "Pre-Cache
Load/Store unit" or simply LS1. Instructions
in this unit compete for cache access together with those in LS2.
The instructions in LS1 first need to wait for their effective memory address.
They monitor the result busses of the AGU's. An instruction in LS1 knows
from which AGU it can expect its address. Instructions check the re-order buffer
Tag which identifies the address one clock-cycle in advance. In
general, an instruction in LS1 will fetch its address and wait for its
turn to probe the cache.
Typical
timing of an F ( reg, mem ) x86 operation.
|
Cycle
|
Integer
Scheduler |
Load
/ Store
Unit
(LS1) |
ALU's
and
AGU's |
Cache
Address
Decode |
Cache
Data
Access |
|
0 |
Dispatched
to
Scheduler |
Dispatched
to
LS1 |
|
|
|
|
1 |
AGU
Scheduled |
|
|
|
|
|
2 |
|
Load
Scheduled |
Address
Generation |
|
|
|
3 |
|
|
|
Cache
Address
Decode |
|
|
4 |
ALU
Scheduled |
|
|
|
Cache
Data
Access |
|
5 |
|
|
Dependent
Operation |
|
|
|
Instructions
may route the address immediately to the cache also if there are no
other (older) instructions waiting. This is the case in our example
above. In any case, each instruction will keep the address for possible
follow-on actions. The address is send directly from the AGU result bus
to the Data Cache's address decoders in our case here. Data comes back from memory
one cycle later and is routed to the Integer Pipeline. LS1 places
the re-order buffer Tag one cycle in advance on the Data Cache result
Tag bus so that the Integer ALU schedulers can schedule any instruction
depending on the load data.
|
|
3.2
Two accesses per cycle, read or write: 8 way bank
interleaved, two way set associative
|
|
The
Opteron's cache has two 64 bit ports. Two accesses can occur each cycle. Any
combination of loads and stores is possible. The dual port mechanism
is implemented by a banking mechanism: The cache consist of 8 individual banks, each
with a single port. Two accesses can occur simultaneously if they are to
different banks.
|
Virtual
Address bit used to access the L1 data Cache
|
Cache
Line Index |
Bank |
Byte |
|
14 |
13 |
12 |
11 |
10 |
9 |
8 |
7 |
6 |
5 |
4 |
3 |
2 |
1 |
0 |
|
A
single 64 byte cache line is subdivided in 8 independent 64 bit banks.
Two accesses are to two different banks if their addresses have a
different bank-field, address bits 3 to 5. The bits are the lowest
possible address bits that can be used for this purpose. This schem effectively
maps adjacent 64 bit words in different banks. The
principle of data locality makes these bits the most suitable choice. The
64 kByte Cache is two
way set-associative.
The cache is split in two 32 kByte ways accessed with Virtual
Address bits [14:0] A
hit into any of the two ways is detected if the Physical
Address Tag, bits
[39:12], which is stored alongside with each cache line, is identical to bits [39:12] of the Physical
Address. Virtual
to Physical address translation is performed with the help of the TLB's
(Translation Look aside Buffers). A port accesses 2 ways and
compares 2 tags with the translated address. Each port has its own TLB
to do the address translation. The
two 64 bit ports are used simultaneously when exchanging cache-lines
with the rest of the memory hierarchy. This means that the memory bus
from the unified L2 cache to the L1 data cache is now 128 bit wide. The
event where a new cache line is needed will take first 4 cycles to evict
the old cache-line and then 4 cycles more to load the new cache-line
when it arrives.
|
|
|
|
|
|
|
|
|
|
3.3
The
Data Cache Hit / Miss Detection: The cache tags and the
primairy TLB's
US
Patent 6,453,387.
|
|
The
L1 Data Cache has room to store 1024 cache lines out of the total of
17,179,869,184 cache lines that fit within the 40 bit physical address
space. Accesses need to check if the stored cache line corresponds
with the actual memory location they want to access. It is for this
purpose that the Tag rams store the
higher physical address bits belonging to each of the 1024 cache-lines.
There are two copies of the Tag ram to allow the simultaneous operation of
two access ports.
The
Tag rams are accessed with bits [14:6] of the virtual address. Each Tag
ram outputs 2 Tags for both ways
of the 2 way- set-associative cache. The wanted cache-line can be in
either way. The Tag rams contain physical addresses. A physical address uniquely defines a memory position throughout the entire
distributed system memory.
The
cache is however accessed with the virtual addresses as defined by the
program. Virtual addresses have only a meaning from within a process
context. This means that a virtual-to-physical-address translation is needed to be
able to check the physical Tags. This translation is handled by a lengthy
sequence of four table lookups in memory: The
virtual address field [47:12] is divided into four equal sub-fields that
each indexes into one of the four tables. Each table points to the start
of the next table, The last table, the page table, then finally contains the
translated address.
|
Virtual
Address to Physical Address Translation: The Table Walk.
|
virtual
address |
page
offset |
|
page
map level 4
table
offset |
page
directory
pointer
offset |
page
directory
offset |
page
table
offset |
|
47
39 |
38
30 |
29
21 |
20
12 |
11
0 |
|
$ |
$ |
$ |
$ |
$ |
|
|
$ |
$ |
$ |
$ |
|
9 |
|
$ |
$ |
$ |
|
|
9 |
|
$ |
$ |
|
|
|
9 |
|
$ |
|
|
|
|
$ |
$ |
|
|
|
physical
address |
page
offset |
|
39
12 |
11
0 |
|
|
This
so-called Table-Walk is a very lengthy procedure indeed. The Opteron uses so-called Translation
Look aside Buffers (TLB's) to remember the 40 most recently used address
translations. 32 of these remember 4k
page translations using
the scheme above. The remaining 8 are used for so-called 2M
/ 4M
page translations which skip the last table and define the translations
for large 2 Megabyte pages. ( The 4M pages are only used for backwards
compatibility )
The
virtual address bits
{47:12] are compared with all 40 entries of the TLB's in the second of
the three clock-cycle access. At the end of the second cycle we know if any
one of
them matches. Each entry also contains the associated physical address bits
[39:12]. These are selected in the third cycle and compared with
the physical Tags to test if we have a cache hit.
|
|
3.4 The 512 entry second level TLB
|
|
If
the necessary translation is not found within the 40 entries of the primary
TLB's, then there is a second chance that it is available in the level-2
TLB which is shared
by both ports. This table contains 512 address translations. This larger
table can be used to update the primary TLB's with
a minor delay. It is organized
in a different way:
It is 512 entry 4-way set-associative.
This
means that it
has 128 sets of 4 translations each. Virtual address bits [18:12] are
used to select one of the 128 sets. We get four translations
giving us four chances that we have the translation we need. Each translation contains the rest of the
virtual address bits [47:19]. We can check if we have the
right translation by comparing these bits with our address. The matching
entry then contains the associated
physical address field [39:12] we need.
|
|
3.5 Error
Coding and Correction
|
|
The
L1 Data Cache is ECC protected (Error Coding and Correction). Eight
bits are used for each 64 bits to be able to correct single bit errors
and to detect dual bit errors with the help of a 64 bit Hamming SED/DED
scheme (Single Error Detection / Double Error Detection) Six
parity bits are needed to retrieve the position of the error bit.
|
|
E
C
C |
64
bit Hamming SED/DED error location
retrieval
|
|
bit
63
bit 0
|
|
|
0 |
1 |
1 |
0 |
1 |
0 |
1 |
0 |
1 |
1 |
1 |
0 |
1 |
1 |
0 |
0 |
1 |
1 |
1 |
1 |
0 |
1 |
x |
0 |
1 |
0 |
1 |
0 |
1 |
1 |
1 |
0 |
0 |
0 |
1 |
0 |
1 |
1 |
0 |
0 |
0 |
1 |
0 |
1 |
1 |
0 |
1 |
1 |
0 |
1 |
0 |
1 |
0 |
1 |
1 |
0 |
1 |
0 |
1 |
0 |
1 |
1 |
1 |
0 |
|
0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
x |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
x |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
x |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
x |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
x |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
x |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The
six bits are shown in the column at the left. A one means that a parity
error was detected. The six bits represent the parity of the 32 purple
bits in each rows. The parity errors together now represent a 6 bit
index that points to the error position. Additional parity bits are used
to detect double bit errors and errors in the parity bits themselves.
(Thanks to Collin for bringing this to my attention)
|
|
3.6 The Load / Store
Unit, LS1 and LS2
|
|
The
Load Store unit handle the accesses to the Data Cache. This type of unit
plays an increasingly important role play in modern speculative
out-of-order processors. They are expected to grow significantly in size
and complexity in newer architectures on the horizon. An extra
reason to give the Opteron's Load Store units a closer look. The split
in LS1 and LS2 is sometimes described as LS1 being for the L1 Data Cache
and LS2 for the L2 Cache. This is far to popular however and even
incorrect. We'll go into more detail here.
|
|
|
|
|
|
|
|
|
|
3.7
The
"Pre Cache" Load / Store unit: LS1
|
|
The
Pre-Cache Load/Store unit (LS1) is the place where dispatched memory
accesses wait for the addresses generated by the AGU's (Address
Generator Units) LS1 has 12 entries, whenever a memory
access is dispatched to the Integer Scheduler it is also dispatched to
an entry in LS1. The re-order tag bus belonging to the AGU indicates if
the required Address is being calculated and available on the result bus of the
AGU in the next cycle. An access waiting in LS1 knows at which AGU to
look for the address.
When
an instruction has its address coming or did already receive it may then
probe the cache. There are two access ports. The two oldest accesses
in LS1 will be allowed to probe the Cache. Both load and store
instructions probe the cache. A load will actually access the cache to
obtain the load data. The store presents its address but will never
write from LS1 to the Cache. Store instructions will only write after
they've received the data to be written and when they are retired.
Stores
must be retired first because the store instruction may be speculative
and is discarded later.
Imagine that MicroSoft patches a buffer overflow exploit by adding a
test for the overflow. This test becomes a conditional branch that
prevents the write to the buffer in case of an overflow. The overflow tends to
never happen so the branch predictor will predict it as not-taken, It
will do so also
in the case that it finally does happen. The write to the buffer will
now be executed speculative.
So the actual writes to the cache
must be delayed until after retirement when it's verified that the branch
predictions were correct. These
deferred stores do not introduce any real delays however. Loads that access the cache also
check LS1 and LS2 to see if there are any pending writes to the memory location
they are about to read. If so than they catch the data directly from LS1
or LS2 without delay. The stores
in LS1 however do present their address to the
cache hit/miss detection logic. If it turns out that the cache-line is not present then it may be loaded as soon as possible from the Level 2 cache or from system
memory. This can be a good policy since there is a significant chance
that following loads will need the same cache-line. Stores may receive
the data they have to write to memory while waiting in LS1 as long as
the data comes in time,
otherwise they move on to LS2 to receive the data there.
|
|
3.8 Entering LS2: The Cache Probe Response
|
|
All
accesses in LS1 probe the Cache and then move on to
the Post-Cache Load/Store unit ( LS2 )
An
access
can either be a Load, a Store or a Load-Store (The latter reads first
and then writes the result back to the same location) All
accesses which came from LS1 first wait to see the results from the
cache probe. If it was a cache hit or
a miss, If there was a cache parity error. They also receive the
physical address which was translated from virtual into physical by
the TLB's. Together with the physical
address come the page attribute bits which determine for instance if the
memory is cacheable or not. Then
in the following cycle, in case there was a cache miss, the instructions receive a so-called
MAB
tag ( Missed
Address Buffer Tag ) This tag will later be used to see if a missed
cache-line arrives from the L2 cache or from system memory. The MAB tag
needs to be used instead of the generally used re-order buffer tags.
Multiple Loads and Stores may depend on the same cache line and thus on
the same MAB tag. All these accesses miss and
they'll all receive the same MAB tag. The
Bus Interface Unit (BUI) will load missed cache-lines from the
unified L2 cache or system memory to fill the data-cache. It also
presents the so-called Fill-tag
to LS2. This fill-tag is compared to the MAB-tag of all accesses that
missed. The accesses that match the fill-tag are changed from miss to
hit.
|
|
3.9
The
"Post Cache" Load Store unit: LS2
|
|
The
so-called Post-Cache Load Store unit ( LS2 ) has 32 entries. It is
organized in a somewhat "shift register" like way so that the
oldest outstanding access ends up in entry 0. Each of the 32 entries has
numerous fields. Many of these fields are accompanied with a comparator
and other logic to see if the fields matches a certain condition. All
accesses stay in LS2 at least until retirement, Accesses that missed the
cache will wait in LS2 until the cache-line arrives from the memory
hierarchy. All Stores wait in LS2 for their retirement first before
actually writing data to memory.
|
Various
fields in an LS2 buffer entry
|
|
|
Type
|
Address & Data
|
Tags
|
Status Flags
|
Action Flags
|
....
|
|
|
|
|
|
|
|
|
Valid
Flags
|
Acc.
Type
|
Store
Data
64
bit
|
Virtual
Address
|
Physical
Address
|
mem
Type
|
Instr-
uction
Tag |
Write
Data
Tag |
Missed
Address
Buffer
Tag |
Cache
Hit
/
Miss
|
Retired
access
|
Last
Store
in
Buff
(LIB) |
Self
Mod.
Code
Flag |
Snoop
Re-
Sync
Flag |
Store
Load
For-
ward |
....
|
|
|
|
Retired
Stores in LS2 that have the hit/miss flag set to hit may use a cache port simultaneously
with a probing store in LS1. The retired store from LS2 writes to the
data cache itself but does not use the cache hit/miss logic. The probing
store from LS1 only uses the hit/miss logic but doesn't access the data
cache itself. This shared use is important performance wise because each store would
occupy a cache port twice otherwise, first while probing from LS1 and
secondly when writing from LS2 after retirement. This would halve
the store bandwidth of the L1 Data Cache.
|
|
3.10
Retiring instructions in the Load Store unit and Exception Handling
|
|
All
access instructions, Loads as well as Stores stay in LS2 until they are retired.
Loads may be removed directly from LS2 when they are retired to make
place for new instructions. Stores must still write their data to
memory. They wait to do so until retirement when it is determined that
no exception or branch miss-prediction occurred. Writes are
removed from LS2 after they have committed their data to memory.
LS2
has a retirement interface with the re-order buffer. The re-order buffer
presents the Tag of the line that is being retired to LS2. It
only needs to present a single Tag for up to three instructions in a
line since these all have the same tag except for the 'sub- index' which
identifies the lane (0, 1 or 2). LS2 compares all-instruction tags
with the retirement-tag and set the Retired
flag of those who match.
Retired loads may be deleted directly from LS2. If
the retirement logic of the re-order buffer has detected a branch-miss
prediction or exception then all instructions matching the retirement
tag and all those with succeeding tags are discarded from LS2. The only ones
left in LS2 are the retired stores that are waiting to commit their data to
memory.
|
|
3.11
Store to Load forwarding, The Dependency Link File
US
Patent 6,266,744.
|
|
A
Load probing the data cache will also check the Load Store units to see
if there are any outstanding stores to the same address as the load. If
it finds such a store ( and the store is before the load in program order
)
then there are two possibilities. If the store has already obtained the
write data from one of the result busses then these can be directly
forwarded to the load. If the store has not yet obtained it data then
the load misses and moves to LS2. An
entry is created in a unit called the Dependency Link
File. This unit
now registers both the tags of the write data, ( which tells the data-to-be-stored is coming in the next
cycle ) as well as the Load tag which is the
be used to tell a following instruction that the load data will be
available. The Dependency Link File keeps monitoring the write data tag, and then, as soon
as it detects it, puts the load instruction tag on one of the Cache
Load tag busses. It
does the same with the actual data when it comes one cycle later. The result data from
instruction 1 can be directly forwarded to the consuming instruction 4 in
the example below. Instructions 2 and 3 (the store and the load) are
bypassed in this case.
|
1)
F( regA,regD ); //
register A is a function of register A and register D 2)
store ( mem, regA ); //
store register A to memory 3)
load ( regB, mem ); //
load register B from the same memory location 4)
F( regD, regB ); //
uses register B and register D to calculate new value of register D
|
Miss-matched
store to loads: Stores that only modify part of the load data are not
supported. The load must first wait unit the store is retired and stored to memory. The load may then access the cache to get it's data
which is a combination of the stored data and the original contents of
the cache. The optimization manual describes all possible miss-match cases since
they can lead to a considerable performance penalty. Multiple
Stores to the same address are handled with the so-called LIB flag (
Last In Buffer ) This flag identifies the most recent store to a certain
address. A newer load accessing the same address will choose this one.
Multiple partial stores to the same word were each modifies only a part
of the word are not supported by the Load Store buffer. They are not
merged in the Load Store buffer. They will be merged later on in the
cache after all stores are retired and written.
|
|
3.12
Self Modifying Code checks: Mutual exclusive L1 DCache and L1 ICache
US
Patent 6,415,360.
|
|
Self
Modifying Code (SMC) checks must in principle be performed for each
store. It must be tested if the store does not modify any of the
instructions in the Instruction Cache or any following Instruction in
flight in any stage of execution. A significant simplification is made
by making the L1 Data Cache and L1 Instruction Cache exclusive to each
other: A cache-line can only exist in either one, not in both at the
same time. When a cache line is loaded in the L1 Data cache then it will
be evicted from the L1 Instruction cache.
The
first advantage is that the contents of the Instruction Cache does not
need to be tested any further for SMC. The second advantage is that SMC
checks may be limited to Data Cache misses. Stores to un-cacheable
memory must be checked always.
(
They always "miss" ) The store's write-address is send from
LS2 to the
SMC test unit which is close to the Instruction Cache. This units holds
the cache-line addresses of all the Instructions in flight. If there is
a conflict then it marks the store that caused the conflict. The
reorder buffer will discard all instructions which follow the store when
the store is retired.
|
|
3.13
Handling multi processing deadlocks: exponential back-off
US
Patent 6,427,193.
|
|
Deadlocks
can occur when multiple processors fight for the ownership of the same
cache-line. They do so for instance if they both want to write to the
same line. A cache-line is generally loaded as soon as possible in case
of a cache-miss. This will cause the cache-line to be invalidated in
other caches in case of a store. Two processors get in a deadlock if
they keep invalidating each others cache-lines before they are able to
finish the stores.
An
example given is the case where two processor try to complete a store which is to
an unaligned address so that part of the store data goes to cache line
A1 and part of the store data goes to cache line A2. Unaligned stores of
this type are typically split into two stores by the hardware.
An exponential
back-off
mechanism is provided to handle this kind of deadlock situations. A back-off
time is introduced when the memory access remains unsuccessful before
retrying to become owner of the cache-line again. This time grows exponentially
after each unsuccessful try until one of the processors finally succeeds.
|
|
3.14
Improvements for multi processing and multi threading
|
|
The
Opteron's micro architecture has a large number of improvements
related to multi processing and multi threading. Very important
improvements also for the desktop market. Multi-processor on a
chip solutions are just around the corner and hyper- threading
may take a significant step forward in the near future with
Intel's Prescott.
The
ability to perform multi processing and multi threaded
applications efficiently becomes essential. Switching contexts,
starting and ending of processes and threads as well as
inter-process and inter-thread communication is traditionally
associated with large overheads. Significant improvements have
been made to reduce these overheads to a minimum.
|
|
|
|
|
3.15
Address Space Number (ASN) and Global flag
US
Patent 6,604,187.
|
|
Different
processes can have different contexts That is: different translations from
virtual to physical addresses. A process switch will cause the
Translation Look Aside buffers to be invalidated ( flushed ). Large
Translation buffers won't help you a lot if they are frequently flushed
which then can lead to significant performance degradation. The Opteron
introduces a new mechanism to avoid flushing of the TLB's. An
Address Space Number (ASN)
register is added together with an enable bit (ASNE).
The
Address Space
Number is used
to uniquely identify a process. Each entry in the TLB now includes the
ASN of the process. An address can be successfully translated if the
address matches the Virtual Address Tag in the TLB
and
the ASN register
matches the ASN field in the TLB. The ASN field can be seen as an
"extension" of the Virtual Address. This now means that
different translations of different processes can coexist in the TLB,
avoiding the need to flush the TLB's for context switches.
A
global flag
is available for data and code that is preferably accessible for all
processes, typically operating system related. Global translations do
not require the ASN fields to match. This means that many processes can
share a single entry in the TLB to access global data. Another
advantage of the ASN and global flag is that flushing can be limited to
specific entries whenever an invalidation of the TLB is needed.
Only
the entries which have a certain ASN or have the global bit set are
flushed.
|
|
3.16
The
TLB Flush Filter CAM
US
Patent 6,510,508.
|
|
The
TLB's can be seen as caches containing the translation information
stored in the address translation tables in memory. The actual
translation requires several levels of indirections through the tables
stored in main memory. This is the so-called "table walk"
A
very time consuming process which may take many hundreds of cycles for
a single TLB entry. The Opteron attempts to speed up the table walk with a 24
entry Page Descriptor Cache.
Even
so, it remains important to avoid the table walk whenever possible in a
multi-tasking multi-threaded environment. A table walk becomes necessary
whenever entries in the TLB do not correspond to the memory resident
translations anymore because some- body has modified the
latter.
Until
now there was only one way to guarantee TLB coherency: Flush the
TLB's if it may be possible that any of the entries is not identical anymore to the
memory resident tables. Many actions in the x86 architecture result in
an automatic flush of the TLB's, often unnecessary. A new feature in the
Opteron: The TLB flush filter can avoid these costly flushing in many
occasions. The
TLB Flush filter is implemented as a 32 entry, Content Addressable
Memory ( CAM ). It remembers the addresses of regions in memory that were
accessed when the TLB's were loaded. These regions thus belong to the
Page Translation Tables. The Filter then keeps monitoring all accesses
to memory to see if any of these regions are accessed again. If not then
it may disable the flushing of the TLB's because coherency is guaranteed.
|
|
3.17
Data Cache Snoop
Interface
|
|
The
Snoop interface is used for a wide variety of purposes. It's used to
maintain Cache Coherency in a multiprocessor system. It is used for
conserving strict memory ordering in shared memory, for Self Modifying
Code detection, for TLB coherency et-cetera.
The
snoop interface uses the physical addresses from other processor
accesses as well as from accesses issued on behalf of the instruction
cache to probe various memories and buffers for data that has somehow,
something to do with that particular address.
|
|
3.18
Snooping
the Data Cache for Cache Coherency, The MOESI protocol
|
|
The
Opteron can maintain cache coherency in systems of up to 8 processors.
It uses the so-called
MOESI
protocol for this purpose.
The snoop interface plays a central role in the effectuation of the
protocol.
If
a cache line is read from system memory ( which may be connected to any
of the eight processors ), then the read has to snoop all the caches of
all processors. Snoop accesses are much smaller then normal memory
accesses because they do not carry the 64 byte cache line data. Many
snoops may therefore be active without overloading the distributed memory system
throughput. A snoop may find the cache-line in one of the caches of
another processor.
If
a processor does not find the cache-line in someone else's cache then it
loads it from system memory into its cache and marks it as Exclusive.
Now whenever it writes something in the cache-line then it becomes Modified.
It does in general not write the modified cache-line back to memory. It
only does so if a special memory-page-attribute tells it to do so (write
through). The
cache line will be evicted only later on if another cache-line comes
in which competes for the same place in the cache.
If
a processor needs to read from memory and
it finds the cache line in someone else's cache then it will mark the
cache line as Shared.
If the cache-line it finds in the other processors cache is Modified
then it will load it directly from there instead of reading it from the
memory which may be not up to date. Cache to cache transfers are generally
faster then memory accesses. The
status of the cache-line in the other cache goes from Modified
to
Owner.
This cache-line still isn't written back to memory. Any other (third)
processor that needs this cache-line from memory will find a Shared
version and a Owner
version in the caches of the first two processors. It will obtain the Owner
version instead of reading it from system memory. The owner is the
latest who modified the cache-line and stays responsible to update the
system memory later on. A
cache-line stays shared as long as nobody modifies the cache-line again. If one of the
processors modifies it then it must let this know to the other
processors by sending an
invalidate probe
throughout the system.
The state becomes Modified
in this processor and Invalid
in the other ones. If it continues to write to the cache line then it
does not have to send anymore invalidate probes because the cache line
isn't shared anymore. It has taken over the responsibility to update the
system memory with the modified cache line whenever it must evict the
cache-line later on.
|
|
3.19
Snooping
the Data Cache for Cache Coherency, The Snoop Tag RAM
|
|
Other processors
that access system memory need to snoop the Data Cache to maintain cache
coherency using the MOESI protocol. We saw that there were two kinds of
snoops. Read and Invalidate snoops. The basic task of a snoop is first
to establish if the Data Cache contains the cache-line in question.
There is a third set of Tags available specially for the snoop
interface. ( The other two are used for the two regular ports of the data
cache ). The snoop-Tag ram has 1024 entries, one for each cache
line. It holds the Physical address bits [39:12] belonging to each cache line.
|
Virtual
Address bit used to access the L1 data Cache
|
virtual
page address |
offset
in page |
offset
in cache line |
|
W |
14 |
13 |
12 |
11 |
10 |
9 |
8 |
7 |
6 |
5 |
4 |
3 |
2 |
1 |
0 |
Physical
Address used to snoop the L1 data Cache
|
physical
page address |
offset
in page |
offset
in cache line |
|
15 |
14 |
13 |
12 |
11 |
10 |
9 |
8 |
7 |
6 |
5 |
4 |
3 |
2 |
1 |
0 |
|
The
regular Tag rams are accessed with the virtual address. The Snoop Tag
ram however must deal with the physical address ! Fortunately many of the
virtual address bits needed are identical to the physical address bits.
Only bits [15:12] are different and thus useless. This means that we
must read out the Tags of all 16 possible cache-lines in parallel and
then test if anyone of them matches. Luckily enough this doesn't present
to much of a burden. The total bus width (in bit-lines) of for instance
the cache rams is 512 bit. Sixteen times a 28 bit Tag is less (448) so
there's space left for some extra bits like the state info for each
cache-line.
Once
we know which of the 16 possible cache-lines hits then we know also the
remaining virtual address bits needed to access the cache plus the Way
(0 or 1) which holds the cache-line. The position itself, ( 1 out of 16
) directly provides the 3 extra address bits plus the Way
bit. This means we can now access the
cache if needed in case of a Read Snoop hit.
|
|
3.20
Snooping
the L1 Data Cache and outstanding stores in LS2
|
|
It
is not necessary for snoop
reads from other
processors that want to read a cache-line from the L1 data cache to
check for retired stores in LS2 that will write to the cache-line they
are about to read. This even though the data these stores will write is
already considered to be part of the memory by the processor who issued
the writes. It's is OK for other processors to see these writes occur at
a later stage. The only effect externally is that it looks as if the
processor is slightly slower.
An
external processor that writes to a shared cache line must send snoop
invalidates around.
The
snoop interface will invalidate the local cache-line if it receives such
a snoop invalidate that hits the cache. The snoop interface must also
set the hit/miss flag to miss for all stores in the Load Store
unit that want to write to the cache-line that was hit. The later is not a
specific snoop operation however. It is needed in all cases in which a
cache-line is evicted or invalidated. These stores that originally did
hit but who are set back to miss will need to probe the cache
again.
|
|
3.21
Snooping
LS2 for loads to recover strict memory ordering in shared memory US
Patent 6,473,837.
|
|
An
interesting trick allows the Opterons to handle speculative out of order
loads from shared memory and still observe the strict memory access
ordering required for shared memory multiprocessing. The hardware
detects violations and can restore strict memory ordering when
needed. A
communicating processor may for instance first write a new command for
another processor to A1 in memory and then increment a value A2 to
notify that it has issued the next command. The processor which is
supposed to handle the commands may find the value A2 incremented but
still reads the old command from A1 if it executes loads out of order. The
ability to handle loads out
of order can
significantly speed up processing. Most notable is the example where a
first load misses the cache. An out of order processor may issue another
load which may hit the cache without waiting for the result of the first
load. It would be beneficial to maintain out-of-order loads in a
multiprocessing environment. Another
important speed improvement is speculative
processing. The first
load that missed may have been the counter A2 in our example. The new
command must be fetched if A2 has been increased. A conditional call is
made based on a test of the value of A2. A speculative processor
attempts to predict the outcome of the branch at the beginning of the
pipeline. It may predict that the counter has been incremented if it
generally takes more time to execute the command than it takes to
provide a new command. That
is: The new command is generally sitting waiting to be executed by the
time the previous command has been executed. The
speculative out of order processor may first attempt to load the counter
A2, It may miss but the branch predictor has predicted that it was
increased and the command from A1 will be loaded for execution. The load
from A1 may hit the cache. We actually do not know if this is a new
commando or not. Let say it is the old one. The counter A2 still has to
be loaded from memory. If A2 is increased in the mean time then the load
that missed will cause the modified cache-line to be loaded in the local
data cache with the incremented counter included. The processor will
conclude that the branch prediction was correct and erroneously carry on
with the old command. The
Opteron has a snoop mechanism that allows this kind of fully speculative
out-of-order processing for high performance multi-processing. The mechanism
detects cases which may go wrong and consequently restores memory
ordering. We'll illustrate the mechanism with the use of our example.
When the first processor
writes a new commando into A1 then it will send a snoop-
invalidate around to
invalidate the cache-line in all other caches. This snoop invalidate
will also reach the snoop interface of the Load Store unit:
The
snoop interface first checks the entries for a load that did hit the
cache-line-to-be-invalidated. This load would then be the "old
command" from A1 in our example. When it finds a load hit
then it continues by checking all older loads to see any of them is
marked as a miss. This would then be the load of the A2 counter
value in our example. It marks the
Snoop ReSync flag
of all the load misses it
finds. This flag will cause any succeeding instructions to be
canceled when the load is retired including the instruction that loads
A1. The load of A1 will be re-executed and will now correctly read
the new command from memory.
|
|
3.22
Snooping
the TLB Flush Filter CAM
|
|
Snooping
is used to preserve memory coherency. The function of the TLB flush
filter is to prevent unnecessary flushes of the TLB's. It does so
by monitoring up to 32 areas in memory that are known to contain page
table translation information which is cached in the TLB's. These
entries must be snooped also by snoop invalidates from other processors
that may write to the page tables of our processor. If any of the
snoops hits a TLB flush filter entry then we know that a TLB may have
invalid entries and that the TLB flush filter may not prevent the
flushing of the TLB's anymore.
The
snoop-invalidates are not send if a processor is sure that a cache-line
is not shared with other processors . This suggests that the TLB's
(being caches in their own right) participate in the MOESI protocol for
cache coherency via the
TLB flush filter. The
memory page translation tables ( PML4, PDP, PDE and PTE entries) may be
in cacheable memory. A special flag has to be set in the Opteron
if the Operating System decides to put the tables in un-cacheable
memory. (
TLBCACHEDIS in HWCR )
|
|
Chapter
4, Opteron's
Instruction Cache and Decoding
|
|
|
|
|
|
4.1
Instruction
Cache: More then instructions alone
|
|
|
Access
to the Instruction cache is 128 bit wide. 16 bytes of
instructions can be loaded from the cache each cycle. The
instruction bytes are
accompanied with an extra 76 bits of extra information. This
extends the total width of the Instruc- ion cache port
to 204 bits.
We're still counting only
the bits that cover the full Instruction Cache. That is:
Each of the 1024 cache lines has its own set of these extra
bits. There are several more fields that have less then 1024
entries and are valid only for a subset of the cache lines.
|
|
|
|
Instruction only |
Total size |
|
Instruction Cache size: |
64 kByte |
102 kByte |
|
Cache Line size |
64 Byte |
102 Byte |
|
One Read Port |
128 bit |
204 bit |
|
One Write Port |
128 bit |
204 bit |
|
Well
known are the three so-called pre-decode
bits attached to each byte.
They mark the start and end points of the complex variable
length x86 instructions and provide some functional information. The other two
fields are the
parity bits,
1 parity bit for each 16 data bits, and the so-called
branch selectors.
( eight times 2 bit for each 16 byte line of instruction code ).
|
|
|
Ram Size |
Bus Size |
Comments |
|
Instruction Code: |
64 kByte |
128 bit |
16 bytes instruction code |
|
Parity bits |
4 kByte |
8 bit |
One parity bit for each 16 bit |
|
Pre-decode |
26 kByte |
52 bit |
3 bits per byte (start, end, function) + 4 bit per 16 byte line |
|
Branch Selectors |
8 kByte |
16 bit |
2 bits for each 2 bytes of instruction code |
|
TOTAL |
102 kByte |
204 bit |
|
|
The
Opteron's branch selectors are different from those of the Athlon (32)
and they now cover all 1024 cache-lines of the Instruction Cache. The branch
selectors contain local
branch prediction
information which can not be retrieved as readily as for instance the pre-decode
information. A piece of code has to be executed multiple times before
the branch-selectors become meaningful. This
is the reason that the branch selector bits are saved together with the instruction data in
the unified level 2 cache whenever a cache-line is evicted from the
instruction cache. The branch selectors represent one bit extra for each
byte. The level 2 cache has this bit already for ECC ( Error Coding
and Correction ) information. ECC is only used for data cache lines and
not for instruction cache lines. The latter do not need ECC, a
few parity bits per cache line is sufficient. Instruction cache
lines that are corrupted
can always be retrieved from external DRAM memory.
|
|
|
|
|
|
|
|
|
|
4.2 The General Instruction Format
|
|
A
short overview of the 64 bit instruction format:
A
series of prefixes can precede the actual instructions. At the start we
have the legacy prefixes. The most important legacy prefixes are
the operand size override prefix (hex 66) and the address size override
prefix (hex 67). These prefixes can change the length of the entire
instruction because they change the length of the displacement and
immediate fields which can be 1, 2 or 4 bytes long.
The
REX prefix (hex 4X)
is the new 64 bit prefix which brings us 64 bit processing. The value of
X is
used to extend the number of General Purpose registers and SSE registers
from 8 to 16. Three bits are used for this purpose because x86 can
specify up to three registers per instruction for data and address calculations.
The fourth bit is used as operand size override (64 bit or default size)
|
|
AMD64
Instruction Format

|
|
The
Escape prefix (hex 0F) is used to identify SSE instructions. The Opcode
is the actual start of the instruction after the prefixes. It can be
either one or two bytes and may have an optional MODRM byte and SIB
byte. The optional displacement and immediate fields can contain
constants used for address and data calculations and can be 1, 2 or 4 bytes.
The total length of the instruction is limited to 15 bytes.
|
|
4.3 The
Pre-decode bits
|
|
Each
byte in the instruction cache is accompanied with 3 pre-decode bits
generated by the pre-decoder. These bits accelerate the decoding of the
variable length instructions. Each instruction byte has a start
bit that is set when
the byte is the start of a variable length instruction and a similar end
bit. Both bits are set
in case of a single byte instruction. More information is given with the
third bit, the function
bit. The decoders look
first at the function bit at the last byte of the variable length
instruction. If the function bit is 0 then the instruction is a
so-called direct path
instruction which can
be handled directly by the functional units. Otherwise if the function
bit is 1 at the end byte then the instruction is a so-called vector
path instruction. A
more complex operation that needs to be handled by a microcode program.
|
|
Definition of the Instruction Pre-decode bits
|
|
START bit
|
1 indicates first byte of an instruction |
|
END
bit
|
1 indicates last byte of an instruction |
|
FUNCTION bit
|
rule 1: Direct Path
instruction if 0 on the last byte
Vector Path instruction if 1 on the last byte
rule 2: 1 indicates Prefix byte of Direct
Path (except last byte)
0 indicates Prefix byte of Vector Path
(except last byte)
rule 3: For vector-path instructions
only:
if the function bit of the MODRM byte is set then
the instruction contains a SIB byte. |
|
Then,
secondly, the function bits identify the prefix bytes. Ones
identify prefix bytes of direct path instructions and zeroes define the
prefix bytes of vector-path instructions. Then, finally, in case of
vector-path instructions only: if the function bit of the MODRM byte is set then
the instruction also contains a SIB byte.
|
|
4.4 Massively Parallel
Pre-decoding
US
Patents 6,460,132
&
6,260,134
|
|
|
We
find a very large block of logic with fourfold symmetry directly near
the position were the 16 byte blocks of data are read and written from and
to the instruction cache.
We'll
discuss the most likely candidate here, A fourfold incarnation of an
earlier pre-decoder described in gate level detail in US
Patent 6,260,134
This
fourfold version can, according to the patent which describes it, pre-
decode an entire line of 16 bytes in only two cycles by means of what it
calls: massively parallel pre-decoding. This circumferences a basic
problem in variable length pre-decoding and decoding in general, being:
A
second instruction can not be decoded until the length of the first
instruction is known. The start position of the second instruction
depends on the length of the first instruction.
The
massively parallel pre-decoder avoids this problem by first pre-decoding
the 16 possible instructions in parallel. Each instruction starts at one of the
16
byte locations of the 16 byte line. It then filters out the real
instructions with the help of the program-counter which points to the
start byte of the next instruction, depending on where we jump into the
16 byte line.
16
bytes of instructions can be fetched per cycle from the instruction
cache to be fed to the decoders. It may be that the line is not yet
pre-decoded or wrongly pre-decoded. (Data bytes between instructions can
mislead the pre- decoder).
If
a branch is made to an address which does not have its pre-decode start
bit set then we know that something is wrong. The instruction pipeline
may invoke the pre- decoding hardware in this case to initialize or
correct the pre-decoding bits within only two cycles.
|
|

|
The
massively parallel pre-decoder uses four blocks, these blocks are an
adapted version of an earlier pre-decoder. A single block
pre-decodes four possible instructions in parallel. Each instruction starting at
one of four subsequent byte positions. The old single block was capable of
stepping through a 16 byte line in four cycles. The massively parallel
pre-decoder combines four of them and uses a second stage to resolve the
relations between the four: The start
/ end fixer / sorter.
|
|
4.5 Large
Workload Branch Prediction
|
|
|
Branch
Prediction is the
technique that makes it possible to design pipelined processors.
The outcome of a conditional branch is generally only known at
the very end of the pipeline while we need to have this
information at the very beginning of the pipeline. We need the
branch outcome to know which line of instructions to load next.
The
loading of a line of instructions already takes two cycles. If
we don't want to loose anymore cycles then we must have decided
on a new instruction pointer at the end of the cycle when 16 instruction
byte line arrives from the instruction cache.
This
means that there is no time at all to even look at the
instruction bytes, to try to identify conditional branches, and
then to look up what the behavior was of these branches in recent
history in order to make a prediction. Doing this alone would
cost us several cycles.
|
|
|
|
|
4.6 Improved Branch Prediction
|
|
The
Branch prediction hardware does not make any attempt to look at
the fetched instruction bytes at all. It uses several data structures
instead to rapidly select a new address. It has a
2048 entry
Branch Target Buffer
and a 12 entry
Return Stack
to select a next Program
Counter address. It further uses two branch history structures, one for local and
one for
global history, It uses these branch history structures to predict the
outcome of the branches. The so-called branch
selectors are used for
local history while the global
history counters are
used for global history.
|
|
4.7 The Branch Selectors
|
|
The
branch selectors embody
the local history. Local means that the prediction is based on the
history of the branch itself alone. Conditional branches that are taken
about always in the same way can be predicted with the branch selectors.
unconditional branches are also handled by the branch selectors.
Remember that there is no time to look at the actual code. What a branch
selector says is that history has shown that a branch will be
encountered that is almost certainly taken, conditional or
unconditional. Now
if it's not so certain that a branch will be taken? The branch
selectors may leave the prediction in this case to the global
branch prediction. The
branch selectors will predict the branch as taken to identify the branch
but leave the final decision to the global
history counters by
setting the global flag.
|
|
16 byte line of
instruction code
|
|
0
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9
|
10
|
11
|
12
|
13
|
14
|
15
|
|
BS0
|
BS1
|
BS2
|
BS3
|
BS4
|
BS5
|
BS6
|
BS7
|
|
Branch
Selection
|
K7 Athlon 32 |
K8
Athlon 64 |
|
3 |
take branch 2 |
take branch 3 (or return) |
|
2 |
take branch 1 |
take branch 2 (or return) |
|
1 |
return |
take branch 1 (or return) |
|
0 |
continue to next line |
continue to next line |
|
Each
16 byte line of instruction code is accompanied with eight 2 bit
branch-selectors (some patents talk about 9) The branch selector within
the line is selected with bits [3:1] of the Instruction Fetch address. The branch
selector answers the question: I did enter this 16 byte line on this
particular address, now what 16 byte line should I load in the next
cycle? A line can
have multiple jumps, calls, returns. They can be conditional or
unconditional. We may have jumped anywhere in the middle of all these
branches. The branch selectors tell us what to do depending on where we
entered the line. The
K7 can predict two branches per line plus one return. The new 64 bit
core can predict up to three branches per line and anyone of them may be
a return according to Opteron's optimization manual. ( There are no
patents yet so the table above is our own extrapolation ). The branch
selectors are saved together with the instruction code in the large One
Megabyte L2 cache whenever a cache-line is evicted from the instruction
cache. The most useful data to save there is the information which can't
be easily retrieved from the instruction code: the branch history.
Information like the actual branch
target address or the
fact that the branch is a return
is retrieved relatively fast in most cases by the processor.
|
|
4.8 The
Branch Target Buffer
|
|
The
BTB ( Branch Target Buffer ) contains 2048 addresses from which the
branch selectors can choose the next cycle's Instruction Fetch address. Fred
Weber's MPF2001 Hammer presentation shows us that each 16 byte line can
now have up to four branch target addresses to choose from ( Up from two in
the case of the Athlon 32 ). Each branch target entry is
shared between eight lines. From the branch selectors we know that any
single line may not use more then 3 of these. We assume that when a
branch selector says: Select the 2nd branch, This means the second
branch available for the current line.
|
|
Most important Branch Target
Buffer fields
|
|
Line
Tag
(
3 bit )
|
15
bit
cache
index
|
Cache
Way
Select,
0 or 1 |
Return
Instruction |
Use
Global
Prediction |
Offset
in
Instr.
Code |
|
Field
|
Description
|
|
Line
Tag ( 3 bit )
|
A branch target buffer entry is shared between 8 lines.
The line tag tells us if this entry belongs to the current
line. |
|
15
bit cache index
|
These 15 bits are sufficient to access the two 32 kB ways
of the 64 kB 2 way set-associative Instruction Cache |
|
cache
way select ( 0 or 1 )
|
Used to check the way of the cache.. |
|
Return
Instruction
|
This bit tells us to use the address from the return stack
instead to access the next line in the instruction cache. |
|
Use
Global Prediction
|
The Global Flag leaves the final Branch Prediction
to the Global History Counters. |
|
Offset
in Instruction
code
|
This tells us where the end of the branch is located in the 16
byte line of instruction code. |
|
Each
branch target entry needs a 3 bit tag to identify to which of the 8
possible lines of
instructions it belongs. Sharing branch target entries strongly reduces
the amount of branch target addresses needed. 2048 entries still would
represent 12 kByte if the full 48 bit addresses were stored in the BTB.
This would be a relatively large memory which you won't find on Opteron's die. The trick
used here is to store only
the 16 bits which are actually needed to access the 64 kByte instruction
cache. The higher address bits are retrieved later on. The Opteron has a
new unit
called the BTAC ( Branch Target Address Calculator ) to support
this.
|
|
4.9 The
Global History Bimodal Counters
|
|
The
Athlon 64 has 16,384 branch history counters. Four times as much as its
32 bit predecessor. The counters describe the likelihood that a branch
is taken. They count up to a maximum of 3 when branches are taken and
count down to a minimum of 0 when not taken. Counter values 3 and
2 predict a branch as taken, see the table.
|
|
Definition of the 2 bit Branch History
Counters
|
|
Counter
Value
|
Branch
Prediction |
|
counter
= 3
|
Strongly Taken
|
|
counter
= 2
|
Weakly Taken |
|
counter
= 1
|
Weakly not Taken |
|
counter
= 0
|
Strongly not Taken |
|
The
BHBC is accessed by using four bits of the Program Counter and the
outcome (taken or not taken) from the last eight branches. This is
basically the same as in the Athlon 32. The fact that we now have four
times as many counters means that we have four branch predictors per 16
byte instruction line. This corresponds with the four branch target
addresses per line. This would be an improvement over the Athlon 32 were
the two branches per line could interfere which each others branch
predictions.
|
|
Addressing the Branch History
Counters
|
|
Instruction
Address
bits 7:4
|
|
|
Branch outcome of the
eight
previous branches
|
|
|
$$$$
|
$$$$$$$$
|
|
16,384
Branch History Counters
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Branch
prediction 0
|
Branch
prediction 1
|
Branch
prediction 2
|
Branch
prediction 3
|
|
Another
improvement is that only branches whose global bit was set participate in the global branch prediction. This
prevents branches
with a static behavior from polluting the global branch history. ( US
Patent 6,502,188 describes this in the context of the
Athlon 32 ) The global bit is set whenever a branch has a variable
outcome. The
GHBC table allows the processors to predict
global branch patterns of up to eight branches.
|
|
4.10
Combined Local and Global Branch Prediction with three branches per line
|
|
A
single 16 byte line with up to three conditional branches represents a
complex situation. If we predict a first branch as not taken then we
encounter the next conditional branch which must be predicted also
et-cetera. Does the opteron handle this in multiple steps? or does it
handle the whole multiple branch prediction at once?
|
|
Local
and Global Branch Prediction with
three Branches per Line
|
|
IF
|
AND
|
THEN |
|
Branch
Selector
Selects
Branch 1
|
Branch
1 is local, or global and predicted taken
|
TAKE
BRANCH 0
|
|
Branch
Selector
Selects
Branch 1
|
Branch 1 is global and predicted not taken and
Branch 2 is local, or global and predicted taken
|
TAKE
BRANCH 1
|
|
Branch
Selector
Selects
Branch 1
|
Branch 1 is global and predicted not taken and
Branch 2 is global and predicted not taken and
Branch 3 is local, or global and predicted taken
|
TAKE
BRANCH 2
|
|
Branch
Selector
Selects
Branch 1
|
Branch 1 is global and predicted not taken and
Branch 2 is global and predicted not taken and
Branch 3 is global and predicted not
taken
|
GO
TO NEXT LINE |
|
Branch
Selector
Selects
Branch 2
|
Branch 2 is local, or global and predicted taken
|
TAKE
BRANCH 0
|
|
Branch
Selector
Selects
Branch 2
|
Branch 2 is global and predicted not taken and
Branch 3 is local, or global and predicted taken
|
TAKE
BRANCH 1
|
|
Branch
Selector
Selects
Branch 2
|
Branch 2 is global and predicted not taken and
Branch 3 is global and predicted not taken
|
GO
TO NEXT LINE
|
|
Branch
Selector
Selects
Branch 3
|
Branch 3 is local, or global and predicted taken
|
TAKE
BRANCH 2
|
|
Branch
Selector
Selects
Branch 3
|
Branch 3 is global and predicted not taken
|
GO
TO NEXT LINE
|
|
If
we may take Fred Weber's MPF2001 presentation as an indication here then we
guess that it takes the branches one step at a time. (The presentation
shows a single GHBC prediction per cycle ). A potential bottleneck
may indeed be the GHBC. A second and a third branch need a different
"8 bit branch outcome" index into the table. The 8 bit value
should be shifted 1 and 2 positions further for the 2nd and the 3rd
branch with zeroes inserted to indicate "not taken" in order
to operate according the rules.
|
|
4.11 The Branch Target Address Calculator, Backup for
the Branch Target Buffer
|
|
Another
new improvement is the BTAC, The Branch Target Address Generator, This
unit is very useful for several purposes. It can generate full (48 bit)
branch addresses two cycles after the 16 byte line of code has been
loaded from the cache. It works for most branches which typically use an
8
or 32 bit displacement in the instruction to jump or call to code
relative to the program counter. The BTAC can probably identify return
instructions as well.
One
task of the BTAC is a backup function of the BTB (Branch Target
Buffer). The BTB shares each branch address with eight lines. We
may find that the branch selectors are OK but the branch target they
select has been overwritten by another branch. The branch selectors are
maintained for all cache-lines in the 64 kByte I Cache. They are also
preserved together with instruction cache-lines which are evicted from
L1 to the large 1 MegaByte L2 cache. It is unlikely that
branch-selectors which are reloaded from L2 into L1 still find their
branch target addresses in the BTB. On the contrary, the BTB entries
should be cleared whenever a cache-line is evicted from L1 to L2.
A
cache-line that returns from L2 to L1 can restore the pre-decode bits
rapidly (In two cycles with a massively parallel pre-decoder) It has to
restore the BTB entries as well but this can take much more time. The
Athlon 32 fills the BTB with instruction-addresses that come back from
the re-order buffer when the branch is retired. This procedure would be
repeated for each branch in the 16 byte line when it is taken. It may
well be that the Athlon 64 still works this way. The BTAC can take over
the functionality of the BTB until the BTB entries are restored.
The
BTAC can use the lowest Instruction Fetch address bits to see were we
enter the 16 byte line. It can then scan from that position to the first
branch and calculate the full 48 bit address by adding the 8 or 32 bit
displacement from the code. Now we have a calculated value which can be
used to index the cache. It is still a guessed address. The certain
address only comes when the branch instruction retires. The BTAC may
have picked the wrong branch for example.
We
believe that the BTAC calculates the full 48 bit address. We believe so
because it can be made to maintain the full 48 bit which has several
advantages. The 48 bits would be lost whenever the BTB is used to
predict an address because it stores only a small portion of the
address. The BTAC can be used to maintain 48 bit because the BTB
identifies the location in the 16 byte line of the branch it uses. The
BTAC can use this to find the right branch and subsequently add the
displacement to keep the address at
48 bits.
There
are two important tasks that need the full 48 bit address. First: The
branch-miss prediction test hardware has to compare the full 48 bit
"guess" address with the actual 48 bit address as calculated
by the branch instruction. Secondly: The cache hit/miss test hardware
needs the full 48 bit "guess" address (virtual) to translate
and compare it with the (physical) address tag stored together with each
cache-line.
There
are some patents without BTAC that use a scheme of reversed TLB lookup
to recover the full 48 bit (virtual) "guess" address from the
(physical) cache tag and use this for the branch miss prediction test.
However such an address is not useful for the cache hit-miss test ( It
hits always! ).
|
|
4.12 Instruction Cache Hit / Miss detection,
The Current Page and BTAC
|
|
The
basic components for the Instruction Cache hit/miss detection are
basically the same as those for the data cache. See section-3.3:
"The
Data Cache Hit / Miss Detection: The cache tags and the
primairy TLB's"
The single port Instruction cache only needs a single tag ram and
a single TLB. The instruction cache also has a second
level TLB ( see
section-3.4) and it has its snoop
tag ram (section-3.19). All these structures are relatively simple to
recognize on the die-photo.
The
current page
register holds address
bits [47:15] of the "guessed" Instruction Fetch address. The BTB only stores the
lower 15 Instruction Fetch address bits. The Fetch logic speculates that
the next 16 byte instruction line will be fetched from the same 32 kB
page and that the upper address bits [47:15] remain the same. Jumps and
calls that cross the 32 kB border are miss predicted. The higher bits of
the fetch address [47:12] are needed for the cache hit/miss logic. The
virtual page address [47:12] is translated to a physical page address
[39:12] . This page address is then compared to the two physical address
tags read from the two way set associative instruction cache to see if
there is a hit in either way.
The
new BTAC ( Branch Target Address Calculator) can recover the full 48 bit
address from the displacement field in the instruction code two cycles
after the code is fetched from the cache. This address can then be
compared with the current page register to check if the assumption that
the branch would not cross the 32 kB bounder was right. The cache
hit/miss logic in the mean time has translated and compared the guessed
address with the two instruction cache tags and produced the hit/miss
result.
|
|
Cache
Hit / Miss and Current Page Test
|
|
Cache
Hit
|
Current
Page OK
|
Continue with the Instruction Line
Fetched from the Instruction Cache
|
|
Cache
Hit
|
Current
Page not OK
|
Re-access the cache / TLB with
the corrected Current Page |
|
Cache
Miss
|
Current
Page OK
|
Real Cache Miss. Reload Cache-line
from L2 or memory. |
|
Cache
Miss
|
Current
Page not OK
|
Re-access the cache /
TLB with
the corrected Current Page
|
|
The
processor continues with the 16 instruction bytes fetched from the cache
if there was a cache hit and the 32 kB border was not crossed. The Fetch
logic will re-access the cache if the 32 kB border was crossed and will
ignore the hit/miss result in this case. If the 32 kB border was not
crossed and the TLB thus translated the right fetch address and there
was a cache miss then we may conclude that the cache miss was real and
that we have to reload the line from memory or L2. The BTAC does
not help in case of indirect branches. These still have to wait until
the correct address becomes available from the retired branch
instruction.
|
|
4.13 Instruction
Cache Snooping
|
|
The
Snoop interface of the Instruction Cache is used to
maintain Cache Coherency in a multiprocessor environment and for Self Modifying
Code detection. Another processor that shares a cache-line with a
cache line in the Instruction cache sends snoop-invalidates throughout
the system when it writes into the shared cache-line. The snoop
interface checks if the snoop invalidate hits with a cache line in the
instruction cache. It will invalidate the line upon a hit. The snoop
interface works with physical addresses as described in section 3.19
The
instruction cache can share cache-lines with other processors. It can
not share a cache-line with its own data cache however. The latter is
forbidden because the processor must correctly handle Self Modifying
Code programs. The Instruction and data cache are exclusive to each
other as well as to the unified level 2 cache. The snoop interface
detects if a cache-line load for the data cache hits a cache-line in the
instruction cache and invalidates the cache-line upon a hit.
The
instruction cache may share a cache-line with a data-cache on another
processor. This so-called Cross Modifying Code case is less stringent.
The exact moment at which the other processor overwrites the instruction
code is uncertain. The only effect of a shared cache-line which is
modified by another processor is that we see the modification somewhat
later, as if the other processor was slightly slower. Interesting
is that the new ASN (Address Space Number) could make it possible for
the instruction cache and data cache to share cache lines as long as
they are assigned to different processes with different ASN's. This
would be similar to the cross modifying case mentioned above. The
hardware however does not support it because the ASN's are not stored
together with the cache lines. It would not be worth the trouble anyway
from a performance point of view.
|
|
Athlon
64, Bringing 64 bits to the x86 Universe
|
|
Regards,
Hans
|
|
|