April  20, 2003

Looking at Intel's Prescott die, part II 

( by Hans de Vries )

 

 

 

        Article Index 

 

      Yamhill comes out of the "blue"

      Code names of 64 bit enabled processors and chipsets.

      Our detailed overview of Intel's Pentium 5 / Pentium 6 processor.

      Our detailed overview of Intel's Pentium 4 processor.

      Improvements found so far

      The need for 64 bit processing: Closer than you think.

      The Second integer core is for 64 bit processing  (not for multithreading)

      No double frequency building blocks used yet.

      48 bit virtual addresses:  The Instruction TLB

      48 bit virtual addresses:  The virtual Trace Cache tags

      48 bit virtual addresses:  The Front End and Trace Cache Branch Target Buffers

      Re-examining the Register Alias History Table: 128 uOps in flight in total.

      Faster SSE Floating Point separated from legacy Floating Point.

      La Grande:  A tiny embedded processor for (micro)-code decryption and other purposes?

 

 

Yamhill comes out of the "blue"                                                          Googlers: This article looks for Yamhill in Prescott's.    

  If you are looking for Prescott's in Yamhill go here :   

 http://www.prescottbluebird.com/portland.html          

 

We've found lots of  prove for 64 bit processing now in Prescott's blue die image. Yamhill is for sure. That's what we can say. Virtual Addressing has been extended from 32 bit to 48 bit just like in the AMD Hammer family. We will demonstrate it with illustrated images of three different locations: The instruction TLB, The Trace Cache and the Front End Branch Target Buffers. Other examples of increased address space are scattered all over the chip. The processor however won't be called Prescott or Nocona anymore when the 64 bit features are enabled. New code names are used for 64 bit enabled processors.  The later two will stay with a 32 bit virtual address space but have their physical  address space already extended from 36 to 40 bit as seems to be mentioned in manuals still under NDA.  

 

Code names of 64 bit enabled processors and chipsets

 

The name that appeared for the 64-bit-enabled, 4P- server version is Potomac. A processor matched to the Twin Castle server chipset scheduled to hit the market in the second half of 2004.  A new code name recently appeared, Jayhawk, as being the dual processor version which is matched with the Lindenhorst chipset slated for the beginning of 2005. Furthermore, The name of a desktop and workstation chipset has appeared: Copper River, coupled with an unnamed processor coming after Prescott but before Tejas. The Copper River chip-set is planned to arrive at the second half of 2004, more or less at the same time as Potomac and Twin castle. The only thing missing now is the code name for this 64 bit desktop version. Or.... maybe its name was Yamhill in the first place?  Well... would be difficult to put that one on a roadmap now and then still deny 64 bit  :^) 

 

It seems that the processors come in generations. So Prescott and Nocona would be Pentium 5's while Potomac, Jayhawk and "Yamhill" could be Pentium 6's Then, as generally expected, The Tejas line as Pentium 7's and the Nehalem's as Pentium 8's

 

Only on Chip-Architect:

 Our detailed overview of Intel's Pentium 5 / Pentium 6 processor. version 1.0

 

 Instructions run clockwise. Click here for a large version (1600x1200) 

 

To compare:

Our detailed overview of Intel's Pentium 4 processor. version 1.0

 

Instructions run counter-clockwise. Click here for a large (1600x1200) and here for a huge version (3200 x 2400)

 

The need for 64 bit processing:  Closer than you think.

 

64 bit virtual addressing for the desktop will be needed many years before 4 GByte physical memory becomes common place on your motherboard . The point is that the whole idea of virtual memory with page management only works if the virtual memory space is much larger then your physical memory. I actually had to save my work and restart the image software many times during the work on the large images of this article, edited in uncompressed mode. Often I had to shoot down by hand as many tasks possible with Task Manager to free virtual memory just to get my work saved. The problem is not the 1 Gigabyte DDR on my Dell Inspiron 8200. The whole problem is the 4 Gigabyte virtual memory limitation which becomes so polluted with scattered around bits and pieces of allocated memory so that it's not possible to find a decent part of continues memory anymore. All the result of course of languages like C with explicit pointer handling and processors that do not have specific pointer registers. There is no way to defragment virtual memory like a hard disc to open up larger continuous areas. The only way to "defragment" virtual memory is to save your work on time, shut down the program and restart again.I think these mission critical 32 bit bank transaction servers only work because they start up and kill small processes all the time to avoid virtual memory pollution.  Imagine that you have to start killing all kinds of tasks by hand in the hope that you can save a few hours worth of bank transactions.... 

 

Second integer core for 64 bit processing  (not for multithreading)

 

It is as good as sure that the second 32 bit core is exclusively used  for 64 bit processing, and in a way similar to the good old bit slices. There was the 4-bit AMD 2901 that could be  used to build 16, 32 or 64 bit processors. The fact that makes it possible is because the core's is limited mainly to additive and logic functions. A 64 bit staggered addition will take a total of four 1/2 cycles but you can start two of them back to back on 1/2 cycle intervals. The latency to access the cache also does not need to be increased because of the extension to 64 (48) bit addresses. The higher part of the address is only used several cycles later to check the address tags with the TLB entries and not to access the data cache itself. What will increase with one cycle is the latency from an ALU instruction to a normal speed integer instructions. This delay will increase from 2 to 3 cycles. One extra pipeline stage is needed as well, resulting in a minor increase in the branch miss prediction penalty.

 

The reason that we can be so sure that the second core is not used to boost the 32 bit Hyper threading capabilities is the scheduler. This unit is by far the biggest entity on the Pentium 4 die. It is larger then all the Floating Point, MMX and SSE hardware together. It is not only big but it also consist mostly out of very timing critical optimized macro cells laid out by hand. It takes a lot of time and effort to change the scheduler. We've looked to it in detail and concluded that it has mainly remained unchanged on Prescott's die. This means that the maximum uOp throughput remains six per cycle using the same dispatch ports as the Pentium 4.

 

No double frequency building blocks used yet

We found that none of those fancy very high performance building blocks demonstrated during the VLSI 2002 conference are used in Prescott. They will have to wait for a Prescott successor. The scheduler needs to be modified to support them. Some of these building block are very impressive indeed. The 32kByte  cache block shown may for instance be used to implement something like a 128 kByte L1 cache with 2 read operations per 1/2 cycle, all within the same load latency as the current Pentiums 4.  So: 16 times the size (128k/8k) and four times the number of reads!

 

 It is now clear that both the schedulers for the Rapid Execution Engine and the Integer Register File do not operate at the double frequency like it is suggested in an Intel presentation from 2000 shown at the right here. 

 

These units do fully support the double pumped ALU's but they do so by doing things in parallel and not by operating at a double frequency. Now this is of course OK, but a number of articles I wrote based on this just don't make much sense. Such as in my first Prescott article from a year ago where I said that a 4GHz Prescott equipped with a double speed Data Cache should be called an 8 GHz processor.....

There's to much yellow in here!

Now when will we see building blocks used like the ones that were demonstrated during the VLSI 2002 conference? The only future frequency roadmap for Tejas and Nehalem we saw came from Mike Magee's, The Inquirer here

 

The immediate successor to Prescott after it tops out at 5.20GHz will be the "Tejas" core, also produced on a 90 nanometer process and delivering 5.60GHz using a 1066MHz system bus. That's slated to start appearing towards the end of 2004. Tejas will increase in steady increments which appear to be 6GHz, 6.40GHz, 6.80GHz, 7.20GHz, 7.60GHz, 7GHz, 8.40GHz, 8.80GHz and topping out at 9.20GHz. The first Nehalem is supposed to appear at 9.60GHz before Intel succeeds in its goal to produce a 10GHz+ chip, the Nehalem, and using a 1200MHz front side bus.  "

 

Interesting is also the paper "Increasing Processor Performance by Implementing Deeper Pipelines" from Intel's Eric Sprangle and Doug Carmean. The paper looks at a theoretical double frequency version of the current Pentium 4. Both now work on the Nehalem with Doug as it's principle architect. They maintain that the study should not be interpreted like a roadmap document. 

Improvements found so far

 

A list of improvements we found out on the Prescott/Nocona/Yamhill/Jayhawk/Potomac die until now.

 Only few of them are officially disclosed by Intel. ( so it's all unofficial )

 

 

Specifications

and Enabled

Features 

 

Pentium 4

( current )

 

Northwood, SP

Prestonia,   DP

Gallatin,      MP 

Pentium 5

( Q4 - 2003 )

 

Prescott, SP

Nocona , DP

   

Pentium 6?

( H2 - 2004 )

 

" Yamhill ",  SP     

Jayhawk,   DP   

Potomac,  MP  

Data Width

32 bit

32 bit

64 bit

Virtual Address

Physical Address

32 bit

36 bit

32 bit

40 bit

48 bit

40 bit

Architectural Registers

8

8

16?

Logical Processors

(number of threads)

Northwood:  1,2   

Prestonia:      2   

 

Prescott:   2   

Nocona:   4?   

Jayhawk:  4

Potomac:  4

Frequency and

estimated

Spec Int 2000

up to 3.2 GHz

 

1250

start at 3.4 GHz

 

1500

start at 4.0 GHz?

 

1900  (16 regs )

Chipsets: 

Desktop Processor

Server Single Processor

Workst. Dual Processor

Server Dual Processor

Server Four Processor

 

Canterwood

Brookdale

Placer 533

Plumas 533

-

 

Canterwood

Canterwood ES

Placer 533, Tumwater

Plumas 533, Lindenhorst

-

 

Copper River

Copper River

Tumwater

Lindenhorst

Twin Castle

ALU Throughput (max)

four  32 bit ops/cycle

four  32 bit ops/cycle

four  64 bit ops/cycle

ALU  Latencies:

  ALU to ALU instruction

  ALU to Cache adddress

  ALU to Other Instruction

 

1/2 cycle

1/2 cycle (32b)

2 cycles

 

1/2 cycle

1/2 cycle (32b)

3 cycles?

 

1/2 cycle

1/2 cycle (64b)

3 cycles

L1 Data Cache

8 kByte

Prescott  16 kByte   

Nocona  16 kByte?

32 kByte

L1 Bandwith (Integer)

L1 Data Cache Reads

L1 Data Cache Writes

 

one 32 bit word/cycle

one 32 bit word/cycle

 

one 32 bit word/cycle

one 32 bit word/cycle

 

one 64 bit word/cycle

one 64 bit word/cycle

Instruction 

Trace Cache

12 k uOps / 80 kByte

256 sets

8 ways

6 uOps per trace-line

53 bit per uOp

16 k uOps / 128 kByte

512 sets

8 ways

4 uOps per trace-line

64 bit per uOp

16 k uOps / 128 kByte

512 sets

8 ways

4 uOps per trace-line

64 bit per uOp

Trace Cache Bandwidth

3 uOps/cycle

4 uOps/cycle

4 uOps/cycle

L2 Unified Cache

512 kByte

1024 kByte

1024 kByte

Branch Prediction:

Trace Cache BTB

Front End BTB

 

  512 entries

4096 entries

 

1024 entries

4096 entries

 

1024 entries

4096 entries

Instructions in Flight

126

128

128

Integer Register File

128 x 32 bit

256 x 64 bit

256 x 64 bit

Floating Point

Register File

128 x 128 bit 256 x 128 bit 256 x 128 bit
Load Buffer 48 entries 96 entries 96 entries

Store Buffer

24 entries

48 entries

48 entries

Micro Code:

Relative ROM size

Relative Flash size

secure encrypted download

 

1.0  X

1.0  X

no?

 

2.1 X    (0.71 mm2)

4.3 X   ( 0.27 mm2)

yes?

 

2.1 X    (0.71 mm2)

4.3 X    (0.27 mm2)

yes?

  (updated April 18, 2003)  

 

48 bit virtual addresses: The Instruction TLB

We'll now go after some proof for 48 bit virtual addressing. Calculations are 64 bit but, like in the Hammer, only the first 48 bit are used for virtual addressing ( the address as the programmer sees it ) and 40 bits are used for accessing physical memory (The memory dimms on your motherboard)   As said, the virtual memory address range must be much larger then the physical in order for paged based memory management to work. The main reason why 64 bit processors are needed well before you can afford 4 Gigabyte of memory. 

The TLB ( Translation Look aside Buffer ) is responsible for translating the virtual address into a physical address. So it's indeed the obvious place to start looking!  The TLB is a little cache that contains recent translations. If it's not in the TLB then the translation must be loaded from the memory hierarchy, possible all the way down from main memory.  The TLB's here are fully associative and are thus basically Contents Addressable Memories. The consist out of two parts. The upper rectangle holds the latest virtual addresses that were translated and each stored virtual address has its own comparator that checks if its equal to the new virtual address that must be translated. The comparators are organized as columns in 

the upper rectangle. A stored virtual address that matches will send an enable signal downwards to the lower rectangle where the physical addresses are stored. The corresponding physical address is selected and the translation is complete. We know that the height of the upper triangle is proportional to the size of the virtual address while the height of the lower rectangle is proportional to the physical address. As you can see in the Image above: There is a very good correspondence with the real size size of Prescott's TLB (vague white rectangle) and the calculated size, the black one.

Northwood has two 64 entry TLB's, one for each thread, while Prescott's  has a single 128 entry instruction TLB shared among all threads.

 

48 bit virtual addresses: The virtual tags of the Trace Cache

We've already seen earlier that Prescott's Trace cache is significantly larger the the one of Northwood. Knowing the right scaling we can hold the two next to each other to see the exact difference. We found that the Willamette, Pentium 4 trace cache is build using 20 of the same 4k Byte memory tiles that were used in the L2 cache of Willamette. This does not only tells us a lot about the size (80 kByte) but also how it is accessed. The Prescott Trace Cache is constructed with 8 memory tiles. These tiles have the same height as Prescott's L2 cache tiles. They are however some 30% wider 

because there are 256 extra word lines running vertically.  Northwood's Trace Cache has a 160 bit bus to read 3 micro operations (uOps) per cycle, so each uOp has about 53 bits. A single trace has 6 uOps and is read in two cycles.

The 160 bus is build up from thick copper global bit lines that get information from one of sixteen local bit-lines. Each 6 uOp trace can come from any of 8 "ways"  The number 8 is there because the Trace Cache is "8 way, set associative"

 

It works like this:  The cache has 256 "sets",  This means that 8 bits from the address are used to select a set from the cache. Now each set has 8 "ways"  Each way remembers the remaining 24 bit of the instruction address in an "address- tag" field These tags must be compared with the corresponding 24 bits of the address requested. If one of them fits then we have a "cache-hit" and the right way is selected to provide the uOps. Northwood's Trace Cache is constructed from four big columns. In each column you can see,  At the bottom: The storage space for the 24 bit address tags, above that: the eight 24 bit comparators, and above that you can see the large green column that stores the uOps.

 

We now go to Prescott's Trace Cache: It's significantly higher which corresponds to a 256 bit bus, enough for four uOps of 64 bit each, about 5 bits more per uOp are needed to support something like AMD's x86-64. An issue now arises with the trace-line size. Trace lines with 8 uOps are less effective because positions in the line after a branch are left empty, uOps are stored at the begin of a new trace-line instead. It would be better instead to have trace-lines of only 4 uOps

instead. This has also its advantages for hyper-threading: Instead of needing two cycle to read out one trace-line we can now read out a new trace line every cycle and can alternate between threads on a per cycle base.

 

Precott's Trace Cache becomes "512 set, 8 way associative" as a consequence. The 512 sets means that the number of address tags must be doubled. Furthermore, with 48 bit virtual addressing we now need 39 address bits for the tags. That is: 48 - 9.  The remaining 9 bits are used to select between the 512 sets.

 

We are reassured by a hint that was dropped to C'Ts Andreas Stiller here where a "finer" Trace Cache access was reported in order to better support hyper threading.

 

48 bit virtual addresses: The Front End Branch Target Buffers

This unit is used for branch prediction and is located at edge of the L2 unified cache were the instruction pre-fetcher loads raw instruction bytes that are to be decoded and then stored as uOps in the instruction trace cache. You can see two columns that are very similar the single column found on the Pentium Pro, Pentium II and III. The (smaller) upper rectangle contains what we want the know: The branch target address. The lower rectangle contains the address-tags needed to select between the different ways and to detect cache-hit or cache-miss. It is known that the previous Pentiums stored the entire 32 bit address as a tag, so also the bits used to select between the sets, There is no real use for this, except maybe to detect "entry valid" An entry would be valid if the extra tag bits are identical to the set they are stored in. 

We can see that Prescott's BTB's are 50% wider which corresponds nicely to the 48 bit addressing. The lower rectangle in Precott however has almost exactly the same area as in Northwood. The lower area also contains the prediction information. Each entry has maintains 16 "bimodal counters" of 2 bit each. The 2 bit values mean: 0=likely not taken, 1=probably not taken, 2=probably taken, 3=likely taken.  The selection between the sixteen counters is based on the outcome of the four previous times that the branch was executed. (local branch history)   

Now branch prediction is an imprecise process. Incorrectly predicted branches can always be corrected. There can be some nasty side effects of erratic branch prediction however. The whole Trace Cache may be invalidated at once if it turns out that uOps were decoded from a data memory page which is shortly thereafter written to. The "self-modifying- code" detector will sound the alarm and invalidate the entire trace cache.  So, some attention is needed here!  We think that some of the redundant address-tag bits may have been replaced with address bits from 32 and higher. 

So full 48 bit address targets avoid the "self-modifying-code" issue while a few more relevant bits in the tags will reduce conflicts between branches to those "tera-bytes" apart.

 

Re-examining the Register Alias History Table, 128 uOps in flight in total.

 

We wrote about the Register Alias History Table before a while ago were we concluded that the number of in-flight uOps had doubled in Prescott. This based on the fact the RAHT (Regiser Alias History Table) more than doubled in size. This table must remember information for all instructions in-flight so that the processor can go back in time and discard results from instructions that were erratically executed because of branch-miss-prediction. There is no absolute need to store information for more instructions since they will be all be retired after branch prediction has been checked.  

 

However. Intel's implementation is such that it maintains a maximum sized table for each processor thread as one can see in a MPF 2002 presentation.  Northwood's RATH thus has 252 entries for two times 126 instructions in flight. 

Now what do we make of the fact that Precott's RAHT has more than doubled?  

We do not expect 256 instructions in-flight since we've concluded that the second integer core can not execute instructions independently.  Four threads is one way. It would double the RAHT but it would not "more-than-double" it. The second options has to do with it's contents. It remembers the mapping for each of the eight architectural registers into the 128 entry Register Files. An increase of the number of architectural registers from 8 to 16 would result in twice the storage needed. We may need 8 index bits because the Register Files have increased from 128 to 256 entries.

The Image above details an and-and implementation. Closer scrutiny of the Northwood RAHT shows that the actual storage areas are only a fraction of the total size. The black rectangles. The rest of the area may be bypasses. New registers indices are produces for 3 uOps per cycle in the Northwood and 4 uOps in Prescott. The last uOp, number 3 or 4 is dependent on the other ones. The data-busses run horizontal so the height of the black rectangles is proportional to the amount of data per entry. The above implementation example provides a "perfect fit" but you can always find perfect fits in more complicated cases like this one.

Still, we rate the chances for an "and-and" implementation the highest. So, four threads and 16 architectural registers. Are four threads useful in Prescott?  Given that the maximum uOp dispatch rate is still 6 uOps per cycle?  I think certainly yes in server implementations. Latency from main memory is the reason.  A thread may need to wait hundreds of cycles for data from main memory. Prescott has all out of order buffers doubled to support two threads. If one thread is waiting for main memory then the other can run unlimited just as fast as a single thread would. I can imagine that in a 4P server with a shared data bus will see latencies to memory much longer with a significant chance that both threads are waiting hundreds of cycles for data from the non-distributed memory. Two extra threads may fill up some of these lost cycles.

 

Faster SSE Floating Point separated from legacy Floating Point  

 

The Pentium 4 handles all multiplies with a single unit. The legacy x87, The integer, The MMX and the SSE ones. This results in longer latency times for integer multiplies and SSE floating point multiplies that use the hardware of the 80 bit x87 multiplier. We've shown in a previous article that the unit that handles all Floating Point, MMX and SSE instructions has moved to a different location on the die ( To make room for new undisclosed hardware? )

You can see the new location in the illustration above. What not was disclosed was that this unit has a "tail". Somewhat to our surprise we found that the macro-cells in this area are identical to the Floating Point adder and the Floating Point Multiplier (minus MMX multiply)  The "Prescott Floating Point" hardware is supposed to be allocated in the rectangle showed!  Now what has happened here? It seems that the older legacy FP hardware has now been isolated to be able to design faster SSE2 floating point units not hindered by legacy hardware anymore. This also results in lower inter multiply latencies as already disclosed by Intel. 

 

La Grande:  A tiny embedded processor for micro-code decryption and other purposes?  

 

Maybe I'm fooling myself, After all, what led me to the ideas below was an Intel patent granted on April 1st, 2003 !

US patent 6549821. Not an Oregon patent however, but la Grande is to be supported by all processors. It talks among other things about downloading encrypted micro code for security reasons. This would need a little decryption engine next to the micro-code instruction sequencer. The patent also mentions how downloaded micro-code could enable undisclosed processor functions. In fact it can also change the x86 instruction codes for these functions every time so that what seems to be just random data may actually be an executable x86 program that works until the next time that the micro code is changed.  It soon started to appear to me that this could be the way to realize a lot of those vague ideas hanging around the announced ( but never explained ) La Grande security technology.  And yes, the amount of Micro-code flash memory has more then quadrupled on the Prescott as compared to the Pentium 4, and the Micro Code ROM has been doubled. All indications that something may be going on there. And yes there are these three closely coupled and partly overlapping  macro-cells that look just like the classical uP /ROM /RAM trinity. All within a space of less then a square mm.

 It's surrounded by an area which has been puzzling to us because it was left over after all known Pentium 4 functionality was accounted for on other locations. The Floating Point unit was moved from here to another location on the die.  It does look like the Micro Code Sequencer ( A tiny processor by itself ) has had a little brother. They would be sitting just 400 um apart. The La Grande processor would measure no more then 320 um by 360 um compared to 260 um by 300 um for the even smaller Micro Sequencer.  Something what seems to be part of La Grande is the option to allocate a protected area in the level 2 cache. Encrypted instructions must be securely decrypted and stored as executable instructions in the L2 cache. Such may be another job for a little encryption engine. It's all highly speculative of course. Speculations that comes partly from the fact that the Floating Point hardware was moved away from this area and not replaced with other known hardware, so something new must be there..

 

Regards, Hans

 

Related articles 

 

March   6, 2003:     Looking at Intel's Prescott die, part I

March 26, 2003:     Clues for Yamhill

HOME