April 16, 2002: Intel's Prescott Prospects
Intel's Prescott Prospects
(by Hans de Vries)
(5) Intel versus AMD
VLSI 2002 symposia.
The Prescott is Intel's 90 nm version of the Pentium 4 architecture with architectural enhancements like simultaneous multithreading for the desktop, A large L2 cache, probably 1 Megabyte and, according to a number of rumors: A 64 bit extension codenamed Yamhill. The advanced program of the VLSI 2002 symposia to be held in Honolulu this summer shows about 18 presentations related to the Pentium 4 architecture. We've compiled a list of the relevant abstracts here. Just by looking at the 75-word limited abstracts one can see that considerable advances are made in a number of areas. Very high speed L1 caches, 32 kbyte and 16 kbyte, larger than the current 8 kbyte and much faster also, a larger register file: 256 words of 32 bit compared to 128 words now. There also will be presentations on a new Integer Execution Unit and a new Address Generator. Some of these units were already disclosed earlier. The 5 GHz ALU at the fall 2001 IDF and the register file at last years VLSI symposia. The speed of the building blocks is impressive:
In a 130 nm process at a low 1.2 Volt:
- 4.0 GHz 32 bit Address Generator
- 5.0 GHz Integer Execution Core
- 4.5 GHz 32 k byte Cache
In a 100 nm process
- 6.0 GHz 16 k byte Cache
These speeds are especially impressive if you compare them with the current Northwood in 130 nm with 60 nm gate-lengths. This processor runs at 'only' 1.5 GHz on the relatively low voltage of 1.2V (see the schmoo -plot here). The Northwood Address Generators and the Integer Execution Units run at double that speed (3 GHz ) but are basically 16 bit while the units here are supposed to be really 32 bit. The register file was reported as running at 6 GHz last year (voltage?) and AnandTech has some pictures of a 10 GHz 32 bit ALU at a high 1.8V here.
One can truly appreciate the speed of these caches when compared with new "One step forwards, Two step backwards" JEDEC DDR II standard. The access-frequency of the much smaller 2kB...4kB row buffers in these chips is only 83 MHz....100 MHz for DDRII-333 and DDRII-400 respectively. An astonishing 100 times slower then the larger caches presented here in when implemented in Prescott's 90 nm process and with a decent voltage. The fact that this standard is designed by a JEDEC committee seems to result in something which can be supported by all members including the weakest of weakest DRAM companies.
64 bit Yamhill implementation may cost less then 2 % extra die space.
The Yamhill rumor
Just to start with the rumor first: The 64 bit Yamhill extension is supposed to be Intel's answer to AMD's Hammer family. One can imagine that the Pentium Architects are thankful to AMD for extending the x86 architecture to 64 bit. An architecture that was supposed to reach it's end of life stage with the introduction of Itanium and it's descendents. Hammer however is up and running now, outperforming the fastest members of the Itanium Family at least in integer performance. Intel management has to make U-turns ones in while, like with their Rambus-only policy lately
Basic 64 bit integer operations
A 64 bit extension by itself does not imply that the Integer Execution Unit and the Integer Register File have to be extended to 64 bit. A minimal implementation would simply use the 32 bit integer pipeline for 64 bit integer operations. The Floating Point/MMX/SSE pipelines are already 64 bit. No need for changes here.
The dual 'Rapid Execution' Units and the 32 bit register file run a twice the frequency and are together able to handle two 64 bit operations per cycle. (The Hammer is able to do 3 per cycle but its 64 bit additions might have twice the latency) The mechanisms to decode an operation into 2 sub-operations are already available in the pipeline. The 128 bit XMM/SSE operations for example are handled in two 64 bit pieces.
It would be advantageous if the basic functional timing of the rapid executions engines can remain the same. The current ones handle 32 bit additions as two skewed 16 bit ones. the 2nd addition starts 1/2 a cycle after the first when the carry bit is available. The newer integer ALU's seems to be fully 32 bit ALU's The same trick may thus be used to handle a 64 bit addition as two skewed 32 bit ones. Hardware for a full 32 bit addition takes about 15-20% longer as that for a 16 bit addition. It seems that Intel's circuit designers have closed this gap with novel design techniques like 'forward body biasing' et-cetera.
More 64 bit operations
The Rapid Execution Engine handles 32 bit logic and additive operations which are easy to extend to 64 bit. Other integer operations are more complicated. 32 integer multiplies are currently handled by the floating point multiplier. This unit can handle 64 bit multiplications as a result of the 80 bit floating point format that uses a 64 bit mantisse. The shifts and divide probably remain in the lower priority legacy area. This hardware is designed without the extreme efforts in circuit design and layout used for the ultra high frequency integer ALU's. The integer divide typically uses only a fraction of the transistors used for the multiply.
A simple shift and subtract state machine generates 2 result bits per cycle. It has to be adapted for 64 bit operation. The remaining legacy integer operations once invented long ago are probably left untouched and not extended to 64 bit.
64 bit general purpose registers
A minimal implementation would add eight 32 bit registers to the basic set of eight to extent them to 64 bit.. This is in fact similar to adding the architectural (register) state for an extra thread. A processor supporting two 64 bit threads would need 32 general purpose registers of 32 bit. The integer register file is likely to support 256 words. The architectural state would need 32 of them, leaving 2x112 for the renamed registers which means that each of the two threads can use almost all the resources of the pipeline when the other is waiting or restarting after a miss-prediction.
The first level caches.
The 2nd and 3rd level caches as well as the memory interface do not need any changes specific to Yamhill: The virtual address which is expanded to anywhere between 32 and 64 bit (Hammer uses 48 bit) is already translated to the physical address here. Physical addresses have broken the 32 bit barrier for quite a while now. The Level 1 data cache however and to a lesser extend the Instruction Trace Cache do need modifications for Yamhill. A lot of the extra die area would come from the increased TLB's (Translation Look Aside Buffers) that translate the higher address bits of the virtual address into a physical address. The current SMT capable Xeon L1 Data cache has a combined TLB for both threads and individual Instruction TLB's for each of the two threads. The current Data Cache Address Generators calculate first the lower 16 bit to index into the cache followed 1/2 a cycle later by the higher 16 bits to access the TLB. A similar solution as in the Integer Execution Unit is needed here. The 32 lower bits are needed in the first half cycle with the remaining higher bits coming in the next 1/2 cycle to address the TLB. The fact that the Address generator now works on 32 bit might be an indication of Yamhill.
Improved Simultaneous Multithreading.
Doubling the L1 Data Cache Frequency
The most important contribution to improved Hyper Threading comes from doubling the L1 Data Cache Frequency. In the first SMT Xeon the two threads have to compete for a single read port that runs at half the speed of the rest of the circuits. The 32 kB cache abstract mentions dual ports but I presume that dual means: 1 read and 1 write port, just like in the current 8 k byte L1 cache.
Doubling the Integer Register File
Expanding the register file from 128 to 256 words means that both of the two threads have about enough renamed registers to cover all operations-in-flight of the entire pipeline. It's not so likely that the desktop version of Prescott would support more then 2 threads. The PC market is basically an upgrade market.
Too big steps first make people wait longer to buy until the systems become available and then makes them wait longer before buying their next system because that next system needs to be so much better again. The server 'Xeon' version however might well support four threads. A feature which would be disabled on the desktop version. The increased SMT support provided by new building blocks like the ones which will be presented at the VLSI 2002 symposia would make that a worthwhile step.
Further and Future Improvements
Having SMT firmly on the tracks opens a whole range of further improvements that become interesting.
An obvious next step would be to bring the rest of the pipelines into the same clock domain as the Integer Execution Units: The Floating Point Units. The MMX/XMM/SSE units. Optimal support for 4 threads would also need a full speed pipeline from the Instruction Trace Cache to the execution units.
The consequence of, doubling the number of pipeline stages in all these units, is that the number of instructions-in-flight in these units also doubles, resulting in the need for more renamed registers, and thus a larger register file.
Splitting Cycles to stay on the Cutting Edge
SMT, once mastered, really is the Great Enabler. This is probably the reason why Intel bought Compaq's Alpha patents in a (non-exclusive) deal. Maybe more for x86 then for the Itanium. The latter's big architec- tural register file gives it a disadvantage for SMT where a copy of the entire register file is needed for each additional thread. SMT makes Hyper-pipelining the name of the game. Intel may bring the whole pipeline at the double frequency in a number of steps. We would not be surprised that, after achieving this goal, the architects and circuit designers would set their mind on trying to double the pipeline frequency again.
Multi Threading: A way to Speed Up Single Thread Applications
Another good reason for multi threading: If you want to have the fast single thread processor use multi- threading tricks! A processor with optimal support for it's threads, meaning that each of them can run close to maximal speed can be used to implement a number of methods to overcome the bottlenecks that can not be solved by hyper-pipelining alone:
(1) "Thread based Speculative Pre-Computation" for Memory Latency as a result of Cache Misses
Presented by John P. Shen, Director of Intel's Micro Architecture Lab at the MPF 2001:
A cache pre-fetching method with extra threads added to the original binary of programs. Able to
pre-fetch unpredictable memory access patterns (like pointer intensive code). The extra parallel
SP threads (Speculative Pre-computation) may look like stripped down versions of the original
binary code with only those instructions left that are needed for the memory address calculations.
These threads would progress faster then the original code and the loads would pre-fetch the
cache-lines from memory into the caches thereby reducing access latency time for the real program.
(2) "Branch Threading" for Branch Miss Prediction.
If the branch prediction hardware concludes that a certain condition branch is very hard to predict then
it can decide to spawn a second thread. The original thread follows one path and the second thread
follows the alternative path. If the condition is finally known at the end of the pipeline then the wrong
path is discarded
(3) "Load Threading" for Data Load Miss Predictions.
Some architectures like the EV6 and the Pentium 4 make predictions if a data load from memory can
be scheduled before preceding stores before knowing any of the addresses. If the store overwrites
the load data later then the pipeline has to be re-started in much the same way like after a branch miss
prediction. Multi threading can allow both choices to be executed simultaneously until the right choice is
The latter two methods need a fine grain form of multi-threading probably not available in the Hyper threading Pentiums. At least not in the current implementation that seems to need a pipeline flush while forking. Now the use of these kind of tricks really needs a lot of multi threading capability in a processor, especially if more then one is used at the same time. A wide scale use of these tricks is still a bit beyond Prescott's capabilities.
4 GHz Pentium 4 or an 8 GHz Pentium 5 ?
Mega Hertz... what Mega Hertz ?
The 90 nm Prescott is expected to reach speeds of 4 GHz and beyond. The Integer Execution Units however runs at 8 GHz, so does the integer register file, the address generators and now, as we may presume, also the L1 data cache. So why call it a 4 GHz processor? Technically spoken it is not a 4 GHz processor but an 8 GHz processor...
A Chance for a Change
Such a sudden jump in Giga Hertz needs to be accompanied with a significant increase in performance to make it marketable to the average customer. The 50% ... 60% extra performance brought by improved Simultaneous Multi Threading does offer this as a one-time-only opportunity. If Intel ever wants to use the real frequency of the Integer Pipeline then it has to make the transition with the introduction of the Prescott.
A name change to Pentium 5 would be appropriate to signal a major architecture change.
Marketing and Metaphors
To marketing the task to explain the term simultaneous multi threading to the general public. Most likely ending up with a number of metaphors that give people the illusion that they understand something while they are in fact totally confusing reality. We've heard a few nice ones from AMD when it had to explain that Mega Hertz is not the same as performance. Something like "Animals with little legs having to run like crazy just to keep up with the larger (Athlon) species...." The classical combustion engine may help here: "A four cylinder engine with twice the RPM produces the same amount of Horse Powers as an eight cylinder does ...." The extra complication is to explain how the second logical processor is result of the much higher frequency. " The processor is so incredible fast that it can work like two", Something like the energetic modern women who have a job and take care of their children at the same time... I would not be surprised that as a side effect of such a campaign we may see some psychiatric researchers proposing that raised brain wave frequencies can induce schizophrenia...
Well, for so far the hope that marketing can produce some decent consumer education...
An 8 GHz Processor in a 90 nm process would be consistent with Intel's statements that it's 70 nm processors will run at more then 10 GHz. These predictions were made already one and a halve year ago.
Intel versus AMD.
A single thread 90 nm Prescott is likely to be on par with a 90 nm version of the Sledge Hammer. A new larger L1 cache with double the access rate. A similar sized L2 cache (1 MegaByte) and a similar Memory Band width of 6.4 GB/s (Prescott : 800 MHz x 64 bit, Sledge : 400 MHz x 128 bit)
Applications and compilers become better optimized towards the hyper-pipelined Pentium. This without any of the new tricks we discussed above: speculative pre-computing, branch threading and load threading. The application of speculative pre-computing may give Sledge a hard time.
The Die Size Advantage
Prescott may mark the end of an era where AMD could erode Intel market share as the result of Athlons much smaller die size compared to the Pentium 4 core. Fab capacity is the second hurdle (after processor performance) on the road to a bigger piece of the x86 processor market. A market good for few dozen billion dollars. The good news for AMD is that it's architects did a dual-processor-on-a-chip (CMP) version of Sledgehammer. This Hammer version may well turn out to be AMD's main stream processor at the 90 nm processor node. Two Hammer cores together with 1 Megabyte L2 cache would consume something like 95 square millimeters at the 90 nm process node. Smaller then its current smallest processor, the Duron that has something like 106 mm2 but larger than the projections for Prescott which are in order of 80 mm2
Multiplying Model Numbers
It would be justifiable if AMD multiplies its model number by 2 for the "dual-processor-on-a-chip" version of the Sledgehammer to get in the 8000+ range. Something that may be become the only option from a marketing viewpoint. The performance of a 2 processor-on-a-chip Sledgehammer is likely be higher again then a 2 thread Prescott. Making multi-processing very important for AMD even in de desktop segment. The Microsoft's Licensing model will play a crucial role here. The current Intel brokered distinction between logical and physical processors does not benefit AMD. AMD users would have to pay significantly more for their version of Windows than Intel users. A more reasonable definition is needed to make a distinction between a desktop PC and the various forms of server PC's. A definition where a desktop would have only one chip containing processors and a server would have two or more chips containing processor may be a solution.