April 16, 2002: Intel's Prescott Prospects
Intel's Prescott Prospects
(by Hans de Vries)
(5) Intel versus AMD
VLSI 2002 symposia.
The Prescott is Intel's 90 nm version of the Pentium 4 architecture with architectural enhancements like simultaneous multithreading for the desktop, A large L2 cache, probably 1 Megabyte and, according to a number of rumors: A 64 bit extension codenamed Yamhill. The advanced program of the VLSI 2002 symposia to be held in Honolulu this summer shows about 18 presentations related to the Pentium 4 architecture. We've compiled a list of the relevant abstracts here. Just by looking at the 75-word limited abstracts one can see that considerable advances are made in a number of areas. Very high speed L1 caches, 32 kbyte and 16 kbyte, larger than the current 8 kbyte and much faster also, a larger register file: 256 words of 32 bit compared to 128 words now. There also will be presentations on a new Integer Execution Unit and a new Address Generator. Some of these units were already disclosed earlier. The 5 GHz ALU at the fall 2001 IDF and the register file at last years VLSI symposia. The speed of the building blocks is impressive:
In a 130 nm process at a low 1.2 Volt:
- 4.0 GHz 32 bit Address Generator
- 5.0 GHz Integer Execution Core
- 4.5 GHz 32 k byte Cache
In a 100 nm process
- 6.0 GHz 16 k byte Cache
These speeds are especially impressive if you compare them with the current Northwood in 130 nm with 60 nm gate-lengths. This processor runs at 'only' 1.5 GHz on the relatively low voltage of 1.2V (see the schmoo -plot here). The Northwood Address Generators and the Integer Execution Units run at double that speed (3 GHz ) but are basically 16 bit while the units here are supposed to be really 32 bit. The register file was reported as running at 6 GHz last year (voltage?) and AnandTech has some pictures of a 10 GHz 32 bit ALU at a high 1.8V here.
One can truly appreciate the speed of these caches when compared with new "One step forwards, Two step backwards" JEDEC DDR II standard. The access-frequency of the much smaller 2kB...4kB row buffers in these chips is only 83 MHz....100 MHz for DDRII-333 and DDRII-400 respectively. An astonishing 100 times slower then the larger caches presented here in when implemented in Prescott's 90 nm process and with a decent voltage. The fact that this standard is designed by a JEDEC committee seems to result in something which can be supported by all members including the weakest of weakest DRAM companies.
64 bit Yamhill implementation may cost less then 2 % extra die space.
The Yamhill rumor
Just to start with the rumor first: The 64 bit Yamhill extension is supposed to be Intel's answer to AMD's Hammer family. One can imagine that the Pentium Architects are thankful to AMD for extending the x86 architecture to 64 bit. An architecture that was supposed to reach it's end of life stage with the introduction of Itanium and it's descendents. Hammer however is up and running now, outperforming the fastest members of the Itanium Family at least in integer performance. Intel management has to make U-turns ones in while, like with their Rambus-only policy lately
Basic 64 bit integer operations
A 64 bit extension by itself does not imply that the Integer Execution Unit and the Integer Register File have to be extended to 64 bit. A minimal implementation would simply use the 32 bit integer pipeline for 64 bit integer operations. The Floating Point/MMX/SSE pipelines are already 64 bit. No need for changes here.
The dual 'Rapid Execution' Units and the 32 bit register file run a twice the frequency and are together able to handle two 64 bit operations per cycle. (The Hammer is able to do 3 per cycle but its 64 bit additions might have twice the latency) The mechanisms to decode an operation into 2 sub-operations are already available in the pipeline. The 128 bit XMM/SSE operations for example are handled in two 64 bit pieces.
It would be advantageous if the basic functional timing of the rapid executions engines can remain the same. The current ones handle 32 bit additions as two skewed 16 bit ones. the 2nd addition starts 1/2 a cycle after the first when the carry bit is available. The newer integer ALU's seems to be fully 32 bit ALU's The same trick may thus be used to handle a 64 bit addition as two skewed 32 bit ones. Hardware for a full 32 bit addition takes about 15-20% longer as that for a 16 bit addition. It seems that Intel's circuit designers have closed this gap with novel design techniques like 'forward body biasing' et-cetera.
More 64 bit operations
The Rapid Execution Engine handles 32 bit logic and additive operations which are easy to extend to 64 bit. Other integer operations are more complicated. 32 integer multiplies are currently handled by the floating point multiplier. This unit can handle 64 bit multiplications as a result of the 80 bit floating point format that uses a 64 bit mantisse. The shifts and divide probably remain in the lower priority legacy area. This hardware is designed without the extreme efforts in circuit design and layout used for the ultra high frequency integer ALU's. The integer divide typically uses only a fraction of the transistors used for the multiply.
A simple shift and subtract state machine generates 2 result bits per cycle. It has to be adapted for 64 bit operation. The remaining legacy integer operations once invented long ago are probably left untouched and not extended to 64 bit.
64 bit general purpose registers
A minimal implementation would add eight 32 bit registers to the basic set of eight to extent them to 64 bit.. This is in fact similar to adding the architectural (register) state for an extra thread. A processor supporting two 64 bit threads would need 32 general purpose registers of 32 bit. The integer register file is likely to support 256 words. The architectural state would need 32 of them, leaving 2x112 for the renamed registers which means that each of the two threads can use almost all the resources of the pipeline when the other is waiting or restarting after a miss-prediction.
The first level caches.
The 2nd and 3rd level caches as well as the memory interface do not need any changes specific to Yamhill: The virtual address which is expanded to anywhere between 32 and 64 bit (Hammer uses 48 bit) is already translated to the physical address here. Physical addresses have broken the 32 bit barrier for quite a while now. The Level 1 data cache however and to a lesser extend the Instruction Trace Cache do need modifications for Yamhill. A lot of the extra die area would come from the increased TLB's (Translation Look Aside Buffers) that translate the higher address bits of the virtual address into a physical address. The current SMT capable Xeon L1 Data cache has a combined TLB for both threads and individual Instruction TLB's for each of the two threads. The current Data Cache Address Generators calculate first the lower 16 bit to index into the cache followed 1/2 a cycle later by the higher 16 bits to access the TLB. A similar solution as in the Integer Execution Unit is needed here. The 32 lower bits are needed in the first half cycle with the remaining higher bits coming in the next 1/2 cycle to address the TLB. The fact that the Address generator now works on 32 bit might be an indication of Yamhill.
Improved Simultaneous Multithreading.
Doubling the L1 Data Cache Frequency
The most important contribution to improved Hyper Threading comes from doubling the L1 Data Cache Frequency. In the first SMT Xeon the two threads have to compete for a single read port that runs at half the speed of the rest of the circuits. The 32 kB cache abstract mentions dual ports but I presume that dual means: 1 read and 1 write port, just like in the current 8 k byte L1 cache.
Doubling the Integer Register File
Expanding the register file from 128 to 256 words means that both of the two threads have about enough renamed registers to cover all operations-in-flight of the entire pipeline. It's not so likely that the desktop version of Prescott would support more then 2 threads. The PC market is basically an upgrade market.
Too big steps first make people wait longer to buy until the systems become available and then makes them wait longer before buying their next system because that next system needs to be so much better again. The server 'Xeon' version however might well support four threads. A feature which would be disabled on the desktop version. The increased SMT support provided by new building blocks like the ones which will be presented at the VLSI 2002 symposia would make that a worthwhile step.
Further and Future Improvements
Having SMT firmly on the tracks opens a whole range of further improvements that become interesting.
An obvious next step would be to bring the rest of the pipelines into the same clock domain as the Integer Execution Units: The Floating Point Units. The MMX/XMM/SSE units. Optimal support for 4 threads would also need a full speed pipeline from the Instruction Trace Cache to the execution units.
The consequence of, doubling the number of pipeline stages in all these units, is that the number of instructions-in-flight in these units also doubles, resulting in the need for more renamed registers, and thus a larger register file.
Splitting Cycles to stay on the Cutting Edge
SMT, once mastered, really is the Great Enabler. This is probably the reason why Intel bought Compaq's Alpha patents in a (non-exclusive) deal. Maybe more for x86 then for the Itanium. The latter's big architec- tural register file gives it a disadvantage for SMT where a copy of the entire register file is needed for each additional thread. SMT makes Hyper-pipelining the name of the game. Intel may bring the whole pipeline at the double frequency in a number of steps. We would not be surprised that, after achieving this goal, the architects and circuit designers would set their mind on trying to double the pipeline frequency again.
Multi Threading: A way to Speed Up Single Thread Applications
Another good reason for multi threading: If you want to have the fast single thread processor use multi- threading tricks! A processor with optimal support for it's threads, meaning that each of them can run close to maximal speed can be used to implement a number of methods to overcome the bottlenecks that can not be solved by hyper-pipelining alone:
(1) "Thread based Speculative Pre-Computation" for Memory Latency as a result of Cache Misses
Presented by John P. Shen, Director of Intel's Micro Architecture Lab at the MPF 2001:
A cache pre-fetching method with extra threads added to the original binary of programs. Able to
pre-fetch unpredictable memory access patterns (like pointer intensive code). The extra parallel
SP threads (Speculative Pre-computation) may look like stripped down versions of the original
binary code with only those instructions left that are needed for the memory address calculations.
These threads would progress faster then the original code and the loads would pre-fetch the
cache-lines from memory into the caches thereby reducing access latency time for the real program.
(2) "Branch Threading" for Branch Miss Prediction.
If the branch prediction hardware concludes that a certain condition branch is very hard to predict then
it can decide to spawn a second thread. The original thread follows one path and the second thread
follows the alternative path. If the condition is finally known at the end of the pipeline then the wrong
path is discarded
(3) "Load Threading" for Data Load Miss Predictions.
Some architectures like the EV6 and the Pentium 4 make predictions if a data load from memory can
be scheduled before preceding stores before knowing any of the addresses. If the store overwrites
the load data later then the pipeline has to be re-started in much the same way like after a branch miss
prediction. Multi threading can allow both choices to be executed simultaneously until the right choice is
The latter two methods need a fine grain form of multi-threading probably not available in the Hyper threading Pentiums. At least not in the current implementation that seems to need a pipeline flush while forking. Now the use of these kind of tricks really needs a lot of multi threading capability in a processor, especially if more then one is used at the same time. A wide scale use of these tricks is still a bit beyond Prescott's capabilities.
4 GHz Pentium 4 or an 8 GHz Pentium 5 ?
Mega Hertz... what Mega Hertz ?
The 90 nm Prescott is expected to reach speeds of 4 GHz and beyond. The Integer Execution Units however runs at 8 GHz, so does the integer register file, the address generators and now, as we may presume, also the L1 data cache. So why call it a 4 GHz processor? Technically spoken it is not a 4 GHz processor but an 8 GHz processor...
A Chance for a Change
Such a sudden jump in Giga Hertz needs to be accompanied with a significant increase in performance to make it marketable to the average customer. The 50% ... 60% extra performance brought by improved Simultaneous Multi Threading does offer this as a one-time-only opportunity. If Intel ever wants to use the real frequency of the Integer Pipeline then it has to make the transition with the introduction of the Prescott.
A name change to Pentium 5 would be appropriate to signal a major architecture change.
Marketing and Metaphors
To marketing the task to explain the term simultaneous multi threading to the general public. Most likely ending up with a number of metaphors that give people the illusion that they understand something while they are in fact totally confusing reality. We've heard a few nice ones from AMD when it had to explain that Mega Hertz is not the same as performance. Something like "Animals with little legs having to run like crazy just to keep up with the larger (Athlon) species...." The classical combustion engine may help here: "A four cylinder engine with twice the RPM produces the same amount of Horse Powers as an eight cylinder does ...." The extra complication is to explain how the second logical processor is result of the much higher frequency. " The processor is so incredible fast that it can work like two", Something like the energetic modern women who have a job and take care of their children at the same time... I would not be surprised that as a side effect of such a campaign we may see some psychiatric researchers proposing that raised brain wave frequencies can induce schizophrenia...
Well, for so far the hope that marketing can produce some decent consumer education...
An 8 GHz Processor in a 90 nm process would be consistent with Intel's statements that it's 70 nm processors will run at more then 10 GHz. These predictions were made already one and a halve year ago.
Intel versus AMD.
A single thread 90 nm Prescott is likely to be on par with a 90 nm version of the Sledge Hammer. A new larger L1 cache with double the access rate. A similar sized L2 cache (1 MegaByte) and a similar Memory Band width of 6.4 GB/s (Prescott : 800 MHz x 64 bit, Sledge : 400 MHz x 128 bit)
Applications and compilers become better optimized towards the hyper-pipelined Pentium. This without any of the new tricks we discussed above: speculative pre-computing, branch threading and load threading. The application of speculative pre-computing may give Sledge a hard time.
The Die Size Advantage
Prescott may mark the end of an era where AMD could erode Intel market share as the result of Athlons much smaller die size compared to the Pentium 4 core. Fab capacity is the second hurdle (after processor performance) on the road to a bigger piece of the x86 processor market. A market good for few dozen billion dollars. The good news for AMD is that it's architects did a dual-processor-on-a-chip (CMP) version of Sledgehammer. This Hammer version may well turn out to be AMD's main stream processor at the 90 nm processor node. Two Hammer cores together with 1 Megabyte L2 cache would consume something like 95 square millimeters at the 90 nm process node. Smaller then its current smallest processor, the Duron that has something like 106 mm2 but larger than the projections for Prescott which are in order of 80 mm2
Multiplying Model Numbers
It would be justifiable if AMD multiplies its model number by 2 for the "dual-processor-on-a-chip" version of the Sledgehammer to get in the 8000+ range. Something that may be become the only option from a marketing viewpoint. The performance of a 2 processor-on-a-chip Sledgehammer is likely be higher again then a 2 thread Prescott. Making multi-processing very important for AMD even in de desktop segment. The Microsoft's Licensing model will play a crucial role here. The current Intel brokered distinction between logical and physical processors does not benefit AMD. AMD users would have to pay significantly more for their version of Windows than Intel users. A more reasonable definition is needed to make a distinction between a desktop PC and the various forms of server PC's. A definition where a desktop would have only one chip containing processors and a server would have two or more chips containing processor may be a solution.
Abstracts from the Intel presentations.
4GHz 130nm Address Generation Unit with 32-bit Sparse-tree Adder Core
Mathew, Mark Anders, Ram K. Krishnamurthy and Shekhar Borkar
Research, Intel Labs, Intel Corporation, Hillsboro, OR 97124, USA,
paper describes a 32-bit Address Generation Unit (AGU) designed for 4GHz
operation in 1.2V, 130nm
technology. The AGU utilizes a 152ps dual-Vt
adder core to achieve 20% delay reduction, 80% lower interconnect density and
a low (1%) active energy leakage component. The semi-dynamic implementation
enables an average energy profile similar to static CMOS, with good sub-130nm
Supply Voltage Clocking for 5GHz 130nm Integer Execution Core
K. Krishnamurthy, Steven Hsu, Mark Anders, Brad Bloechel, Bhaskar Chatterjee*,
Manoj Sachdev*, Shekhar Borkar Circuits
Research, Intel Labs, Intel Corporation, Hillsboro, OR 97124, USA, firstname.lastname@example.org
paper describes dual-Vcc clocking
on a 1.2V, 5GHz integer execution core fabricated in 130nm CMOS to achieve up
to 71% measured clock power (including 15% active leakage) reduction. A
write-port style pass-transistor latch and split-output level-converting local
clock buffer are described for robust, DC power free low-Vcc
4.5GHz 130nm 32KB L0 Cache with a Self Reverse Bias Scheme
K. Hsu, Atila Alvandpour, Sanu Mathew, Shih-Lien Lu, Ram K. Krishnamurthy,
Research, Intel Labs, Intel Corporation, Hillsboro, OR 97124, USAsteven.email@example.com
paper describes a 32KB dual-ported L0 cache for 4.5GHz operation in 1.2V,
130nm CMOS. The local bitline uses a Self Reverse Bias scheme to achieve
?220mV access transistor underdrive without external bias voltage or
gate-oxide overstress. 11% faster read delay and 104% higher DC robustness
(including 7x measured active leakage reduction) is achieved over optimized
high-performance dual-Vt scheme.
a 3GHz, 130nm, Intel« Pentium«4 Processor
Deleganes, Jonathon Douglas, Badari Kommandur, Marek Patyra Intel Architecture
Group, 2501 NW 229 th Ave.
MS RA2-401 Hillsboro, OR, 97124, USA
The design of an IA32 processor fabricated on state-of-the art 130nm CMOS process with improved six layers of dual-damascene copper metallization is described. Engineering an IA32 processor for server, desktop, and mobile platforms, particularly meeting diverse power & thermal constraints, poses numerous challenges. This presentation focuses on methods applied to achieve high frequency and low power on the same chip, particularly, the use of Dual Vt process, clock skew design, and thermal management techniques.
Body Bias for Microprocessors in 130nm Technology Generation and Beyond
Ali Keshavarzi, Siva Narendra, Bradley Bloechel, Shekhar Borkar and Vivek De Microprocessor Research, Intel Labs, Hillsboro, OR, USA
and testchip measurements show that forward body bias (FBB) can be used
effectively to improve performance and reduce complexity of a 130nm dual-VT
technology, reduce leakage
power during burn-in and standby, improve circuit delay and robustness, and
reduce active power. FBB allows performance advantages of low temperature
operation to be realized fully without requiring transistor redesign, and also
improves VT variations,
mismatch, and gm x ro
A 6GHz, 16Kbytes L1 Cache in a 100nm Dual-VT Technology Using a Bitline Leakage Reduction (BLR) Technique
Yibin Ye, Muhammad Khellah, Dinesh Somasekhar, Ali Farhang and Vivek De Microprocessor Research, Intel Labs, Hillsboro, OR, USA
L1 cache testchip with dual-VT
cell and a bitline leakage
reduction (BLR) technique has been implemented in a 100nm dual-VT
technology. Area of a 2KBytes
array is 263.m X 204.m, which is virtually the same as the best conventional
design with high-VT cell.
BLR eliminates impacts of bitline leakage on performance and noise margin with
minimal area overhead. Bitline delay improves by 23%, thus enabling 6GHz
operation. Energy consumption per cycle is 15% higher.
Leakage-Tolerant Dynamic Register File Using Leakage Bypass with Stack Forcing
(LBSF) and Source Follower NMOS (SFN) Techniques
Tang, Steven Hsu, Yibin Ye, James Tschanz, Dinesh Somasekhar, Siva Narendra,
Shih-Lien Lu, Ram Krishnamurthy and Vivek De Microprocessor Research, Intel
Labs, Hillsboro, OR, USA
and SFN leakage-tolerant techniques improve robustness of leakage-sensitive
and performance-critical wide dynamic circuits in the local and global
bitlines of a 256X32b register file in a 100nm dual-VT
technology. The full LBSF
design improves clock frequency by 50% or reduces energy by 37%, compared to
the best dual-VT (DVT)
design. Performance advantages of LBSF and SFN become more significant as
Processor 800 MT/s Front Side Bus with Ground Referenced Voltage Source I/O
P. Thomas, Ian A. Young Intel Corporation Portland Technology
Development RA1-309, 5200 NE Elam Young Parkway Hillsboro OR 97124, USA
40cm multi-drop bus shared by 5 test chips to emulate 4 processors and a
chipset runs error free at 800MT/s with 130mV margin using Ground Referenced
Voltage Source (GRVS) I/O scheme. For comparison, when the same test chip is
programmed to use Gunning Transceiver Logic (GTL), the bus speed is 500 MT/s
for the same 130mV margin under identical conditions.
Pulsed Bus for On-Chip Interconnects
Muhammad Khellah, James Tschanz, Yibin Ye, Siva Narendra and Vivek De Circuit Research, Intel Labs, Hillsboro, OR, USA
Static Pulsed Bus (SPB) technique offers significant advantages over
conventional static bus (SB) in delay, energy, total device width and peak VCC
current for 1500mm to 4500mm
long M4 buses in a 100nm technology. These improvements are due to reduction
in effective coupling capacitance and repeater skewing enabled by monotonic
signal transition. Unlike dynamic schemes, energy savings of SPB are
maintained across all activity factors without any clock power or routing
Transition-Encoded Dynamic Bus Technique for High-Performance Interconnects
Anders, Nivruti Rai*, Ram Krishnamurthy, Shekhar Borkar Circuit
Research, Intel Labs Intel Corporation, Hillsboro, OR 97124, USA firstname.lastname@example.org
*Desktop Products Group Intel Corporation, Hillsboro, OR 97124, USA
transition-encoded dynamic bus technique enables interconnect delay reduction
while maintaining the robustness and switching energy behavior of a static
bus. Efficient circuits, designed for a drop-in replacement, enable
significant delay and peak-current reduction even for short buses, while
obtaining energy savings at aggressive delay targets. In a 180nm 32-bit
microprocessor, 79% of all global buses exhibit 10%-35% performance
Accurate and Efficient Analysis Method for Multi-Gb/s Chip-to-chip Signaling
K. Casper, Matthew Haycock, Randy Mooney Circuit Research, Intel Labs email@example.com
paper introduces an accurate method of modeling the performance of high-speed
chip-to-chip signaling systems. Implemented in a simulation tool, it precisely
accounts for intersymbol interference,
and echos as well as circuit related effects such as thermal noise, power
supply noise and
jitter. We correlated the simulation tool to actual measurements of a
high-speed signaling system
then used this tool to make tradeoffs between different methods of
chip-to-chip signaling with and
present a technique to enable the integration of sensitive analog circuits
with a high performance microprocessor (Pentium . 4), on a lossy substrate.
We show that by exploiting the spectral content of substrate noise, and the use appropriately tuned analog amplification it is possible to limit the isolation requirements to 70dB. By using a combination of measurement and field solver results, we show that a minimal process enhancement (i.e. a deep nwell) will yield 50 dB of isolation, and the remainder can be achieved by layout and differential circuit techniques.
Node Engineering for Chip-Level Soft Error Rate Improvement
Karnik, Sriram Vangal, V. Veeramachaneni, Peter Hazucha, Vasantha Erraguntla,
Shekhar Borkar Circuit Research, Intel Labs, Hillsboro, OR, U.S.A.
paper presents a technique to selectively engineer sequential or domino nodes
in high performance circuits to improve soft error rate (SER) induced by
cosmic rays or alpha particles. In 0.18 Ám process, the SER improvement is as
much as 3X at the cell-level, 1.8X at the block- level and 1.3X at the
chip-level without any penalty in performance or area, and <3% power
penalty. The node selection, hardening and SER quantification steps are fully
Optimizations of a High Performance Microprocessor Using Combinations of Dual-VT
and Transistor Sizing
Tschanz, Yibin Ye, Liqiong Wei 1
, Venkatesh Govindarajulu,
Nitin Borkar, Steven Burns 2 ,
Tanay Karnik, Shekhar Borkar and Vivek De Microprocessor Research, 1
Mobile Architecture, 2
CAD, Intel Labs Hillsboro, OR, USA
optimizations of dual-VT
allocation and transistor
sizing for a high performance microprocessor reduce low-VT
usage by 36%-64%, compared to
a design where only dual-VT allocation
is optimized. Designs optimized for minimum power (DVT+S) and minimum area (L-SDVT)
reduce leakage power by 20%, with minimal impact on total power and die area.
An enhancement of the optimum DVT+S design allows processor frequency to be
increased efficiently during manufacturing through low-VT
device leakage push only.
& Validation of the Pentium«
III and Pentium« 4 Processors
Rahal-Arabi, Greg Taylor, Matthew Ma, and Clair Webb Intel Corporation / Logic
Technology Development 5200 NE ElamYoung Parkway Hillsboro, Oregon, 97124
this paper, we present an empirical approach for the validation of the power
supply impedance model. The model is widely used to design the power delivery
for high performance systems. For this purpose, several silicon wafers of the
Pentium « III and Pentium « 4 processors were built with various amount of
decoupling. The measured data showed significant discrepancies with the model
predictions and provided useful insights in investigating the model regions of
of Adaptive Supply Voltage and Body Bias for Reducing Impact of Parameter
Variations in Low Power and High Performance Microprocessors
Tschanz, James Kao 1 ,
Siva Narendra, Raj Nair and Vivek De Microprocessor Research, Intel Labs,
Hillsboro, OR, USA 1 Massachusetts
Institute of Technology
Testchip measurements show that adaptive VCC is useful for reducing impacts of die-to-die and WID parameter variations on frequency, active power and leakage power distributions of both low power and high performance microprocessors. Using adaptive VCC together with adaptive VBS or WID-VBS is much more effective than using any of them individually. Adaptive VCC+WID-VBS increases the number of dies accepted in the highest two frequency bins to 80%