The Battle in 64 bit Land, 2003 and Beyond

By: Paul DeMone (pdemone@realworldtech.com)
Updated: 01-26-2003

This is the third article in a series that I started in 2000 with The Looming Battle in 64 bit Land and updated with The Battle in 64 bit Land Revisited in 2001. In the year and a half since there have been a number of important events in this esoteric but important market segment. These events are reviewed and then attention is turned to 2003, a crucial year for determining the future leaders in the high end 64 bit MPU market.

Winners and Losers under a Darkening Sky

The most important events in the 64 bit universe in the past year and a half have been outside the technical arena. Dominating everything was a major industry downturn from a weakened economy, reduced capital spending due to the events of September 11th and recent U.S. corporate scandals. Especially hard hit was Sun Microsystems which suffered a significant reduction in revenue, down 32% from two years ago in the face of both the down turn in the economy and increasing competitive pressure, mainly from Dell and IBM [1]. Sun was squeezed hard between low cost and powerful x86 based servers on the low end and IBM’s successful introduction of POWER4 based systems on the high end.

Troubled computer giant Compaq surprised the computer world in June 2001 by announcing all Alpha development would cease after the EV79. The former EV8 development team was recruited by Intel essentially intact and was immediately put to work learning the ins and outs of IA64. Ironically Alpha itself would outlive its short sighted owner as Compaq was acquired by Hewlett Packard in a controversial deal which included a long and bitter fight between HP management and a faction of stockholders which included members of the Hewlett and Packard families. HP now faces the tricky problem of growing a product line based on a new 64 bit architecture (IA64) while managing the winding down of two others (PA-RISC and Alpha).

Last July, Intel started commercial shipments of the IA64 processor known during development as McKinley, under the trade name Itanium 2. The unexpectedly robust performance of this device was somewhat overshadowed by Intel’s failure to deliver an accompanying chipset in a timely manner. This likely factored in a long, and what some saw as ominous, delay by Dell in announcing it would adopt the Itanium 2 and its successors. What wasn’t surprising was HP’s ability to hit the ground running with Itanium 2, having participated heavily in its development. Indeed, for several months the only Itanium 2 based hardware shipping were 1 to 4 CPU systems based on HP’s ZX1 chipset.

Also hard hit over the last year and a half was aspiring 64 bit player Advanced Micro Devices. It racked up a string of money losing quarters during which it gave up most of the x86 market share gains obtained during its period of peak competitive advantage when it wielded a superior product against a stumbling and bumbling Intel. Despite serious difficulty with both of its primary businesses, x86 MPUs and flash memory, AMD determinedly pushed forward with a scheme to bridge the gap between the popular mass market, 32 bit world of PC hardware and the corporate and technical world of 64 bit computing by extending x86 to 64 bits (x86-64). AMD plans to introduce two versions of its x86-64 based 0.13 mm K8 generation devices this year, the Athlon 64 for PCs, and the more powerful Opteron for workstations and small to mid scale servers.

The Competitive Landscape

The performance of current 64 bit microprocessors as measured by the SPEC CPU 2000 industry standard benchmark suite is shown below in Figure 1. The performance of the fastest Intel and AMD 32 bit x86 processors currently available are also included for reference purposes.


Figure 1 Current 64 Bit MPUs (x86 MPUs shown for reference)

What is particularly striking about Figure 1 is the prominent bimodal division of high end processors into "have" and "have not" camps of uniprocessor performance. It is apparent that only the most technically competent RISC processors can keep up with the blistering performance pace set by 32 bit x86 MPUs. A pace compelled by competitive pressure between Intel and AMD for market share and fueled by bleeding edge semiconductor processes and large, skilled, and well funded design teams.

In the "have not" camp are the sagging "house brand" RISC processor families of SGI, Sun, and HP, namely MIPS, UltraSPARC, and PA-RISC respectively. Although both SGI and HP promise to introduce newer and faster RISC products, the huge and rapidly growing performance gulf between their proprietary processor line and Intel’s IA64 line (which both have adopted as their long term platform) makes one wonder how much effort will actually be expended, and to what result, given the escalating costs of designing competitive high end processors. In contrast Sun has vowed to continue to fight on using its UltraSPARC line of processors. Its inability to bring 0.13 mm devices to market as quickly as IBM has put Sun in an unusually poor competitive position in uniprocessor performance.

In the "have" camp sits Alpha, Itanium 2, POWER4/POWER4+, mainstream desktop x86 processors, and a surprising newcomer, the SPARC64 V. Fujitsu firmly stuck a finger in Sun’s eye by announcing the SPARC64 V at last October’s Microprocessor Forum. This 0.13 mm, out-of-order execution implementation of the 64 bit SPARC architecture easily outperforms the fastest UltraSPARC-III on the SPEC CPU 2k benchmark suite while sporting power dissipation a fraction of devices with comparable performance. IBM’s gamble on a highly automated MPU development process paid off to the extent that it beat its RISC rivals, as well as IA64, to the 0.13 mm process node with its POWER4+.

Despite shipping 0.13 mm x86 devices for about a year, Intel’s first 0.13 mm IA64 MPU, code named Madison, won’t be introduced for another 5 or 6 months. The EV79, a 0.13 mm shrink of the 0.18 mm EV7, will be even later, shipping in about a year.

Alpha: Haunting Rivals from Beyond the Grave

Despite being kept on minimum life support and locked in a dark crawl space, HP’s unwanted bastard stepchild Alpha continues to confound friends and foes alike. The last major revamp of the venerable EV68, to 1250 MHz, has proven surprisingly formidable for this small, 15 million transistor device. The ability of the latest EV68-based hardware to hang tough with expensive MCM packaged POWER4 systems bulging with custom silicon L3 cache is a stark testament to the architectural design and silicon engineering that went into the EV6x processor core so many years ago. This core will live on for years to come in the form of the newly introduced EV7, effectively a single chip supercomputer compute node.

Due to the decision to accept a relatively slow, low bandwidth on-chip L2 cache rather than modify the EV6 core (which was optimized around off-chip cache) the EV7 is not appreciable faster than the EV68 on single threaded applications or in uniprocessor systems. But EV7’s performance within large scale systems will likely be beyond peer for years to come. A comparison of a prototype 16 processor EV7 system to a 4 processor ES45 system indicates the EV7 gear has twice the per CPU memory bandwidth and much better inter-processor communication performance (MPI) than a 4 processor, cross-bar based ES45 system [2]. The results are summarized in table 1.

 

ES45

4 x EV68/1000

Prototype

16 x EV7/1200

Memory, local, read bandwidth (GB/s)

2.27

4.58

Memory, remote, read bandwidth (GB/s)

N/A

3.60

MPI, unidirectional, latency (m s)

4.9

1.7

MPI, unidirectional, bandwidth (MB/s)

792

1080

MPI, unidirectional, latency (m s)

8.9

2.2

MPI, unidirectional, bandwidth (MB/s)

379

485

If the EV7 wasn’t bad enough for competitors in the high performance computing (HPC) market who wish Alpha would just quietly accept its death, HP has recently confirmed that it will keep Compaq’s commitment to bring the 0.13 mm shrink of the EV7, the EV79, to market in about a year. No doubt airtight contracts with various government labs inherited from Compaq figured prominently in this decision. The EV79 will incorporate a larger L2 cache than EV7 (probably 3 MB) as well as support faster (PC1066) memory.

Large scale EV7x systems will be relatively inexpensive to build (look ma, no chipset!) but expect HP to charge top dollar regardless. Since the acquisition of Compaq, HP’s official policy has been to direct all new customers to IA64 hardware and limit sales of Alpha systems to the remaining customer base. But with PA-RISC long toothless for technical computing and a number of vendors like SGI offering huge Itanium 2 based systems optimized for HPC, sheer pragmatism will likely force HP to sell EV7x gear to anyone with approved credit. Obscenely high prices will ensure the Alpha tail doesn’t wag the IA64 dog.

AMD: Borrowing Intel’s Formula for Success

Designing and selling generation after generation of backwardly compatible x86 processors has been a tried and true formula for success that has propelled Intel to clear supremacy in the semiconductor industry. Yet Intel seems so smitten by the siren call of VLIW/EPIC architecture that it has abandoned its legacy and instead pushed into the 64 bit world with IA64. AMD has quite rightly recognized that Intel’s decision has left in its wake AMD’s best and perhaps only, opportunity to enter the market for general purpose 64 bit microprocessors - pick up where Intel left off.

To take advantage of this opportunity AMD has designed a backwardly compatible 64 bit extension to the x86 instruction set architecture called x86-64. In addition to support for 64 bit integer operations and 64 bit flat logical addressing, x86-64 doubles the number of general purpose and SSE registers to 16. Traditional x86 is so register poor that x86-64 may be the first and only 64 bit extension of a 32 bit ISA where compiling an application that otherwise fits into a 4 GB address space into 64 bit code can in theory produce a faster executable. Whether this actually occurs in practice will depend on close x86-64 compilers come to existing x86 compilers in code quality. AMD also claims that 64 bit coding increased the average length of instructions from about 3.4 bytes to 3.8 byte but dynamic instruction count falls by 10% [3].

The first implementation of x86-64 is a family of 0.13 mm SOI devices known as K8 or Hammer which will be introduced by AMD later this year. Hammer MPUs are also the first AMD processors to support SSE2, the x86 instruction set extension Intel introduced in the Pentium 4. SSE2 includes enhanced support for scalar and 2-way SIMD double precision FP operations and is an important step in deprecating the x87 programming model which has hindered x86 FP performance from its inception. AMD originally envisioned a RISC-like FP model for Hammer called TFP but wisely decided it was easier and safer to ride Intel’s slipstream instead. The programmer’s model of the x86-64 processor state is shown in Figure 2.


Figure 2. Programmer’s Model of x86-64 Processor State

In a decision reminiscent of DEC’s strategy for the Alpha EV7, AMD decided to focus much of its K8 development efforts on the architecture surrounding the processor [4]. To improve performance and reduce system level costs, important functions that have traditionally been implemented at the northbridge chipset level have been brought onto the Hammer processor device. These include the main memory controller and high speed interfaces for direct connection to southbridge style I/O bridges and linking processors together in multiprocessor systems. The downside of this strategy is that the extra circuitry brings with it higher power consumption and higher thermal load on the MPU package. In this case the extra power is partially offset by use of an SOI process which allows a given level of circuit performance to be achieved with somewhat reduced power consumption compared to bulk CMOS. The basic organization of the Opteron processor is shown in Figure 3.


Figure 3 Organization of the Opteron Processor

The Opteron, the high end version of the Hammer family for workstations and servers, includes three so-called HyperTransport high speed links and can be easily configured into systems with up to 8 CPUs. System level cache coherency is implemented using a broadcast style protocol. This scheme is simpler and less expensive to implement than the distributed directory system used by the EV7, but doesn’t scale nearly as well, a reflection of the class of system the two processors were designed to address. The Opteron directly supports a 128 bit wide DDR memory system. With support for memory speeds as high as PC2700, this gives a peak memory bandwidth of 5.3 GB/s although in practice it will often fall well short of this due to the inefficiency of using DDR to service short burst length transactions.

The desktop version of the Hammer family, known as Athlon 64, will be differentiated from the Opteron by a narrower 64 bit memory interface and fewer high speed links. Although the peak memory bandwidth of the Athlon 64 is half of Opteron the Athlon 64’s DDR memory system will work at a higher efficiency so the difference in effective bandwidth will likely be significantly less than 2x. AMD’s original plan for the Hammer family likely revolved around two different mask sets - the Athlon 64 with 256 KB of on-chip L2 cache and Opteron with 1 MB of L2 cache. This would give the Athlon 64 a comparably modest die size of 104 mm2 while maximizing differentiation with the workstation and server oriented Opteron [5].

But with Intel intending to introduce mainstream desktop chipsets supporting 128 bit wide DDR memory systems and higher speed system busses for its Pentium 4 desktop processor line this year AMD will probably be forced to release a version of Athlon 64 with 1 MB L2 for the top end of its desktop product line. Such a device will likely be a hybrid device with an Opteron die bonded out in the Athlon 64 package. The upside in doing this is that AMD can potentially drive up wafer volumes and drive down cost for Opteron and achieve greater flexibility and lower risk in planning Opteron wafer starts. The drawback is that this will increase manufacturing cost well beyond the 256 KB version that will already be more expensive than K7 Athlons due to the SOI processing.

In terms of performance the Opteron should prove to be a top performer in integer and commercial workloads. This will be in large part due to its efficient and well balanced processor core, relatively high clock rate (for a server class processor), and low latency memory system. For FP intensive applications Opteron will likely be competitive but no where near the leaders, Itanium 2 and Alpha. The biggest challenge Opteron will face isn’t the credibility of the silicon but rather the company behind it. AMD made very little headway in establishing Athlon within the corporate world, despite its high performance and low cost. This is a reflection of AMD’s low profile, uncertain future, and inability to sustain interest among the major computer OEMs. Unfortunately for AMD, as the price of a computer system rises from thousands for a PC to hundreds of thousands for server class hardware, the conservatism of corporate IT decision makers and the OEMs that sell to them rise in proportion. AMD’s best hope is establishing Opteron is in academic and government research type establishments which are often more open to buying new and unproven computing hardware. Another possibility is that low cost Opteron based systems could infiltrate the business world via departmental level purchases beyond the direct control of corporate IT policy makers. This bottom up approach was highly successful in the early spread of the Linux operating system.

Itanium: The Juggernaut Picks up Speed

Hardly a week goes buy without a major computer OEM announcing a shiny new product line based on the Itanium 2: first HP, then NEC, followed by Unisys and SGI. IBM is close to announcing an IA64 based mid range system while Dell seems to be held back by delays in Intel’s 8870 chipset rather than a lack of intention. The Itanium 2 is an impressive product, with leading edge FP performance as well as respectable integer performance for a server class processor. Early indications are that it also does very well on commercial workloads. This is not surprising given the huge capacity, bandwidth, and support for parallelism within its three level on-chip cache hierarchy as well as the fact that Itanium 2’s non-FP functionality was optimized for commercial workloads rather than SPECint2k [6].

Although the selection of hardware and OEMs to choose from is impressive for a new instruction set architecture, the speed at which IA64 can penetrate the workstation and server markets is limited primarily by the availability of application software. Although the Itanium 2 offers compatibility for 32 bit x86 applications (in hardware, not by emulation, as is often erroneously reported) the performance penalty incurred is substantial enough to render the feature useless in terms of driving hardware sales ahead of the availability of native 64 bit software. Especially valuable would be a native version of the Windows operating system. Although this is in the works Microsoft has never been known for timely support of a new architecture. However, the rapidly growing presence of Linux across the breadth of the server market, and the serious effort by Intel and its hardware partners to support Linux on the full range of IA64 systems will be a strong impetus to Microsoft not to waste any time in joining the party.

The near term road map of the Itanium family seems quite clear. Like the 0.5 um P6, the 0.35 mm EV6, and 0.18 mm Willamette, the Itanium 2’s McKinley core has multiple process shrinks ahead of it. The results of the first shrink, to 0.13 mm, are Madison and Deerfield which will be introduced later this year. Both are based on the same core and are primarily differentiated by the size of the L3 cache. Madison is the high end part and will double the Itanium 2’s 3 MB L3 cache size to a massive 6 MB and increase associativity from 12-way to 24-way. Despite the extra cache the Madison will be somewhat smaller than the Itanium 2, reportedly 374 mm2 compared to 421 mm2. In contrast, the Deerfield will keep the 3 MB L3 size of the Itanium 2 and ride the process shrink to a much smaller die size, ~266 mm2 or about the size of the POWER4+.

The clock speed of the McKinley core in 0.13 mm has been disclosed as 1.5 GHz in the abstract of an associated paper in the preliminary program for ISSCC 2003. Given the difficulty with interconnect delay and signal integrity in the 0.18 mm aluminum Itanium 2 device, and the fact that the Pentium 4 core has already seen over 50% higher clock rates in the 0.18 to 0.13 mm shrink, one might expect the Madison would clock faster rate than 1.5 GHz in a 0.13 mm copper process. Intel is likely once bitten, twice shy over disclosing aggressive clock rate targets for unreleased IA64 products in technical papers. The abstract of the McKinley paper famously withdrawn from ISSCC 2001 described it as a 1.2 GHz device. Many of the papers presented at ISSCC 2002 described critical blocks of McKinley and indicated they were fully qualified up to 1.2 GHz. Seeing as there is no Itanium 2 faster than 1.0 GHz one could theorize that the top clock rate was cut by 200 MHz to hit a thermal design target that OEMs refused to budge on [7]. There is a strong possibility that Madison will be available at clock rates higher than the 1.5 GHz disclosed once the 0.13 mm device is fully characterized.

Whether Madison and Deerfield clock 50% or 60% faster than McKinley or more, one thing seems clear is that performance as measured by SPEC CPU 2000 will scale upwards very closely with any clock frequency increase. Detailed processor CPI component breakdown and memory access latency contribution breakdown data for Itanium 2 was presented in [8]. Analysis of the data presented suggests performance scaling factors of 0.95 and 0.89 for SPECintbase2k and SPECfpbase2k respectively for Itanium 2. Increasing the size of the L3 cache to 6 MB, while keeping latency constant, is estimated to raise the SPEC scaling factors to about 0.96 and 0.92 respectively. Another obvious lever for increasing performance is the system interface. The Itanium 2 has a 128 bit wide data bus effectively clocked at 400 MHz. It uses similar signaling technology to the Pentium 4 front side bus which is targeted to hit 800 MHz later this year. Even a conservative increase to 533 MHz would increase the memory and I/O bandwidth available to the McKinley core by a third. Inside the processor core a number of minor changes could have been made during the port to 0.13 mm to enhance performance. One relatively simple change would be to increase the number of stacked general purpose registers from 96 to 128. Although the change would be transparent to software the data presented in [9] suggests this would cut the number of register stack engine spills and fills associated with procedure calls and returns by more than half on average.

Beyond Madison and Deerfield, further shrinks of the McKinley core are planned. The next process node is 90 nm. One rumor that keeps recurring is that Intel will incorporate two McKinley cores in a high end device in the 90 nm generation. This 2 way chip level multi processor (CMP) is similar in concept to the existing IBM POWER4 design and HP’s future PA-8800. The ultimate in down-the-road speculation for the Itanium family is what the former EV8 design team is cooking up in Intel’s Shrewsbury design center. Sources within Intel indicate excitement about the new design concepts incorporated in this brand new IA64 processor core which will likely target 65 nm. The possibilities range from out-of-order execution, hardware multithreading, and vector style computing extensions like the "Tarantula" extension to EV8. Whatever path was chosen, vendors of 64 bit MPUs that compete with IA64 are faced with the scary prospect that the former Alpha design team, which created the highest performance processors at virtually every process technology node it tackled, is largely intact and hard at work with Intel’s process development and manufacturing capabilities at its disposal.

SGI MIPS and PA-RISC: Running out the Clock

Although shrinking in size in recent years, the customer base of MIPS and PA-RISC, the house brand RISC architectures of SGI and HP respectively, are still economically important to these companies. That is why they continue to develop and introduce new RISC products even though both companies are committed to migrating their users to IA64. Imposing major architectural change on customers is a dangerous activity for any computer vendor. Such changes are expensive and disruptive for customers and are natural opportunities for customers to change vendors, especially if the reason for the change is unconvincing. The safest tactic is to offer an overlap period extending for at least several years during which the vendor offers products based both the old and the new architecture.

The problem with a dual track approach is that designing a competitive new high end processor is quite expensive and time consuming. The solution both HP and SGI employ comes straight from the environmental movement - reduce, re-use, and recycle. That is, they recycle their existing CPU core design by porting it to a more advanced process to reduce the feature size, and re-use system infrastructure (buses, chipsets etc). Refreshing an old design with a process shrink provides higher clock rate and larger on-chip cache even while shrinking die size, power, and cost. While staying with a tried and true design saves the time and resources needed to create and verify a new microarchitecture it is still a costly and time consuming exercise to bring out a new MPU. Sections of circuitry often need to be redesigned for performance or sometimes just to function reliably and the physical design and pre and post silicon verification tasks are essentially the same.

In the case of SGI, it is has been recycling the MIPS R10000 (R10k) quad issue, out-of-order execution processor core for half a decade. Although the processor gets renamed with each shrink (R12k, R14k, and R16k) the microarchitecture is largely unchanged from the original design. Although SGI has hinted that the future R18k device on its road map will represent a significant improvement, reports indicate it will still heavily leverage the R10k. The primary differences are the FP hardware will be changed from one multiply and one add pipeline to two multiply-add pipelines (thus doubling the chip’s peak GFLOP rating) and the system bus is improved [10]. Given the huge performance gulf between its MIPS-based products and its new Itanium 2 based product line it is hard to imagine SGI seriously contemplating MIPS development beyond the R18k. But SGI's questionable plans for MIPS must be distinguished from an independent and highly successful market for 32 and 64 bit MIPS architecture processors in embedded control type applications, a topic out of the scope of this article

In the case of PA-RISC, it is even more obvious that HP is simply running out the clock. The company hasn’t introduced a new CPU core since the PA-8000 which is about as old as the MIPS R10k. Instead, HP simply shrinks the PA-8000 core and fills up the available space with increasingly large L1 instruction and data cache as feature size shrinks. The next PA-RISC on HP’s road map is slightly more interesting, the PA-8800 "Mako". It consists of two PA-8000 style cores on one device. This decision forced HP to cut the data cache capacity per processor in half compared to the PA-8700 despite the smaller feature size (0.13 mm vs 0.18 mm). To help make up for the smaller data cache and the growing gap between processor and memory performance, the Mako will be packaged in an MCM with 32 MB of DRAM-based L2 cache [11]. Despite its high capacity, the very high latency (40 CPU cycles) of the L2 in combination with a modest processor clock frequency target (1 GHz) for PA-8800 will severely limit improvement in uniprocessor performance over the PA-8700. The purpose of PA-8800 seems to be to allow HP to double the number of CPUs in its Superdome large scale server without a major physical redesign.

POWER4: Big Blue CPU Gets a "Mini-me" Sidekick

The conservative, highly automated design approach IBM chose for its POWER4 high end server processor has paid off twice now. The POWER4 product line was rolled out on schedule more than a year ago and now big blue is the first 64 bit MPU vendor to deliver a 0.13 mm based product, POWER4+. The down side of IBM’s microprocessor design methodology is that compared to a more traditional full custom approach, some of the potential performance of the process is left on the table in a trade-off for faster and cheaper development. For example, the top clock rate of POWER4 is only 50 MHz higher than the Alpha EV68 despite have a basic execution pipeline roughly twice as long and intrinsically faster logic from its SOI process.

The POWER4 packs two CPUs along with 1.4 MB of L2 cache (1.5 MB in POWER4+) as well as L3 cache controller, memory controller, and interprocessor communications functionality onto a 415 mm2 die. To accomplish this something had to give and that something was processor efficiency. Unlike other out-of-order RISC processors like PA-8x00, EV6x, and MIPS R1xk, the POWER4 doesn’t issue and track individual instructions. Instead it collects together up to 5 instructions at a time and then issues, tracks, and retires them as a group, kind of like a VLIW instruction. The processor only preserves machine state at group boundaries so an exception causes the machine to be backed up to the oldest group prior to the exception. [12]. This is less cumbersome and complex than tracking individual instructions but comes at a significant cost. This cost can be divided into three main components: grouping formation overhead, grouping restrictions, and lost opportunities for parallelism.

The group formation overhead is quite easily observed. The POWER4 uses 6 pipeline stages to decode and crack (break down complex PowerPC instructions like load with update into separate simpler primitive load and add instructions) instructions and form them into groups. In comparison, the EV6x performs decoding in a single pipe stage. The extra pipe stages increase the branch misprediction penalty, and reduce performance. A POWER4 CPU can issue and/or retire one group per clock cycle. A group may have up to 5 instructions, but in practice restrictions to the way a group can be assembled means that groups will on average carry fewer than 5 instructions. For example, the fifth slot in a group is reserved for branches. If a sequence of instructions is being packed into a group and the third instruction is a branch then the group is padded with NOPs in the third and fourth slot and the branch is inserted into the fifth. If CPU primitive instructions from a cracked native instruction can’t all fit in a group then the group is padded out with NOPs and a new group is started. If an instruction sets a non-renamed architectural register then the group is padded out with NOPs. Instructions that cannot be executed speculatively are executed serially as single instruction groups. Before a group can be dispatched all the resources to support all the instructions in that group must be available. This is more restrictive than in other out-of-order execution RISC processors in which dependencies are tracked instruction by instruction.

To get an indication of how much performance the POWER4 loses due to its distended pipeline and restrictive instruction dispatch and tracking system its integer performance is compared to the Alpha EV68 in Table 2. Despite out-fetching, out-dispatching, and out-issuing the EV68; despite having a 12:1 advantage in low latency on-chip cache, a 2.5:1 advantage in the size of its out-of-order execution window, and a 4% clock rate advantage, the POWER4 can’t outperform it on SPECint2k,base or peak.

 

POWER4

Alpha EV68

Fetch width (instructions)

8

4

Dispatch width (instructions)

5

4

non-FP Issue width (instructions)

6

4

L1 cache (I/D)

64 KB / 32 KB

64 KB / 64 KB

L2 cache

1.4 MB

16 MB (off-chip)

L3 cache

32 MB (off-chip)

-

Max non-FP instructions in flight

100

40

Clock Frequency (MHz)

1300

1250

SPECint2k (base/peak)

804 / 839

845 / 928

Despite its architectural inefficiencies, the POWER4 is one of the most competitive MPUs to come out of IBM labs in a long time thanks to the use of 2-way CMP and mainframe class high bandwidth data paths and system packaging technology. If IBM can continue to scale up clock frequency with each process shrink as fast as its full custom designed competitors it should be in good competitive shape for its intended market.

However the same cannot be said for all users of the PowerPC instruction set. The G4 and G4+ processors Apple currently uses in its Macintosh line of desktop and laptop computers are hopelessly out-muscled by the latest x86 processors from Intel and AMD. Worst yet, the growing use of SSE in multimedia and content creation software has put a slow leak in Apple’s competitive life preserver, the Altivec SIMD PowerPC instruction set extension. By some strange coincidence IBM has announced it was developing the PowerPC 970, a desktop class processor based on the POWER4 microarchitecture and extended with Altivec support. The relative die size and basic floorplans of the POWER4, POWER4+, and PowerPC 970 are shown in Figure 3.


Figure 4 Relative Size of POWER4, POWER4+, and PowerPC 970

The PowerPC 970, is a 0.13 mm SOI single CPU device with 512 KB of on-chip L2 cache. IBM estimates its performance at 937 SPECint2k and 1051 SPECfp2k at 1.8 GHz [13]. This performance level is a bit shy of the fastest desktop Pentium 4 processors shipping today but is nevertheless quite remarkable when you consider it would be achieved by a 118 mm2 device with 42 W typical power dissipation. Given the way IBM designs its MPUs the 970 is remarkably compact and power efficient compared to heavily engineered products like the Pentium 4 Northwood. Semi-custom MPU design methodologies have obviously come a long way since Sun and TI rolled out the bloated, power hungry, and slow MicroSPARC a decade ago [14]. Even more intriguing for Apple is that the 970’s typical power consumption drops to 19 W at 1.2 GHz which makes it a natural competitor to Intel’s Banias processor for high end mobile applications and very small form factor and/or silent desktop PCs. Given the reduced design margin and greater market emphasis for clock frequency of desktop processors, it is also conceivable that IBM could turn out limited numbers of 970 MPUs that clocked at 2 GHz or higher for high end desktop Macs, an important psychological milestone for Apple’s struggle for survival in an increasingly x86 dominated PC world.

Sun: Too Little but is it Also Too Late?

Sun Microsystems has been in tight spots before but never like this. Its large SPARC based servers were once touted as engines powering a digital new age, an internet based economy. This notion was captured in its marketing slogan about Sun "putting the dot in dot com". Bruised and battered by the current economic uncertainty and hurt by a product line that grows increasingly stale by the day, some have cruelly suggested Sun’s new slogan should be "putting the bank in bankruptcy".

While its current situation is not anywhere near that dire, it is obvious that the substantial MPU development effort Sun engages in is not delivering the goods as it should. The company stumbled badly with the UltraSPARC-III (US-III), shelving the initial 0.25 mm version altogether. Yet it still has to rely this microarchitecture for years to come. The next major refresh, the UltraSPARC-IV, is basically a 0.13 mm shrink of the 0.18 mm US-III core. Serious relief won’t come until 2005 in the form of the UltraSPARC-V, a completely new design that reportedly incorporates dynamic scheduling like the Fujitsu SPARC64 V. But the track record of MPU design teams tackling out-of-order execution for the first time suggests the 2005 delivery time scale should be considered a best case scenario. If the rumor that US-V also incorporates SMT is true then a 2005 delivery is even more unlikely.

Over the next year Sun will likely introduce products based on the US-IV and the US-IIIi. Like the US-IV, the US-IIIi is based on a 0.13 m m implementation of the US-III CPU core. The two processors are differentiated primarily by the logic that surrounds the CPU core, which in turn is shaped by the intended role. The US-IIIi will replace the hopelessly outdated US-IIi in Sun’s workstation line and also be used in 1 to 4 CPU entry level servers. The US-IIIi includes a 1.0 MB on-chip L2 cache, an integrated 128 bit wide DDR memory interface and controller, and SMP system bus interface. The big machine oriented US-IV will likely include a large L2 cache and higher bandwidth system and memory interfaces than the US-IIIi. A more expensive, higher I/O count package will likely be used for the faster clocked US-IV to accommodate wider data paths, a greater number of power and ground connections, and higher heat dissipation. There are reports that the US-IV is a two way CMP device like the POWER4. If true then it would represent a major change for Sun which has traditionally emphasized manufacturability and low cost. The US-III core is relatively large (nearly twice as big as EV68 in 0.18 mm) so a dual core US-IV would be quite substantial even in 0.13 mm.

The immediate future aside, Sun is faced with a long term MPU credibility problem. The admission by computer giants like HP and Compaq that they couldn’t justify or sustain the increasingly expensive effort of designing house brand processors and building systems around them increases the pressure on Sun management to justify further investment in SPARC. The more SPARC performance lags behind openly available merchant processor families like x86, x86-64, and IA64, the greater this pressure becomes. The more that revenue from SPARC-based hardware falls the greater this pressure becomes. Sun’s recent move to develop a Linux system product line based on Intel x86 processors is a sure sign that its once highly successful strategy of "putting all its wood behind one arrow" is showing the first signs of giving way to a more pragmatic and realistic approach.

Looking Forward - The 64 bit Landscape A Year From Now

Over the next twelve months, powerful new 0.13 mm 64 bit MPUs will ship, or approach shipping. These include Intel’s Madison, AMD’s Opteron, IBM’s PowerPC 970, HP’s Alpha EV79 and PA-8800, and Sun’s US-IV. In addition existing processors designs like POWER4+ will likely see incremental clock frequency and performance increases. The characteristics and performance of these processors are provided below in Table 3. Performance figures and undisclosed processor characteristics are based on estimates and/or extrapolations from public data. The performance projections for IA64 assume no boost from improvements in EPIC compiler technology over the next 12 months. The performance projection for Opteron assumes native x86-64 compiler technology will not outperform current IA32 compiler technology over the next 12 months. The die area estimate for US-IV assumes it is a single CPU device.

 

Alpha

EV79

IA64

Madison

IA64

Deerfield

MIPS

R18K

PA-RISC

PA-8800

Clock (MHz)

1600

1600

1500

800

1000

Max Power (W)

120

130

100

50

90

Die Size (mm2)

300

374

266

130

366

Transistors (m)

200

410

230

80

300

L1 cache (KB)

64 / 64

16 / 16

16 / 16

32 / 32

768/768

L2 cache (MB)

3.0

0.25

0.25

1.0

32

L3 cache (MB)

-

6.0

3.0

16

-

SPECint_base2k

1100

1250

1150

600

850

SPECfp_base2k

1600

2150

2000

750

900

 

PowerPC

970

PowerPC

POWER4+

SPARC

US-IIIi

SPARC

US-IV

x86-64

Opteron

Clock (MHz)

1800

1800

1200

1500

2400

Max Power (W)

60

120

60

90

80

Die Size (mm2)

118

267

179

240*

160

Transistors (m)

52

184

88

160

100

L1 cache (KB)

64 / 32

64 / 32

32 / 64

32 / 64

64 / 64

L2 cache (MB)

0.5

1.5

1.0

2.0

1.0

L3 cache (MB)

-

128

-

16

-

SPECint_base2k

900

1100

650

800

1400

SPECfp_base2k

1000

1500

800

1000

1350

The performance projections in Table 3 suggest the current division of competing 64 bit MPUs into distinct performance "have" and "have not" camps will continue. The top integer performers as measured by SPECint2k will likely be Opteron and Madison. The top floating point performers as measured by SPECfp2k will likely be Madison and EV79.

Summary and Conclusion

From a state of confusion and turmoil a new reality in the 64 bit MPU market is starting to emerge, a three way race for technical leadership between AMD’s x86-64, IBM’s PowerPC, and Intel’s IA64 architectures. After years of delay and missteps, IA64 is here and real. Falling well short of the architectural revolution in computer design promised by early EPIC enthusiasts, IA64 is nevertheless a powerful presence due to a very aggressive implementation and broad OEM support. The two factors that seem to stand in the way of IA64 quickly grabbing significant market share are the inevitable shortage of production ready applications in the infancy of any new architecture and the low rates of capital investment in the current economic climate. Intel is doing all it can to alleviate the first stumbling block but the second is beyond even its purview.

AMD is following a very clever strategy by making Opteron and Athlon 64 compelling products on the basis of their performance on the existing 32 bit x86 code base. This avoids a head on confrontation with Intel for the attention of ISVs in developing 64 bit applications, a battle it cannot win. Instead, by piggybacking x86-64 sales on top of the huge existing market for x86 hardware AMD hopes to attract ISV interest by establishing a sizable installed base of 64-bit ready hardware. It is aided in this effort, especially with regards to "mom and pop" and open source software operations, by targeting system price points and form factors that Intel and its partners cannot yet meet.

IBM is in the enviable position of being able to place a bet on every contender in the 64 bit horse race. It has an entry of its own, PowerPC, as well as a long standing public commitment to IA64 soon to be realized in a line of mid range servers. It is rumored that it will also be the first major OEM to support x86-64. These rumors are fueled largely by speculation based on IBM’s early port of DB2 to x86-64, and its recently announced joint process technology development agreement with AMD. However big blue is not the monolith of old and its separate divisions are independent business units responsible for their own bottom line. But IBM’s existing architectural diversity suggests it doesn’t value a unified platform strategy nearly as much as other OEMs and would strongly consider testing market interest in x86-64 with re-badged entry level Opteron gear sourced from Newisys.

Unlike IBM, Sun is not in a position to take a wait and see approach. It needs to quickly take some hard decisions to turn its hardware business around. Solaris on SPARC is growing less competitive in cost and performance every day and the only remaining question is whether Sun will port Solaris to another architecture, or try to sustain it on SPARC as a legacy product line while growing a non-SPARC product line based on Linux. One factor that could affect Sun’s choice of an alternative architecture is the cooperation between IBM and AMD for process development, especially if IBM is forced to take an equity position in AMD to prop it up. If it appears that IBM has co-opted x86-64 it could drive uncommitted OEMs like Sun into Intel’s arms. On the other hand, Sun might regard Intel’s close relationship with HP in IA64 MPU development with suspicion, a factor that also seems to be giving Dell some pause. One factor that might influence Sun’s decision is similarity between IA64 and SPARC. Common architectural characteristics, like overlapping stacked register windows, three address instructions, and large register sets might make it easier for Sun to port Solaris and its associated compilers to IA64 than to x86-64.

References

[1] Kerstetter, J. and Greene, J., "Will Sun Rise Again?", Business Week, November 25, 2002., pp. 120-130.

[2] Kerbyson, D. et al., "Performance Evaluation of an EV7 AlphaServer Machine", Los Alamos National Labs report LA-UR-02-4850, October 2002.

[3] McGrath, K., and Christie, D., "The AMD x86-64 Architecture", Hot Chips 14, August 2002.

[4] Keltcher, C., "The AMD Hammer Processor Core", Hot Chips 14, August 2002.

[5] Krewell, K., "AMD Takes Hammer to Itanium", Microprocessor Report, Vol. 15., No. 11, November 2001., p. 1.

[6] Naffziger, S. and Hammond, G., "The Implementation of the Next Generation 64b Itanium Microprocessor", Digest of Technical Papers, ISSCC 2002, pp. 344-345.

[7] Ascierto, J. and Clendenin, M., "Itanium in hot seat as power issues boil over", Electronic Engineering Times, August 13, 2001.

[8] McCormick, J. and Knies, A., "A Brief Analysis of the SPEC CPU2000 Benchmarks on the Intel Itanium 2 Processor", Hot Chips 14, August 2002.

[9] Rakvic, R., et al., "Performance Advantage of the Register Stack in Intel Itanium Processors", Workshop Proceedings, EPIC-2, November 18, 2002.

[10] Fu, T., et al., "R18000 The Latest SGI Superscalar Microprocessor", Hot Chips 13, August 2001.

[11] Johnson, D., "HP’s Mako Processor", Technical Proceedings, Microprocessor Forum 2001, October 16, 2001.

[12] Tendler, J., et al., "POWER4 System Microarchitecture", IBM Journal of Research and Development, Vol. 46, No. 1, January 2002.

[13] Sandon, P., "PowerPC 970: First in a New Family of 64-bit High Performance PowerPC Processors", Technical Proceedings, Microprocessor Forum 2002, October 2002.

[14] Case, B., "SPARC Hits Low End with TI’s microSPARC", Microprocessor Report, Vol. 6, No. 14, October 12, 1992, p. 11.

 

  Copyright © 1996-2001, Real World Technologies - All Rights Reserved