Nehalem arrives with a splash
by Scott Wasson — 3:38 AM on November 3, 2008
Those of us who are conversant with technology are more or less conditioned to accept and even expect change as a natural part of the course of things. New gadgets and gizmos debut regularly, each one offering some set of advantages or refinements over the prior generation. As a result, well, you folks are a rather difficult lot to impress, frankly speaking. But today is a day when one should sit up and take notice. I've been reviewing processors for nearly ten years now, and the Core i7 processors we're examining here represent one of the most consequential shifts in the industry during that entire span.
Intel, as you know, has been leading its smaller rival AMD in the performance sweeps for some time now, with a virtually unbroken lead since the debut of the first Core 2 processors more than two years ago. Even so, AMD has retained a theoretical (and sometimes practical) advantage in terms of basic system architecture throughout that time, thanks to the changes it introduced with its original K8 (Athlon 64 and Opteron) processors five years back. Those changes included the integration of the memory controller onto the CPU die, the elimination of the front-side bus, and its replacement with a fast, narrow chip-to-chip interconnect known as HyperTransport. This system architecture has served AMD quite well, particularly in multi-socket servers, where the Opteron became a formidable player in very short order and has retained a foothold even with AMD's recent struggles.
Now, Intel aims to rob AMD of that advantage by introducing a new system architecture of its own, one that mirror's AMD's in key respects but is intended to be newer, faster, and better. At the heart of this project is a new microprocessor, code-named Nehalem during its development and now officially christened as the Core i7.
Yeah, I dunno about the name, either. Let's just roll with it.
The Core i7 design is based on current Core 2 processors but has been widely revised, from its front end to its memory and I/O interfaces and nearly everywhere in between. The Core i7 integrates four cores into a single chip, brings the memory controller onboard, and introduces a low-latency point-to-point interconnect called QuickPath to replace the front-side bus. Intel has modified the chip to take advantage of this new system infrastructure, tweaking it throughout to accommodate the increased flow of data and instructions through its four cores. The memory subsystem and cache hierarchy have been redesigned, and simultaneous multithreading—better known by its marketing name, Hyper-Threading—makes its return, as well. The end result blurs the line between an evolutionary new product and a revolutionary one, with vastly more bandwidth and performance potential than we've ever seen in a single CPU socket.
How well does the Core i7 deliver on that potential? Let's find out.
An overview of the Core i7
The Core i7 modifies the landscape quite a bit, but much of what you need to know about it is apparent in the picture of the processor die below, with the major components labeled.
What you're seeing, incidentally, is a pretty good-sized chip—an estimated 731 million transistors arranged into a 263 mm² area via the same 45nm, high-k fabrication process used to produce "Penryn" Core 2 chips. Penryn has roughly 410 million transistors and a die area of 107 mm², but of course, it takes two Penryn dies to make one quad-core product. Meanwhile, AMD's native quad-core Phenom chips have 463 million transistors but occupy a larger die area of 283 mm² because they're made on a 65nm process and have a higher ratio of (less dense) logic to (denser) cache transistors. Then again, size is to some degree relative; the GeForce GTX 280 GPU is over twice the size of a Core i7 or Phenom.
Nehalem's four cores are readily apparent across the center of the chip in the image above, as are the other components (Intel calls these, collectively, the "uncore") around the periphery. The uncore occupies a substantial portion of the die area, most of which goes to the large, shared L3 cache.
This L3 cache is the last level of a fundamentally reworked cache hierarchy. Although not clearly marked in the image above, inside of each core is a 32 kB L1 instruction cache, a 32 kB L1 data cache (it's 8-way set associative), and a dedicated 256 kB L2 cache (also 8-way set associative). Outside of the cores is the L3, which is much larger at 8 MB and smarter (16-way associative) than the L2s. This basic arrangement may be familiar from AMD's native quad-core Phenom processors, and as with the Phenom, the Core i7's L3 cache serves as the primary means of passing data between its four cores. The Core i7's cache setup differs from the Phenom's in key respects, though, including the fact that it's inclusive—that is, it replicates the contents of the higher level caches—and runs at higher clock frequencies. As a result of these and other design differences, including a revamped TLB hierarchy, the Core i7's cache latencies are much lower than the Phenom's, even though its L3 cache is four times the size.
One mechanism Intel uses to make its caches more effective is prefetching, in which the hardware examines memory access patterns and attempts to fill the caches speculatively with data that's likely to be requested soon. Intel claims the Core i7's prefetching algorithm is both more efficient than Penryn's—some server admins wound up disabling hardware prefetch in Xeons because it harmed performance with certain workloads, a measure Intel says should no longer be needed—and more aggressive, as well.
The Core i7 can get to main memory very quickly, too, thanks to its integrated memory controller, which eliminates the chip-to-chip "hop" required when going over a front-side bus to an external north bridge. Again, this is a familiar page from AMD's template, but Intel has raised the stakes by incorporating support for three channels of DDR3 memory. Officially, the maximum memory speed supported by the first Core i7 processors is 1066 MHz, which is a little conservative for DDR3, but frequencies of 1333, 1600, and 2000 MHz are possible with the most expensive Core i7, the 965 Extreme Edition. In fact, we tested it with 1600 MHz memory, since this is a more likely configuration for a thousand-dollar processor.
For a CPU, the bandwidth numbers involved here are considerable. Three channels of memory at 1066 MHz can achieve an aggregate of 25.6 GB/s of bandwidth. At 1333 MHz, you're looking at 32 GB/s. At 1600 MHz, the peak would be 38.4 GB/s, and at 2000 MHz, 48 GB/s. By contrast, the peak effective memory bandwidth on a Core 2 system would be 12.8 GB/s, limited by the throughput of a 1600MHz front-side bus. With dual channels of DDR2 memory at 1066MHz, the Phenom's peak would be 17.1 GB/s. The Core i7 is simply in another league. In fact, our Core i7-965 Extreme test rig with 1600MHz memory has the same total bus width (192 bits) and theoretical memory bandwidth as a GeForce 9600 GSO graphics card.
With the memory controller onboard and the front-side bus gone, the Core i7 communicates with the rest of the system via the QuickPath interconnect, or QPI. QuickPath is Intel's answer to HyperTransport, a high-speed, narrow, packet-based, point-to-point interconnect between the processor and the I/O chip (or other CPUs in multi-socket systems.) The QPI link on the Core i7-965 Extreme operates at 6.4 GT/s. At 16 bits per transfer, that adds up to 12.8 GB/s, and since QPI links involve dedicated bidirectional pairs, the total bandwidth is 25.6 GB/s. Lower-end Core i7 processors have 4.8 GT/s QPI links with up to 19.2 GB/s of bandwidth. Obviously, these are both just starting points, and Intel will likely ramp up QPI speeds from here in successive product generations. Still, both are somewhat faster than the HyperTransport 3 interconnects in today's Phenoms, which peak at either 16 or 14.4 GB/s, depending on the chip.
by Scott Wasson — 3:38 AM on November 3, 2008
Those of us who are conversant with technology are more or less conditioned to accept and even expect change as a natural part of the course of things. New gadgets and gizmos debut regularly, each one offering some set of advantages or refinements over the prior generation. As a result, well, you folks are a rather difficult lot to impress, frankly speaking. But today is a day when one should sit up and take notice. I've been reviewing processors for nearly ten years now, and the Core i7 processors we're examining here represent one of the most consequential shifts in the industry during that entire span.
Intel, as you know, has been leading its smaller rival AMD in the performance sweeps for some time now, with a virtually unbroken lead since the debut of the first Core 2 processors more than two years ago. Even so, AMD has retained a theoretical (and sometimes practical) advantage in terms of basic system architecture throughout that time, thanks to the changes it introduced with its original K8 (Athlon 64 and Opteron) processors five years back. Those changes included the integration of the memory controller onto the CPU die, the elimination of the front-side bus, and its replacement with a fast, narrow chip-to-chip interconnect known as HyperTransport. This system architecture has served AMD quite well, particularly in multi-socket servers, where the Opteron became a formidable player in very short order and has retained a foothold even with AMD's recent struggles.
Now, Intel aims to rob AMD of that advantage by introducing a new system architecture of its own, one that mirror's AMD's in key respects but is intended to be newer, faster, and better. At the heart of this project is a new microprocessor, code-named Nehalem during its development and now officially christened as the Core i7.
Yeah, I dunno about the name, either. Let's just roll with it.
The Core i7 design is based on current Core 2 processors but has been widely revised, from its front end to its memory and I/O interfaces and nearly everywhere in between. The Core i7 integrates four cores into a single chip, brings the memory controller onboard, and introduces a low-latency point-to-point interconnect called QuickPath to replace the front-side bus. Intel has modified the chip to take advantage of this new system infrastructure, tweaking it throughout to accommodate the increased flow of data and instructions through its four cores. The memory subsystem and cache hierarchy have been redesigned, and simultaneous multithreading—better known by its marketing name, Hyper-Threading—makes its return, as well. The end result blurs the line between an evolutionary new product and a revolutionary one, with vastly more bandwidth and performance potential than we've ever seen in a single CPU socket.
How well does the Core i7 deliver on that potential? Let's find out.
An overview of the Core i7
The Core i7 modifies the landscape quite a bit, but much of what you need to know about it is apparent in the picture of the processor die below, with the major components labeled.
The Core i7 die and major components. Source: Intel. |
What you're seeing, incidentally, is a pretty good-sized chip—an estimated 731 million transistors arranged into a 263 mm² area via the same 45nm, high-k fabrication process used to produce "Penryn" Core 2 chips. Penryn has roughly 410 million transistors and a die area of 107 mm², but of course, it takes two Penryn dies to make one quad-core product. Meanwhile, AMD's native quad-core Phenom chips have 463 million transistors but occupy a larger die area of 283 mm² because they're made on a 65nm process and have a higher ratio of (less dense) logic to (denser) cache transistors. Then again, size is to some degree relative; the GeForce GTX 280 GPU is over twice the size of a Core i7 or Phenom.
Nehalem's four cores are readily apparent across the center of the chip in the image above, as are the other components (Intel calls these, collectively, the "uncore") around the periphery. The uncore occupies a substantial portion of the die area, most of which goes to the large, shared L3 cache.
This L3 cache is the last level of a fundamentally reworked cache hierarchy. Although not clearly marked in the image above, inside of each core is a 32 kB L1 instruction cache, a 32 kB L1 data cache (it's 8-way set associative), and a dedicated 256 kB L2 cache (also 8-way set associative). Outside of the cores is the L3, which is much larger at 8 MB and smarter (16-way associative) than the L2s. This basic arrangement may be familiar from AMD's native quad-core Phenom processors, and as with the Phenom, the Core i7's L3 cache serves as the primary means of passing data between its four cores. The Core i7's cache setup differs from the Phenom's in key respects, though, including the fact that it's inclusive—that is, it replicates the contents of the higher level caches—and runs at higher clock frequencies. As a result of these and other design differences, including a revamped TLB hierarchy, the Core i7's cache latencies are much lower than the Phenom's, even though its L3 cache is four times the size.
One mechanism Intel uses to make its caches more effective is prefetching, in which the hardware examines memory access patterns and attempts to fill the caches speculatively with data that's likely to be requested soon. Intel claims the Core i7's prefetching algorithm is both more efficient than Penryn's—some server admins wound up disabling hardware prefetch in Xeons because it harmed performance with certain workloads, a measure Intel says should no longer be needed—and more aggressive, as well.
The Core i7 can get to main memory very quickly, too, thanks to its integrated memory controller, which eliminates the chip-to-chip "hop" required when going over a front-side bus to an external north bridge. Again, this is a familiar page from AMD's template, but Intel has raised the stakes by incorporating support for three channels of DDR3 memory. Officially, the maximum memory speed supported by the first Core i7 processors is 1066 MHz, which is a little conservative for DDR3, but frequencies of 1333, 1600, and 2000 MHz are possible with the most expensive Core i7, the 965 Extreme Edition. In fact, we tested it with 1600 MHz memory, since this is a more likely configuration for a thousand-dollar processor.
For a CPU, the bandwidth numbers involved here are considerable. Three channels of memory at 1066 MHz can achieve an aggregate of 25.6 GB/s of bandwidth. At 1333 MHz, you're looking at 32 GB/s. At 1600 MHz, the peak would be 38.4 GB/s, and at 2000 MHz, 48 GB/s. By contrast, the peak effective memory bandwidth on a Core 2 system would be 12.8 GB/s, limited by the throughput of a 1600MHz front-side bus. With dual channels of DDR2 memory at 1066MHz, the Phenom's peak would be 17.1 GB/s. The Core i7 is simply in another league. In fact, our Core i7-965 Extreme test rig with 1600MHz memory has the same total bus width (192 bits) and theoretical memory bandwidth as a GeForce 9600 GSO graphics card.
With the memory controller onboard and the front-side bus gone, the Core i7 communicates with the rest of the system via the QuickPath interconnect, or QPI. QuickPath is Intel's answer to HyperTransport, a high-speed, narrow, packet-based, point-to-point interconnect between the processor and the I/O chip (or other CPUs in multi-socket systems.) The QPI link on the Core i7-965 Extreme operates at 6.4 GT/s. At 16 bits per transfer, that adds up to 12.8 GB/s, and since QPI links involve dedicated bidirectional pairs, the total bandwidth is 25.6 GB/s. Lower-end Core i7 processors have 4.8 GT/s QPI links with up to 19.2 GB/s of bandwidth. Obviously, these are both just starting points, and Intel will likely ramp up QPI speeds from here in successive product generations. Still, both are somewhat faster than the HyperTransport 3 interconnects in today's Phenoms, which peak at either 16 or 14.4 GB/s, depending on the chip.
No comments:
Post a Comment