An Embedded 32-b Microprocessor Core for Low-Power and High-Performance Applications - PDF

Please download to get full document.

View again

of 10
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report



Views: 6 | Pages: 10

Extension: PDF | Download: 0

Related documents
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER An Embedded 32-b Microprocessor Core for Low-Power and High-Performance Applications Lawrence T. Clark, Member, IEEE, Eric J. Hoffman,
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER An Embedded 32-b Microprocessor Core for Low-Power and High-Performance Applications Lawrence T. Clark, Member, IEEE, Eric J. Hoffman, Jay Miller, Manish Biyani, Yuyun Liao, Associate Member, IEEE, Stephen Strazdus, Michael Morrow, Kimberley E. Velarde, and Mark A. Yarch Abstract An embedded RISC microprocessor core fabricated in a six-layer metal m CMOS process implementing the ARM V.5TE instruction set is described. The core described is the first implementation of the Intel XScale Microarchitecture. (ARM is a registered trademark of Advanced RISC Machines, Ltd.) The microprocessor core, which includes caches, memory management units, and a bus controller, comprises a hard-embedded block mm 2 in size. The implementation is primarily custom logic in a variety of circuit styles. The processor dissipates 450 mw at 1.3 V, 600 MHz, and scales between 55 mw at 0.7 V, 200 MHz, and 900 mw at 1.65 V, 800 MHz. Architectural performance is 1000 MIPS at 800 MHz with efficiency ranging from over 850 MIPS/W at 1.65 V to over 4500 MIPS/W at 0.75 V. Architectural and circuit design approaches for low power and high performance are described and measured results from the initial implementation are shown. The first implementation VLSI chip has a 3.3-V pin interface and supports a V core voltage range. Index Terms Cache memories, CMOS integrated circuits, microprocessors. I. INTRODUCTION EMBEDDED microprocessor applications encompass a wide range from high performance in system-on-a-chip (SOC) devices that supply networking, I/O processing, and modem banks, to power-consumption-limited personal digital assistants and cell phones. While the latter require increasing performance for increased functionality such as handwriting and voice recognition, low active and standby power consumption is the primary consideration for adequate battery life. The former applications, which are tethered or nonbattery powered, still desire low power but demand the highest possible performance. In these applications, low power enables greater integration and lower package cost due to improved thermal characteristics. The first embodiment of the Intel XScale Microarchitecture described here was developed to enable application specific standard product (ASSP) SOC devices which provide up to 1000 MIPS of processing power in tethered applications while allowing up to 4500 MIPS/W under battery power. The microprocessor delivers the highest currently available performance under 0.5 W, when measured running Dhrystone 2.1. High absolute performance at the process maximum voltage enables compelling performance at low voltage levels and provides high Manuscript received March 17, 2001; revised June 18, The authors are with Intel Corporation, Chandler, AZ USA ( Publisher Item Identifier S (01) TABLE I DEVICE PARAMETERS MIPS/W due to the well-known quadratic dependence of power on operating voltage. In this manner, the same design can meet what appear initially to be conflicting goals of low power and high performance. Circuit and process techniques allow low average power as well as standby current of 100 Aat1Vand room temperature. In this paper, performance of the implementation is reviewed and details of the architectural and circuit approaches are presented. Section II briefly describes the process technology utilized. Section III describes the processor microarchitecture, while the circuit implementation details, focusing on clocking and caches, comprises Section IV. The physical implementation is shown in Section V, and the simulated and measured performance is discussed in Section VI. In this section, the power-down modes and use of dynamic power supply voltage scaling is also described. II. PROCESS TECHNOLOGY The core is implemented in an n-well on P-epi m lithography process similar to that described in [1]. This process implements a 5% optical shrink from that process, as well as numerous changes to both the transistors and interconnect to support SOC applications. Process characteristics are shown in Table I. The additional 8-nm gate oxide provides 5-V tolerance to allow interfacing to standard memory and I/O such as SDRAM and PCI. The SRAM cell is 5.05 m in size. The six layers of interconnect are aluminum with SiOF dielectric material ( ) to limit capacitance. Metals 2 through 5 support the same pitch to provide high routing density on standard-cell-based autoplace and route-designed blocks, as shown in Fig /01$ IEEE 1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001 Fig. 1. Process SEM cross section. The process was raised from [1] to limit standby power. Circuit design and architectural pipelining ensure low voltage performance and functionality. To further limit standby current in handheld ASSPs, a longer poly target takes advantage of the versus dependence and source-to-body bias is used to electrically limit transistor in standby mode. All core nmos and pmos transistors utilize separate source and bulk connections to support this. The process includes cobalt disilicide gates and diffusions. Low source and drain capacitance, as well as 3-nm gate-oxide thickness, allow high performance and low-voltage operation. Fig. 2. Microprocessor pipeline organization. III. ARCHITECTURE The microprocessor contains 32-kB instruction and data caches as well as an eight-entry coalescing writeback buffer. The instruction and data cache fill buffers have two and four entries, respectively. The data cache supports hit-under-miss operation and lines may be locked to allow SRAM-like operation. Thirty-two-entry fully associative translation lookaside buffers (TLBs) that support multiple page sizes are provided for both caches. TLB entries may also be locked. A 128-entry branch target buffer improves branch performance a pipeline deeper than earlier high-performance ARM designs [2], [3]. A. Pipeline Organization To obtain high performance, the microprocessor core utilizes a simple scalar pipeline and a high-frequency clock. In addition to avoiding the potential power waste of a superscalar approach, functional design and validation complexity is decreased at the expense of circuit design effort. To avoid circuit design issues, the pipeline partitioning balances the workload and ensures that no one pipeline stage is tight. The main integer pipeline is seven stages, memory operations follow an eight-stage pipeline, and when operating in thumb mode an extra pipe stage is inserted after the last fetch stage to convert thumb instructions into ARM instructions. Since thumb mode instructions [11] are 16 b, two instructions are fetched in parallel while executing thumb instructions. A simplified diagram of the processor pipeline is shown in Fig. 2, where the state boundaries are indicated by gray. Features that allow the microarchitecture to achieve high speed are as follows. The shifter and ALU reside in separate stages. The ARM instruction set allows a shift followed by an ALU operation in a single instruction. Previous implementations limited frequency by having the shift and ALU in a single stage. Splitting this operation reduces the critical ALU bypass path by approximately 1/3. The extra pipeline hazard introduced when an instruction is immediately followed by one requiring that the result be shifted is infrequent. Decoupled Instruction Fetch. A two-instruction deep queue is implemented between the second fetch and instruction decode pipe stages. This allows stalls generated later in the pipe to be deferred by one or more cycles in the earlier pipe stages, thereby allowing instruction fetches to proceed when the pipe is stalled, and also relieves stall speed paths in the instruction fetch and branch prediction units. Deferred register dependency stalls. While register dependencies are checked in the RF stage, stalls due to these hazards are deferred until the X1 stage. All the necessary operands are then captured from result-forwarding busses as the results are returned to the register file. One of the major goals of the design was to minimize the energy consumed to complete a given task. Conventional wisdom has been that shorter pipelines are more efficient due to re- CLARK et al.: EMBEDDED 32-b MICROPROCESSOR CORE 1601 duced number of clocked elements and speculative operations in the core. However, lengthening the pipeline allows power reduction at a given frequency through a combination of the greater frequency that can be achieved at the same voltage (or lower voltage at the same frequency) and limiting the need for high-speed, i.e., high-power, circuit design techniques. As frequency is increased the effect of memory latency on overall performance is also increased. The microarchitecture decouples execution from external memory to avoid this, by including the ability to buffer up to eight external memory read requests, a nonblocking data cache, an eight-entry write buffer that supports coalescing of multiple requests, writeback caching, configurable data cache allocation policies, and cache locking. B. Cache Architecture The cache design utilizes high set associativity content-addressable memory (CAM)-based virtually addressed tags [3], [4] that eliminate the x address decoder and provides low power in a single cycle cache. While potentially shortening the pipeline by allowing concurrent TLB and cache lookup, virtual addressing invites unique challenges in the data cache, as entries which are replaced by writes must be unreplaced upon a TLB miss or permission violation, which is known only after the fact. The pipelined 32-kB instruction and 32-kB data caches are divided into banks, which are 32-way set associative. The high associativity is more important to power than to architectural performance, where 32 ways is past the point of diminishing returns [5]. Comparison to similar speed designs showed that the power savings achieved versus a conventional 4-way set associative design is approximately 4. All cache operations, load, store, fill, and replace can issue on each cycle. A read or write yields one 32-b word or one to four written bytes, respectively. Fill and flush operations are 64 bits wide. Line fill operations begin with a tag write operation, which can occur concurrently with an eviction from the data array. The tag valid bit is then set. Fills are completed when the data is made available from the bus unit by writing two words (as well as one fourth of the physical address) in four subsequent cycles. Subsequently, including the tag valid bit in the match operation validates fills. The data valid bit is not set until the line data is complete, which allows hit-under-miss operation. A round-robin replacement scheme is used due to the impracticality of implementing a least recently used replacement scheme. Line-based locking allows predictable response time for critical accesses, e.g., commonly used data or interrupt handler code. This is implemented with a move-to-coprocessor (MCR) instruction to lock the line upon loading it [6]. A data minicache which is 2 kb in size and has an independent round-robin replacement mechanism is provided. This allows large data sets with high spatial locality, e.g., graphics buffers, to be cached without evicting data with more temporal locality from the main data cache. C. Multiply Accumulator The multiply accumulator (MAC) supports single cycle throughput for 16-b 32-b operations and 16-b SIMD opera- Fig. 3. Multiplier accumulator architecture. tions for audio processing. In the latter case, a 32-b register is treated as two 16-b values. The MAC leverages the advantage of a 16-b encoding scheme without adding extra delay to the faster four-stage Wallace tree of a 12-b encoding scheme. Whereas a conventional 16-b encoding requires five stages of 3-to-2 carry save adders (CSA), and a 12-b encoding scheme, four stages of CSA, the lack of feedback on the Wallace tree in the first cycle allows improvement. The MAC encodes 16-b of the multiplier in the first cycle (the upper MUX inputs in Fig. 3) and encodes 12-b of the multiplier for the rest of the cycles with four stages (the lower MUX inputs in the figure). Eight partial products are generated in the first cycle and six partial products along with the intermediate feedback carry and sum vectors in the other cycles to fill eight slots. A[31:0] is a 32-b multiplicand, B[31:0] is a 32-b multiplier, and C[31:0] is a 32-b accumulate data. Two 40-b accumulators increase the multiply instruction throughput by avoiding data dependencies without requiring high circuit speed, thus limiting power. Forty-bit accumulators increase performance and precision of audio coding algorithms by allowing infrequent overflow checking. The accumulators are implemented as a conventional carry lookahead design. The accumulator results are combined upon the writeback to the register file by the adder shown at the lower right in Fig. 3. To support 16-b single-instruction multiple data (SIMD) operations, multiply and load double word (LDRD) instructions are added, the former to the coprocessor instruction space. Both require two cycles to complete, and when issued in alternating cycles, utilize the full issue bandwidth of the microprocessor to allow up to 0.85 MACs/MHz on some DSP algorithms. These 16-b DSP extensions include a SIMD instruction and multiply with implicit accumulate (MIA) instructions. The basic SIMD operation, MIAPH [6], treats two 32-b registers as two pairs of 16-b values. The upper 16 bits of each register are multiplied together and the lower 16 bits are also multiplied together. The results are then added to the contents of the 40-b accumulator. The MIAxy MAC instruction multiplies two 16-b values, taken from either the upper or low two bytes of the two source registers. The combination of double word load, MAC, and SIMD MAC instructions allow efficient code for handling media streams. 1602 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001 Fig. 5. Cache simulated waveforms. Fig. 4. Clock distribution. IV. CIRCUIT IMPLEMENTATION A. Circuit Design for Low Power The design is primarily implemented in static CMOS logic and supports full clock stop (pseudostatic operation). Three power down modes are supported: Idle, which stops the internal clocks; Standby, which stops the phase-locked loop (PLL) and drives the core to a low-leakage (reverse body biased) configuration that retains state; and Sleep, which is not state retentive. The use of pulse-clocked latches as master slave flip-flops reduces the clock power by approximately 1/3 while minimizing delay and setup time. Portions of the design are implemented in domino logic, while the ALU, parity generators, and multiply accumulator (MAC) utilize CMOS pass-gate logic. Circuits were simulated at nominal and low voltage to ensure low-voltage performance. B. Adder and Bypass Loop The ALU bypass loop is the primary speed path in a RISC microprocessor, frequently forcing designs to a domino implementation. Here, a static CMOS pass-gate logic conditional-sum static adder was utilized instead. The static adder, standalone, is 14% slower at 1.1 V than a single-rail domino version. However, when used in conjunction with pulse-clocked latches in the ALU bypass loop, it is slightly faster than a domino adder, due to the elimination of one latch setup time, clock skew, and delay. Separate latches limit ALU logic switching by function to limit power dissipation [7], a technique that was used throughout the design, e.g., on the cache busses. C. Clocks and PLL Pulse clocking obtains flip-flop functionality from transparent latches, saving power due to less clock load and fewer toggling nodes in sequential elements. They also have better delay characteristics than a full master slave designs, i.e., the quantity is lower. Here, is the total time wasted in the latch elements, comprised of, the setup time, and, the clock-to-output delay. The penalty is increased risk of racethough as the hold time required is a function of the pulsewidth. To mitigate this disadvantage, two things were done: the minimum pulsewidth providing reliable latch writeability across process, voltage, and temperature corners was used, and compact power efficient min-delay buffers were constructed. The standard library transparent latches were utilized. The pulses are generated locally in the local clock buffer (LCB). A simple three-inversion one-shot circuit is used to generate the pulse [7]. The LCB is the last clock buffer level and directly drives the sequential elements. Two enables encourage clock gating and ease logic constraints on its use. Local pulse generation diminishes degradation by RC effects, as well as filtering by downstream buffers. Both the maximum and minimum delay aspects of the pulse clocked latches were included in the timing analysis, which utilized commercially available and proprietary timing analysis software. Since pulse generation and granular clock gating does carry overhead in terms of complexity, at least five latches were driven with each LCB. The clock network is represented in Fig. 4. A deskewed early global clock (EGCLK) is produced in the shape of a T. It is generated by two balanced binary trees that fan out from a common point. The final drivers in each tree feed into wide M6 and M5 nodes labeled EGCLK. EGCLK is buffered through two levels of inversions to produce GCLKs that in turn drive the LCBs. To minimize skew, the RC component of the route from the output of the clock spine to the input of the LCB is matched. Typically, in a spine-based clock distribution CLARK et al.: EMBEDDED 32-b MICROPROCESSOR CORE 1603 Fig. 6. Cache clock gating. Circuitry outside the dashed boxes is in the clock spine. scheme, this is done by having a fixed load at the input to each LCB and having a fixed route length and width to the input of each LCB. Route doublebacks are used to match the route length and dummy capacitive loading is used to equalize the input capacitance on all LCBs. This approach is simple and guaranteed accurate, but wastes power since most clock nodes have a considerable amount of dummy load. Here, the RC was matched without dummy loading by modulating the route width from the output of the clock spine to the inputs of the LCBs. If the minimum acceptable width was hit, a doubleback routing was employed, as shown for signal GCLK_RF1 in the figure. If the width of the data path could not be driven from a single end under the maximum wire RC limit, a T-shaped route was allowed, as shown for route GCLK_IC1. This allows two LCBs to drive a single clock signal from each end of the data path. Early global clock gating is represented for the caches and for the signal GCLK_MA2. Transparent latches were used to take advantage of time borrowing, as were enable low latches. In the figure, signal CLK_IA1 is representative of that topology. Total analytical clock skew at the local clock level is less than 100 ps. A-0 silicon had measured cycle-to-cycle jitter of 55 ps. Clock gating is implemented on three levels, allowing gating as early in the clock spine as feasible to limit the power dissipated by the clocks. Since the enable path is significantly shortened, this exerted a significant effect on clock buffer positioning, first, at the PLL, implementing idle mode, which eliminates all clock activity, second, at the global clock (GLK) level where 83 unique enable signals are implemented, and finally, at the LCB level, where 400 unique enable signals exist. D. Cache Organization and Design A cache access is performed in three phases, with one for each of tag lookup, data access, and delivery with alignment (including sign extend in the case of the data cache). Entries replaced by stores must be restored upon a TLB miss or permission violation, requiring a unique read/write cycle rather than the simpler write. This necessitated relatively short 68-ce
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks