Timing Constraints for High Speed Counter ow-Clocked Pipelining Department of Computer Science

Please download to get full document.

View again

of 17
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report

Computer Science


Views: 0 | Pages: 17

Extension: PDF | Download: 0

Related documents
Timing Constraints for High Speed Counter ow-Clocked Pipelining Jae-tack Yoo, Ganesh Gopalakrishnan and Kent F. Smith UUCS-95-019 Department of Computer Science…
Timing Constraints for High Speed Counter ow-Clocked Pipelining Jae-tack Yoo, Ganesh Gopalakrishnan and Kent F. Smith UUCS-95-019 Department of Computer Science MEB 3190, University of Utah Salt Lake City, UT. 84112 October 30, 1995 Abstract With the escalation of clock frequencies and the increasing ratio of wire- to gate-delays, clock skew is a major problem to be overcome in tomorrow's high-speed VLSI chips. Also, with an increasing number of stages switching simultaneously comes the problem of higher peak power consumption. In our past work, we have proposed a novel scheme called Counterow-Clocked(C 2) Pipelining to combat these problem, and discussed methods for composing C 2 pipelined stages. In this paper, we analyze, in great detail, the timing constraints to be obeyed in designing basic C 2 pipelined stages as well as in composing C 2 pipelined stages. C 2 pipelining is well suited for systems that exhibit mostly uni-directional data ows as well as possess mostly nearest-neighbor connections. We illustrate C 2 pipelining on such a design with several design examples. C 2 pipelining eases the distribution of high speed clocks, shortens the clock period by eliminating global clock signals, allows natural use of level-sensitive dynamic latches, and generates less internal switching noise due to the uniformly distributed latch operation. By applying C 2 pipelining and its composition methods to build a system, VLSI designers can substitute the global clock skew problem with many local one-sided delay constraints. 1 I. Introduction to a high speed system With the escalation of clock frequencies and the increasing ratio of wire- to gate-delays, clock skew is a major problem to be overcome in today's high-speed VLSI chips. Clock skew should ideally be less than 5-10% of the system clock cycle time 1] this is a di cult gure to attain in many modern chips 2] and will become more so with the impending GHz rate of clocking 3]. The eect of shrinking VLSI feature sizes will increase this disparity 4] in the future, especially in the light of the fact that in submicron CMOS, interconnection delays are going to be larger than gate- propagation delays 5]. Consequently, an increased percentage of the clock period will be devoted to clock skew margins 6, 7]. The faster the clock and the bigger the die size, the worse the clock skew eects will be. A major concern when building high performance VLSI systems is to build an eective clock distribution network. Many clock distribution methods for large high-speed VLSI chips have been developed 1] to achieve rigid synchronization (tight skew control) over the chip. Clock distribution networks of high-speed systems are normally comprised of binary trees of clock buers 2, 8], which are expensive to produce in terms of area and design time. Network implementations such as H-tree methods 7] have been commonly exploited to reduce the clock skew. The eort to limit skews has an unfortunate side-eect: it causes the latches to switch almost simultaneously, causing ground-bounce and power-supply-droop, both of which can lead to chip malfunction. This often necessitates on-chip and o-chip decoupling capacitors 1], both of which add to the design cost. Rigidly clocked synchronous systems are often those that support a variety of data movements between their computational blocks. These systems have embedded bus structures that permit communication between physically distant modules. In these cases, the assumption that all the modules are rigidly synchronized to a global clock makes design easier, and hence is almost always made. However, for systems that have a VLSI realization with mostly uni-directional data ows as well as possessing mostly nearest-neighbor connections, the assumption of rigidly synchronized clocking is not necessary, and can result in lost performance when enforced. Examples of such chips are digital signal processing (DSP) chips, oating point units (FPU), graphics engines, asyn- chronous transfer mode (ATM) switches, etc. As we will show, higher performance and simpler clock distribution will result in these systems if we stick to local clocking constraints, much the same way the data dependencies in these systems are local. This is the main idea behind clock distribution in C 2 pipelined realizations. Another major concern when building high performance VLSI systems is to employ high per- formance pipelined structures in conjunction with high speed clocks. Pipelining is a technique for reducing the clock period as well as increasing the amount of parallel circuit activity by splitting deep logic structures into shallower structures that are separated by pipelined latches. Although design methods for conventionally pipelined systems are well known 9], serious problems due to rigid clock synchronization may arise in very high speed pipelined designs. Strictly speaking, how- ever, pipelining and clocking are orthogonal concepts. One can build asynchronous pipelines known as micropipelines 10] that do not employ clocks. However, the time penalty paid for generating the completion signals, as well as for handshaking 11] has prevented micropipelines from nding widespread use in high-performance VLSI systems. One can also implement wavepipelining 12] where the \latches can be realized by the inherent combinational delays of logic structures. De- spite their inherent performance advantages, wavepipelined systems require considerably more de- sign eort to balance combinational delays, and consequently have received only limited usage. C 2 pipelining is a synchronous design scheme that (as pointed out before) comes with clock-distribution methods as well as pipeline design- and composition-methods. 2 A feature of C 2 pipelining is that the clock signals travel opposite to the direction of data movement. Back-propagating clock signals have been considered previously 2, 13], but never widely used in actual circuits. These previous back-propagating circuits were rigidly clocked, and hence oered no real advantages over H-tree distributed clocks in fact, they actually increased the clock period. Another clocking method is buered clocking, mentioned in El-Amawy 14], and originally described as pipelined clocking by Fisher et al. 7] (who does not assign any particular direction to pipelined clocks). This method also suers from an increased clock period. In C 2 pipelined systems, every pipeline stage employs clock buers, as shown in Figure 1 (a), detailed explanation of which will be given in succeeding sections. These inverter buers not only deliberately skew the clock (the exact one-sided constraints will be presented later) but also restore the clock-edge. This scheme achieves temporally distributed clocking. Clock ampli cation is also carried out in a distributed fashion. Conventional two-phase clocked pipelining is also illustrated in the gure for comparisons. The C 2 pipelining idea was rst introduced in 15] where we presented many actual uses in the context of a subband vector quantizer (SB/VQ) chip. In this paper, we will focus on analyzing the timing constraints of C 2 pipelining. In Section V we will review the results of a C 2 pipelining network for the SB ltering chip. Another feature of C 2 pipelined systems is that it enables one to use simple and e cient dynamic latches, which oer extremely low latch delays and areas, and avoids special latch designs 1, 2]. The C 2 pipelining method also staggers the switching activities of the latches, thus reducing the peak power consumption. This, in turn, reduces internal switching noise and also simpli es power-line routing, making it easier to distribute high speed clocks. clk driver clk 1 2 data data (a) Circuit C2 (b) Circuit Conventional Figure 1: Circuit C2 and circuit Conventional The pipeline interconnection methods to be described actually make the idea of C 2 pipelining more useful than pipelines with only nearest neighbor connections. In 15], we introduced such methods for 1) data forwarding, in which data skips a few pipeline stages in the direction of the dataow, 2) data backwarding, in which data skips a few pipeline stages backwards (commonly used for iterative computations), 3) sequential connection of dierent pipelines, 4) pipeline fork and join methods to combine pipeline functionality in parallel, and 5) synchronization methods to synchronize incoming data and outgoing data to a clock signal. Timing constraints involved in these methods will also be discussed in detail in this paper in Section III. In Section II, basic C 2 pipelining architectures are described and analyzed. Basic composition methods of data forwarding and data backwarding are analyzed in Section III. Section IV shows extended composition methods of sequential connection, pipeline fork and join and synchronization. These methods are explained using the analysis results shown in Section III. Section V gives a practical assessment of C 2 pipelining with a design and layout example. Conclusions are given in the nal section. 3 II. Basic C 2 Pipelining Architectures This section shows basic C 2 pipelining architectures with an analysis of timing constraints. A. Principles of C 2 Pipelining Architecture Figure 1 shows the dierence between a C 2 pipelining and a conventional clock distribution for a pipeline. Circuit C2 on Figure 1 (a) employs a chain of inverters to provide local clock signals. Local buers attached to the chain provide appropriate output power to control local latches. Figure 1 (b) shows the conventional method in which a non-overlapping two-phase clock generator is located at the center of the clock distribution network. This clock generator is designed to cope with clock loads of the entire clock network. C 2 pipelining can be realized in several ways as shown in Figure 2. Figure 2 (a) shows the basic architecture with back-propagating and inverting delays in a clock distribution line. Figure 2 (b) shows a version with computational components in the data path. Figure 2 (c) shows an alternative with noninverting buers used in place of inverting buers. Although they are illustrated dierently, Figure 2 (b) can represent all three cases for the purpose of timing constraint analysis. Figure 3 (a) shows a portion of the C 2 pipeline of Figure 2 (b), and Figure 3 (b) illustrates clock waveforms for its level sensitive dynamic latches. These latches are transparent during high clock signal and opaque during low clock signal. Latch i is controlled by a clock which is inverted and delayed by a clock line delay (dc ) from the clki+1 for latch i+1. Similarly, latch i-1 receives an inverted and delayed clock signal from clki. i i+1 i+2 i+3 clk clk data data (a) Basic architecture (b) General architecture clk * Remarks : inverting delay data : non-inverting delay : latches * A Latch with a bubble operates at different phase : data paths (c) Alternate general architecture Figure 2: C 2 pipelining architectures Clock timing analysis pertaining to a particular latch i with respect to its neighboring latches will now be discussed. First, the pipelining involves \go-throughs during clock period I and III shown in Figure 3 (b) (due to the fact that C 2 pipelining implements overlapping clocks.) For instance, during period III, stage i-1 output can \go-through to stage i+1 because the i-1 latch is in hold while i and i+1 are transparent. Go-through should be avoided in a rigidly clocked synchronous system with a non-overlapping clock. However, this go-through does not make stage i+1 produce a wrong output in a C 2-pipelined system. A possible scenario involving a go-through is the following: 4 clk i-1 dc clk i dc clk i+1 dc clk i+2 clk stage i+1 data stage i-1 stage i latch i-1 latch i latch i+1 latch i+2 2 (a) Part of a C pipeline dc clk i-1 clk i clk i+1 clk i+2 I II III IV V VI * Remarks : Transparent window : Latch output window (b) Clocks for latches Figure 3: A part of a C 2 pipeline stage-latch i-1 stabilizes its output by period II however stage i-1 delays this output which reaches the input of latch i only during period III (note the distinction between stage and stage-latch) the output of stage i (not stage-latch) can be generated early during period III and be sent to stage-latch i+1 which is also transparent. In this scenario, the output generated by latch i-1 gets processed by stages i-1 and i and is applied to the input of stage i+1|all during period III. This go-through is not harmful because it causes stage i+1 output to tend towards the same value as it will evaluate to in the absence of go-through (much like chaining 16]). The go-through possible in period I can also be analyzed in the same way. In fact, go-throughs can actually help shorten the clock period by allowing a stage to absorb a fraction of the long-path delays associated with the stage preceding it. This can potentially be an advantage if the stage delays are not exactly balanced. The other periods involved (II, IV, V and VI) do not allow go-throughs to happen. Figure 4 illustrates the overall latch operations for a C 2 pipeline. This gure shows staggered latch operations, where each latch alternates between transparent and opaque states. The vertical bold lines emanating from one period of the latch i operation marks a sending window, involving a transparent state and the succeeding opaque state of a latch, and a matching receiving window of the following latch. The latter latch i+1 is in the transparent state between the two bold lines. This shows that the latches are operating as described in previous paragraphs. The novelty of C 2 pipelining results from the use of intentionally inserted delays on clock lines. These delays not only provide pipeline speed-up described above, but also partition the clock line into many small pieces enabling one to avoid global clock skew problems. This leads to a locality property of timing constraint to the whole pipeline: i. e. the whole pipeline works properly by assuring local delay constraints for all stages. 5 Output data latch i+4 Latch Clock propagation latch i+3 Latch latch i+2 Latch latch i+1 Latch latch i Latch d1 d2 d3 d4 d5 d6 Input data Time : Transparent : Opaque Figure 4: Latch operations for a C 2 pipeline B. Timing Constraints for the Basic Architecture In very high speed designs, the delays associated with segments of wires cannot be ignored. These are taken into consideration in the following calculations. Figure 5 shows stage i with its associated delays for clock wires, a clock buer and a data path. Speci cally, let the shortest wire delay for the rst latch be dfsw , the longest wire delay for the rst latch be dflw , the shortest wire delay for the second latch be dssw , the longest wire delay for the second latch be dslw , the inserted inverting buer delay for the clock be dc , the shortest data path delay for the stage be dds and the longest data path delay for the stage be ddl . Figure 6 shows the detailed timing diagram for the stage shown in Figure 5, including latch set-up time (S) and hold time (H). This gure emphasizes (slowest and fastest) data validation timing for latch i+1, and (slowest and fastest) clocks for the latch i+1. node i dc nodei+1 dfsw dssw clk d ds, d dl Output Input dflw stage i d slw latchi latchi+1 Figure 5: Delays associated with stage i The local timing constraint can be derived as follows (Refer to Figure 6): The earliest available output time (ta ) of stage i is dc + dfsw + dds after the falling edge of clock clk at node i + 1. the latest activation time of latch i + 1, tb , is dslw after the falling edge of the clock clk at node i+1. 6 earliest next data validation time (ta ) dfsw latest next data validation time (t c) dc dds (earliest) clk at latch i (latest) dc dflw d dl dssw S clk at latch i+1 (earliest) (latest) dslw H next data should be valid before here (t d) current data should be stable until here (t b + H) clk at nodei+1 * Remarks : Latch is transparent Figure 6: Detail timing diagram for Figure 5 ta should be greater than tb + H (H being the latch hold-time) to avoid violating the hold- time requirement for latch i + 1 (avoid changing the data being latched by latch i + 1 during the hold-time period). Thus, the condition for the local timing constraint is dc + dfsw + dds dslw + H which results in the smallest inserted-delay value, dc , of the clock line buer(s) being: dc dslw + H ; dfsw ; dds (1). To calculate the minimum allowable clock-phase duration, P, assume a 50% duty-cycle clocks which results in: the latest data validation time at the input of latch i+1 is when the incoming data to stage i was validated late. This time instant (tc ) will be dc + dflw + ddl after the rising edge of the clock clk as shown in Figure 6. the earliest latch i+1 opening time (latch opened by the rise of Clk) is dssw after the rising edge. Therefore, the earliest latch i+1 closing time is dssw + P . tc should be before td = dssw + P ; S to satisfy the latch i+1 setup time. This will result in the following inequality: dc + dflw + ddl P ; S + dssw which results in the clock phase duration lower-bound P dc + dflw + ddl + S ; dssw (2). The inequality in (2) can always be satis ed because the clock period is externally controllable as in conventional synchronous clocking. The inequality in (1) is the condition that is most important. 7 Hence, C 2 pipelining results in one-sided timing constraints. Also, notice that (2) is independent of clock-skew, which con rms the observation that C 2 pipelining is an attractive method for GHz clocked circuits where skew is expected to become a major problem using conventional rigid-clocking methods. III. Basic composition methods During the composition of C 2 pipeline blocks, there will arise situations in which the data needs to (1) move downstream (with respect to the data movement) to be consumed by a functional block with typically several inputs (Figure 7 (a)), and/or (2) move upstream to be consumed by a functional block with several inputs (typically in iteration structures) (Figure 7 (b)). As we expect such \stage skipping connections to be infrequent as well as skip only a small number of stages, we do not provide any special circuits to resynchronize the data instead, we obtain timing constraints to be obeyed. Skips over longer distances have to proceed as several short skips in sequence with corresponding adjustments in the data timing. Timing constraints required for data forwarding and backwarding are now analyzed in the sections to follow. i i+k i-k i clk clk data data (a) Data forwarding (b) Data backwarding where k is odd. where k is even. Figure 7: Data forwarding and data backwarding A. Data Forwarding Figure 8 (a) shows data forwarding ignoring wire delays. The simplest form of data forwarding is feeding data from latch i to latch i+1 with blank data path (waveforms (1) and (2) of Figure 8 (a) respectively). Notice that in this case latch i+1 is controlled by a clock that is inverted and leading with respect to the latch i clock. This case of forwarding represents an empty C 2 pipeline stage with zero data-path delays and was analyzed in the previous section. When the destination is a latch with clki+2j +1 as shown in waveform (3), the clock for this latch is inverted and leading by (2j+1)*dc where dc is a clock buer delay. This leading duration can be extended up to P ; S as shown by waveform (4) (i.e., the latest data sent by latch i must fall before the set-up time window, marked S, of a latch with clki+2k+1 on waveform (4)). Note that data forwarding by whole cycles is possible. However, such extended forwarding needs to be avoided since the amount of delay in a long chain of inverters can signi cantly vary with temperature, operating voltage and fabrication process parameters, and hence may not reliably track the cycle time. Waveforms (1) and (5) give an example of an incorrect
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks