Announcements. EE382A Lecture 3: Superscalar and Out-of-order Processor Basics. Dynamic-Static Interface. Lecture 3 Outline

Please download to get full document.

View again

of 12
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

News & Politics

Published:

Views: 0 | Pages: 12

Extension: PDF | Download: 0

Share
Related documents
Description
Announcements EE382A Lecture 3: Superscalar and Out-o-order Processor Basics HW is due today Hand to Davide at the end o lecture or send ASAP Will contact you about the results within -2 days Required
Transcript
Announcements EE382A Lecture 3: Superscalar and Out-o-order Processor Basics HW is due today Hand to Davide at the end o lecture or send ASAP Will contact you about the results within -2 days Required paper assigned or Lecture 3 Submit summary by Wed 9/30 th Check instruction on the class webpage or John Shen: Department o Electrical Engineering Stanord University Lecture 3 - Lecture 3-2 Dynamic-Static Interace Lecture 3 Outline. From Scalar to Superscalar Pipelines 2. Limits o Instruction-Level Parallelism 3. Superscalar Microprocessor Landscapes DSI = ISA = a contract between the program and the machine. Lecture 3-3 Lecture 3-4 Instruction Pipeline Design Uniorm Sub-computations... OT!. From Scalar to Superscalar Pipelines Balancing pipeline stages - Stage quantization to yield balanced pipe stages - Minimize internal ragmentation (some waiting stages) Identical Computations... OT! Uniying instruction types - Coalescing instruction types into one multi-unction pipe - Minimize external ragmentation (some idling stages) Independent Computations... OT! Resolving pipeline hazards - Inter-instruction dependence detection and resolution - Minimize i i perormance lose due to pipeline stalls Lecture 3-5 Lecture 3-6 Scalar Pipelined Processors The 6-stage TYPICAL pipeline: LOAD STORE BRACH : : OF: EX: I-CACHE DECODE. REG. OP. I-CACHE DECODE. REG. ADDR. GE.. MEM. I-CACHE DECODE. REG. WR. MEM. I-CACHE DECODE. REG. OS: WR. REG. WR. REG. ADDR.GE. ADDR. GE. WR MEM 5 6 Add Update I-Cache I-Cache Data Instruction 6-stage TYP Pipeline D-Cache D-Cache Data Register File MEM Add Lecture 3-7 Lecture 3-8 Pipeline Interace to Register File: 6-stage TYP Pipeline Operation D-Cache D-Cache add R = R2 + R3 x0246 add R R2 + R3 S S2 D 2 3 WAdd WData Register RAdd File RAdd2 ata ata2 W/R Add Update I-Cache ICache I-Cache Data Instruction load R3 = M[R4 + R5] x99 Data Register File x80 x04 MEM x84 Add MEM x023 x023 + Lecture 3-9 Lecture Major Penalty Loops o Pipelining Limitations o Scalar Pipelined Processors LOAD PEALTY PEALTY MEM BRACH PEALTY Perormance Objective: Reduce CPI to. Upper Bound on Scalar Pipeline Throughtput Parallel Pipelines Limited it by I = Ineicient i Uniication Into Single Pipeline Long latency or each instruction Hazards and associated stalls Diversiied Pipelines Perormance Lost Due to In-order Pipeline Dynamic Pipelines Unnecessary stalls Lecture 3 - Lecture 3-2 Parallel Pipelines Intel Pentium Parallel Pipeline D D D (a) o Parallelism (b) Temporal Parallelism D2 D2 D2 (d) Parallel Pipeline EX EX EX (c) Spatial Parallelism U - Pipe V - Pipe Lecture 3-3 Lecture 3-4 Diversiied Pipelines Power4 Diversiied Pipelines Fetch Q I-Cache BR Scan BR Predict EX MEM FP BR FP Issue Q FX/LD Issue Q FX/LD 2 Issue Q BR/CR Issue Q Reorder Buer MEM2 FP2 FP3 FX Unit LD LD2 FP FP2 Unit Unit Unit Unit FX2 Unit CR Unit BR Unit StQ D-Cache Lecture 3-5 Lecture 3-6 Diversiied Pipelines In-order Issue into Diversiied Pipelines Separate execution pipelines Integer simple, memory, FP, Advantages: Reduce instruction latency Each instruction goes to asap Eliminate need or orwarding paths Eliminate some unnecessary stalls E.g. slow FP instruction does not block independent integer instructions Disadvantages?? Inorder Inst. Stream Fn (RS, RT) Dest. Reg. IT Fadd Fmult LD/ST Fadd2 Fmult2 Func Unit Source Registers Fmult3 Issue stage needs to check:. Structural Dependence 2. RAW Hazard 3. WAW Hazard 4. WAR Hazard Lecture 3-7 Lecture 3-8 Dynamic Pipelines Designs o Inter-stage Buers EX Dispatch Buer ( in order ) ( out o order ) MEM FP BR Stage i Stage i n ( in order ) Buer () Buer (n) n ( in order ) Stage i + Stage i + Scalar Pipeline Buer In-order Parallel Buers (simple register) (wide-register or FO) Reorder Buer MEM2 FP2 FP3 ( out o order ) (inorder) Stage i Buer ( _ n) Stage i + (multiported SRAM and CAM) ( any order ) ( any order ) Out-o-order Pipeline Stages Lecture 3-9 Lecture 3-20 The Challenges o Out-o-Order Dynamic-Static Interace Program Order I a : F F2 x F I b : F F4 + F5 DSI = ISA = a contract between the program and the machine. Architectural State EX IT Fadd Fmult LD/ST Fadd2 Lecture 3-2 Fmult2 Fmult3 Out-o-order I b : F F4 + F I a : F F2 x F3 What is the value o F? WAW!!! Microarchitecture State Architectural state requirements: Support sequential instruction i execution semantics. Support precise servicing o exceptions and interrupts. Buering needed between arch and uarch states: (ROB) Allow uarch state to deviate rom arch state. Able to undo speculative uarch state i needed. Lecture 3-22 Modern Superscalar Processor Impediments to Superscalar Perormance Fetch In Order Out o Order In Order Issue Execute Finishi Dispatch Complete Retire Instruction/ Buer Dispatch Buer Reservation Stations Reorder/ Completion Buer Store Buer PEALTY LOAD PEALTY MEM BRACH PEALTY Register Data Flow Branch Predictor I-cache FETCH DECODE Instruction Buer Instruction Flow Integer Floating-point Media Memory Reorder Buer (ROB) Store Queue EXECUTE COMMIT D-cache Memory Data Flow Lecture 3-23 Lecture 3-24 Amdahl s Law 2. Limits o Instruction-Level Parallelism o. o Processors h - h - Time Speedup Lecture 3-25 Lecture 3-26 Revisit Amdahl s Law Pipelined Processor Perormance Model lim v Pipeline Depth -g g g o. o Processors h - h - Time Lecture 3-27 Lecture 3-28 Pipelined Processor Perormance Model Motivation or Superscalar [Agerwala and Cocke] Pipeline Depth -g g Spe eedup p n=6,s=2 n=00 n=2 n=6 n=4 Typical Range Vectorizability Lecture 3-29 Lecture 3-30 Superscalar Proposal Limits on Instruction Level Parallelism (ILP) Speedup S Weiss and Smith [984].58 Sohi and Vajapeyam [987].8 Tjaden and Flynn [970].86 (Flynn s bottleneck) Tjaden and Flynn [973].96 Uht [986] 2.00 Smith et al. [989] 2.00 Jouppi and Wall [988] 2.40 Johnson [99] 2.50 Acosta et al. [986] 2.79 Wedig [982] 3.00 Butler et al. [99] 5.8 Melvin and Patt [99] 6 Wall [99] 7 (Jouppi disagreed) Kuck et al. [972] 8 Riseman and Foster [972] 5 (no control dependences) icolau and Fisher [984] 90 (Fisher s optimism) Lecture 3-3 Lecture 3-32 The Ideas Behind Modern Processors Architectures or Instruction-Level Parallelism Superscalar or wide instruction issue Diversiied pipelines Ideal I = n (CPI = /n) Dierent instructions go through dierent pipe stages Instructions go through needed stages only Out-o-order or data-low execution Speculation Stall only on RAW hazards and structural hazards Overcome (some) RAW hazards through prediction And it all relies on: Instruction Level Parallelism (ILP) Independent instructions within sequential programs Scalar Pipeline (baseline) Instruction Parallelism = D Operation Latency = Peak I = VE S CCESSIV TRUCTIO SU IST D DE EX TIME I CYCLES (OF BASELIE MACHIE) Lecture 3-33 Lecture 3-34 Superpipelined Processors Superscalar Processors Superpipelined Execution IP = DxM OL = M minor cycles Peak I = per minor cycle (M per baseline cycle) Superscalar (Pipelined) Execution IP = Dx OL = baseline cycle Peak I = per baseline cycle major cycle = M minor cycle minor cycle DE EX DE EX Lecture 3-35 Lecture 3-36 Superscalar and Superpipelined Superscalar Parallelism Operation Latency: Issuing Rate: Superscalar Degree (SSD): (Determined by Issue Rate) Superpipeline Parallelism Operation Latency: M Issuing Rate: Superpipelined Degree (SPD): M (Determined by Operation Latency) 3. Superscalar Microprocessor Landscapes SUPERSCALAR Key: etch SUPERPIPELIED Dcode Execute Writeback Time in Cycles (o Base Machine) Superscalar and superpipelined machines o equal degree have roughly the same perormance, i.e. i n = m then both have about the same I. Lecture 3-37 Lecture 3-38 Iron Law o Processor Perormance Landscape o Microprocessor Families (SPECint92) Time /Processor Perormance = Program Instructions Cycles Time = X X Program Instruction Cycle (inst. count) (CPI) (cycle time) I x GHz Processor Perormance = inst. count Lecture 3-39 Lecture 3-40 /MHz SPECint95/ Landscape Landscape o Microprocessor o Microprocessor Families Families (SPECint95) PII PPro (SPECint95) SPECint PIII Athlon PIII Athlon 64 Pentium Alpha 064 AMD-x86 Intel-x Frequency (MHz) ** Data source SP PECint2000 0/MHz 0.5 Landscape Landscape o Microprocessor o Microprocessor Families Families (SPECint2K) (SPECint2000) SPECint B 25 PIII-Xeon 264C 604e 264A Itanium Sparc-III Athlon Pentium 4 Intel-x86 AMD-x86 Alpha Power Sparc IPF Frequency (MHz) ** Data source Lecture 3-4 Lecture 3-42 Frequency vs. Parallelism Deeper and Wider Pipelines Increase Frequency (GHz) Deeper Pipelines Increased Overall Latency Lower I Increase Instruction Parallelism (I) Wider Pipelines Increased Complexity Lower GHz Fetch Dec. Disp. Exec. Mem. Retire Fetch Dispatch Execute Memory Retire Branch Mispredict Penalty Lecture 3-43 Lecture 3-44 Front-End Pipe-Depth Penalty Alleviate Pipe-Depth Penalty Fetch Dispatch Execute Memory Retire Front-End Contraction Back-End Optimization Fetch Dispatch Execute Memory Retire Optimize Front-End Contraction Code Re-mapping and Caching Trace Construction, Caching, Optimization Leverage Back-End Optimizations Back-End Optimization Multiple-Branch, Trace, Stream, Prediction Code Reordering, Alignment, Optimization Pre-decode, Pre-rename, Pre-scheduling Memory Pre-etch Prediction and Control Lecture 3-45 Lecture 3-46 Execution Core Improvement ext Lecture Super-pipelined design Very high-speed arithmetic units Fetch Dispatch Execute Memory Retire Optimize Speculative OoO execution Criticality-based data caching Aggressive data pre-etching Superscalar Pipeline Implementation: Instruction etch Instruction decode Instruction dispatch Instruction execute Instruction complete and retire Instruction Flow Techniques Lecture 3-47 Lecture 3-48
Recommended
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks