Lab 3 - 3/21/2007 - Jonathan Ragan-Kelley (jrk@mit.edu) 1. Baseline processor (Initial place+route areas given for default configuration with 5ns timing.) * Total area: 18010.0 abstract units * Post-synthesis critical path: instRespQ_valid_reg -[regfile read]-> wbQ_f0_reg Post-synthesis critical path: instRespQ_valid_reg -[regfile read]-> pc_reg Data required time: 11.81 ns Slack (MET): 0.04 ns Clock network delay (ideal): 12.00 ns * Post-place+route total logic area (no fp): 352012.9 (w/ fp): 327232.2 um^2 (Effective clock: 5.313 ns) * Post-place+route total square area (no fp): 474513.4 um^2 (w/ fp): 411784.1 um^2 (Effective clock: 5.313 ns) * Post-place+route critical path (no fp): instRespQ_valid_reg -[regfile read]-> wbQ_f0_reg Effective clock: 14 ns, 244322.6 um^2 Effective clock: 13 ns, 250851.8 um^2 12 ns: timing violation * Post-place+route critical path (w/ fp): instRespQ_valid_reg -[regfile read]-> wbQ_f0_reg Effective clock: 5.800 ns (I have been unable to re-active floorplanning after turning it off for the first time, so I cannot generate results with the new 13ns clock.) I predicted that the combinational path from regfile read through the ALUs to regfile write would be the critical path. Though this 3-stage pipeline (separate writeback) is different than the 2-stage pipeline for which I made my initial prediction, the critical path corresponds to the same basic path. Module area (w/ fp): [um^2 ( area%)] Level 0 Module mkProc 327232.2 (100.0%) Level 1 Module dataRespQ 6676.5 ( 2.0%) Level 1 Module pcQ 7921.5 ( 2.4%) Level 1 Module rf_rfile 157079.1 ( 48.0%) Level 1 Module add_2006 6350.4 ( 1.9%) Level 1 Module sub_2037 6541.7 ( 2.0%) Level 1 Module add_2034 6152.8 ( 1.9%) Level 1 Module add_881 1423.7 ( 0.4%) Level 1 Module add_2053 3797.7 ( 1.2%) Level 1 Module lt_2103 2374.0 ( 0.7%) Level 1 Module lt_2099 2442.9 ( 0.7%) Level 1 Module lt_2091 2295.6 ( 0.7%) Level 1 Module lt_2086 2590.3 ( 0.8%) = Total level 1 205174.6 ( 62.7%) +overhead 122057.6 ( 37.3%) The register file dominates, occupying effectively half of the entire design (w/ fp). The behavioral ALU elements and the level-1 queues only make up an additional 15% of the design, leaving what would seem to be perhaps some "level-0" state/logic, or routing/wiring overhead, clock distribution, etc. (or dead space) occupying more than a third of the chip area. I iteratively relaxed and then tightened the clock until the minimum point where there were no setup violations. I iterated up to 14ns, then was able to tighten to 13ns. Corresponding areas are reported above. Area generally *decreases* as the clock is made more lax, because fewer drivers need to be added to reduce delay on long wires when the delay constraints are less extreme. 2. Evaluate a simple branch-predictor I first (accidentally) ran through place+route with my experimental large BHT from lab 2. THis clearly produced a very large chip, but most of it was the bht_rfile which stored the large table: Level 0 Module mkProc 2985732.3 Level 1 Module bht_rfile 2538454.0 With the standard 4-element BHT, the notable area results are: Level 0 Module mkProc 399121.9 Level 1 Module bht_rfile 31363.1 vs. 352012.9 for the original processor with no BHT and no floorplanning. The significant majority of this ~15% area increase is directly attributable to the bht_rfile, itself. Same critical path, but faster clock: Timing: 13ns: works (as without BHT) Data required time: 13.996 (skew: -0.004 adjusted 1 cycle) Data arrival time: 13.870 Slack: 0.126 12ns: works (better than BHT) Data required time: 12.001 (skew: 0.001 adjusted 1 cycle) Data arrival time: 11.963 Slack: 0.038 (290431.2 um^2) Though the BHT increases area and only provides modest IPC gains in the poorly-pipelined processor, it actually happens to *decrease* the effective cycle time from 13ns to 12ns, improving performance both in IPC and cycles/sec. The area increases by 15.7% relative to the 13ns build with no BHT (250851.8 um^2 reported above). In lab 2, I observed up to ~5% IPC improvements. Combined with the 8.7% clock speed improvement, this provides a theoretical throughput improvement of 13% or more in the best case benchmarks for 15.7% area increase. This definitely seems worthwhile, but hinges on the (strange) increase in clock speed with the BHT. My IPC improvements were greater than some, so I did observe some benefit to having a branch predictor even without much rule concurrency. However, this was mostly because of the relatively long mispredict penalty from pc_gen, out to instruction memory, and back into execute. The benefit is much larger when the mispredict propagates further through execution. 3. Refining using Ephemeral History Registers lab2 bpred (no parallelism): median.smips.out: ipc = 0.555164 multiply.smips.out: ipc = 0.566079 qsort.smips.out: ipc = 0.528172 towers.smips.out: ipc = 0.502844 vvadd.smips.out: ipc = 0.524191 Level 0 Module mkProc 290431.2 um^2 EHR parallelism (no bpred): median.smips.out: ipc = 0.729952 multiply.smips.out: ipc = 0.731256 qsort.smips.out: ipc = 0.775608 towers.smips.out: ipc = 0.806893 vvadd.smips.out: ipc = 0.763885 Still comfortable 13ns clock Data arrival time: 12.860 Slack: 0.167 Level 0 Module mkProc 338945.2 um^2 Level 1 Module dataRespQ 6767.5 um^2 Level 1 Module epoch_r 1113.3 um^2 Level 1 Module pcQ 6617.0 um^2 Level 1 Module pc_r 5337.5 um^2 Level 1 Module rf_r_1 4829.4 um^2 Level 1 Module rf_r_10 4911.0 um^2 Level 1 Module rf_r_11 4879.6 um^2 ... ~5000 um^2 per-regfile EHR ~= 160000 um^2 total This is level 0 mkProc increase of ~48000 um^2 due to the addition of EHR parallelism. EHR parallelism with bpred median.smips.out: ipc = 0.734015 multiply.smips.out: ipc = 0.731256 qsort.smips.out: ipc = 0.844618 towers.smips.out: ipc = 0.806893 vvadd.smips.out: ipc = 0.826111 Still comfortable 13ns clock Data arrival time: 12.871 Slack: 0.083 Startpoint: wbQ_data0_r/r_reg0_reg[37]/Q (clocked by ideal_clock1 R, latency: 0.650) Endpoint: pc_r/r_reg0_reg[29]/D (Setup time: 0.164, clocked by ideal_clock1 R, latency: 0.603) Cycle time doesn't change, but the critical path is shifted from the main paths through the execute stage, to the updating the PC EHR somehow dependent on the writeback queue, running through register file writing in the middle. This is a dramatic change from the non-EHR design which is entirely execute-bound, and is likely due to the even longer paths introduced by the effective bypassing of values though the EHR registers and regfile. Level 0 Module mkProc 392693.1 um^2 Level 1 Module bht_rfile 27502.7 um^2 Level 1 Module pcQ_data0_r 4647.6 um^2 Level 1 Module pcQ_data1_r 3606.4 um^2 Level 1 Module pc_r 7059.1 um^2 The area increase with the BHT EHR implementation is due not only to the BHT, but also to the conversion of pcQ (and takenQ) to EHR SFIFOs from conventional FIFOs. The pc EHR also grows in this example, likely because it is now the end of the critical path (in writeback) and so needs more drivers to offset delay. However, the BHT increases performance substantially in two benchmarks -- almost 9% in qsort and vvadd -- and more modestly in median. The branch predictor is indeed most significant when there is actual parallel execution in the design. In the effectively multi-stage design in lab2, the predictor mostly only serves to avoid mispredict penalties between the first and second stages (through potentially multiple FIFO stages). Here it also increases useful concurrency throughout the pipeline by avoiding unnecessary stalls and bubbles through the entire execution pipe. This EHR technique is much more effective than resizing FIFOs in principle because: 1) it provides much more parallelism than could even exist using a single, non-EHR regfile before 2) it does not unnecessarily lengthen the pipeline (increasing mispredict penalties, etc.) 4. Using RC modeling to design a register file write bit-line driver No points recorded in my postroute_setup_timing.rpt for the non-EHR builds include the 3rd stage/register writeback -- it stops after 200 paths, dominated by the exec stage. The EHR+BHT design is complexified by the introduction of EHRs, shifting more load to the writeback stage. The load problem is worsened there because the multi-stage nature of the EHR can mean that the write bit-line is driving not just a single flip-flop per-register, but effectively several for the _0, _1, etc. stages. net wireCap pinCap totalCap netLen wireCapPerLen nrFanout n4557 0.117304 0.071000 0.188304 871.360 1.346e-04 9 The net driving the regfile write bit line in this critical path, n4557, has a total capacitance of 0.188 pF. The critical path of my 3-stage design without EHRs is much helped relative to the 2-stage design because the write bit-line driver delay issue is separated into an independent writeback stage, independent of the critical path from pcQ to register reading through the ALUs. This path only needs to drive a *single* element (the wbQ register), rather than a 32-bit line of registers in a register file. We can create a distributed RC model of the given 32-bit line (with DENRQ1 flip-flops) as follows: The cell is 7 gates wide = 15.7um, the gate height is given as 5.6um, and D pin capacitance is 0.003pF. The model runs from the first to the last D gates, giving a chain of 32 DENRQ1 gates connected by 31 15.7um wires (since the gates are tightly packed, each wire segment is exactly as long as the gate is wide), and a half-gate-wide wire connecting from the left edge to the first D-gate. The gates are represented as 0.003pF capacitors, and each wire as a 6.27ohm resistor and 2 parallel parasitic capacitors each of 0.003pF/2 = 0.0015pF (based on the provided wire parameters). [diagram] Using a lumped model for the capacitors gives: 0.0015pF + 31*(0.003pF + 2*0.0015pF) = 0.191pF. We can design a reasonable inverter chain based on the given initial input inverter (0.36um+0.72um)*1.5fF/um = 0.0016pF and a gate scale factor of 4 (from the FO4 rule). We attempt to set this system to drive an output capacitance of ~0.191pF for the bit-line input. 3 stages gives 0.104pF, while 4 gives 0.415pF and doesn't require logical inversion of the input signal at the destination. 4 inverters, each scaled 4x from the given minimal input, gives sizes of: NMOS: 0.36um, 1.44um, 5.76um, 23.04um PMOS: 0.72um, 2.88um,11.52um, 46.08um Driver resistance is 144 ohms, and using the pi model to evaluate the entire system: wire = 31.5 cells = 0.098pF, 197.6 ohms gate load = 0.003pF total delay = 0.025ns