Lab 3 - 3/21/2007 - Jonathan Ragan-Kelley (jrk@mit.edu)

1. Baseline processor

(Initial place+route areas given for default configuration with 5ns timing.)

* Total area: 18010.0 abstract units
* Post-synthesis critical path: instRespQ_valid_reg -[regfile read]-> wbQ_f0_reg
  Post-synthesis critical path: instRespQ_valid_reg -[regfile read]-> pc_reg
  Data required time: 11.81 ns
  Slack (MET): 0.04 ns
  Clock network delay (ideal): 12.00 ns
* Post-place+route total logic area (no fp): 352012.9 
                                    (w/ fp): 327232.2 um^2
  (Effective clock: 5.313 ns)
* Post-place+route total square area (no fp): 474513.4 um^2
                                     (w/ fp): 411784.1 um^2
  (Effective clock: 5.313 ns)
* Post-place+route critical path (no fp): instRespQ_valid_reg -[regfile read]-> wbQ_f0_reg
  Effective clock: 14 ns, 244322.6 um^2
  Effective clock: 13 ns, 250851.8 um^2
  12 ns: timing violation
* Post-place+route critical path (w/ fp): instRespQ_valid_reg -[regfile read]-> wbQ_f0_reg
  Effective clock: 5.800 ns
  (I have been unable to re-active floorplanning after turning it off for the 
   first time, so I cannot generate results with the new 13ns clock.)

I predicted that the combinational path from regfile read through the ALUs to regfile write would be the critical path. Though this 3-stage pipeline (separate writeback) is different than the 2-stage pipeline for which I made my initial prediction, the critical path corresponds to the same basic path.


Module area (w/ fp):                                     [um^2 ( area%)]
Level 0 Module mkProc                                 327232.2 (100.0%)
 Level 1 Module dataRespQ                               6676.5 (  2.0%)
 Level 1 Module pcQ                                     7921.5 (  2.4%)
 Level 1 Module rf_rfile                              157079.1 ( 48.0%)
 Level 1 Module add_2006                                6350.4 (  1.9%)
 Level 1 Module sub_2037                                6541.7 (  2.0%)
 Level 1 Module add_2034                                6152.8 (  1.9%)
 Level 1 Module add_881                                 1423.7 (  0.4%)
 Level 1 Module add_2053                                3797.7 (  1.2%)
 Level 1 Module lt_2103                                 2374.0 (  0.7%)
 Level 1 Module lt_2099                                 2442.9 (  0.7%)
 Level 1 Module lt_2091                                 2295.6 (  0.7%)
 Level 1 Module lt_2086                                 2590.3 (  0.8%)
 = Total level 1                                      205174.6 ( 62.7%)
+overhead                                             122057.6 ( 37.3%)

The register file dominates, occupying effectively half of the entire design (w/ fp). The behavioral ALU elements and the level-1 queues only make up an additional 15% of the design, leaving what would seem to be perhaps some "level-0" state/logic, or routing/wiring overhead, clock distribution, etc. (or dead space) occupying more than a third of the chip area.

I iteratively relaxed and then tightened the clock until the minimum point where there were no setup violations. I iterated up to 14ns, then was able to tighten to 13ns. Corresponding areas are reported above. Area generally *decreases* as the clock is made more lax, because fewer drivers need to be added to reduce delay on long wires when the delay constraints are less extreme.


2. Evaluate a simple branch-predictor

I first (accidentally) ran through place+route with my experimental large BHT from lab 2. THis clearly produced a very large chip, but most of it was the bht_rfile which stored the large table:

Level 0 Module mkProc                                   2985732.3
 Level 1 Module bht_rfile                               2538454.0

With the standard 4-element BHT, the notable area results are:

Level 0 Module mkProc                                   399121.9
 Level 1 Module bht_rfile                                31363.1

vs. 352012.9 for the original processor with no BHT and no floorplanning. The significant majority of this ~15% area increase is directly attributable to the bht_rfile, itself.

Same critical path, but faster clock:

Timing:
13ns: works (as without BHT)
Data required time: 13.996 (skew: -0.004 adjusted 1 cycle)
Data arrival time: 13.870
Slack: 0.126

12ns: works (better than BHT)
Data required time: 12.001 (skew: 0.001 adjusted 1 cycle)
Data arrival time: 11.963
Slack: 0.038
(290431.2 um^2)

Though the BHT increases area and only provides modest IPC gains in the poorly-pipelined processor, it actually happens to *decrease* the effective cycle time from 13ns to 12ns, improving performance both in IPC and cycles/sec.

The area increases by 15.7% relative to the 13ns build with no BHT (250851.8 um^2 reported above).

In lab 2, I observed up to ~5% IPC improvements.  Combined with the 8.7% clock speed improvement, this provides a theoretical throughput improvement of 13% or more in the best case benchmarks for 15.7% area increase. This definitely seems worthwhile, but hinges on the (strange) increase in clock speed with the BHT.

My IPC improvements were greater than some, so I did observe some benefit to having a branch predictor even without much rule concurrency. However, this was mostly because of the relatively long mispredict penalty from pc_gen, out to instruction memory, and back into execute. The benefit is much larger when the mispredict propagates further through execution.


3. Refining using Ephemeral History Registers


lab2 bpred (no parallelism):
 median.smips.out: ipc                  = 0.555164
 multiply.smips.out: ipc                = 0.566079
 qsort.smips.out: ipc                   = 0.528172
 towers.smips.out: ipc                  = 0.502844
 vvadd.smips.out: ipc                   = 0.524191

Level 0 Module mkProc                                   290431.2 um^2


EHR parallelism (no bpred):
 median.smips.out: ipc                  = 0.729952
 multiply.smips.out: ipc                = 0.731256
 qsort.smips.out: ipc                   = 0.775608
 towers.smips.out: ipc                  = 0.806893
 vvadd.smips.out: ipc                   = 0.763885

Still comfortable 13ns clock
Data arrival time: 12.860
Slack: 0.167

Level 0 Module mkProc                                   338945.2 um^2
 Level 1 Module dataRespQ                                 6767.5 um^2
 Level 1 Module epoch_r                                   1113.3 um^2
 Level 1 Module pcQ                                       6617.0 um^2
 Level 1 Module pc_r                                      5337.5 um^2
 Level 1 Module rf_r_1                                    4829.4 um^2
 Level 1 Module rf_r_10                                   4911.0 um^2
 Level 1 Module rf_r_11                                   4879.6 um^2
 ...

~5000 um^2 per-regfile EHR ~= 160000 um^2 total

This is level 0 mkProc increase of ~48000 um^2 due to the addition of EHR parallelism.


EHR parallelism with bpred
 median.smips.out: ipc                  = 0.734015
 multiply.smips.out: ipc                = 0.731256
 qsort.smips.out: ipc                   = 0.844618
 towers.smips.out: ipc                  = 0.806893
 vvadd.smips.out: ipc                   = 0.826111

Still comfortable 13ns clock
Data arrival time: 12.871
Slack: 0.083

Startpoint: wbQ_data0_r/r_reg0_reg[37]/Q
            (clocked by ideal_clock1 R, latency: 0.650)
Endpoint: pc_r/r_reg0_reg[29]/D
          (Setup time: 0.164, clocked by ideal_clock1 R, latency: 0.603)
          
Cycle time doesn't change, but the critical path is shifted from the main paths through the execute stage, to the updating the PC EHR somehow dependent on the writeback queue, running through register file writing in the middle. This is a dramatic change from the non-EHR design which is entirely execute-bound, and is likely due to the even longer paths introduced by the effective bypassing of values though the EHR registers and regfile.

Level 0 Module mkProc                                   392693.1 um^2
 Level 1 Module bht_rfile                                27502.7 um^2
 Level 1 Module pcQ_data0_r                               4647.6 um^2
 Level 1 Module pcQ_data1_r                               3606.4 um^2
 Level 1 Module pc_r                                      7059.1 um^2

The area increase with the BHT EHR implementation is due not only to the BHT, but also to the conversion of pcQ (and takenQ) to EHR SFIFOs from conventional FIFOs. The pc EHR also grows in this example, likely because it is now the end of the critical path (in writeback) and so needs more drivers to offset delay.

However, the BHT increases performance substantially in two benchmarks -- almost 9% in qsort and vvadd -- and more modestly in median.

The branch predictor is indeed most significant when there is actual parallel execution in the design. In the effectively multi-stage design in lab2, the predictor mostly only serves to avoid mispredict penalties between the first and second stages (through potentially multiple FIFO stages). Here it also increases useful concurrency throughout the pipeline by avoiding unnecessary stalls and bubbles through the entire execution pipe.

This EHR technique is much more effective than resizing FIFOs in principle because:

 1) it provides much more parallelism than could even exist using a single, non-EHR regfile before
 2) it does not unnecessarily lengthen the pipeline (increasing mispredict penalties, etc.)


4. Using RC modeling to design a register file write bit-line driver

No points recorded in my postroute_setup_timing.rpt for the non-EHR builds include the 3rd stage/register writeback -- it stops after 200 paths, dominated by the exec stage. The EHR+BHT design is complexified by the introduction of EHRs, shifting more load to the writeback stage. The load problem is worsened there because the multi-stage nature of the EHR can mean that the write bit-line is driving not just a single flip-flop per-register, but effectively several for the _0, _1, etc. stages.

net     wireCap     pinCap      totalCap    netLen      wireCapPerLen   nrFanout
n4557   0.117304    0.071000    0.188304    871.360     1.346e-04       9

The net driving the regfile write bit line in this critical path, n4557, has a total capacitance of 0.188 pF.

The critical path of my 3-stage design without EHRs is much helped relative to the 2-stage design because the write bit-line driver delay issue is separated into an independent writeback stage, independent of the critical path from pcQ to register reading through the ALUs. This path only needs to drive a *single* element (the wbQ register), rather than a 32-bit line of registers in a register file.


We can create a distributed RC model of the given 32-bit line (with DENRQ1 flip-flops) as follows:

The cell is 7 gates wide = 15.7um, the gate height is given as 5.6um, and D pin capacitance is 0.003pF.

The model runs from the first to the last D gates, giving a chain of 32 DENRQ1 gates connected by 31 15.7um wires (since the gates are tightly packed, each wire segment is exactly as long as the gate is wide), and a half-gate-wide wire connecting from the left edge to the first D-gate.
The gates are represented as 0.003pF capacitors, and each wire as a 6.27ohm resistor and 2 parallel parasitic capacitors each of 0.003pF/2 = 0.0015pF (based on the provided wire parameters).

[diagram]


Using a lumped model for the capacitors gives: 0.0015pF + 31*(0.003pF + 2*0.0015pF) = 0.191pF.

We can design a reasonable inverter chain based on the given initial input inverter (0.36um+0.72um)*1.5fF/um = 0.0016pF and a gate scale factor of 4 (from the FO4 rule). We attempt to set this system to drive an output capacitance of ~0.191pF for the bit-line input. 3 stages gives 0.104pF, while 4 gives 0.415pF and doesn't require logical inversion of the input signal at the destination.

4 inverters, each scaled 4x from the given minimal input, gives sizes of:
    NMOS: 0.36um, 1.44um, 5.76um, 23.04um
    PMOS: 0.72um, 2.88um,11.52um, 46.08um

Driver resistance is 144 ohms, and using the pi model to evaluate the entire system:

    wire = 31.5 cells = 0.098pF, 197.6 ohms
    gate load = 0.003pF

total delay = 0.025ns