Motivation

Pipelining allows for a higher speedup over pure sequential processing.

Pipelining doesn’t help latency for a single task, but helps the throughput of the entire workload - by having multiple tasks operate simultaneously using different resources.

Possible delays:

  • Pipeline rate, limited by slowest pipeline stage
  • Stall for dependencies

MIPS Pipeline Stages

  1. IF Instruction Fetch
  2. ID Instruction Decode and Register Read
  3. EX Execute an operation, or calculate an address
  4. MEM Access an operand in data memory
  5. WB Write back the result into a register

(related to the Datapath stages)

Each execution stage will take 1 clock cycle.

PCAddressInstructionInstruction MemoryMUXAdd4RR1RR2WRWDRegistersRD1RD2Sign extendShift leftMUXAddALUisZeroresultAddressWrite dataData memoryRead DataMUXIFIDEXMEMWBIF/IDID/EXEX/MEMMEM/WBtwo register numbers16-bit offsetPC + 4two data values32-bit sign-extendedimmediatePC + 4ALU resultisZero? signaldata read 2PC + 4 + immediate x 4ALU resultMemory read datawrite registerwrite registerwrite registerRegDstALUSrcALUOpMemReadMemWriteBranchMemToRegRegWriteMemReadMemWriteBranchMemToRegRegWriteMemToRegRegWritedatacontrol signals IFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWBProgram flowTime

Implementation

  • One cycle per pipeline stage
  • Data required for each stage needs to be stored separately.

There are two types of data that needs to be stored:

  • data used by subsequent instructions
    • stored in programmer-visible elements
      • PC
      • registers
      • memory
  • data used by same instruction in later pipeline stages
    • stored in pipeline registers
      • IF/ID
      • ID/EX
      • EX/MEM
      • MEM/WB

Stages

In the ID stage:

  • IF/ID supplies:
    • register numbers for reading two registers
    • 16-bit offset to be sign-extended
    • PC + 4
  • ID/EX receives:
    • data values read from register file
    • 32-bit immediate value
    • PC + 4
    • write register

Here, the signals are generated.

In the EX stage:

  • ID/EX supplies:
    • data values read from register file
    • 32-bit immediate value
    • PC + 4
    • write register
    • signals
      • RegDst
      • ALUSrc
      • ALUop
      • MemRead
      • MemWrite
      • Branch
      • MemToReg
      • RegWrite
  • EX/MEM receives:
    • PC+4 + (IMM X 4)
    • ALU result
    • isZero?
    • data read 2
    • write register
    • signals
      • MemRead
      • MemWrite
      • Branch
      • MemToReg
      • RegWrite

In the WB stage:

  • EX/MEM supplies:
    • PC+4 + (IMM X 4)
    • ALU result
    • isZero?
    • data read 2
    • write register
    • signals
      • MemRead
      • MemWrite
      • Branch
      • MemToReg
      • RegWrite
  • MEM/WB receives:
    • ALU result
    • memory read data
    • write register
    • signals
      • MemToReg
      • RegWrite

Performance comparison

IFIDEXMEMWBIFIDEXMEMWBCycle 1Cycle 2Single-CycleIFIDEXMEMWBIFIDEXMEMWBMulti-CycleCycle 1Cycle 2instruction 2instruction 1instruction 2instruction 1PipelineIFIDEXMEMWBIFIDEXMEMWBIFIDEXMEMWB

Single-cycle processor

Cycle time:

where time for operation in stage , number of stages.

Execution time:

Since all the stages are in one single cycle, the time for operation is the maximum time of an instruction as a bottleneck.

Multicycle processor

Cycle time:

where is the time for operation in stage .

The cycle time is based on the stage with the most time taken for it.

Execution time:

Average CPI needed because each instruction takes different number of cycles.

Pipelining processor

Cycle time:

where is the time for operation in stage , and refers to the overhead for pipelining.

Since there is a pipeline register, there is overhead here.

Execution time:

where refers to the cycles needed for instructions.

Ideal speedup

Assumptions for ideal case

  • every stage takes same amount of time
  • no pipeline overhead
  • (number of instructions significantly larger than number of stages)

Pipeline processor can gain N times speedup where N is the number of pipeline stages.

Hazards

Pipeline hazards

Hazards that prevent next instruction from immediately following previous instruction

  • Structural hazards
    • Simultaneous use of a hardware resource
  • Data hazards
    • Data dependencies between instructions
  • Control hazards
    • Change in program flow
IMRegDMRegALUCC1CC2CC3CC4CC5CC6IMRegDMRegALUStagesInstructions

Structural Hazards

If there is only one single memory module

There might be a conflict if two instructions access memory in the same cycle.

IMRegDMRegALUIMRegDMRegALUIMRegDMRegALUIMRegDMRegALUMemory is accessedin the same cycleProblemIMRegDMRegALUIMRegDMRegALUIMRegDMRegALUIMRegDMRegALUSTALLDelay 4th instructionby 1 cycleStall pipelineIMRegDMRegALUIMRegDMRegALUIMRegDMRegALUIMRegDMRegALUSeparate memorySeparate the memory intoData Memory and Instruction MemorySOLUTIONS

Conflict for registers?

Registers are very fast memory, allowing for the cycle to be split into half, one for writing and one for reading.

IMRegDMRegALUIMRegDMRegALUIMRegDMRegALUIMRegDMRegALUWrites in first halfRead in second half

Instruction Dependencies

Instruction Dependencies

Instructions can have relationships that prevent pipeline execution.

Data dependency

When different instructions access the same register

Control dependency

An instruction j is control dependent on i if i controls whether or not j executes

i is generally a branch instruction

RAW: Read-After-Write

RAW

Occurs when a later instruction reads from destination reigster written by an earlier instruction (True data dependency)

i1: add $1, $2, $3
i2: sub $4, $1, $5

Stale result

An old result.

If i2 reads register $1 before i1 writes back, i2 gets a stale result.

There are two possible solutions:

  • Forwarding
    • Forward the result to any later instructions before reflecting it in register file and replace the data read from register file.
  • Stalling
    • Useful when forwarding does not work (data is not even loaded yet).
IMRegDMRegALUIMRegDMRegALUIMRegDMRegALUIMRegDMRegALUIMRegDMRegALUand $12, $2, $5sub $2, $1, $3or $13, $6, $2add $14, $2, $2sw $15, 100($2)ForwardingIMRegDMRegALUIMRegDMRegALUand $12, $2, $5lw $2, 20($1)StallingBubble

WAR: Write-After-Read

Does not cause pipeline hazards.

Only affects processor only when instructions are executed out of program order.

WAW: Write-After-Write

Does not cause pipeline hazards

Only affects processor only when instructions are executed out of program order.

Control Dependency

Why can't we just stall the pipeline?

Introduces large clock cycle delay - branching being common, a 3-cycle stall penalty is too heavy.

To minimise the control hazard penalty:

  • Early branch resolution
    • Move branch decision calculation to earlier pipeline stage
  • Branch prediction
    • Guess outcome before produced
  • Delayed branching
    • Do something useful while waiting for the outcome

Early branch resolution

Make the decision in ID stage instead of MEM.

Move register comparison to ID stage.

IMRegDMRegALUisZeroIMRegDMRegALUisZeroSTALLEarly Branch ResolutionIMRegDMRegALUisZeroIMRegDMRegALUisZeroBUBBLEBUBBLEIf register involved in comparison is produced by preceding instruction,FURTHER STALL IS NEEDED.IMRegDMRegALUisZeroIMRegDMRegALUisZeroBUBBLEBUBBLEIf load is used to produce register involved.NO IMPROVEMENT

Problems

If instruction producing value into comparison register is before, extra clock cycles are needed

  • 1 more if load instruction
    • results in no performance improvement.

Branch Prediction

There are other branch prediction schemes. Only one is covered here.

Simple prediction: All braNches are assumed to not be taken.

After outcome of branch:

  • Not taken
    • No pipeline stall
  • Taken
    • Wrong instructions in pipeline
    • Successor instructions are flushed
IMRegDMRegALUIMRegDMRegALUIMRegDMRegALUIMRegDMRegALUbeqnew locIMRegDMRegALUIMRegDMRegALUIMRegDMRegALUIMRegDMRegALUbeqnew locCORRECT PREDICTIONBranch NOT takenIMRegDMRegALUIMIMRegDMRegALUbeqnew locWRONG PREDICTIONBranch takenBUBBLEINSTRUCTIONS ARE FLUSHED

Delayed Branch

Motivation

Branch outcome takes cycles to be known. Thus, these cycles can be used to execute non-control dependent instructions into the slot following a branch. These instructions are executed regardless of the branch outcome.

or $8, $9, $10
add $1, $2, $3
sub $4, $5, $6
beq $1, $4, Exit
xor $10, $1, $11

Exit:

Since the or instruction does not affect the remaining instructions, and gets executed regardless, it can be moved:

add $1, $2, $3
sub $4, $5, $6
beq $1, $4, Exit
or $8, $9, $10
xor $10, $1, $11

Exit:

Possible scenarios

Best case scenario

Instruction preceding branch which can be moved into delayed slot

Worst case scenario

  • Add nop instruction

Multiple Issue Processors

This is optional reading.

Multiple issue processors

Multiple instructions in every pipeline stage

Static multiple issue

EPIC or VLIW

  • Compiler specifies the set of instructions that execute together in given clock cycle.
  • Simple hardware, complex compiler.

Dynamic multiple issue

Superscalar processor - dominant design of modern processes

  • Hardware decides which instructions to execute together.
  • Complex hardware, simpler compiler

2-wide superscalar pipeline

Fetch and dispatch two instructions at a time, allowing for up to 2 instructions to complete per cycle.