FTB Branch Predictor

Introduction to FTB

FTB is the third sub-predictor of the Xiangshan BPU, and it can also get the outputs of uFTB and TAGE-SC together. In the input interface of FTB, the s1 channel contains the basic prediction results of uFTB, and the s2 and s3 channels are filled with only one group of signals, br_taken_mask, by TAGE-SC, without the basic prediction results generated by the FTB entry. The function of FTB is to provide basic prediction results for the s2 and s3 channels.

In terms of functionality and structure, FTB is similar to uFTB. The main difference is that FTB can accommodate more FTB entries, and the prediction results of FTB are output in the s2 and s3 channels. Due to its large capacity, the readout speed of FTB is slower than that of uFTB, and it cannot be placed in the first cycle to generate prediction results. However, the large capacity enables it to obtain more accurate prediction results.

Function of uFTB

  • Cache more FTB entries and provide basic prediction results for the s2 and s3 channels. The FTB predictor is essentially a storage with a large capacity. It reads the corresponding FTB entry based on the current predicted PC and outputs it in the s2 stage. At the same time, this FTB entry will be saved for one more cycle to generate the s3 stage prediction result. One thing to note is to consider the br_taken_mask field inputted by the previous predictor to avoid losing it during generation.
  • Update FTB entries based on update requests.

FTB Storage Structure

FTB entries in the FTB predictor are placed in a dedicated storage structure called FTBBank. Before further examining the structure of FTBBank, let’s first see how FTBBank is used.

FTB Read Request

The read request interface of FTBBank is as follows:

  • req_pc Requested PC
    • Interface type: Flipped(DecoupledIO(UInt(VAddrBits.W)))
  • read_resp Read out FTB entry
    • Interface type: FTBEntry
  • read_hits Which way (row) is hit
    • Interface type: Valid(UInt(log2Ceil(numWays).W))

Among, req_pc interface is Decoupled, meaning it contains valid and ready signals. FTB needs to get the PC before the s1 stage starts, so s0_pc is sent to the req_pc interface, s0_fire signal is connected to the valid signal of req_pc, and the ready signal is connected to the pipeline control signal s1_ready.

When s0_fire enters the s1 stage, in the next cycle, when s0_fire is at the same time as s1_fire, FTBBank has already outputted the readout FTB entry to the read_resp interface, and calculated read_hits. However, at this time, because the readout has wasted too much delay, it cannot be outputted in the s1 stage. Therefore, this readout result is saved in an internal register. It will be read out from the register in the s2 and s3 stages to generate the prediction result.

FTBBank

FTBBank defines a storage to store all FTB entries. The storage adopts a group-associative structure, with 512 groups (Sets) in total, each group has 4 ways, and can store up to 2048 FTB entries. Besides storing FTB entries, it also stores the tag corresponding to each FTB entry for matching.

Specifically, the tag is defined as pc[29:10], which takes 20 bits from the PC to identify the FTB entry. The PC is divided as follows:

  pc: | ... |<-- tag(20 bits) -->|<-- idx(9 bits) -->|<-- instOffset(1 bit) -->|

When reading, provide the group number (idx) to the storage, read out all ways in that group, and then check if there is a way whose tag matches the current tag. If there is a match, it means a hit, and the readout FTB entry is sent out through the read_resp interface, and the hit way number is sent out through the read_hits interface.

Generation of Prediction Results

As mentioned earlier, for the FTB predictor, it needs to provide basic prediction results derived from FTB entries to the s2 and s3 channels. The FTB entries have been read and saved in the s1 stage. In the s2 and s3 stages, they only need to be read out to generate the prediction results. However, one thing to note is to preserve the br_taken_mask field generated by TAGE-SC in the s2 and s3 prediction results, which provides precise prediction results for conditional branch instructions. For the s1 channel, the FTB predictor does not make any changes.

The generation of signals in the s2 and s3 prediction results can refer to the following list:

  • hit Whether the FTB entry is hit
    • Generation method: The read_hits signal valid bit from FTBBank is valid.
  • slot_valids Slot valid bit, indicating whether each slot in the ftb entry is valid
  • targets Jump target address corresponding to each slot
  • offsets Instruction offset relative to the start address of the predicted block in each slot
  • is_jal Whether the predicted block contains a jal instruction
  • is_jalr Whether the predicted block contains a jalr instruction
  • is_call Whether the predicted block contains a call instruction
  • is_ret Whether the predicted block contains a ret instruction
  • last_may_be_rvi_call Signal indicating that the end of the predicted block may be an RVI type call instruction
  • **is_br_sharing Whether the last slot (tailSlot) stores a conditional branch instruction signal
    • Generation method**: Export from the corresponding field in the FTB entry
  • fallThroughErr Error in the pftAddr recorded in the FTB entry
    • Generation method: Compare whether the address represented by pftAddr is greater than the start address of the predicted block. If it is less than, it indicates an error, and this signal is set to valid. This situation may occur when the PC indexes an incorrect FTB entry.
  • fallThroughAddr End address of the predicted block
    • Generation method: If fallThroughErr is invalid, it is generated according to pftAddr. Otherwise, it is set to the start address + prediction width.
  • br_taken_mask Branch prediction result, each branch (slot) corresponds to a bit, indicating whether the branch is predicted as taken
    • Generation method: Generated based on the always_taken field in the FTB entry and the indication result of the two-bit saturation counter.
  • jalr_target Jump target of jalr in this predicted block
    • Generation method: Jump target in the tailSlot of the FTB entry.

FTB meta

In the third cycle of prediction, the FTB predictor outputs some auxiliary information of this prediction to last_stage_meta and also sends the read FTB entry to the last_stage_ftrb_entry interface.

The FTB meta contains two pieces of information, hit and writeWay, indicating whether the prediction hits and in which way it is read. Subsequently, the update channel generates the update information for this prediction, and these two pieces of information are also sent to guide the writing of the updated FTB entry.

FTB Update

In the update channel, the pc and the new FTB entry are already specified for us, along with the hit and writeWay in the meta information. If hit in the meta is valid, it means that the FTB entry corresponding to this pc was stored in the memory, and we only need to write it to the corresponding way.

If it is invalid, it means that there was no storage before, but we do not know whether it is stored now. It is possible that before this update request, the FTB entry corresponding to this pc was written by another update request. Therefore, we still need to send a read request to FTBBank to check if there is a corresponding FTB entry. If it exists, it can be directly written to this position in the next cycle, otherwise, FTBBank will be notified to allocate a new position.

Therefore, the number of cycles required for updating FTB entries depends on the hit situation.

Let’s first look at how FTBBank handles updates.

FTBBank Update

FTBBank’s update interface is divided into two parts, the update read interface and the update write interface.

  • u_req_pc: Update read request pc
    • Flipped(DecoupledIO(UInt(VAddrBits.W)))
  • update_hits: Hit information read out
    • Valid(UInt(log2Ceil(numWays).W))
  • update_access: There is an update request but the meta information indicates a miss
    • Bool()
  • update_pc: Update write request pc
    • UInt(VAddrBits.W))
  • update_write_data: Data to be written in the update request, write when valid
    • Flipped(Valid(new FTBEntryWithTag))
  • update_write_way: Way index to write in the update request
    • UInt(log2Ceil(numWays).W))
  • update_write_alloc: Whether a new FTB entry needs to be allocated (missed before)
    • Bool()

For the update read interface, FTBBank obtains the update read request through u_req_pc signal. This request has a higher priority than the read request during prediction. In the next cycle, FTBBank will output the hit information through the update_hits interface. update_access is only used for some internal status judgments of FTBBank.

For the update write interface, FTBBank obtains the pc of the update write request through the update_pc signal, and when update_write_data is valid, it writes the data into the corresponding position specified by update_write_way. If update_write_alloc is valid, it means that it cannot be directly written to the position specified in the request, but a new position needs to be allocated.

The allocation strategy is as follows:

  • If all ways are filled, use the pseudo LRU replacement algorithm to select the way to replace
  • If there is an empty way, select the empty way.

Update Request Timing

  • Meta hit is valid: If hit in the update request meta is valid, then we only need to specify the address and data to be written according to the information in the update request, and the writing only takes one cycle.
  • Meta hit is invalid: In this case, after receiving the update request, we connect the pc in the request to the read port of FTBBank. The read port will return the result in the next cycle. Due to timing issues, we save this result and use it in the next cycle. Depending on the hit status in the result, we decide whether to set update_write_alloc and send a write request. The entire update process takes three cycles.

Interface List

信号类型 信号位 信号名 信号描述
input clock 输入时钟
input reset 复位信号
input [35:0] io_reset_vector 用于reset时,reset s1_pc_dup_0 提供的值
input [40:0] io_in_bits_s0_pc_0 输入位s0_pc 的 第0个复制
input [40:0] io_in_bits_s0_pc_1 同上 第1个
input [40:0] io_in_bits_s0_pc_2 同上 第2个
input [40:0] io_in_bits_s0_pc_3 同上 第3个
input io_in_bits_resp_in_0_s2_full_pred_0_br_taken_mask_0 预测结果输入
input io_in_bits_resp_in_0_s2_full_pred_0_br_taken_mask_1
input io_in_bits_resp_in_0_s2_full_pred_1_br_taken_mask_0
input io_in_bits_resp_in_0_s2_full_pred_1_br_taken_mask_1
input io_in_bits_resp_in_0_s2_full_pred_2_br_taken_mask_0
input io_in_bits_resp_in_0_s2_full_pred_2_br_taken_mask_1
input io_in_bits_resp_in_0_s2_full_pred_3_br_taken_mask_0
input io_in_bits_resp_in_0_s2_full_pred_3_br_taken_mask_1
input io_in_bits_resp_in_0_s3_full_pred_0_br_taken_mask_0
input io_in_bits_resp_in_0_s3_full_pred_0_br_taken_mask_1
input io_in_bits_resp_in_0_s3_full_pred_1_br_taken_mask_0
input io_in_bits_resp_in_0_s3_full_pred_1_br_taken_mask_1
input io_in_bits_resp_in_0_s3_full_pred_2_br_taken_mask_0
input io_in_bits_resp_in_0_s3_full_pred_2_br_taken_mask_1
input io_in_bits_resp_in_0_s3_full_pred_3_br_taken_mask_0
input io_in_bits_resp_in_0_s3_full_pred_3_br_taken_mask_1
output io_out_s2_full_pred_0_br_taken_mask_0 s2 阶段输出的完整预测结果
output io_out_s2_full_pred_0_br_taken_mask_1
output io_out_s2_full_pred_0_slot_valids_0
output io_out_s2_full_pred_0_slot_valids_1
output [40:0] io_out_s2_full_pred_0_targets_0
output [40:0] io_out_s2_full_pred_0_targets_1
output [40:0] io_out_s2_full_pred_0_jalr_target
output [3:0] io_out_s2_full_pred_0_offsets_0
output [3:0] io_out_s2_full_pred_0_offsets_1
output [40:0] io_out_s2_full_pred_0_fallThroughAddr
output io_out_s2_full_pred_0_is_br_sharing
output io_out_s2_full_pred_0_hit
output io_out_s2_full_pred_1_br_taken_mask_0 同上
output io_out_s2_full_pred_1_br_taken_mask_1
output io_out_s2_full_pred_1_slot_valids_0
output io_out_s2_full_pred_1_slot_valids_1
output [40:0] io_out_s2_full_pred_1_targets_0
output [40:0] io_out_s2_full_pred_1_targets_1
output [40:0] io_out_s2_full_pred_1_jalr_target
output [3:0] io_out_s2_full_pred_1_offsets_0
output [3:0] io_out_s2_full_pred_1_offsets_1
output [40:0] io_out_s2_full_pred_1_fallThroughAddr
output io_out_s2_full_pred_1_is_br_sharing
output io_out_s2_full_pred_1_hit
output io_out_s2_full_pred_2_br_taken_mask_0 同上
output io_out_s2_full_pred_2_br_taken_mask_1
output io_out_s2_full_pred_2_slot_valids_0
output io_out_s2_full_pred_2_slot_valids_1
output [40:0] io_out_s2_full_pred_2_targets_0
output [40:0] io_out_s2_full_pred_2_targets_1
output [40:0] io_out_s2_full_pred_2_jalr_target
output [3:0] io_out_s2_full_pred_2_offsets_0
output [3:0] io_out_s2_full_pred_2_offsets_1
output [40:0] io_out_s2_full_pred_2_fallThroughAddr
output io_out_s2_full_pred_2_is_jalr
output io_out_s2_full_pred_2_is_call
output io_out_s2_full_pred_2_is_ret
output io_out_s2_full_pred_2_last_may_be_rvi_call
output io_out_s2_full_pred_2_is_br_sharing
output io_out_s2_full_pred_2_hit
output io_out_s2_full_pred_3_br_taken_mask_0 同上
output io_out_s2_full_pred_3_br_taken_mask_1
output io_out_s2_full_pred_3_slot_valids_0
output io_out_s2_full_pred_3_slot_valids_1
output [40:0] io_out_s2_full_pred_3_targets_0
output [40:0] io_out_s2_full_pred_3_targets_1
output [40:0] io_out_s2_full_pred_3_jalr_target
output [3:0] io_out_s2_full_pred_3_offsets_0
output [3:0] io_out_s2_full_pred_3_offsets_1
output [40:0] io_out_s2_full_pred_3_fallThroughAddr
output io_out_s2_full_pred_3_fallThroughErr
output io_out_s2_full_pred_3_is_br_sharing
output io_out_s2_full_pred_3_hit
output io_out_s3_full_pred_0_br_taken_mask_0 s3 阶段输出的完整预测结果
output io_out_s3_full_pred_0_br_taken_mask_1
output io_out_s3_full_pred_0_slot_valids_0
output io_out_s3_full_pred_0_slot_valids_1
output [40:0] io_out_s3_full_pred_0_targets_0
output [40:0] io_out_s3_full_pred_0_targets_1
output [40:0] io_out_s3_full_pred_0_jalr_target
output [40:0] io_out_s3_full_pred_0_fallThroughAddr
output io_out_s3_full_pred_0_fallThroughErr
output io_out_s3_full_pred_0_is_br_sharing
output io_out_s3_full_pred_0_hit
output io_out_s3_full_pred_1_br_taken_mask_0 同上
output io_out_s3_full_pred_1_br_taken_mask_1
output io_out_s3_full_pred_1_slot_valids_0
output io_out_s3_full_pred_1_slot_valids_1
output [40:0] io_out_s3_full_pred_1_targets_0
output [40:0] io_out_s3_full_pred_1_targets_1
output [40:0] io_out_s3_full_pred_1_jalr_target
output [40:0] io_out_s3_full_pred_1_fallThroughAddr
output io_out_s3_full_pred_1_fallThroughErr
output io_out_s3_full_pred_1_is_br_sharing
output io_out_s3_full_pred_1_hit
output io_out_s3_full_pred_2_br_taken_mask_0 同上
output io_out_s3_full_pred_2_br_taken_mask_1
output io_out_s3_full_pred_2_slot_valids_0
output io_out_s3_full_pred_2_slot_valids_1
output [40:0] io_out_s3_full_pred_2_targets_0
output [40:0] io_out_s3_full_pred_2_targets_1
output [40:0] io_out_s3_full_pred_2_jalr_target
output [40:0] io_out_s3_full_pred_2_fallThroughAddr
output io_out_s3_full_pred_2_fallThroughErr
output io_out_s3_full_pred_2_is_jalr
output io_out_s3_full_pred_2_is_call
output io_out_s3_full_pred_2_is_ret
output io_out_s3_full_pred_2_is_br_sharing
output io_out_s3_full_pred_2_hit
output io_out_s3_full_pred_3_br_taken_mask_0 同上
output io_out_s3_full_pred_3_br_taken_mask_1
output io_out_s3_full_pred_3_slot_valids_0
output io_out_s3_full_pred_3_slot_valids_1
output [40:0] io_out_s3_full_pred_3_targets_0
output [40:0] io_out_s3_full_pred_3_targets_1
output [40:0] io_out_s3_full_pred_3_jalr_target
output [3:0] io_out_s3_full_pred_3_offsets_0
output [3:0] io_out_s3_full_pred_3_offsets_1
output [40:0] io_out_s3_full_pred_3_fallThroughAddr
output io_out_s3_full_pred_3_fallThroughErr
output io_out_s3_full_pred_3_is_br_sharing
output io_out_s3_full_pred_3_hit
output [222:0] io_out_last_stage_meta 最后一个阶段输出的 meta 信息
output io_out_last_stage_ftb_entry_valid 最后一个阶段输出的 FTB 项
output [3:0] io_out_last_stage_ftb_entry_brSlots_0_offset
output [11:0] io_out_last_stage_ftb_entry_brSlots_0_lower
output [1:0] io_out_last_stage_ftb_entry_brSlots_0_tarStat
output io_out_last_stage_ftb_entry_brSlots_0_sharing
output io_out_last_stage_ftb_entry_brSlots_0_valid
output [3:0] io_out_last_stage_ftb_entry_tailSlot_offset
output [19:0] io_out_last_stage_ftb_entry_tailSlot_lower
output [1:0] io_out_last_stage_ftb_entry_tailSlot_tarStat
output io_out_last_stage_ftb_entry_tailSlot_sharing
output io_out_last_stage_ftb_entry_tailSlot_valid
output [3:0] io_out_last_stage_ftb_entry_pftAddr
output io_out_last_stage_ftb_entry_carry
output io_out_last_stage_ftb_entry_isCall
output io_out_last_stage_ftb_entry_isRet
output io_out_last_stage_ftb_entry_isJalr
output io_out_last_stage_ftb_entry_last_may_be_rvi_call
output io_out_last_stage_ftb_entry_always_taken_0
output io_out_last_stage_ftb_entry_always_taken_1
input io_ctrl_btb_enable 使能信号
input io_s0_fire_0 s0 阶段流水线控制信号
input io_s0_fire_1
input io_s0_fire_2
input io_s0_fire_3
output io_s1_ready s1 阶段流水线控制信号
input io_s1_fire_0
input io_s1_fire_1
input io_s1_fire_2
input io_s1_fire_3
input io_s2_fire_0 s2 阶段流水线控制信号
input io_s2_fire_1
input io_s2_fire_2
input io_s2_fire_3
input io_update_valid 更新有效性
input [40:0] io_update_bits_pc 传回的预测块pc(用于指示更新的预测块)
input io_update_bits_ftb_entry_valid 是否启用
input [3:0] io_update_bits_ftb_entry_brSlots_0_offset solt 0 中分支指令相对于地址块起始pc的偏移
input [11:0] io_update_bits_ftb_entry_brSlots_0_lower 跳转目标地址的低位
input [1:0] io_update_bits_ftb_entry_brSlots_0_tarStat 跳转后的 pc 高位是否进退位
input io_update_bits_ftb_entry_brSlots_0_sharing 无条件跳转指令槽中存储条件分支指令
input io_update_bits_ftb_entry_brSlots_0_valid 是否启用
input [3:0] io_update_bits_ftb_entry_tailSlot_offset solt 1 中分支指令相对于地址块起始pc的偏移
input [19:0] io_update_bits_ftb_entry_tailSlot_lower 跳转目标地址的低位
input [1:0] io_update_bits_ftb_entry_tailSlot_tarStat 跳转后的 pc 高位是否进退位
input io_update_bits_ftb_entry_tailSlot_sharing 无条件跳转指令槽中存储条件分支指令
input io_update_bits_ftb_entry_tailSlot_valid 是否启用
input [3:0] io_update_bits_ftb_entry_pftAddr Partial Fallthrough Addr 如果预测块中没有跳转,那么程序将会顺序执行到达的地址,预测块的结束地址。
input io_update_bits_ftb_entry_carry pc+pft时是否产生进位
input io_update_bits_ftb_entry_isCall 是否是函数调用
input io_update_bits_ftb_entry_isRet 是否是函数返回
input io_update_bits_ftb_entry_isJalr 是否是 jalr 指令
input io_update_bits_ftb_entry_last_may_be_rvi_call 最后一个指令槽存储的可能是 rvi 的 call 指令
input io_update_bits_ftb_entry_always_taken_0 是否预测为总是跳转
input io_update_bits_ftb_entry_always_taken_1 是否预测为总是跳转
input io_update_bits_old_entry 是否是旧的 FTB 项
input [222:0] io_update_bits_meta meta 信息
Last modified September 13, 2024: Update the picture of BPU Top. (431c050)