FTB Branch Predictor
Categories:
Introduction to FTB
FTB is the third sub-predictor of the Xiangshan BPU, and it can also get the outputs of uFTB and TAGE-SC together. In the input interface of FTB, the s1 channel contains the basic prediction results of uFTB, and the s2 and s3 channels are filled with only one group of signals, br_taken_mask
, by TAGE-SC, without the basic prediction results generated by the FTB entry. The function of FTB is to provide basic prediction results for the s2 and s3 channels.
In terms of functionality and structure, FTB is similar to uFTB. The main difference is that FTB can accommodate more FTB entries, and the prediction results of FTB are output in the s2 and s3 channels. Due to its large capacity, the readout speed of FTB is slower than that of uFTB, and it cannot be placed in the first cycle to generate prediction results. However, the large capacity enables it to obtain more accurate prediction results.
Function of uFTB
- Cache more FTB entries and provide basic prediction results for the s2 and s3 channels. The FTB predictor is essentially a storage with a large capacity. It reads the corresponding FTB entry based on the current predicted PC and outputs it in the s2 stage. At the same time, this FTB entry will be saved for one more cycle to generate the s3 stage prediction result. One thing to note is to consider the
br_taken_mask
field inputted by the previous predictor to avoid losing it during generation. - Update FTB entries based on update requests.
FTB Storage Structure
FTB entries in the FTB predictor are placed in a dedicated storage structure called FTBBank
. Before further examining the structure of FTBBank
, let’s first see how FTBBank
is used.
FTB Read Request
The read request interface of FTBBank
is as follows:
- req_pc Requested PC
- Interface type: Flipped(DecoupledIO(UInt(VAddrBits.W)))
- read_resp Read out FTB entry
- Interface type: FTBEntry
- read_hits Which way (row) is hit
- Interface type: Valid(UInt(log2Ceil(numWays).W))
Among, req_pc
interface is Decoupled, meaning it contains valid and ready signals. FTB needs to get the PC before the s1 stage starts, so s0_pc
is sent to the req_pc
interface, s0_fire
signal is connected to the valid signal of req_pc
, and the ready
signal is connected to the pipeline control signal s1_ready
.
When s0_fire
enters the s1 stage, in the next cycle, when s0_fire
is at the same time as s1_fire
, FTBBank has already outputted the readout FTB entry to the read_resp
interface, and calculated read_hits
. However, at this time, because the readout has wasted too much delay, it cannot be outputted in the s1 stage. Therefore, this readout result is saved in an internal register. It will be read out from the register in the s2 and s3 stages to generate the prediction result.
FTBBank
FTBBank defines a storage to store all FTB entries. The storage adopts a group-associative structure, with 512 groups (Sets) in total, each group has 4 ways, and can store up to 2048 FTB entries. Besides storing FTB entries, it also stores the tag corresponding to each FTB entry for matching.
Specifically, the tag is defined as pc[29:10]
, which takes 20 bits from the PC to identify the FTB entry. The PC is divided as follows:
pc: | ... |<-- tag(20 bits) -->|<-- idx(9 bits) -->|<-- instOffset(1 bit) -->|
When reading, provide the group number (idx) to the storage, read out all ways in that group, and then check if there is a way whose tag matches the current tag. If there is a match, it means a hit, and the readout FTB entry is sent out through the read_resp
interface, and the hit way number is sent out through the read_hits
interface.
Generation of Prediction Results
As mentioned earlier, for the FTB predictor, it needs to provide basic prediction results derived from FTB entries to the s2 and s3 channels. The FTB entries have been read and saved in the s1 stage. In the s2 and s3 stages, they only need to be read out to generate the prediction results. However, one thing to note is to preserve the br_taken_mask
field generated by TAGE-SC in the s2 and s3 prediction results, which provides precise prediction results for conditional branch instructions. For the s1 channel, the FTB predictor does not make any changes.
The generation of signals in the s2 and s3 prediction results can refer to the following list:
- hit Whether the FTB entry is hit
- Generation method: The
read_hits
signal valid bit fromFTBBank
is valid.
- Generation method: The
- slot_valids Slot valid bit, indicating whether each slot in the ftb entry is valid
- targets Jump target address corresponding to each slot
- offsets Instruction offset relative to the start address of the predicted block in each slot
- is_jal Whether the predicted block contains a jal instruction
- is_jalr Whether the predicted block contains a jalr instruction
- is_call Whether the predicted block contains a call instruction
- is_ret Whether the predicted block contains a ret instruction
- last_may_be_rvi_call Signal indicating that the end of the predicted block may be an RVI type call instruction
- **is_br_sharing Whether the last slot (tailSlot) stores a conditional branch instruction signal
- Generation method**: Export from the corresponding field in the FTB entry
- fallThroughErr Error in the
pftAddr
recorded in the FTB entry- Generation method: Compare whether the address represented by
pftAddr
is greater than the start address of the predicted block. If it is less than, it indicates an error, and this signal is set to valid. This situation may occur when the PC indexes an incorrect FTB entry.
- Generation method: Compare whether the address represented by
- fallThroughAddr End address of the predicted block
- Generation method: If
fallThroughErr
is invalid, it is generated according topftAddr
. Otherwise, it is set to the start address + prediction width.
- Generation method: If
- br_taken_mask Branch prediction result, each branch (slot) corresponds to a bit, indicating whether the branch is predicted as taken
- Generation method: Generated based on the
always_taken
field in the FTB entry and the indication result of the two-bit saturation counter.
- Generation method: Generated based on the
- jalr_target Jump target of jalr in this predicted block
- Generation method: Jump target in the tailSlot of the FTB entry.
FTB meta
In the third cycle of prediction, the FTB predictor outputs some auxiliary information of this prediction to last_stage_meta
and also sends the read FTB entry to the last_stage_ftrb_entry
interface.
The FTB meta contains two pieces of information, hit
and writeWay
, indicating whether the prediction hits and in which way it is read. Subsequently, the update channel generates the update information for this prediction, and these two pieces of information are also sent to guide the writing of the updated FTB entry.
FTB Update
In the update channel, the pc and the new FTB entry are already specified for us, along with the hit
and writeWay
in the meta information. If hit
in the meta is valid, it means that the FTB entry corresponding to this pc was stored in the memory, and we only need to write it to the corresponding way.
If it is invalid, it means that there was no storage before, but we do not know whether it is stored now. It is possible that before this update request, the FTB entry corresponding to this pc was written by another update request. Therefore, we still need to send a read request to FTBBank to check if there is a corresponding FTB entry. If it exists, it can be directly written to this position in the next cycle, otherwise, FTBBank will be notified to allocate a new position.
Therefore, the number of cycles required for updating FTB entries depends on the hit situation.
Let’s first look at how FTBBank handles updates.
FTBBank Update
FTBBank’s update interface is divided into two parts, the update read interface and the update write interface.
- u_req_pc: Update read request pc
Flipped(DecoupledIO(UInt(VAddrBits.W)))
- update_hits: Hit information read out
Valid(UInt(log2Ceil(numWays).W))
- update_access: There is an update request but the meta information indicates a miss
Bool()
- update_pc: Update write request pc
UInt(VAddrBits.W))
- update_write_data: Data to be written in the update request, write when valid
Flipped(Valid(new FTBEntryWithTag))
- update_write_way: Way index to write in the update request
UInt(log2Ceil(numWays).W))
- update_write_alloc: Whether a new FTB entry needs to be allocated (missed before)
Bool()
For the update read interface, FTBBank obtains the update read request through u_req_pc
signal. This request has a higher priority than the read request during prediction. In the next cycle, FTBBank will output the hit information through the update_hits
interface. update_access
is only used for some internal status judgments of FTBBank.
For the update write interface, FTBBank obtains the pc of the update write request through the update_pc
signal, and when update_write_data
is valid, it writes the data into the corresponding position specified by update_write_way
. If update_write_alloc
is valid, it means that it cannot be directly written to the position specified in the request, but a new position needs to be allocated.
The allocation strategy is as follows:
- If all ways are filled, use the pseudo LRU replacement algorithm to select the way to replace
- If there is an empty way, select the empty way.
Update Request Timing
- Meta hit is valid: If hit in the update request meta is valid, then we only need to specify the address and data to be written according to the information in the update request, and the writing only takes one cycle.
- Meta hit is invalid: In this case, after receiving the update request, we connect the pc in the request to the read port of FTBBank. The read port will return the result in the next cycle. Due to timing issues, we save this result and use it in the next cycle. Depending on the hit status in the result, we decide whether to set
update_write_alloc
and send a write request. The entire update process takes three cycles.
Interface List
信号类型 | 信号位 | 信号名 | 信号描述 |
---|---|---|---|
input | clock | 输入时钟 | |
input | reset | 复位信号 | |
input | [35:0] | io_reset_vector | 用于reset时,reset s1_pc_dup_0 提供的值 |
input | [40:0] | io_in_bits_s0_pc_0 | 输入位s0_pc 的 第0个复制 |
input | [40:0] | io_in_bits_s0_pc_1 | 同上 第1个 |
input | [40:0] | io_in_bits_s0_pc_2 | 同上 第2个 |
input | [40:0] | io_in_bits_s0_pc_3 | 同上 第3个 |
input | io_in_bits_resp_in_0_s2_full_pred_0_br_taken_mask_0 | 预测结果输入 | |
input | io_in_bits_resp_in_0_s2_full_pred_0_br_taken_mask_1 | ||
input | io_in_bits_resp_in_0_s2_full_pred_1_br_taken_mask_0 | ||
input | io_in_bits_resp_in_0_s2_full_pred_1_br_taken_mask_1 | ||
input | io_in_bits_resp_in_0_s2_full_pred_2_br_taken_mask_0 | ||
input | io_in_bits_resp_in_0_s2_full_pred_2_br_taken_mask_1 | ||
input | io_in_bits_resp_in_0_s2_full_pred_3_br_taken_mask_0 | ||
input | io_in_bits_resp_in_0_s2_full_pred_3_br_taken_mask_1 | ||
input | io_in_bits_resp_in_0_s3_full_pred_0_br_taken_mask_0 | ||
input | io_in_bits_resp_in_0_s3_full_pred_0_br_taken_mask_1 | ||
input | io_in_bits_resp_in_0_s3_full_pred_1_br_taken_mask_0 | ||
input | io_in_bits_resp_in_0_s3_full_pred_1_br_taken_mask_1 | ||
input | io_in_bits_resp_in_0_s3_full_pred_2_br_taken_mask_0 | ||
input | io_in_bits_resp_in_0_s3_full_pred_2_br_taken_mask_1 | ||
input | io_in_bits_resp_in_0_s3_full_pred_3_br_taken_mask_0 | ||
input | io_in_bits_resp_in_0_s3_full_pred_3_br_taken_mask_1 | ||
output | io_out_s2_full_pred_0_br_taken_mask_0 | s2 阶段输出的完整预测结果 | |
output | io_out_s2_full_pred_0_br_taken_mask_1 | ||
output | io_out_s2_full_pred_0_slot_valids_0 | ||
output | io_out_s2_full_pred_0_slot_valids_1 | ||
output | [40:0] | io_out_s2_full_pred_0_targets_0 | |
output | [40:0] | io_out_s2_full_pred_0_targets_1 | |
output | [40:0] | io_out_s2_full_pred_0_jalr_target | |
output | [3:0] | io_out_s2_full_pred_0_offsets_0 | |
output | [3:0] | io_out_s2_full_pred_0_offsets_1 | |
output | [40:0] | io_out_s2_full_pred_0_fallThroughAddr | |
output | io_out_s2_full_pred_0_is_br_sharing | ||
output | io_out_s2_full_pred_0_hit | ||
output | io_out_s2_full_pred_1_br_taken_mask_0 | 同上 | |
output | io_out_s2_full_pred_1_br_taken_mask_1 | ||
output | io_out_s2_full_pred_1_slot_valids_0 | ||
output | io_out_s2_full_pred_1_slot_valids_1 | ||
output | [40:0] | io_out_s2_full_pred_1_targets_0 | |
output | [40:0] | io_out_s2_full_pred_1_targets_1 | |
output | [40:0] | io_out_s2_full_pred_1_jalr_target | |
output | [3:0] | io_out_s2_full_pred_1_offsets_0 | |
output | [3:0] | io_out_s2_full_pred_1_offsets_1 | |
output | [40:0] | io_out_s2_full_pred_1_fallThroughAddr | |
output | io_out_s2_full_pred_1_is_br_sharing | ||
output | io_out_s2_full_pred_1_hit | ||
output | io_out_s2_full_pred_2_br_taken_mask_0 | 同上 | |
output | io_out_s2_full_pred_2_br_taken_mask_1 | ||
output | io_out_s2_full_pred_2_slot_valids_0 | ||
output | io_out_s2_full_pred_2_slot_valids_1 | ||
output | [40:0] | io_out_s2_full_pred_2_targets_0 | |
output | [40:0] | io_out_s2_full_pred_2_targets_1 | |
output | [40:0] | io_out_s2_full_pred_2_jalr_target | |
output | [3:0] | io_out_s2_full_pred_2_offsets_0 | |
output | [3:0] | io_out_s2_full_pred_2_offsets_1 | |
output | [40:0] | io_out_s2_full_pred_2_fallThroughAddr | |
output | io_out_s2_full_pred_2_is_jalr | ||
output | io_out_s2_full_pred_2_is_call | ||
output | io_out_s2_full_pred_2_is_ret | ||
output | io_out_s2_full_pred_2_last_may_be_rvi_call | ||
output | io_out_s2_full_pred_2_is_br_sharing | ||
output | io_out_s2_full_pred_2_hit | ||
output | io_out_s2_full_pred_3_br_taken_mask_0 | 同上 | |
output | io_out_s2_full_pred_3_br_taken_mask_1 | ||
output | io_out_s2_full_pred_3_slot_valids_0 | ||
output | io_out_s2_full_pred_3_slot_valids_1 | ||
output | [40:0] | io_out_s2_full_pred_3_targets_0 | |
output | [40:0] | io_out_s2_full_pred_3_targets_1 | |
output | [40:0] | io_out_s2_full_pred_3_jalr_target | |
output | [3:0] | io_out_s2_full_pred_3_offsets_0 | |
output | [3:0] | io_out_s2_full_pred_3_offsets_1 | |
output | [40:0] | io_out_s2_full_pred_3_fallThroughAddr | |
output | io_out_s2_full_pred_3_fallThroughErr | ||
output | io_out_s2_full_pred_3_is_br_sharing | ||
output | io_out_s2_full_pred_3_hit | ||
output | io_out_s3_full_pred_0_br_taken_mask_0 | s3 阶段输出的完整预测结果 | |
output | io_out_s3_full_pred_0_br_taken_mask_1 | ||
output | io_out_s3_full_pred_0_slot_valids_0 | ||
output | io_out_s3_full_pred_0_slot_valids_1 | ||
output | [40:0] | io_out_s3_full_pred_0_targets_0 | |
output | [40:0] | io_out_s3_full_pred_0_targets_1 | |
output | [40:0] | io_out_s3_full_pred_0_jalr_target | |
output | [40:0] | io_out_s3_full_pred_0_fallThroughAddr | |
output | io_out_s3_full_pred_0_fallThroughErr | ||
output | io_out_s3_full_pred_0_is_br_sharing | ||
output | io_out_s3_full_pred_0_hit | ||
output | io_out_s3_full_pred_1_br_taken_mask_0 | 同上 | |
output | io_out_s3_full_pred_1_br_taken_mask_1 | ||
output | io_out_s3_full_pred_1_slot_valids_0 | ||
output | io_out_s3_full_pred_1_slot_valids_1 | ||
output | [40:0] | io_out_s3_full_pred_1_targets_0 | |
output | [40:0] | io_out_s3_full_pred_1_targets_1 | |
output | [40:0] | io_out_s3_full_pred_1_jalr_target | |
output | [40:0] | io_out_s3_full_pred_1_fallThroughAddr | |
output | io_out_s3_full_pred_1_fallThroughErr | ||
output | io_out_s3_full_pred_1_is_br_sharing | ||
output | io_out_s3_full_pred_1_hit | ||
output | io_out_s3_full_pred_2_br_taken_mask_0 | 同上 | |
output | io_out_s3_full_pred_2_br_taken_mask_1 | ||
output | io_out_s3_full_pred_2_slot_valids_0 | ||
output | io_out_s3_full_pred_2_slot_valids_1 | ||
output | [40:0] | io_out_s3_full_pred_2_targets_0 | |
output | [40:0] | io_out_s3_full_pred_2_targets_1 | |
output | [40:0] | io_out_s3_full_pred_2_jalr_target | |
output | [40:0] | io_out_s3_full_pred_2_fallThroughAddr | |
output | io_out_s3_full_pred_2_fallThroughErr | ||
output | io_out_s3_full_pred_2_is_jalr | ||
output | io_out_s3_full_pred_2_is_call | ||
output | io_out_s3_full_pred_2_is_ret | ||
output | io_out_s3_full_pred_2_is_br_sharing | ||
output | io_out_s3_full_pred_2_hit | ||
output | io_out_s3_full_pred_3_br_taken_mask_0 | 同上 | |
output | io_out_s3_full_pred_3_br_taken_mask_1 | ||
output | io_out_s3_full_pred_3_slot_valids_0 | ||
output | io_out_s3_full_pred_3_slot_valids_1 | ||
output | [40:0] | io_out_s3_full_pred_3_targets_0 | |
output | [40:0] | io_out_s3_full_pred_3_targets_1 | |
output | [40:0] | io_out_s3_full_pred_3_jalr_target | |
output | [3:0] | io_out_s3_full_pred_3_offsets_0 | |
output | [3:0] | io_out_s3_full_pred_3_offsets_1 | |
output | [40:0] | io_out_s3_full_pred_3_fallThroughAddr | |
output | io_out_s3_full_pred_3_fallThroughErr | ||
output | io_out_s3_full_pred_3_is_br_sharing | ||
output | io_out_s3_full_pred_3_hit | ||
output | [222:0] | io_out_last_stage_meta | 最后一个阶段输出的 meta 信息 |
output | io_out_last_stage_ftb_entry_valid | 最后一个阶段输出的 FTB 项 | |
output | [3:0] | io_out_last_stage_ftb_entry_brSlots_0_offset | |
output | [11:0] | io_out_last_stage_ftb_entry_brSlots_0_lower | |
output | [1:0] | io_out_last_stage_ftb_entry_brSlots_0_tarStat | |
output | io_out_last_stage_ftb_entry_brSlots_0_sharing | ||
output | io_out_last_stage_ftb_entry_brSlots_0_valid | ||
output | [3:0] | io_out_last_stage_ftb_entry_tailSlot_offset | |
output | [19:0] | io_out_last_stage_ftb_entry_tailSlot_lower | |
output | [1:0] | io_out_last_stage_ftb_entry_tailSlot_tarStat | |
output | io_out_last_stage_ftb_entry_tailSlot_sharing | ||
output | io_out_last_stage_ftb_entry_tailSlot_valid | ||
output | [3:0] | io_out_last_stage_ftb_entry_pftAddr | |
output | io_out_last_stage_ftb_entry_carry | ||
output | io_out_last_stage_ftb_entry_isCall | ||
output | io_out_last_stage_ftb_entry_isRet | ||
output | io_out_last_stage_ftb_entry_isJalr | ||
output | io_out_last_stage_ftb_entry_last_may_be_rvi_call | ||
output | io_out_last_stage_ftb_entry_always_taken_0 | ||
output | io_out_last_stage_ftb_entry_always_taken_1 | ||
input | io_ctrl_btb_enable | 使能信号 | |
input | io_s0_fire_0 | s0 阶段流水线控制信号 | |
input | io_s0_fire_1 | ||
input | io_s0_fire_2 | ||
input | io_s0_fire_3 | ||
output | io_s1_ready | s1 阶段流水线控制信号 | |
input | io_s1_fire_0 | ||
input | io_s1_fire_1 | ||
input | io_s1_fire_2 | ||
input | io_s1_fire_3 | ||
input | io_s2_fire_0 | s2 阶段流水线控制信号 | |
input | io_s2_fire_1 | ||
input | io_s2_fire_2 | ||
input | io_s2_fire_3 | ||
input | io_update_valid | 更新有效性 | |
input | [40:0] | io_update_bits_pc | 传回的预测块pc(用于指示更新的预测块) |
input | io_update_bits_ftb_entry_valid | 是否启用 | |
input | [3:0] | io_update_bits_ftb_entry_brSlots_0_offset | solt 0 中分支指令相对于地址块起始pc的偏移 |
input | [11:0] | io_update_bits_ftb_entry_brSlots_0_lower | 跳转目标地址的低位 |
input | [1:0] | io_update_bits_ftb_entry_brSlots_0_tarStat | 跳转后的 pc 高位是否进退位 |
input | io_update_bits_ftb_entry_brSlots_0_sharing | 无条件跳转指令槽中存储条件分支指令 | |
input | io_update_bits_ftb_entry_brSlots_0_valid | 是否启用 | |
input | [3:0] | io_update_bits_ftb_entry_tailSlot_offset | solt 1 中分支指令相对于地址块起始pc的偏移 |
input | [19:0] | io_update_bits_ftb_entry_tailSlot_lower | 跳转目标地址的低位 |
input | [1:0] | io_update_bits_ftb_entry_tailSlot_tarStat | 跳转后的 pc 高位是否进退位 |
input | io_update_bits_ftb_entry_tailSlot_sharing | 无条件跳转指令槽中存储条件分支指令 | |
input | io_update_bits_ftb_entry_tailSlot_valid | 是否启用 | |
input | [3:0] | io_update_bits_ftb_entry_pftAddr | Partial Fallthrough Addr 如果预测块中没有跳转,那么程序将会顺序执行到达的地址,预测块的结束地址。 |
input | io_update_bits_ftb_entry_carry | pc+pft时是否产生进位 | |
input | io_update_bits_ftb_entry_isCall | 是否是函数调用 | |
input | io_update_bits_ftb_entry_isRet | 是否是函数返回 | |
input | io_update_bits_ftb_entry_isJalr | 是否是 jalr 指令 | |
input | io_update_bits_ftb_entry_last_may_be_rvi_call | 最后一个指令槽存储的可能是 rvi 的 call 指令 | |
input | io_update_bits_ftb_entry_always_taken_0 | 是否预测为总是跳转 | |
input | io_update_bits_ftb_entry_always_taken_1 | 是否预测为总是跳转 | |
input | io_update_bits_old_entry | 是否是旧的 FTB 项 | |
input | [222:0] | io_update_bits_meta | meta 信息 |