uFTB Branch Predictor

Introduction to uFTB

uFTB is the first predictor among all the predictors in Xiangshan, and it serves as the cornerstone for other predictors to generate prediction results. uFTB works in the s1 stage. It can generate prediction results within the current cycle after obtaining s1_pc and output them in the s1 channel, without modifying other channels. It provides the position of the branch instruction and the target of the instruction. Subsequent predictors will further predict based on this result.

Its essence is an FTB item cache, which stores FTB items, and the basic prediction result will be directly generated from the read-out FTB item.

Therefore, before you start reading the document, make sure you understand the FTB items and their meanings, as well as the specific details of the prediction result interface.

Functionality of uFTB

  • Cache FTB items and generate one-cycle prediction results: uFTB maintains a small FTB item cache. After receiving PC, it reads out the FTB item corresponding to the PC within one cycle and generates an s1 stage prediction result from the FTB item.
  • Maintain two-bit saturating counters to provide basic conditional branch results: uFTB maintains two-bit saturating counters for each line of the FTB item cache. The direction prediction result is reflected in the prediction result output of uFTB.
  • Update the FTB cache and two-bit saturating counters based on update requests

uFTB Cache Structure

As mentioned above, uFTB is essentially a small cache that stores FTB items. Its approximate structure is shown in the figure below.

In the current version of Xiangshan, uFTB has a total of 32 cache lines, each cache line is called FauFTBWay, and one FTB item can be stored in each cache line.

When s1 pipeline is valid, uFTB will use s1_pc to determine which item of the uFTB cache to read out. The cache is indexed based on the tag field in PC, which is defined as pc[16:1], i.e., taking 16 bits from PC as an identifier to match a certain line in the cache.

Each line in the cache, i.e., the data request interface in FauFTBWay, has three items:

  • req_tag: Input tag identifier extracted from pc
  • resp: Output the FTB item stored in this line
  • resp_hit: Output indicates whether the FTB item in this line matches req_tag uFTB connects the tag to the data request interface of each line in the cache and selects the hit FTB item based on the resp_hit signal. Subsequent steps will generate a complete prediction result based on this FTB item.

Two-Bit Saturating Counters

uFTB maintains two-bit saturating counters for each line of the FTB item cache. As we know, an FTB item can store up to two conditional branch instructions, so each line’s two-bit saturating counters also have two, responsible for providing rough prediction results for the conditional branch instructions in them.

When indexing the FTB item, uFTB also indexes the corresponding two-bit saturating counters.

When the prediction result is generated, it will be based on the FTB item and the contents of the two two-bit saturating counters corresponding to it.

Prediction Result Generation

After indexing the corresponding FTB item and two two-bit saturating counters based on s1_pc, uFTB needs to generate a prediction result based on them. The prediction result generated by uFTB will be outputted through the s1 channel when s1 pipeline is valid, and will not be modified for s2 and s3 channels.

The signal generation method in the s1 prediction result can refer to the following list:

  • hitWhether the FTB item hits
    • Generation method: resp_hit signal in FauFTBWay, one of them is valid
  • slot_valids: Slot valid bit, indicating whether each slot in the ftb item is valid
  • targets: Jump target address corresponding to each slot
  • offsets: The offset of each instruction in the slot relative to the starting address of the prediction block
  • is_jal: Whether the prediction block contains a jal instruction
  • is_jalr: Whether the prediction block contains a jalr instruction
  • is_call: Whether the prediction block contains a call instruction
  • is_ret: Whether the prediction block contains a ret instruction
  • last_may_be_rvi_call: Signal indicating that the last slot in the prediction block may be an RVI type call instruction
  • is_br_sharing: The signal stored in the last slot (tailSlot) indicates a conditional branch instruction.
    • Generation: Exported from the corresponding field in the FTB entry.
  • fallThroughErr: The pftAddr recorded in the FTB entry is incorrect.
    • Generation: Compare whether the end address represented by pftAddr is greater than the start address of the predicted block. If it is smaller, an error has occurred, and this signal is set to valid. This situation may occur when the PC indexes an incorrect FTB entry.
  • fallThroughAddr: The end address of the predicted block.
    • Generation: If fallThroughErr is invalid, it is generated based on pftAddr. Otherwise, it is set to the start address + prediction width.
  • br_taken_mask: Branch prediction result, with each branch (slot) corresponding to a bit indicating whether the branch is predicted as taken.
    • Generation: Generated based on the always_taken field in the FTB entry and the result indicated by the two-bit saturating counter.
  • jalr_target: The jump target of the jalr in this predicted block. -Generation: From the jump target in the tailSlot of the FTB entry.

uFTB Update

The update of uFTB involves updating the FTB entry cache and the two-bit saturating counter, with the update content obtained through the update interface.

In the uFTB predictor, the reading and writing of the cache and the two-bit saturating counter do not conflict, so we do not need to consider timing conflicts between reading and updating and can consider them as two independent parts.

FTB Cache Update

The update process of the FTB cache is simple. The update channel has already specified the PC and the newly generated FTB entry, so it only needs to be written to the specified position in the cache.

FTB cache updating requires two cycles:

  • In the first cycle, calculate the following based on the signals in the update:
    • Which row in the cache corresponds to the update request The PC extracted from the update request is sent to the update request channel of FauFTBWay, and the hit signal returned by each row is calculated.
  • In the second cycle, update according to the position calculated in the first cycle. If no row is hit, uFTB will use a pseudo-LRU replacement algorithm to select the row to be written. Specifically, the ftb_entry signal group in the update channel contains the complete information of the new FTB entry, which is sent to the cache row that needs to be updated.

Two-Bit Saturating Counter Update

The update of the two-bit saturating counter needs to be combined with the actual execution of the subsequent program and the branch instruction information recorded in the FTB entry, which can be obtained from the update channel.

The update of the two-bit saturating counter also requires two cycles:

  • In the first cycle, calculate which two-bit saturating counter corresponding to the conditional branch instruction in the slot needs to be updated, which needs to meet the following conditions:
    • The current branch instruction slot is valid and contains a conditional branch instruction.
    • The current branch instruction slot is not marked as always_taken. (Because after being marked as always_taken, the result of the two-bit saturating counter will not be used.)
    • The current branch instruction slot is not after the branch instruction slot where an actual jump occurred.
  • In the second cycle, update the saturating counter based on the mask generated in the first cycle and the br_taken_mask information in the update channel.

Interface List

信号类型 信号位 信号名 信号描述
input clock 输入时钟
input reset 复位信号
input [35:0] io_reset_vector 用于reset时,reset s1_pc_dup_0 提供的值
input [40:0] io_in_bits_s0_pc_0 输入位s0_pc 的 第0个复制
input [40:0] io_in_bits_s0_pc_1 同上 第1个
input [40:0] io_in_bits_s0_pc_2 同上 第2个
input [40:0] io_in_bits_s0_pc_3 同上 第3个
output [40:0] io_out_s1_pc_0 输出s1_pc 的 第0个复制
output [40:0] io_out_s1_pc_1 同上 第1个
output [40:0] io_out_s1_pc_2 同上 第2个
output [40:0] io_out_s1_pc_3 同上 第3个
output io_out_s1_full_pred_0_br_taken_mask_0 solt 0 是否被预测为 always taken
output io_out_s1_full_pred_0_br_taken_mask_1 solt 1 是否被预测为 always taken
output io_out_s1_full_pred_0_slot_valids_0 solt 0 是否启用
output io_out_s1_full_pred_0_slot_valids_1 solt 1 是否启用
output [40:0] io_out_s1_full_pred_0_targets_0 solt 0 对应的跳转目标地址
output [40:0] io_out_s1_full_pred_0_targets_1 solt 1 对应的跳转目标地址
output [3:0] io_out_s1_full_pred_0_offsets_0 solt 0 中分支指令相对于地址块起始pc的偏移
output [3:0] io_out_s1_full_pred_0_offsets_1 solt 1 中分支指令相对于地址块起始pc的偏移
output [40:0] io_out_s1_full_pred_0_fallThroughAddr 预测块的结束地址
output io_out_s1_full_pred_0_is_br_sharing solt 1(无条件跳转)是否被共享为有条件跳转指令
output io_out_s1_full_pred_0_hit
output io_out_s1_full_pred_1_br_taken_mask_0 类似 io_out_s1_pc_1 io_out_s1_full_pred_0的复制
output io_out_s1_full_pred_1_br_taken_mask_1
output io_out_s1_full_pred_1_slot_valids_0
output io_out_s1_full_pred_1_slot_valids_1
output [40:0] io_out_s1_full_pred_1_targets_0
output [40:0] io_out_s1_full_pred_1_targets_1
output [3:0] io_out_s1_full_pred_1_offsets_0
output [3:0] io_out_s1_full_pred_1_offsets_1
output [40:0] io_out_s1_full_pred_1_fallThroughAddr
output io_out_s1_full_pred_1_is_br_sharing
output io_out_s1_full_pred_1_hit
output io_out_s1_full_pred_2_br_taken_mask_0 同上
output io_out_s1_full_pred_2_br_taken_mask_1
output io_out_s1_full_pred_2_slot_valids_0
output io_out_s1_full_pred_2_slot_valids_1
output [40:0] io_out_s1_full_pred_2_targets_0
output [40:0] io_out_s1_full_pred_2_targets_1
output [3:0] io_out_s1_full_pred_2_offsets_0
output [3:0] io_out_s1_full_pred_2_offsets_1
output [40:0] io_out_s1_full_pred_2_fallThroughAddr
output io_out_s1_full_pred_2_is_br_sharing
output io_out_s1_full_pred_2_hit
output io_out_s1_full_pred_3_br_taken_mask_0 同上
output io_out_s1_full_pred_3_br_taken_mask_1
output io_out_s1_full_pred_3_slot_valids_0
output io_out_s1_full_pred_3_slot_valids_1
output [40:0] io_out_s1_full_pred_3_targets_0
output [40:0] io_out_s1_full_pred_3_targets_1
output [3:0] io_out_s1_full_pred_3_offsets_0
output [3:0] io_out_s1_full_pred_3_offsets_1
output [40:0] io_out_s1_full_pred_3_fallThroughAddr
output io_out_s1_full_pred_3_fallThroughErr
output io_out_s1_full_pred_3_is_br_sharing
output io_out_s1_full_pred_3_hit
output [222:0] io_out_last_stage_meta 输出最后阶段的元信息 io_out_last_stage_meta = {213’h0, resp_meta_pred_way_r_1, resp_meta_hit_r_1}
input io_ctrl_ubtb_enable 控制ubtb是否启用
input io_s0_fire_0 输入s0_fire_0,与 io_out_s1_pc_0 <= io_in_bits_s0_pc_0 的时钟门控相关
input io_s0_fire_1 输入s0_fire_1
input io_s0_fire_2 输入s0_fire_2
input io_s0_fire_3 输入s0_fire_3
input io_s1_fire_0 输入s1_fire_0
input io_s2_fire_0 输入s2_fire_0
input io_update_valid 更新有效性
input [40:0] io_update_bits_pc 传回的预测块pc(用于指示更新的预测块)
input [3:0] io_update_bits_ftb_entry_brSlots_0_offset solt 0 中分支指令相对于地址块起始pc的偏移
input [11:0] io_update_bits_ftb_entry_brSlots_0_lower 跳转目标地址的低位
input [1:0] io_update_bits_ftb_entry_brSlots_0_tarStat 跳转后的 pc 高位是否进退位
input io_update_bits_ftb_entry_brSlots_0_valid 是否启用
input [3:0] io_update_bits_ftb_entry_tailSlot_offset solt 1 中分支指令相对于地址块起始pc的偏移
input [19:0] io_update_bits_ftb_entry_tailSlot_lower 跳转目标地址的低位
input [1:0] io_update_bits_ftb_entry_tailSlot_tarStat 跳转后的 pc 高位是否进退位
input io_update_bits_ftb_entry_tailSlot_sharing 无条件跳转指令槽中存储条件分支指令
input io_update_bits_ftb_entry_tailSlot_valid 是否启用
input [3:0] io_update_bits_ftb_entry_pftAddr Partial Fallthrough Addr 如果预测块中没有跳转,那么程序将会顺序执行到达的地址,预测块的结束地址。
input io_update_bits_ftb_entry_carry pc+pft时是否产生进位
input io_update_bits_ftb_entry_always_taken_0 是否预测为总是跳转
input io_update_bits_ftb_entry_always_taken_1 是否预测为总是跳转
input io_update_bits_br_taken_mask_0 是否跳转
input io_update_bits_br_taken_mask_1 是否跳转
Last modified September 13, 2024: Update the picture of BPU Top. (431c050)