uFTB Branch Predictor
Categories:
Introduction to uFTB
uFTB is the first predictor among all the predictors in Xiangshan, and it serves as the cornerstone for other predictors to generate prediction results. uFTB works in the s1 stage. It can generate prediction results within the current cycle after obtaining s1_pc and output them in the s1 channel, without modifying other channels. It provides the position of the branch instruction and the target of the instruction. Subsequent predictors will further predict based on this result.
Its essence is an FTB item cache, which stores FTB items, and the basic prediction result will be directly generated from the read-out FTB item.
Therefore, before you start reading the document, make sure you understand the FTB items and their meanings, as well as the specific details of the prediction result interface.
Functionality of uFTB
- Cache FTB items and generate one-cycle prediction results: uFTB maintains a small FTB item cache. After receiving PC, it reads out the FTB item corresponding to the PC within one cycle and generates an s1 stage prediction result from the FTB item.
- Maintain two-bit saturating counters to provide basic conditional branch results: uFTB maintains two-bit saturating counters for each line of the FTB item cache. The direction prediction result is reflected in the prediction result output of uFTB.
- Update the FTB cache and two-bit saturating counters based on update requests
uFTB Cache Structure
As mentioned above, uFTB is essentially a small cache that stores FTB items. Its approximate structure is shown in the figure below.
In the current version of Xiangshan, uFTB has a total of 32 cache lines, each cache line is called FauFTBWay
, and one FTB item can be stored in each cache line.
When s1 pipeline is valid, uFTB will use s1_pc
to determine which item of the uFTB cache to read out. The cache is indexed based on the tag field in PC, which is defined as pc[16:1], i.e., taking 16 bits from PC as an identifier to match a certain line in the cache.
Each line in the cache, i.e., the data request interface in FauFTBWay
, has three items:
- req_tag: Input tag identifier extracted from pc
- resp: Output the FTB item stored in this line
- resp_hit: Output indicates whether the FTB item in this line matches req_tag
uFTB connects the tag to the data request interface of each line in the cache and selects the hit FTB item based on the
resp_hit
signal. Subsequent steps will generate a complete prediction result based on this FTB item.
Two-Bit Saturating Counters
uFTB maintains two-bit saturating counters for each line of the FTB item cache. As we know, an FTB item can store up to two conditional branch instructions, so each line’s two-bit saturating counters also have two, responsible for providing rough prediction results for the conditional branch instructions in them.
When indexing the FTB item, uFTB also indexes the corresponding two-bit saturating counters.
When the prediction result is generated, it will be based on the FTB item and the contents of the two two-bit saturating counters corresponding to it.
Prediction Result Generation
After indexing the corresponding FTB item and two two-bit saturating counters based on s1_pc, uFTB needs to generate a prediction result based on them. The prediction result generated by uFTB will be outputted through the s1 channel when s1 pipeline is valid, and will not be modified for s2 and s3 channels.
The signal generation method in the s1 prediction result can refer to the following list:
- hitWhether the FTB item hits
- Generation method:
resp_hit
signal inFauFTBWay
, one of them is valid
- Generation method:
- slot_valids: Slot valid bit, indicating whether each slot in the ftb item is valid
- targets: Jump target address corresponding to each slot
- offsets: The offset of each instruction in the slot relative to the starting address of the prediction block
- is_jal: Whether the prediction block contains a jal instruction
- is_jalr: Whether the prediction block contains a jalr instruction
- is_call: Whether the prediction block contains a call instruction
- is_ret: Whether the prediction block contains a ret instruction
- last_may_be_rvi_call: Signal indicating that the last slot in the prediction block may be an RVI type call instruction
- is_br_sharing: The signal stored in the last slot (tailSlot) indicates a conditional branch instruction.
- Generation: Exported from the corresponding field in the FTB entry.
- fallThroughErr: The
pftAddr
recorded in the FTB entry is incorrect.- Generation: Compare whether the end address represented by
pftAddr
is greater than the start address of the predicted block. If it is smaller, an error has occurred, and this signal is set to valid. This situation may occur when the PC indexes an incorrect FTB entry.
- Generation: Compare whether the end address represented by
- fallThroughAddr: The end address of the predicted block.
- Generation: If
fallThroughErr
is invalid, it is generated based onpftAddr
. Otherwise, it is set to the start address + prediction width.
- Generation: If
- br_taken_mask: Branch prediction result, with each branch (slot) corresponding to a bit indicating whether the branch is predicted as taken.
- Generation: Generated based on the
always_taken
field in the FTB entry and the result indicated by the two-bit saturating counter.
- Generation: Generated based on the
- jalr_target: The jump target of the jalr in this predicted block. -Generation: From the jump target in the tailSlot of the FTB entry.
uFTB Update
The update of uFTB involves updating the FTB entry cache and the two-bit saturating counter, with the update content obtained through the update interface.
In the uFTB predictor, the reading and writing of the cache and the two-bit saturating counter do not conflict, so we do not need to consider timing conflicts between reading and updating and can consider them as two independent parts.
FTB Cache Update
The update process of the FTB cache is simple. The update channel has already specified the PC and the newly generated FTB entry, so it only needs to be written to the specified position in the cache.
FTB cache updating requires two cycles:
- In the first cycle, calculate the following based on the signals in the update:
- Which row in the cache corresponds to the update request The PC extracted from the update request is sent to the update request channel of FauFTBWay, and the hit signal returned by each row is calculated.
- In the second cycle, update according to the position calculated in the first cycle. If no row is hit, uFTB will use a pseudo-LRU replacement algorithm to select the row to be written. Specifically, the ftb_entry signal group in the update channel contains the complete information of the new FTB entry, which is sent to the cache row that needs to be updated.
Two-Bit Saturating Counter Update
The update of the two-bit saturating counter needs to be combined with the actual execution of the subsequent program and the branch instruction information recorded in the FTB entry, which can be obtained from the update channel.
The update of the two-bit saturating counter also requires two cycles:
- In the first cycle, calculate which two-bit saturating counter corresponding to the conditional branch instruction in the slot needs to be updated, which needs to meet the following conditions:
- The current branch instruction slot is valid and contains a conditional branch instruction.
- The current branch instruction slot is not marked as always_taken. (Because after being marked as always_taken, the result of the two-bit saturating counter will not be used.)
- The current branch instruction slot is not after the branch instruction slot where an actual jump occurred.
- In the second cycle, update the saturating counter based on the mask generated in the first cycle and the
br_taken_mask
information in the update channel.
Interface List
信号类型 | 信号位 | 信号名 | 信号描述 |
---|---|---|---|
input | clock | 输入时钟 | |
input | reset | 复位信号 | |
input | [35:0] | io_reset_vector | 用于reset时,reset s1_pc_dup_0 提供的值 |
input | [40:0] | io_in_bits_s0_pc_0 | 输入位s0_pc 的 第0个复制 |
input | [40:0] | io_in_bits_s0_pc_1 | 同上 第1个 |
input | [40:0] | io_in_bits_s0_pc_2 | 同上 第2个 |
input | [40:0] | io_in_bits_s0_pc_3 | 同上 第3个 |
output | [40:0] | io_out_s1_pc_0 | 输出s1_pc 的 第0个复制 |
output | [40:0] | io_out_s1_pc_1 | 同上 第1个 |
output | [40:0] | io_out_s1_pc_2 | 同上 第2个 |
output | [40:0] | io_out_s1_pc_3 | 同上 第3个 |
output | io_out_s1_full_pred_0_br_taken_mask_0 | solt 0 是否被预测为 always taken | |
output | io_out_s1_full_pred_0_br_taken_mask_1 | solt 1 是否被预测为 always taken | |
output | io_out_s1_full_pred_0_slot_valids_0 | solt 0 是否启用 | |
output | io_out_s1_full_pred_0_slot_valids_1 | solt 1 是否启用 | |
output | [40:0] | io_out_s1_full_pred_0_targets_0 | solt 0 对应的跳转目标地址 |
output | [40:0] | io_out_s1_full_pred_0_targets_1 | solt 1 对应的跳转目标地址 |
output | [3:0] | io_out_s1_full_pred_0_offsets_0 | solt 0 中分支指令相对于地址块起始pc的偏移 |
output | [3:0] | io_out_s1_full_pred_0_offsets_1 | solt 1 中分支指令相对于地址块起始pc的偏移 |
output | [40:0] | io_out_s1_full_pred_0_fallThroughAddr | 预测块的结束地址 |
output | io_out_s1_full_pred_0_is_br_sharing | solt 1(无条件跳转)是否被共享为有条件跳转指令 | |
output | io_out_s1_full_pred_0_hit | ||
output | io_out_s1_full_pred_1_br_taken_mask_0 | 类似 io_out_s1_pc_1 io_out_s1_full_pred_0的复制 | |
output | io_out_s1_full_pred_1_br_taken_mask_1 | ||
output | io_out_s1_full_pred_1_slot_valids_0 | ||
output | io_out_s1_full_pred_1_slot_valids_1 | ||
output | [40:0] | io_out_s1_full_pred_1_targets_0 | |
output | [40:0] | io_out_s1_full_pred_1_targets_1 | |
output | [3:0] | io_out_s1_full_pred_1_offsets_0 | |
output | [3:0] | io_out_s1_full_pred_1_offsets_1 | |
output | [40:0] | io_out_s1_full_pred_1_fallThroughAddr | |
output | io_out_s1_full_pred_1_is_br_sharing | ||
output | io_out_s1_full_pred_1_hit | ||
output | io_out_s1_full_pred_2_br_taken_mask_0 | 同上 | |
output | io_out_s1_full_pred_2_br_taken_mask_1 | ||
output | io_out_s1_full_pred_2_slot_valids_0 | ||
output | io_out_s1_full_pred_2_slot_valids_1 | ||
output | [40:0] | io_out_s1_full_pred_2_targets_0 | |
output | [40:0] | io_out_s1_full_pred_2_targets_1 | |
output | [3:0] | io_out_s1_full_pred_2_offsets_0 | |
output | [3:0] | io_out_s1_full_pred_2_offsets_1 | |
output | [40:0] | io_out_s1_full_pred_2_fallThroughAddr | |
output | io_out_s1_full_pred_2_is_br_sharing | ||
output | io_out_s1_full_pred_2_hit | ||
output | io_out_s1_full_pred_3_br_taken_mask_0 | 同上 | |
output | io_out_s1_full_pred_3_br_taken_mask_1 | ||
output | io_out_s1_full_pred_3_slot_valids_0 | ||
output | io_out_s1_full_pred_3_slot_valids_1 | ||
output | [40:0] | io_out_s1_full_pred_3_targets_0 | |
output | [40:0] | io_out_s1_full_pred_3_targets_1 | |
output | [3:0] | io_out_s1_full_pred_3_offsets_0 | |
output | [3:0] | io_out_s1_full_pred_3_offsets_1 | |
output | [40:0] | io_out_s1_full_pred_3_fallThroughAddr | |
output | io_out_s1_full_pred_3_fallThroughErr | ||
output | io_out_s1_full_pred_3_is_br_sharing | ||
output | io_out_s1_full_pred_3_hit | ||
output | [222:0] | io_out_last_stage_meta | 输出最后阶段的元信息 io_out_last_stage_meta = {213’h0, resp_meta_pred_way_r_1, resp_meta_hit_r_1} |
input | io_ctrl_ubtb_enable | 控制ubtb是否启用 | |
input | io_s0_fire_0 | 输入s0_fire_0,与 io_out_s1_pc_0 <= io_in_bits_s0_pc_0 的时钟门控相关 | |
input | io_s0_fire_1 | 输入s0_fire_1 | |
input | io_s0_fire_2 | 输入s0_fire_2 | |
input | io_s0_fire_3 | 输入s0_fire_3 | |
input | io_s1_fire_0 | 输入s1_fire_0 | |
input | io_s2_fire_0 | 输入s2_fire_0 | |
input | io_update_valid | 更新有效性 | |
input | [40:0] | io_update_bits_pc | 传回的预测块pc(用于指示更新的预测块) |
input | [3:0] | io_update_bits_ftb_entry_brSlots_0_offset | solt 0 中分支指令相对于地址块起始pc的偏移 |
input | [11:0] | io_update_bits_ftb_entry_brSlots_0_lower | 跳转目标地址的低位 |
input | [1:0] | io_update_bits_ftb_entry_brSlots_0_tarStat | 跳转后的 pc 高位是否进退位 |
input | io_update_bits_ftb_entry_brSlots_0_valid | 是否启用 | |
input | [3:0] | io_update_bits_ftb_entry_tailSlot_offset | solt 1 中分支指令相对于地址块起始pc的偏移 |
input | [19:0] | io_update_bits_ftb_entry_tailSlot_lower | 跳转目标地址的低位 |
input | [1:0] | io_update_bits_ftb_entry_tailSlot_tarStat | 跳转后的 pc 高位是否进退位 |
input | io_update_bits_ftb_entry_tailSlot_sharing | 无条件跳转指令槽中存储条件分支指令 | |
input | io_update_bits_ftb_entry_tailSlot_valid | 是否启用 | |
input | [3:0] | io_update_bits_ftb_entry_pftAddr | Partial Fallthrough Addr 如果预测块中没有跳转,那么程序将会顺序执行到达的地址,预测块的结束地址。 |
input | io_update_bits_ftb_entry_carry | pc+pft时是否产生进位 | |
input | io_update_bits_ftb_entry_always_taken_0 | 是否预测为总是跳转 | |
input | io_update_bits_ftb_entry_always_taken_1 | 是否预测为总是跳转 | |
input | io_update_bits_br_taken_mask_0 | 是否跳转 | |
input | io_update_bits_br_taken_mask_1 | 是否跳转 |