This document will describe the important structures and external interfaces in the BPU, with a description granularity that delves into the code level. By reading this document, you can understand the role of each signal in the Xiangshan branch prediction unit, comprehend the specific implementation methods of various requests, and gain a functional understanding in conjunction with the code.
This document will describe the important structures and external interfaces in the BPU, with a description granularity that delves into the code level. The structures described in the document will be consistent with the Chisel version of the Xiangshan branch prediction unit code, and the signal structures and names also come from the Chisel version code.
This document is intended for those who have already understood the basic design of the Xiangshan branch prediction unit and want to delve into the signal details. You can selectively read according to the content needed for verification or refer to it at any time.
Among them, the FTB entry and the full prediction result interface involve the way the BPU generates prediction results, which is recommended for every reader to read.
1 - FTB Item and Complete Prediction Result Interface
The basic prediction unit of BPU is the branch prediction block, but what is the structure of a branch prediction block? Why can a branch prediction block guide the execution of so many subsequent instructions? This section will describe the core data structure for generating branch prediction blocks—the FTB item, and the complete prediction result structure generated from the FTB item.
FTB item
The FTB item is the core data structure of branch prediction blocks in Xiangshan. It stores the information needed to generate a branch prediction block. When BPU performs predictions, the initial branch prediction block is first generated from a read-out FTB item. Then, this branch prediction block is passed to subsequent predictors, which read and modify the information to generate the final prediction result.
Therefore, to understand the structure of a branch prediction block, we first need to understand the structure of an FTB item. An FTB item corresponds to a branch prediction block, and its general structure is as follows:
First, it is necessary to clarify one piece of information: whether it is a branch prediction block or an FTB item, the number of instructions they can contain is set to a specific limit (16 RVC instructions in the current version of Xiangshan), called the maximum prediction length. This means that if we need to record the position of an instruction within the branch prediction block, we can use a fixed-length bit vector to specify the offset of this instruction relative to the starting address of the prediction block.
The determinant of the branch prediction block’s execution process is the information about the branch instructions. The other instructions are considered ordinary instructions and do not affect the program’s execution flow. Therefore, in a branch prediction block, we only need to record the positions of the branch instructions, while the positions of ordinary instructions are not our concern.
Therefore, the FTB item defines two types of branch instruction slots—brSlots and tailSlot, used to store the branch instructions within the branch prediction block. In the current version of Xiangshan, brSlots contains only one slot, while tailSlot is a separate slot, totaling two slots.
Within the instructions of the maximum prediction length, if a branch instruction appears, the FTB item will record it in the corresponding slot and mark the slot as valid. If too many branch instructions appear, reaching the capacity limit of the FTB item, the excess branch instructions will be handed over to the next FTB item for storage. In each slot, we record the offset of a branch instruction relative to the starting address of the prediction block and information such as its jump target address.
The Unique tailSlot
In RISC-V, branch instructions are mainly divided into two types: conditional branches and unconditional jumps. Therefore, for a branch prediction block, it will contain at most one unconditional jump instruction. Because once this instruction is executed, the program’s execution flow will change, and subsequent instructions will no longer be executed. Hence, we define a type of slot called tailSlot specifically for storing this unconditional jump instruction. As for conditional branch instructions, they are stored in brSlots.
As its name suggests, tailSlot is located in the last slot of the entire prediction block. This is also because once the unconditional jump instruction is filled, the program will definitely jump, and subsequent instructions will be handled by other prediction blocks, so we do not need to care about the subsequent instructions. However, among the instructions before the unconditional jump instruction, we need to care about whether there are conditional branch instructions, because conditional branch instructions may or may not jump. Therefore, we need to record the relevant information of the conditional branch instructions.
tailSlot Sharing
Consider a situation: if no unconditional jump instructions appear from the starting PC of the prediction block to the maximum prediction length, but two conditional branch instructions appear instead, the tailSlot will be idle, and the second conditional branch instruction cannot be stored, causing space waste.
To solve this problem, Xiangshan adopts a method of setting a sharing mark. We can directly store the second branch instruction into the tailSlot and set the sharing mark to true, indicating that the second conditional branch instruction shares the tailSlot of the unconditional jump instruction. This way, the space of the tailSlot is effectively utilized.
The isCall, isRet, and isJalr fields in the prediction block serve the tailSlot. If the tailSlot records an unconditional jump instruction, these fields will further indicate the type of the jump instruction. There is also a field in the FTB item called always_taken, which records whether each conditional branch instruction stored in each slot is always predicted to jump. If so, subsequent predictors can directly adopt this prediction result.
Through the FTB item, we can know the instruction situation in a branch prediction block, including the position and type of branch instructions. This information will be handed over to subsequent predictors, which will predict more accurate jump targets, whether to jump, and other information.
This section describes the complete structural definition of the FTB item, containing the following signals:
valid: Whether the FTB entry is valid.
Interface Type: Bool
brSlots: Conditional branch instruction slots.
Interface Type: Vec(numBrSlot, new FtbSlot(BR_OFFSET_LEN)) (see FtbSlot for interface list)
tailSlot: Unconditional jump instruction slot.
Interface Type: new FtbSlot(JMP_OFFSET_LEN, Some(BR_OFFSET_LEN)) (see FtbSlot for interface list)
pftAddr: Unconditional jump instruction slot
Interface Type: UInt(log2Up(PredictWidth).W)
carry: Unconditional jump instruction slot.
Interface Type: Bool()
isCall: The instruction in the unconditional jump instruction slot is a call instruction.
Interface Type: Bool()
isRet: The instruction in the unconditional jump instruction slot is a ret instruction.
Interface Type: Bool()
isJalr: The instruction in the unconditional jump instruction slot is a jalr instruction.
Interface Type: Bool()
last_may_be_rvi_call: The instruction in the unconditional jump instruction slot may be an RVI type call instruction signal.
Interface Type: Bool()
always_taken: Whether each branch instruction in the prediction block is always predicted as Taken.
Interface Type: Vec(numBr, Bool())
Explanation: pftAddr and carry
Here, pftAddr stands for Partial Fallthrough Address. Fallthrough Address means that if there is no jump in the prediction block, the program will sequentially execute to the address reached. In other words, if the offset of an unconditional jump instruction is 5, then the offset corresponding to the Fallthrough Address is 6. This signal is mainly used to get the return address of the program after a function call, and this concept can be understood as the end address of the prediction block.
Partial means part of the address, which is determined by the address representation method. Here, the address representation method is as follows:
pftAddr only records the middle offset part (the part with a length of log(predictWidth)), and a complete PC can be generated by combining it with the current PC. However, a carry might occur, so a carry bit is recorded separately. carryPos is the position in the instruction address within the prediction block where a carry might occur.
Additionally, last_may_be_rvi_call is an auxiliary signal for this address, indicating that the instruction in the unconditional jump instruction slot is an RVI type call instruction. Since pftAddr assumes the instruction length as the compressed instruction length by default when calculating, the end address is increased by only 2 bytes. If the actual call instruction is not a compressed instruction, it will lead to an incorrect return address calculation. RAS will correct this error based on this signal.
offset: The offset of the instruction in the slot relative to the start address of the prediction block.
Interface Type: UInt(log2Ceil(PredictWidth).W)
lower: The lower bits of the jump target address.
Interface Type: UInt(offsetLen.W)
Note: The lower is set to 12 or 20 bits because the addressing capability of branch instructions is 12 bits, while the addressing capability of jump instructions is 20 bits.
tarStat: Whether the high bits of the PC after the jump are incremented or decremented.
Note: The jump target address is calculated from the high bits of the current PC, the tarStat, and the lower field. The lower field directly stores the lower bits of the jump target address. The high bits of the current PC are adjusted according to the tarStat, then concatenated with the lower to get the actual jump target address. The tarStat can take three values: 0 - no increment or decrement, 1 - increment, 2 - decrement.
sharing: Indicates that a conditional branch instruction is stored in an unconditional jump instruction slot.
This interface defines the complete branch prediction results, included in the prediction results of each pipeline stage.
The full branch prediction result interface is similar to the FTB entry and is initially generated from the FTB entry. Two slots are split into individual signals: slot_valids, targets, offsets, is_br_sharing, etc. Additionally, several fields are added such as br_taken_mask, jalr_target to facilitate the provision of precise prediction results by subsequent predictors. The hit indicates whether an FTB entry is hit, i.e., the PC in the current prediction round indexed to an FTB entry.
Full Interface List:
hit: Indicates whether the FTB entry is hit.
Interface Type: Bool()
slot_valids: Indicates whether each slot in the FTB entry is valid.
Interface Type: Vec(totalSlot, Bool())
targets: The jump target address corresponding to each slot.
Interface Type: Vec(totalSlot, UInt(VAddrBits.W))
offsets: The offset of the instruction in each slot relative to the start address of the prediction block.
fallThroughAddr: The end address of the prediction block.
Interface Type: UInt(VAddrBits.W)
fallThroughErr: Indicates that the pftAddr recorded in the FTB entry is incorrect.
Interface Type: Bool()
is_jal: Indicates that the prediction block contains a jal instruction.
Interface Type: Bool()
is_jalr: Indicates that the prediction block contains a jalr instruction.
Interface Type: Bool()
is_call: Indicates that the prediction block contains a call instruction.
Interface Type: Bool()
is_ret: Indicates that the prediction block contains a ret instruction.
Interface Type: Bool()
last_may_be_rvi_call: Indicates that the instruction in the unconditional jump instruction slot might be an RVI type call instruction.
Interface Type: Bool()
is_br_sharing: Indicates that the last slot (tailSlot) stores a conditional branch instruction.
Interface Type: Bool()
br_taken_mask: The branch prediction result, with each bit corresponding to a branch (slot), indicating whether the branch is predicted to be taken.
Interface Type: Vec(numBr, Bool())
jalr_target: The jump target of the jalr instruction in the prediction block.
Interface Type: UInt(VAddrBits.W)
2 - Redirection and Update Interfaces
Redirection and update requests are the two main types of information sent from the FTQ to the BPU. This section details the specific structure of these request interfaces.
This interface defines the redirection requests from the branch prediction unit, mainly used to redirect the state of the branch predictor.
The BranchPredictionRedirect interface inherits from the Redirect interface and includes many signals, only a subset of which are used by the BPU redirection. The documented structure contains only the interfaces used by the BPU.
Redirection requests have two sources: those generated by comparing IFU pre-decode information and those generated during the backend execution process.
In redirection requests, the key information is cfiUpdate, which corresponds to a control flow instruction. This information pertains to an instruction where the BPU made a prediction error. For example, if the BPU indicates that the third instruction in the prediction block is a normal instruction with no jump, but the pre-decode shows it is an unconditional jump instruction, a prediction error has occurred. In this case, the FTQ generates a redirection request. The cfiUpdate in the redirection request corresponds to this unconditional jump instruction.
The information in cfiUpdate can be broadly classified into three types:
Information about the corresponding instruction and its execution status. This includes the slot (shift) and PC of the instruction in the prediction block, the type-related information of the instruction (pd), and the execution status, such as jump target and whether it jumps.
History maintenance related information. The redirection request contains branch history information corresponding to the prediction block of the instruction to help the BPU restore branch history. folded_hist represents the global folded history, histPtr represents the global history pointer, and other information assists in maintaining branch history. For detailed information, refer to the BPU Top-Level Module.
RAS maintenance related information. For detailed meaning, refer to the RAS sub-predictor documentation.
The meaning of level is whether the redirection includes this instruction. If not included, the redirection request receiver will assume the instruction has been executed, and the next prediction will start from the next instruction. The BPU top-level will default to not including this instruction, and upon receiving a redirection request, it will include the execution status of this instruction in the branch history.
The detailed signal list for the redirection interface is as follows:
level: Indicates whether the redirection request includes the current position. Low means redirection after this position, high means redirection at this position.
Interface Type: UInt(1.W)
cfiUpdate: Control flow instruction update information
Interface Type: CfiUpdateInfo
Interface List
pc: The PC of the instruction corresponding to the redirection request
Interface Type: UInt(VaddrBits.W)
pd: Pre-decode information of the redirection instruction
isRVC: Whether it is a compressed instruction
Interface Type: Bool
isCall: Whether it is a function call
Interface Type: Bool
isRet: Whether it is a function return
Interface Type: Bool
target: Target address of the redirection request instruction
Interface Type: UInt(VaddrBits.W)
taken: Whether the redirection request instruction is taken
Interface Type: Bool
shift: Slot in which the redirection request instruction is located, 0 if it is a normal instruction.
Interface Type: UInt((log2Ceil(numBr)+1).W)
addIntoHist: Whether to include the execution information of the redirection request instruction in the branch history.
Interface Type: Bool
folded_hist: Folded history corresponding to the redirection request
This interface defines the update requests for the branch predictor, mainly used to update the state of the branch predictor. The document lists only the interfaces used in the BPU.
Update requests correspond to each branch prediction block. When a branch prediction block in the FTQ has been executed, the FTQ generates an update request for this prediction block to train the predictor. Thus, an important role of the update request is to feed back the actual execution status of the instructions to the BPU. Additionally, in the Xiangshan branch prediction unit, the update request is responsible for updating FTB entries.
The information in the update request can be broadly classified into four categories:
PC: Indicates the starting address of the prediction block, specifying which prediction block the update request corresponds to
FTB Entry Update Information: The update channel contains an FTB entry structure (ftb_entry), outputs the newly generated FTB entry from the FTQ, and indicates whether it is the same as the old FTB entry (old_entry)
Actual Execution Status Information of Instructions: The update channel indicates the execution status of branch and unconditional jump instructions in the prediction block, and provides the address and final target of control flow instructions (i.e., instructions that jump)
Predictor-Related Data Corresponding to the Prediction Block: Contains spec_info and meta information (refer to the BPU Global Interface Documentation for details)
The interface list of the update request is as follows:
pc PC of the update request (starting address of the prediction block)
Interface Type:UInt(VAddrBits.W)
ftb_entry Updated FTB entry
Interface Type:new FTBEntry()
Interface List:See(FTBEntry)
old_entry Whether the updated FTB entry is the same as the old FTB entry
Interface Type:Bool()
br_taken_mask Mask indicating whether each slot instruction in the prediction block jumps
Interface Type:Vec(numBr, Bool())
mispred_mask Mask indicating prediction errors in the prediction block. The first and second bits indicate whether the two conditional branch instructions were mispredicted, and the third bit indicates whether the unconditional jump instruction was mispredicted.
Interface Type:Vec(numBr+1, Bool())
jmp_taken Unconditional jump instruction triggered in the prediction block
Interface Type:Bool()
cfi_idx Index of the control flow instruction in the prediction block
meta Last stage meta information corresponding to the prediction block
Interface Type:UInt(MaxMetaLength.W)
3 - BPU Global Interface
This section introduces the definition of the overall external interaction interface of the Xiangshan Branch Prediction Unit, including the presentation of global branch prediction results and single pipeline stage prediction results.
PredirectIO is the overall external interface of the branch predictor (BPU). It mainly handles the interaction between the branch predictor (BPU) and the fetch target queue (FTQ), which includes the following parts:
bpu_to_ftq: Information sent from BPU to FTQ, mainly for sending branch prediction results from BPU to FTQ
Interface type: BpuToFtqIO
Interface type:
resp: Global branch prediction information sent from BPU to FTQ
Interface type:DecoupledIO(new BpuToFtqBundle())
BpuToFtqBundle inherits from BranchPredictionResp,without additional signals
Interface type:See (BranchPredictionResp)
ftq_to_bpu: Information sent from FTQ to BPU, mainly for handling redirect and update requests
This interface defines all the prediction result information of the branch predictor, including the prediction results of each stage and the related information output by the last pipeline stage.
s1 Branch prediction result of the s1 pipeline stage
s2 Branch prediction result of the s2 pipeline stage
s3 Branch prediction result of the s3 pipeline stage
Interface type:BranchPredictionBundle
Interface type:See (BranchPredictionBundle)
last_stage_meta Metadata of the prediction result output by the last pipeline stage. It is a bit vector output by each predictor and combined by the Composer.
Interface type:UInt(MaxMetaLength.W)
last_stage_spec_info Related information of the prediction result output by the last pipeline stage
This interface defines the branch prediction result information output by each pipeline stage,
pc Starting PC of the predicted block
Interface type: Vec(numDup, UInt(VAddrBits.W)) numDup is only for register replication, with identical signals
valid Whether the prediction result is valid
Interface type: Vec(numDup, Bool())
hasRedirect Whether a redirect is needed
Interface description: Only the s2 and s3 stages will redirect, and the prediction result of this stage will override the previous pipeline stage’s result when a redirect occurs
Interface type: Vec(numDup, Bool())
ftq_idx FTQ pointer, pointing to the FTQ entry corresponding to the prediction information of this stage
Interface type: new FtqPtr
full_pred Complete branch prediction result
Interface type: Vec(numDup, new FullBranchPrediction)
Interface list: See (FullBranchPrediction)
4 - Base Predictor Class and Sub Predictor Interface
This document introduces the sub-predictor interface in the Xiangshan BPU, as well as the use of the sub-predictor base class. Reading this document can help you understand the external interactions of sub-predictors and the use of signals in the sub-predictor base class.
In the Xiangshan branch prediction unit, all its sub-predictors and the class implementations of Composer are inherited from the sub-predictor base class (BasePredictor). The sub-predictor interface (BasePredictorIO) is also defined in the sub-predictor base class. Therefore, we can consider that Composer and all sub-predictors have the same interface.
In the understanding and verification of sub-prediction, our most direct external interaction occurs in the sub-predictor interface and some variables defined in the sub-predictor base class. Therefore, before verifying the sub-predictor, it is strongly recommended that you read and understand this section of the document.
The general content and usage of the sub-branch predictor interface have been introduced in the Xiangshan Branch Prediction Unit (BPU) Basic Design section. This document will focus on the signal details of the interface.
The interface information of the global folding history consists of only one FoldedHistory list.
hist List of folded histories
Interface Type:MixedVec(gen.map{case (l, cl) => new FoldedHistory(l, cl, numBr)})
The interface information of FoldedHistory also has only one item.
folded_hist Single folded history, with a bit width equal to the compressed history length.
Interface Type:UInt(compLen.W)
This means that the interface of the global folding history is actually a list that stores folded histories, where each item is a folded history of a specific length.
Base Predictor Class
The base predictor class defines several signals, which can be accessed in each sub-predictor, and several connections are made within it.
Most of the signals are relatively easy to understand. We need to pay particular attention to the pc of each pipeline, which also involves your understanding of pipeline control signals. Therefore, next, we will introduce the meanings of pipeline control signals that need to be paid attention to in sub-predictors, as well as the meanings of the s1_pc, s2_pc, s3_pc signals.
There are three groups of pipeline control signals:
fire signals (s0, s1, s2, s3)
redirect signals (s2, s3)
ready signals (s1, s2, s3)
The pc signals in the base predictor class are divided into four groups, s0_pc_dup, s1_pc_dup, s2_pc_dup, s3_pc_dup. Each group of signals contains multiple pc signals, which are exactly the same and are duplicated for some other reasons. Therefore, we can simply consider them as s0_pc, s1_pc, s2_pc, s3_pc.
Their usage can be seen in the following diagram:
Their relationship with the pipeline control signals is as follows:
s0_pc is directly connected from the in.s0_pc in the input interface.
When s0_fire is active, the next cycle s1_pc will output the value of s0_pc.
When s1_fire is active, the next cycle s2_pc will output the value of s1_pc.
When s2_fire is active, the next cycle s3_pc will output the value of s2_pc.
In other words, the fire signal affects whether the data of the next cycle is valid. For example, the s0_fire signal affects whether the data of the s1 pipeline is valid, and the s1_fire signal affects whether the data of the s2 pipeline is valid.
Whether the fire signal is valid depends on whether the data of this pipeline stage is valid and whether the next pipeline stage is ready. For example, the s1_fire signal is valid only if the data of the s1 stage is valid and the s2_ready signal from the sub-predictor output is valid. At this point, it can be considered that the data processing of the s1 stage is completed, the s2 stage is ready, and the data of the next cycle will be directly sent to the s2 stage.
Therefore, in the sub-predictor, taking the s1 stage as an example, s1_ready can block data from entering the s1 stage. When s1_ready is active, the data for the s1 stage will be valid in the next cycle. When s1_fire is active, it indicates that the data in the s1 stage is already valid and the predictor has generated the result for the s1 stage. The data will then be directly sent to the s2 stage in the next cycle.
The redirect signal is relatively clear. For example, in the s2 stage, when s2_redirect is valid, it indicates that when s2_fire is valid, the s2 prediction result is different from the s1 prediction result in the previous cycle.