Basic Design of Shanshan Branch Prediction Unit (BPU)
This document introduces the basic design principles of the Shanshan branch prediction unit. By reading this document, you can understand the approximate workflow of the Shanshan BPU without needing to know specific signal names and code details.
In processor design, a well-designed branch predictor (BPU) is a key component for improving processor performance. It is responsible for guiding the processor’s fetch, determining where the next instruction should be fetched and executed. The BPU is the starting point of an instruction’s lifecycle, so exploring a high-performance processor from the BPU is a good starting point.
This is also true for Shanshan, a high-performance processor with out-of-order six-issue execution, which naturally requires a branch prediction unit with high accuracy and efficiency. The design of a branch prediction unit often needs to consider many factors, such as timing, complexity of structure, silicon area occupation, prediction accuracy, and speed of recovery from prediction errors. The branch prediction unit of the Shanshan processor has achieved a good balance among these factors through many clever designs, giving it high branch prediction efficiency and accuracy, providing a basic guarantee for the supply of instructions to the backend.
In this section, we will introduce the basic design of the Shanshan prediction unit. By reading this section, you can learn the following:
Basic concepts of branch prediction
Basic prediction unit of the Shanshan branch prediction unit - branch prediction block
External interfaces of the Shanshan branch prediction unit
Basic structure of the Shanshan branch prediction unit
Basic timing of the Shanshan branch prediction unit
Anyone participating in the BPU crowdsourcing verification work should read this section first to have a basic understanding of the Shanshan branch prediction unit.
1 - What is Branch Prediction
The branch prediction unit, as the name suggests, needs to perform a basic task—branch prediction. Before delving into the branch prediction unit, it is necessary to understand what branch prediction is.
Why Do We Need Branch Prediction?
There are mainly two reasons for branch prediction: one is that the program’s execution flow contains branch instructions, and the other is that high-performance processors use pipeline design.
The above is a piece of C code. It first defines three variables x, y, and result, and then assigns a value to result based on the comparison of x and y. It can be observed that the program assigns values to variables in sequence in the first three lines. However, in the 5th line, due to the presence of the if instruction, the program branches, jumping directly from the 5th line to the 8th line to continue execution, which causes a branch in the program’s execution.
After translating into RISC-V assembly code, the code is as follows:
lia0,10# x = 10
lia1,20# y = 20
lia2,0# result = 0
blta0,a1,else_branch# Jump to else_branch if x < y
adda2,a0,a1# Execute result = x + y
jend# Jump to end
else_branch:suba2,a0,a1# Execute result = x - y
end:
It can be seen that the program still maintains the previous branching behavior. In the first three lines of the code, instructions are executed in sequence. Then, in the 5th line of the program, a special instruction blt appears, which we call a branch instruction. It will determine whether to execute the next instruction based on the relationship between x and y, and the appearance of this instruction causes a branch in the program’s execution.
High-performance Processors Use Pipeline Design
Therefore, the concept of branch prediction arises. If we can accurately predict the address of the next instruction before the execution result is generated, the processor can continue to run efficiently.
Feasibility of Branch Prediction
Why can branch prediction be done? Let’s look at an example:
if(data>=128)sum+=data;
Assuming that this instruction will be executed repeatedly, and data is incremented from 0, i.e., data = 0, 1, 2, 3 … 128, 129…, let’s analyze the behavior of executing this instruction each time.
T = branch taken
N = branch not taken
data = 0, 1, 2, 3, s, ... 126, 127, 128, 129, 130, ... 250, 251, 252, ...
branch = N N N N N ... N N T T T ... T T T ...
= NNNNNNNNNNNN ... NNNNNNNTTTTTTTTT ... TTTTTTTTTT (easy to predict)
It can be seen that in the first 128 times, the branch is always Not Taken (the condition is not met), but after 128 times, the branch is always Taken (the condition is met). If we predict whether it is Taken based on whether it was Taken last time, we will only make one prediction error throughout the prediction process.
The occurrence of the above phenomenon is due to a basic fact—whether a branch instruction jumps is related to the past jumping behavior of that instruction. By summarizing the past jumping rules, we can make a relatively accurate prediction for this jump, which also makes branch prediction possible.
In fact, the jump of branch instructions is also related to factors such as the jumping situation of other branch instructions. Fully exploiting effective information to produce accurate prediction results is one of the main tasks of branch prediction.
Basic Types of Branch Prediction
In RISC-V, branch instructions include two types:
Conditional Branch Instructions (beq, bne, blt, bge, bltu, bgeu) For these instructions, whether to jump is determined by the condition in the instruction, and the jump target can be obtained from the instruction. Therefore, we need to predict whether the instruction will jump.
Unconditional Jump Instructions (jal, jalr) For these instructions, the jump is always executed when encountered, but the jump target may be specified by a register. Therefore, we need to predict the jump target of the instruction.
Fortunately, due to the concise design of the RISC-V architecture, we do not need to handle conditional jump instructions. Every jump instruction we need to predict is unconditional, which is also convenient for our design.
From the above analysis, we can summarize the two basic types of branch prediction—direction prediction and target address prediction.
Direction Prediction of Branch Instructions
Direction prediction of branch instructions corresponds to conditional branch instructions in RISC-V instructions. We need to predict whether it needs to jump, which is called direction prediction.
Two-Bit Saturation Counters
Direction prediction has a very simple and efficient prediction method called two-bit saturation counter. The basic idea can be seen in the figure below.
The two-bit saturating counter is regarded as a state machine, and we maintain such a state machine for each branch instruction. When a branch instruction is taken, the corresponding state in the diagram moves to the right; otherwise, it moves to the left. So, the next time we encounter this branch instruction, we first look up its two-bit saturating counter. If the state is more biased to the right, we predict it to be taken; otherwise, we predict it not to be taken.
Of course, it’s impractical to maintain a two-bit saturating counter for each branch instruction. Therefore, in practice, we usually use part of the PC or a hash method to index the two-bit saturating counter, as shown in the diagram below.
Branch History
Branch history is a very commonly used data in branch prediction and the basis of most branch prediction algorithms because it directly shows the past jumping behavior of instructions.
There are two basic types of branch history:
Local Branch History Maintain a set of registers for each branch instruction, recording the historical jumping behavior of that instruction.
For example: 0101000000101 (0 means not taken, 1 means taken)
全Global Branch History All instructions share a set of registers, recording the branching behavior during program execution.
For example:
beg a0, a1, label1 not taken record 0
bne a1, a2, label2 not taken record 0
-> beq a2, a3, label4 taken record 1
After executing these three different branch instructions, the global branch history becomes 001.
Branch Target Address Prediction
In the RISC-V architecture, branch target address prediction refers to predicting the target address of unconditional jump instructions (e.g., jal, jalr). Since these instructions always perform a jump operation, we need to predict their target address.
Branch Target Buffer (BTB)
BTB is a common method for predicting target addresses. Its core idea is to use a cache to store the target addresses of past unconditional jump instructions. When encountering the same unconditional jump instruction again, the BTB can be checked to see if there is a record for that instruction. If so, the recorded target address is used as the predicted target address for the current execution.
The diagram below illustrates this:
Predicting Instruction Types
As we know, in branch prediction, for conditional branch instructions, we need to predict their direction, and for unconditional jump instructions, we need to predict their target. However, there is a problem: when we get a PC that needs to be predicted, we don’t know whether the corresponding instruction is a normal instruction or a branch instruction. Therefore, we cannot predict it.
How to solve this? One way is to predict the behavior of the instruction after fetching it. But fetching from ICache or Memory may take several cycles, which is a major drawback of this method.
A better way is to directly predict the type of instruction. After getting a PC, we can directly predict whether this instruction is a branch instruction and predict its behavior. In this way, we don’t have to wait for fetching to complete, and the predicted result can also guide the CPU to fetch from the correct location.
The method of type prediction can be similar to BTB, where a field in the cache contains the type of instruction for use in the next prediction.
General Steps of Branch Prediction
Through the introduction in this section, we can summarize the general steps of branch prediction:
Get the PC.
Predict whether it is a branch instruction.
If it is a conditional branch instruction, predict its direction and target.
If it is an unconditional jump instruction, predict its target.
Note that since predicting the type of instruction is required in prediction, and we haven’t obtained the specific content of the instruction, predicting the target of a conditional branch instruction also becomes our task.
2 - Basic of the Xiangshan Branch Prediction Unit (BPU)
This section introduces the basic ideas and working principles of the Xiangshan Branch Prediction Unit (BPU), including the use of branch prediction block concepts, multiple predictors, multiple pipeline structures, and the role of the Fetch Target Queue (FTQ), explaining the main interfaces of BPU for external interaction.
Branch Prediction Block Concept
For a general branch predictor, it usually predicts the relevant information of an instruction corresponding to a given PC, such as whether it is a conditional branch instruction or a jump instruction. For conditional branch instructions, it predicts whether it will jump, while for jump instructions, it predicts the jump target. However, predicting instructions one by one is inefficient, leading to slow instruction supply in the frontend.
In contrast, the prediction method used in Xiangshan is to predict a block of instructions each time. That is to say, given a PC, Xiangshan will predict a branch prediction block starting from this PC, including the situation of several subsequent instructions, such as whether there is a branch instruction, the position of the branch instruction, whether there is a jump, and the jump target.
This prediction method can predict multiple instructions at once and send the prediction results to the fetch unit (IFU) to guide the IFU to fetch instructions. In addition, since the IFU needs to consider the performance of cache lines, it can fetch multiple instructions at once based on the prediction block, thereby improving throughput efficiency.
After the prediction block is generated, the branch prediction block will also generate the PC to which it jumps after executing this prediction block, and then the BPU will continue to generate the next branch prediction block based on this PC.
Here’s a simple example:
As shown in the above figure, when the PC reaches 0x20000118, the BPU goes through the following steps:
The BPU learns that the PC is 0x20000118.
The BPU generates a branch prediction block starting from 0x20000118, with the following approximate contents:
In the next several instructions,
The third instruction is a conditional branch instruction.
For this conditional branch instruction, it predicts that it will be taken.
The address to which it jumps is 0x20000110.
The BPU sets the PC to 0x20000110 and continues to generate the next branch prediction block.
This is the basic prediction process of the Shanshan BPU using branch prediction blocks.
Multiple Predictors, Multiple Pipeline Structure
The figure below shows the overall architecture of the Xiangshan BPU, where we need to focus on two main aspects:
Multiple Predictors
To ensure prediction accuracy, Xiangshan BPU uses multiple predictors, and these predictors collectively generate the BPU’s prediction results. For example, FTB generates basic prediction results for subsequent predictors to use, while TAGE produces more accurate prediction results for conditional branch instructions, and so on.
Multiple Pipelines
To meet the requirements of high performance, Xiangshan BPU adopts a pipeline design. Various predictors are at different pipeline levels. Among them, the uFTB (also known as uBTB in the figure) predictor is at the first pipeline level, capable of generating prediction results in one cycle. The other predictors need 2-3 cycles to generate prediction results. Although the prediction time is longer, the prediction results are relatively more accurate.
However, if it takes three cycles to get the prediction result and start predicting based on the new result, this design inevitably leads to performance loss. Because of this, it takes three clock cycles to complete one prediction.
To be able to get the prediction results of some predictors in the first and second cycles, we set up three prediction result channels and output the prediction results of the three pipeline levels simultaneously, as shown in the figure below.
Fetch Target Queue (FTQ)
Storing Branch Prediction Results
Although the BPU can provide prediction results in the form of branch prediction blocks and the IFU can fetch multiple instructions at once, there is still a performance gap between them. In general, the BPU generates prediction results faster.
Therefore, a Fetch Target Queue (FTQ) is added between the BPU and the IFU as a buffer. The FTQ is essentially a queue used to store individual data items. The prediction results generated by the BPU are first stored in the FTQ, and then fetched by the IFU from the FTQ, as shown in the figure below.
Whenever the BPU generates a prediction block, the prediction block is placed at the head of the FTQ. When the IFU is idle, it will fetch the next prediction block from the tail of the FTQ. The diagram below illustrates this process.
In Xiangshan, the FTQ’s functionality goes far beyond this. Referring to the FTQ’s external interface in the figure above, it is also responsible for sending prefetch information to the ICache, storing various training information of the BPU, analyzing redirection information and update information sent from the fetch module and the backend execution module, sending update requests to the BPU, and even updating the basic data structure of the FTB predictor in the FTQ.
BPU Prediction Result Redirection
As mentioned earlier, the Xiangshan branch prediction results have three channels, which simultaneously output the prediction results of stages s1, s2, and s3. How does the FTQ use the prediction results of the three stages?
Let’s start from exploring the timing of the pipeline, as shown in the figure below.
In the first cycle, a new PC 0x4 is fetched, and the predictor (called uFTB) that can produce a prediction result within one cycle outputs its prediction result at the s1 interface, indicating the next PC as 0xf, with no output from other interfaces yet.
In the second cycle, the PC is set to 0xf, and uFTB also generates a prediction result of 0xf, which is sent out from the s1 channel. At the same time, the two-cycle predictor generates the prediction result for the previous address 0x4, which is sent out from the s2 channel.
However, a problem arises here: in the second cycle, the prediction result generated by s2 is 0x4, but the prediction result for 0x4 has already been output by s1 in the previous cycle and placed in an entry in the FTQ. In other words, the prediction result generated by s2 has already been generated by s1. The difference is that the result from s2 is generated by the two-cycle predictor, making it more accurate.
Therefore, what we need to do is not to place a new FTQ entry based on the prediction result from s2 but to compare the prediction results from s2 and the previous cycle’s s1 prediction results. If there is a difference, then overwrite the FTQ entry placed by the previous stage’s s1 interface.
So we add two additional signal lines to the s2 and s3 channels, which we call redirect signals. If this signal is valid, it indicates that there is a difference between the prediction result of this stage and the previous prediction result, and it is necessary to overwrite an FTQ entry from before. The structure is shown in the diagram below.
At the time corresponding to the second cycle of the pipeline in the structural diagram, the s1 channel has already placed a branch prediction block result with an address of 0x4 into the FTQ. At this time, the s2 prediction result is generated, and the BPU finds that the s2 prediction result is different from s1, so the redirect interface for this cycle is set to valid. The FTQ will use the s2 channel’s prediction result to overwrite the FTQ entry previously storing the 0x4 prediction result.
At this time, although the s1 channel has also generated a branch prediction block with 0xf as the head, it is obviously an incorrect prediction result generated by s1 based on the PC of the first cycle. Therefore, at this time, the s1 result can be directly discarded.
In the third cycle, s1 starts a new round of prediction with the correct prediction result indicated by s2, the new PC 0x8. After that, if no prediction errors are detected by the s2 and s3 channels, the pipeline will continue to run at full capacity.
BPU Redirect Requests
No matter how accurate a branch predictor is, it is not always correct. This inaccuracy can lead to incorrect instructions being filled in the subsequent pipeline. Therefore, there needs to be a mechanism to correct this, and this mechanism is redirection. When an instruction is executed by the backend execution module, the true behavior of this instruction is determined. At this time, if the backend execution module detects a branch prediction error, it will issue a redirect request to restore the processor’s state to the state before executing the incorrect instruction. For us, we only need to pay attention to how the BPU and FTQ restore the state when redirecting.
In addition to redirect requests from the backend, the Shan Mountain processor will perform a simple analysis of the instruction after it is fetched by the IFU to detect the most basic prediction errors. The specific process is as follows: after the FTQ sends a fetch request to the IFU, it will wait for the IFU to return the pre-decoded information (pre-decoding is the IFU’s simple decoding of the instruction, such as whether it is a jump instruction, what is the target of the jump). The FTQ will write the pre-decoded information back to a field in the corresponding entry in the FTQ and will also analyze the pre-decoded information. If a prediction error is detected, it will generate an IFU redirect request.
Redirect requests from the backend execution module do not need to be generated by the FTQ but are directly sent from the backend to the FTQ for processing. The FTQ will forward the generated IFU redirect request and the backend redirect request to the BPU’s redirect interface. If both are valid in the same cycle, the FTQ will choose to forward the backend redirect request.
The BPU with the added redirect interface is shown in the diagram below.
BPU Update Requests
The current BPU already has the ability to correct errors, but there is still a problem: the data in the BPU cannot be updated. If it is impossible to obtain information such as the location, type, whether a jump occurred, and the jump address of the instruction, the BPU will not be trained and the accuracy will be greatly reduced.
To obtain this information, we still need to rely on the Fetch Target Queue (FTQ) because it can not only interact with the IFU to obtain instruction-related information but also interact with the backend to obtain execution-related information. Therefore, there will be an update request channel directly connecting the FTQ to the BPU.
When the backend completes the execution of an entry in the FTQ, the entry is marked as committed. Next, the FTQ forwards the update information of this entry to the BPU through the Update channel.
Summary
Through this section, we have learned about all the main interfaces required for BPU external interaction and the role of FTQ in BPU. With the BPU equipped with prediction result interfaces, redirect interfaces, and update interfaces, it can already support all external interactions of the BPU. Next, we will delve deeper into the internals of the BPU.
3 - Introduction to the Xiangshan Branch Prediction Unit Structure
This section introduces the structure of the Xiangshan Branch Prediction Unit (BPU), including the integration of multiple predictors and multiple pipeline schemes, as well as the organization structure and interface design of internal sub-predictors, demonstrating how the BPU interacts with the Composer, and explaining the connection methods between sub-predictors.
How Does the BPU Integrate Internal Sub-predictors?
We already know that the Xiangshan BPU adopts multiple predictors and multiple pipeline schemes. To adapt to multiple pipelines, the BPU uses a three-channel result output interface. But how does it adapt to multiple predictors? This requires us to further explore the internal structure of the BPU.
The above figure is the BPU architecture diagram from the Xiangshan documentation. Currently, we only need to focus on one piece of information: all internal sub-predictors are encapsulated in a structure called Composer. The BPU only needs to interact with Composer.
What is Composer? Let’s first look at their definition in the Xiangshan code.
It can be seen that Composer and the five sub-predictors have a common characteristic: they all inherit from the BasePredictor base class. And the interface has been defined in the BasePredictor class. In other words, Composer and the five sub-predictors all have the same interface! The top-level BPU can directly regard Composer as a sub-predictor, without worrying about how the internal sub-predictors are connected.
Sub-predictor Interface
Next, we will look at what the sub-predictor interface looks like. This interface will involve the interaction between Composer and the top-level BPU, as well as the interaction between each sub-predictor and Composer.
Let’s take Composer as an example to illustrate the structure of the sub-predictor interface.
As shown in the above figure, the three-channel prediction results of Composer are directly output to the outside of the BPU. There is also a set of three-channel prediction results connected from the inside of the BPU to Composer. However, since the prediction results are generated by Composer, the BPU will pass an empty prediction result to Composer. The significance of this is to make the sub-predictor act as a “processor.” The sub-predictor will process the input prediction results and then output the processed prediction results.
Next, the top-level BPU will provide the information needed for prediction to the pipeline. First is the PC and branch history records (including global history and global folding history). Next, the BPU will connect some pipeline control signals between Composer and the pipeline control signals. Finally, the BPU will directly connect the externally input redirect request interface and update interface to Composer.
In the end, a simple definition of the sub-predictor interface can be given (for detailed definitions, please refer to the interface documentation):
in
(s1, s2, s3) Prediction information input
s0_pc PC to be predicted
ghist Global branch history
folded_hist Global folding history
out (s1, s2, s3) Prediction information output
流水线控制信号
s0_fire, s1_fire, s2_fire, s3_fire Whether the corresponding pipeline stage is working
s2_redirect, s3_redirect Redirect signals when a prediction error is discovered in the subsequent pipeline stage
s1_ready, s2_ready, s3_ready Whether the sub-predictor corresponding pipeline stage is ready
update Update request
redirect Redirect request
Connection Between Sub-predictors
We now know that the interfaces between each sub-predictor and Composer are the same, and we also know how Composer is connected to the top-level BPU. This section will explain how sub-predictors are connected within Composer.
The above figure shows the connection structure of sub-predictors in Composer. It can be seen that after the three-channel prediction results are input into Composer, they are first processed by uFTB and then output. They are then successively processed by TAGE-SC, FTB, ITTAGE, and RAS, and finally connected to the prediction result output of Composer, which is then directly connected to the outside of the BPU by Composer.
For other signals, because the interfaces between Composer and each sub-predictor are the same, they are directly connected to the corresponding interfaces of each predictor by Composer, without much additional processing.
Prediction Result Interface Connection
For sub-predictors, the connection of their prediction result is that the prediction result output of one predictor is the input of the next predictor. However, it should be noted that this connection is a combinational circuit connection and is not affected by timing.
As shown in the above figure, taking the s1 channel as an example, from input to the output of the last predictor, it is all modified by combinational circuits, unaffected by timing. Registers only exist between the s1, s2, and s3 channels.
Therefore, increasing the number of sub-predictors will not increase the number of cycles required for prediction, but will only increase the delay required for prediction per cycle.
4 - Introduction to the Timing of Xiangshan Branch Prediction Unit
The timing design of the three-stage pipeline is the essence of the Xiangshan BPU. This section will introduce how the prediction result redirection signal is generated, how a new PC is generated based on the prediction result, and how the prediction results of the three channels are handled.
Single-Cycle Prediction without Bubble
uFTB is the only predictor in Xiangshan BPU that can generate prediction results in a single cycle. The figure below shows the prediction process of uFTB. The s0_pc is sent from the top level of BPU, and when the s1 stage is active, the s1_pc retains the value of s0_pc from the previous cycle. This means that the value of s0_pc will move down the pipeline.
When the s1 stage is active, uFTB receives the s1_fire signal from the current cycle and generates a prediction result based on the s1_pc address in this cycle, which can obtain the new PC value in the prediction result.
As shown in the figure, the top level of BPU analyzes the next PC value position based on the prediction result channel s1 and sends it to npc_Gen (new PC generator) for generating the s0_pc of the next cycle.
In the next cycle, uFTB gets the new PC value and starts generating the prediction block for the new PC value. Therefore, with only the s1 stage, the prediction block can be generated at a rate of one block per cycle.
Prediction Result Redirection
However, except for uFTB, other predictors require 2-3 cycles to generate prediction results. How to utilize their prediction results? And how to generate the prediction result redirection signal?
As shown in the figure, a Predirector 2 that takes two cycles to generate a prediction result can output its prediction result to the s2 prediction result channel in the s2 stage. After the top level of BPU receives the prediction result, it analyzes the jump target address target of the prediction block and connects it to npc_Gen.
At this point, the signal connected to npc_Gen contains both the old PC prediction result generated by s2 and the new PC prediction result generated by s1. How to choose which one to use for the new PC?
As mentioned earlier, BPU compares the prediction result of s2 with the prediction result of s1 from the previous cycle. If the prediction results are different, it indicates that s1 has made a wrong prediction, and naturally, the prediction result of the current cycle generated based on the wrong prediction result of the previous cycle is also wrong. Therefore, if the prediction result is incorrect in the current cycle, npc_Gen will use the target provided by s2 as the new s0_pc.
This process is shown in the pipeline structure diagram as follows:
The Diff comparator compares the prediction results of the s1 stage with those of the previous cycle to generate a diff signal, guiding npc_Gen to generate the next PC. At the same time, the diff signal indicates that the prediction result of the s1 stage is incorrect and can be used directly by BPU to redirect the prediction result channel of the s2 stage in the FTQ, instructing the FTQ to overwrite the previous prediction result.
The diff signal is also sent to each predictor through the s2_redirect interface to guide the predictors to update their states.
Furthermore, when the prediction result redirection of the s2 stage occurs, indicating that the prediction result of the s1 channel is incorrect, the s2 stage cannot continue to predict and needs to invalidate the s2_fire signal of the predictor pipeline and wait for the corrected prediction result to flow in.
The prediction result redirection of the s3 stage is similar to this. Its pipeline structure diagram is as follows. The specific processing process is left for you to analyze.
Redirection Requests and Other Information Generation
Only when the prediction information of all three stages is incorrect will an external redirection request occur. At this time, npc_Gen will receive the PC address from the redirection request. Since when a redirection request occurs, we assume that all three stages have predicted incorrectly, so all three stages’ fire signals need to be invalidated. Then, npc_Gen uses the PC that needs to be restored from the redirection request to restart the prediction.
Other information, such as the generation of the global history and the PC, follows the same principle and is maintained based on the prediction information of each stage. The global history generates a new branch history based on the prediction results of each stage.
Pipeline Control Signals
After learning about the specific process of the pipeline, you should understand the pipeline control signals in the predictor interface, as follows:
s0_fire, s1_fire, s2_fire, s3_fire Indicate whether each stage of the pipeline is working.
s2_redirect, s3_redirect Indicate whether a prediction result redirection has occurred.
s1_ready, s2_ready, s3_ready Sent from the predictor to the top level of BPU, indicating whether each stage of the pipeline is ready.
Conclusion
By now, you should understand the basic design principles, external interaction logic, internal structure, timing, etc., of the Xiangshan Branch Prediction Unit, and have a rough understanding of the working principle of BPU. Xiangshan’s BPU is no longer mysterious to you.
Next, you can read the Important Structures and Interfaces Document and combine it with the source code of Xiangshan BPU to form a more detailed understanding of BPU. When you clearly understand the working principle and signal details of BPU, you can start your verification work!