This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Branch Predictor and Kunming Lake Microarchitecture Implementation

1: Basic Design of Shanshan Branch Prediction Unit (BPU)

1.1: What is Branch Prediction
1.2: Basic of the Xiangshan Branch Prediction Unit (BPU)
1.3: Introduction to the Xiangshan Branch Prediction Unit Structure
1.4: Introduction to the Timing of Xiangshan Branch Prediction Unit

2: Important Structures and Interface Documentation

2.1: FTB Item and Complete Prediction Result Interface
2.2: Redirection and Update Interfaces
2.3: BPU Global Interface
2.4: Base Predictor Class and Sub Predictor Interface

3: Submodule Documentation

3.1: BPU Top Module
3.2: uFTB Branch Predictor
3.3: TAGE-SC Branch Predictor
3.4: FTB Branch Predictor
3.5: ITTAGE Branch Predictor
3.6: RAS Branch Predictor

4: Feature List

4.1: BPU Top Feature List
4.2: uFTB Feature List
4.3: FTB Feature List
4.4: TAGE-SC Feature List
4.5: ITTAGE Feature List
4.6: RAS Feature List

In processor microarchitecture, what is branch prediction? Why is it important? How can it be functionally and performance verified? Through this issue of Open Validation Learning, you will have a deep understanding of the BPU (Branch Prediction Unit) section in the high-performance Kunming Lake microarchitecture.

1 - Basic Design of Shanshan Branch Prediction Unit (BPU)

This document introduces the basic design principles of the Shanshan branch prediction unit. By reading this document, you can understand the approximate workflow of the Shanshan BPU without needing to know specific signal names and code details.

In processor design, a well-designed branch predictor (BPU) is a key component for improving processor performance. It is responsible for guiding the processor’s fetch, determining where the next instruction should be fetched and executed. The BPU is the starting point of an instruction’s lifecycle, so exploring a high-performance processor from the BPU is a good starting point.

This is also true for Shanshan, a high-performance processor with out-of-order six-issue execution, which naturally requires a branch prediction unit with high accuracy and efficiency. The design of a branch prediction unit often needs to consider many factors, such as timing, complexity of structure, silicon area occupation, prediction accuracy, and speed of recovery from prediction errors. The branch prediction unit of the Shanshan processor has achieved a good balance among these factors through many clever designs, giving it high branch prediction efficiency and accuracy, providing a basic guarantee for the supply of instructions to the backend.

In this section, we will introduce the basic design of the Shanshan prediction unit. By reading this section, you can learn the following:

Basic concepts of branch prediction
Basic prediction unit of the Shanshan branch prediction unit - branch prediction block
External interfaces of the Shanshan branch prediction unit
Basic structure of the Shanshan branch prediction unit
Basic timing of the Shanshan branch prediction unit

Anyone participating in the BPU crowdsourcing verification work should read this section first to have a basic understanding of the Shanshan branch prediction unit.

1.1 - What is Branch Prediction

The branch prediction unit, as the name suggests, needs to perform a basic task—branch prediction. Before delving into the branch prediction unit, it is necessary to understand what branch prediction is.

Why Do We Need Branch Prediction?

There are mainly two reasons for branch prediction: one is that the program’s execution flow contains branch instructions, and the other is that high-performance processors use pipeline design.

Program’s Execution Flow Contains Branch Instructions

int x = 10;
int y = 20;
int result = 0;

if (x >= y) {
    result = x + y;
} else {
    result = x - y;
}

The above is a piece of C code. It first defines three variables x, y, and result, and then assigns a value to result based on the comparison of x and y. It can be observed that the program assigns values to variables in sequence in the first three lines. However, in the 5th line, due to the presence of the if instruction, the program branches, jumping directly from the 5th line to the 8th line to continue execution, which causes a branch in the program’s execution.

After translating into RISC-V assembly code, the code is as follows:

li  a0, 10               # x = 10
li  a1, 20               # y = 20
li  a2, 0                # result = 0

blt a0, a1, else_branch  # Jump to else_branch if x < y
add a2, a0, a1           # Execute result = x + y
j end                    # Jump to end
else_branch:
sub a2, a0, a1           # Execute result = x - y
end:

It can be seen that the program still maintains the previous branching behavior. In the first three lines of the code, instructions are executed in sequence. Then, in the 5th line of the program, a special instruction blt appears, which we call a branch instruction. It will determine whether to execute the next instruction based on the relationship between x and y, and the appearance of this instruction causes a branch in the program’s execution.

High-performance Processors Use Pipeline Design

Therefore, the concept of branch prediction arises. If we can accurately predict the address of the next instruction before the execution result is generated, the processor can continue to run efficiently.

Feasibility of Branch Prediction

Why can branch prediction be done? Let’s look at an example:

if (data >= 128)
    sum += data;

Assuming that this instruction will be executed repeatedly, and data is incremented from 0, i.e., data = 0, 1, 2, 3 … 128, 129…, let’s analyze the behavior of executing this instruction each time.

T = branch taken
N = branch not taken

data   = 0, 1, 2, 3, s, ... 126, 127, 128, 129, 130, ... 250, 251, 252, ...
branch = N  N  N  N  N  ...   N    N    T    T    T  ...   T    T    T  ...

       = NNNNNNNNNNNN ... NNNNNNNTTTTTTTTT ... TTTTTTTTTT  (easy to predict)

It can be seen that in the first 128 times, the branch is always Not Taken (the condition is not met), but after 128 times, the branch is always Taken (the condition is met). If we predict whether it is Taken based on whether it was Taken last time, we will only make one prediction error throughout the prediction process.

The occurrence of the above phenomenon is due to a basic fact—whether a branch instruction jumps is related to the past jumping behavior of that instruction. By summarizing the past jumping rules, we can make a relatively accurate prediction for this jump, which also makes branch prediction possible.

In fact, the jump of branch instructions is also related to factors such as the jumping situation of other branch instructions. Fully exploiting effective information to produce accurate prediction results is one of the main tasks of branch prediction.

Basic Types of Branch Prediction

In RISC-V, branch instructions include two types:

Conditional Branch Instructions (beq, bne, blt, bge, bltu, bgeu) For these instructions, whether to jump is determined by the condition in the instruction, and the jump target can be obtained from the instruction. Therefore, we need to predict whether the instruction will jump.
Unconditional Jump Instructions (jal, jalr) For these instructions, the jump is always executed when encountered, but the jump target may be specified by a register. Therefore, we need to predict the jump target of the instruction.

Fortunately, due to the concise design of the RISC-V architecture, we do not need to handle conditional jump instructions. Every jump instruction we need to predict is unconditional, which is also convenient for our design.

From the above analysis, we can summarize the two basic types of branch prediction—direction prediction and target address prediction.

Direction Prediction of Branch Instructions

Direction prediction of branch instructions corresponds to conditional branch instructions in RISC-V instructions. We need to predict whether it needs to jump, which is called direction prediction.

Two-Bit Saturation Counters

Direction prediction has a very simple and efficient prediction method called two-bit saturation counter. The basic idea can be seen in the figure below.

The two-bit saturating counter is regarded as a state machine, and we maintain such a state machine for each branch instruction. When a branch instruction is taken, the corresponding state in the diagram moves to the right; otherwise, it moves to the left. So, the next time we encounter this branch instruction, we first look up its two-bit saturating counter. If the state is more biased to the right, we predict it to be taken; otherwise, we predict it not to be taken.

Of course, it’s impractical to maintain a two-bit saturating counter for each branch instruction. Therefore, in practice, we usually use part of the PC or a hash method to index the two-bit saturating counter, as shown in the diagram below.

Branch History

Branch history is a very commonly used data in branch prediction and the basis of most branch prediction algorithms because it directly shows the past jumping behavior of instructions.

There are two basic types of branch history:

Local Branch History Maintain a set of registers for each branch instruction, recording the historical jumping behavior of that instruction.
- For example: 0101000000101 (0 means not taken, 1 means taken)
全Global Branch History All instructions share a set of registers, recording the branching behavior during program execution.
- For example:
```
    beg a0, a1, label1          not taken  record 0
    bne a1, a2, label2          not taken  record 0
->  beq a2, a3, label4          taken      record 1
```
  After executing these three different branch instructions, the global branch history becomes 001.

Branch Target Address Prediction

In the RISC-V architecture, branch target address prediction refers to predicting the target address of unconditional jump instructions (e.g., jal, jalr). Since these instructions always perform a jump operation, we need to predict their target address.

Branch Target Buffer (BTB)

BTB is a common method for predicting target addresses. Its core idea is to use a cache to store the target addresses of past unconditional jump instructions. When encountering the same unconditional jump instruction again, the BTB can be checked to see if there is a record for that instruction. If so, the recorded target address is used as the predicted target address for the current execution.

The diagram below illustrates this:

Predicting Instruction Types

As we know, in branch prediction, for conditional branch instructions, we need to predict their direction, and for unconditional jump instructions, we need to predict their target. However, there is a problem: when we get a PC that needs to be predicted, we don’t know whether the corresponding instruction is a normal instruction or a branch instruction. Therefore, we cannot predict it.

How to solve this? One way is to predict the behavior of the instruction after fetching it. But fetching from ICache or Memory may take several cycles, which is a major drawback of this method.

A better way is to directly predict the type of instruction. After getting a PC, we can directly predict whether this instruction is a branch instruction and predict its behavior. In this way, we don’t have to wait for fetching to complete, and the predicted result can also guide the CPU to fetch from the correct location.

The method of type prediction can be similar to BTB, where a field in the cache contains the type of instruction for use in the next prediction.

General Steps of Branch Prediction

Through the introduction in this section, we can summarize the general steps of branch prediction:

Get the PC.
Predict whether it is a branch instruction.
1. If it is a conditional branch instruction, predict its direction and target.
2. If it is an unconditional jump instruction, predict its target.

Note that since predicting the type of instruction is required in prediction, and we haven’t obtained the specific content of the instruction, predicting the target of a conditional branch instruction also becomes our task.

1.2 - Basic of the Xiangshan Branch Prediction Unit (BPU)

This section introduces the basic ideas and working principles of the Xiangshan Branch Prediction Unit (BPU), including the use of branch prediction block concepts, multiple predictors, multiple pipeline structures, and the role of the Fetch Target Queue (FTQ), explaining the main interfaces of BPU for external interaction.

Branch Prediction Block Concept

For a general branch predictor, it usually predicts the relevant information of an instruction corresponding to a given PC, such as whether it is a conditional branch instruction or a jump instruction. For conditional branch instructions, it predicts whether it will jump, while for jump instructions, it predicts the jump target. However, predicting instructions one by one is inefficient, leading to slow instruction supply in the frontend.

In contrast, the prediction method used in Xiangshan is to predict a block of instructions each time. That is to say, given a PC, Xiangshan will predict a branch prediction block starting from this PC, including the situation of several subsequent instructions, such as whether there is a branch instruction, the position of the branch instruction, whether there is a jump, and the jump target.

This prediction method can predict multiple instructions at once and send the prediction results to the fetch unit (IFU) to guide the IFU to fetch instructions. In addition, since the IFU needs to consider the performance of cache lines, it can fetch multiple instructions at once based on the prediction block, thereby improving throughput efficiency.

After the prediction block is generated, the branch prediction block will also generate the PC to which it jumps after executing this prediction block, and then the BPU will continue to generate the next branch prediction block based on this PC.

Here’s a simple example:

As shown in the above figure, when the PC reaches 0x20000118, the BPU goes through the following steps:

The BPU learns that the PC is 0x20000118.
The BPU generates a branch prediction block starting from 0x20000118, with the following approximate contents:
1. In the next several instructions,
2. The third instruction is a conditional branch instruction.
3. For this conditional branch instruction, it predicts that it will be taken.
4. The address to which it jumps is 0x20000110.
The BPU sets the PC to 0x20000110 and continues to generate the next branch prediction block.

This is the basic prediction process of the Shanshan BPU using branch prediction blocks.

Multiple Predictors, Multiple Pipeline Structure

The figure below shows the overall architecture of the Xiangshan BPU, where we need to focus on two main aspects:

Multiple Predictors

To ensure prediction accuracy, Xiangshan BPU uses multiple predictors, and these predictors collectively generate the BPU’s prediction results. For example, FTB generates basic prediction results for subsequent predictors to use, while TAGE produces more accurate prediction results for conditional branch instructions, and so on.

Multiple Pipelines

To meet the requirements of high performance, Xiangshan BPU adopts a pipeline design. Various predictors are at different pipeline levels. Among them, the uFTB (also known as uBTB in the figure) predictor is at the first pipeline level, capable of generating prediction results in one cycle. The other predictors need 2-3 cycles to generate prediction results. Although the prediction time is longer, the prediction results are relatively more accurate.

However, if it takes three cycles to get the prediction result and start predicting based on the new result, this design inevitably leads to performance loss. Because of this, it takes three clock cycles to complete one prediction.

To be able to get the prediction results of some predictors in the first and second cycles, we set up three prediction result channels and output the prediction results of the three pipeline levels simultaneously, as shown in the figure below.

Fetch Target Queue (FTQ)

Storing Branch Prediction Results

Although the BPU can provide prediction results in the form of branch prediction blocks and the IFU can fetch multiple instructions at once, there is still a performance gap between them. In general, the BPU generates prediction results faster.

Therefore, a Fetch Target Queue (FTQ) is added between the BPU and the IFU as a buffer. The FTQ is essentially a queue used to store individual data items. The prediction results generated by the BPU are first stored in the FTQ, and then fetched by the IFU from the FTQ, as shown in the figure below.

Whenever the BPU generates a prediction block, the prediction block is placed at the head of the FTQ. When the IFU is idle, it will fetch the next prediction block from the tail of the FTQ. The diagram below illustrates this process.

In Xiangshan, the FTQ’s functionality goes far beyond this. Referring to the FTQ’s external interface in the figure above, it is also responsible for sending prefetch information to the ICache, storing various training information of the BPU, analyzing redirection information and update information sent from the fetch module and the backend execution module, sending update requests to the BPU, and even updating the basic data structure of the FTB predictor in the FTQ.

BPU Prediction Result Redirection

As mentioned earlier, the Xiangshan branch prediction results have three channels, which simultaneously output the prediction results of stages s1, s2, and s3. How does the FTQ use the prediction results of the three stages?

Let’s start from exploring the timing of the pipeline, as shown in the figure below.

In the first cycle, a new PC 0x4 is fetched, and the predictor (called uFTB) that can produce a prediction result within one cycle outputs its prediction result at the s1 interface, indicating the next PC as 0xf, with no output from other interfaces yet.
In the second cycle, the PC is set to 0xf, and uFTB also generates a prediction result of 0xf, which is sent out from the s1 channel. At the same time, the two-cycle predictor generates the prediction result for the previous address 0x4, which is sent out from the s2 channel.

However, a problem arises here: in the second cycle, the prediction result generated by s2 is 0x4, but the prediction result for 0x4 has already been output by s1 in the previous cycle and placed in an entry in the FTQ. In other words, the prediction result generated by s2 has already been generated by s1. The difference is that the result from s2 is generated by the two-cycle predictor, making it more accurate.

Therefore, what we need to do is not to place a new FTQ entry based on the prediction result from s2 but to compare the prediction results from s2 and the previous cycle’s s1 prediction results. If there is a difference, then overwrite the FTQ entry placed by the previous stage’s s1 interface.

So we add two additional signal lines to the s2 and s3 channels, which we call redirect signals. If this signal is valid, it indicates that there is a difference between the prediction result of this stage and the previous prediction result, and it is necessary to overwrite an FTQ entry from before. The structure is shown in the diagram below.

At the time corresponding to the second cycle of the pipeline in the structural diagram, the s1 channel has already placed a branch prediction block result with an address of 0x4 into the FTQ. At this time, the s2 prediction result is generated, and the BPU finds that the s2 prediction result is different from s1, so the redirect interface for this cycle is set to valid. The FTQ will use the s2 channel’s prediction result to overwrite the FTQ entry previously storing the 0x4 prediction result.

At this time, although the s1 channel has also generated a branch prediction block with 0xf as the head, it is obviously an incorrect prediction result generated by s1 based on the PC of the first cycle. Therefore, at this time, the s1 result can be directly discarded.

In the third cycle, s1 starts a new round of prediction with the correct prediction result indicated by s2, the new PC 0x8. After that, if no prediction errors are detected by the s2 and s3 channels, the pipeline will continue to run at full capacity.

BPU Redirect Requests

No matter how accurate a branch predictor is, it is not always correct. This inaccuracy can lead to incorrect instructions being filled in the subsequent pipeline. Therefore, there needs to be a mechanism to correct this, and this mechanism is redirection. When an instruction is executed by the backend execution module, the true behavior of this instruction is determined. At this time, if the backend execution module detects a branch prediction error, it will issue a redirect request to restore the processor’s state to the state before executing the incorrect instruction. For us, we only need to pay attention to how the BPU and FTQ restore the state when redirecting.

In addition to redirect requests from the backend, the Shan Mountain processor will perform a simple analysis of the instruction after it is fetched by the IFU to detect the most basic prediction errors. The specific process is as follows: after the FTQ sends a fetch request to the IFU, it will wait for the IFU to return the pre-decoded information (pre-decoding is the IFU’s simple decoding of the instruction, such as whether it is a jump instruction, what is the target of the jump). The FTQ will write the pre-decoded information back to a field in the corresponding entry in the FTQ and will also analyze the pre-decoded information. If a prediction error is detected, it will generate an IFU redirect request.

Redirect requests from the backend execution module do not need to be generated by the FTQ but are directly sent from the backend to the FTQ for processing. The FTQ will forward the generated IFU redirect request and the backend redirect request to the BPU’s redirect interface. If both are valid in the same cycle, the FTQ will choose to forward the backend redirect request.

The BPU with the added redirect interface is shown in the diagram below.

BPU Update Requests

The current BPU already has the ability to correct errors, but there is still a problem: the data in the BPU cannot be updated. If it is impossible to obtain information such as the location, type, whether a jump occurred, and the jump address of the instruction, the BPU will not be trained and the accuracy will be greatly reduced.

To obtain this information, we still need to rely on the Fetch Target Queue (FTQ) because it can not only interact with the IFU to obtain instruction-related information but also interact with the backend to obtain execution-related information. Therefore, there will be an update request channel directly connecting the FTQ to the BPU.

When the backend completes the execution of an entry in the FTQ, the entry is marked as committed. Next, the FTQ forwards the update information of this entry to the BPU through the Update channel.

Summary

Through this section, we have learned about all the main interfaces required for BPU external interaction and the role of FTQ in BPU. With the BPU equipped with prediction result interfaces, redirect interfaces, and update interfaces, it can already support all external interactions of the BPU. Next, we will delve deeper into the internals of the BPU.

1.3 - Introduction to the Xiangshan Branch Prediction Unit Structure

This section introduces the structure of the Xiangshan Branch Prediction Unit (BPU), including the integration of multiple predictors and multiple pipeline schemes, as well as the organization structure and interface design of internal sub-predictors, demonstrating how the BPU interacts with the Composer, and explaining the connection methods between sub-predictors.

How Does the BPU Integrate Internal Sub-predictors?

We already know that the Xiangshan BPU adopts multiple predictors and multiple pipeline schemes. To adapt to multiple pipelines, the BPU uses a three-channel result output interface. But how does it adapt to multiple predictors? This requires us to further explore the internal structure of the BPU.

The above figure is the BPU architecture diagram from the Xiangshan documentation. Currently, we only need to focus on one piece of information: all internal sub-predictors are encapsulated in a structure called Composer. The BPU only needs to interact with Composer.

What is Composer? Let’s first look at their definition in the Xiangshan code.

It can be seen that Composer and the five sub-predictors have a common characteristic: they all inherit from the BasePredictor base class. And the interface has been defined in the BasePredictor class. In other words, Composer and the five sub-predictors all have the same interface! The top-level BPU can directly regard Composer as a sub-predictor, without worrying about how the internal sub-predictors are connected.

Sub-predictor Interface

Next, we will look at what the sub-predictor interface looks like. This interface will involve the interaction between Composer and the top-level BPU, as well as the interaction between each sub-predictor and Composer.

Let’s take Composer as an example to illustrate the structure of the sub-predictor interface.

As shown in the above figure, the three-channel prediction results of Composer are directly output to the outside of the BPU. There is also a set of three-channel prediction results connected from the inside of the BPU to Composer. However, since the prediction results are generated by Composer, the BPU will pass an empty prediction result to Composer. The significance of this is to make the sub-predictor act as a “processor.” The sub-predictor will process the input prediction results and then output the processed prediction results.

Next, the top-level BPU will provide the information needed for prediction to the pipeline. First is the PC and branch history records (including global history and global folding history). Next, the BPU will connect some pipeline control signals between Composer and the pipeline control signals. Finally, the BPU will directly connect the externally input redirect request interface and update interface to Composer.

In the end, a simple definition of the sub-predictor interface can be given (for detailed definitions, please refer to the interface documentation):

in
- (s1, s2, s3) Prediction information input
- s0_pc PC to be predicted
- ghist Global branch history
- folded_hist Global folding history
out (s1, s2, s3) Prediction information output
流水线控制信号
- s0_fire, s1_fire, s2_fire, s3_fire Whether the corresponding pipeline stage is working
- s2_redirect, s3_redirect Redirect signals when a prediction error is discovered in the subsequent pipeline stage
- s1_ready, s2_ready, s3_ready Whether the sub-predictor corresponding pipeline stage is ready
update Update request
redirect Redirect request

Connection Between Sub-predictors

We now know that the interfaces between each sub-predictor and Composer are the same, and we also know how Composer is connected to the top-level BPU. This section will explain how sub-predictors are connected within Composer.

The above figure shows the connection structure of sub-predictors in Composer. It can be seen that after the three-channel prediction results are input into Composer, they are first processed by uFTB and then output. They are then successively processed by TAGE-SC, FTB, ITTAGE, and RAS, and finally connected to the prediction result output of Composer, which is then directly connected to the outside of the BPU by Composer.

For other signals, because the interfaces between Composer and each sub-predictor are the same, they are directly connected to the corresponding interfaces of each predictor by Composer, without much additional processing.

Prediction Result Interface Connection

For sub-predictors, the connection of their prediction result is that the prediction result output of one predictor is the input of the next predictor. However, it should be noted that this connection is a combinational circuit connection and is not affected by timing.

As shown in the above figure, taking the s1 channel as an example, from input to the output of the last predictor, it is all modified by combinational circuits, unaffected by timing. Registers only exist between the s1, s2, and s3 channels.

Therefore, increasing the number of sub-predictors will not increase the number of cycles required for prediction, but will only increase the delay required for prediction per cycle.

1.4 - Introduction to the Timing of Xiangshan Branch Prediction Unit

The timing design of the three-stage pipeline is the essence of the Xiangshan BPU. This section will introduce how the prediction result redirection signal is generated, how a new PC is generated based on the prediction result, and how the prediction results of the three channels are handled.

Single-Cycle Prediction without Bubble

uFTB is the only predictor in Xiangshan BPU that can generate prediction results in a single cycle. The figure below shows the prediction process of uFTB. The s0_pc is sent from the top level of BPU, and when the s1 stage is active, the s1_pc retains the value of s0_pc from the previous cycle. This means that the value of s0_pc will move down the pipeline.

When the s1 stage is active, uFTB receives the s1_fire signal from the current cycle and generates a prediction result based on the s1_pc address in this cycle, which can obtain the new PC value in the prediction result.

As shown in the figure, the top level of BPU analyzes the next PC value position based on the prediction result channel s1 and sends it to npc_Gen (new PC generator) for generating the s0_pc of the next cycle.

In the next cycle, uFTB gets the new PC value and starts generating the prediction block for the new PC value. Therefore, with only the s1 stage, the prediction block can be generated at a rate of one block per cycle.

Prediction Result Redirection

However, except for uFTB, other predictors require 2-3 cycles to generate prediction results. How to utilize their prediction results? And how to generate the prediction result redirection signal?

As shown in the figure, a Predirector 2 that takes two cycles to generate a prediction result can output its prediction result to the s2 prediction result channel in the s2 stage. After the top level of BPU receives the prediction result, it analyzes the jump target address target of the prediction block and connects it to npc_Gen.

At this point, the signal connected to npc_Gen contains both the old PC prediction result generated by s2 and the new PC prediction result generated by s1. How to choose which one to use for the new PC?

As mentioned earlier, BPU compares the prediction result of s2 with the prediction result of s1 from the previous cycle. If the prediction results are different, it indicates that s1 has made a wrong prediction, and naturally, the prediction result of the current cycle generated based on the wrong prediction result of the previous cycle is also wrong. Therefore, if the prediction result is incorrect in the current cycle, npc_Gen will use the target provided by s2 as the new s0_pc.

This process is shown in the pipeline structure diagram as follows:

The Diff comparator compares the prediction results of the s1 stage with those of the previous cycle to generate a diff signal, guiding npc_Gen to generate the next PC. At the same time, the diff signal indicates that the prediction result of the s1 stage is incorrect and can be used directly by BPU to redirect the prediction result channel of the s2 stage in the FTQ, instructing the FTQ to overwrite the previous prediction result.

The diff signal is also sent to each predictor through the s2_redirect interface to guide the predictors to update their states.

Furthermore, when the prediction result redirection of the s2 stage occurs, indicating that the prediction result of the s1 channel is incorrect, the s2 stage cannot continue to predict and needs to invalidate the s2_fire signal of the predictor pipeline and wait for the corrected prediction result to flow in.

The prediction result redirection of the s3 stage is similar to this. Its pipeline structure diagram is as follows. The specific processing process is left for you to analyze.

Redirection Requests and Other Information Generation

Only when the prediction information of all three stages is incorrect will an external redirection request occur. At this time, npc_Gen will receive the PC address from the redirection request. Since when a redirection request occurs, we assume that all three stages have predicted incorrectly, so all three stages’ fire signals need to be invalidated. Then, npc_Gen uses the PC that needs to be restored from the redirection request to restart the prediction.

Other information, such as the generation of the global history and the PC, follows the same principle and is maintained based on the prediction information of each stage. The global history generates a new branch history based on the prediction results of each stage.

Pipeline Control Signals

After learning about the specific process of the pipeline, you should understand the pipeline control signals in the predictor interface, as follows:

s0_fire, s1_fire, s2_fire, s3_fire Indicate whether each stage of the pipeline is working.
s2_redirect, s3_redirect Indicate whether a prediction result redirection has occurred.
s1_ready, s2_ready, s3_ready Sent from the predictor to the top level of BPU, indicating whether each stage of the pipeline is ready.

Conclusion

By now, you should understand the basic design principles, external interaction logic, internal structure, timing, etc., of the Xiangshan Branch Prediction Unit, and have a rough understanding of the working principle of BPU. Xiangshan’s BPU is no longer mysterious to you.

Next, you can read the Important Structures and Interfaces Document and combine it with the source code of Xiangshan BPU to form a more detailed understanding of BPU. When you clearly understand the working principle and signal details of BPU, you can start your verification work!

2 - Important Structures and Interface Documentation

This document will describe the important structures and external interfaces in the BPU, with a description granularity that delves into the code level. By reading this document, you can understand the role of each signal in the Xiangshan branch prediction unit, comprehend the specific implementation methods of various requests, and gain a functional understanding in conjunction with the code.

This document will describe the important structures and external interfaces in the BPU, with a description granularity that delves into the code level. The structures described in the document will be consistent with the Chisel version of the Xiangshan branch prediction unit code, and the signal structures and names also come from the Chisel version code.

This document is intended for those who have already understood the basic design of the Xiangshan branch prediction unit and want to delve into the signal details. You can selectively read according to the content needed for verification or refer to it at any time.

Among them, the FTB entry and the full prediction result interface involve the way the BPU generates prediction results, which is recommended for every reader to read.

2.1 - FTB Item and Complete Prediction Result Interface

The basic prediction unit of BPU is the branch prediction block, but what is the structure of a branch prediction block? Why can a branch prediction block guide the execution of so many subsequent instructions? This section will describe the core data structure for generating branch prediction blocks—the FTB item, and the complete prediction result structure generated from the FTB item.

FTB item

The FTB item is the core data structure of branch prediction blocks in Xiangshan. It stores the information needed to generate a branch prediction block. When BPU performs predictions, the initial branch prediction block is first generated from a read-out FTB item. Then, this branch prediction block is passed to subsequent predictors, which read and modify the information to generate the final prediction result.

Therefore, to understand the structure of a branch prediction block, we first need to understand the structure of an FTB item. An FTB item corresponds to a branch prediction block, and its general structure is as follows:

First, it is necessary to clarify one piece of information: whether it is a branch prediction block or an FTB item, the number of instructions they can contain is set to a specific limit (16 RVC instructions in the current version of Xiangshan), called the maximum prediction length. This means that if we need to record the position of an instruction within the branch prediction block, we can use a fixed-length bit vector to specify the offset of this instruction relative to the starting address of the prediction block.

The determinant of the branch prediction block’s execution process is the information about the branch instructions. The other instructions are considered ordinary instructions and do not affect the program’s execution flow. Therefore, in a branch prediction block, we only need to record the positions of the branch instructions, while the positions of ordinary instructions are not our concern.

Therefore, the FTB item defines two types of branch instruction slots—brSlots and tailSlot, used to store the branch instructions within the branch prediction block. In the current version of Xiangshan, brSlots contains only one slot, while tailSlot is a separate slot, totaling two slots.

Within the instructions of the maximum prediction length, if a branch instruction appears, the FTB item will record it in the corresponding slot and mark the slot as valid. If too many branch instructions appear, reaching the capacity limit of the FTB item, the excess branch instructions will be handed over to the next FTB item for storage. In each slot, we record the offset of a branch instruction relative to the starting address of the prediction block and information such as its jump target address.

The Unique tailSlot

In RISC-V, branch instructions are mainly divided into two types: conditional branches and unconditional jumps. Therefore, for a branch prediction block, it will contain at most one unconditional jump instruction. Because once this instruction is executed, the program’s execution flow will change, and subsequent instructions will no longer be executed. Hence, we define a type of slot called tailSlot specifically for storing this unconditional jump instruction. As for conditional branch instructions, they are stored in brSlots.

As its name suggests, tailSlot is located in the last slot of the entire prediction block. This is also because once the unconditional jump instruction is filled, the program will definitely jump, and subsequent instructions will be handled by other prediction blocks, so we do not need to care about the subsequent instructions. However, among the instructions before the unconditional jump instruction, we need to care about whether there are conditional branch instructions, because conditional branch instructions may or may not jump. Therefore, we need to record the relevant information of the conditional branch instructions.

Consider a situation: if no unconditional jump instructions appear from the starting PC of the prediction block to the maximum prediction length, but two conditional branch instructions appear instead, the tailSlot will be idle, and the second conditional branch instruction cannot be stored, causing space waste.

To solve this problem, Xiangshan adopts a method of setting a sharing mark. We can directly store the second branch instruction into the tailSlot and set the sharing mark to true, indicating that the second conditional branch instruction shares the tailSlot of the unconditional jump instruction. This way, the space of the tailSlot is effectively utilized.

The isCall, isRet, and isJalr fields in the prediction block serve the tailSlot. If the tailSlot records an unconditional jump instruction, these fields will further indicate the type of the jump instruction. There is also a field in the FTB item called always_taken, which records whether each conditional branch instruction stored in each slot is always predicted to jump. If so, subsequent predictors can directly adopt this prediction result.

Through the FTB item, we can know the instruction situation in a branch prediction block, including the position and type of branch instructions. This information will be handed over to subsequent predictors, which will predict more accurate jump targets, whether to jump, and other information.

Complete Structure of FTB Item (FTBEntry)

Interface Definition: src/main/scala/xiangshan/frontend/FTB.scala

This section describes the complete structural definition of the FTB item, containing the following signals:

valid: Whether the FTB entry is valid.
- Interface Type: Bool
brSlots: Conditional branch instruction slots.
- Interface Type: Vec(numBrSlot, new FtbSlot(BR_OFFSET_LEN)) （see FtbSlot for interface list）
tailSlot: Unconditional jump instruction slot.
- Interface Type: new FtbSlot(JMP_OFFSET_LEN, Some(BR_OFFSET_LEN)) （see FtbSlot for interface list）
pftAddr: Unconditional jump instruction slot
- Interface Type: UInt(log2Up(PredictWidth).W)
carry: Unconditional jump instruction slot.
- Interface Type: Bool()
isCall: The instruction in the unconditional jump instruction slot is a call instruction.
- Interface Type: Bool()
isRet: The instruction in the unconditional jump instruction slot is a ret instruction.
- Interface Type: Bool()
isJalr: The instruction in the unconditional jump instruction slot is a jalr instruction.
- Interface Type: Bool()
last_may_be_rvi_call: The instruction in the unconditional jump instruction slot may be an RVI type call instruction signal.
- Interface Type: Bool()
always_taken: Whether each branch instruction in the prediction block is always predicted as Taken.
- Interface Type: Vec(numBr, Bool())

Explanation: pftAddr and carry

Here, pftAddr stands for Partial Fallthrough Address. Fallthrough Address means that if there is no jump in the prediction block, the program will sequentially execute to the address reached. In other words, if the offset of an unconditional jump instruction is 5, then the offset corresponding to the Fallthrough Address is 6. This signal is mainly used to get the return address of the program after a function call, and this concept can be understood as the end address of the prediction block.

Partial means part of the address, which is determined by the address representation method. Here, the address representation method is as follows:

  pc: | ... |<-- log(predictWidth) -->|<-- log(instBytes) -->|
           ^                         ^
           |                         |
           carryPos                  instOffsetBits

pftAddr only records the middle offset part (the part with a length of log(predictWidth)), and a complete PC can be generated by combining it with the current PC. However, a carry might occur, so a carry bit is recorded separately. carryPos is the position in the instruction address within the prediction block where a carry might occur.

Additionally, last_may_be_rvi_call is an auxiliary signal for this address, indicating that the instruction in the unconditional jump instruction slot is an RVI type call instruction. Since pftAddr assumes the instruction length as the compressed instruction length by default when calculating, the end address is increased by only 2 bytes. If the actual call instruction is not a compressed instruction, it will lead to an incorrect return address calculation. RAS will correct this error based on this signal.

Branch Prediction Slot (FTBSlot)

Interface Definition: src/main/scala/xiangshan/frontend/FTB.scala

This interface defines the slot in the FTB entry:

offset: The offset of the instruction in the slot relative to the start address of the prediction block.
- Interface Type: UInt(log2Ceil(PredictWidth).W)
lower: The lower bits of the jump target address.
- Interface Type: UInt(offsetLen.W)
- Note: The lower is set to 12 or 20 bits because the addressing capability of branch instructions is 12 bits, while the addressing capability of jump instructions is 20 bits.
tarStat: Whether the high bits of the PC after the jump are incremented or decremented.
- Interface Type: UInt(TAR_STAT_SZ.W) (TAR_STAT_SZ = 2)
- Note: The jump target address is calculated from the high bits of the current PC, the tarStat, and the lower field. The lower field directly stores the lower bits of the jump target address. The high bits of the current PC are adjusted according to the tarStat, then concatenated with the lower to get the actual jump target address. The tarStat can take three values: 0 - no increment or decrement, 1 - increment, 2 - decrement.
sharing: Indicates that a conditional branch instruction is stored in an unconditional jump instruction slot.
- Interface Type: Bool()
valid: Indicates whether the slot is valid.
- Interface Type: Bool()

Full Branch Prediction

Interface Definition: src/main/scala/xiangshan/frontend/FrontendBundle.scala

This interface defines the complete branch prediction results, included in the prediction results of each pipeline stage.

The full branch prediction result interface is similar to the FTB entry and is initially generated from the FTB entry. Two slots are split into individual signals: slot_valids, targets, offsets, is_br_sharing, etc. Additionally, several fields are added such as br_taken_mask, jalr_target to facilitate the provision of precise prediction results by subsequent predictors. The hit indicates whether an FTB entry is hit, i.e., the PC in the current prediction round indexed to an FTB entry.

Full Interface List：

hit: Indicates whether the FTB entry is hit.
Interface Type: Bool()
slot_valids: Indicates whether each slot in the FTB entry is valid.
Interface Type: Vec(totalSlot, Bool())
targets: The jump target address corresponding to each slot.
Interface Type: Vec(totalSlot, UInt(VAddrBits.W))
offsets: The offset of the instruction in each slot relative to the start address of the prediction block.
Interface Type: Vec(totalSlot, UInt(log2Ceil(PredictWidth).W))
fallThroughAddr: The end address of the prediction block.
Interface Type: UInt(VAddrBits.W)
fallThroughErr: Indicates that the pftAddr recorded in the FTB entry is incorrect.
Interface Type: Bool()
is_jal: Indicates that the prediction block contains a jal instruction.
Interface Type: Bool()
is_jalr: Indicates that the prediction block contains a jalr instruction.
Interface Type: Bool()
is_call: Indicates that the prediction block contains a call instruction.
Interface Type: Bool()
is_ret: Indicates that the prediction block contains a ret instruction.
Interface Type: Bool()
last_may_be_rvi_call: Indicates that the instruction in the unconditional jump instruction slot might be an RVI type call instruction.
Interface Type: Bool()
is_br_sharing: Indicates that the last slot (tailSlot) stores a conditional branch instruction.
Interface Type: Bool()
br_taken_mask: The branch prediction result, with each bit corresponding to a branch (slot), indicating whether the branch is predicted to be taken.
Interface Type: Vec(numBr, Bool())
jalr_target: The jump target of the jalr instruction in the prediction block.
Interface Type: UInt(VAddrBits.W)

2.2 - Redirection and Update Interfaces

Redirection and update requests are the two main types of information sent from the FTQ to the BPU. This section details the specific structure of these request interfaces.

Branch Prediction Redirection（BranchPredictionRedirect）

Interface Definition：src/main/scala/xiangshan/frontend/FrontendBundle.scala

Interface Type：BranchPredictionRedirect

This interface defines the redirection requests from the branch prediction unit, mainly used to redirect the state of the branch predictor.

The BranchPredictionRedirect interface inherits from the Redirect interface and includes many signals, only a subset of which are used by the BPU redirection. The documented structure contains only the interfaces used by the BPU.

Redirection requests have two sources: those generated by comparing IFU pre-decode information and those generated during the backend execution process.

In redirection requests, the key information is cfiUpdate, which corresponds to a control flow instruction. This information pertains to an instruction where the BPU made a prediction error. For example, if the BPU indicates that the third instruction in the prediction block is a normal instruction with no jump, but the pre-decode shows it is an unconditional jump instruction, a prediction error has occurred. In this case, the FTQ generates a redirection request. The cfiUpdate in the redirection request corresponds to this unconditional jump instruction.

The information in cfiUpdate can be broadly classified into three types:

Information about the corresponding instruction and its execution status. This includes the slot (shift) and PC of the instruction in the prediction block, the type-related information of the instruction (pd), and the execution status, such as jump target and whether it jumps.
History maintenance related information. The redirection request contains branch history information corresponding to the prediction block of the instruction to help the BPU restore branch history. folded_hist represents the global folded history, histPtr represents the global history pointer, and other information assists in maintaining branch history. For detailed information, refer to the BPU Top-Level Module.
RAS maintenance related information. For detailed meaning, refer to the RAS sub-predictor documentation.

The meaning of level is whether the redirection includes this instruction. If not included, the redirection request receiver will assume the instruction has been executed, and the next prediction will start from the next instruction. The BPU top-level will default to not including this instruction, and upon receiving a redirection request, it will include the execution status of this instruction in the branch history.

The detailed signal list for the redirection interface is as follows:

level: Indicates whether the redirection request includes the current position. Low means redirection after this position, high means redirection at this position.
- Interface Type: UInt(1.W)
cfiUpdate: Control flow instruction update information
- Interface Type: CfiUpdateInfo
- Interface List
  - pc: The PC of the instruction corresponding to the redirection request
    - Interface Type: UInt(VaddrBits.W)
  - pd: Pre-decode information of the redirection instruction
    - isRVC: Whether it is a compressed instruction
      - Interface Type: Bool
    - isCall: Whether it is a function call
      - Interface Type: Bool
    - isRet: Whether it is a function return
      - Interface Type: Bool
  - target: Target address of the redirection request instruction
    - Interface Type: UInt(VaddrBits.W)
  - taken: Whether the redirection request instruction is taken
    - Interface Type: Bool
  - shift: Slot in which the redirection request instruction is located, 0 if it is a normal instruction.
    - Interface Type: UInt((log2Ceil(numBr)+1).W)
  - addIntoHist: Whether to include the execution information of the redirection request instruction in the branch history.
    - Interface Type: Bool
  - folded_hist: Folded history corresponding to the redirection request
    - Interface Type: AllFoldedHistories(foldedGHistInfos)
  - afhob: Oldest bit of the branch history corresponding to the redirection request instruction
    - Interface Type: AllAheadFoldedHistoryOldestBits(foldedGHistInfos)
  - lastBrNumOH: Last branch position corresponding to the redirection request
    - Interface Type: UInt((numBr+1).W)
  - histPtr: Global history pointer to be restored by the redirection request
    - Interface Type: CGHPtr
  - ssp: RAS speculative stack top pointer at the commit stack position corresponding to the redirection request instruction
    - Interface Type: UInt(log2Up(RasSize).W)
  - sctr: RAS speculative stack top recursion counter corresponding to the redirection request instruction
    - Interface Type: UInt(log2Up(RasCtrSize).W)
  - TOSW: RAS speculative stack (queue) write pointer corresponding to the redirection request instruction
    - Interface Type: CGHPtr
  - TOSR: RAS speculative stack (queue) read pointer corresponding to the redirection request instruction
    - Interface Type: CGHPtr
  - NOS: RAS stack top counter corresponding to the redirection request instruction
    - Interface Type: CGHPtr

Branch Prediction Update（BranchPredictionUpdate）

Interface Definition: src/main/scala/xiangshan/frontend/FrontendBundle.scala

This interface defines the update requests for the branch predictor, mainly used to update the state of the branch predictor. The document lists only the interfaces used in the BPU.

Update requests correspond to each branch prediction block. When a branch prediction block in the FTQ has been executed, the FTQ generates an update request for this prediction block to train the predictor. Thus, an important role of the update request is to feed back the actual execution status of the instructions to the BPU. Additionally, in the Xiangshan branch prediction unit, the update request is responsible for updating FTB entries.

The information in the update request can be broadly classified into four categories:

PC: Indicates the starting address of the prediction block, specifying which prediction block the update request corresponds to
FTB Entry Update Information: The update channel contains an FTB entry structure (ftb_entry), outputs the newly generated FTB entry from the FTQ, and indicates whether it is the same as the old FTB entry (old_entry)
Actual Execution Status Information of Instructions: The update channel indicates the execution status of branch and unconditional jump instructions in the prediction block, and provides the address and final target of control flow instructions (i.e., instructions that jump)
Predictor-Related Data Corresponding to the Prediction Block: Contains spec_info and meta information (refer to the BPU Global Interface Documentation for details)

The interface list of the update request is as follows:

pc PC of the update request (starting address of the prediction block)
- Interface Type：UInt(VAddrBits.W)
ftb_entry Updated FTB entry
- Interface Type：new FTBEntry()
- Interface List：See（FTBEntry）
old_entry Whether the updated FTB entry is the same as the old FTB entry
- Interface Type：Bool()
br_taken_mask Mask indicating whether each slot instruction in the prediction block jumps
- Interface Type：Vec(numBr, Bool())
mispred_mask Mask indicating prediction errors in the prediction block. The first and second bits indicate whether the two conditional branch instructions were mispredicted, and the third bit indicates whether the unconditional jump instruction was mispredicted.
- Interface Type：Vec(numBr+1, Bool())
jmp_taken Unconditional jump instruction triggered in the prediction block
- Interface Type：Bool()
cfi_idx Index of the control flow instruction in the prediction block
- Interface Type：ValidUndirectioned(UInt(log2Ceil(PredictWidth).W))
full_target Jump target of the prediction block (starting address of the next prediction block)
- Interface Type：UInt(VAddrBits.W)
spec_info Last stage speculative information corresponding to the prediction block
- Interface Type：new SpeculativeInfo
- Interface List：（Only foled_hist is used）
  - folded_hist Global folded history
    - Interface Type：AllFolededHistories(foldedGHistInfos)
meta Last stage meta information corresponding to the prediction block
- Interface Type：UInt(MaxMetaLength.W)

2.3 - BPU Global Interface

This section introduces the definition of the overall external interaction interface of the Xiangshan Branch Prediction Unit, including the presentation of global branch prediction results and single pipeline stage prediction results.

BPU Module Overall External Interface (PredirectIO)

Interface definition: src/main/scala/xiangshan/frontend/BPU.scala

PredirectIO is the overall external interface of the branch predictor (BPU). It mainly handles the interaction between the branch predictor (BPU) and the fetch target queue (FTQ), which includes the following parts:

bpu_to_ftq: Information sent from BPU to FTQ, mainly for sending branch prediction results from BPU to FTQ
- Interface type： BpuToFtqIO
- Interface type：
  - resp: Global branch prediction information sent from BPU to FTQ
    - Interface type：DecoupledIO(new BpuToFtqBundle())
      - BpuToFtqBundle inherits from BranchPredictionResp，without additional signals
    - Interface type：See (BranchPredictionResp)
ftq_to_bpu: Information sent from FTQ to BPU, mainly for handling redirect and update requests
- Interface type： FtqToBpuIO
- Interface type：
  - redirect: Redirect request sent from FTQ to BPU
    - Interface type： Valid(new BranchPredictionRedirect)
    - Interface list：See （BranchPredictionRedirect）
  - update: Update request sent from FTQ to BPU
    - Interface type：Valid(new BranchPredictionUpdate)
    - Interface list：See （BranchPredictionUpdate）
  - enq_ptr: FTQ pointer sent to BPU, indicating which FTQ entry to write to next
    - Interface type：FtqPtr
ctrl: BPU control signals, mainly for enabling various predictors
- Interface type：BPUCtrl
- Interface list：
  - ubtb_enable: UBTB predictor enable
    - Interface type：Bool()
  - btb_enable: BTB predictor enable
    - Interface type：Bool()
  - bim_enable: BIM predictor enable
    - Interface type：Bool()
  - tage_enable: TAGE predictor enable
    - Interface type：Bool()
  - sc_enable: SC predictor enable
    - Interface type：Bool()
  - ras_enable: RAS predictor enable
    - Interface type：Bool()
  - loop_enable: LOOP predictor enable
    - Interface type：Bool()
reset_vector: Reset vector, which the BPU’s PC will be reset to upon reset
- Interface type：UInt(PAddrBits.W)

Global Branch Prediction Information (BranchPredictionResp)

Interface definition: src/main/scala/xiangshan/frontend/FrontendBundle.scala

This interface defines all the prediction result information of the branch predictor, including the prediction results of each stage and the related information output by the last pipeline stage.

s1 Branch prediction result of the s1 pipeline stage
s2 Branch prediction result of the s2 pipeline stage
s3 Branch prediction result of the s3 pipeline stage
- Interface type：BranchPredictionBundle
- Interface type：See （BranchPredictionBundle）
last_stage_meta Metadata of the prediction result output by the last pipeline stage. It is a bit vector output by each predictor and combined by the Composer.
- Interface type：UInt(MaxMetaLength.W)
last_stage_spec_info Related information of the prediction result output by the last pipeline stage
- Interface type：Ftq_Redirect_SRAMEntry
- Interface list：
  - folded_hist Global folded history
    - Interface type：AllFoldedHistories(foldedGHistInfos)
  - afhob Global branch history oldest bit
    - Interface type：AllAheadFoldedHistoryOldestBits(foldedGHistInfos)
  - lastBrNumOH Last branch position
    - Interface type：UInt((numBr+1).W)
  - histPtr Global branch history pointer
    - Interface type：CGHPtr
  - ssp RAS speculation stack pointer at commit stack position
    - Interface type：UInt(log2Up(RasSize).W)
  - sctr RAS speculation stack recursion counter
    - Interface type：UInt(log2Up(RasCtrSize).W)
  - TOSW RAS speculation stack (queue) write pointer
    - Interface type：CGHPtr
  - TOSR RAS speculation stack (queue) read pointer
    - Interface type：CGHPtr
  - NOS RAS stack top counter
    - Interface type：CGHPtr
  - topAddr RAS stack top return address
    - Interface type：UInt(VAddrBits.W)
last_stage_ftb_entry FTB entry output by the last pipeline stage
- Interface type：FtqEntry
- Interface list：See（FtqEntry）

FTB entry output by the last pipeline stage

Interface definition: src/main/scala/xiangshan/frontend/FrontendBundle.scala

This interface defines the branch prediction result information output by each pipeline stage,

pc Starting PC of the predicted block
- Interface type: Vec(numDup, UInt(VAddrBits.W)) numDup is only for register replication, with identical signals
valid Whether the prediction result is valid
- Interface type: Vec(numDup, Bool())
hasRedirect Whether a redirect is needed
- Interface description: Only the s2 and s3 stages will redirect, and the prediction result of this stage will override the previous pipeline stage’s result when a redirect occurs
- Interface type: Vec(numDup, Bool())
ftq_idx FTQ pointer, pointing to the FTQ entry corresponding to the prediction information of this stage
- Interface type: new FtqPtr
full_pred Complete branch prediction result
- Interface type: Vec(numDup, new FullBranchPrediction)
- Interface list: See (FullBranchPrediction)

2.4 - Base Predictor Class and Sub Predictor Interface

This document introduces the sub-predictor interface in the Xiangshan BPU, as well as the use of the sub-predictor base class. Reading this document can help you understand the external interactions of sub-predictors and the use of signals in the sub-predictor base class.

In the Xiangshan branch prediction unit, all its sub-predictors and the class implementations of Composer are inherited from the sub-predictor base class (BasePredictor). The sub-predictor interface (BasePredictorIO) is also defined in the sub-predictor base class. Therefore, we can consider that Composer and all sub-predictors have the same interface.

In the understanding and verification of sub-prediction, our most direct external interaction occurs in the sub-predictor interface and some variables defined in the sub-predictor base class. Therefore, before verifying the sub-predictor, it is strongly recommended that you read and understand this section of the document.

The general content and usage of the sub-branch predictor interface have been introduced in the Xiangshan Branch Prediction Unit (BPU) Basic Design section. This document will focus on the signal details of the interface.

Sub-Branch Predictor Interface (BasePredictorIO)

Interface Definition: src/main/scala/xiangshan/frontend/BPU.scala

Each sub-branch predictor needs to implement this interface, which defines the input and output interfaces of the sub-branch predictor.

Note: Some signals are defined as numDup quantities, where each signal is exactly the same. Multiple identical signals are for other considerations.

The detailed signal list is as follows:

reset_vector Reset vector, when reset occurs, the BPU’s pc will be reset to this value.
- Interface Type：UInt(PAddrBits.W)
in Information sent from the BPU to the sub-branch predictor
- Interface Type：DecoupledIO(new BasePredictorInput)
- Signal List：
  - s0_pc PC of the s0 pipeline stage
    - Interface Type：Vec(numDup, UInt(VAddrBits.W))
  - folded_hist Global folded history information
    - Interface Type：Vec(numDup, new AllFoldedHistories(foldedGHistInfos))
    - Signal List：See（AllFoldedHistories）
  - ghist Global branch history information
    - Interface Type：UInt(HistoryLength.W)
  - resp_in Global branch prediction information (including s1, s2, s3 prediction result information)
    - Interface Type：BranchPredictionResp
    - Signal List：See（BranchPredictionResp）
out Information sent from the sub-branch predictor to the BPU (including s1, s2, s3 prediction result information)
- Interface Type：new BasePredictorOutput 继承自 BranchPredictionResp
- Signal List：See（BranchPredictionResp）
ctrl BPU sub-predictor enable control signal, mainly used to control whether each predictor is enabled
- Interface Type：BPUCtrl
- Interface Type：
  - ubtb_enable: UBTB predictor enable
    - Interface Type：Bool()
  - btb_enable: BTB predictor enable
    - 接Interface Type：Bool()
  - bim_enable: BIM predictor enable
    - Interface Type：Bool()
  - tage_enable: TAGE predictor enable
    - Interface Type：Bool()
  - sc_enable: SC predictor enable
    - Interface Type：Bool()
  - ras_enable: RAS predictor enable
    - Interface Type：Bool()
  - loop_enable: LOOP predictor enable
    - Interface Type：Bool()
s0_fire s0 stage handshake success signal
- Interface Type：Vec(numDup, Bool())
s1_fire s1 stage handshake success signal
- Interface Type：Vec(numDup, Bool())
s2_fire s2 stage handshake success signal
- Interface Type：Vec(numDup, Bool())
s3_fire s3 stage handshake success signal
- Interface Type：Vec(numDup, Bool())
s2_redirect s2 stage redirection signal
- Interface Type：Vec(numDup, Bool())
s3_redirect s3 stage redirection signal
- Interface Type：Vec(numDup, Bool())
s1_ready s1 stage ready to receive information (Direction: output from the sub-predictor)
- Interface Type：Bool()
s2_ready s2 stage ready to receive information (Direction: output from the sub-predictor)
- Interface Type：Bool()
s3_ready s3 stage ready to receive information (Direction: output from the sub-predictor)
- Interface Type：Bool()
update Update request forwarded from the BPU to the sub-branch predictor
- Interface Type：Valid(new BranchPredictionUpdate)
- Signal List：See（BranchPredictionUpdate）
redirect Redirect request forwarded from the BPU to the sub-branch predictor
- Interface Type：Valid(new BranchPredictionRedirect)
- Signal List：See（BranchPredictionRedirect）

The pipeline control signals will be further explained in the following content.

Global Folding History (AllFoldedHistories)

Interface Definition：src/main/scala/xiangshan/frontend/FrontendBundle.scala

Interface Type：AllFoldedHistories(foldedGHistInfos))

The interface information of the global folding history consists of only one FoldedHistory list.

hist List of folded histories
- Interface Type：MixedVec(gen.map{case (l, cl) => new FoldedHistory(l, cl, numBr)})

The interface information of FoldedHistory also has only one item.

folded_hist Single folded history, with a bit width equal to the compressed history length.
- Interface Type：UInt(compLen.W)

This means that the interface of the global folding history is actually a list that stores folded histories, where each item is a folded history of a specific length.

Base Predictor Class

The base predictor class defines several signals, which can be accessed in each sub-predictor, and several connections are made within it.

Most of the signals are relatively easy to understand. We need to pay particular attention to the pc of each pipeline, which also involves your understanding of pipeline control signals. Therefore, next, we will introduce the meanings of pipeline control signals that need to be paid attention to in sub-predictors, as well as the meanings of the s1_pc, s2_pc, s3_pc signals.

There are three groups of pipeline control signals:

fire signals (s0, s1, s2, s3)
redirect signals (s2, s3)
ready signals (s1, s2, s3)

The pc signals in the base predictor class are divided into four groups, s0_pc_dup, s1_pc_dup, s2_pc_dup, s3_pc_dup. Each group of signals contains multiple pc signals, which are exactly the same and are duplicated for some other reasons. Therefore, we can simply consider them as s0_pc, s1_pc, s2_pc, s3_pc.

Their usage can be seen in the following diagram:

Their relationship with the pipeline control signals is as follows:

s0_pc is directly connected from the in.s0_pc in the input interface.
When s0_fire is active, the next cycle s1_pc will output the value of s0_pc.
When s1_fire is active, the next cycle s2_pc will output the value of s1_pc.
When s2_fire is active, the next cycle s3_pc will output the value of s2_pc.

In other words, the fire signal affects whether the data of the next cycle is valid. For example, the s0_fire signal affects whether the data of the s1 pipeline is valid, and the s1_fire signal affects whether the data of the s2 pipeline is valid.

Whether the fire signal is valid depends on whether the data of this pipeline stage is valid and whether the next pipeline stage is ready. For example, the s1_fire signal is valid only if the data of the s1 stage is valid and the s2_ready signal from the sub-predictor output is valid. At this point, it can be considered that the data processing of the s1 stage is completed, the s2 stage is ready, and the data of the next cycle will be directly sent to the s2 stage.

Therefore, in the sub-predictor, taking the s1 stage as an example, s1_ready can block data from entering the s1 stage. When s1_ready is active, the data for the s1 stage will be valid in the next cycle. When s1_fire is active, it indicates that the data in the s1 stage is already valid and the predictor has generated the result for the s1 stage. The data will then be directly sent to the s2 stage in the next cycle.

The redirect signal is relatively clear. For example, in the s2 stage, when s2_redirect is valid, it indicates that when s2_fire is valid, the s2 prediction result is different from the s1 prediction result in the previous cycle.

3 - Submodule Documentation

This section of the documentation will provide a detailed introduction to each module of the Xiangshan Branch Prediction Unit, including the BPU top-level and five sub-predictors.

In each module’s documentation, we will provide a detailed explanation of the module’s role in the Xiangshan Branch Prediction Unit, as well as the module’s algorithm principles, structure, and timing.

Students responsible for verifying a specific module should read the corresponding documentation thoroughly and understand it in conjunction with the code. Other documents can also be read to help you understand the overall functionality of the Xiangshan Branch Prediction Unit. During the understanding process, you may need to constantly review the basic design concepts and interface signals described in previous documents.

3.1 - BPU Top Module

The overall function and structure of the BPU top level have been roughly described in previous documents. For those verifying the BPU top level, a more detailed description might be needed. Due to the many functions of the BPU top level, this section divides the BPU into several major functional points for further description. However, since there are too many details at the BPU top level, further details need to be understood by referring to the code.

Generator Maintenance Method

From the basic design documents of Xiangshan, we know that the BPU top level maintains various variables in the s0 cycle through generators, such as PC, branch history, etc. The core concept is to decide which pipeline level’s result to adopt through the redirection signal of the prediction result.

There are a total of 6 generators in the BPU top level:

npcGen maintains the PC
ghistPtrGen maintains the global history pointer
ghvBitWriteGens maintains global history write data
foledGhGen maintains the folded history
lastBrNumOHGen maintains the position of the last effective branch instruction in the previous cycle
aheadFhObGen maintains the oldest position of the branch history

Except for npcGen, the rest of the generators will be introduced in this document. In this section, we will focus on the method of generating the next prediction for the generators.

In the code, you can see generators defined in a similar way:

val npcGen = new PhyPriorityMuxGenerator[UInt]

Next, the code registers the data sources for the generators through multiple statements:

npcGen.register(true.B, reg， ...)
npcGen.register(s1_valid, s1_target, ...)
npcGen.register(s2_redirect, s2_target, ...)
npcGen.register(s3_redirect, s3_target, ...)
npcGen.register(do_redirect.valid, do_redirect.bits.cfiUpdate.target, ...)

Each line is called a registration. In a registration, the first signal parameter is the data valid signal, and the second signal parameter contains the specific data. The priority of the generator is also determined in the order of registration; the later the registration, the higher the priority. Therefore, the priority at the same time, from low to high, is as follows:

s0 blocked data
Data updated based on s1 prediction results
Data updated based on s2 prediction results
Data updated based on s3 prediction results
Data in external redirection of BPU

In this way, when the redirection of the prediction result is valid, we can avoid using the earlier pipeline level’s prediction result and adopt the corrected prediction result. This allows us to handle external redirection requests with the highest priority.

We can conclude the method by which all generators generate s0 signals: Among all data valid signals, if only one is valid, the corresponding data is selected; if multiple data valid signals are valid, the data with the highest priority is selected.

Global Branch History

We know that the global branch history is maintained at the BPU top level, and the maintenance strategy is consistent with the PC maintenance strategy. That is, after the prediction result is generated at each stage of the pipeline, the global branch history is updated based on the corresponding signals.

The top level defines two sets of signals to maintain the global branch history:

ghv stores the global branch history (maximum length 256)
ghist_ptr global branch history pointer, pointing to the current position of the global branch history

Similar to s0_pc, s1_pc, s2_pc, the BPU top level also maintains signals for each stage of the global history pointer: s0_ghist_ptr, s1_ghist_ptr, s2_ghist_ptr. However, the content in ghv is fixed in position, and we only use ghist_ptr to locate where the current global branch history starts.

Calculating the Current Global Branch History with ghist_ptr

The use of ghist_ptr is only visible at the BPU top level. What we pass to the sub-predictors is the global branch history after the data in the global history register is shifted based on ghist_ptr. In the global branch history obtained by the sub-predictor, the least significant bit corresponds to the newest bit of the global branch history, and the most significant bit corresponds to the oldest bit.

So how is the shifting done? First, let’s see how the global history is stored in ghv.

|===== ghist =====>| =======>|
n                  ^         0
                   ghist_ptr

As shown in the figure above, the sequence represents the entire ghv register, and ghist_ptr points to a position in ghv. This position represents the newest bit of the global branch history. When a new global history record needs to be added, ghist_ptr is first decremented by 1, and then this bit is written to the position it points to. When ghist_ptr is decremented to 0, it will loop back to point to the highest position, thereby overwriting the previously written global branch history.

No matter what, starting from the position pointed to by ghist_ptr, the pointer increases and the history gets older. Therefore, when we need to calculate the current global branch history, we only need to circularly right-shift the ghv register by ghist_ptr positions.

Updating Global Branch History

The strategy for updating the global branch history is consistent with the strategy for updating the pc. At each pipeline stage, a pointer for the current stage and an update description of ghv are generated based on the prediction result of the current stage, and all are sent to the relevant generator for processing.

The update strategy for the global branch history is consistent with the pc update strategy, requiring the generation of a current stage pointer and ghv update instructions based on the current stage’s prediction results at each pipeline stage. These instructions are ultimately sent to the relevant generators for processing.

The update description of ghv is some information used to guide the update of the ghv register. Xiangshan BPU maintains two pieces of information to fulfill this duty:

ghv_wdata the data that needs to be written into ghv
ghv_wens the write bit mask

For the final update, only the bits identified by ghv_wens need to be written with the corresponding bits of ghv_wdata.

Therefore, each pipeline stage needs to generate three sets of information: ghist_ptr, ghv_wdata, ghv_wens.

Specifically, the prediction result can contain up to two branch instructions. We only need to set these pieces of information according to the actual situation. Here are some examples:

Only the first slot is valid, and the conditional branch instruction in it is predicted as not taken. Then the next position of ghv_wens is set to 0, the corresponding position of ghv_wens is set to 1, and ghist_ptr is decremented by one.
Both slots contain conditional branch instructions, the first is predicted as not taken, and the second is predicted as taken. At this time, ghist_ptr should be decremented by two, and the other two pieces of information should indicate writing 01 to ghv.

Here, only one piece of ghv_wdata information is maintained in the generator (maintained by the ghvBitWriteGens generator), and ghv_wens is not maintained by the generator. This is because a small trick is used here, where the final output of the generator’s ghv_wdata is the result of the selected stage, and ghv_wens is used by performing a bitwise OR operation on ghv_wens of all stages.

This consideration is based on:

If the later pipeline stage is valid, the global history pointer is restored to an older position, even if the history of newer positions is modified by the earlier pipeline’s ghv_wens.
If the earlier pipeline stage is valid, the global history pointer continues to update to newer positions, and the later pipeline will not set ghv_wens due to the ineffective redirect.

Branch Folded History

The branch folded history passed to the predictor is also maintained by the BPU top level. To shorten the update delay of the folded history, the BPU maintains many variables to support the fast update of the branch folded history. We will focus on this strategy and introduce the function of each variable.

Before we start, let’s first look at how the branch folded history is defined and its structure.

Branch Folded History

If you have checked the BPU global interface documentation, you will know that the sub-predictor receives an array of bit vectors of different lengths, representing various lengths of folded history, and these folded histories are compressed from the global branch history.

For the global branch history, we have a register that stores the global branch history with a length of 256. For example, suppose the length of the global branch history is 15 bits, and after shifting, we get a branch history like this: the least significant bit is the newest history record, and the most significant bit is the oldest history record.

At this time, if we need to generate a 6-bit folded history from these 15 bits, we will use an XOR strategy for compression. The specific process is as follows:

    h[5]         h[4]       h[3]    h[2]   h[1]   h[0]
    h[11]        h[10]      h[9]    h[8]   h[7]   h[6]
^                                   h[14]  h[13]  h[12]
---------------------------------------------------------------
    h[5]^h[11]   h[4]^h[10]         ...           h[0]^h[6]^h[12]

That is, after arranging it as shown above, perform an XOR operation on the values at each position, and the result is the folded history of length 6.

Method for Updating Branch Folded History

Now we want to update this branch folded history. When we insert a new history into the global branch history, it is inserted from the least significant bit, meaning the original h[0] becomes h[1]. If we want to obtain the folded history at this time, we only need to perform the XOR operation again. But such efficiency is too low, because the XOR operation may become particularly long. We can explore the impact of one update on the branch folded history.

In the example above, before inserting a new history, the 6-bit folded history is generated in the following arrangement:

h[5]   h[4]   h[3]  h[2]  h[1]  h[0]
h[11]  h[10]  h[9]  h[8]  h[7]  h[6]
                    h[14] h[13] h[12]

After inserting a new history, it becomes like this:

h[4]   h[3]   h[2]  h[1]  h[0]  h[new]
h[10]  h[9]   h[8]  h[7]  h[6]  h[5]
           (h[14])  h[13] h[12] h[11]

We can notice some patterns:

Before insertion:
h[5]   {h[4]   h[3]  h[2]  h[1]  h[0] }
h[11]  {h[10]  h[9]  h[8]  h[7]  h[6] }
       {             h[14] h[13] h[12]}
After insertion:
{h[4]   h[3]   h[2]  h[1]  h[0] } h[new]
{h[10]  h[9]   h[8]  h[7]  h[6] } h[5]
{           (h[14])  h[13] h[12]} h[11]

The content in the curly braces has undergone a complete left shift, with h[5] and h[11] moving from the most significant bit to the least significant bit. So, in the compressed history, isn’t this just a typical cyclic left shift that we often encounter!

However, only two bits have changed: one is the newly inserted h[new], and the other is the discarded h[14]. h[new] must be in the first position, and the discarded position is fixed. Therefore, to complete an update, we only need to know the value of the newly inserted history and the oldest bit of the previous history. After the cyclic shift, modifying these two positions according to the actual situation will give us the updated folded history.

Implementation of Update Method

To achieve this update in the top-level BPU, two additional variables are maintained:

ahead_fh_oldest_bits: the oldest bit of the global branch history, with additional bits stored before it
last_br_num_oh: the slot number of the last effective branch instruction in the previous prediction

An optimization in the timing of operations occurs here because the global history pointer can only be updated based on the branch outcome when the pipeline-level prediction result is available. Updating the oldest bit after updating the global history pointer would increase the latency. Therefore, we maintain the branch outcome and update the oldest bit when it is needed in the next cycle.

The oldest bit also needs to be maintained further back because after updating using the branch outcome, the relatively newer bits will become the oldest bits.

Thus, there are three generators related to the folded history: foldedGhGen, lastBrNumOhGen, and aheadFhObGen.

Information required for each update of the folded history：

Folded history information before the update
Oldest bit of the global branch history (ahead_fh_oldest_bits)
Last prediction’s branch outcome (last_br_num_oh)
Whether there is a branch instruction in this update
The branch outcome of this update: the slot number of the last effective branch instruction

For each update of the folded history, the true oldest bit needs to be determined based on last_br_num_oh and ahead_fh_oldest_bits. Then, based on the oldest bit and the branch outcome of this update, several bits are modified, and finally, a cyclic left shift completes the update operation.

Pipeline Control Method

Pipeline control is the core function of the BPU, with complex logic. All pipeline control signals in the top-level BPU are as follows:

s1_valid, s2_valid, s3_valid: indicate the corresponding pipeline data is valid
s1_ready, s2_ready, s3_ready: indicate the corresponding pipeline is ready to continue the prediction of the previous pipeline stage
s1_component_ready, s2_component_ready, s3_component_ready: indicate the readiness of the corresponding pipeline sub-predictor
s0_fire, s1_fire, s2_fire, s3_fire: successful handshake signals, indicating that the pipeline data is valid and has been successfully passed to the next pipeline
s1_flush, s2_flush, s3_flush: indicate whether the current pipeline needs flushing
s2_redirect, s3_redirect: indicate whether the current pipeline needs to redirect due to a different prediction result

valid, ready 与 fire

We will introduce the purpose of each signal step by step. First, let’s look at the fire signal, which indicates a successful handshake in the pipeline, meaning that the data has been successfully passed to the next pipeline. This signifies the end of the current cycle and the end of the prediction for this pipeline stage, with the prediction for the next pipeline stage about to begin in the next cycle.

This requires two conditions:

valid: The data in the current pipeline stage is valid.
ready: Indicates whether the next pipeline stage in the BPU top level and the predictor are ready.

When these two signals are high simultaneously, the fire signal is valid, indicating a successful handshake. If we isolate a single prediction, the timing should look like this (in reality, most of the time, each pipeline is valid continuously):

Of the four sets of signals mentioned earlier, component_ready is generated by the predictor, while the rest are maintained by the BPU top level, with only the fire set of signals exposed to the sub-predictor.

Next, let’s take s2 as an example to see how each signal is maintained.

ready Signal

s2_ready := s2_fire || !s2_valid

This assignment statement is a combinational circuit assignment, meaning that s2_ready is directly related to s2_fire and s2_valid for this cycle, with two possible situations:

If s2_valid is invalid for this cycle, indicating that the s2 pipeline stage is empty and can accept new data, then s2_ready is valid.
If s2_valid is valid for this cycle, indicating that the s2 pipeline stage has data that has not been passed to the next stage yet, but if s2_fire, then the data will be passed in the next cycle. In this case, s2_ready is valid, indicating that the data can flow into the next stage.

valid Signal

Maintaining ·s2_valid· is relatively simple, as it is only related to s1_fire and s2_ready signals. The relationship is as follows:

When s1_fire is valid, indicating that data is coming in and s2_valid will be valid in the next cycle.
When s2_fire is valid, indicating that data is flowing out and s2_valid will be invalid in the next cycle.

fire Signal

The fire signal is somewhat special, but for intermediate pipeline stages, its maintenance is straightforward. For example,

s2_fire := s2_valid && s3_components_ready && s3_ready

This simply requires considering the valid of the current pipeline stage and the ready of the next pipeline stage.

However, for s0_fire, since there is no valid signal, it is directly set to s1_components_ready && s1_ready.

For s3_fire, as there is no ready signal for the next stage, it is directly set to s3_valid.

Incorporating Flush and Redirect

When there is a different prediction result in the pipeline, a redirection signal is generated, and the pipeline needs to be flushed. flush and redirect handle these two tasks. redirect indicates whether the current pipeline stage needs redirection, while flush indicates whether the current pipeline stage needs flushing.

redirect Signal

The generation of s2_redirect is as follows:

s2_redirect := s2_fire && s2_redirect_s1_last_pred

This means that when s2_fire is valid and the prediction result of s2 is different from the prediction result saved from s1, this signal is valid. Later, this signal will be connected to the input of the sub-predictor and the output of the BPU prediction result, guiding the sub-predictor and FTQ to restore their states.

flush Signal

The flush signal is used to guide the flushing of the pipeline. For example, when s3_redirect is valid, it means that the incorrect prediction result has entered the pipeline, and both s1 and s2 are now predicting based on the wrong result. Therefore, the pipeline needs to be flushed to stop the previous stages and wait for new prediction results to enter.

Specifically, the relationship between them is:

 s2_flush := s3_flush || s3_redirect
 s1_flush := s2_flush || s2_redirect

This means that if a pipeline stage needs redirection, all previous stages will be flushed. The purpose of flush is to guide the valid signal. If the valid signal is valid in this cycle but the fire signal is not, it means that the incorrect data has not been taken by the next pipeline stage. In this case, when flush is valid, valid will immediately become invalid in the next cycle, avoiding storing incorrect data in the pipeline for a long time.

However, the effect of flush on the valid signal varies depending on each pipeline stage. For example:

For the s1 pipeline stage, although flush is valid, if s0_fire is valid, indicating that new data is flowing in, valid will remain valid in the next cycle.
For the s2 pipeline stage, if flush is valid, valid will definitely be invalid in the next cycle (because s1 is also flushed), indicating that valid can be directly set to invalid. However, there is a special case where s2_redirect occurs but s2_flush is not set to valid. In this case, if s1_fire occurs, the incorrect prediction result of s1 may also flow in. In this case, s2_valid needs to be determined based on the s1_flush signal.

The use of flush is complex, and more detailed details need to be understood by referring to the code.

Redirect Recovery Logic

When the redirect request from FTQ to BPU takes effect, it indicates that all stages of the pipeline have incorrect predictions, and all stages should be flushed. This can be achieved by setting s3_flush to be valid. Therefore, we have:

 s3_flush := redirect_req.valid

In the BPU, the redirect request is delayed by one cycle before being officially used. Therefore, the response of s1_valid to the flush signal needs to be changed. When the redirect request (before delay) is valid, s1_valid in the next cycle is immediately set to invalid, without the need to refer to the s0_fire signal.

At this point, generators such as npcGen also need to directly use the data from the redirect request to generate, which is equivalent to redirecting the BPU’s state to the state before the error occurred. However, it is important to note that the default redirect level in BPU is flushAfter, which means that the redirect request corresponds to a predicted erroneous instruction, and the BPU assumes that although this instruction was predicted incorrectly, it has been corrected and executed by the backend. Therefore, the next prediction can start directly from the next instruction.

Therefore, when recovering from a redirect, it is necessary not only to restore the information from the redirect interface, but also to update the execution status of this predicted erroneous instruction in the history.

3.2 - uFTB Branch Predictor

Introduction to uFTB

uFTB is the first predictor among all the predictors in Xiangshan, and it serves as the cornerstone for other predictors to generate prediction results. uFTB works in the s1 stage. It can generate prediction results within the current cycle after obtaining s1_pc and output them in the s1 channel, without modifying other channels. It provides the position of the branch instruction and the target of the instruction. Subsequent predictors will further predict based on this result.

Its essence is an FTB item cache, which stores FTB items, and the basic prediction result will be directly generated from the read-out FTB item.

Therefore, before you start reading the document, make sure you understand the FTB items and their meanings, as well as the specific details of the prediction result interface.

Functionality of uFTB

Cache FTB items and generate one-cycle prediction results: uFTB maintains a small FTB item cache. After receiving PC, it reads out the FTB item corresponding to the PC within one cycle and generates an s1 stage prediction result from the FTB item.
Maintain two-bit saturating counters to provide basic conditional branch results: uFTB maintains two-bit saturating counters for each line of the FTB item cache. The direction prediction result is reflected in the prediction result output of uFTB.
Update the FTB cache and two-bit saturating counters based on update requests

uFTB Cache Structure

As mentioned above, uFTB is essentially a small cache that stores FTB items. Its approximate structure is shown in the figure below.

In the current version of Xiangshan, uFTB has a total of 32 cache lines, each cache line is called FauFTBWay, and one FTB item can be stored in each cache line.

When s1 pipeline is valid, uFTB will use s1_pc to determine which item of the uFTB cache to read out. The cache is indexed based on the tag field in PC, which is defined as pc[16:1], i.e., taking 16 bits from PC as an identifier to match a certain line in the cache.

Each line in the cache, i.e., the data request interface in FauFTBWay, has three items:

req_tag: Input tag identifier extracted from pc
resp: Output the FTB item stored in this line
resp_hit: Output indicates whether the FTB item in this line matches req_tag uFTB connects the tag to the data request interface of each line in the cache and selects the hit FTB item based on the resp_hit signal. Subsequent steps will generate a complete prediction result based on this FTB item.

Two-Bit Saturating Counters

uFTB maintains two-bit saturating counters for each line of the FTB item cache. As we know, an FTB item can store up to two conditional branch instructions, so each line’s two-bit saturating counters also have two, responsible for providing rough prediction results for the conditional branch instructions in them.

When indexing the FTB item, uFTB also indexes the corresponding two-bit saturating counters.

When the prediction result is generated, it will be based on the FTB item and the contents of the two two-bit saturating counters corresponding to it.

Prediction Result Generation

After indexing the corresponding FTB item and two two-bit saturating counters based on s1_pc, uFTB needs to generate a prediction result based on them. The prediction result generated by uFTB will be outputted through the s1 channel when s1 pipeline is valid, and will not be modified for s2 and s3 channels.

The signal generation method in the s1 prediction result can refer to the following list:

hitWhether the FTB item hits
- Generation method: resp_hit signal in FauFTBWay, one of them is valid
slot_valids: Slot valid bit, indicating whether each slot in the ftb item is valid
targets: Jump target address corresponding to each slot
offsets: The offset of each instruction in the slot relative to the starting address of the prediction block
is_jal: Whether the prediction block contains a jal instruction
is_jalr: Whether the prediction block contains a jalr instruction
is_call: Whether the prediction block contains a call instruction
is_ret: Whether the prediction block contains a ret instruction
last_may_be_rvi_call: Signal indicating that the last slot in the prediction block may be an RVI type call instruction
is_br_sharing: The signal stored in the last slot (tailSlot) indicates a conditional branch instruction.
- Generation: Exported from the corresponding field in the FTB entry.
fallThroughErr: The pftAddr recorded in the FTB entry is incorrect.
- Generation: Compare whether the end address represented by pftAddr is greater than the start address of the predicted block. If it is smaller, an error has occurred, and this signal is set to valid. This situation may occur when the PC indexes an incorrect FTB entry.
fallThroughAddr: The end address of the predicted block.
- Generation: If fallThroughErr is invalid, it is generated based on pftAddr. Otherwise, it is set to the start address + prediction width.
br_taken_mask: Branch prediction result, with each branch (slot) corresponding to a bit indicating whether the branch is predicted as taken.
- Generation: Generated based on the always_taken field in the FTB entry and the result indicated by the two-bit saturating counter.
jalr_target: The jump target of the jalr in this predicted block. -Generation: From the jump target in the tailSlot of the FTB entry.

uFTB Update

The update of uFTB involves updating the FTB entry cache and the two-bit saturating counter, with the update content obtained through the update interface.

In the uFTB predictor, the reading and writing of the cache and the two-bit saturating counter do not conflict, so we do not need to consider timing conflicts between reading and updating and can consider them as two independent parts.

FTB Cache Update

The update process of the FTB cache is simple. The update channel has already specified the PC and the newly generated FTB entry, so it only needs to be written to the specified position in the cache.

FTB cache updating requires two cycles:

In the first cycle, calculate the following based on the signals in the update:
- Which row in the cache corresponds to the update request The PC extracted from the update request is sent to the update request channel of FauFTBWay, and the hit signal returned by each row is calculated.
In the second cycle, update according to the position calculated in the first cycle. If no row is hit, uFTB will use a pseudo-LRU replacement algorithm to select the row to be written. Specifically, the ftb_entry signal group in the update channel contains the complete information of the new FTB entry, which is sent to the cache row that needs to be updated.

Two-Bit Saturating Counter Update

The update of the two-bit saturating counter needs to be combined with the actual execution of the subsequent program and the branch instruction information recorded in the FTB entry, which can be obtained from the update channel.

The update of the two-bit saturating counter also requires two cycles:

In the first cycle, calculate which two-bit saturating counter corresponding to the conditional branch instruction in the slot needs to be updated, which needs to meet the following conditions:
- The current branch instruction slot is valid and contains a conditional branch instruction.
- The current branch instruction slot is not marked as always_taken. (Because after being marked as always_taken, the result of the two-bit saturating counter will not be used.)
- The current branch instruction slot is not after the branch instruction slot where an actual jump occurred.
In the second cycle, update the saturating counter based on the mask generated in the first cycle and the br_taken_mask information in the update channel.

Interface List

信号类型	信号位	信号名	信号描述
input		clock	输入时钟
input		reset	复位信号
input	[35:0]	io_reset_vector	用于reset时，reset s1_pc_dup_0 提供的值
input	[40:0]	io_in_bits_s0_pc_0	输入位s0_pc 的第0个复制
input	[40:0]	io_in_bits_s0_pc_1	同上第1个
input	[40:0]	io_in_bits_s0_pc_2	同上第2个
input	[40:0]	io_in_bits_s0_pc_3	同上第3个
output	[40:0]	io_out_s1_pc_0	输出s1_pc 的第0个复制
output	[40:0]	io_out_s1_pc_1	同上第1个
output	[40:0]	io_out_s1_pc_2	同上第2个
output	[40:0]	io_out_s1_pc_3	同上第3个
output		io_out_s1_full_pred_0_br_taken_mask_0	solt 0 是否被预测为 always taken
output		io_out_s1_full_pred_0_br_taken_mask_1	solt 1 是否被预测为 always taken
output		io_out_s1_full_pred_0_slot_valids_0	solt 0 是否启用
output		io_out_s1_full_pred_0_slot_valids_1	solt 1 是否启用
output	[40:0]	io_out_s1_full_pred_0_targets_0	solt 0 对应的跳转目标地址
output	[40:0]	io_out_s1_full_pred_0_targets_1	solt 1 对应的跳转目标地址
output	[3:0]	io_out_s1_full_pred_0_offsets_0	solt 0 中分支指令相对于地址块起始pc的偏移
output	[3:0]	io_out_s1_full_pred_0_offsets_1	solt 1 中分支指令相对于地址块起始pc的偏移
output	[40:0]	io_out_s1_full_pred_0_fallThroughAddr	预测块的结束地址
output		io_out_s1_full_pred_0_is_br_sharing	solt 1（无条件跳转）是否被共享为有条件跳转指令
output		io_out_s1_full_pred_0_hit
output		io_out_s1_full_pred_1_br_taken_mask_0	类似 io_out_s1_pc_1 io_out_s1_full_pred_0的复制
output		io_out_s1_full_pred_1_br_taken_mask_1
output		io_out_s1_full_pred_1_slot_valids_0
output		io_out_s1_full_pred_1_slot_valids_1
output	[40:0]	io_out_s1_full_pred_1_targets_0
output	[40:0]	io_out_s1_full_pred_1_targets_1
output	[3:0]	io_out_s1_full_pred_1_offsets_0
output	[3:0]	io_out_s1_full_pred_1_offsets_1
output	[40:0]	io_out_s1_full_pred_1_fallThroughAddr
output		io_out_s1_full_pred_1_is_br_sharing
output		io_out_s1_full_pred_1_hit
output		io_out_s1_full_pred_2_br_taken_mask_0	同上
output		io_out_s1_full_pred_2_br_taken_mask_1
output		io_out_s1_full_pred_2_slot_valids_0
output		io_out_s1_full_pred_2_slot_valids_1
output	[40:0]	io_out_s1_full_pred_2_targets_0
output	[40:0]	io_out_s1_full_pred_2_targets_1
output	[3:0]	io_out_s1_full_pred_2_offsets_0
output	[3:0]	io_out_s1_full_pred_2_offsets_1
output	[40:0]	io_out_s1_full_pred_2_fallThroughAddr
output		io_out_s1_full_pred_2_is_br_sharing
output		io_out_s1_full_pred_2_hit
output		io_out_s1_full_pred_3_br_taken_mask_0	同上
output		io_out_s1_full_pred_3_br_taken_mask_1
output		io_out_s1_full_pred_3_slot_valids_0
output		io_out_s1_full_pred_3_slot_valids_1
output	[40:0]	io_out_s1_full_pred_3_targets_0
output	[40:0]	io_out_s1_full_pred_3_targets_1
output	[3:0]	io_out_s1_full_pred_3_offsets_0
output	[3:0]	io_out_s1_full_pred_3_offsets_1
output	[40:0]	io_out_s1_full_pred_3_fallThroughAddr
output		io_out_s1_full_pred_3_fallThroughErr
output		io_out_s1_full_pred_3_is_br_sharing
output		io_out_s1_full_pred_3_hit
output	[222:0]	io_out_last_stage_meta	输出最后阶段的元信息 io_out_last_stage_meta = {213’h0, resp_meta_pred_way_r_1, resp_meta_hit_r_1}
input		io_ctrl_ubtb_enable	控制ubtb是否启用
input		io_s0_fire_0	输入s0_fire_0，与 io_out_s1_pc_0 <= io_in_bits_s0_pc_0 的时钟门控相关
input		io_s0_fire_1	输入s0_fire_1
input		io_s0_fire_2	输入s0_fire_2
input		io_s0_fire_3	输入s0_fire_3
input		io_s1_fire_0	输入s1_fire_0
input		io_s2_fire_0	输入s2_fire_0
input		io_update_valid	更新有效性
input	[40:0]	io_update_bits_pc	传回的预测块pc（用于指示更新的预测块）
input	[3:0]	io_update_bits_ftb_entry_brSlots_0_offset	solt 0 中分支指令相对于地址块起始pc的偏移
input	[11:0]	io_update_bits_ftb_entry_brSlots_0_lower	跳转目标地址的低位
input	[1:0]	io_update_bits_ftb_entry_brSlots_0_tarStat	跳转后的 pc 高位是否进退位
input		io_update_bits_ftb_entry_brSlots_0_valid	是否启用
input	[3:0]	io_update_bits_ftb_entry_tailSlot_offset	solt 1 中分支指令相对于地址块起始pc的偏移
input	[19:0]	io_update_bits_ftb_entry_tailSlot_lower	跳转目标地址的低位
input	[1:0]	io_update_bits_ftb_entry_tailSlot_tarStat	跳转后的 pc 高位是否进退位
input		io_update_bits_ftb_entry_tailSlot_sharing	无条件跳转指令槽中存储条件分支指令
input		io_update_bits_ftb_entry_tailSlot_valid	是否启用
input	[3:0]	io_update_bits_ftb_entry_pftAddr	Partial Fallthrough Addr 如果预测块中没有跳转，那么程序将会顺序执行到达的地址，预测块的结束地址。
input		io_update_bits_ftb_entry_carry	pc+pft时是否产生进位
input		io_update_bits_ftb_entry_always_taken_0	是否预测为总是跳转
input		io_update_bits_ftb_entry_always_taken_1	是否预测为总是跳转
input		io_update_bits_br_taken_mask_0	是否跳转
input		io_update_bits_br_taken_mask_1	是否跳转

3.3 - TAGE-SC Branch Predictor

Introduction

TAGE-SC is the primary predictor for conditional branches in the Kunming Lake architecture, classified as an Accurate Predictor (APD). TAGE-SC can be seen as two relatively independent components: the prediction part TAGE and the verification part SC.

The Tagged Geometric History Length Predictor (TAGE) utilizes multiple prediction tables with different history lengths to exploit extensive branch history information. TAGE predicts whether a branch instruction will be taken or not taken. It consists of a base prediction table and multiple history tables. It first predicts using multiple history tables. If there is no prediction, it uses the prediction from the base table. The Statistical Corrector (SC) is a statistical corrector. SC references the prediction results of TAGE and statistical bias results. Based on these two results, it corrects the final prediction result.

In Kunming Lake, each prediction block can have up to 2 branch instructions, so TAGE can predict up to 2 conditional branch instructions simultaneously. When accessing the various history tables in TAGE, the starting address of the prediction block is used as the PC, and two prediction results are retrieved based on the same global history.

TAGE Branch Predictor in Kunming Lake

Basic Functionality

The core idea of the TAGE predictor is to provide prediction results with different history lengths and select the most appropriate result for feedback. In the TAGE predictor, there are a total of 1+N history record tables, where N is a configurable option. In Kunming Lake, N=4.

The base predictor based on the T0 table is the baseline predictor. During prediction, it directly looks up the “2-bit saturated counter representing the jump history information” corresponding to the address in the T0 table, and then makes a prediction based on the history information. The T0 table has only 2 bits per entry, so the history states it can record are limited.

For tables other than T0, we use Tn to represent them. During table lookup, in addition to the PC, it is also necessary to use the global jump history information H for the lookup. When a match is found, the prediction is made based on the “3-bit saturated predictor” to jump or not to jump. The higher the value of n for the Tn table, the longer the history information it uses, i.e., x<y.

For each prediction, TAGE selects the table entry with the longest global jump history among all the hit Tn entries.

If the table entry exists and the prediction result confidence is high, it is used as the final prediction result.
If the confidence is low, another internal counter is used to determine whether to select that entry or T0 as the final prediction.

To save space, when retrieving the Tn table, the input jump history information H needs to be compressed, a process also known as history folding.

The table entries of each prediction table include the following elements:

T0 table indexed directly by PC
1. 2-bit pred unsigned saturated counter (indicating prediction direction and confidence level)
Tn table indexed by XOR of PC and folded global history
1. 1-bit valid bit
2. 3-bit pred unsigned saturated counter
3. 8-bit tag (used for verifying whether the hit is intentional, not coincidental)
4. 1-bit useful bit for controlling expiration For a prediction block, all tables may generate prediction results, requiring a selection process. Typically, the higher the Tn table number, the higher its priority.

Pipeline

TAGE contains two pipeline stages: the first stage calculates the index, and the second stage reads the results from the SRAM table.

Stage 0 (s0): Input to the first pipeline stage, usually the pc and folded history. First pipeline stage operation: Calculate the index. Output to s1 via registers.
Stage 1 (s1): Input to the second pipeline stage, consisting of the index and other calculated data from the first stage. Second pipeline stage operation: Memory access to SRAM, reading the prediction result. Output to s2 via registers.
Stage 2 (s2): Actual prediction result. TAGE uses 2 stages for prediction, with the prediction result ready for use in the third stage after 2 stages.

Data Structures

In Kunming Lake’s implementation, the T0 and Tn table structures are as follows:

预测器	作用	表项构成	项数
基准预测器T0	用于在其他预测器的预测结果都无效时输出预测结果	2 bit ctr 饱和计数器最高位决定跳转方向	2路各2048项，每路对于一条分支指令
预测表T1-T4	对每个预测块的输入，所有Tn表都进行预测，在所有预测有效的结果中，选择历史记录最长的结果作为最后预测结果。历史记录长度由输入的H决定	1 bit valid 有效位 3 bit ctr 饱和计数器8 bit tag 校验命中1 bit us 作为usefulness计数器	4096项、奇数项对应第一条分支指令，偶数项对应第二条分支指令

For each table Tn, the length of its input “global branch history data H” varies during querying. Assuming the total prediction history length is S, Tn and Tn+1 may use the low x, low y bits of S (the lower bits are the newer history) as query inputs. Generally, the larger the n of table Tn, the longer the history information used, i.e., x<y.

During the query of table Tn, since the historical data H is “compressed,” it may lead to a situation where the result of one PC1^H1 matches another PC2^H2 (similar to a Hash collision), resulting in indexing into invalid data (predicting PC1 indexes to predicted PC2’s data). Therefore, TAGE provides a tag identifier for each table, using an 8-bit tag in the Kunming Lake implementation to reduce the probability of collisions. The calculation method and index method for tags are different; only when the tag calculation is the same, the query result is valid.

In the Tn table entry, in addition to the saturation counter ctr and tag, there is also a 1-bit usefulness counter. When this counter is 0, it is a weak entry, indicating that the entry can be reallocated for other uses; when it is not 0, it is a strong entry, indicating that the entry cannot be reallocated for other uses.

To try to avoid the situation where all table entries are 1 and no new table entries can be allocated, TAGE expects to use the counter bankTickCtrs to clear all usefulness to 0.

Retrieval Method for T0 and Tn Tables

For table T0, indexing is done using the PC[11:1] bits to index 2048 table entries, so for T0, there is no possibility of not finding a match.
For table Tn, in addition to PC[11:1], retrieval also requires searching based on the global branch history. In Kunming Lake, the top-level branch predictor maintains a 256-bit global history record GH, which can fold the GH’s most recent n bits of history information based on the required number of bits x for the sub-predictor. That is, n is divided into ceil(x/n) units of length x, and then XOR is performed bitwise. This folded history is denoted as FH (Folded History), and the specific process can be found in the [Branch Folding History section](../00_bpu_top/#Branch Folding History). When the TAGE predictor searches for a table entry in Tn, it uses the index and tag, calculated as follows:

Calculation Formula
index = FH ^ ((pc»1)低位)
tag = FH1 ^ FH2 ^ ((pc»1)低位)

Where FH, FH1, FH2 represent the folded global branch history according to certain rules. For Tn, FH, FH1, and FH2 each have their own folding bit numbers, which may not be the same. In the Kunming Lake implementation, the configurations of the T0 and Tn tables are as follows:

表名称	FH长度	FH1长度	FH2长度	最近历史长度（用到GH中的位数）
T1	8比特	8比特	7比特	低8位，即把最新8位历史，折叠成FH、FH1、FH2
T2	11比特	8比特	7比特	低13位，即把最新13位历史，折叠成FH、FH1、FH2
T3	11比特	8比特	7比特	低32位，即把最新32位历史，折叠成FH、FH1、FH2
T4	11比特	8比特	7比特	低119位，即把最新119位历史，折叠成FH、FH1、FH2

Note: pc»1 is used because RISC-C extension is used, with 2-byte alignment, and PC itself is already aligned to 1 byte, so only 1 bit is used.

Alternative Predictor

Since the Tn table uses saturation counters for prediction, there may be situations where the output result is “not confident.” For example, in Kunming Lake, for a 3-bit saturation counter, both 100 and 011 indicate a weak prediction. To provide more choices as references for this state, the TAGE predictor also provides an “alternative predictor” mechanism, which determines whether to select the prediction result of Tn or T0 when the Tn table predicts with low confidence.

In the Kunming Lake implementation, the “alternative predictor” is implemented based on the register group useAltOnNaCtrs. It consists of two paths of 128 4-bit saturation counters each, initialized to 0b1000. When TAGE makes a prediction, it uses PC(7,1) to index the corresponding saturation counter. If the value of this counter is greater than or equal to the preset value and the prediction result of Tn is not confident, it selects the result of T0; otherwise, it selects the result of Tn.

Prediction Process

In summary, the prediction steps of the TAGE predictor in Kunming Lake are as follows:

Parallel indexing of T0 and Tn tables, selecting which table to use based on the hit result:
1. If a match to the tag of a Tn table is found, the potential prediction result is given by the saturation counter of the longest history Tn table.
2. If no match to a Tn table is found, the final prediction result is given by the saturation counter of the T0 table.
If the potential prediction result of the matched Tn table is a weak prediction (100,011), and the value of the corresponding 4-bit counter in the alternative prediction for PC is greater than or equal to a threshold, the result of the T0 table is used as the final result; otherwise, the prediction result of the Tn table is used as the final prediction result.

Training Process

Since the prediction process of TAGE involves many counters and tags, they need to be updated according to certain rules, a process known as training. This training process occurs in the BPU’s update stage, where the PC, branch history, and prediction correctness information are input. The training process for branch prediction in Kunming Lake is divided into several steps based on different conditions:

Update when T0 is the final prediction result: If a jump occurs (i.e., taken), increment the ctr saturation counter indexed by the pc; otherwise, decrement it.
When only T0 is hit, the following operations are performed:
1. If T0 is predicted correctly, no additional update is performed.
2. If T0 is predicted incorrectly, attempt to randomly allocate a new table entry in a Tn table. To allocate a new table entry, the original entry’s usefulness at the corresponding index must be 0. The new entry is initialized as a weak prediction with usefulness 0, and its tag is set to the newly calculated tag.
When both T0 and Tn are hit, the following operations are performed:
1. Tn is always updated: If a jump occurs, increment the ctr saturation counter indexed by the pc; otherwise, decrement it. It is important to note that “hit” means that the tag of the indexed entry matches the calculated tag.
2. If T0 and Tn produce the same result:
  1. If predicted correctly, no additional update is performed.
  2. If predicted incorrectly, attempt to allocate a new table entry in a table with a longer history than Tn. To allocate a new entry, the usefulness of the original entry at the corresponding index must be 0. The new entry is initialized as a weak prediction with usefulness 0, and its tag is set to the tag calculated using the new history information.
3. If T0 and Tn produce different results:
  1. If Tn is correct, the entry’s usefulness is incremented.
    1. If the result is still a weak prediction, the counter in the alternative prediction for T0 is decremented.
  2. If Tn is incorrect, the entry’s usefulness is decremented, and a new entry is allocated in a table with a longer history than Tn, as in 3.2.2.
    1. If the result is still a weak prediction, the counter in the alternative prediction for T0 is incremented.
When a new table needs to be allocated, dynamic reset of the usefulness flag is performed.
1. Using a 7-bit bankTickCtrs register and calculating:
  1. The number of allocatable tables a (with a longer history length than the current and corresponding index usefulness is 0)
  2. The number of unallocatable tables b (with a longer history length than the current and corresponding index usefulness is not 0)
2. Update bankTickCtrs += Δ (saturated counter), Δ = b - a,
3. When bankTickCtrs reaches its maximum value, reset all usefulness to 0.

Kunming Lake SC Branch Predictor

Basic Function Introduction

The SC (Statistics counter) branch predictor is a branch predictor based on historical statistical information. Similar to TAGE, SC typically has multiple tables Tn, each corresponding to different lengths of historical jump statistics. The difference is that in SC, when predicting based on the PC, each table Tn is accessed, and then SC adds up each hit table entry to calculate the total “saturated counter” jump information, and finally determines whether to jump based on the total jump information. Generally, SC uses “signed saturated counters”, where a counter value greater than 0 indicates a jump, and less than 0 indicates no jump. The larger the absolute value of the counter, the higher the prediction confidence.

In the SC predictor, SC is also composed of multiple tables (e.g., T1, T2, T3, T4), but with fewer basic prediction tables T0 compared to the TAGE predictor. The Tn tables in SC have 6-bit signed saturated counters. The indexing method for SC tables is as follows:

Calculation Method
Index = (FH) ^ ((pc»1)低位)

For each table, the number of entries and the folded history length used are as follows:

Table	Number of Entries	FH Length	Folded History Range
T1	512	0	不折叠
T2	512	4	把历史信息的低4位，折叠成FH
T3	512	8	把历史信息的低10位，折叠成FH
T4	512	8	把历史信息的低16位，折叠成FH

The formula for calculating the total statistical prediction result is as follows:

$$scCtrSum=\sum_{i=0}^{i<4}( (ctr_{sc} << 1) +1)$$

Where ctr_sc represents the signed saturated counter for each table. Left-shifting and adding one is for weight adjustment. The accumulated scCtrSum is the final prediction result of SC. If this value is greater than zero, the prediction is a jump; if it is less than zero, the prediction is no jump. The larger the absolute value, the higher the prediction confidence.

Typical data conversion results are as follows (extended to 9 bits to prevent overflow during calculation):

All are 6b100000 (strong no-jump), resulting in 9b100000100, with a value of -252.
All are 6b011111 (strong jump), resulting in 9b011111100, with a value of 252.
All are 6b000000 (weak jump), resulting in 9b000000100, with a value of 4.
All are 6b111111 (weak no-jump), resulting in 9b111111100, with a value of -4.

Prediction Process

Calculate the index of table Tn using the PC and historical information.
Query the index to obtain the saturation counters for all tables.
Sum up all the saturation counters obtained from all tables to get the final prediction result (take a jump for values greater than 0, no jump for values less than 0).

Training Process

Update the saturation counters during the update phase.

If the real instruction corresponding to PC jumps, increment the saturation counters corresponding to all tables.
If the real instruction corresponding to PC does not jump, decrement the saturation counters corresponding to all tables.

Kunming Lake TAGE-SC Branch Predictor

Why SC is Needed with TAGE

In some applications, some branch behaviors have a weak correlation with branch history or paths, showing a statistical prediction bias. For these branches, using counters to capture statistical biases is more effective than history-based branch prediction.

TAGE is very effective in predicting branches that are highly correlated with history, but it performs poorly for branches with statistical biases. For example, branches that have a small bias in one direction but are not strongly correlated with historical paths. To avoid this problem, an SC predictor can be added to the traditional TAGE predictor.

TAGE-SC Functionality

In the Kunming Lake TAGE-SC predictor, both the TAGE and SC prediction results P1 and P2 are obtained simultaneously, and then their results are accumulated P = P1 + P2. If the absolute value of P is greater than the 8-bit threshold sc_bank_thres, the predictor result P is used; otherwise, P1 is used as the final prediction result.

For dynamic adaptation, the threshold sc_thres needs to be dynamically changed. Therefore, in the implementation, TAGE-SC uses a 5-bit sc_bank_ctr counter to adjust the threshold sc_bank_thres. Additionally, since Kunming Lake supports the simultaneous prediction of 2 branch instructions, the threshold register and corresponding control counter are also duplicated.

Pipeline

The TAGE-SC predictor contains 3 pipeline stages, where the 2-stage pipeline of TAGE has been introduced, and the pipeline of the SC part is as follows:

Stage 0: Read PC and folded history into s0. First Stage: Calculate the index from pc and FH to obtain s0_idx.
Stage 1: Read s0_idx from s0. Second Stage: Find the counter data corresponding to s1_idx in SCTable and output to s1_scResps.
Stage 2: Read s1_scResps from s1. Third Stage: Select whether to invert the prediction result based on s2_scResps and output to s2_disagree.
Stage 3: Read the result from s2_disagree as s3_disagree.

Prediction Process

In TAGE-SC prediction, the prediction result P1 of TAGE is represented by tage_ctr, and the prediction result P2 of SC is represented by scCtrSum. The prediction is divided into four steps:

Execute the SC predictor to get the prediction result scCtrSum.
Simultaneously obtain the prediction result tage_ctr of the TAGE predictor.
1. Since the prediction result of TAGE is an unsigned saturation counter, and the prediction result of SC is a signed saturation counter, if they are added together, data conversion is required.
2. Kunming Lake adopts a conversion for the result of TAGE. The converted result is represented by tageCtrCentered, and the specific conversion process is as follows:
  
  $$tageCtrCentered=((((ctr_{tage} -4)<<1)+1)<<3) $$
3. Conversion of a 3-bit unsigned saturation counter to an 8-bit signed saturation counter result is illustrated as follows:
  - 3b100 Weak jump => 8b00001000 = 8
  - 3b011 Weak non-jump => 8b11111000 = -8
  - 3b111 Strong jump => 8b00111000 = 56
  - 3b000 Strong non-jump => 8b11001000 = -56
Add the prediction results of TAGE and SC to get the final prediction result P, represented by totalSum.

$$totalSum = scCtrSum + tageCtrCentered$$

Determine the final prediction direction based on totalSum and sc_bank_thres
1. Jump if totalSum > 0 and its absolute value exceeds the threshold: If scCtrSum > sc_bank_thres - tageCtrCentered, it can also be understood as totalSum > sc_bank_thres. The above expression can reduce the maximum bit width (ensuring no overflow requires 10 bits to become 9 bits).
2. No jump if totalSum < 0 and its absolute value exceeds the threshold: If scCtrSum < -sc_bank_thres - tageCtrCentered, it can also be understood as |totalSum| > sc_bank_thres.

Training Process

After combining TAGE and SC, TAGE-SC adds an sc_bank_ctr counter to control the threshold sc_bank_thres. Therefore, during training, in addition to the training of TAGE and SC themselves, the newly added counter needs to be updated.

During the update phase, the specific update process is as follows:

TAGE-SC uses the prediction result P (i.e., the prediction result after TAGE + SC). If |totalSum| is in the range [sc_bank_thres -4, sc_bank_thres -2], update the threshold-related register group.
1. Update sc_bank_ctr, the saturation counter: If the prediction is correct, sc_bank_ctr +=1; if the prediction is incorrect, sc_bank_ctr -=1.
2. Update sc_bank_thres, limited saturation operation: If the updated value of sc_bank_ctr reaches 0b11111 and sc_bank_thres <= 31, then sc_bank_thres +=2; if the updated value of sc_bank_ctr is 0 and sc_bank_thres >=6, then sc_bank_thres -=2. For all other cases, thres remains unchanged.
3. After the update judgment of sc_bank_thres is completed, another judgment is made on sc_bank_ctr. If the updated sc_bank_ctr is 0b11111 or 0, thres_ctr is reset to the initial value 0b10000.
TAGE-SC uses the prediction result P1 (i.e., the prediction result of TAGE) and does not perform any operations.

Interface List

TageSC

信号类型	信号宽度	信号名	信号描述
input	*	clock
input	*	reset
input	*[35:0]	io_reset_vector	用于reset时，reset s1_pc_dup_0 提供的值
input	*[40:0]	io_in_bits_s0_pc_0	复制的s0_pc的dup数组的第1个，给顶层BPU的PC
input	*[40:0]	io_in_bits_s0_pc_1	复制的s0_pc第2个，给Tage的PC
input	*[40:0]	io_in_bits_s0_pc_3	复制的s0_pc的第4个，给SC的PC
input	*[10:0]	io_in_bits_folded_hist_1_hist_17_folded_hist	TageTable 2 用到的11bits 折叠历史从多长历史范围折叠到11bit见前文所述的表注意TageTable下标+1，此处 T2 是前文 T3
input	*[10:0]	io_in_bits_folded_hist_1_hist_16_folded_hist	TageTable 3 用到的11bits 折叠历史
input	*[6:0]	io_in_bits_folded_hist_1_hist_15_folded_hist	TageTable 1 用到的7bits 折叠历史
input	*[7:0]	io_in_bits_folded_hist_1_hist_14_folded_hist	TageTable 0 用到的8bits 折叠历史
input	*[6:0]	io_in_bits_folded_hist_1_hist_9_folded_hist	TageTable 2 用到的7bits 折叠历史
input	*[7:0]	io_in_bits_folded_hist_1_hist_8_folded_hist	TageTable 3 用到的8bits 折叠历史
input	*[6:0]	io_in_bits_folded_hist_1_hist_7_folded_hist	TageTable 0 用到的7bits 折叠历史
input	*[6:0]	io_in_bits_folded_hist_1_hist_5_folded_hist	TageTable 3 用到的7bits 折叠历史
input	*[7:0]	io_in_bits_folded_hist_1_hist_4_folded_hist	TageTable 1 用到的8bits 折叠历史
input	*[7:0]	io_in_bits_folded_hist_1_hist_3_folded_hist	TageTable 2 用到的8bits 折叠历史
input	*[10:0]	io_in_bits_folded_hist_1_hist_1_folded_hist	TageTable 1 用到的11bits 折叠历史
input	*[3:0]	io_in_bits_folded_hist_3_hist_12_folded_hist	SCTable 1 用到的 4bit 折叠历史
input	*[7:0]	io_in_bits_folded_hist_3_hist_11_folded_hist	SCTable 2 用到的 8bit 折叠历史
input	*[7:0]	io_in_bits_folded_hist_3_hist_2_folded_hist	SCTable 3 用到的 8bit 折叠历史
output	*	io_out_s2_full_pred_0_br_taken_mask_0	io_out_s2_full_pred_{i}br_taken_mask{j} Tage 在 s2流水级输出的，复制4份预测块中第 j 条分支指令TAGE预测结果这里不该叫mask吧
output	*	io_out_s2_full_pred_0_br_taken_mask_1
output	*	io_out_s2_full_pred_1_br_taken_mask_0
output	*	io_out_s2_full_pred_1_br_taken_mask_1
output	*	io_out_s2_full_pred_2_br_taken_mask_0
output	*	io_out_s2_full_pred_2_br_taken_mask_1
output	*	io_out_s2_full_pred_3_br_taken_mask_0
output	*	io_out_s2_full_pred_3_br_taken_mask_1
output	*	io_out_s3_full_pred_0_br_taken_mask_0	io_out_s3_full_pred_{i}br_taken_mask{j} Tage 在 s3流水级输出的，复制4份预测块中第 j 条分支指令SC预测结果
output	*	io_out_s3_full_pred_0_br_taken_mask_1
output	*	io_out_s3_full_pred_1_br_taken_mask_0
output	*	io_out_s3_full_pred_1_br_taken_mask_1
output	*	io_out_s3_full_pred_2_br_taken_mask_0
output	*	io_out_s3_full_pred_2_br_taken_mask_1
output	*	io_out_s3_full_pred_3_br_taken_mask_0
output	*	io_out_s3_full_pred_3_br_taken_mask_1
output	*[222:0]	io_out_last_stage_meta	见附表
input	*	io_ctrl_tage_enable
input	*	io_ctrl_sc_enable
input	*	io_s0_fire_0	s0 阶段流水线控制相同信号复制多份，0给BPU，1给Tage，3给SC
input	*	io_s0_fire_1
input	*	io_s0_fire_3
input	*	io_s1_fire_0	s1 阶段流水线控制
input	*	io_s1_fire_1
input	*	io_s1_fire_2
input	*	io_s1_fire_3
input	*	io_s2_fire_0	s2 阶段流水线控制
input	*	io_s2_fire_1
input	*	io_s2_fire_2
input	*	io_s2_fire_3
output	*	io_s1_ready	tage的所有表，可以执行读取结果的操作
input	*	io_update_valid	从FTQ发向BPU的后端执行结果（更新信号）是否有效
input	*[40:0]	io_update_bits_pc	（后端执行过的）预测块的PC
input	*[10:0]	io_update_bits_spec_info_folded_hist_hist_17_folded_hist	TageTable 2 用到的11bits 折叠历史预测时使用的分支历史结果，没有更新，转了一圈回来了
input	*[10:0]	io_update_bits_spec_info_folded_hist_hist_16_folded_hist	TageTable 3 用到的11bits 折叠历史
input	*[6:0]	io_update_bits_spec_info_folded_hist_hist_15_folded_hist	TageTable 1 用到的7bits 折叠历史
input	*[7:0]	io_update_bits_spec_info_folded_hist_hist_14_folded_hist	TageTable 0 用到的8bits 折叠历史
input	*[3:0]	io_update_bits_spec_info_folded_hist_hist_12_folded_hist	SCTable 1 用到的 4bit 折叠历史
input	*[7:0]	io_update_bits_spec_info_folded_hist_hist_11_folded_hist	SCTable 2 用到的 8bit 折叠历史
input	*[6:0]	io_update_bits_spec_info_folded_hist_hist_9_folded_hist	TageTable 2 用到的7bits 折叠历史
input	*[7:0]	io_update_bits_spec_info_folded_hist_hist_8_folded_hist	TageTable 3 用到的8bits 折叠历史
input	*[6:0]	io_update_bits_spec_info_folded_hist_hist_7_folded_hist	TageTable 0 用到的7bits 折叠历史
input	*[6:0]	io_update_bits_spec_info_folded_hist_hist_5_folded_hist	TageTable 3 用到的7bits 折叠历史
input	*[7:0]	io_update_bits_spec_info_folded_hist_hist_4_folded_hist	TageTable 1 用到的8bits 折叠历史
input	*[7:0]	io_update_bits_spec_info_folded_hist_hist_3_folded_hist	TageTable 2 用到的8bits 折叠历史
input	*[7:0]	io_update_bits_spec_info_folded_hist_hist_2_folded_hist	SCTable 3 用到的 8bit 折叠历史
input	*[10:0]	io_update_bits_spec_info_folded_hist_hist_1_folded_hist	TageTable 1 用到的11bits 折叠历史
input	*	io_update_bits_ftb_entry_brSlots_0_valid	FTB 表项的第一个slot是否有效（存储了跳转指令）
input	*	io_update_bits_ftb_entry_tailSlot_sharing	FTB 表项的最后一个slot是否存储了条件分支而非无条件跳转
input	*	io_update_bits_ftb_entry_tailSlot_valid	FTB 表项的最后一个slot是否有效
input	*	io_update_bits_ftb_entry_always_taken_0	历史上slot 0 指令总是跳转
input	*	io_update_bits_ftb_entry_always_taken_1	历史上slot 1 指令总是跳转
input	*	io_update_bits_br_taken_mask_0	solt 0 是否 taken
input	*	io_update_bits_br_taken_mask_1	solt 1 是否 taken
input	*	io_update_bits_mispred_mask_0	solt 0 是否预测正确
input	*	io_update_bits_mispred_mask_1	solt 1 是否预测正确
input	*[222:0]	io_update_bits_meta	见附表

io_out_last_stage_meta

需要设计参与优化！

信号类型	信号位	信号名	信号描述
output	[218:88]	0	占位，全为0，传递到composer时会忽略
	87	resp_meta_providers_1_valid_r
	[86:85]	resp_meta_providers_1_bits_r
	84	resp_meta_providers_0_valid_r
	[83:82]	resp_meta_providers_0_bits_r
	[81:79]	resp_meta_providerResps_1_r_ctr
	78	resp_meta_providerResps_1_r_u
	77	resp_meta_providerResps_1_r_unconf
	[76:74]	resp_meta_providerResps_0_r_ctr
	73	resp_meta_providerResps_0_r_u
	72	resp_meta_providerResps_0_r_unconf
	71	resp_meta_altUsed_1_r
	70	resp_meta_altUsed_0_r
	69	resp_meta_altDiffers_1_r
	68	resp_meta_altDiffers_0_r
	[67:66]	resp_meta_basecnts_1_r
	[65:64]	resp_meta_basecnts_0_r
	[63:60]	resp_meta_allocates_1_r
	[59:56]	resp_meta_allocates_0_r
	55	resp_meta_takens_1_r
	54	resp_meta_takens_0_r
	53	resp_meta_scMeta_tageTakens_1_r
	52	resp_meta_scMeta_tageTakens_0_r
	51	resp_meta_scMeta_scUsed_1_r
	50	resp_meta_scMeta_scUsed_0_r
	49	resp_meta_scMeta_scPreds_1_r
	48	resp_meta_scMeta_scPreds_0_r
	[47:42]	r_1_3	scMeta(预测时的状态)中第2路的第4个sc_ctr的值
	[41:36]	r_1_2	scMeta中第2路的第3个sc_ctr的值
	[35:30]	r_1_1	scMeta中第2路的第2个sc_ctr的值
	[29:24]	r_1_0	scMeta中第2路的第1个sc_ctr的值
	[23:18]	r_3	scMeta中第1路的第4个sc_ctr的值
	[17:12]	r_2	scMeta中第1路的第3个sc_ctr的值
	[11:6]	r_1	scMeta中第1路的第2个sc_ctr的值
	[5:0]	r_0	scMeta中第1路的第1个sc_ctr的值

io_update_bits_meta

信号类型	信号位	信号名	信号描述
input	[218:94]		FTB, ITAGE, RAS 模块传给 FTQ 的 META 信息，忽略
	[93:6]	io_out_last_stage_meta[87:0] 偏移 6bit 后的结果	TAGE 输出给 FTQ 的 META
	[5:0]		uFTB 输出给 FTQ 的 META

3.4 - FTB Branch Predictor

Introduction to FTB

FTB is the third sub-predictor of the Xiangshan BPU, and it can also get the outputs of uFTB and TAGE-SC together. In the input interface of FTB, the s1 channel contains the basic prediction results of uFTB, and the s2 and s3 channels are filled with only one group of signals, br_taken_mask, by TAGE-SC, without the basic prediction results generated by the FTB entry. The function of FTB is to provide basic prediction results for the s2 and s3 channels.

In terms of functionality and structure, FTB is similar to uFTB. The main difference is that FTB can accommodate more FTB entries, and the prediction results of FTB are output in the s2 and s3 channels. Due to its large capacity, the readout speed of FTB is slower than that of uFTB, and it cannot be placed in the first cycle to generate prediction results. However, the large capacity enables it to obtain more accurate prediction results.

Function of uFTB

Cache more FTB entries and provide basic prediction results for the s2 and s3 channels. The FTB predictor is essentially a storage with a large capacity. It reads the corresponding FTB entry based on the current predicted PC and outputs it in the s2 stage. At the same time, this FTB entry will be saved for one more cycle to generate the s3 stage prediction result. One thing to note is to consider the br_taken_mask field inputted by the previous predictor to avoid losing it during generation.
Update FTB entries based on update requests.

FTB Storage Structure

FTB entries in the FTB predictor are placed in a dedicated storage structure called FTBBank. Before further examining the structure of FTBBank, let’s first see how FTBBank is used.

FTB Read Request

The read request interface of FTBBank is as follows:

req_pc Requested PC
- Interface type: Flipped(DecoupledIO(UInt(VAddrBits.W)))
read_resp Read out FTB entry
- Interface type: FTBEntry
read_hits Which way (row) is hit
- Interface type: Valid(UInt(log2Ceil(numWays).W))

Among, req_pc interface is Decoupled, meaning it contains valid and ready signals. FTB needs to get the PC before the s1 stage starts, so s0_pc is sent to the req_pc interface, s0_fire signal is connected to the valid signal of req_pc, and the ready signal is connected to the pipeline control signal s1_ready.

When s0_fire enters the s1 stage, in the next cycle, when s0_fire is at the same time as s1_fire, FTBBank has already outputted the readout FTB entry to the read_resp interface, and calculated read_hits. However, at this time, because the readout has wasted too much delay, it cannot be outputted in the s1 stage. Therefore, this readout result is saved in an internal register. It will be read out from the register in the s2 and s3 stages to generate the prediction result.

FTBBank

FTBBank defines a storage to store all FTB entries. The storage adopts a group-associative structure, with 512 groups (Sets) in total, each group has 4 ways, and can store up to 2048 FTB entries. Besides storing FTB entries, it also stores the tag corresponding to each FTB entry for matching.

Specifically, the tag is defined as pc[29:10], which takes 20 bits from the PC to identify the FTB entry. The PC is divided as follows:

  pc: | ... |<-- tag(20 bits) -->|<-- idx(9 bits) -->|<-- instOffset(1 bit) -->|

When reading, provide the group number (idx) to the storage, read out all ways in that group, and then check if there is a way whose tag matches the current tag. If there is a match, it means a hit, and the readout FTB entry is sent out through the read_resp interface, and the hit way number is sent out through the read_hits interface.

Generation of Prediction Results

As mentioned earlier, for the FTB predictor, it needs to provide basic prediction results derived from FTB entries to the s2 and s3 channels. The FTB entries have been read and saved in the s1 stage. In the s2 and s3 stages, they only need to be read out to generate the prediction results. However, one thing to note is to preserve the br_taken_mask field generated by TAGE-SC in the s2 and s3 prediction results, which provides precise prediction results for conditional branch instructions. For the s1 channel, the FTB predictor does not make any changes.

The generation of signals in the s2 and s3 prediction results can refer to the following list:

hit Whether the FTB entry is hit
- Generation method: The read_hits signal valid bit from FTBBank is valid.
slot_valids Slot valid bit, indicating whether each slot in the ftb entry is valid
targets Jump target address corresponding to each slot
offsets Instruction offset relative to the start address of the predicted block in each slot
is_jal Whether the predicted block contains a jal instruction
is_jalr Whether the predicted block contains a jalr instruction
is_call Whether the predicted block contains a call instruction
is_ret Whether the predicted block contains a ret instruction
last_may_be_rvi_call Signal indicating that the end of the predicted block may be an RVI type call instruction
**is_br_sharing Whether the last slot (tailSlot) stores a conditional branch instruction signal
- Generation method**: Export from the corresponding field in the FTB entry
fallThroughErr Error in the pftAddr recorded in the FTB entry
- Generation method: Compare whether the address represented by pftAddr is greater than the start address of the predicted block. If it is less than, it indicates an error, and this signal is set to valid. This situation may occur when the PC indexes an incorrect FTB entry.
fallThroughAddr End address of the predicted block
- Generation method: If fallThroughErr is invalid, it is generated according to pftAddr. Otherwise, it is set to the start address + prediction width.
br_taken_mask Branch prediction result, each branch (slot) corresponds to a bit, indicating whether the branch is predicted as taken
- Generation method: Generated based on the always_taken field in the FTB entry and the indication result of the two-bit saturation counter.
jalr_target Jump target of jalr in this predicted block
- Generation method: Jump target in the tailSlot of the FTB entry.

FTB meta

In the third cycle of prediction, the FTB predictor outputs some auxiliary information of this prediction to last_stage_meta and also sends the read FTB entry to the last_stage_ftrb_entry interface.

The FTB meta contains two pieces of information, hit and writeWay, indicating whether the prediction hits and in which way it is read. Subsequently, the update channel generates the update information for this prediction, and these two pieces of information are also sent to guide the writing of the updated FTB entry.

FTB Update

In the update channel, the pc and the new FTB entry are already specified for us, along with the hit and writeWay in the meta information. If hit in the meta is valid, it means that the FTB entry corresponding to this pc was stored in the memory, and we only need to write it to the corresponding way.

If it is invalid, it means that there was no storage before, but we do not know whether it is stored now. It is possible that before this update request, the FTB entry corresponding to this pc was written by another update request. Therefore, we still need to send a read request to FTBBank to check if there is a corresponding FTB entry. If it exists, it can be directly written to this position in the next cycle, otherwise, FTBBank will be notified to allocate a new position.

Therefore, the number of cycles required for updating FTB entries depends on the hit situation.

Let’s first look at how FTBBank handles updates.

FTBBank Update

FTBBank’s update interface is divided into two parts, the update read interface and the update write interface.

u_req_pc: Update read request pc
- Flipped(DecoupledIO(UInt(VAddrBits.W)))
update_hits: Hit information read out
- Valid(UInt(log2Ceil(numWays).W))
update_access: There is an update request but the meta information indicates a miss
- Bool()
update_pc: Update write request pc
- UInt(VAddrBits.W))
update_write_data: Data to be written in the update request, write when valid
- Flipped(Valid(new FTBEntryWithTag))
update_write_way: Way index to write in the update request
- UInt(log2Ceil(numWays).W))
update_write_alloc: Whether a new FTB entry needs to be allocated (missed before)
- Bool()

For the update read interface, FTBBank obtains the update read request through u_req_pc signal. This request has a higher priority than the read request during prediction. In the next cycle, FTBBank will output the hit information through the update_hits interface. update_access is only used for some internal status judgments of FTBBank.

For the update write interface, FTBBank obtains the pc of the update write request through the update_pc signal, and when update_write_data is valid, it writes the data into the corresponding position specified by update_write_way. If update_write_alloc is valid, it means that it cannot be directly written to the position specified in the request, but a new position needs to be allocated.

The allocation strategy is as follows:

If all ways are filled, use the pseudo LRU replacement algorithm to select the way to replace
If there is an empty way, select the empty way.

Update Request Timing

Meta hit is valid: If hit in the update request meta is valid, then we only need to specify the address and data to be written according to the information in the update request, and the writing only takes one cycle.
Meta hit is invalid: In this case, after receiving the update request, we connect the pc in the request to the read port of FTBBank. The read port will return the result in the next cycle. Due to timing issues, we save this result and use it in the next cycle. Depending on the hit status in the result, we decide whether to set update_write_alloc and send a write request. The entire update process takes three cycles.

Interface List

信号类型	信号位	信号名	信号描述
input		clock	输入时钟
input		reset	复位信号
input	[35:0]	io_reset_vector	用于reset时，reset s1_pc_dup_0 提供的值
input	[40:0]	io_in_bits_s0_pc_0	输入位s0_pc 的第0个复制
input	[40:0]	io_in_bits_s0_pc_1	同上第1个
input	[40:0]	io_in_bits_s0_pc_2	同上第2个
input	[40:0]	io_in_bits_s0_pc_3	同上第3个
input		io_in_bits_resp_in_0_s2_full_pred_0_br_taken_mask_0	预测结果输入
input		io_in_bits_resp_in_0_s2_full_pred_0_br_taken_mask_1
input		io_in_bits_resp_in_0_s2_full_pred_1_br_taken_mask_0
input		io_in_bits_resp_in_0_s2_full_pred_1_br_taken_mask_1
input		io_in_bits_resp_in_0_s2_full_pred_2_br_taken_mask_0
input		io_in_bits_resp_in_0_s2_full_pred_2_br_taken_mask_1
input		io_in_bits_resp_in_0_s2_full_pred_3_br_taken_mask_0
input		io_in_bits_resp_in_0_s2_full_pred_3_br_taken_mask_1
input		io_in_bits_resp_in_0_s3_full_pred_0_br_taken_mask_0
input		io_in_bits_resp_in_0_s3_full_pred_0_br_taken_mask_1
input		io_in_bits_resp_in_0_s3_full_pred_1_br_taken_mask_0
input		io_in_bits_resp_in_0_s3_full_pred_1_br_taken_mask_1
input		io_in_bits_resp_in_0_s3_full_pred_2_br_taken_mask_0
input		io_in_bits_resp_in_0_s3_full_pred_2_br_taken_mask_1
input		io_in_bits_resp_in_0_s3_full_pred_3_br_taken_mask_0
input		io_in_bits_resp_in_0_s3_full_pred_3_br_taken_mask_1
output		io_out_s2_full_pred_0_br_taken_mask_0	s2 阶段输出的完整预测结果
output		io_out_s2_full_pred_0_br_taken_mask_1
output		io_out_s2_full_pred_0_slot_valids_0
output		io_out_s2_full_pred_0_slot_valids_1
output	[40:0]	io_out_s2_full_pred_0_targets_0
output	[40:0]	io_out_s2_full_pred_0_targets_1
output	[40:0]	io_out_s2_full_pred_0_jalr_target
output	[3:0]	io_out_s2_full_pred_0_offsets_0
output	[3:0]	io_out_s2_full_pred_0_offsets_1
output	[40:0]	io_out_s2_full_pred_0_fallThroughAddr
output		io_out_s2_full_pred_0_is_br_sharing
output		io_out_s2_full_pred_0_hit
output		io_out_s2_full_pred_1_br_taken_mask_0	同上
output		io_out_s2_full_pred_1_br_taken_mask_1
output		io_out_s2_full_pred_1_slot_valids_0
output		io_out_s2_full_pred_1_slot_valids_1
output	[40:0]	io_out_s2_full_pred_1_targets_0
output	[40:0]	io_out_s2_full_pred_1_targets_1
output	[40:0]	io_out_s2_full_pred_1_jalr_target
output	[3:0]	io_out_s2_full_pred_1_offsets_0
output	[3:0]	io_out_s2_full_pred_1_offsets_1
output	[40:0]	io_out_s2_full_pred_1_fallThroughAddr
output		io_out_s2_full_pred_1_is_br_sharing
output		io_out_s2_full_pred_1_hit
output		io_out_s2_full_pred_2_br_taken_mask_0	同上
output		io_out_s2_full_pred_2_br_taken_mask_1
output		io_out_s2_full_pred_2_slot_valids_0
output		io_out_s2_full_pred_2_slot_valids_1
output	[40:0]	io_out_s2_full_pred_2_targets_0
output	[40:0]	io_out_s2_full_pred_2_targets_1
output	[40:0]	io_out_s2_full_pred_2_jalr_target
output	[3:0]	io_out_s2_full_pred_2_offsets_0
output	[3:0]	io_out_s2_full_pred_2_offsets_1
output	[40:0]	io_out_s2_full_pred_2_fallThroughAddr
output		io_out_s2_full_pred_2_is_jalr
output		io_out_s2_full_pred_2_is_call
output		io_out_s2_full_pred_2_is_ret
output		io_out_s2_full_pred_2_last_may_be_rvi_call
output		io_out_s2_full_pred_2_is_br_sharing
output		io_out_s2_full_pred_2_hit
output		io_out_s2_full_pred_3_br_taken_mask_0	同上
output		io_out_s2_full_pred_3_br_taken_mask_1
output		io_out_s2_full_pred_3_slot_valids_0
output		io_out_s2_full_pred_3_slot_valids_1
output	[40:0]	io_out_s2_full_pred_3_targets_0
output	[40:0]	io_out_s2_full_pred_3_targets_1
output	[40:0]	io_out_s2_full_pred_3_jalr_target
output	[3:0]	io_out_s2_full_pred_3_offsets_0
output	[3:0]	io_out_s2_full_pred_3_offsets_1
output	[40:0]	io_out_s2_full_pred_3_fallThroughAddr
output		io_out_s2_full_pred_3_fallThroughErr
output		io_out_s2_full_pred_3_is_br_sharing
output		io_out_s2_full_pred_3_hit
output		io_out_s3_full_pred_0_br_taken_mask_0	s3 阶段输出的完整预测结果
output		io_out_s3_full_pred_0_br_taken_mask_1
output		io_out_s3_full_pred_0_slot_valids_0
output		io_out_s3_full_pred_0_slot_valids_1
output	[40:0]	io_out_s3_full_pred_0_targets_0
output	[40:0]	io_out_s3_full_pred_0_targets_1
output	[40:0]	io_out_s3_full_pred_0_jalr_target
output	[40:0]	io_out_s3_full_pred_0_fallThroughAddr
output		io_out_s3_full_pred_0_fallThroughErr
output		io_out_s3_full_pred_0_is_br_sharing
output		io_out_s3_full_pred_0_hit
output		io_out_s3_full_pred_1_br_taken_mask_0	同上
output		io_out_s3_full_pred_1_br_taken_mask_1
output		io_out_s3_full_pred_1_slot_valids_0
output		io_out_s3_full_pred_1_slot_valids_1
output	[40:0]	io_out_s3_full_pred_1_targets_0
output	[40:0]	io_out_s3_full_pred_1_targets_1
output	[40:0]	io_out_s3_full_pred_1_jalr_target
output	[40:0]	io_out_s3_full_pred_1_fallThroughAddr
output		io_out_s3_full_pred_1_fallThroughErr
output		io_out_s3_full_pred_1_is_br_sharing
output		io_out_s3_full_pred_1_hit
output		io_out_s3_full_pred_2_br_taken_mask_0	同上
output		io_out_s3_full_pred_2_br_taken_mask_1
output		io_out_s3_full_pred_2_slot_valids_0
output		io_out_s3_full_pred_2_slot_valids_1
output	[40:0]	io_out_s3_full_pred_2_targets_0
output	[40:0]	io_out_s3_full_pred_2_targets_1
output	[40:0]	io_out_s3_full_pred_2_jalr_target
output	[40:0]	io_out_s3_full_pred_2_fallThroughAddr
output		io_out_s3_full_pred_2_fallThroughErr
output		io_out_s3_full_pred_2_is_jalr
output		io_out_s3_full_pred_2_is_call
output		io_out_s3_full_pred_2_is_ret
output		io_out_s3_full_pred_2_is_br_sharing
output		io_out_s3_full_pred_2_hit
output		io_out_s3_full_pred_3_br_taken_mask_0	同上
output		io_out_s3_full_pred_3_br_taken_mask_1
output		io_out_s3_full_pred_3_slot_valids_0
output		io_out_s3_full_pred_3_slot_valids_1
output	[40:0]	io_out_s3_full_pred_3_targets_0
output	[40:0]	io_out_s3_full_pred_3_targets_1
output	[40:0]	io_out_s3_full_pred_3_jalr_target
output	[3:0]	io_out_s3_full_pred_3_offsets_0
output	[3:0]	io_out_s3_full_pred_3_offsets_1
output	[40:0]	io_out_s3_full_pred_3_fallThroughAddr
output		io_out_s3_full_pred_3_fallThroughErr
output		io_out_s3_full_pred_3_is_br_sharing
output		io_out_s3_full_pred_3_hit
output	[222:0]	io_out_last_stage_meta	最后一个阶段输出的 meta 信息
output		io_out_last_stage_ftb_entry_valid	最后一个阶段输出的 FTB 项
output	[3:0]	io_out_last_stage_ftb_entry_brSlots_0_offset
output	[11:0]	io_out_last_stage_ftb_entry_brSlots_0_lower
output	[1:0]	io_out_last_stage_ftb_entry_brSlots_0_tarStat
output		io_out_last_stage_ftb_entry_brSlots_0_sharing
output		io_out_last_stage_ftb_entry_brSlots_0_valid
output	[3:0]	io_out_last_stage_ftb_entry_tailSlot_offset
output	[19:0]	io_out_last_stage_ftb_entry_tailSlot_lower
output	[1:0]	io_out_last_stage_ftb_entry_tailSlot_tarStat
output		io_out_last_stage_ftb_entry_tailSlot_sharing
output		io_out_last_stage_ftb_entry_tailSlot_valid
output	[3:0]	io_out_last_stage_ftb_entry_pftAddr
output		io_out_last_stage_ftb_entry_carry
output		io_out_last_stage_ftb_entry_isCall
output		io_out_last_stage_ftb_entry_isRet
output		io_out_last_stage_ftb_entry_isJalr
output		io_out_last_stage_ftb_entry_last_may_be_rvi_call
output		io_out_last_stage_ftb_entry_always_taken_0
output		io_out_last_stage_ftb_entry_always_taken_1
input		io_ctrl_btb_enable	使能信号
input		io_s0_fire_0	s0 阶段流水线控制信号
input		io_s0_fire_1
input		io_s0_fire_2
input		io_s0_fire_3
output		io_s1_ready	s1 阶段流水线控制信号
input		io_s1_fire_0
input		io_s1_fire_1
input		io_s1_fire_2
input		io_s1_fire_3
input		io_s2_fire_0	s2 阶段流水线控制信号
input		io_s2_fire_1
input		io_s2_fire_2
input		io_s2_fire_3
input		io_update_valid	更新有效性
input	[40:0]	io_update_bits_pc	传回的预测块pc（用于指示更新的预测块）
input		io_update_bits_ftb_entry_valid	是否启用
input	[3:0]	io_update_bits_ftb_entry_brSlots_0_offset	solt 0 中分支指令相对于地址块起始pc的偏移
input	[11:0]	io_update_bits_ftb_entry_brSlots_0_lower	跳转目标地址的低位
input	[1:0]	io_update_bits_ftb_entry_brSlots_0_tarStat	跳转后的 pc 高位是否进退位
input		io_update_bits_ftb_entry_brSlots_0_sharing	无条件跳转指令槽中存储条件分支指令
input		io_update_bits_ftb_entry_brSlots_0_valid	是否启用
input	[3:0]	io_update_bits_ftb_entry_tailSlot_offset	solt 1 中分支指令相对于地址块起始pc的偏移
input	[19:0]	io_update_bits_ftb_entry_tailSlot_lower	跳转目标地址的低位
input	[1:0]	io_update_bits_ftb_entry_tailSlot_tarStat	跳转后的 pc 高位是否进退位
input		io_update_bits_ftb_entry_tailSlot_sharing	无条件跳转指令槽中存储条件分支指令
input		io_update_bits_ftb_entry_tailSlot_valid	是否启用
input	[3:0]	io_update_bits_ftb_entry_pftAddr	Partial Fallthrough Addr 如果预测块中没有跳转，那么程序将会顺序执行到达的地址，预测块的结束地址。
input		io_update_bits_ftb_entry_carry	pc+pft时是否产生进位
input		io_update_bits_ftb_entry_isCall	是否是函数调用
input		io_update_bits_ftb_entry_isRet	是否是函数返回
input		io_update_bits_ftb_entry_isJalr	是否是 jalr 指令
input		io_update_bits_ftb_entry_last_may_be_rvi_call	最后一个指令槽存储的可能是 rvi 的 call 指令
input		io_update_bits_ftb_entry_always_taken_0	是否预测为总是跳转
input		io_update_bits_ftb_entry_always_taken_1	是否预测为总是跳转
input		io_update_bits_old_entry	是否是旧的 FTB 项
input	[222:0]	io_update_bits_meta	meta 信息

3.5 - ITTAGE Branch Predictor

Function Introduction

For general conditional branch instructions, only predicting whether to jump (taken) or not (not taken) is needed. However, for indirect jumps, such as call/jump instructions, it is necessary to predict where to jump to (Target). In order to make TAGE support predicting jump addresses, ITTAGE (Indirect Target TAGE) was introduced.

The main difference between ITTAGE and TAGE is that in the T0 and Tn tables, Target PC data is added. During prediction, ITTAGE selects the Target from the matched, longest history entry as the prediction result, and uses a 2-bit saturating counter to decide whether to output this result or choose an alternative prediction result. For TAGE predictor details, please refer to TAGE-SC Branch Predictor.

Kunming Lake ITTAGE Branch Predictor

In the BPU design of Kunming Lake, prediction is performed in a cascaded manner with multiple predictors, so the implementation of the subpredictor differs from the original predictor, mainly in the default prediction result.

Basic Functionality

ITTAGE’s basic functionality is similar to the TAGE branch predictor, but with the following differences:

The Target is added as a jump target address item in the entry to predict the jump target address.
The saturating counter ctr no longer provides the prediction direction, but instead decides whether to output the result (just the prediction information).
Since there is only one indirect jump instruction in each branch prediction block, ITTAGE only considers one instruction.

Pipeline

ITTAGE contains three pipeline stages, the first stage calculates the index, and the second stage reads the result from the SRAM table using the index.

Cycle 0, s0: Input of the first pipeline stage, generally pc and folded history.

Operation of the first pipeline stage：Calculate the index. Output through registers to s1.

Cycle 1, s1: Input of the second pipeline stage, the index and other data calculated in the first stage.

Operation of the second pipeline stage：Access SRAM, read prediction information. Output through registers to s2.

Cycle 2, s2: Input of the third pipeline stage, the original prediction information read from SRAM in the second stage.

Operation of the third pipeline stage：Process the original prediction information, decide whether to output the prediction result.

Cycle 3, s3: Prediction result ready, the prediction result can now be used.

Data Structure

In the Kunming Lake implementation, the table structure of T0 and Tn is as follows:

预测器	作用	表项构成	项数
基准预测器T0	用于在其他预测器的预测结果都无效时输出预测结果	虚表，不存在。直接将上级预测器FTB 的预测结果作为表项结果	虚表，不存在。直接将上级预测器FTB结果作为索引到的结果
预测表T1-T2	对每个预测块的输入，所有Tn表都进行预测，在所有预测有效的结果中，选择历史记录最长的结果作为原始预测信息。历史记录长度由输入的H决定	target：41 bitsvalid 1bittag 9bitsctr 2bitsus: 1bit（usefulness计数器）	256项
预测表T3-T5	512项

T0，TnTable Retrieval Method

The retrieval method is consistent with the TAGE branch predictor, only differing in the configuration options of each table.

表名称	FH长度	FH1长度	FH2长度	最近历史长度（用到GH中的位数）
T1	4比特	4比特	4比特	低4位，即把最新4位历史，折叠成FH、FH1、FH2
T2	8比特	8比特	8比特	低8位，即把最新8位历史，折叠成FH、FH1、FH2
T3	9比特	9比特	8比特	低13位，即把最新13位历史，折叠成FH、FH1、FH2
T4	9比特	9比特	8比特	低16位，即把最新16位历史，折叠成FH、FH1、FH2
T5	9比特	9比特	8比特	低32位，即把最新32位历史，折叠成FH、FH1、FH2

Other processes (computation method and computation formula) are similar to the TAGE-SC branch predictor.

Alternate Predictor

When the prediction result given by the Tn table has insufficient “prediction confidence,” the prediction result needs to be jumped to become an “alternate predictor.” This process is similar to TAGE. For details, please refer to the corresponding part of TAGE. Unlike TAGE, ITTAGE’s ctr does not give the prediction direction but only determines whether to output the result (prediction confidence). When ctr is 2b00, it is considered weak confidence. Choose the alternate prediction result:

If multiple tables are hit, output the Target from the second-longest history table entry.
Otherwise, output the T0 Target (FTB Target).

Prediction Process

The prediction process is similar to TAGE, but ITTAGE has an additional step to decide whether to output the prediction result based on ctr. The specific process is as follows:

When the ctr of the ITTAGE table entry is not 2b00, output Target.
When the ctr of the ITTAGE table entry is 2b00, output the alternate prediction result:
1. If there is a second-longest history (the second table is also hit), output the Target of the second-longest.
2. Otherwise, output the FTB Target.
When the ITTAGE table entry is not hit, output the T0 Target (FTB Target).

Training Process

This process is similar to TAGE, with the following differences:

Table entry updates (original prediction data):
1. ctr:
  1. If the predicted address matches the actual address, increment the ctr counter of the corresponding provider table entry by 1.
  2. If the predicted address does not match the actual address, decrement the ctr counter of the corresponding provider table entry by 1.
  3. In ITTAGE, it is determined based on ctr whether to adopt the jump target result of this prediction. If multiple tables are hit and the ctr of the longest history table is 0, adopt the alternate prediction logic (the second-longest history table or T0). Always update the longest history table during updates, and also update the alternate prediction table if the alternate prediction is adopted.
2. target:
  1. When the ctr of the table entry to be updated is 0 during this prediction, directly store the actual final jump result in the target, overwriting it.
  2. When applying for a new table entry, directly store the actual final jump result in the target.
  3. Otherwise, do not modify the target.
3. usefulness:
  1. When the provider’s prediction is correct but the alternate prediction is incorrect, set the provider’s usefulness to 1.
  2. If the alternate prediction has weak confidence and is correct, set the provider’s usefulness to 1. If the alternate prediction has weak confidence and is incorrect, set the provider’s usefulness to 0.
4. New table entry:
  1. Each time the prediction from the longest history table with confidence is incorrect (not due to using the alternate prediction), try to randomly apply for a table entry from a longer history table. The condition for application is that the usefulness of the corresponding entry is 0.
  2. If all longer entries are not 0, the allocation fails.
Reset useful bit:
1. Each time a prediction error occurs and a new table entry is applied for, if the allocation fails, increment tickCtr (an 8-bit saturated counter used to reset all usefulness). If successful, decrement tickCtr.
2. When tickCtr reaches its maximum value, set all usefulness in ITTAGE to 0 and reset tickCtr to 0.

Interface List

接口类型	位宽	信号名	备注
input		clock
input		reset
input	[40:0]	io_in_bits_s0_pc_3	用于预测的PC
input	[7:0]	io_in_bits_folded_hist_3_hist_14_folded_hist	T2 折叠历史
input	[8:0]	io_in_bits_folded_hist_3_hist_13_folded_hist	T3 折叠历史
input	[3:0]	io_in_bits_folded_hist_3_hist_12_folded_hist	T1 折叠历史
input	[8:0]	io_in_bits_folded_hist_3_hist_10_folded_hist	T5 折叠历史
input	[8:0]	io_in_bits_folded_hist_3_hist_6_folded_hist	T4 折叠历史
input	[7:0]	io_in_bits_folded_hist_3_hist_4_folded_hist	T3 折叠历史
input	[7:0]	io_in_bits_folded_hist_3_hist_3_folded_hist	T5 折叠历史
input	[7:0]	io_in_bits_folded_hist_3_hist_2_folded_hist	T4 折叠历史
input	[40:0]	io_in_bits_resp_in_0_s3_full_pred_0_jalr_target
input	[40:0]	io_in_bits_resp_in_0_s3_full_pred_1_jalr_target
input	[40:0]	io_in_bits_resp_in_0_s3_full_pred_2_jalr_target
input	[40:0]	io_in_bits_resp_in_0_s3_full_pred_3_jalr_target
output	[40:0]	io_out_s3_full_pred_0_jalr_target
output	[40:0]	io_out_s3_full_pred_1_jalr_target
output	[40:0]	io_out_s3_full_pred_2_jalr_target
output	[40:0]	io_out_s3_full_pred_3_jalr_target
output	[222:0]	io_out_last_stage_meta	[100:0] 有效，是ITTAGE的Meta信息
input		io_s0_fire_3	s0阶段使能信号
input		io_s1_fire_3	s1阶段使能信号
input		io_s2_fire_0	s2阶段使能信号，相同
input		io_s2_fire_1
input		io_s2_fire_2
input		io_s2_fire_3
input		io_update_valid	是否进行更新
input	[40:0]	io_update_bits_pc	待更新的预测块pc索引
input	[7:0]	io_update_bits_spec_info_folded_hist_hist_14_folded_hist	T2 更新时传入的历史
input	[8:0]	io_update_bits_spec_info_folded_hist_hist_13_folded_hist	T3 更新时传入的历史
input	[3:0]	io_update_bits_spec_info_folded_hist_hist_12_folded_hist	T1 更新时传入的历史
input	[8:0]	io_update_bits_spec_info_folded_hist_hist_10_folded_hist	T5 更新时传入的历史
input	[8:0]	io_update_bits_spec_info_folded_hist_hist_6_folded_hist	T4 更新时传入的历史
input	[7:0]	io_update_bits_spec_info_folded_hist_hist_4_folded_hist	T3 更新时传入的历史
input	[7:0]	io_update_bits_spec_info_folded_hist_hist_3_folded_hist	T5 更新时传入的历史
input	[7:0]	io_update_bits_spec_info_folded_hist_hist_2_folded_hist	T4 更新时传入的历史
input	[3:0]	io_update_bits_ftb_entry_tailSlot_offset	待更新的FTB项offset
input		io_update_bits_ftb_entry_tailSlot_sharing	待更新的FTB项是否是有条件跳转
input		io_update_bits_ftb_entry_tailSlot_valid	待更新的tailSlot是否启用
input		io_update_bits_ftb_entry_isRet	tailSlot是否是Ret指令
input		io_update_bits_ftb_entry_isJalr	tailSlot是否是Jalr指令
input		io_update_bits_cfi_idx_valid	控制流指令在预测块中的索引.valid信号
input	[3:0]	io_update_bits_cfi_idx_bits	控制流指令在预测块中的索引
input		io_update_bits_jmp_taken	预测块内无条件跳转指令被触发
input		io_update_bits_mispred_mask_2	是否预测错误
input	[222:0]	io_update_bits_meta	预测时传出 meta 信息的[222:25] 即{25h0, _ubtb_io_out_last_stage_meta[5:0] ,_tage_io_out_last_stage_meta[87:0] ,_ftb_io_out_last_stage_meta[2:0], _ittage_io_out_last_stage_meta[100:0]}
input	[40:0]	io_update_bits_full_target	预测块的跳转目标（下一个预测块的起始地址）

Pass-through signals that do not have an impact

These signals do not have an impact and are not important

接口类型	位宽	信号名	备注
input		io_in_bits_resp_in_0_s2_full_pred_0_br_taken_mask_0	从FTB输入完全透传到输出包括jalr_target
input		io_in_bits_resp_in_0_s2_full_pred_0_br_taken_mask_1
input		io_in_bits_resp_in_0_s2_full_pred_0_slot_valids_0
input		io_in_bits_resp_in_0_s2_full_pred_0_slot_valids_1
input	[40:0]	io_in_bits_resp_in_0_s2_full_pred_0_targets_0
input	[40:0]	io_in_bits_resp_in_0_s2_full_pred_0_targets_1
input	[40:0]	io_in_bits_resp_in_0_s2_full_pred_0_jalr_target
input	[3:0]	io_in_bits_resp_in_0_s2_full_pred_0_offsets_0
input	[3:0]	io_in_bits_resp_in_0_s2_full_pred_0_offsets_1
input	[40:0]	io_in_bits_resp_in_0_s2_full_pred_0_fallThroughAddr
input		io_in_bits_resp_in_0_s2_full_pred_0_is_br_sharing
input		io_in_bits_resp_in_0_s2_full_pred_0_hit
input		io_in_bits_resp_in_0_s2_full_pred_1_br_taken_mask_0
input		io_in_bits_resp_in_0_s2_full_pred_1_br_taken_mask_1
input		io_in_bits_resp_in_0_s2_full_pred_1_slot_valids_0
input		io_in_bits_resp_in_0_s2_full_pred_1_slot_valids_1
input	[40:0]	io_in_bits_resp_in_0_s2_full_pred_1_targets_0
input	[40:0]	io_in_bits_resp_in_0_s2_full_pred_1_targets_1
input	[40:0]	io_in_bits_resp_in_0_s2_full_pred_1_jalr_target
input	[3:0]	io_in_bits_resp_in_0_s2_full_pred_1_offsets_0
input	[3:0]	io_in_bits_resp_in_0_s2_full_pred_1_offsets_1
input	[40:0]	io_in_bits_resp_in_0_s2_full_pred_1_fallThroughAddr
input		io_in_bits_resp_in_0_s2_full_pred_1_is_br_sharing
input		io_in_bits_resp_in_0_s2_full_pred_1_hit
input		io_in_bits_resp_in_0_s2_full_pred_2_br_taken_mask_0
input		io_in_bits_resp_in_0_s2_full_pred_2_br_taken_mask_1
input		io_in_bits_resp_in_0_s2_full_pred_2_slot_valids_0
input		io_in_bits_resp_in_0_s2_full_pred_2_slot_valids_1
input	[40:0]	io_in_bits_resp_in_0_s2_full_pred_2_targets_0
input	[40:0]	io_in_bits_resp_in_0_s2_full_pred_2_targets_1
input	[40:0]	io_in_bits_resp_in_0_s2_full_pred_2_jalr_target
input	[3:0]	io_in_bits_resp_in_0_s2_full_pred_2_offsets_0
input	[3:0]	io_in_bits_resp_in_0_s2_full_pred_2_offsets_1
input	[40:0]	io_in_bits_resp_in_0_s2_full_pred_2_fallThroughAddr
input		io_in_bits_resp_in_0_s2_full_pred_2_is_jalr	RAS 模块使用的信息，透传
input		io_in_bits_resp_in_0_s2_full_pred_2_is_call
input		io_in_bits_resp_in_0_s2_full_pred_2_is_ret
input		io_in_bits_resp_in_0_s2_full_pred_2_last_may_be_rvi_call
input		io_in_bits_resp_in_0_s2_full_pred_2_is_br_sharing	从FTB输入完全透传到输出包括jalr_target fallThroughErr 表示 FTB项中记录的 pftAddr 有误生成方式：比较 pftAddr 代表的预测块结束地址是否大于预测块的起始地址，如果小于，则代表出现错误，此信号置为有效。这种情况可能会发生在 pc 索引到错误的 FTB 项的情况。 FTQ使用这个变量，与ITTAGE无关
input		io_in_bits_resp_in_0_s2_full_pred_2_hit
input		io_in_bits_resp_in_0_s2_full_pred_3_br_taken_mask_0
input		io_in_bits_resp_in_0_s2_full_pred_3_br_taken_mask_1
input		io_in_bits_resp_in_0_s2_full_pred_3_slot_valids_0
input		io_in_bits_resp_in_0_s2_full_pred_3_slot_valids_1
input	[40:0]	io_in_bits_resp_in_0_s2_full_pred_3_targets_0
input	[40:0]	io_in_bits_resp_in_0_s2_full_pred_3_targets_1
input	[40:0]	io_in_bits_resp_in_0_s2_full_pred_3_jalr_target
input	[3:0]	io_in_bits_resp_in_0_s2_full_pred_3_offsets_0
input	[3:0]	io_in_bits_resp_in_0_s2_full_pred_3_offsets_1
input	[40:0]	io_in_bits_resp_in_0_s2_full_pred_3_fallThroughAddr
input		io_in_bits_resp_in_0_s2_full_pred_3_fallThroughErr
input		io_in_bits_resp_in_0_s2_full_pred_3_is_br_sharing
input		io_in_bits_resp_in_0_s2_full_pred_3_hit
input		io_in_bits_resp_in_0_s3_full_pred_0_br_taken_mask_0	除了 jalr_target 可能被修改，其他都是透传
input		io_in_bits_resp_in_0_s3_full_pred_0_br_taken_mask_1
input		io_in_bits_resp_in_0_s3_full_pred_0_slot_valids_0
input		io_in_bits_resp_in_0_s3_full_pred_0_slot_valids_1
input	[40:0]	io_in_bits_resp_in_0_s3_full_pred_0_targets_0
input	[40:0]	io_in_bits_resp_in_0_s3_full_pred_0_targets_1
input	[40:0]	io_in_bits_resp_in_0_s3_full_pred_0_fallThroughAddr
input		io_in_bits_resp_in_0_s3_full_pred_0_fallThroughErr
input		io_in_bits_resp_in_0_s3_full_pred_0_is_br_sharing
input		io_in_bits_resp_in_0_s3_full_pred_0_hit
input		io_in_bits_resp_in_0_s3_full_pred_1_br_taken_mask_0	同上
input		io_in_bits_resp_in_0_s3_full_pred_1_br_taken_mask_1
input		io_in_bits_resp_in_0_s3_full_pred_1_slot_valids_0
input		io_in_bits_resp_in_0_s3_full_pred_1_slot_valids_1
input	[40:0]	io_in_bits_resp_in_0_s3_full_pred_1_targets_0
input	[40:0]	io_in_bits_resp_in_0_s3_full_pred_1_targets_1
input	[40:0]	io_in_bits_resp_in_0_s3_full_pred_1_fallThroughAddr
input		io_in_bits_resp_in_0_s3_full_pred_1_fallThroughErr
input		io_in_bits_resp_in_0_s3_full_pred_1_is_br_sharing
input		io_in_bits_resp_in_0_s3_full_pred_1_hit
input		io_in_bits_resp_in_0_s3_full_pred_2_br_taken_mask_0	同上
input		io_in_bits_resp_in_0_s3_full_pred_2_br_taken_mask_1
input		io_in_bits_resp_in_0_s3_full_pred_2_slot_valids_0
input		io_in_bits_resp_in_0_s3_full_pred_2_slot_valids_1
input	[40:0]	io_in_bits_resp_in_0_s3_full_pred_2_targets_0
input	[40:0]	io_in_bits_resp_in_0_s3_full_pred_2_targets_1
input	[40:0]	io_in_bits_resp_in_0_s3_full_pred_2_fallThroughAddr
input		io_in_bits_resp_in_0_s3_full_pred_2_fallThroughErr
input		io_in_bits_resp_in_0_s3_full_pred_2_is_jalr
input		io_in_bits_resp_in_0_s3_full_pred_2_is_call
input		io_in_bits_resp_in_0_s3_full_pred_2_is_ret
input		io_in_bits_resp_in_0_s3_full_pred_2_is_br_sharing
input		io_in_bits_resp_in_0_s3_full_pred_2_hit
input		io_in_bits_resp_in_0_s3_full_pred_3_br_taken_mask_0	同上
input		io_in_bits_resp_in_0_s3_full_pred_3_br_taken_mask_1
input		io_in_bits_resp_in_0_s3_full_pred_3_slot_valids_0
input		io_in_bits_resp_in_0_s3_full_pred_3_slot_valids_1
input	[40:0]	io_in_bits_resp_in_0_s3_full_pred_3_targets_0
input	[40:0]	io_in_bits_resp_in_0_s3_full_pred_3_targets_1
input	[3:0]	io_in_bits_resp_in_0_s3_full_pred_3_offsets_0
input	[3:0]	io_in_bits_resp_in_0_s3_full_pred_3_offsets_1
input	[40:0]	io_in_bits_resp_in_0_s3_full_pred_3_fallThroughAddr
input		io_in_bits_resp_in_0_s3_full_pred_3_fallThroughErr
input		io_in_bits_resp_in_0_s3_full_pred_3_is_br_sharing
input		io_in_bits_resp_in_0_s3_full_pred_3_hit
input		io_in_bits_resp_in_0_last_stage_ftb_entry_valid	透传到output，不做修改来源是FTB
input	[3:0]	io_in_bits_resp_in_0_last_stage_ftb_entry_brSlots_0_offset
input	[11:0]	io_in_bits_resp_in_0_last_stage_ftb_entry_brSlots_0_lower
input	[1:0]	io_in_bits_resp_in_0_last_stage_ftb_entry_brSlots_0_tarStat
input		io_in_bits_resp_in_0_last_stage_ftb_entry_brSlots_0_sharing
input		io_in_bits_resp_in_0_last_stage_ftb_entry_brSlots_0_valid
input	[3:0]	io_in_bits_resp_in_0_last_stage_ftb_entry_tailSlot_offset
input	[19:0]	io_in_bits_resp_in_0_last_stage_ftb_entry_tailSlot_lower
input	[1:0]	io_in_bits_resp_in_0_last_stage_ftb_entry_tailSlot_tarStat
input		io_in_bits_resp_in_0_last_stage_ftb_entry_tailSlot_sharing
input		io_in_bits_resp_in_0_last_stage_ftb_entry_tailSlot_valid
input	[3:0]	io_in_bits_resp_in_0_last_stage_ftb_entry_pftAddr
input		io_in_bits_resp_in_0_last_stage_ftb_entry_carry
input		io_in_bits_resp_in_0_last_stage_ftb_entry_isCall
input		io_in_bits_resp_in_0_last_stage_ftb_entry_isRet
input		io_in_bits_resp_in_0_last_stage_ftb_entry_isJalr
input		io_in_bits_resp_in_0_last_stage_ftb_entry_last_may_be_rvi_call
input		io_in_bits_resp_in_0_last_stage_ftb_entry_always_taken_0
input		io_in_bits_resp_in_0_last_stage_ftb_entry_always_taken_1
output		io_out_s2_full_pred_0_br_taken_mask_0	完全透传传入值 prefix: io_in_bits_resp_in_
output		io_out_s2_full_pred_0_br_taken_mask_1
output		io_out_s2_full_pred_0_slot_valids_0
output		io_out_s2_full_pred_0_slot_valids_1
output	[40:0]	io_out_s2_full_pred_0_targets_0
output	[40:0]	io_out_s2_full_pred_0_targets_1
output	[40:0]	io_out_s2_full_pred_0_jalr_target
output	[3:0]	io_out_s2_full_pred_0_offsets_0
output	[3:0]	io_out_s2_full_pred_0_offsets_1
output	[40:0]	io_out_s2_full_pred_0_fallThroughAddr
output		io_out_s2_full_pred_0_is_br_sharing
output		io_out_s2_full_pred_0_hit
output		io_out_s2_full_pred_1_br_taken_mask_0
output		io_out_s2_full_pred_1_br_taken_mask_1
output		io_out_s2_full_pred_1_slot_valids_0
output		io_out_s2_full_pred_1_slot_valids_1
output	[40:0]	io_out_s2_full_pred_1_targets_0
output	[40:0]	io_out_s2_full_pred_1_targets_1
output	[40:0]	io_out_s2_full_pred_1_jalr_target
output	[3:0]	io_out_s2_full_pred_1_offsets_0
output	[3:0]	io_out_s2_full_pred_1_offsets_1
output	[40:0]	io_out_s2_full_pred_1_fallThroughAddr
output		io_out_s2_full_pred_1_is_br_sharing
output		io_out_s2_full_pred_1_hit
output		io_out_s2_full_pred_2_br_taken_mask_0
output		io_out_s2_full_pred_2_br_taken_mask_1
output		io_out_s2_full_pred_2_slot_valids_0
output		io_out_s2_full_pred_2_slot_valids_1
output	[40:0]	io_out_s2_full_pred_2_targets_0
output	[40:0]	io_out_s2_full_pred_2_targets_1
output	[40:0]	io_out_s2_full_pred_2_jalr_target
output	[3:0]	io_out_s2_full_pred_2_offsets_0
output	[3:0]	io_out_s2_full_pred_2_offsets_1
output	[40:0]	io_out_s2_full_pred_2_fallThroughAddr
output		io_out_s2_full_pred_2_is_jalr
output		io_out_s2_full_pred_2_is_call
output		io_out_s2_full_pred_2_is_ret
output		io_out_s2_full_pred_2_last_may_be_rvi_call
output		io_out_s2_full_pred_2_is_br_sharing
output		io_out_s2_full_pred_2_hit
output		io_out_s2_full_pred_3_br_taken_mask_0
output		io_out_s2_full_pred_3_br_taken_mask_1
output		io_out_s2_full_pred_3_slot_valids_0
output		io_out_s2_full_pred_3_slot_valids_1
output	[40:0]	io_out_s2_full_pred_3_targets_0
output	[40:0]	io_out_s2_full_pred_3_targets_1
output	[40:0]	io_out_s2_full_pred_3_jalr_target
output	[3:0]	io_out_s2_full_pred_3_offsets_0
output	[3:0]	io_out_s2_full_pred_3_offsets_1
output	[40:0]	io_out_s2_full_pred_3_fallThroughAddr
output		io_out_s2_full_pred_3_fallThroughErr
output		io_out_s2_full_pred_3_is_br_sharing
output		io_out_s2_full_pred_3_hit
output		io_out_s3_full_pred_0_br_taken_mask_0	见对应prefix的输入
output		io_out_s3_full_pred_0_br_taken_mask_1
output		io_out_s3_full_pred_0_slot_valids_0
output		io_out_s3_full_pred_0_slot_valids_1
output	[40:0]	io_out_s3_full_pred_0_targets_0
output	[40:0]	io_out_s3_full_pred_0_targets_1
output	[40:0]	io_out_s3_full_pred_0_fallThroughAddr
output		io_out_s3_full_pred_0_fallThroughErr
output		io_out_s3_full_pred_0_is_br_sharing
output		io_out_s3_full_pred_0_hit
output		io_out_s3_full_pred_1_br_taken_mask_0	见对应prefix的输入
output		io_out_s3_full_pred_1_br_taken_mask_1
output		io_out_s3_full_pred_1_slot_valids_0
output		io_out_s3_full_pred_1_slot_valids_1
output	[40:0]	io_out_s3_full_pred_1_targets_0
output	[40:0]	io_out_s3_full_pred_1_targets_1
output	[40:0]	io_out_s3_full_pred_1_fallThroughAddr
output		io_out_s3_full_pred_1_fallThroughErr
output		io_out_s3_full_pred_1_is_br_sharing
output		io_out_s3_full_pred_1_hit
output		io_out_s3_full_pred_2_br_taken_mask_0	见对应prefix的输入
output		io_out_s3_full_pred_2_br_taken_mask_1
output		io_out_s3_full_pred_2_slot_valids_0
output		io_out_s3_full_pred_2_slot_valids_1
output	[40:0]	io_out_s3_full_pred_2_targets_0
output	[40:0]	io_out_s3_full_pred_2_targets_1
output	[40:0]	io_out_s3_full_pred_2_fallThroughAddr
output		io_out_s3_full_pred_2_fallThroughErr
output		io_out_s3_full_pred_2_is_jalr
output		io_out_s3_full_pred_2_is_call
output		io_out_s3_full_pred_2_is_ret
output		io_out_s3_full_pred_2_is_br_sharing
output		io_out_s3_full_pred_2_hit
output		io_out_s3_full_pred_3_br_taken_mask_0	见对应prefix的输入
output		io_out_s3_full_pred_3_br_taken_mask_1
output		io_out_s3_full_pred_3_slot_valids_0
output		io_out_s3_full_pred_3_slot_valids_1
output	[40:0]	io_out_s3_full_pred_3_targets_0
output	[40:0]	io_out_s3_full_pred_3_targets_1
output	[3:0]	io_out_s3_full_pred_3_offsets_0
output	[3:0]	io_out_s3_full_pred_3_offsets_1
output	[40:0]	io_out_s3_full_pred_3_fallThroughAddr
output		io_out_s3_full_pred_3_fallThroughErr
output		io_out_s3_full_pred_3_is_br_sharing
output		io_out_s3_full_pred_3_hit
output		io_out_last_stage_ftb_entry_valid	完全透传传入的值
output	[3:0]	io_out_last_stage_ftb_entry_brSlots_0_offset
output	[11:0]	io_out_last_stage_ftb_entry_brSlots_0_lower
output	[1:0]	io_out_last_stage_ftb_entry_brSlots_0_tarStat
output		io_out_last_stage_ftb_entry_brSlots_0_sharing
output		io_out_last_stage_ftb_entry_brSlots_0_valid
output	[3:0]	io_out_last_stage_ftb_entry_tailSlot_offset
output	[19:0]	io_out_last_stage_ftb_entry_tailSlot_lower
output	[1:0]	io_out_last_stage_ftb_entry_tailSlot_tarStat
output		io_out_last_stage_ftb_entry_tailSlot_sharing
output		io_out_last_stage_ftb_entry_tailSlot_valid
output	[3:0]	io_out_last_stage_ftb_entry_pftAddr
output		io_out_last_stage_ftb_entry_carry
output		io_out_last_stage_ftb_entry_isCall
output		io_out_last_stage_ftb_entry_isRet
output		io_out_last_stage_ftb_entry_isJalr
output		io_out_last_stage_ftb_entry_last_may_be_rvi_call
output		io_out_last_stage_ftb_entry_always_taken_0
output		io_out_last_stage_ftb_entry_always_taken_1

Other Meta information can be found in the corresponding sub-predictor documentation

_ubtb_io_out_last_stage_meta

_tage_io_out_last_stage_meta

_ftb_io_out_last_stage_meta

ittage_io_out_last_stage_meta[100:0]

位宽	信号名	备注
100	s3_provided	是否有结果
[99:97]	s3_provider	提供结果的表项
96	s3_altProvided	是否有替代预测表项
[95:93]	s3_altProvider	提供结果的替代预测表项
92	resp_meta_altDiffers	替代预测是否是弱信心的（FTB不算）
91	s3_providerU	主预测的useful bit
[90:89]	s3_providerCtr	主预测给出的置信度
[88:87]	s3_altProviderCtr	替代预测给出的置信度
86	resp_meta_allocate_valid_r	有空余的表项可供申请
[85:83]	resp_meta_allocate_bits_r	申请哪个表中的表项
82	s3_tageTaken_dup_3	在不使用FTB的情况下始为true，使用FTB也为true
[81:41]	s3_providerTarget	主预测给出的跳转地址
[40:0]	s3_altProviderTarget	替代预测给出的跳转地址

3.6 - RAS Branch Predictor

RAS介绍

RAS stands for “Return Address Stack.” It helps determine branch behavior in programs by tracking return addresses. As previously mentioned, there are many branches in a program: if/else, switch/case, while/for loop, iteration, call/return, etc. The RAS branch predictor specifically targets call/return types.

function _add(a, b){
    return (a > 0 ? a : 0)  + (b > 0? b : 0);
}

function add(a, b){
    return a + b;
}

function sub(a, b){
    return a - b;
}

function main(){
    a = 1;
    b = 2;
    c = add(a, b);
    d = sub(a, b);
}

As shown above, the main function calls add and sub, and add calls the function _add. In this process, each call’s jump address and return address are fixed, and the return address can be obtained at the time of the call. The function call process is a “stack push and pop” process, so branch prediction can be performed using a “stack” structure: each time a call instruction is encountered, the current PC+4 (compressed instructions and regular instructions have different offsets) is pushed onto the stack; when a return instruction is encountered, a pop operation is performed, and the address obtained is the target jump address. In the block-based BPU, RAS cannot know whether the current block is a call or ret, so it relies on other predictors, using the results of previous predictors for RAS operations.

Specifically, in Xiangshan’s RAS predictor, at the s2 stage, it needs to determine whether the previous stage’s s2 output predicts a call or ret (i.e., the input signal io.s2_full_pred.hit_taken_on_call/ret is valid). If it’s a call, it pushes the subsequent instruction address onto the stack; if it’s a ret, it pops the address from the stack as the prediction result. Because in the BPU predictor, the result obtained at the s3 stage is assumed to be better than the s2 stage, the RAS predictor needs to check at the s3 stage. If the previous stage’s s3 prediction result is inconsistent with s2, the s3 result is taken, and it needs to determine whether to cancel or complete the stack operations of the previous s2 stage as needed. For example, if the s2 stage predicted a call instruction and performed a push operation, but s3 predicted a regular branch instruction with no need for any operation, the push must be canceled; if s2 predicted a regular branch instruction and s3 predicted a call, a push operation must be performed to complete.

RAS Stack Operations

In RAS design, function return addresses are predicted using a stack. Ideally, this section assumes that RAS can be backed up at any time, with the stack top pointer represented by sp and the predicted address represented by paddr. The basic RAS operations are as follows:

PUSH

Since predictions can be wrong, the current stack state needs to be backed up (often referred to as a “snapshot” in software; this term is also used in subsequent content). When encountering a call instruction, get the return address of the call instruction addr = current pc + 4 (if it’s a compressed instruction, addr = pc+2), then push onto the stack: sp = addr; sp += 1.

POP

For the same reason, take a snapshot of the current stack, marked as s. When encountering a ret instruction, the predicted jump address is paddr = sp, then pop: sp = sp - 1. Take a snapshot of the current stack, marked as s.

Redirect Operation

Since the BPU predicts program branches, there are “correct predictions” and “wrong predictions.” When the CPU backend detects a branch prediction error, it performs a redirect operation, informing the BPU where the prediction was wrong and what the correct result is. During redirection, the RAS module receives the correct branch and the RAS stack information at the time of prediction. Depending on the type of correct branch instruction, the following snapshot recovery situations arise:

(1) The previously predicted instruction is actually a call instruction, and the push operation is executed based on the addr address provided in the redirect. (2) The previously predicted instruction is actually a ret instruction, and the pop operation is executed.

Commit Operation

The commit operation is when the backend informs the frontend that the previous prediction was correct. Ideally, the RAS predictor doesn’t need to perform any operations at this time.

Implementation of RAS in Kunming Lake

In actual circuit design, an infinitely large stack is impossible, and constant backups are not feasible. Therefore, in Kunming Lake’s RAS implementation, the problems and solutions are as follows:

How to obtain RAS stack snapshots for each prediction?

To achieve the function of taking snapshots of the RAS stack, Kunming Lake adopts a linked representation based on a circular array. The design is as follows:

As shown above, a circular array is used for data management. The circular array has a starting address marked as BOS and a tail pointer marked as TOSW. The data between them are valid, and the data outside are free. Within the valid data, a linked structure represents the “RAS stack,” where each stack element records the number of its previous data. When performing stack operations, the corresponding previous element can be obtained through this number. The RAS stack’s bottom pointer shares the BOS. In the initial state S in the figure, the RAS stack elements are 0, 1, 3, 5. Element 5 records the position of its previous element 3, and element 3 records the position of its previous element 1. When a push operation is needed, the RAS stack top pointer TOSR = TOSW, the new element is stored at the new TOSR position 7, and the position of its previous element 5 is recorded in the new element, then TOSW is moved back (TOSW = TOSW+1). When a pop operation is performed, the RAS stack top pointer TOSR moves to the previous element’s position 3 based on the index saved in the stack top element. Therefore, under the condition that the stack does not overflow, the above RAS stack always allocates new data on the array through TOSW during normal Push/Pop operations, so all process states and intermediate data are saved. So, to restore to the state S, it only needs to reset the corresponding stack pointers. Therefore, in each prediction, the corresponding stack pointers (BOS, TOSR, TOSW) also need to be saved in the prediction result for later restoration in case of redirection. The advantage of this structure is that it can save complete process data, but frequent push operations can lead to large space resource consumption.

Chain RA storage space waste?

Since the prediction result is correct after commit, the stack will not roll back. When the RAS predictor receives the commit message of the prediction block “P,” it will no longer receive the redirect message of block P, so the snapshot taken during the push operation for block P will not be used again. Therefore, the RAS stack elements can be categorized, with uncommitted elements stored in a “linked” structure and committed elements stored in a regular stack structure (the original RAS stack is split into two parts: the uncommitted part stored in a snapshot-saving linked structure and the committed part stored in a regular stack structure). The optimized RAS structure is shown below:

As shown above, based on the normal call/ret predictions and commit call/ret, the original RAS stack can be split into two independent stacks, called the “speculative stack” (spec_stack, linked structure) and the “commit stack” (commit_stack, regular structure). Due to the change in stack structure, the specific Pop/Push operations are as follows:

Encounter normal call and ret: (1) The call predicted by the prediction block is correct, and a push operation is performed on the spec_stack, specifically the linked stack Push process mentioned above.

(2) The ret predicted by the prediction block is correct, and the stack top of the spec_stack is used as the prediction value, then a pop operation is performed. If the spec_stack is empty, the stack top element of the commit_stack is used as the prediction result (no pop operation is performed).

Commit operation: (1) The FTQ execution result is call correct, and a regular push operation is performed on the commit_stack.

(2) The FTQ execution result is ret correct, and a regular pop operation is performed on the commit_stack.

Redirect operation: (1) Obtain the stack pointers (BOS, TOSR, TOSW, ssp) from the redirect message during the previous prediction and cover the current pointer values to complete the speculative stack rollback.

(2) This operation does not affect the commit stack.

How to handle when the S3 prediction result is inconsistent with S2 at the input end?

Since the S3 result is assumed to be better than S2, the RAS stack needs to be repaired again in case of inconsistency. The specific inconsistency and corresponding repair operations are shown in the table below:

S2 Pred. Res.	S3 Pred. Res.	Repair Operation
push	keep	pop
keep	pop	pop
pop	keep	push
keep	push	push

S2 and S3 operations do not exist, pop/push or push/pop scenarios (Why not exist?)

Other Optimizations

Each element in the RAS stack has a counter, which is used to save repeated values (for recursive calls). For example, when the address pushed for the first time is 0xff00, and the address pushed for the second time is also 0xff00, only the counter of the top element of the stack needs to be incremented, and there is no need to actually push the address onto the stack.

Interface Description

In the RAS predictor, the core component is the RASStack, with the following interface:

接口名称	功能描述	接口名称	功能描述
in.spec_push_valid	预测有Call指令，Spec压栈	in.s2_fire	s2信号有效
in.spec_pop_valid	预测有Ret指令，Spec出栈	in.s3_fire	s3信号有效
in.spec_push_addr	Ret地址	in.s3_cancel	s3和s2的预测结果不一样，需要撤销s2的操作
out.spec_pop_addr	RAS的栈顶数据	in.s3_meta	s3需要的s2时的现场信息
out.ssp	commit栈顶指针	in.s3_missed_pop	s3判断需要再次进行pop
out.sctr	commit栈顶重复元素计数器	in.s3_missed_push	s3判断需要再次进行push
out.nsp	commit栈顶，会被ssp覆盖	in.s3_pushAddr	需要再次push时的地址

out.TOSR	spec栈栈顶指针	in.redirect_valid	需要重定向
out.TOSW	spec栈数据分配指针	in.redirect_isCall	真实执行情况是Call
out.BOS	spec栈栈低指针	in.redirect_isRet	真实执行情况是Return
in.commit_push_valid	push操作正确	in.redirect_meta_ssp	之前预测时的现场信息ssp

in.commit_pop_valid	FTQ执行结果为Call正确	in.redirect_meta_sctr	之前预测时的现场信息sctr
in.commit_push_addr	更新信息中的Ret地址	in.redirect_meta_TOSW	之前预测时的现场信息TOSW
in.commit_meta_TOSW	更新信息中的TOSW
in.redirect_meta_TOSR	之前预测时的现场信息TOSR
in.commit_meta_TOSR	更新信息中的TOSR	in.redirect_meta_NOS	之前预测时的现场信息NOS
in.commit_meta_ssp	更新信息中的现场信息SSP	in.redirect_callAddr	重定向地址
in.commit_meta_sctr	更新信息中的现场信息SCTR

The relationship between the RASStack module and the BasePredictor interface is as follows:

stack接口	转换过程	描述
s.spec_push_valid	io.s2_fire(2) && s2_full_pred.hit_taken_on_call && !io.s3_redirect(2)	s2输入有效，且上级预测为call跳转
s.spec_pop_valid	io.s2_fire(2) && s2_full_pred.hit_taken_on_ret && !io.s3_redirect(2)	s2输入有效，且上级预测为ret跳转
s.spec_push_addr	s2_full_pred.fallThroughAddr + Mux(s2_full_pred.last_may_be_rvi_call, 2.U, 0.U)	上级预测器s2预测的fallThroughAddr（即PC+2），判断是否压缩指令是否需要 +2
s.redirect_isCall	redirect.bits.level === 0.U && recover_cfi.pd.isCall

s.redirect_isRet	redirect.bits.level === 0.U && recover_cfi.pd.isRet

s.redirect_meta_*	redirect.bits.cfiUpdate.*

s.commit_push_valid	io.update.is_call_taken	call指令预测正确
s.commit_push_valid	io.update.is_ret_taken	ret指令预测正确
s.commit_push_addr	update.ftb_entry.getFallThrough(update.pc) + Mux(update.ftb_entry.last_may_be_rvi_call, 2.U, 0.U)	根据是否为压缩指令，进行地址+2或者+0
s.commit_meta_*	io.update.bits.meta.asTypeOf(new RASMeta)

io.out.last_stage_spec_info.*	s3_meta.*	s3_meta = RegEnable(s2_meta, io.s2_fire(2))由s2_meta延迟一怕得到
io.out.last_stage_meta	s3_meta

io.out.s2.full_pred.*.jalr_target	:=stack.spec_pop_addr	预测地址（栈顶地址，只预测ret）
io.out.s3.full_pred.*.jalr_target	:=RegEnable(stack.spec_pop_addr, io.s2_fire(2))	由s2延迟一拍得到
io.out.s2/3.full_pred.targets.last	:=Mux(s2/3_is_jalr, s2/3_jalr_target, s2/3_last_target_in)	如果时call执行，更新targets.last的结果

Timing Description

In RAS, there are only 2 stages, S2 and S3.

Main tasks in S2

Based on the S2 prediction result of the previous predictor, complete the prediction through the push/pop process and obtain the result spec_pop_addr.
Perform update operations based on commit signal.

Main tasks in S3

Based on the prediction result in the previous predictor S3 and the operations in S2, determine whether to undo the Pop/Push operation. The predictor assumes that the S3 prediction result is better than S2. If S2 and S3 predictions are inconsistent, the RAS predictor accepts the S3 result for stack operation.
The prediction process in S3 is the same as in S2, but the data is different.
Perform redirection operations (redirection information obtained from the previous cycle).

Since the RASStack appears to complete its tasks within a cycle, data bypassing is needed inside the stack to cache data for processing.

4 - Feature List

The Feature List in hardware verification lists various functions and characteristics of the design that need to be verified. This section provides a basic feature list for each verification task for reference.

The Feature List in hardware verification lists various functions and characteristics of the design that need to be verified. Before starting verification, we generally list the features of the design to clarify which features need to be verified in this task.

To verify a feature in the feature list, we will build the necessary test points for this feature and write test cases to cover these test points. Therefore, the feature list is not only a description of the features of the design to be verified but also a guide for our verification direction and a simple summary of the verification task.

In this verification, we provide a basic feature list for each verification task (i.e., the top-level BPU and each sub-predictor), which only covers the basic functions of the corresponding module. In your verification, you must ensure that this verification covers these features. Additionally, you can supplement features not covered in the list to ensure the final delivery quality.

In this document section, you only need to refer to the document corresponding to the module you will verify.

4.1 - BPU Top Feature List

Feature List

upport uFTB sub-predictor
Support TAGE-SC sub-predictor
Support FTB sub-predictor
Support ITTAGE sub-predictor
Support RAS sub-predictor
Support three-stage prediction result and other information output
Support prediction result redirection signal generation
Support pipeline control signal generation
Support PC generation
Support global branch history maintenance
Support branch folding history maintenance
Support redirection request response, state restoration
Support update request response

4.2 - uFTB Feature List

Feature List

Support prediction based on FTB entry
Support two-bit saturating counter
Support basic prediction result output and meta information output for the s1 channel
Support update request response, updating internal FTB and two-bit saturating counter.

4.3 - FTB Feature List

Feature List

Support FTB entry storage
Support basic prediction result output and meta information output for the s2, s3 channels
Support update request response, updating internal FTB entries

4.4 - TAGE-SC Feature List

Feature List

s2 TAGE outputs prediction results
s3 SC outputs prediction results
s2 TAGE outputs meta information
s3 SC outputs meta information
TAGE performs update training
Check for new table entry requests
Globally reset useful
SC performs update training

4.5 - ITTAGE Feature List

Feature List

s2 ITTAGE determines whether to generate a prediction result
s3 ITTAGE reads the predicted target for a branch
s3 ITTAGE outputs meta information
ITTAGE performs update training
Check for new table entry requests
Globally reset useful
Prediction direction

4.6 - RAS Feature List

Feature List

Supports enabling and disabling of the predictor
Supports s3 prediction result overriding s2 prediction result
Supports push and pop operations of the RAS stack
Supports redirect operation of the RAS stack
Supports update operation of the RAS stack
Supports base predictor interface
Conforms to the standard RAS predictor prediction process

Branch Predictor and Kunming Lake Microarchitecture Implementation

1 - Basic Design of Shanshan Branch Prediction Unit (BPU)

1.1 - What is Branch Prediction

Why Do We Need Branch Prediction?

Program’s Execution Flow Contains Branch Instructions

High-performance Processors Use Pipeline Design

Feasibility of Branch Prediction

Basic Types of Branch Prediction

Direction Prediction of Branch Instructions

Two-Bit Saturation Counters

Branch History

Branch Target Address Prediction

Branch Target Buffer (BTB)

Predicting Instruction Types

General Steps of Branch Prediction

1.2 - Basic of the Xiangshan Branch Prediction Unit (BPU)

Branch Prediction Block Concept

Multiple Predictors, Multiple Pipeline Structure

Multiple Predictors

Multiple Pipelines

Fetch Target Queue (FTQ)

Storing Branch Prediction Results

BPU Prediction Result Redirection

BPU Redirect Requests

BPU Update Requests

Summary

1.3 - Introduction to the Xiangshan Branch Prediction Unit Structure

How Does the BPU Integrate Internal Sub-predictors?

Sub-predictor Interface

Connection Between Sub-predictors

Prediction Result Interface Connection

1.4 - Introduction to the Timing of Xiangshan Branch Prediction Unit

Single-Cycle Prediction without Bubble

Prediction Result Redirection

Redirection Requests and Other Information Generation

Pipeline Control Signals

Conclusion

2 - Important Structures and Interface Documentation

2.1 - FTB Item and Complete Prediction Result Interface

FTB item

The Unique tailSlot

tailSlot Sharing

Complete Structure of FTB Item (FTBEntry)

Branch Prediction Slot (FTBSlot)

Full Branch Prediction

2.2 - Redirection and Update Interfaces

Branch Prediction Redirection（BranchPredictionRedirect）

Branch Prediction Update（BranchPredictionUpdate）

2.3 - BPU Global Interface

BPU Module Overall External Interface (PredirectIO)

Global Branch Prediction Information (BranchPredictionResp)

FTB entry output by the last pipeline stage

2.4 - Base Predictor Class and Sub Predictor Interface

Sub-Branch Predictor Interface (BasePredictorIO)

Global Folding History (AllFoldedHistories)

Base Predictor Class

3 - Submodule Documentation

3.1 - BPU Top Module

Generator Maintenance Method

Global Branch History

Calculating the Current Global Branch History with ghist_ptr

Updating Global Branch History

Branch Folded History

Branch Folded History

Method for Updating Branch Folded History

Implementation of Update Method

Pipeline Control Method

valid, ready 与 fire

Incorporating Flush and Redirect

Redirect Recovery Logic

3.2 - uFTB Branch Predictor

Introduction to uFTB

Functionality of uFTB

uFTB Cache Structure

Two-Bit Saturating Counters

Prediction Result Generation

uFTB Update

FTB Cache Update

Two-Bit Saturating Counter Update

Interface List