BPU Top Module
Categories:
The overall function and structure of the BPU top level have been roughly described in previous documents. For those verifying the BPU top level, a more detailed description might be needed. Due to the many functions of the BPU top level, this section divides the BPU into several major functional points for further description. However, since there are too many details at the BPU top level, further details need to be understood by referring to the code.
Generator Maintenance Method
From the basic design documents of Xiangshan, we know that the BPU top level maintains various variables in the s0 cycle through generators, such as PC, branch history, etc. The core concept is to decide which pipeline level’s result to adopt through the redirection signal of the prediction result.
There are a total of 6 generators in the BPU top level:
- npcGen maintains the PC
- ghistPtrGen maintains the global history pointer
- ghvBitWriteGens maintains global history write data
- foledGhGen maintains the folded history
- lastBrNumOHGen maintains the position of the last effective branch instruction in the previous cycle
- aheadFhObGen maintains the oldest position of the branch history
Except for npcGen
, the rest of the generators will be introduced in this document. In this section, we will focus on the method of generating the next prediction for the generators.
In the code, you can see generators defined in a similar way:
val npcGen = new PhyPriorityMuxGenerator[UInt]
Next, the code registers the data sources for the generators through multiple statements:
npcGen.register(true.B, reg, ...)
npcGen.register(s1_valid, s1_target, ...)
npcGen.register(s2_redirect, s2_target, ...)
npcGen.register(s3_redirect, s3_target, ...)
npcGen.register(do_redirect.valid, do_redirect.bits.cfiUpdate.target, ...)
Each line is called a registration. In a registration, the first signal parameter is the data valid signal, and the second signal parameter contains the specific data. The priority of the generator is also determined in the order of registration; the later the registration, the higher the priority. Therefore, the priority at the same time, from low to high, is as follows:
- s0 blocked data
- Data updated based on s1 prediction results
- Data updated based on s2 prediction results
- Data updated based on s3 prediction results
- Data in external redirection of BPU
In this way, when the redirection of the prediction result is valid, we can avoid using the earlier pipeline level’s prediction result and adopt the corrected prediction result. This allows us to handle external redirection requests with the highest priority.
We can conclude the method by which all generators generate s0 signals: Among all data valid signals, if only one is valid, the corresponding data is selected; if multiple data valid signals are valid, the data with the highest priority is selected.
Global Branch History
We know that the global branch history is maintained at the BPU top level, and the maintenance strategy is consistent with the PC maintenance strategy. That is, after the prediction result is generated at each stage of the pipeline, the global branch history is updated based on the corresponding signals.
The top level defines two sets of signals to maintain the global branch history:
- ghv stores the global branch history (maximum length 256)
- ghist_ptr global branch history pointer, pointing to the current position of the global branch history
Similar to s0_pc
, s1_pc
, s2_pc
, the BPU top level also maintains signals for each stage of the global history pointer: s0_ghist_ptr
, s1_ghist_ptr
, s2_ghist_ptr
. However, the content in ghv
is fixed in position, and we only use ghist_ptr
to locate where the current global branch history starts.
Calculating the Current Global Branch History with ghist_ptr
The use of ghist_ptr
is only visible at the BPU top level. What we pass to the sub-predictors is the global branch history after the data in the global history register is shifted based on ghist_ptr
. In the global branch history obtained by the sub-predictor, the least significant bit corresponds to the newest bit of the global branch history, and the most significant bit corresponds to the oldest bit.
So how is the shifting done? First, let’s see how the global history is stored in ghv
.
|===== ghist =====>| =======>|
n ^ 0
ghist_ptr
As shown in the figure above, the sequence represents the entire ghv
register, and ghist_ptr
points to a position in ghv
. This position represents the newest bit of the global branch history. When a new global history record needs to be added, ghist_ptr
is first decremented by 1, and then this bit is written to the position it points to. When ghist_ptr
is decremented to 0, it will loop back to point to the highest position, thereby overwriting the previously written global branch history.
No matter what, starting from the position pointed to by ghist_ptr
, the pointer increases and the history gets older. Therefore, when we need to calculate the current global branch history, we only need to circularly right-shift the ghv
register by ghist_ptr positions
.
Updating Global Branch History
The strategy for updating the global branch history is consistent with the strategy for updating the pc
. At each pipeline stage, a pointer for the current stage and an update description of ghv
are generated based on the prediction result of the current stage, and all are sent to the relevant generator for processing.
The update strategy for the global branch history is consistent with the pc
update strategy, requiring the generation of a current stage pointer and ghv
update instructions based on the current stage’s prediction results at each pipeline stage. These instructions are ultimately sent to the relevant generators for processing.
The update description of ghv
is some information used to guide the update of the ghv
register. Xiangshan BPU maintains two pieces of information to fulfill this duty:
ghv_wdata
the data that needs to be written into ghvghv_wens
the write bit mask
For the final update, only the bits identified by ghv_wens
need to be written with the corresponding bits of ghv_wdata
.
Therefore, each pipeline stage needs to generate three sets of information: ghist_ptr
, ghv_wdata
, ghv_wens
.
Specifically, the prediction result can contain up to two branch instructions. We only need to set these pieces of information according to the actual situation. Here are some examples:
- Only the first slot is valid, and the conditional branch instruction in it is predicted as not taken. Then the next position of
ghv_wens
is set to 0, the corresponding position ofghv_wens
is set to 1, andghist_ptr
is decremented by one. - Both slots contain conditional branch instructions, the first is predicted as not taken, and the second is predicted as taken. At this time,
ghist_ptr
should be decremented by two, and the other two pieces of information should indicate writing 01 toghv
.
Here, only one piece of ghv_wdata
information is maintained in the generator (maintained by the ghvBitWriteGens
generator), and ghv_wens
is not maintained by the generator. This is because a small trick is used here, where the final output of the generator’s ghv_wdata
is the result of the selected stage, and ghv_wens
is used by performing a bitwise OR operation on ghv_wens
of all stages.
This consideration is based on:
- If the later pipeline stage is valid, the global history pointer is restored to an older position, even if the history of newer positions is modified by the earlier pipeline’s
ghv_wens
. - If the earlier pipeline stage is valid, the global history pointer continues to update to newer positions, and the later pipeline will not set
ghv_wens
due to the ineffective redirect.
Branch Folded History
The branch folded history passed to the predictor is also maintained by the BPU top level. To shorten the update delay of the folded history, the BPU maintains many variables to support the fast update of the branch folded history. We will focus on this strategy and introduce the function of each variable.
Before we start, let’s first look at how the branch folded history is defined and its structure.
Branch Folded History
If you have checked the BPU global interface documentation, you will know that the sub-predictor receives an array of bit vectors of different lengths, representing various lengths of folded history, and these folded histories are compressed from the global branch history.
For the global branch history, we have a register that stores the global branch history with a length of 256. For example, suppose the length of the global branch history is 15 bits, and after shifting, we get a branch history like this: the least significant bit is the newest history record, and the most significant bit is the oldest history record.
At this time, if we need to generate a 6-bit folded history from these 15 bits, we will use an XOR strategy for compression. The specific process is as follows:
h[5] h[4] h[3] h[2] h[1] h[0]
h[11] h[10] h[9] h[8] h[7] h[6]
^ h[14] h[13] h[12]
---------------------------------------------------------------
h[5]^h[11] h[4]^h[10] ... h[0]^h[6]^h[12]
That is, after arranging it as shown above, perform an XOR operation on the values at each position, and the result is the folded history of length 6.
Method for Updating Branch Folded History
Now we want to update this branch folded history. When we insert a new history into the global branch history, it is inserted from the least significant bit, meaning the original h[0] becomes h[1]. If we want to obtain the folded history at this time, we only need to perform the XOR operation again. But such efficiency is too low, because the XOR operation may become particularly long. We can explore the impact of one update on the branch folded history.
In the example above, before inserting a new history, the 6-bit folded history is generated in the following arrangement:
h[5] h[4] h[3] h[2] h[1] h[0]
h[11] h[10] h[9] h[8] h[7] h[6]
h[14] h[13] h[12]
After inserting a new history, it becomes like this:
h[4] h[3] h[2] h[1] h[0] h[new]
h[10] h[9] h[8] h[7] h[6] h[5]
(h[14]) h[13] h[12] h[11]
We can notice some patterns:
Before insertion:
h[5] {h[4] h[3] h[2] h[1] h[0] }
h[11] {h[10] h[9] h[8] h[7] h[6] }
{ h[14] h[13] h[12]}
After insertion:
{h[4] h[3] h[2] h[1] h[0] } h[new]
{h[10] h[9] h[8] h[7] h[6] } h[5]
{ (h[14]) h[13] h[12]} h[11]
The content in the curly braces has undergone a complete left shift, with h[5] and h[11] moving from the most significant bit to the least significant bit. So, in the compressed history, isn’t this just a typical cyclic left shift that we often encounter!
However, only two bits have changed: one is the newly inserted h[new], and the other is the discarded h[14]. h[new] must be in the first position, and the discarded position is fixed. Therefore, to complete an update, we only need to know the value of the newly inserted history and the oldest bit of the previous history. After the cyclic shift, modifying these two positions according to the actual situation will give us the updated folded history.
Implementation of Update Method
To achieve this update in the top-level BPU, two additional variables are maintained:
- ahead_fh_oldest_bits: the oldest bit of the global branch history, with additional bits stored before it
- last_br_num_oh: the slot number of the last effective branch instruction in the previous prediction
An optimization in the timing of operations occurs here because the global history pointer can only be updated based on the branch outcome when the pipeline-level prediction result is available. Updating the oldest bit after updating the global history pointer would increase the latency. Therefore, we maintain the branch outcome and update the oldest bit when it is needed in the next cycle.
The oldest bit also needs to be maintained further back because after updating using the branch outcome, the relatively newer bits will become the oldest bits.
Thus, there are three generators related to the folded history: foldedGhGen
, lastBrNumOhGen
, and aheadFhObGen
.
Information required for each update of the folded history:
- Folded history information before the update
- Oldest bit of the global branch history (ahead_fh_oldest_bits)
- Last prediction’s branch outcome (last_br_num_oh)
- Whether there is a branch instruction in this update
- The branch outcome of this update: the slot number of the last effective branch instruction
For each update of the folded history, the true oldest bit needs to be determined based on last_br_num_oh
and ahead_fh_oldest_bits
. Then, based on the oldest bit and the branch outcome of this update, several bits are modified, and finally, a cyclic left shift completes the update operation.
Pipeline Control Method
Pipeline control is the core function of the BPU, with complex logic. All pipeline control signals in the top-level BPU are as follows:
- s1_valid, s2_valid, s3_valid: indicate the corresponding pipeline data is valid
- s1_ready, s2_ready, s3_ready: indicate the corresponding pipeline is ready to continue the prediction of the previous pipeline stage
- s1_component_ready, s2_component_ready, s3_component_ready: indicate the readiness of the corresponding pipeline sub-predictor
- s0_fire, s1_fire, s2_fire, s3_fire: successful handshake signals, indicating that the pipeline data is valid and has been successfully passed to the next pipeline
- s1_flush, s2_flush, s3_flush: indicate whether the current pipeline needs flushing
- s2_redirect, s3_redirect: indicate whether the current pipeline needs to redirect due to a different prediction result
valid, ready 与 fire
We will introduce the purpose of each signal step by step. First, let’s look at the fire
signal, which indicates a successful handshake in the pipeline, meaning that the data has been successfully passed to the next pipeline. This signifies the end of the current cycle and the end of the prediction for this pipeline stage, with the prediction for the next pipeline stage about to begin in the next cycle.
This requires two conditions:
valid
: The data in the current pipeline stage is valid.ready
: Indicates whether the next pipeline stage in the BPU top level and the predictor are ready.
When these two signals are high simultaneously, the fire
signal is valid, indicating a successful handshake. If we isolate a single prediction, the timing should look like this (in reality, most of the time, each pipeline is valid continuously):
Of the four sets of signals mentioned earlier, component_ready
is generated by the predictor, while the rest are maintained by the BPU top level, with only the fire set of signals exposed to the sub-predictor.
Next, let’s take s2 as an example to see how each signal is maintained.
ready Signal
s2_ready := s2_fire || !s2_valid
This assignment statement is a combinational circuit assignment, meaning that s2_ready
is directly related to s2_fire
and s2_valid
for this cycle, with two possible situations:
- If
s2_valid
is invalid for this cycle, indicating that the s2 pipeline stage is empty and can accept new data, thens2_ready
is valid. - If
s2_valid
is valid for this cycle, indicating that the s2 pipeline stage has data that has not been passed to the next stage yet, but ifs2_fire
, then the data will be passed in the next cycle. In this case,s2_ready
is valid, indicating that the data can flow into the next stage.
valid Signal
Maintaining ·s2_valid· is relatively simple, as it is only related to s1_fire
and s2_ready
signals. The relationship is as follows:
- When
s1_fire
is valid, indicating that data is coming in ands2_valid
will be valid in the next cycle. - When
s2_fire
is valid, indicating that data is flowing out ands2_valid
will be invalid in the next cycle.
fire Signal
The fire signal is somewhat special, but for intermediate pipeline stages, its maintenance is straightforward. For example,
s2_fire := s2_valid && s3_components_ready && s3_ready
This simply requires considering the valid
of the current pipeline stage and the ready
of the next pipeline stage.
However, for s0_fire, since there is no valid signal, it is directly set to s1_components_ready && s1_ready
.
For s3_fire, as there is no ready signal for the next stage, it is directly set to s3_valid
.
Incorporating Flush and Redirect
When there is a different prediction result in the pipeline, a redirection signal is generated, and the pipeline needs to be flushed. flush
and redirect
handle these two tasks. redirect
indicates whether the current pipeline stage needs redirection, while flush
indicates whether the current pipeline stage needs flushing.
redirect Signal
The generation of s2_redirect
is as follows:
s2_redirect := s2_fire && s2_redirect_s1_last_pred
This means that when s2_fire
is valid and the prediction result of s2 is different from the prediction result saved from s1, this signal is valid. Later, this signal will be connected to the input of the sub-predictor and the output of the BPU prediction result, guiding the sub-predictor and FTQ to restore their states.
flush Signal
The flush signal is used to guide the flushing of the pipeline. For example, when s3_redirect is valid, it means that the incorrect prediction result has entered the pipeline, and both s1 and s2 are now predicting based on the wrong result. Therefore, the pipeline needs to be flushed to stop the previous stages and wait for new prediction results to enter.
Specifically, the relationship between them is:
s2_flush := s3_flush || s3_redirect
s1_flush := s2_flush || s2_redirect
This means that if a pipeline stage needs redirection, all previous stages will be flushed. The purpose of flush is to guide the valid signal. If the valid signal is valid in this cycle but the fire signal is not, it means that the incorrect data has not been taken by the next pipeline stage. In this case, when flush is valid, valid will immediately become invalid in the next cycle, avoiding storing incorrect data in the pipeline for a long time.
However, the effect of flush on the valid signal varies depending on each pipeline stage. For example:
- For the s1 pipeline stage, although flush is valid, if
s0_fire
is valid, indicating that new data is flowing in, valid will remain valid in the next cycle. - For the s2 pipeline stage, if flush is valid, valid will definitely be invalid in the next cycle (because s1 is also flushed), indicating that valid can be directly set to invalid. However, there is a special case where
s2_redirect
occurs buts2_flush
is not set to valid. In this case, ifs1_fire
occurs, the incorrect prediction result of s1 may also flow in. In this case,s2_valid
needs to be determined based on thes1_flush
signal.
The use of flush is complex, and more detailed details need to be understood by referring to the code.
Redirect Recovery Logic
When the redirect request from FTQ to BPU takes effect, it indicates that all stages of the pipeline have incorrect predictions, and all stages should be flushed. This can be achieved by setting s3_flush
to be valid. Therefore, we have:
s3_flush := redirect_req.valid
In the BPU, the redirect request is delayed by one cycle before being officially used. Therefore, the response of s1_valid
to the flush
signal needs to be changed. When the redirect request (before delay) is valid, s1_valid
in the next cycle is immediately set to invalid, without the need to refer to the s0_fire
signal.
At this point, generators such as npcGen
also need to directly use the data from the redirect request to generate, which is equivalent to redirecting the BPU’s state to the state before the error occurred. However, it is important to note that the default redirect level in BPU is flushAfter
, which means that the redirect request corresponds to a predicted erroneous instruction, and the BPU assumes that although this instruction was predicted incorrectly, it has been corrected and executed by the backend. Therefore, the next prediction can start directly from the next instruction.
Therefore, when recovering from a redirect, it is necessary not only to restore the information from the redirect interface, but also to update the execution status of this predicted erroneous instruction in the history.