BPU Top Module

The overall function and structure of the BPU top level have been roughly described in previous documents. For those verifying the BPU top level, a more detailed description might be needed. Due to the many functions of the BPU top level, this section divides the BPU into several major functional points for further description. However, since there are too many details at the BPU top level, further details need to be understood by referring to the code.

Generator Maintenance Method

From the basic design documents of Xiangshan, we know that the BPU top level maintains various variables in the s0 cycle through generators, such as PC, branch history, etc. The core concept is to decide which pipeline level’s result to adopt through the redirection signal of the prediction result.

There are a total of 6 generators in the BPU top level:

  • npcGen maintains the PC
  • ghistPtrGen maintains the global history pointer
  • ghvBitWriteGens maintains global history write data
  • foledGhGen maintains the folded history
  • lastBrNumOHGen maintains the position of the last effective branch instruction in the previous cycle
  • aheadFhObGen maintains the oldest position of the branch history

Except for npcGen, the rest of the generators will be introduced in this document. In this section, we will focus on the method of generating the next prediction for the generators.

In the code, you can see generators defined in a similar way:

val npcGen = new PhyPriorityMuxGenerator[UInt]

Next, the code registers the data sources for the generators through multiple statements:

npcGen.register(true.B, reg ...)
npcGen.register(s1_valid, s1_target, ...)
npcGen.register(s2_redirect, s2_target, ...)
npcGen.register(s3_redirect, s3_target, ...)
npcGen.register(do_redirect.valid, do_redirect.bits.cfiUpdate.target, ...)

Each line is called a registration. In a registration, the first signal parameter is the data valid signal, and the second signal parameter contains the specific data. The priority of the generator is also determined in the order of registration; the later the registration, the higher the priority. Therefore, the priority at the same time, from low to high, is as follows:

  • s0 blocked data
  • Data updated based on s1 prediction results
  • Data updated based on s2 prediction results
  • Data updated based on s3 prediction results
  • Data in external redirection of BPU

In this way, when the redirection of the prediction result is valid, we can avoid using the earlier pipeline level’s prediction result and adopt the corrected prediction result. This allows us to handle external redirection requests with the highest priority.

We can conclude the method by which all generators generate s0 signals: Among all data valid signals, if only one is valid, the corresponding data is selected; if multiple data valid signals are valid, the data with the highest priority is selected.

Global Branch History

We know that the global branch history is maintained at the BPU top level, and the maintenance strategy is consistent with the PC maintenance strategy. That is, after the prediction result is generated at each stage of the pipeline, the global branch history is updated based on the corresponding signals.

The top level defines two sets of signals to maintain the global branch history:

  • ghv stores the global branch history (maximum length 256)
  • ghist_ptr global branch history pointer, pointing to the current position of the global branch history

Similar to s0_pc, s1_pc, s2_pc, the BPU top level also maintains signals for each stage of the global history pointer: s0_ghist_ptr, s1_ghist_ptr, s2_ghist_ptr. However, the content in ghv is fixed in position, and we only use ghist_ptr to locate where the current global branch history starts.

Calculating the Current Global Branch History with ghist_ptr

The use of ghist_ptr is only visible at the BPU top level. What we pass to the sub-predictors is the global branch history after the data in the global history register is shifted based on ghist_ptr. In the global branch history obtained by the sub-predictor, the least significant bit corresponds to the newest bit of the global branch history, and the most significant bit corresponds to the oldest bit.

So how is the shifting done? First, let’s see how the global history is stored in ghv.

|===== ghist =====>| =======>|
n                  ^         0
                   ghist_ptr

As shown in the figure above, the sequence represents the entire ghv register, and ghist_ptr points to a position in ghv. This position represents the newest bit of the global branch history. When a new global history record needs to be added, ghist_ptr is first decremented by 1, and then this bit is written to the position it points to. When ghist_ptr is decremented to 0, it will loop back to point to the highest position, thereby overwriting the previously written global branch history.

No matter what, starting from the position pointed to by ghist_ptr, the pointer increases and the history gets older. Therefore, when we need to calculate the current global branch history, we only need to circularly right-shift the ghv register by ghist_ptr positions.

Updating Global Branch History

The strategy for updating the global branch history is consistent with the strategy for updating the pc. At each pipeline stage, a pointer for the current stage and an update description of ghv are generated based on the prediction result of the current stage, and all are sent to the relevant generator for processing.

The update strategy for the global branch history is consistent with the pc update strategy, requiring the generation of a current stage pointer and ghv update instructions based on the current stage’s prediction results at each pipeline stage. These instructions are ultimately sent to the relevant generators for processing.

The update description of ghv is some information used to guide the update of the ghv register. Xiangshan BPU maintains two pieces of information to fulfill this duty:

  • ghv_wdata the data that needs to be written into ghv
  • ghv_wens the write bit mask

For the final update, only the bits identified by ghv_wens need to be written with the corresponding bits of ghv_wdata.

Therefore, each pipeline stage needs to generate three sets of information: ghist_ptr, ghv_wdata, ghv_wens.

Specifically, the prediction result can contain up to two branch instructions. We only need to set these pieces of information according to the actual situation. Here are some examples:

  • Only the first slot is valid, and the conditional branch instruction in it is predicted as not taken. Then the next position of ghv_wens is set to 0, the corresponding position of ghv_wens is set to 1, and ghist_ptr is decremented by one.
  • Both slots contain conditional branch instructions, the first is predicted as not taken, and the second is predicted as taken. At this time, ghist_ptr should be decremented by two, and the other two pieces of information should indicate writing 01 to ghv.

Here, only one piece of ghv_wdata information is maintained in the generator (maintained by the ghvBitWriteGens generator), and ghv_wens is not maintained by the generator. This is because a small trick is used here, where the final output of the generator’s ghv_wdata is the result of the selected stage, and ghv_wens is used by performing a bitwise OR operation on ghv_wens of all stages.

This consideration is based on:

  • If the later pipeline stage is valid, the global history pointer is restored to an older position, even if the history of newer positions is modified by the earlier pipeline’s ghv_wens.
  • If the earlier pipeline stage is valid, the global history pointer continues to update to newer positions, and the later pipeline will not set ghv_wens due to the ineffective redirect.

Branch Folded History

The branch folded history passed to the predictor is also maintained by the BPU top level. To shorten the update delay of the folded history, the BPU maintains many variables to support the fast update of the branch folded history. We will focus on this strategy and introduce the function of each variable.

Before we start, let’s first look at how the branch folded history is defined and its structure.

Branch Folded History

If you have checked the BPU global interface documentation, you will know that the sub-predictor receives an array of bit vectors of different lengths, representing various lengths of folded history, and these folded histories are compressed from the global branch history.

For the global branch history, we have a register that stores the global branch history with a length of 256. For example, suppose the length of the global branch history is 15 bits, and after shifting, we get a branch history like this: the least significant bit is the newest history record, and the most significant bit is the oldest history record.

At this time, if we need to generate a 6-bit folded history from these 15 bits, we will use an XOR strategy for compression. The specific process is as follows:

    h[5]         h[4]       h[3]    h[2]   h[1]   h[0]
    h[11]        h[10]      h[9]    h[8]   h[7]   h[6]
^                                   h[14]  h[13]  h[12]
---------------------------------------------------------------
    h[5]^h[11]   h[4]^h[10]         ...           h[0]^h[6]^h[12]

That is, after arranging it as shown above, perform an XOR operation on the values at each position, and the result is the folded history of length 6.

Method for Updating Branch Folded History

Now we want to update this branch folded history. When we insert a new history into the global branch history, it is inserted from the least significant bit, meaning the original h[0] becomes h[1]. If we want to obtain the folded history at this time, we only need to perform the XOR operation again. But such efficiency is too low, because the XOR operation may become particularly long. We can explore the impact of one update on the branch folded history.

In the example above, before inserting a new history, the 6-bit folded history is generated in the following arrangement:

h[5]   h[4]   h[3]  h[2]  h[1]  h[0]
h[11]  h[10]  h[9]  h[8]  h[7]  h[6]
                    h[14] h[13] h[12]

After inserting a new history, it becomes like this:

h[4]   h[3]   h[2]  h[1]  h[0]  h[new]
h[10]  h[9]   h[8]  h[7]  h[6]  h[5]
           (h[14])  h[13] h[12] h[11]

We can notice some patterns:

Before insertion:
h[5]   {h[4]   h[3]  h[2]  h[1]  h[0] }
h[11]  {h[10]  h[9]  h[8]  h[7]  h[6] }
       {             h[14] h[13] h[12]}
After insertion:
{h[4]   h[3]   h[2]  h[1]  h[0] } h[new]
{h[10]  h[9]   h[8]  h[7]  h[6] } h[5]
{           (h[14])  h[13] h[12]} h[11]

The content in the curly braces has undergone a complete left shift, with h[5] and h[11] moving from the most significant bit to the least significant bit. So, in the compressed history, isn’t this just a typical cyclic left shift that we often encounter!

However, only two bits have changed: one is the newly inserted h[new], and the other is the discarded h[14]. h[new] must be in the first position, and the discarded position is fixed. Therefore, to complete an update, we only need to know the value of the newly inserted history and the oldest bit of the previous history. After the cyclic shift, modifying these two positions according to the actual situation will give us the updated folded history.

Implementation of Update Method

To achieve this update in the top-level BPU, two additional variables are maintained:

  • ahead_fh_oldest_bits: the oldest bit of the global branch history, with additional bits stored before it
  • last_br_num_oh: the slot number of the last effective branch instruction in the previous prediction

An optimization in the timing of operations occurs here because the global history pointer can only be updated based on the branch outcome when the pipeline-level prediction result is available. Updating the oldest bit after updating the global history pointer would increase the latency. Therefore, we maintain the branch outcome and update the oldest bit when it is needed in the next cycle.

The oldest bit also needs to be maintained further back because after updating using the branch outcome, the relatively newer bits will become the oldest bits.

Thus, there are three generators related to the folded history: foldedGhGen, lastBrNumOhGen, and aheadFhObGen.

Information required for each update of the folded history

  • Folded history information before the update
  • Oldest bit of the global branch history (ahead_fh_oldest_bits)
  • Last prediction’s branch outcome (last_br_num_oh)
  • Whether there is a branch instruction in this update
  • The branch outcome of this update: the slot number of the last effective branch instruction

For each update of the folded history, the true oldest bit needs to be determined based on last_br_num_oh and ahead_fh_oldest_bits. Then, based on the oldest bit and the branch outcome of this update, several bits are modified, and finally, a cyclic left shift completes the update operation.

Pipeline Control Method

Pipeline control is the core function of the BPU, with complex logic. All pipeline control signals in the top-level BPU are as follows:

  • s1_valid, s2_valid, s3_valid: indicate the corresponding pipeline data is valid
  • s1_ready, s2_ready, s3_ready: indicate the corresponding pipeline is ready to continue the prediction of the previous pipeline stage
  • s1_component_ready, s2_component_ready, s3_component_ready: indicate the readiness of the corresponding pipeline sub-predictor
  • s0_fire, s1_fire, s2_fire, s3_fire: successful handshake signals, indicating that the pipeline data is valid and has been successfully passed to the next pipeline
  • s1_flush, s2_flush, s3_flush: indicate whether the current pipeline needs flushing
  • s2_redirect, s3_redirect: indicate whether the current pipeline needs to redirect due to a different prediction result

valid, ready 与 fire

We will introduce the purpose of each signal step by step. First, let’s look at the fire signal, which indicates a successful handshake in the pipeline, meaning that the data has been successfully passed to the next pipeline. This signifies the end of the current cycle and the end of the prediction for this pipeline stage, with the prediction for the next pipeline stage about to begin in the next cycle.

This requires two conditions:

  1. valid: The data in the current pipeline stage is valid.
  2. ready: Indicates whether the next pipeline stage in the BPU top level and the predictor are ready.

When these two signals are high simultaneously, the fire signal is valid, indicating a successful handshake. If we isolate a single prediction, the timing should look like this (in reality, most of the time, each pipeline is valid continuously):

Of the four sets of signals mentioned earlier, component_ready is generated by the predictor, while the rest are maintained by the BPU top level, with only the fire set of signals exposed to the sub-predictor.

Next, let’s take s2 as an example to see how each signal is maintained.

ready Signal

s2_ready := s2_fire || !s2_valid

This assignment statement is a combinational circuit assignment, meaning that s2_ready is directly related to s2_fire and s2_valid for this cycle, with two possible situations:

  • If s2_valid is invalid for this cycle, indicating that the s2 pipeline stage is empty and can accept new data, then s2_ready is valid.
  • If s2_valid is valid for this cycle, indicating that the s2 pipeline stage has data that has not been passed to the next stage yet, but if s2_fire, then the data will be passed in the next cycle. In this case, s2_ready is valid, indicating that the data can flow into the next stage.

valid Signal

Maintaining ·s2_valid· is relatively simple, as it is only related to s1_fire and s2_ready signals. The relationship is as follows:

  • When s1_fire is valid, indicating that data is coming in and s2_valid will be valid in the next cycle.
  • When s2_fire is valid, indicating that data is flowing out and s2_valid will be invalid in the next cycle.

fire Signal

The fire signal is somewhat special, but for intermediate pipeline stages, its maintenance is straightforward. For example,

s2_fire := s2_valid && s3_components_ready && s3_ready

This simply requires considering the valid of the current pipeline stage and the ready of the next pipeline stage.

However, for s0_fire, since there is no valid signal, it is directly set to s1_components_ready && s1_ready.

For s3_fire, as there is no ready signal for the next stage, it is directly set to s3_valid.

Incorporating Flush and Redirect

When there is a different prediction result in the pipeline, a redirection signal is generated, and the pipeline needs to be flushed. flush and redirect handle these two tasks. redirect indicates whether the current pipeline stage needs redirection, while flush indicates whether the current pipeline stage needs flushing.

redirect Signal

The generation of s2_redirect is as follows:

s2_redirect := s2_fire && s2_redirect_s1_last_pred

This means that when s2_fire is valid and the prediction result of s2 is different from the prediction result saved from s1, this signal is valid. Later, this signal will be connected to the input of the sub-predictor and the output of the BPU prediction result, guiding the sub-predictor and FTQ to restore their states.

flush Signal

The flush signal is used to guide the flushing of the pipeline. For example, when s3_redirect is valid, it means that the incorrect prediction result has entered the pipeline, and both s1 and s2 are now predicting based on the wrong result. Therefore, the pipeline needs to be flushed to stop the previous stages and wait for new prediction results to enter.

Specifically, the relationship between them is:

 s2_flush := s3_flush || s3_redirect
 s1_flush := s2_flush || s2_redirect

This means that if a pipeline stage needs redirection, all previous stages will be flushed. The purpose of flush is to guide the valid signal. If the valid signal is valid in this cycle but the fire signal is not, it means that the incorrect data has not been taken by the next pipeline stage. In this case, when flush is valid, valid will immediately become invalid in the next cycle, avoiding storing incorrect data in the pipeline for a long time.

However, the effect of flush on the valid signal varies depending on each pipeline stage. For example:

  • For the s1 pipeline stage, although flush is valid, if s0_fire is valid, indicating that new data is flowing in, valid will remain valid in the next cycle.
  • For the s2 pipeline stage, if flush is valid, valid will definitely be invalid in the next cycle (because s1 is also flushed), indicating that valid can be directly set to invalid. However, there is a special case where s2_redirect occurs but s2_flush is not set to valid. In this case, if s1_fire occurs, the incorrect prediction result of s1 may also flow in. In this case, s2_valid needs to be determined based on the s1_flush signal.

The use of flush is complex, and more detailed details need to be understood by referring to the code.

Redirect Recovery Logic

When the redirect request from FTQ to BPU takes effect, it indicates that all stages of the pipeline have incorrect predictions, and all stages should be flushed. This can be achieved by setting s3_flush to be valid. Therefore, we have:

 s3_flush := redirect_req.valid

In the BPU, the redirect request is delayed by one cycle before being officially used. Therefore, the response of s1_valid to the flush signal needs to be changed. When the redirect request (before delay) is valid, s1_valid in the next cycle is immediately set to invalid, without the need to refer to the s0_fire signal.

At this point, generators such as npcGen also need to directly use the data from the redirect request to generate, which is equivalent to redirecting the BPU’s state to the state before the error occurred. However, it is important to note that the default redirect level in BPU is flushAfter, which means that the redirect request corresponds to a predicted erroneous instruction, and the BPU assumes that although this instruction was predicted incorrectly, it has been corrected and executed by the backend. Therefore, the next prediction can start directly from the next instruction.

Therefore, when recovering from a redirect, it is necessary not only to restore the information from the redirect interface, but also to update the execution status of this predicted erroneous instruction in the history.

Last modified September 13, 2024: Update the picture of BPU Top. (431c050)