This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Submodule Documentation

This section of the documentation will provide a detailed introduction to each module of the Xiangshan Branch Prediction Unit, including the BPU top-level and five sub-predictors.

This section of the documentation will provide a detailed introduction to each module of the Xiangshan Branch Prediction Unit, including the BPU top-level and five sub-predictors.

In each module’s documentation, we will provide a detailed explanation of the module’s role in the Xiangshan Branch Prediction Unit, as well as the module’s algorithm principles, structure, and timing.

Students responsible for verifying a specific module should read the corresponding documentation thoroughly and understand it in conjunction with the code. Other documents can also be read to help you understand the overall functionality of the Xiangshan Branch Prediction Unit. During the understanding process, you may need to constantly review the basic design concepts and interface signals described in previous documents.

1 - BPU Top Module

The overall function and structure of the BPU top level have been roughly described in previous documents. For those verifying the BPU top level, a more detailed description might be needed. Due to the many functions of the BPU top level, this section divides the BPU into several major functional points for further description. However, since there are too many details at the BPU top level, further details need to be understood by referring to the code.

Generator Maintenance Method

From the basic design documents of Xiangshan, we know that the BPU top level maintains various variables in the s0 cycle through generators, such as PC, branch history, etc. The core concept is to decide which pipeline level’s result to adopt through the redirection signal of the prediction result.

There are a total of 6 generators in the BPU top level:

  • npcGen maintains the PC
  • ghistPtrGen maintains the global history pointer
  • ghvBitWriteGens maintains global history write data
  • foledGhGen maintains the folded history
  • lastBrNumOHGen maintains the position of the last effective branch instruction in the previous cycle
  • aheadFhObGen maintains the oldest position of the branch history

Except for npcGen, the rest of the generators will be introduced in this document. In this section, we will focus on the method of generating the next prediction for the generators.

In the code, you can see generators defined in a similar way:

val npcGen = new PhyPriorityMuxGenerator[UInt]

Next, the code registers the data sources for the generators through multiple statements:

npcGen.register(true.B, reg ...)
npcGen.register(s1_valid, s1_target, ...)
npcGen.register(s2_redirect, s2_target, ...)
npcGen.register(s3_redirect, s3_target, ...)
npcGen.register(do_redirect.valid, do_redirect.bits.cfiUpdate.target, ...)

Each line is called a registration. In a registration, the first signal parameter is the data valid signal, and the second signal parameter contains the specific data. The priority of the generator is also determined in the order of registration; the later the registration, the higher the priority. Therefore, the priority at the same time, from low to high, is as follows:

  • s0 blocked data
  • Data updated based on s1 prediction results
  • Data updated based on s2 prediction results
  • Data updated based on s3 prediction results
  • Data in external redirection of BPU

In this way, when the redirection of the prediction result is valid, we can avoid using the earlier pipeline level’s prediction result and adopt the corrected prediction result. This allows us to handle external redirection requests with the highest priority.

We can conclude the method by which all generators generate s0 signals: Among all data valid signals, if only one is valid, the corresponding data is selected; if multiple data valid signals are valid, the data with the highest priority is selected.

Global Branch History

We know that the global branch history is maintained at the BPU top level, and the maintenance strategy is consistent with the PC maintenance strategy. That is, after the prediction result is generated at each stage of the pipeline, the global branch history is updated based on the corresponding signals.

The top level defines two sets of signals to maintain the global branch history:

  • ghv stores the global branch history (maximum length 256)
  • ghist_ptr global branch history pointer, pointing to the current position of the global branch history

Similar to s0_pc, s1_pc, s2_pc, the BPU top level also maintains signals for each stage of the global history pointer: s0_ghist_ptr, s1_ghist_ptr, s2_ghist_ptr. However, the content in ghv is fixed in position, and we only use ghist_ptr to locate where the current global branch history starts.

Calculating the Current Global Branch History with ghist_ptr

The use of ghist_ptr is only visible at the BPU top level. What we pass to the sub-predictors is the global branch history after the data in the global history register is shifted based on ghist_ptr. In the global branch history obtained by the sub-predictor, the least significant bit corresponds to the newest bit of the global branch history, and the most significant bit corresponds to the oldest bit.

So how is the shifting done? First, let’s see how the global history is stored in ghv.

|===== ghist =====>| =======>|
n                  ^         0
                   ghist_ptr

As shown in the figure above, the sequence represents the entire ghv register, and ghist_ptr points to a position in ghv. This position represents the newest bit of the global branch history. When a new global history record needs to be added, ghist_ptr is first decremented by 1, and then this bit is written to the position it points to. When ghist_ptr is decremented to 0, it will loop back to point to the highest position, thereby overwriting the previously written global branch history.

No matter what, starting from the position pointed to by ghist_ptr, the pointer increases and the history gets older. Therefore, when we need to calculate the current global branch history, we only need to circularly right-shift the ghv register by ghist_ptr positions.

Updating Global Branch History

The strategy for updating the global branch history is consistent with the strategy for updating the pc. At each pipeline stage, a pointer for the current stage and an update description of ghv are generated based on the prediction result of the current stage, and all are sent to the relevant generator for processing.

The update strategy for the global branch history is consistent with the pc update strategy, requiring the generation of a current stage pointer and ghv update instructions based on the current stage’s prediction results at each pipeline stage. These instructions are ultimately sent to the relevant generators for processing.

The update description of ghv is some information used to guide the update of the ghv register. Xiangshan BPU maintains two pieces of information to fulfill this duty:

  • ghv_wdata the data that needs to be written into ghv
  • ghv_wens the write bit mask

For the final update, only the bits identified by ghv_wens need to be written with the corresponding bits of ghv_wdata.

Therefore, each pipeline stage needs to generate three sets of information: ghist_ptr, ghv_wdata, ghv_wens.

Specifically, the prediction result can contain up to two branch instructions. We only need to set these pieces of information according to the actual situation. Here are some examples:

  • Only the first slot is valid, and the conditional branch instruction in it is predicted as not taken. Then the next position of ghv_wens is set to 0, the corresponding position of ghv_wens is set to 1, and ghist_ptr is decremented by one.
  • Both slots contain conditional branch instructions, the first is predicted as not taken, and the second is predicted as taken. At this time, ghist_ptr should be decremented by two, and the other two pieces of information should indicate writing 01 to ghv.

Here, only one piece of ghv_wdata information is maintained in the generator (maintained by the ghvBitWriteGens generator), and ghv_wens is not maintained by the generator. This is because a small trick is used here, where the final output of the generator’s ghv_wdata is the result of the selected stage, and ghv_wens is used by performing a bitwise OR operation on ghv_wens of all stages.

This consideration is based on:

  • If the later pipeline stage is valid, the global history pointer is restored to an older position, even if the history of newer positions is modified by the earlier pipeline’s ghv_wens.
  • If the earlier pipeline stage is valid, the global history pointer continues to update to newer positions, and the later pipeline will not set ghv_wens due to the ineffective redirect.

Branch Folded History

The branch folded history passed to the predictor is also maintained by the BPU top level. To shorten the update delay of the folded history, the BPU maintains many variables to support the fast update of the branch folded history. We will focus on this strategy and introduce the function of each variable.

Before we start, let’s first look at how the branch folded history is defined and its structure.

Branch Folded History

If you have checked the BPU global interface documentation, you will know that the sub-predictor receives an array of bit vectors of different lengths, representing various lengths of folded history, and these folded histories are compressed from the global branch history.

For the global branch history, we have a register that stores the global branch history with a length of 256. For example, suppose the length of the global branch history is 15 bits, and after shifting, we get a branch history like this: the least significant bit is the newest history record, and the most significant bit is the oldest history record.

At this time, if we need to generate a 6-bit folded history from these 15 bits, we will use an XOR strategy for compression. The specific process is as follows:

    h[5]         h[4]       h[3]    h[2]   h[1]   h[0]
    h[11]        h[10]      h[9]    h[8]   h[7]   h[6]
^                                   h[14]  h[13]  h[12]
---------------------------------------------------------------
    h[5]^h[11]   h[4]^h[10]         ...           h[0]^h[6]^h[12]

That is, after arranging it as shown above, perform an XOR operation on the values at each position, and the result is the folded history of length 6.

Method for Updating Branch Folded History

Now we want to update this branch folded history. When we insert a new history into the global branch history, it is inserted from the least significant bit, meaning the original h[0] becomes h[1]. If we want to obtain the folded history at this time, we only need to perform the XOR operation again. But such efficiency is too low, because the XOR operation may become particularly long. We can explore the impact of one update on the branch folded history.

In the example above, before inserting a new history, the 6-bit folded history is generated in the following arrangement:

h[5]   h[4]   h[3]  h[2]  h[1]  h[0]
h[11]  h[10]  h[9]  h[8]  h[7]  h[6]
                    h[14] h[13] h[12]

After inserting a new history, it becomes like this:

h[4]   h[3]   h[2]  h[1]  h[0]  h[new]
h[10]  h[9]   h[8]  h[7]  h[6]  h[5]
           (h[14])  h[13] h[12] h[11]

We can notice some patterns:

Before insertion:
h[5]   {h[4]   h[3]  h[2]  h[1]  h[0] }
h[11]  {h[10]  h[9]  h[8]  h[7]  h[6] }
       {             h[14] h[13] h[12]}
After insertion:
{h[4]   h[3]   h[2]  h[1]  h[0] } h[new]
{h[10]  h[9]   h[8]  h[7]  h[6] } h[5]
{           (h[14])  h[13] h[12]} h[11]

The content in the curly braces has undergone a complete left shift, with h[5] and h[11] moving from the most significant bit to the least significant bit. So, in the compressed history, isn’t this just a typical cyclic left shift that we often encounter!

However, only two bits have changed: one is the newly inserted h[new], and the other is the discarded h[14]. h[new] must be in the first position, and the discarded position is fixed. Therefore, to complete an update, we only need to know the value of the newly inserted history and the oldest bit of the previous history. After the cyclic shift, modifying these two positions according to the actual situation will give us the updated folded history.

Implementation of Update Method

To achieve this update in the top-level BPU, two additional variables are maintained:

  • ahead_fh_oldest_bits: the oldest bit of the global branch history, with additional bits stored before it
  • last_br_num_oh: the slot number of the last effective branch instruction in the previous prediction

An optimization in the timing of operations occurs here because the global history pointer can only be updated based on the branch outcome when the pipeline-level prediction result is available. Updating the oldest bit after updating the global history pointer would increase the latency. Therefore, we maintain the branch outcome and update the oldest bit when it is needed in the next cycle.

The oldest bit also needs to be maintained further back because after updating using the branch outcome, the relatively newer bits will become the oldest bits.

Thus, there are three generators related to the folded history: foldedGhGen, lastBrNumOhGen, and aheadFhObGen.

Information required for each update of the folded history

  • Folded history information before the update
  • Oldest bit of the global branch history (ahead_fh_oldest_bits)
  • Last prediction’s branch outcome (last_br_num_oh)
  • Whether there is a branch instruction in this update
  • The branch outcome of this update: the slot number of the last effective branch instruction

For each update of the folded history, the true oldest bit needs to be determined based on last_br_num_oh and ahead_fh_oldest_bits. Then, based on the oldest bit and the branch outcome of this update, several bits are modified, and finally, a cyclic left shift completes the update operation.

Pipeline Control Method

Pipeline control is the core function of the BPU, with complex logic. All pipeline control signals in the top-level BPU are as follows:

  • s1_valid, s2_valid, s3_valid: indicate the corresponding pipeline data is valid
  • s1_ready, s2_ready, s3_ready: indicate the corresponding pipeline is ready to continue the prediction of the previous pipeline stage
  • s1_component_ready, s2_component_ready, s3_component_ready: indicate the readiness of the corresponding pipeline sub-predictor
  • s0_fire, s1_fire, s2_fire, s3_fire: successful handshake signals, indicating that the pipeline data is valid and has been successfully passed to the next pipeline
  • s1_flush, s2_flush, s3_flush: indicate whether the current pipeline needs flushing
  • s2_redirect, s3_redirect: indicate whether the current pipeline needs to redirect due to a different prediction result

valid, ready 与 fire

We will introduce the purpose of each signal step by step. First, let’s look at the fire signal, which indicates a successful handshake in the pipeline, meaning that the data has been successfully passed to the next pipeline. This signifies the end of the current cycle and the end of the prediction for this pipeline stage, with the prediction for the next pipeline stage about to begin in the next cycle.

This requires two conditions:

  1. valid: The data in the current pipeline stage is valid.
  2. ready: Indicates whether the next pipeline stage in the BPU top level and the predictor are ready.

When these two signals are high simultaneously, the fire signal is valid, indicating a successful handshake. If we isolate a single prediction, the timing should look like this (in reality, most of the time, each pipeline is valid continuously):

Of the four sets of signals mentioned earlier, component_ready is generated by the predictor, while the rest are maintained by the BPU top level, with only the fire set of signals exposed to the sub-predictor.

Next, let’s take s2 as an example to see how each signal is maintained.

ready Signal

s2_ready := s2_fire || !s2_valid

This assignment statement is a combinational circuit assignment, meaning that s2_ready is directly related to s2_fire and s2_valid for this cycle, with two possible situations:

  • If s2_valid is invalid for this cycle, indicating that the s2 pipeline stage is empty and can accept new data, then s2_ready is valid.
  • If s2_valid is valid for this cycle, indicating that the s2 pipeline stage has data that has not been passed to the next stage yet, but if s2_fire, then the data will be passed in the next cycle. In this case, s2_ready is valid, indicating that the data can flow into the next stage.

valid Signal

Maintaining ·s2_valid· is relatively simple, as it is only related to s1_fire and s2_ready signals. The relationship is as follows:

  • When s1_fire is valid, indicating that data is coming in and s2_valid will be valid in the next cycle.
  • When s2_fire is valid, indicating that data is flowing out and s2_valid will be invalid in the next cycle.

fire Signal

The fire signal is somewhat special, but for intermediate pipeline stages, its maintenance is straightforward. For example,

s2_fire := s2_valid && s3_components_ready && s3_ready

This simply requires considering the valid of the current pipeline stage and the ready of the next pipeline stage.

However, for s0_fire, since there is no valid signal, it is directly set to s1_components_ready && s1_ready.

For s3_fire, as there is no ready signal for the next stage, it is directly set to s3_valid.

Incorporating Flush and Redirect

When there is a different prediction result in the pipeline, a redirection signal is generated, and the pipeline needs to be flushed. flush and redirect handle these two tasks. redirect indicates whether the current pipeline stage needs redirection, while flush indicates whether the current pipeline stage needs flushing.

redirect Signal

The generation of s2_redirect is as follows:

s2_redirect := s2_fire && s2_redirect_s1_last_pred

This means that when s2_fire is valid and the prediction result of s2 is different from the prediction result saved from s1, this signal is valid. Later, this signal will be connected to the input of the sub-predictor and the output of the BPU prediction result, guiding the sub-predictor and FTQ to restore their states.

flush Signal

The flush signal is used to guide the flushing of the pipeline. For example, when s3_redirect is valid, it means that the incorrect prediction result has entered the pipeline, and both s1 and s2 are now predicting based on the wrong result. Therefore, the pipeline needs to be flushed to stop the previous stages and wait for new prediction results to enter.

Specifically, the relationship between them is:

 s2_flush := s3_flush || s3_redirect
 s1_flush := s2_flush || s2_redirect

This means that if a pipeline stage needs redirection, all previous stages will be flushed. The purpose of flush is to guide the valid signal. If the valid signal is valid in this cycle but the fire signal is not, it means that the incorrect data has not been taken by the next pipeline stage. In this case, when flush is valid, valid will immediately become invalid in the next cycle, avoiding storing incorrect data in the pipeline for a long time.

However, the effect of flush on the valid signal varies depending on each pipeline stage. For example:

  • For the s1 pipeline stage, although flush is valid, if s0_fire is valid, indicating that new data is flowing in, valid will remain valid in the next cycle.
  • For the s2 pipeline stage, if flush is valid, valid will definitely be invalid in the next cycle (because s1 is also flushed), indicating that valid can be directly set to invalid. However, there is a special case where s2_redirect occurs but s2_flush is not set to valid. In this case, if s1_fire occurs, the incorrect prediction result of s1 may also flow in. In this case, s2_valid needs to be determined based on the s1_flush signal.

The use of flush is complex, and more detailed details need to be understood by referring to the code.

Redirect Recovery Logic

When the redirect request from FTQ to BPU takes effect, it indicates that all stages of the pipeline have incorrect predictions, and all stages should be flushed. This can be achieved by setting s3_flush to be valid. Therefore, we have:

 s3_flush := redirect_req.valid

In the BPU, the redirect request is delayed by one cycle before being officially used. Therefore, the response of s1_valid to the flush signal needs to be changed. When the redirect request (before delay) is valid, s1_valid in the next cycle is immediately set to invalid, without the need to refer to the s0_fire signal.

At this point, generators such as npcGen also need to directly use the data from the redirect request to generate, which is equivalent to redirecting the BPU’s state to the state before the error occurred. However, it is important to note that the default redirect level in BPU is flushAfter, which means that the redirect request corresponds to a predicted erroneous instruction, and the BPU assumes that although this instruction was predicted incorrectly, it has been corrected and executed by the backend. Therefore, the next prediction can start directly from the next instruction.

Therefore, when recovering from a redirect, it is necessary not only to restore the information from the redirect interface, but also to update the execution status of this predicted erroneous instruction in the history.

2 - uFTB Branch Predictor

Introduction to uFTB

uFTB is the first predictor among all the predictors in Xiangshan, and it serves as the cornerstone for other predictors to generate prediction results. uFTB works in the s1 stage. It can generate prediction results within the current cycle after obtaining s1_pc and output them in the s1 channel, without modifying other channels. It provides the position of the branch instruction and the target of the instruction. Subsequent predictors will further predict based on this result.

Its essence is an FTB item cache, which stores FTB items, and the basic prediction result will be directly generated from the read-out FTB item.

Therefore, before you start reading the document, make sure you understand the FTB items and their meanings, as well as the specific details of the prediction result interface.

Functionality of uFTB

  • Cache FTB items and generate one-cycle prediction results: uFTB maintains a small FTB item cache. After receiving PC, it reads out the FTB item corresponding to the PC within one cycle and generates an s1 stage prediction result from the FTB item.
  • Maintain two-bit saturating counters to provide basic conditional branch results: uFTB maintains two-bit saturating counters for each line of the FTB item cache. The direction prediction result is reflected in the prediction result output of uFTB.
  • Update the FTB cache and two-bit saturating counters based on update requests

uFTB Cache Structure

As mentioned above, uFTB is essentially a small cache that stores FTB items. Its approximate structure is shown in the figure below.

In the current version of Xiangshan, uFTB has a total of 32 cache lines, each cache line is called FauFTBWay, and one FTB item can be stored in each cache line.

When s1 pipeline is valid, uFTB will use s1_pc to determine which item of the uFTB cache to read out. The cache is indexed based on the tag field in PC, which is defined as pc[16:1], i.e., taking 16 bits from PC as an identifier to match a certain line in the cache.

Each line in the cache, i.e., the data request interface in FauFTBWay, has three items:

  • req_tag: Input tag identifier extracted from pc
  • resp: Output the FTB item stored in this line
  • resp_hit: Output indicates whether the FTB item in this line matches req_tag uFTB connects the tag to the data request interface of each line in the cache and selects the hit FTB item based on the resp_hit signal. Subsequent steps will generate a complete prediction result based on this FTB item.

Two-Bit Saturating Counters

uFTB maintains two-bit saturating counters for each line of the FTB item cache. As we know, an FTB item can store up to two conditional branch instructions, so each line’s two-bit saturating counters also have two, responsible for providing rough prediction results for the conditional branch instructions in them.

When indexing the FTB item, uFTB also indexes the corresponding two-bit saturating counters.

When the prediction result is generated, it will be based on the FTB item and the contents of the two two-bit saturating counters corresponding to it.

Prediction Result Generation

After indexing the corresponding FTB item and two two-bit saturating counters based on s1_pc, uFTB needs to generate a prediction result based on them. The prediction result generated by uFTB will be outputted through the s1 channel when s1 pipeline is valid, and will not be modified for s2 and s3 channels.

The signal generation method in the s1 prediction result can refer to the following list:

  • hitWhether the FTB item hits
    • Generation method: resp_hit signal in FauFTBWay, one of them is valid
  • slot_valids: Slot valid bit, indicating whether each slot in the ftb item is valid
  • targets: Jump target address corresponding to each slot
  • offsets: The offset of each instruction in the slot relative to the starting address of the prediction block
  • is_jal: Whether the prediction block contains a jal instruction
  • is_jalr: Whether the prediction block contains a jalr instruction
  • is_call: Whether the prediction block contains a call instruction
  • is_ret: Whether the prediction block contains a ret instruction
  • last_may_be_rvi_call: Signal indicating that the last slot in the prediction block may be an RVI type call instruction
  • is_br_sharing: The signal stored in the last slot (tailSlot) indicates a conditional branch instruction.
    • Generation: Exported from the corresponding field in the FTB entry.
  • fallThroughErr: The pftAddr recorded in the FTB entry is incorrect.
    • Generation: Compare whether the end address represented by pftAddr is greater than the start address of the predicted block. If it is smaller, an error has occurred, and this signal is set to valid. This situation may occur when the PC indexes an incorrect FTB entry.
  • fallThroughAddr: The end address of the predicted block.
    • Generation: If fallThroughErr is invalid, it is generated based on pftAddr. Otherwise, it is set to the start address + prediction width.
  • br_taken_mask: Branch prediction result, with each branch (slot) corresponding to a bit indicating whether the branch is predicted as taken.
    • Generation: Generated based on the always_taken field in the FTB entry and the result indicated by the two-bit saturating counter.
  • jalr_target: The jump target of the jalr in this predicted block. -Generation: From the jump target in the tailSlot of the FTB entry.

uFTB Update

The update of uFTB involves updating the FTB entry cache and the two-bit saturating counter, with the update content obtained through the update interface.

In the uFTB predictor, the reading and writing of the cache and the two-bit saturating counter do not conflict, so we do not need to consider timing conflicts between reading and updating and can consider them as two independent parts.

FTB Cache Update

The update process of the FTB cache is simple. The update channel has already specified the PC and the newly generated FTB entry, so it only needs to be written to the specified position in the cache.

FTB cache updating requires two cycles:

  • In the first cycle, calculate the following based on the signals in the update:
    • Which row in the cache corresponds to the update request The PC extracted from the update request is sent to the update request channel of FauFTBWay, and the hit signal returned by each row is calculated.
  • In the second cycle, update according to the position calculated in the first cycle. If no row is hit, uFTB will use a pseudo-LRU replacement algorithm to select the row to be written. Specifically, the ftb_entry signal group in the update channel contains the complete information of the new FTB entry, which is sent to the cache row that needs to be updated.

Two-Bit Saturating Counter Update

The update of the two-bit saturating counter needs to be combined with the actual execution of the subsequent program and the branch instruction information recorded in the FTB entry, which can be obtained from the update channel.

The update of the two-bit saturating counter also requires two cycles:

  • In the first cycle, calculate which two-bit saturating counter corresponding to the conditional branch instruction in the slot needs to be updated, which needs to meet the following conditions:
    • The current branch instruction slot is valid and contains a conditional branch instruction.
    • The current branch instruction slot is not marked as always_taken. (Because after being marked as always_taken, the result of the two-bit saturating counter will not be used.)
    • The current branch instruction slot is not after the branch instruction slot where an actual jump occurred.
  • In the second cycle, update the saturating counter based on the mask generated in the first cycle and the br_taken_mask information in the update channel.

Interface List

信号类型 信号位 信号名 信号描述
input clock 输入时钟
input reset 复位信号
input [35:0] io_reset_vector 用于reset时,reset s1_pc_dup_0 提供的值
input [40:0] io_in_bits_s0_pc_0 输入位s0_pc 的 第0个复制
input [40:0] io_in_bits_s0_pc_1 同上 第1个
input [40:0] io_in_bits_s0_pc_2 同上 第2个
input [40:0] io_in_bits_s0_pc_3 同上 第3个
output [40:0] io_out_s1_pc_0 输出s1_pc 的 第0个复制
output [40:0] io_out_s1_pc_1 同上 第1个
output [40:0] io_out_s1_pc_2 同上 第2个
output [40:0] io_out_s1_pc_3 同上 第3个
output io_out_s1_full_pred_0_br_taken_mask_0 solt 0 是否被预测为 always taken
output io_out_s1_full_pred_0_br_taken_mask_1 solt 1 是否被预测为 always taken
output io_out_s1_full_pred_0_slot_valids_0 solt 0 是否启用
output io_out_s1_full_pred_0_slot_valids_1 solt 1 是否启用
output [40:0] io_out_s1_full_pred_0_targets_0 solt 0 对应的跳转目标地址
output [40:0] io_out_s1_full_pred_0_targets_1 solt 1 对应的跳转目标地址
output [3:0] io_out_s1_full_pred_0_offsets_0 solt 0 中分支指令相对于地址块起始pc的偏移
output [3:0] io_out_s1_full_pred_0_offsets_1 solt 1 中分支指令相对于地址块起始pc的偏移
output [40:0] io_out_s1_full_pred_0_fallThroughAddr 预测块的结束地址
output io_out_s1_full_pred_0_is_br_sharing solt 1(无条件跳转)是否被共享为有条件跳转指令
output io_out_s1_full_pred_0_hit
output io_out_s1_full_pred_1_br_taken_mask_0 类似 io_out_s1_pc_1 io_out_s1_full_pred_0的复制
output io_out_s1_full_pred_1_br_taken_mask_1
output io_out_s1_full_pred_1_slot_valids_0
output io_out_s1_full_pred_1_slot_valids_1
output [40:0] io_out_s1_full_pred_1_targets_0
output [40:0] io_out_s1_full_pred_1_targets_1
output [3:0] io_out_s1_full_pred_1_offsets_0
output [3:0] io_out_s1_full_pred_1_offsets_1
output [40:0] io_out_s1_full_pred_1_fallThroughAddr
output io_out_s1_full_pred_1_is_br_sharing
output io_out_s1_full_pred_1_hit
output io_out_s1_full_pred_2_br_taken_mask_0 同上
output io_out_s1_full_pred_2_br_taken_mask_1
output io_out_s1_full_pred_2_slot_valids_0
output io_out_s1_full_pred_2_slot_valids_1
output [40:0] io_out_s1_full_pred_2_targets_0
output [40:0] io_out_s1_full_pred_2_targets_1
output [3:0] io_out_s1_full_pred_2_offsets_0
output [3:0] io_out_s1_full_pred_2_offsets_1
output [40:0] io_out_s1_full_pred_2_fallThroughAddr
output io_out_s1_full_pred_2_is_br_sharing
output io_out_s1_full_pred_2_hit
output io_out_s1_full_pred_3_br_taken_mask_0 同上
output io_out_s1_full_pred_3_br_taken_mask_1
output io_out_s1_full_pred_3_slot_valids_0
output io_out_s1_full_pred_3_slot_valids_1
output [40:0] io_out_s1_full_pred_3_targets_0
output [40:0] io_out_s1_full_pred_3_targets_1
output [3:0] io_out_s1_full_pred_3_offsets_0
output [3:0] io_out_s1_full_pred_3_offsets_1
output [40:0] io_out_s1_full_pred_3_fallThroughAddr
output io_out_s1_full_pred_3_fallThroughErr
output io_out_s1_full_pred_3_is_br_sharing
output io_out_s1_full_pred_3_hit
output [222:0] io_out_last_stage_meta 输出最后阶段的元信息 io_out_last_stage_meta = {213’h0, resp_meta_pred_way_r_1, resp_meta_hit_r_1}
input io_ctrl_ubtb_enable 控制ubtb是否启用
input io_s0_fire_0 输入s0_fire_0,与 io_out_s1_pc_0 <= io_in_bits_s0_pc_0 的时钟门控相关
input io_s0_fire_1 输入s0_fire_1
input io_s0_fire_2 输入s0_fire_2
input io_s0_fire_3 输入s0_fire_3
input io_s1_fire_0 输入s1_fire_0
input io_s2_fire_0 输入s2_fire_0
input io_update_valid 更新有效性
input [40:0] io_update_bits_pc 传回的预测块pc(用于指示更新的预测块)
input [3:0] io_update_bits_ftb_entry_brSlots_0_offset solt 0 中分支指令相对于地址块起始pc的偏移
input [11:0] io_update_bits_ftb_entry_brSlots_0_lower 跳转目标地址的低位
input [1:0] io_update_bits_ftb_entry_brSlots_0_tarStat 跳转后的 pc 高位是否进退位
input io_update_bits_ftb_entry_brSlots_0_valid 是否启用
input [3:0] io_update_bits_ftb_entry_tailSlot_offset solt 1 中分支指令相对于地址块起始pc的偏移
input [19:0] io_update_bits_ftb_entry_tailSlot_lower 跳转目标地址的低位
input [1:0] io_update_bits_ftb_entry_tailSlot_tarStat 跳转后的 pc 高位是否进退位
input io_update_bits_ftb_entry_tailSlot_sharing 无条件跳转指令槽中存储条件分支指令
input io_update_bits_ftb_entry_tailSlot_valid 是否启用
input [3:0] io_update_bits_ftb_entry_pftAddr Partial Fallthrough Addr 如果预测块中没有跳转,那么程序将会顺序执行到达的地址,预测块的结束地址。
input io_update_bits_ftb_entry_carry pc+pft时是否产生进位
input io_update_bits_ftb_entry_always_taken_0 是否预测为总是跳转
input io_update_bits_ftb_entry_always_taken_1 是否预测为总是跳转
input io_update_bits_br_taken_mask_0 是否跳转
input io_update_bits_br_taken_mask_1 是否跳转

3 - TAGE-SC Branch Predictor

Introduction

TAGE-SC is the primary predictor for conditional branches in the Kunming Lake architecture, classified as an Accurate Predictor (APD). TAGE-SC can be seen as two relatively independent components: the prediction part TAGE and the verification part SC.

The Tagged Geometric History Length Predictor (TAGE) utilizes multiple prediction tables with different history lengths to exploit extensive branch history information. TAGE predicts whether a branch instruction will be taken or not taken. It consists of a base prediction table and multiple history tables. It first predicts using multiple history tables. If there is no prediction, it uses the prediction from the base table. The Statistical Corrector (SC) is a statistical corrector. SC references the prediction results of TAGE and statistical bias results. Based on these two results, it corrects the final prediction result.

In Kunming Lake, each prediction block can have up to 2 branch instructions, so TAGE can predict up to 2 conditional branch instructions simultaneously. When accessing the various history tables in TAGE, the starting address of the prediction block is used as the PC, and two prediction results are retrieved based on the same global history.

TAGE Branch Predictor in Kunming Lake

Basic Functionality

img

The core idea of the TAGE predictor is to provide prediction results with different history lengths and select the most appropriate result for feedback. In the TAGE predictor, there are a total of 1+N history record tables, where N is a configurable option. In Kunming Lake, N=4.

The base predictor based on the T0 table is the baseline predictor. During prediction, it directly looks up the “2-bit saturated counter representing the jump history information” corresponding to the address in the T0 table, and then makes a prediction based on the history information. The T0 table has only 2 bits per entry, so the history states it can record are limited.

For tables other than T0, we use Tn to represent them. During table lookup, in addition to the PC, it is also necessary to use the global jump history information H for the lookup. When a match is found, the prediction is made based on the “3-bit saturated predictor” to jump or not to jump. The higher the value of n for the Tn table, the longer the history information it uses, i.e., x<y.

For each prediction, TAGE selects the table entry with the longest global jump history among all the hit Tn entries.

  1. If the table entry exists and the prediction result confidence is high, it is used as the final prediction result.
  2. If the confidence is low, another internal counter is used to determine whether to select that entry or T0 as the final prediction.

To save space, when retrieving the Tn table, the input jump history information H needs to be compressed, a process also known as history folding.

The table entries of each prediction table include the following elements:

  1. T0 table indexed directly by PC
    1. 2-bit pred unsigned saturated counter (indicating prediction direction and confidence level)
  2. Tn table indexed by XOR of PC and folded global history
    1. 1-bit valid bit
    2. 3-bit pred unsigned saturated counter
    3. 8-bit tag (used for verifying whether the hit is intentional, not coincidental)
    4. 1-bit useful bit for controlling expiration For a prediction block, all tables may generate prediction results, requiring a selection process. Typically, the higher the Tn table number, the higher its priority.

Pipeline

TAGE contains two pipeline stages: the first stage calculates the index, and the second stage reads the results from the SRAM table.

  1. Stage 0 (s0): Input to the first pipeline stage, usually the pc and folded history. First pipeline stage operation: Calculate the index. Output to s1 via registers.
  2. Stage 1 (s1): Input to the second pipeline stage, consisting of the index and other calculated data from the first stage. Second pipeline stage operation: Memory access to SRAM, reading the prediction result. Output to s2 via registers.
  3. Stage 2 (s2): Actual prediction result. TAGE uses 2 stages for prediction, with the prediction result ready for use in the third stage after 2 stages.

Data Structures

  1. In Kunming Lake’s implementation, the T0 and Tn table structures are as follows:

    1. 预测器 作用 表项构成 项数
      基准预测器T0 用于在其他预测器的预测结果都无效时输出预测结果 2 bit ctr 饱和计数器最高位决定跳转方向 2路各2048项,每路对于一条分支指令
      预测表T1-T4 对每个预测块的输入,所有Tn表都进行预测,在所有预测有效的结果中,选择历史记录最长的结果作为最后预测结果。历史记录长度由输入的H决定 1 bit valid 有效位 3 bit ctr 饱和计数器8 bit tag 校验命中1 bit us 作为usefulness计数器 4096项、奇数项对应第一条分支指令,偶数项对应第二条分支指令

For each table Tn, the length of its input “global branch history data H” varies during querying. Assuming the total prediction history length is S, Tn and Tn+1 may use the low x, low y bits of S (the lower bits are the newer history) as query inputs. Generally, the larger the n of table Tn, the longer the history information used, i.e., x<y.

During the query of table Tn, since the historical data H is “compressed,” it may lead to a situation where the result of one PC1^H1 matches another PC2^H2 (similar to a Hash collision), resulting in indexing into invalid data (predicting PC1 indexes to predicted PC2’s data). Therefore, TAGE provides a tag identifier for each table, using an 8-bit tag in the Kunming Lake implementation to reduce the probability of collisions. The calculation method and index method for tags are different; only when the tag calculation is the same, the query result is valid.

In the Tn table entry, in addition to the saturation counter ctr and tag, there is also a 1-bit usefulness counter. When this counter is 0, it is a weak entry, indicating that the entry can be reallocated for other uses; when it is not 0, it is a strong entry, indicating that the entry cannot be reallocated for other uses.

To try to avoid the situation where all table entries are 1 and no new table entries can be allocated, TAGE expects to use the counter bankTickCtrs to clear all usefulness to 0.

Retrieval Method for T0 and Tn Tables

  • For table T0, indexing is done using the PC[11:1] bits to index 2048 table entries, so for T0, there is no possibility of not finding a match.
  • For table Tn, in addition to PC[11:1], retrieval also requires searching based on the global branch history. In Kunming Lake, the top-level branch predictor maintains a 256-bit global history record GH, which can fold the GH’s most recent n bits of history information based on the required number of bits x for the sub-predictor. That is, n is divided into ceil(x/n) units of length x, and then XOR is performed bitwise. This folded history is denoted as FH (Folded History), and the specific process can be found in the [Branch Folding History section](../00_bpu_top/#Branch Folding History). When the TAGE predictor searches for a table entry in Tn, it uses the index and tag, calculated as follows:
Calculation Formula
index = FH ^ ((pc»1)低位)
tag = FH1 ^ FH2 ^ ((pc»1)低位)

Where FH, FH1, FH2 represent the folded global branch history according to certain rules. For Tn, FH, FH1, and FH2 each have their own folding bit numbers, which may not be the same. In the Kunming Lake implementation, the configurations of the T0 and Tn tables are as follows:

表名称 FH长度 FH1长度 FH2长度 最近历史长度(用到GH中的位数)
T1 8比特 8比特 7比特 低8位,即把最新8位历史,折叠成FH、FH1、FH2
T2 11比特 8比特 7比特 低13位,即把最新13位历史,折叠成FH、FH1、FH2
T3 11比特 8比特 7比特 低32位,即把最新32位历史,折叠成FH、FH1、FH2
T4 11比特 8比特 7比特 低119位,即把最新119位历史,折叠成FH、FH1、FH2

Note: pc»1 is used because RISC-C extension is used, with 2-byte alignment, and PC itself is already aligned to 1 byte, so only 1 bit is used.

Alternative Predictor

Since the Tn table uses saturation counters for prediction, there may be situations where the output result is “not confident.” For example, in Kunming Lake, for a 3-bit saturation counter, both 100 and 011 indicate a weak prediction. To provide more choices as references for this state, the TAGE predictor also provides an “alternative predictor” mechanism, which determines whether to select the prediction result of Tn or T0 when the Tn table predicts with low confidence.

In the Kunming Lake implementation, the “alternative predictor” is implemented based on the register group useAltOnNaCtrs. It consists of two paths of 128 4-bit saturation counters each, initialized to 0b1000. When TAGE makes a prediction, it uses PC(7,1) to index the corresponding saturation counter. If the value of this counter is greater than or equal to the preset value and the prediction result of Tn is not confident, it selects the result of T0; otherwise, it selects the result of Tn.

Prediction Process

In summary, the prediction steps of the TAGE predictor in Kunming Lake are as follows:

  1. Parallel indexing of T0 and Tn tables, selecting which table to use based on the hit result:
    1. If a match to the tag of a Tn table is found, the potential prediction result is given by the saturation counter of the longest history Tn table.
    2. If no match to a Tn table is found, the final prediction result is given by the saturation counter of the T0 table.
  2. If the potential prediction result of the matched Tn table is a weak prediction (100,011), and the value of the corresponding 4-bit counter in the alternative prediction for PC is greater than or equal to a threshold, the result of the T0 table is used as the final result; otherwise, the prediction result of the Tn table is used as the final prediction result.

Training Process

Since the prediction process of TAGE involves many counters and tags, they need to be updated according to certain rules, a process known as training. This training process occurs in the BPU’s update stage, where the PC, branch history, and prediction correctness information are input. The training process for branch prediction in Kunming Lake is divided into several steps based on different conditions:

  1. Update when T0 is the final prediction result: If a jump occurs (i.e., taken), increment the ctr saturation counter indexed by the pc; otherwise, decrement it.
  2. When only T0 is hit, the following operations are performed:
    1. If T0 is predicted correctly, no additional update is performed.
    2. If T0 is predicted incorrectly, attempt to randomly allocate a new table entry in a Tn table. To allocate a new table entry, the original entry’s usefulness at the corresponding index must be 0. The new entry is initialized as a weak prediction with usefulness 0, and its tag is set to the newly calculated tag.
  3. When both T0 and Tn are hit, the following operations are performed:
    1. Tn is always updated: If a jump occurs, increment the ctr saturation counter indexed by the pc; otherwise, decrement it. It is important to note that “hit” means that the tag of the indexed entry matches the calculated tag.
    2. If T0 and Tn produce the same result:
      1. If predicted correctly, no additional update is performed.
      2. If predicted incorrectly, attempt to allocate a new table entry in a table with a longer history than Tn. To allocate a new entry, the usefulness of the original entry at the corresponding index must be 0. The new entry is initialized as a weak prediction with usefulness 0, and its tag is set to the tag calculated using the new history information.
    3. If T0 and Tn produce different results:
      1. If Tn is correct, the entry’s usefulness is incremented.
        1. If the result is still a weak prediction, the counter in the alternative prediction for T0 is decremented.
      2. If Tn is incorrect, the entry’s usefulness is decremented, and a new entry is allocated in a table with a longer history than Tn, as in 3.2.2.
        1. If the result is still a weak prediction, the counter in the alternative prediction for T0 is incremented.
  4. When a new table needs to be allocated, dynamic reset of the usefulness flag is performed.
    1. Using a 7-bit bankTickCtrs register and calculating:
      1. The number of allocatable tables a (with a longer history length than the current and corresponding index usefulness is 0)
      2. The number of unallocatable tables b (with a longer history length than the current and corresponding index usefulness is not 0)
    2. Update bankTickCtrs += Δ (saturated counter), Δ = b - a,
    3. When bankTickCtrs reaches its maximum value, reset all usefulness to 0.

Kunming Lake SC Branch Predictor

Basic Function Introduction

The SC (Statistics counter) branch predictor is a branch predictor based on historical statistical information. Similar to TAGE, SC typically has multiple tables Tn, each corresponding to different lengths of historical jump statistics. The difference is that in SC, when predicting based on the PC, each table Tn is accessed, and then SC adds up each hit table entry to calculate the total “saturated counter” jump information, and finally determines whether to jump based on the total jump information. Generally, SC uses “signed saturated counters”, where a counter value greater than 0 indicates a jump, and less than 0 indicates no jump. The larger the absolute value of the counter, the higher the prediction confidence.

In the SC predictor, SC is also composed of multiple tables (e.g., T1, T2, T3, T4), but with fewer basic prediction tables T0 compared to the TAGE predictor. The Tn tables in SC have 6-bit signed saturated counters. The indexing method for SC tables is as follows:

Calculation Method
Index = (FH) ^ ((pc»1)低位)

For each table, the number of entries and the folded history length used are as follows:

Table Number of Entries FH Length Folded History Range
T1 512 0 不折叠
T2 512 4 把历史信息的低4位,折叠成FH
T3 512 8 把历史信息的低10位,折叠成FH
T4 512 8 把历史信息的低16位,折叠成FH

The formula for calculating the total statistical prediction result is as follows:

$$scCtrSum=\sum_{i=0}^{i<4}( (ctr_{sc} << 1) +1)$$

Where ctr_sc represents the signed saturated counter for each table. Left-shifting and adding one is for weight adjustment. The accumulated scCtrSum is the final prediction result of SC. If this value is greater than zero, the prediction is a jump; if it is less than zero, the prediction is no jump. The larger the absolute value, the higher the prediction confidence.

Typical data conversion results are as follows (extended to 9 bits to prevent overflow during calculation):

  1. All are 6b100000 (strong no-jump), resulting in 9b100000100, with a value of -252.
  2. All are 6b011111 (strong jump), resulting in 9b011111100, with a value of 252.
  3. All are 6b000000 (weak jump), resulting in 9b000000100, with a value of 4.
  4. All are 6b111111 (weak no-jump), resulting in 9b111111100, with a value of -4.

Prediction Process

  1. Calculate the index of table Tn using the PC and historical information.
  2. Query the index to obtain the saturation counters for all tables.
  3. Sum up all the saturation counters obtained from all tables to get the final prediction result (take a jump for values greater than 0, no jump for values less than 0).

Training Process

Update the saturation counters during the update phase.

  1. If the real instruction corresponding to PC jumps, increment the saturation counters corresponding to all tables.
  2. If the real instruction corresponding to PC does not jump, decrement the saturation counters corresponding to all tables.

Kunming Lake TAGE-SC Branch Predictor

Why SC is Needed with TAGE

In some applications, some branch behaviors have a weak correlation with branch history or paths, showing a statistical prediction bias. For these branches, using counters to capture statistical biases is more effective than history-based branch prediction.

TAGE is very effective in predicting branches that are highly correlated with history, but it performs poorly for branches with statistical biases. For example, branches that have a small bias in one direction but are not strongly correlated with historical paths. To avoid this problem, an SC predictor can be added to the traditional TAGE predictor.

TAGE-SC Functionality

In the Kunming Lake TAGE-SC predictor, both the TAGE and SC prediction results P1 and P2 are obtained simultaneously, and then their results are accumulated P = P1 + P2. If the absolute value of P is greater than the 8-bit threshold sc_bank_thres, the predictor result P is used; otherwise, P1 is used as the final prediction result.

For dynamic adaptation, the threshold sc_thres needs to be dynamically changed. Therefore, in the implementation, TAGE-SC uses a 5-bit sc_bank_ctr counter to adjust the threshold sc_bank_thres. Additionally, since Kunming Lake supports the simultaneous prediction of 2 branch instructions, the threshold register and corresponding control counter are also duplicated.

Pipeline

The TAGE-SC predictor contains 3 pipeline stages, where the 2-stage pipeline of TAGE has been introduced, and the pipeline of the SC part is as follows:

  1. Stage 0: Read PC and folded history into s0. First Stage: Calculate the index from pc and FH to obtain s0_idx.

  2. Stage 1: Read s0_idx from s0. Second Stage: Find the counter data corresponding to s1_idx in SCTable and output to s1_scResps.

  3. Stage 2: Read s1_scResps from s1. Third Stage: Select whether to invert the prediction result based on s2_scResps and output to s2_disagree.

  4. Stage 3: Read the result from s2_disagree as s3_disagree.

Prediction Process

In TAGE-SC prediction, the prediction result P1 of TAGE is represented by tage_ctr, and the prediction result P2 of SC is represented by scCtrSum. The prediction is divided into four steps:

  1. Execute the SC predictor to get the prediction result scCtrSum.

  2. Simultaneously obtain the prediction result tage_ctr of the TAGE predictor.

    1. Since the prediction result of TAGE is an unsigned saturation counter, and the prediction result of SC is a signed saturation counter, if they are added together, data conversion is required.

    2. Kunming Lake adopts a conversion for the result of TAGE. The converted result is represented by tageCtrCentered, and the specific conversion process is as follows:

      $$tageCtrCentered=((((ctr_{tage} -4)<<1)+1)<<3) $$
    3. Conversion of a 3-bit unsigned saturation counter to an 8-bit signed saturation counter result is illustrated as follows:

      • 3b100 Weak jump => 8b00001000 = 8

      • 3b011 Weak non-jump => 8b11111000 = -8

      • 3b111 Strong jump => 8b00111000 = 56

      • 3b000 Strong non-jump => 8b11001000 = -56

  3. Add the prediction results of TAGE and SC to get the final prediction result P, represented by totalSum.

$$totalSum = scCtrSum + tageCtrCentered$$
  1. Determine the final prediction direction based on totalSum and sc_bank_thres

    1. Jump if totalSum > 0 and its absolute value exceeds the threshold: If scCtrSum > sc_bank_thres - tageCtrCentered, it can also be understood as totalSum > sc_bank_thres. The above expression can reduce the maximum bit width (ensuring no overflow requires 10 bits to become 9 bits).
    2. No jump if totalSum < 0 and its absolute value exceeds the threshold: If scCtrSum < -sc_bank_thres - tageCtrCentered, it can also be understood as |totalSum| > sc_bank_thres.

Training Process

After combining TAGE and SC, TAGE-SC adds an sc_bank_ctr counter to control the threshold sc_bank_thres. Therefore, during training, in addition to the training of TAGE and SC themselves, the newly added counter needs to be updated.

During the update phase, the specific update process is as follows:

  1. TAGE-SC uses the prediction result P (i.e., the prediction result after TAGE + SC). If |totalSum| is in the range [sc_bank_thres -4, sc_bank_thres -2], update the threshold-related register group.
    1. Update sc_bank_ctr, the saturation counter: If the prediction is correct, sc_bank_ctr +=1; if the prediction is incorrect, sc_bank_ctr -=1.
    2. Update sc_bank_thres, limited saturation operation: If the updated value of sc_bank_ctr reaches 0b11111 and sc_bank_thres <= 31, then sc_bank_thres +=2; if the updated value of sc_bank_ctr is 0 and sc_bank_thres >=6, then sc_bank_thres -=2. For all other cases, thres remains unchanged.
    3. After the update judgment of sc_bank_thres is completed, another judgment is made on sc_bank_ctr. If the updated sc_bank_ctr is 0b11111 or 0, thres_ctr is reset to the initial value 0b10000.
  2. TAGE-SC uses the prediction result P1 (i.e., the prediction result of TAGE) and does not perform any operations.

Interface List

TageSC

信号类型 信号宽度 信号名 信号描述
input * clock
input * reset
input *[35:0] io_reset_vector 用于reset时,reset s1_pc_dup_0 提供的值
input *[40:0] io_in_bits_s0_pc_0 复制的s0_pc的dup数组的第1个,给顶层BPU的PC
input *[40:0] io_in_bits_s0_pc_1 复制的s0_pc第2个,给Tage的PC
input *[40:0] io_in_bits_s0_pc_3 复制的s0_pc的第4个,给SC的PC
input *[10:0] io_in_bits_folded_hist_1_hist_17_folded_hist TageTable 2 用到的11bits 折叠历史 从多长历史范围折叠到11bit见前文所述的表 注意TageTable下标+1,此处 T2 是前文 T3
input *[10:0] io_in_bits_folded_hist_1_hist_16_folded_hist TageTable 3 用到的11bits 折叠历史
input *[6:0] io_in_bits_folded_hist_1_hist_15_folded_hist TageTable 1 用到的7bits 折叠历史
input *[7:0] io_in_bits_folded_hist_1_hist_14_folded_hist TageTable 0 用到的8bits 折叠历史
input *[6:0] io_in_bits_folded_hist_1_hist_9_folded_hist TageTable 2 用到的7bits 折叠历史
input *[7:0] io_in_bits_folded_hist_1_hist_8_folded_hist TageTable 3 用到的8bits 折叠历史
input *[6:0] io_in_bits_folded_hist_1_hist_7_folded_hist TageTable 0 用到的7bits 折叠历史
input *[6:0] io_in_bits_folded_hist_1_hist_5_folded_hist TageTable 3 用到的7bits 折叠历史
input *[7:0] io_in_bits_folded_hist_1_hist_4_folded_hist TageTable 1 用到的8bits 折叠历史
input *[7:0] io_in_bits_folded_hist_1_hist_3_folded_hist TageTable 2 用到的8bits 折叠历史
input *[10:0] io_in_bits_folded_hist_1_hist_1_folded_hist TageTable 1 用到的11bits 折叠历史
input *[3:0] io_in_bits_folded_hist_3_hist_12_folded_hist SCTable 1 用到的 4bit 折叠历史
input *[7:0] io_in_bits_folded_hist_3_hist_11_folded_hist SCTable 2 用到的 8bit 折叠历史
input *[7:0] io_in_bits_folded_hist_3_hist_2_folded_hist SCTable 3 用到的 8bit 折叠历史
output * io_out_s2_full_pred_0_br_taken_mask_0 io_out_s2_full_pred_{i}br_taken_mask{j} Tage 在 s2流水级输出的,复制4份 预测块中第 j 条分支指令TAGE预测结果 这里不该叫mask吧
output * io_out_s2_full_pred_0_br_taken_mask_1
output * io_out_s2_full_pred_1_br_taken_mask_0
output * io_out_s2_full_pred_1_br_taken_mask_1
output * io_out_s2_full_pred_2_br_taken_mask_0
output * io_out_s2_full_pred_2_br_taken_mask_1
output * io_out_s2_full_pred_3_br_taken_mask_0
output * io_out_s2_full_pred_3_br_taken_mask_1
output * io_out_s3_full_pred_0_br_taken_mask_0 io_out_s3_full_pred_{i}br_taken_mask{j} Tage 在 s3流水级输出的,复制4份 预测块中第 j 条分支指令SC预测结果
output * io_out_s3_full_pred_0_br_taken_mask_1
output * io_out_s3_full_pred_1_br_taken_mask_0
output * io_out_s3_full_pred_1_br_taken_mask_1
output * io_out_s3_full_pred_2_br_taken_mask_0
output * io_out_s3_full_pred_2_br_taken_mask_1
output * io_out_s3_full_pred_3_br_taken_mask_0
output * io_out_s3_full_pred_3_br_taken_mask_1
output *[222:0] io_out_last_stage_meta 见附表
input * io_ctrl_tage_enable
input * io_ctrl_sc_enable
input * io_s0_fire_0 s0 阶段流水线控制 相同信号复制多份,0给BPU,1给Tage,3给SC
input * io_s0_fire_1
input * io_s0_fire_3
input * io_s1_fire_0 s1 阶段流水线控制
input * io_s1_fire_1
input * io_s1_fire_2
input * io_s1_fire_3
input * io_s2_fire_0 s2 阶段流水线控制
input * io_s2_fire_1
input * io_s2_fire_2
input * io_s2_fire_3
output * io_s1_ready tage的所有表,可以执行读取结果的操作
input * io_update_valid 从FTQ发向BPU的后端执行结果(更新信号)是否有效
input *[40:0] io_update_bits_pc (后端执行过的)预测块的PC
input *[10:0] io_update_bits_spec_info_folded_hist_hist_17_folded_hist TageTable 2 用到的11bits 折叠历史 预测时使用的分支历史结果,没有更新,转了一圈回来了
input *[10:0] io_update_bits_spec_info_folded_hist_hist_16_folded_hist TageTable 3 用到的11bits 折叠历史
input *[6:0] io_update_bits_spec_info_folded_hist_hist_15_folded_hist TageTable 1 用到的7bits 折叠历史
input *[7:0] io_update_bits_spec_info_folded_hist_hist_14_folded_hist TageTable 0 用到的8bits 折叠历史
input *[3:0] io_update_bits_spec_info_folded_hist_hist_12_folded_hist SCTable 1 用到的 4bit 折叠历史
input *[7:0] io_update_bits_spec_info_folded_hist_hist_11_folded_hist SCTable 2 用到的 8bit 折叠历史
input *[6:0] io_update_bits_spec_info_folded_hist_hist_9_folded_hist TageTable 2 用到的7bits 折叠历史
input *[7:0] io_update_bits_spec_info_folded_hist_hist_8_folded_hist TageTable 3 用到的8bits 折叠历史
input *[6:0] io_update_bits_spec_info_folded_hist_hist_7_folded_hist TageTable 0 用到的7bits 折叠历史
input *[6:0] io_update_bits_spec_info_folded_hist_hist_5_folded_hist TageTable 3 用到的7bits 折叠历史
input *[7:0] io_update_bits_spec_info_folded_hist_hist_4_folded_hist TageTable 1 用到的8bits 折叠历史
input *[7:0] io_update_bits_spec_info_folded_hist_hist_3_folded_hist TageTable 2 用到的8bits 折叠历史
input *[7:0] io_update_bits_spec_info_folded_hist_hist_2_folded_hist SCTable 3 用到的 8bit 折叠历史
input *[10:0] io_update_bits_spec_info_folded_hist_hist_1_folded_hist TageTable 1 用到的11bits 折叠历史
input * io_update_bits_ftb_entry_brSlots_0_valid FTB 表项的第一个slot是否有效(存储了跳转指令)
input * io_update_bits_ftb_entry_tailSlot_sharing FTB 表项的最后一个slot是否存储了条件分支而非无条件跳转
input * io_update_bits_ftb_entry_tailSlot_valid FTB 表项的最后一个slot是否有效
input * io_update_bits_ftb_entry_always_taken_0 历史上slot 0 指令总是跳转
input * io_update_bits_ftb_entry_always_taken_1 历史上slot 1 指令总是跳转
input * io_update_bits_br_taken_mask_0 solt 0 是否 taken
input * io_update_bits_br_taken_mask_1 solt 1 是否 taken
input * io_update_bits_mispred_mask_0 solt 0 是否预测正确
input * io_update_bits_mispred_mask_1 solt 1 是否预测正确
input *[222:0] io_update_bits_meta 见附表

io_out_last_stage_meta

需要设计参与优化!

信号类型 信号位 信号名 信号描述
output [218:88] 0 占位,全为0,传递到composer时会忽略
87 resp_meta_providers_1_valid_r
[86:85] resp_meta_providers_1_bits_r
84 resp_meta_providers_0_valid_r
[83:82] resp_meta_providers_0_bits_r
[81:79] resp_meta_providerResps_1_r_ctr
78 resp_meta_providerResps_1_r_u
77 resp_meta_providerResps_1_r_unconf
[76:74] resp_meta_providerResps_0_r_ctr
73 resp_meta_providerResps_0_r_u
72 resp_meta_providerResps_0_r_unconf
71 resp_meta_altUsed_1_r
70 resp_meta_altUsed_0_r
69 resp_meta_altDiffers_1_r
68 resp_meta_altDiffers_0_r
[67:66] resp_meta_basecnts_1_r
[65:64] resp_meta_basecnts_0_r
[63:60] resp_meta_allocates_1_r
[59:56] resp_meta_allocates_0_r
55 resp_meta_takens_1_r
54 resp_meta_takens_0_r
53 resp_meta_scMeta_tageTakens_1_r
52 resp_meta_scMeta_tageTakens_0_r
51 resp_meta_scMeta_scUsed_1_r
50 resp_meta_scMeta_scUsed_0_r
49 resp_meta_scMeta_scPreds_1_r
48 resp_meta_scMeta_scPreds_0_r
[47:42] r_1_3 scMeta(预测时的状态)中第2路的第4个sc_ctr的值
[41:36] r_1_2 scMeta中第2路的第3个sc_ctr的值
[35:30] r_1_1 scMeta中第2路的第2个sc_ctr的值
[29:24] r_1_0 scMeta中第2路的第1个sc_ctr的值
[23:18] r_3 scMeta中第1路的第4个sc_ctr的值
[17:12] r_2 scMeta中第1路的第3个sc_ctr的值
[11:6] r_1 scMeta中第1路的第2个sc_ctr的值
[5:0] r_0 scMeta中第1路的第1个sc_ctr的值

io_update_bits_meta

信号类型 信号位 信号名 信号描述
input [218:94] FTB, ITAGE, RAS 模块传给 FTQ 的 META 信息,忽略
[93:6] io_out_last_stage_meta[87:0] 偏移 6bit 后的结果 TAGE 输出给 FTQ 的 META
[5:0] uFTB 输出给 FTQ 的 META

4 - FTB Branch Predictor

Introduction to FTB

FTB is the third sub-predictor of the Xiangshan BPU, and it can also get the outputs of uFTB and TAGE-SC together. In the input interface of FTB, the s1 channel contains the basic prediction results of uFTB, and the s2 and s3 channels are filled with only one group of signals, br_taken_mask, by TAGE-SC, without the basic prediction results generated by the FTB entry. The function of FTB is to provide basic prediction results for the s2 and s3 channels.

In terms of functionality and structure, FTB is similar to uFTB. The main difference is that FTB can accommodate more FTB entries, and the prediction results of FTB are output in the s2 and s3 channels. Due to its large capacity, the readout speed of FTB is slower than that of uFTB, and it cannot be placed in the first cycle to generate prediction results. However, the large capacity enables it to obtain more accurate prediction results.

Function of uFTB

  • Cache more FTB entries and provide basic prediction results for the s2 and s3 channels. The FTB predictor is essentially a storage with a large capacity. It reads the corresponding FTB entry based on the current predicted PC and outputs it in the s2 stage. At the same time, this FTB entry will be saved for one more cycle to generate the s3 stage prediction result. One thing to note is to consider the br_taken_mask field inputted by the previous predictor to avoid losing it during generation.
  • Update FTB entries based on update requests.

FTB Storage Structure

FTB entries in the FTB predictor are placed in a dedicated storage structure called FTBBank. Before further examining the structure of FTBBank, let’s first see how FTBBank is used.

FTB Read Request

The read request interface of FTBBank is as follows:

  • req_pc Requested PC
    • Interface type: Flipped(DecoupledIO(UInt(VAddrBits.W)))
  • read_resp Read out FTB entry
    • Interface type: FTBEntry
  • read_hits Which way (row) is hit
    • Interface type: Valid(UInt(log2Ceil(numWays).W))

Among, req_pc interface is Decoupled, meaning it contains valid and ready signals. FTB needs to get the PC before the s1 stage starts, so s0_pc is sent to the req_pc interface, s0_fire signal is connected to the valid signal of req_pc, and the ready signal is connected to the pipeline control signal s1_ready.

When s0_fire enters the s1 stage, in the next cycle, when s0_fire is at the same time as s1_fire, FTBBank has already outputted the readout FTB entry to the read_resp interface, and calculated read_hits. However, at this time, because the readout has wasted too much delay, it cannot be outputted in the s1 stage. Therefore, this readout result is saved in an internal register. It will be read out from the register in the s2 and s3 stages to generate the prediction result.

FTBBank

FTBBank defines a storage to store all FTB entries. The storage adopts a group-associative structure, with 512 groups (Sets) in total, each group has 4 ways, and can store up to 2048 FTB entries. Besides storing FTB entries, it also stores the tag corresponding to each FTB entry for matching.

Specifically, the tag is defined as pc[29:10], which takes 20 bits from the PC to identify the FTB entry. The PC is divided as follows:

  pc: | ... |<-- tag(20 bits) -->|<-- idx(9 bits) -->|<-- instOffset(1 bit) -->|

When reading, provide the group number (idx) to the storage, read out all ways in that group, and then check if there is a way whose tag matches the current tag. If there is a match, it means a hit, and the readout FTB entry is sent out through the read_resp interface, and the hit way number is sent out through the read_hits interface.

Generation of Prediction Results

As mentioned earlier, for the FTB predictor, it needs to provide basic prediction results derived from FTB entries to the s2 and s3 channels. The FTB entries have been read and saved in the s1 stage. In the s2 and s3 stages, they only need to be read out to generate the prediction results. However, one thing to note is to preserve the br_taken_mask field generated by TAGE-SC in the s2 and s3 prediction results, which provides precise prediction results for conditional branch instructions. For the s1 channel, the FTB predictor does not make any changes.

The generation of signals in the s2 and s3 prediction results can refer to the following list:

  • hit Whether the FTB entry is hit
    • Generation method: The read_hits signal valid bit from FTBBank is valid.
  • slot_valids Slot valid bit, indicating whether each slot in the ftb entry is valid
  • targets Jump target address corresponding to each slot
  • offsets Instruction offset relative to the start address of the predicted block in each slot
  • is_jal Whether the predicted block contains a jal instruction
  • is_jalr Whether the predicted block contains a jalr instruction
  • is_call Whether the predicted block contains a call instruction
  • is_ret Whether the predicted block contains a ret instruction
  • last_may_be_rvi_call Signal indicating that the end of the predicted block may be an RVI type call instruction
  • **is_br_sharing Whether the last slot (tailSlot) stores a conditional branch instruction signal
    • Generation method**: Export from the corresponding field in the FTB entry
  • fallThroughErr Error in the pftAddr recorded in the FTB entry
    • Generation method: Compare whether the address represented by pftAddr is greater than the start address of the predicted block. If it is less than, it indicates an error, and this signal is set to valid. This situation may occur when the PC indexes an incorrect FTB entry.
  • fallThroughAddr End address of the predicted block
    • Generation method: If fallThroughErr is invalid, it is generated according to pftAddr. Otherwise, it is set to the start address + prediction width.
  • br_taken_mask Branch prediction result, each branch (slot) corresponds to a bit, indicating whether the branch is predicted as taken
    • Generation method: Generated based on the always_taken field in the FTB entry and the indication result of the two-bit saturation counter.
  • jalr_target Jump target of jalr in this predicted block
    • Generation method: Jump target in the tailSlot of the FTB entry.

FTB meta

In the third cycle of prediction, the FTB predictor outputs some auxiliary information of this prediction to last_stage_meta and also sends the read FTB entry to the last_stage_ftrb_entry interface.

The FTB meta contains two pieces of information, hit and writeWay, indicating whether the prediction hits and in which way it is read. Subsequently, the update channel generates the update information for this prediction, and these two pieces of information are also sent to guide the writing of the updated FTB entry.

FTB Update

In the update channel, the pc and the new FTB entry are already specified for us, along with the hit and writeWay in the meta information. If hit in the meta is valid, it means that the FTB entry corresponding to this pc was stored in the memory, and we only need to write it to the corresponding way.

If it is invalid, it means that there was no storage before, but we do not know whether it is stored now. It is possible that before this update request, the FTB entry corresponding to this pc was written by another update request. Therefore, we still need to send a read request to FTBBank to check if there is a corresponding FTB entry. If it exists, it can be directly written to this position in the next cycle, otherwise, FTBBank will be notified to allocate a new position.

Therefore, the number of cycles required for updating FTB entries depends on the hit situation.

Let’s first look at how FTBBank handles updates.

FTBBank Update

FTBBank’s update interface is divided into two parts, the update read interface and the update write interface.

  • u_req_pc: Update read request pc
    • Flipped(DecoupledIO(UInt(VAddrBits.W)))
  • update_hits: Hit information read out
    • Valid(UInt(log2Ceil(numWays).W))
  • update_access: There is an update request but the meta information indicates a miss
    • Bool()
  • update_pc: Update write request pc
    • UInt(VAddrBits.W))
  • update_write_data: Data to be written in the update request, write when valid
    • Flipped(Valid(new FTBEntryWithTag))
  • update_write_way: Way index to write in the update request
    • UInt(log2Ceil(numWays).W))
  • update_write_alloc: Whether a new FTB entry needs to be allocated (missed before)
    • Bool()

For the update read interface, FTBBank obtains the update read request through u_req_pc signal. This request has a higher priority than the read request during prediction. In the next cycle, FTBBank will output the hit information through the update_hits interface. update_access is only used for some internal status judgments of FTBBank.

For the update write interface, FTBBank obtains the pc of the update write request through the update_pc signal, and when update_write_data is valid, it writes the data into the corresponding position specified by update_write_way. If update_write_alloc is valid, it means that it cannot be directly written to the position specified in the request, but a new position needs to be allocated.

The allocation strategy is as follows:

  • If all ways are filled, use the pseudo LRU replacement algorithm to select the way to replace
  • If there is an empty way, select the empty way.

Update Request Timing

  • Meta hit is valid: If hit in the update request meta is valid, then we only need to specify the address and data to be written according to the information in the update request, and the writing only takes one cycle.
  • Meta hit is invalid: In this case, after receiving the update request, we connect the pc in the request to the read port of FTBBank. The read port will return the result in the next cycle. Due to timing issues, we save this result and use it in the next cycle. Depending on the hit status in the result, we decide whether to set update_write_alloc and send a write request. The entire update process takes three cycles.

Interface List

信号类型 信号位 信号名 信号描述
input clock 输入时钟
input reset 复位信号
input [35:0] io_reset_vector 用于reset时,reset s1_pc_dup_0 提供的值
input [40:0] io_in_bits_s0_pc_0 输入位s0_pc 的 第0个复制
input [40:0] io_in_bits_s0_pc_1 同上 第1个
input [40:0] io_in_bits_s0_pc_2 同上 第2个
input [40:0] io_in_bits_s0_pc_3 同上 第3个
input io_in_bits_resp_in_0_s2_full_pred_0_br_taken_mask_0 预测结果输入
input io_in_bits_resp_in_0_s2_full_pred_0_br_taken_mask_1
input io_in_bits_resp_in_0_s2_full_pred_1_br_taken_mask_0
input io_in_bits_resp_in_0_s2_full_pred_1_br_taken_mask_1
input io_in_bits_resp_in_0_s2_full_pred_2_br_taken_mask_0
input io_in_bits_resp_in_0_s2_full_pred_2_br_taken_mask_1
input io_in_bits_resp_in_0_s2_full_pred_3_br_taken_mask_0
input io_in_bits_resp_in_0_s2_full_pred_3_br_taken_mask_1
input io_in_bits_resp_in_0_s3_full_pred_0_br_taken_mask_0
input io_in_bits_resp_in_0_s3_full_pred_0_br_taken_mask_1
input io_in_bits_resp_in_0_s3_full_pred_1_br_taken_mask_0
input io_in_bits_resp_in_0_s3_full_pred_1_br_taken_mask_1
input io_in_bits_resp_in_0_s3_full_pred_2_br_taken_mask_0
input io_in_bits_resp_in_0_s3_full_pred_2_br_taken_mask_1
input io_in_bits_resp_in_0_s3_full_pred_3_br_taken_mask_0
input io_in_bits_resp_in_0_s3_full_pred_3_br_taken_mask_1
output io_out_s2_full_pred_0_br_taken_mask_0 s2 阶段输出的完整预测结果
output io_out_s2_full_pred_0_br_taken_mask_1
output io_out_s2_full_pred_0_slot_valids_0
output io_out_s2_full_pred_0_slot_valids_1
output [40:0] io_out_s2_full_pred_0_targets_0
output [40:0] io_out_s2_full_pred_0_targets_1
output [40:0] io_out_s2_full_pred_0_jalr_target
output [3:0] io_out_s2_full_pred_0_offsets_0
output [3:0] io_out_s2_full_pred_0_offsets_1
output [40:0] io_out_s2_full_pred_0_fallThroughAddr
output io_out_s2_full_pred_0_is_br_sharing
output io_out_s2_full_pred_0_hit
output io_out_s2_full_pred_1_br_taken_mask_0 同上
output io_out_s2_full_pred_1_br_taken_mask_1
output io_out_s2_full_pred_1_slot_valids_0
output io_out_s2_full_pred_1_slot_valids_1
output [40:0] io_out_s2_full_pred_1_targets_0
output [40:0] io_out_s2_full_pred_1_targets_1
output [40:0] io_out_s2_full_pred_1_jalr_target
output [3:0] io_out_s2_full_pred_1_offsets_0
output [3:0] io_out_s2_full_pred_1_offsets_1
output [40:0] io_out_s2_full_pred_1_fallThroughAddr
output io_out_s2_full_pred_1_is_br_sharing
output io_out_s2_full_pred_1_hit
output io_out_s2_full_pred_2_br_taken_mask_0 同上
output io_out_s2_full_pred_2_br_taken_mask_1
output io_out_s2_full_pred_2_slot_valids_0
output io_out_s2_full_pred_2_slot_valids_1
output [40:0] io_out_s2_full_pred_2_targets_0
output [40:0] io_out_s2_full_pred_2_targets_1
output [40:0] io_out_s2_full_pred_2_jalr_target
output [3:0] io_out_s2_full_pred_2_offsets_0
output [3:0] io_out_s2_full_pred_2_offsets_1
output [40:0] io_out_s2_full_pred_2_fallThroughAddr
output io_out_s2_full_pred_2_is_jalr
output io_out_s2_full_pred_2_is_call
output io_out_s2_full_pred_2_is_ret
output io_out_s2_full_pred_2_last_may_be_rvi_call
output io_out_s2_full_pred_2_is_br_sharing
output io_out_s2_full_pred_2_hit
output io_out_s2_full_pred_3_br_taken_mask_0 同上
output io_out_s2_full_pred_3_br_taken_mask_1
output io_out_s2_full_pred_3_slot_valids_0
output io_out_s2_full_pred_3_slot_valids_1
output [40:0] io_out_s2_full_pred_3_targets_0
output [40:0] io_out_s2_full_pred_3_targets_1
output [40:0] io_out_s2_full_pred_3_jalr_target
output [3:0] io_out_s2_full_pred_3_offsets_0
output [3:0] io_out_s2_full_pred_3_offsets_1
output [40:0] io_out_s2_full_pred_3_fallThroughAddr
output io_out_s2_full_pred_3_fallThroughErr
output io_out_s2_full_pred_3_is_br_sharing
output io_out_s2_full_pred_3_hit
output io_out_s3_full_pred_0_br_taken_mask_0 s3 阶段输出的完整预测结果
output io_out_s3_full_pred_0_br_taken_mask_1
output io_out_s3_full_pred_0_slot_valids_0
output io_out_s3_full_pred_0_slot_valids_1
output [40:0] io_out_s3_full_pred_0_targets_0
output [40:0] io_out_s3_full_pred_0_targets_1
output [40:0] io_out_s3_full_pred_0_jalr_target
output [40:0] io_out_s3_full_pred_0_fallThroughAddr
output io_out_s3_full_pred_0_fallThroughErr
output io_out_s3_full_pred_0_is_br_sharing
output io_out_s3_full_pred_0_hit
output io_out_s3_full_pred_1_br_taken_mask_0 同上
output io_out_s3_full_pred_1_br_taken_mask_1
output io_out_s3_full_pred_1_slot_valids_0
output io_out_s3_full_pred_1_slot_valids_1
output [40:0] io_out_s3_full_pred_1_targets_0
output [40:0] io_out_s3_full_pred_1_targets_1
output [40:0] io_out_s3_full_pred_1_jalr_target
output [40:0] io_out_s3_full_pred_1_fallThroughAddr
output io_out_s3_full_pred_1_fallThroughErr
output io_out_s3_full_pred_1_is_br_sharing
output io_out_s3_full_pred_1_hit
output io_out_s3_full_pred_2_br_taken_mask_0 同上
output io_out_s3_full_pred_2_br_taken_mask_1
output io_out_s3_full_pred_2_slot_valids_0
output io_out_s3_full_pred_2_slot_valids_1
output [40:0] io_out_s3_full_pred_2_targets_0
output [40:0] io_out_s3_full_pred_2_targets_1
output [40:0] io_out_s3_full_pred_2_jalr_target
output [40:0] io_out_s3_full_pred_2_fallThroughAddr
output io_out_s3_full_pred_2_fallThroughErr
output io_out_s3_full_pred_2_is_jalr
output io_out_s3_full_pred_2_is_call
output io_out_s3_full_pred_2_is_ret
output io_out_s3_full_pred_2_is_br_sharing
output io_out_s3_full_pred_2_hit
output io_out_s3_full_pred_3_br_taken_mask_0 同上
output io_out_s3_full_pred_3_br_taken_mask_1
output io_out_s3_full_pred_3_slot_valids_0
output io_out_s3_full_pred_3_slot_valids_1
output [40:0] io_out_s3_full_pred_3_targets_0
output [40:0] io_out_s3_full_pred_3_targets_1
output [40:0] io_out_s3_full_pred_3_jalr_target
output [3:0] io_out_s3_full_pred_3_offsets_0
output [3:0] io_out_s3_full_pred_3_offsets_1
output [40:0] io_out_s3_full_pred_3_fallThroughAddr
output io_out_s3_full_pred_3_fallThroughErr
output io_out_s3_full_pred_3_is_br_sharing
output io_out_s3_full_pred_3_hit
output [222:0] io_out_last_stage_meta 最后一个阶段输出的 meta 信息
output io_out_last_stage_ftb_entry_valid 最后一个阶段输出的 FTB 项
output [3:0] io_out_last_stage_ftb_entry_brSlots_0_offset
output [11:0] io_out_last_stage_ftb_entry_brSlots_0_lower
output [1:0] io_out_last_stage_ftb_entry_brSlots_0_tarStat
output io_out_last_stage_ftb_entry_brSlots_0_sharing
output io_out_last_stage_ftb_entry_brSlots_0_valid
output [3:0] io_out_last_stage_ftb_entry_tailSlot_offset
output [19:0] io_out_last_stage_ftb_entry_tailSlot_lower
output [1:0] io_out_last_stage_ftb_entry_tailSlot_tarStat
output io_out_last_stage_ftb_entry_tailSlot_sharing
output io_out_last_stage_ftb_entry_tailSlot_valid
output [3:0] io_out_last_stage_ftb_entry_pftAddr
output io_out_last_stage_ftb_entry_carry
output io_out_last_stage_ftb_entry_isCall
output io_out_last_stage_ftb_entry_isRet
output io_out_last_stage_ftb_entry_isJalr
output io_out_last_stage_ftb_entry_last_may_be_rvi_call
output io_out_last_stage_ftb_entry_always_taken_0
output io_out_last_stage_ftb_entry_always_taken_1
input io_ctrl_btb_enable 使能信号
input io_s0_fire_0 s0 阶段流水线控制信号
input io_s0_fire_1
input io_s0_fire_2
input io_s0_fire_3
output io_s1_ready s1 阶段流水线控制信号
input io_s1_fire_0
input io_s1_fire_1
input io_s1_fire_2
input io_s1_fire_3
input io_s2_fire_0 s2 阶段流水线控制信号
input io_s2_fire_1
input io_s2_fire_2
input io_s2_fire_3
input io_update_valid 更新有效性
input [40:0] io_update_bits_pc 传回的预测块pc(用于指示更新的预测块)
input io_update_bits_ftb_entry_valid 是否启用
input [3:0] io_update_bits_ftb_entry_brSlots_0_offset solt 0 中分支指令相对于地址块起始pc的偏移
input [11:0] io_update_bits_ftb_entry_brSlots_0_lower 跳转目标地址的低位
input [1:0] io_update_bits_ftb_entry_brSlots_0_tarStat 跳转后的 pc 高位是否进退位
input io_update_bits_ftb_entry_brSlots_0_sharing 无条件跳转指令槽中存储条件分支指令
input io_update_bits_ftb_entry_brSlots_0_valid 是否启用
input [3:0] io_update_bits_ftb_entry_tailSlot_offset solt 1 中分支指令相对于地址块起始pc的偏移
input [19:0] io_update_bits_ftb_entry_tailSlot_lower 跳转目标地址的低位
input [1:0] io_update_bits_ftb_entry_tailSlot_tarStat 跳转后的 pc 高位是否进退位
input io_update_bits_ftb_entry_tailSlot_sharing 无条件跳转指令槽中存储条件分支指令
input io_update_bits_ftb_entry_tailSlot_valid 是否启用
input [3:0] io_update_bits_ftb_entry_pftAddr Partial Fallthrough Addr 如果预测块中没有跳转,那么程序将会顺序执行到达的地址,预测块的结束地址。
input io_update_bits_ftb_entry_carry pc+pft时是否产生进位
input io_update_bits_ftb_entry_isCall 是否是函数调用
input io_update_bits_ftb_entry_isRet 是否是函数返回
input io_update_bits_ftb_entry_isJalr 是否是 jalr 指令
input io_update_bits_ftb_entry_last_may_be_rvi_call 最后一个指令槽存储的可能是 rvi 的 call 指令
input io_update_bits_ftb_entry_always_taken_0 是否预测为总是跳转
input io_update_bits_ftb_entry_always_taken_1 是否预测为总是跳转
input io_update_bits_old_entry 是否是旧的 FTB 项
input [222:0] io_update_bits_meta meta 信息

5 - ITTAGE Branch Predictor

Function Introduction

For general conditional branch instructions, only predicting whether to jump (taken) or not (not taken) is needed. However, for indirect jumps, such as call/jump instructions, it is necessary to predict where to jump to (Target). In order to make TAGE support predicting jump addresses, ITTAGE (Indirect Target TAGE) was introduced.

The main difference between ITTAGE and TAGE is that in the T0 and Tn tables, Target PC data is added. During prediction, ITTAGE selects the Target from the matched, longest history entry as the prediction result, and uses a 2-bit saturating counter to decide whether to output this result or choose an alternative prediction result. For TAGE predictor details, please refer to TAGE-SC Branch Predictor.

Kunming Lake ITTAGE Branch Predictor

In the BPU design of Kunming Lake, prediction is performed in a cascaded manner with multiple predictors, so the implementation of the subpredictor differs from the original predictor, mainly in the default prediction result.

Basic Functionality

ITTAGE’s basic functionality is similar to the TAGE branch predictor, but with the following differences:

  1. The Target is added as a jump target address item in the entry to predict the jump target address.
  2. The saturating counter ctr no longer provides the prediction direction, but instead decides whether to output the result (just the prediction information).
  3. Since there is only one indirect jump instruction in each branch prediction block, ITTAGE only considers one instruction.

Pipeline

ITTAGE contains three pipeline stages, the first stage calculates the index, and the second stage reads the result from the SRAM table using the index.

  1. Cycle 0, s0: Input of the first pipeline stage, generally pc and folded history.

Operation of the first pipeline stage:Calculate the index. Output through registers to s1.

  1. Cycle 1, s1: Input of the second pipeline stage, the index and other data calculated in the first stage.

Operation of the second pipeline stage:Access SRAM, read prediction information. Output through registers to s2.

  1. Cycle 2, s2: Input of the third pipeline stage, the original prediction information read from SRAM in the second stage.

Operation of the third pipeline stage:Process the original prediction information, decide whether to output the prediction result.

  1. Cycle 3, s3: Prediction result ready, the prediction result can now be used.

Data Structure

In the Kunming Lake implementation, the table structure of T0 and Tn is as follows:

预测器 作用 表项构成 项数
基准预测器T0 用于在其他预测器的预测结果都无效时输出预测结果 虚表,不存在。 直接将上级预测器FTB 的预测结果作为表项结果 虚表,不存在。 直接将上级预测器FTB结果作为索引到的结果
预测表T1-T2 对每个预测块的输入,所有Tn表都进行预测,在所有预测有效的结果中,选择历史记录最长的结果作为 原始预测信息。历史记录长度由输入的H决定 target:41 bitsvalid 1bittag 9bitsctr 2bitsus: 1bit(usefulness计数器) 256项
预测表T3-T5 512项

T0,TnTable Retrieval Method

The retrieval method is consistent with the TAGE branch predictor, only differing in the configuration options of each table.

表名称 FH长度 FH1长度 FH2长度 最近历史长度(用到GH中的位数)
T1 4比特 4比特 4比特 低4位,即把最新4位历史,折叠成FH、FH1、FH2
T2 8比特 8比特 8比特 低8位,即把最新8位历史,折叠成FH、FH1、FH2
T3 9比特 9比特 8比特 低13位,即把最新13位历史,折叠成FH、FH1、FH2
T4 9比特 9比特 8比特 低16位,即把最新16位历史,折叠成FH、FH1、FH2
T5 9比特 9比特 8比特 低32位,即把最新32位历史,折叠成FH、FH1、FH2

Other processes (computation method and computation formula) are similar to the TAGE-SC branch predictor.

Alternate Predictor

When the prediction result given by the Tn table has insufficient “prediction confidence,” the prediction result needs to be jumped to become an “alternate predictor.” This process is similar to TAGE. For details, please refer to the corresponding part of TAGE. Unlike TAGE, ITTAGE’s ctr does not give the prediction direction but only determines whether to output the result (prediction confidence). When ctr is 2b00, it is considered weak confidence. Choose the alternate prediction result:

  1. If multiple tables are hit, output the Target from the second-longest history table entry.
  2. Otherwise, output the T0 Target (FTB Target).

Prediction Process

The prediction process is similar to TAGE, but ITTAGE has an additional step to decide whether to output the prediction result based on ctr. The specific process is as follows:

  1. When the ctr of the ITTAGE table entry is not 2b00, output Target.
  2. When the ctr of the ITTAGE table entry is 2b00, output the alternate prediction result:
    1. If there is a second-longest history (the second table is also hit), output the Target of the second-longest.
    2. Otherwise, output the FTB Target.
  3. When the ITTAGE table entry is not hit, output the T0 Target (FTB Target).

Training Process

This process is similar to TAGE, with the following differences:

  1. Table entry updates (original prediction data):
    1. ctr:
      1. If the predicted address matches the actual address, increment the ctr counter of the corresponding provider table entry by 1.
      2. If the predicted address does not match the actual address, decrement the ctr counter of the corresponding provider table entry by 1.
      3. In ITTAGE, it is determined based on ctr whether to adopt the jump target result of this prediction. If multiple tables are hit and the ctr of the longest history table is 0, adopt the alternate prediction logic (the second-longest history table or T0). Always update the longest history table during updates, and also update the alternate prediction table if the alternate prediction is adopted.
    2. target:
      1. When the ctr of the table entry to be updated is 0 during this prediction, directly store the actual final jump result in the target, overwriting it.
      2. When applying for a new table entry, directly store the actual final jump result in the target.
      3. Otherwise, do not modify the target.
    3. usefulness:
      1. When the provider’s prediction is correct but the alternate prediction is incorrect, set the provider’s usefulness to 1.
      2. If the alternate prediction has weak confidence and is correct, set the provider’s usefulness to 1. If the alternate prediction has weak confidence and is incorrect, set the provider’s usefulness to 0.
    4. New table entry:
      1. Each time the prediction from the longest history table with confidence is incorrect (not due to using the alternate prediction), try to randomly apply for a table entry from a longer history table. The condition for application is that the usefulness of the corresponding entry is 0.
      2. If all longer entries are not 0, the allocation fails.
  2. Reset useful bit:
    1. Each time a prediction error occurs and a new table entry is applied for, if the allocation fails, increment tickCtr (an 8-bit saturated counter used to reset all usefulness). If successful, decrement tickCtr.
    2. When tickCtr reaches its maximum value, set all usefulness in ITTAGE to 0 and reset tickCtr to 0.

Interface List

接口类型 位宽 信号名 备注
input clock
input reset
input [40:0] io_in_bits_s0_pc_3 用于预测的PC
input [7:0] io_in_bits_folded_hist_3_hist_14_folded_hist T2 折叠历史
input [8:0] io_in_bits_folded_hist_3_hist_13_folded_hist T3 折叠历史
input [3:0] io_in_bits_folded_hist_3_hist_12_folded_hist T1 折叠历史
input [8:0] io_in_bits_folded_hist_3_hist_10_folded_hist T5 折叠历史
input [8:0] io_in_bits_folded_hist_3_hist_6_folded_hist T4 折叠历史
input [7:0] io_in_bits_folded_hist_3_hist_4_folded_hist T3 折叠历史
input [7:0] io_in_bits_folded_hist_3_hist_3_folded_hist T5 折叠历史
input [7:0] io_in_bits_folded_hist_3_hist_2_folded_hist T4 折叠历史
input [40:0] io_in_bits_resp_in_0_s3_full_pred_0_jalr_target
input [40:0] io_in_bits_resp_in_0_s3_full_pred_1_jalr_target
input [40:0] io_in_bits_resp_in_0_s3_full_pred_2_jalr_target
input [40:0] io_in_bits_resp_in_0_s3_full_pred_3_jalr_target
output [40:0] io_out_s3_full_pred_0_jalr_target
output [40:0] io_out_s3_full_pred_1_jalr_target
output [40:0] io_out_s3_full_pred_2_jalr_target
output [40:0] io_out_s3_full_pred_3_jalr_target
output [222:0] io_out_last_stage_meta [100:0] 有效,是ITTAGE的Meta信息
input io_s0_fire_3 s0阶段使能信号
input io_s1_fire_3 s1阶段使能信号
input io_s2_fire_0 s2阶段使能信号,相同
input io_s2_fire_1
input io_s2_fire_2
input io_s2_fire_3
input io_update_valid 是否进行更新
input [40:0] io_update_bits_pc 待更新的预测块pc索引
input [7:0] io_update_bits_spec_info_folded_hist_hist_14_folded_hist T2 更新时传入的历史
input [8:0] io_update_bits_spec_info_folded_hist_hist_13_folded_hist T3 更新时传入的历史
input [3:0] io_update_bits_spec_info_folded_hist_hist_12_folded_hist T1 更新时传入的历史
input [8:0] io_update_bits_spec_info_folded_hist_hist_10_folded_hist T5 更新时传入的历史
input [8:0] io_update_bits_spec_info_folded_hist_hist_6_folded_hist T4 更新时传入的历史
input [7:0] io_update_bits_spec_info_folded_hist_hist_4_folded_hist T3 更新时传入的历史
input [7:0] io_update_bits_spec_info_folded_hist_hist_3_folded_hist T5 更新时传入的历史
input [7:0] io_update_bits_spec_info_folded_hist_hist_2_folded_hist T4 更新时传入的历史
input [3:0] io_update_bits_ftb_entry_tailSlot_offset 待更新的FTB项offset
input io_update_bits_ftb_entry_tailSlot_sharing 待更新的FTB项是否是有条件跳转
input io_update_bits_ftb_entry_tailSlot_valid 待更新的tailSlot是否启用
input io_update_bits_ftb_entry_isRet tailSlot是否是Ret指令
input io_update_bits_ftb_entry_isJalr tailSlot是否是Jalr指令
input io_update_bits_cfi_idx_valid 控制流指令在预测块中的索引.valid信号
input [3:0] io_update_bits_cfi_idx_bits 控制流指令在预测块中的索引
input io_update_bits_jmp_taken 预测块内无条件跳转指令被触发
input io_update_bits_mispred_mask_2 是否预测错误
input [222:0] io_update_bits_meta 预测时传出 meta 信息的[222:25] 即{25h0, _ubtb_io_out_last_stage_meta[5:0] ,_tage_io_out_last_stage_meta[87:0] ,_ftb_io_out_last_stage_meta[2:0], _ittage_io_out_last_stage_meta[100:0]}
input [40:0] io_update_bits_full_target 预测块的跳转目标(下一个预测块的起始地址)

Pass-through signals that do not have an impact

These signals do not have an impact and are not important
接口类型 位宽 信号名 备注
input io_in_bits_resp_in_0_s2_full_pred_0_br_taken_mask_0 从FTB输入 完全透传到输出 包括jalr_target
input io_in_bits_resp_in_0_s2_full_pred_0_br_taken_mask_1
input io_in_bits_resp_in_0_s2_full_pred_0_slot_valids_0
input io_in_bits_resp_in_0_s2_full_pred_0_slot_valids_1
input [40:0] io_in_bits_resp_in_0_s2_full_pred_0_targets_0
input [40:0] io_in_bits_resp_in_0_s2_full_pred_0_targets_1
input [40:0] io_in_bits_resp_in_0_s2_full_pred_0_jalr_target
input [3:0] io_in_bits_resp_in_0_s2_full_pred_0_offsets_0
input [3:0] io_in_bits_resp_in_0_s2_full_pred_0_offsets_1
input [40:0] io_in_bits_resp_in_0_s2_full_pred_0_fallThroughAddr
input io_in_bits_resp_in_0_s2_full_pred_0_is_br_sharing
input io_in_bits_resp_in_0_s2_full_pred_0_hit
input io_in_bits_resp_in_0_s2_full_pred_1_br_taken_mask_0
input io_in_bits_resp_in_0_s2_full_pred_1_br_taken_mask_1
input io_in_bits_resp_in_0_s2_full_pred_1_slot_valids_0
input io_in_bits_resp_in_0_s2_full_pred_1_slot_valids_1
input [40:0] io_in_bits_resp_in_0_s2_full_pred_1_targets_0
input [40:0] io_in_bits_resp_in_0_s2_full_pred_1_targets_1
input [40:0] io_in_bits_resp_in_0_s2_full_pred_1_jalr_target
input [3:0] io_in_bits_resp_in_0_s2_full_pred_1_offsets_0
input [3:0] io_in_bits_resp_in_0_s2_full_pred_1_offsets_1
input [40:0] io_in_bits_resp_in_0_s2_full_pred_1_fallThroughAddr
input io_in_bits_resp_in_0_s2_full_pred_1_is_br_sharing
input io_in_bits_resp_in_0_s2_full_pred_1_hit
input io_in_bits_resp_in_0_s2_full_pred_2_br_taken_mask_0
input io_in_bits_resp_in_0_s2_full_pred_2_br_taken_mask_1
input io_in_bits_resp_in_0_s2_full_pred_2_slot_valids_0
input io_in_bits_resp_in_0_s2_full_pred_2_slot_valids_1
input [40:0] io_in_bits_resp_in_0_s2_full_pred_2_targets_0
input [40:0] io_in_bits_resp_in_0_s2_full_pred_2_targets_1
input [40:0] io_in_bits_resp_in_0_s2_full_pred_2_jalr_target
input [3:0] io_in_bits_resp_in_0_s2_full_pred_2_offsets_0
input [3:0] io_in_bits_resp_in_0_s2_full_pred_2_offsets_1
input [40:0] io_in_bits_resp_in_0_s2_full_pred_2_fallThroughAddr
input io_in_bits_resp_in_0_s2_full_pred_2_is_jalr RAS 模块使用的信息,透传
input io_in_bits_resp_in_0_s2_full_pred_2_is_call
input io_in_bits_resp_in_0_s2_full_pred_2_is_ret
input io_in_bits_resp_in_0_s2_full_pred_2_last_may_be_rvi_call
input io_in_bits_resp_in_0_s2_full_pred_2_is_br_sharing 从FTB输入 完全透传到输出 包括jalr_target fallThroughErr 表示 FTB项 中记录的 pftAddr 有误 生成方式:比较 pftAddr 代表的预测块结束地址是否大于预测块的起始地址,如果小于,则代表出现错误,此信号置为有效。这种情况可能会发生在 pc 索引到错误的 FTB 项的情况。 FTQ使用这个变量,与ITTAGE无关
input io_in_bits_resp_in_0_s2_full_pred_2_hit
input io_in_bits_resp_in_0_s2_full_pred_3_br_taken_mask_0
input io_in_bits_resp_in_0_s2_full_pred_3_br_taken_mask_1
input io_in_bits_resp_in_0_s2_full_pred_3_slot_valids_0
input io_in_bits_resp_in_0_s2_full_pred_3_slot_valids_1
input [40:0] io_in_bits_resp_in_0_s2_full_pred_3_targets_0
input [40:0] io_in_bits_resp_in_0_s2_full_pred_3_targets_1
input [40:0] io_in_bits_resp_in_0_s2_full_pred_3_jalr_target
input [3:0] io_in_bits_resp_in_0_s2_full_pred_3_offsets_0
input [3:0] io_in_bits_resp_in_0_s2_full_pred_3_offsets_1
input [40:0] io_in_bits_resp_in_0_s2_full_pred_3_fallThroughAddr
input io_in_bits_resp_in_0_s2_full_pred_3_fallThroughErr
input io_in_bits_resp_in_0_s2_full_pred_3_is_br_sharing
input io_in_bits_resp_in_0_s2_full_pred_3_hit
input io_in_bits_resp_in_0_s3_full_pred_0_br_taken_mask_0 除了 jalr_target 可能被修改,其他都是透传
input io_in_bits_resp_in_0_s3_full_pred_0_br_taken_mask_1
input io_in_bits_resp_in_0_s3_full_pred_0_slot_valids_0
input io_in_bits_resp_in_0_s3_full_pred_0_slot_valids_1
input [40:0] io_in_bits_resp_in_0_s3_full_pred_0_targets_0
input [40:0] io_in_bits_resp_in_0_s3_full_pred_0_targets_1
input [40:0] io_in_bits_resp_in_0_s3_full_pred_0_fallThroughAddr
input io_in_bits_resp_in_0_s3_full_pred_0_fallThroughErr
input io_in_bits_resp_in_0_s3_full_pred_0_is_br_sharing
input io_in_bits_resp_in_0_s3_full_pred_0_hit
input io_in_bits_resp_in_0_s3_full_pred_1_br_taken_mask_0 同上
input io_in_bits_resp_in_0_s3_full_pred_1_br_taken_mask_1
input io_in_bits_resp_in_0_s3_full_pred_1_slot_valids_0
input io_in_bits_resp_in_0_s3_full_pred_1_slot_valids_1
input [40:0] io_in_bits_resp_in_0_s3_full_pred_1_targets_0
input [40:0] io_in_bits_resp_in_0_s3_full_pred_1_targets_1
input [40:0] io_in_bits_resp_in_0_s3_full_pred_1_fallThroughAddr
input io_in_bits_resp_in_0_s3_full_pred_1_fallThroughErr
input io_in_bits_resp_in_0_s3_full_pred_1_is_br_sharing
input io_in_bits_resp_in_0_s3_full_pred_1_hit
input io_in_bits_resp_in_0_s3_full_pred_2_br_taken_mask_0 同上
input io_in_bits_resp_in_0_s3_full_pred_2_br_taken_mask_1
input io_in_bits_resp_in_0_s3_full_pred_2_slot_valids_0
input io_in_bits_resp_in_0_s3_full_pred_2_slot_valids_1
input [40:0] io_in_bits_resp_in_0_s3_full_pred_2_targets_0
input [40:0] io_in_bits_resp_in_0_s3_full_pred_2_targets_1
input [40:0] io_in_bits_resp_in_0_s3_full_pred_2_fallThroughAddr
input io_in_bits_resp_in_0_s3_full_pred_2_fallThroughErr
input io_in_bits_resp_in_0_s3_full_pred_2_is_jalr
input io_in_bits_resp_in_0_s3_full_pred_2_is_call
input io_in_bits_resp_in_0_s3_full_pred_2_is_ret
input io_in_bits_resp_in_0_s3_full_pred_2_is_br_sharing
input io_in_bits_resp_in_0_s3_full_pred_2_hit
input io_in_bits_resp_in_0_s3_full_pred_3_br_taken_mask_0 同上
input io_in_bits_resp_in_0_s3_full_pred_3_br_taken_mask_1
input io_in_bits_resp_in_0_s3_full_pred_3_slot_valids_0
input io_in_bits_resp_in_0_s3_full_pred_3_slot_valids_1
input [40:0] io_in_bits_resp_in_0_s3_full_pred_3_targets_0
input [40:0] io_in_bits_resp_in_0_s3_full_pred_3_targets_1
input [3:0] io_in_bits_resp_in_0_s3_full_pred_3_offsets_0
input [3:0] io_in_bits_resp_in_0_s3_full_pred_3_offsets_1
input [40:0] io_in_bits_resp_in_0_s3_full_pred_3_fallThroughAddr
input io_in_bits_resp_in_0_s3_full_pred_3_fallThroughErr
input io_in_bits_resp_in_0_s3_full_pred_3_is_br_sharing
input io_in_bits_resp_in_0_s3_full_pred_3_hit
input io_in_bits_resp_in_0_last_stage_ftb_entry_valid 透传到output,不做修改 来源是FTB
input [3:0] io_in_bits_resp_in_0_last_stage_ftb_entry_brSlots_0_offset
input [11:0] io_in_bits_resp_in_0_last_stage_ftb_entry_brSlots_0_lower
input [1:0] io_in_bits_resp_in_0_last_stage_ftb_entry_brSlots_0_tarStat
input io_in_bits_resp_in_0_last_stage_ftb_entry_brSlots_0_sharing
input io_in_bits_resp_in_0_last_stage_ftb_entry_brSlots_0_valid
input [3:0] io_in_bits_resp_in_0_last_stage_ftb_entry_tailSlot_offset
input [19:0] io_in_bits_resp_in_0_last_stage_ftb_entry_tailSlot_lower
input [1:0] io_in_bits_resp_in_0_last_stage_ftb_entry_tailSlot_tarStat
input io_in_bits_resp_in_0_last_stage_ftb_entry_tailSlot_sharing
input io_in_bits_resp_in_0_last_stage_ftb_entry_tailSlot_valid
input [3:0] io_in_bits_resp_in_0_last_stage_ftb_entry_pftAddr
input io_in_bits_resp_in_0_last_stage_ftb_entry_carry
input io_in_bits_resp_in_0_last_stage_ftb_entry_isCall
input io_in_bits_resp_in_0_last_stage_ftb_entry_isRet
input io_in_bits_resp_in_0_last_stage_ftb_entry_isJalr
input io_in_bits_resp_in_0_last_stage_ftb_entry_last_may_be_rvi_call
input io_in_bits_resp_in_0_last_stage_ftb_entry_always_taken_0
input io_in_bits_resp_in_0_last_stage_ftb_entry_always_taken_1
output io_out_s2_full_pred_0_br_taken_mask_0 完全透传传入值 prefix: io_in_bits_resp_in_
output io_out_s2_full_pred_0_br_taken_mask_1
output io_out_s2_full_pred_0_slot_valids_0
output io_out_s2_full_pred_0_slot_valids_1
output [40:0] io_out_s2_full_pred_0_targets_0
output [40:0] io_out_s2_full_pred_0_targets_1
output [40:0] io_out_s2_full_pred_0_jalr_target
output [3:0] io_out_s2_full_pred_0_offsets_0
output [3:0] io_out_s2_full_pred_0_offsets_1
output [40:0] io_out_s2_full_pred_0_fallThroughAddr
output io_out_s2_full_pred_0_is_br_sharing
output io_out_s2_full_pred_0_hit
output io_out_s2_full_pred_1_br_taken_mask_0
output io_out_s2_full_pred_1_br_taken_mask_1
output io_out_s2_full_pred_1_slot_valids_0
output io_out_s2_full_pred_1_slot_valids_1
output [40:0] io_out_s2_full_pred_1_targets_0
output [40:0] io_out_s2_full_pred_1_targets_1
output [40:0] io_out_s2_full_pred_1_jalr_target
output [3:0] io_out_s2_full_pred_1_offsets_0
output [3:0] io_out_s2_full_pred_1_offsets_1
output [40:0] io_out_s2_full_pred_1_fallThroughAddr
output io_out_s2_full_pred_1_is_br_sharing
output io_out_s2_full_pred_1_hit
output io_out_s2_full_pred_2_br_taken_mask_0
output io_out_s2_full_pred_2_br_taken_mask_1
output io_out_s2_full_pred_2_slot_valids_0
output io_out_s2_full_pred_2_slot_valids_1
output [40:0] io_out_s2_full_pred_2_targets_0
output [40:0] io_out_s2_full_pred_2_targets_1
output [40:0] io_out_s2_full_pred_2_jalr_target
output [3:0] io_out_s2_full_pred_2_offsets_0
output [3:0] io_out_s2_full_pred_2_offsets_1
output [40:0] io_out_s2_full_pred_2_fallThroughAddr
output io_out_s2_full_pred_2_is_jalr
output io_out_s2_full_pred_2_is_call
output io_out_s2_full_pred_2_is_ret
output io_out_s2_full_pred_2_last_may_be_rvi_call
output io_out_s2_full_pred_2_is_br_sharing
output io_out_s2_full_pred_2_hit
output io_out_s2_full_pred_3_br_taken_mask_0
output io_out_s2_full_pred_3_br_taken_mask_1
output io_out_s2_full_pred_3_slot_valids_0
output io_out_s2_full_pred_3_slot_valids_1
output [40:0] io_out_s2_full_pred_3_targets_0
output [40:0] io_out_s2_full_pred_3_targets_1
output [40:0] io_out_s2_full_pred_3_jalr_target
output [3:0] io_out_s2_full_pred_3_offsets_0
output [3:0] io_out_s2_full_pred_3_offsets_1
output [40:0] io_out_s2_full_pred_3_fallThroughAddr
output io_out_s2_full_pred_3_fallThroughErr
output io_out_s2_full_pred_3_is_br_sharing
output io_out_s2_full_pred_3_hit
output io_out_s3_full_pred_0_br_taken_mask_0 见对应prefix的输入
output io_out_s3_full_pred_0_br_taken_mask_1
output io_out_s3_full_pred_0_slot_valids_0
output io_out_s3_full_pred_0_slot_valids_1
output [40:0] io_out_s3_full_pred_0_targets_0
output [40:0] io_out_s3_full_pred_0_targets_1
output [40:0] io_out_s3_full_pred_0_fallThroughAddr
output io_out_s3_full_pred_0_fallThroughErr
output io_out_s3_full_pred_0_is_br_sharing
output io_out_s3_full_pred_0_hit
output io_out_s3_full_pred_1_br_taken_mask_0 见对应prefix的输入
output io_out_s3_full_pred_1_br_taken_mask_1
output io_out_s3_full_pred_1_slot_valids_0
output io_out_s3_full_pred_1_slot_valids_1
output [40:0] io_out_s3_full_pred_1_targets_0
output [40:0] io_out_s3_full_pred_1_targets_1
output [40:0] io_out_s3_full_pred_1_fallThroughAddr
output io_out_s3_full_pred_1_fallThroughErr
output io_out_s3_full_pred_1_is_br_sharing
output io_out_s3_full_pred_1_hit
output io_out_s3_full_pred_2_br_taken_mask_0 见对应prefix的输入
output io_out_s3_full_pred_2_br_taken_mask_1
output io_out_s3_full_pred_2_slot_valids_0
output io_out_s3_full_pred_2_slot_valids_1
output [40:0] io_out_s3_full_pred_2_targets_0
output [40:0] io_out_s3_full_pred_2_targets_1
output [40:0] io_out_s3_full_pred_2_fallThroughAddr
output io_out_s3_full_pred_2_fallThroughErr
output io_out_s3_full_pred_2_is_jalr
output io_out_s3_full_pred_2_is_call
output io_out_s3_full_pred_2_is_ret
output io_out_s3_full_pred_2_is_br_sharing
output io_out_s3_full_pred_2_hit
output io_out_s3_full_pred_3_br_taken_mask_0 见对应prefix的输入
output io_out_s3_full_pred_3_br_taken_mask_1
output io_out_s3_full_pred_3_slot_valids_0
output io_out_s3_full_pred_3_slot_valids_1
output [40:0] io_out_s3_full_pred_3_targets_0
output [40:0] io_out_s3_full_pred_3_targets_1
output [3:0] io_out_s3_full_pred_3_offsets_0
output [3:0] io_out_s3_full_pred_3_offsets_1
output [40:0] io_out_s3_full_pred_3_fallThroughAddr
output io_out_s3_full_pred_3_fallThroughErr
output io_out_s3_full_pred_3_is_br_sharing
output io_out_s3_full_pred_3_hit
output io_out_last_stage_ftb_entry_valid 完全透传传入的值
output [3:0] io_out_last_stage_ftb_entry_brSlots_0_offset
output [11:0] io_out_last_stage_ftb_entry_brSlots_0_lower
output [1:0] io_out_last_stage_ftb_entry_brSlots_0_tarStat
output io_out_last_stage_ftb_entry_brSlots_0_sharing
output io_out_last_stage_ftb_entry_brSlots_0_valid
output [3:0] io_out_last_stage_ftb_entry_tailSlot_offset
output [19:0] io_out_last_stage_ftb_entry_tailSlot_lower
output [1:0] io_out_last_stage_ftb_entry_tailSlot_tarStat
output io_out_last_stage_ftb_entry_tailSlot_sharing
output io_out_last_stage_ftb_entry_tailSlot_valid
output [3:0] io_out_last_stage_ftb_entry_pftAddr
output io_out_last_stage_ftb_entry_carry
output io_out_last_stage_ftb_entry_isCall
output io_out_last_stage_ftb_entry_isRet
output io_out_last_stage_ftb_entry_isJalr
output io_out_last_stage_ftb_entry_last_may_be_rvi_call
output io_out_last_stage_ftb_entry_always_taken_0
output io_out_last_stage_ftb_entry_always_taken_1

Other Meta information can be found in the corresponding sub-predictor documentation

_ubtb_io_out_last_stage_meta

_tage_io_out_last_stage_meta

_ftb_io_out_last_stage_meta

ittage_io_out_last_stage_meta[100:0]

位宽 信号名 备注
100 s3_provided 是否有结果
[99:97] s3_provider 提供结果的表项
96 s3_altProvided 是否有替代预测表项
[95:93] s3_altProvider 提供结果的替代预测表项
92 resp_meta_altDiffers 替代预测是否是弱信心的(FTB不算)
91 s3_providerU 主预测的useful bit
[90:89] s3_providerCtr 主预测给出的置信度
[88:87] s3_altProviderCtr 替代预测给出的置信度
86 resp_meta_allocate_valid_r 有空余的表项可供申请
[85:83] resp_meta_allocate_bits_r 申请哪个表中的表项
82 s3_tageTaken_dup_3 在不使用FTB的情况下始为true,使用FTB也为true
[81:41] s3_providerTarget 主预测给出的跳转地址
[40:0] s3_altProviderTarget 替代预测给出的跳转地址

6 - RAS Branch Predictor

RAS介绍

RAS stands for “Return Address Stack.” It helps determine branch behavior in programs by tracking return addresses. As previously mentioned, there are many branches in a program: if/else, switch/case, while/for loop, iteration, call/return, etc. The RAS branch predictor specifically targets call/return types.

function _add(a, b){
    return (a > 0 ? a : 0)  + (b > 0? b : 0);
}

function add(a, b){
    return a + b;
}

function sub(a, b){
    return a - b;
}

function main(){
    a = 1;
    b = 2;
    c = add(a, b);
    d = sub(a, b);
}

As shown above, the main function calls add and sub, and add calls the function _add. In this process, each call’s jump address and return address are fixed, and the return address can be obtained at the time of the call. The function call process is a “stack push and pop” process, so branch prediction can be performed using a “stack” structure: each time a call instruction is encountered, the current PC+4 (compressed instructions and regular instructions have different offsets) is pushed onto the stack; when a return instruction is encountered, a pop operation is performed, and the address obtained is the target jump address. In the block-based BPU, RAS cannot know whether the current block is a call or ret, so it relies on other predictors, using the results of previous predictors for RAS operations.

Specifically, in Xiangshan’s RAS predictor, at the s2 stage, it needs to determine whether the previous stage’s s2 output predicts a call or ret (i.e., the input signal io.s2_full_pred.hit_taken_on_call/ret is valid). If it’s a call, it pushes the subsequent instruction address onto the stack; if it’s a ret, it pops the address from the stack as the prediction result. Because in the BPU predictor, the result obtained at the s3 stage is assumed to be better than the s2 stage, the RAS predictor needs to check at the s3 stage. If the previous stage’s s3 prediction result is inconsistent with s2, the s3 result is taken, and it needs to determine whether to cancel or complete the stack operations of the previous s2 stage as needed. For example, if the s2 stage predicted a call instruction and performed a push operation, but s3 predicted a regular branch instruction with no need for any operation, the push must be canceled; if s2 predicted a regular branch instruction and s3 predicted a call, a push operation must be performed to complete.

RAS Stack Operations

In RAS design, function return addresses are predicted using a stack. Ideally, this section assumes that RAS can be backed up at any time, with the stack top pointer represented by sp and the predicted address represented by paddr. The basic RAS operations are as follows:

  1. PUSH

Since predictions can be wrong, the current stack state needs to be backed up (often referred to as a “snapshot” in software; this term is also used in subsequent content). When encountering a call instruction, get the return address of the call instruction addr = current pc + 4 (if it’s a compressed instruction, addr = pc+2), then push onto the stack: sp = addr; sp += 1.

  1. POP

For the same reason, take a snapshot of the current stack, marked as s. When encountering a ret instruction, the predicted jump address is paddr = sp, then pop: sp = sp - 1. Take a snapshot of the current stack, marked as s.

  1. Redirect Operation

Since the BPU predicts program branches, there are “correct predictions” and “wrong predictions.” When the CPU backend detects a branch prediction error, it performs a redirect operation, informing the BPU where the prediction was wrong and what the correct result is. During redirection, the RAS module receives the correct branch and the RAS stack information at the time of prediction. Depending on the type of correct branch instruction, the following snapshot recovery situations arise:

(1) The previously predicted instruction is actually a call instruction, and the push operation is executed based on the addr address provided in the redirect. (2) The previously predicted instruction is actually a ret instruction, and the pop operation is executed.

  1. Commit Operation

The commit operation is when the backend informs the frontend that the previous prediction was correct. Ideally, the RAS predictor doesn’t need to perform any operations at this time.

Implementation of RAS in Kunming Lake

In actual circuit design, an infinitely large stack is impossible, and constant backups are not feasible. Therefore, in Kunming Lake’s RAS implementation, the problems and solutions are as follows:

How to obtain RAS stack snapshots for each prediction?

To achieve the function of taking snapshots of the RAS stack, Kunming Lake adopts a linked representation based on a circular array. The design is as follows:

As shown above, a circular array is used for data management. The circular array has a starting address marked as BOS and a tail pointer marked as TOSW. The data between them are valid, and the data outside are free. Within the valid data, a linked structure represents the “RAS stack,” where each stack element records the number of its previous data. When performing stack operations, the corresponding previous element can be obtained through this number. The RAS stack’s bottom pointer shares the BOS. In the initial state S in the figure, the RAS stack elements are 0, 1, 3, 5. Element 5 records the position of its previous element 3, and element 3 records the position of its previous element 1. When a push operation is needed, the RAS stack top pointer TOSR = TOSW, the new element is stored at the new TOSR position 7, and the position of its previous element 5 is recorded in the new element, then TOSW is moved back (TOSW = TOSW+1). When a pop operation is performed, the RAS stack top pointer TOSR moves to the previous element’s position 3 based on the index saved in the stack top element. Therefore, under the condition that the stack does not overflow, the above RAS stack always allocates new data on the array through TOSW during normal Push/Pop operations, so all process states and intermediate data are saved. So, to restore to the state S, it only needs to reset the corresponding stack pointers. Therefore, in each prediction, the corresponding stack pointers (BOS, TOSR, TOSW) also need to be saved in the prediction result for later restoration in case of redirection. The advantage of this structure is that it can save complete process data, but frequent push operations can lead to large space resource consumption.

Chain RA storage space waste?

Since the prediction result is correct after commit, the stack will not roll back. When the RAS predictor receives the commit message of the prediction block “P,” it will no longer receive the redirect message of block P, so the snapshot taken during the push operation for block P will not be used again. Therefore, the RAS stack elements can be categorized, with uncommitted elements stored in a “linked” structure and committed elements stored in a regular stack structure (the original RAS stack is split into two parts: the uncommitted part stored in a snapshot-saving linked structure and the committed part stored in a regular stack structure). The optimized RAS structure is shown below:

As shown above, based on the normal call/ret predictions and commit call/ret, the original RAS stack can be split into two independent stacks, called the “speculative stack” (spec_stack, linked structure) and the “commit stack” (commit_stack, regular structure). Due to the change in stack structure, the specific Pop/Push operations are as follows:

  1. Encounter normal call and ret: (1) The call predicted by the prediction block is correct, and a push operation is performed on the spec_stack, specifically the linked stack Push process mentioned above.

(2) The ret predicted by the prediction block is correct, and the stack top of the spec_stack is used as the prediction value, then a pop operation is performed. If the spec_stack is empty, the stack top element of the commit_stack is used as the prediction result (no pop operation is performed).

  1. Commit operation: (1) The FTQ execution result is call correct, and a regular push operation is performed on the commit_stack.

(2) The FTQ execution result is ret correct, and a regular pop operation is performed on the commit_stack.

  1. Redirect operation: (1) Obtain the stack pointers (BOS, TOSR, TOSW, ssp) from the redirect message during the previous prediction and cover the current pointer values to complete the speculative stack rollback.

(2) This operation does not affect the commit stack.

How to handle when the S3 prediction result is inconsistent with S2 at the input end?

Since the S3 result is assumed to be better than S2, the RAS stack needs to be repaired again in case of inconsistency. The specific inconsistency and corresponding repair operations are shown in the table below:

S2 Pred. Res. S3 Pred. Res. Repair Operation
push keep pop
keep pop pop
pop keep push
keep push push

S2 and S3 operations do not exist, pop/push or push/pop scenarios (Why not exist?)

Other Optimizations

  1. Each element in the RAS stack has a counter, which is used to save repeated values (for recursive calls). For example, when the address pushed for the first time is 0xff00, and the address pushed for the second time is also 0xff00, only the counter of the top element of the stack needs to be incremented, and there is no need to actually push the address onto the stack.

Interface Description

In the RAS predictor, the core component is the RASStack, with the following interface:

接口名称 功能描述 接口名称 功能描述
in.spec_push_valid 预测有Call指令,Spec压栈 in.s2_fire s2信号有效
in.spec_pop_valid 预测有Ret指令,Spec出栈 in.s3_fire s3信号有效
in.spec_push_addr Ret地址 in.s3_cancel s3和s2的预测结果不一样,需要撤销s2的操作
out.spec_pop_addr RAS的栈顶数据 in.s3_meta s3需要的s2时的现场信息
out.ssp commit栈顶指针 in.s3_missed_pop s3判断需要再次进行pop
out.sctr commit栈顶重复元素计数器 in.s3_missed_push s3判断需要再次进行push
out.nsp commit栈顶,会被ssp覆盖 in.s3_pushAddr 需要再次push时的地址
out.TOSR spec栈栈顶指针 in.redirect_valid 需要重定向
out.TOSW spec栈数据分配指针 in.redirect_isCall 真实执行情况是Call
out.BOS spec栈栈低指针 in.redirect_isRet 真实执行情况是Return
in.commit_push_valid push操作正确 in.redirect_meta_ssp 之前预测时的现场信息ssp
in.commit_pop_valid FTQ执行结果为Call正确 in.redirect_meta_sctr 之前预测时的现场信息sctr
in.commit_push_addr 更新信息中的Ret地址 in.redirect_meta_TOSW 之前预测时的现场信息TOSW
in.commit_meta_TOSW 更新信息中的TOSW
in.redirect_meta_TOSR 之前预测时的现场信息TOSR
in.commit_meta_TOSR 更新信息中的TOSR in.redirect_meta_NOS 之前预测时的现场信息NOS
in.commit_meta_ssp 更新信息中的现场信息SSP in.redirect_callAddr 重定向地址
in.commit_meta_sctr 更新信息中的现场信息SCTR

The relationship between the RASStack module and the BasePredictor interface is as follows:

stack接口 转换过程 描述
s.spec_push_valid io.s2_fire(2) && s2_full_pred.hit_taken_on_call && !io.s3_redirect(2) s2输入有效,且上级预测为call跳转
s.spec_pop_valid io.s2_fire(2) && s2_full_pred.hit_taken_on_ret && !io.s3_redirect(2) s2输入有效,且上级预测为ret跳转
s.spec_push_addr s2_full_pred.fallThroughAddr + Mux(s2_full_pred.last_may_be_rvi_call, 2.U, 0.U) 上级预测器s2预测的fallThroughAddr(即PC+2),判断是否压缩指令是否需要 +2
s.redirect_isCall redirect.bits.level === 0.U && recover_cfi.pd.isCall
s.redirect_isRet redirect.bits.level === 0.U && recover_cfi.pd.isRet
s.redirect_meta_* redirect.bits.cfiUpdate.*
s.commit_push_valid io.update.is_call_taken call指令预测正确
s.commit_push_valid io.update.is_ret_taken ret指令预测正确
s.commit_push_addr update.ftb_entry.getFallThrough(update.pc) + Mux(update.ftb_entry.last_may_be_rvi_call, 2.U, 0.U) 根据是否为压缩指令,进行地址+2或者+0
s.commit_meta_* io.update.bits.meta.asTypeOf(new RASMeta)
io.out.last_stage_spec_info.* s3_meta.* s3_meta = RegEnable(s2_meta, io.s2_fire(2))由s2_meta延迟一怕得到
io.out.last_stage_meta s3_meta
io.out.s2.full_pred.*.jalr_target :=stack.spec_pop_addr 预测地址(栈顶地址,只预测ret)
io.out.s3.full_pred.*.jalr_target :=RegEnable(stack.spec_pop_addr, io.s2_fire(2)) 由s2延迟一拍得到
io.out.s2/3.full_pred.targets.last :=Mux(s2/3_is_jalr, s2/3_jalr_target, s2/3_last_target_in) 如果时call执行,更新targets.last的结果

Timing Description

In RAS, there are only 2 stages, S2 and S3.

Main tasks in S2

  1. Based on the S2 prediction result of the previous predictor, complete the prediction through the push/pop process and obtain the result spec_pop_addr.
  2. Perform update operations based on commit signal.

Main tasks in S3

  1. Based on the prediction result in the previous predictor S3 and the operations in S2, determine whether to undo the Pop/Push operation. The predictor assumes that the S3 prediction result is better than S2. If S2 and S3 predictions are inconsistent, the RAS predictor accepts the S3 result for stack operation.

  2. The prediction process in S3 is the same as in S2, but the data is different.

  3. Perform redirection operations (redirection information obtained from the previous cycle).

Since the RASStack appears to complete its tasks within a cycle, data bypassing is needed inside the stack to cache data for processing.