This section of the documentation will provide a detailed introduction to each module of the Xiangshan Branch Prediction Unit, including the BPU top-level and five sub-predictors.
This section of the documentation will provide a detailed introduction to each module of the Xiangshan Branch Prediction Unit, including the BPU top-level and five sub-predictors.
In each module’s documentation, we will provide a detailed explanation of the module’s role in the Xiangshan Branch Prediction Unit, as well as the module’s algorithm principles, structure, and timing.
Students responsible for verifying a specific module should read the corresponding documentation thoroughly and understand it in conjunction with the code. Other documents can also be read to help you understand the overall functionality of the Xiangshan Branch Prediction Unit. During the understanding process, you may need to constantly review the basic design concepts and interface signals described in previous documents.
1 - BPU Top Module
The overall function and structure of the BPU top level have been roughly described in previous documents. For those verifying the BPU top level, a more detailed description might be needed. Due to the many functions of the BPU top level, this section divides the BPU into several major functional points for further description. However, since there are too many details at the BPU top level, further details need to be understood by referring to the code.
Generator Maintenance Method
From the basic design documents of Xiangshan, we know that the BPU top level maintains various variables in the s0 cycle through generators, such as PC, branch history, etc. The core concept is to decide which pipeline level’s result to adopt through the redirection signal of the prediction result.
There are a total of 6 generators in the BPU top level:
npcGen maintains the PC
ghistPtrGen maintains the global history pointer
ghvBitWriteGens maintains global history write data
foledGhGen maintains the folded history
lastBrNumOHGen maintains the position of the last effective branch instruction in the previous cycle
aheadFhObGen maintains the oldest position of the branch history
Except for npcGen, the rest of the generators will be introduced in this document. In this section, we will focus on the method of generating the next prediction for the generators.
In the code, you can see generators defined in a similar way:
valnpcGen=newPhyPriorityMuxGenerator[UInt]
Next, the code registers the data sources for the generators through multiple statements:
Each line is called a registration. In a registration, the first signal parameter is the data valid signal, and the second signal parameter contains the specific data. The priority of the generator is also determined in the order of registration; the later the registration, the higher the priority. Therefore, the priority at the same time, from low to high, is as follows:
s0 blocked data
Data updated based on s1 prediction results
Data updated based on s2 prediction results
Data updated based on s3 prediction results
Data in external redirection of BPU
In this way, when the redirection of the prediction result is valid, we can avoid using the earlier pipeline level’s prediction result and adopt the corrected prediction result. This allows us to handle external redirection requests with the highest priority.
We can conclude the method by which all generators generate s0 signals: Among all data valid signals, if only one is valid, the corresponding data is selected; if multiple data valid signals are valid, the data with the highest priority is selected.
Global Branch History
We know that the global branch history is maintained at the BPU top level, and the maintenance strategy is consistent with the PC maintenance strategy. That is, after the prediction result is generated at each stage of the pipeline, the global branch history is updated based on the corresponding signals.
The top level defines two sets of signals to maintain the global branch history:
ghv stores the global branch history (maximum length 256)
ghist_ptr global branch history pointer, pointing to the current position of the global branch history
Similar to s0_pc, s1_pc, s2_pc, the BPU top level also maintains signals for each stage of the global history pointer: s0_ghist_ptr, s1_ghist_ptr, s2_ghist_ptr. However, the content in ghv is fixed in position, and we only use ghist_ptr to locate where the current global branch history starts.
Calculating the Current Global Branch History with ghist_ptr
The use of ghist_ptr is only visible at the BPU top level. What we pass to the sub-predictors is the global branch history after the data in the global history register is shifted based on ghist_ptr. In the global branch history obtained by the sub-predictor, the least significant bit corresponds to the newest bit of the global branch history, and the most significant bit corresponds to the oldest bit.
So how is the shifting done? First, let’s see how the global history is stored in ghv.
|===== ghist =====>| =======>|
n ^ 0
ghist_ptr
As shown in the figure above, the sequence represents the entire ghv register, and ghist_ptr points to a position in ghv. This position represents the newest bit of the global branch history. When a new global history record needs to be added, ghist_ptr is first decremented by 1, and then this bit is written to the position it points to. When ghist_ptr is decremented to 0, it will loop back to point to the highest position, thereby overwriting the previously written global branch history.
No matter what, starting from the position pointed to by ghist_ptr, the pointer increases and the history gets older. Therefore, when we need to calculate the current global branch history, we only need to circularly right-shift the ghv register by ghist_ptr positions.
Updating Global Branch History
The strategy for updating the global branch history is consistent with the strategy for updating the pc. At each pipeline stage, a pointer for the current stage and an update description of ghv are generated based on the prediction result of the current stage, and all are sent to the relevant generator for processing.
The update strategy for the global branch history is consistent with the pc update strategy, requiring the generation of a current stage pointer and ghv update instructions based on the current stage’s prediction results at each pipeline stage. These instructions are ultimately sent to the relevant generators for processing.
The update description of ghv is some information used to guide the update of the ghv register. Xiangshan BPU maintains two pieces of information to fulfill this duty:
ghv_wdata the data that needs to be written into ghv
ghv_wens the write bit mask
For the final update, only the bits identified by ghv_wens need to be written with the corresponding bits of ghv_wdata.
Therefore, each pipeline stage needs to generate three sets of information: ghist_ptr, ghv_wdata, ghv_wens.
Specifically, the prediction result can contain up to two branch instructions. We only need to set these pieces of information according to the actual situation. Here are some examples:
Only the first slot is valid, and the conditional branch instruction in it is predicted as not taken. Then the next position of ghv_wens is set to 0, the corresponding position of ghv_wens is set to 1, and ghist_ptr is decremented by one.
Both slots contain conditional branch instructions, the first is predicted as not taken, and the second is predicted as taken. At this time, ghist_ptr should be decremented by two, and the other two pieces of information should indicate writing 01 to ghv.
Here, only one piece of ghv_wdata information is maintained in the generator (maintained by the ghvBitWriteGens generator), and ghv_wens is not maintained by the generator. This is because a small trick is used here, where the final output of the generator’s ghv_wdata is the result of the selected stage, and ghv_wens is used by performing a bitwise OR operation on ghv_wens of all stages.
This consideration is based on:
If the later pipeline stage is valid, the global history pointer is restored to an older position, even if the history of newer positions is modified by the earlier pipeline’s ghv_wens.
If the earlier pipeline stage is valid, the global history pointer continues to update to newer positions, and the later pipeline will not set ghv_wens due to the ineffective redirect.
Branch Folded History
The branch folded history passed to the predictor is also maintained by the BPU top level. To shorten the update delay of the folded history, the BPU maintains many variables to support the fast update of the branch folded history. We will focus on this strategy and introduce the function of each variable.
Before we start, let’s first look at how the branch folded history is defined and its structure.
Branch Folded History
If you have checked the BPU global interface documentation, you will know that the sub-predictor receives an array of bit vectors of different lengths, representing various lengths of folded history, and these folded histories are compressed from the global branch history.
For the global branch history, we have a register that stores the global branch history with a length of 256. For example, suppose the length of the global branch history is 15 bits, and after shifting, we get a branch history like this: the least significant bit is the newest history record, and the most significant bit is the oldest history record.
At this time, if we need to generate a 6-bit folded history from these 15 bits, we will use an XOR strategy for compression. The specific process is as follows:
That is, after arranging it as shown above, perform an XOR operation on the values at each position, and the result is the folded history of length 6.
Method for Updating Branch Folded History
Now we want to update this branch folded history. When we insert a new history into the global branch history, it is inserted from the least significant bit, meaning the original h[0] becomes h[1]. If we want to obtain the folded history at this time, we only need to perform the XOR operation again. But such efficiency is too low, because the XOR operation may become particularly long. We can explore the impact of one update on the branch folded history.
In the example above, before inserting a new history, the 6-bit folded history is generated in the following arrangement:
The content in the curly braces has undergone a complete left shift, with h[5] and h[11] moving from the most significant bit to the least significant bit. So, in the compressed history, isn’t this just a typical cyclic left shift that we often encounter!
However, only two bits have changed: one is the newly inserted h[new], and the other is the discarded h[14]. h[new] must be in the first position, and the discarded position is fixed. Therefore, to complete an update, we only need to know the value of the newly inserted history and the oldest bit of the previous history. After the cyclic shift, modifying these two positions according to the actual situation will give us the updated folded history.
Implementation of Update Method
To achieve this update in the top-level BPU, two additional variables are maintained:
ahead_fh_oldest_bits: the oldest bit of the global branch history, with additional bits stored before it
last_br_num_oh: the slot number of the last effective branch instruction in the previous prediction
An optimization in the timing of operations occurs here because the global history pointer can only be updated based on the branch outcome when the pipeline-level prediction result is available. Updating the oldest bit after updating the global history pointer would increase the latency. Therefore, we maintain the branch outcome and update the oldest bit when it is needed in the next cycle.
The oldest bit also needs to be maintained further back because after updating using the branch outcome, the relatively newer bits will become the oldest bits.
Thus, there are three generators related to the folded history: foldedGhGen, lastBrNumOhGen, and aheadFhObGen.
Information required for each update of the folded history:
Folded history information before the update
Oldest bit of the global branch history (ahead_fh_oldest_bits)
Last prediction’s branch outcome (last_br_num_oh)
Whether there is a branch instruction in this update
The branch outcome of this update: the slot number of the last effective branch instruction
For each update of the folded history, the true oldest bit needs to be determined based on last_br_num_oh and ahead_fh_oldest_bits. Then, based on the oldest bit and the branch outcome of this update, several bits are modified, and finally, a cyclic left shift completes the update operation.
Pipeline Control Method
Pipeline control is the core function of the BPU, with complex logic. All pipeline control signals in the top-level BPU are as follows:
s1_valid, s2_valid, s3_valid: indicate the corresponding pipeline data is valid
s1_ready, s2_ready, s3_ready: indicate the corresponding pipeline is ready to continue the prediction of the previous pipeline stage
s1_component_ready, s2_component_ready, s3_component_ready: indicate the readiness of the corresponding pipeline sub-predictor
s0_fire, s1_fire, s2_fire, s3_fire: successful handshake signals, indicating that the pipeline data is valid and has been successfully passed to the next pipeline
s1_flush, s2_flush, s3_flush: indicate whether the current pipeline needs flushing
s2_redirect, s3_redirect: indicate whether the current pipeline needs to redirect due to a different prediction result
valid, ready 与 fire
We will introduce the purpose of each signal step by step. First, let’s look at the fire signal, which indicates a successful handshake in the pipeline, meaning that the data has been successfully passed to the next pipeline. This signifies the end of the current cycle and the end of the prediction for this pipeline stage, with the prediction for the next pipeline stage about to begin in the next cycle.
This requires two conditions:
valid: The data in the current pipeline stage is valid.
ready: Indicates whether the next pipeline stage in the BPU top level and the predictor are ready.
When these two signals are high simultaneously, the fire signal is valid, indicating a successful handshake. If we isolate a single prediction, the timing should look like this (in reality, most of the time, each pipeline is valid continuously):
Of the four sets of signals mentioned earlier, component_ready is generated by the predictor, while the rest are maintained by the BPU top level, with only the fire set of signals exposed to the sub-predictor.
Next, let’s take s2 as an example to see how each signal is maintained.
readySignal
s2_ready:=s2_fire||!s2_valid
This assignment statement is a combinational circuit assignment, meaning that s2_ready is directly related to s2_fire and s2_valid for this cycle, with two possible situations:
If s2_valid is invalid for this cycle, indicating that the s2 pipeline stage is empty and can accept new data, then s2_ready is valid.
If s2_valid is valid for this cycle, indicating that the s2 pipeline stage has data that has not been passed to the next stage yet, but if s2_fire, then the data will be passed in the next cycle. In this case, s2_ready is valid, indicating that the data can flow into the next stage.
valid Signal
Maintaining ·s2_valid· is relatively simple, as it is only related to s1_fire and s2_ready signals. The relationship is as follows:
When s1_fire is valid, indicating that data is coming in and s2_valid will be valid in the next cycle.
When s2_fire is valid, indicating that data is flowing out and s2_valid will be invalid in the next cycle.
fire Signal
The fire signal is somewhat special, but for intermediate pipeline stages, its maintenance is straightforward. For example,
s2_fire:=s2_valid&&s3_components_ready&&s3_ready
This simply requires considering the valid of the current pipeline stage and the ready of the next pipeline stage.
However, for s0_fire, since there is no valid signal, it is directly set to s1_components_ready && s1_ready.
For s3_fire, as there is no ready signal for the next stage, it is directly set to s3_valid.
Incorporating Flush and Redirect
When there is a different prediction result in the pipeline, a redirection signal is generated, and the pipeline needs to be flushed. flush and redirect handle these two tasks. redirect indicates whether the current pipeline stage needs redirection, while flush indicates whether the current pipeline stage needs flushing.
redirect Signal
The generation of s2_redirect is as follows:
s2_redirect:=s2_fire&&s2_redirect_s1_last_pred
This means that when s2_fire is valid and the prediction result of s2 is different from the prediction result saved from s1, this signal is valid. Later, this signal will be connected to the input of the sub-predictor and the output of the BPU prediction result, guiding the sub-predictor and FTQ to restore their states.
flush Signal
The flush signal is used to guide the flushing of the pipeline. For example, when s3_redirect is valid, it means that the incorrect prediction result has entered the pipeline, and both s1 and s2 are now predicting based on the wrong result. Therefore, the pipeline needs to be flushed to stop the previous stages and wait for new prediction results to enter.
This means that if a pipeline stage needs redirection, all previous stages will be flushed. The purpose of flush is to guide the valid signal. If the valid signal is valid in this cycle but the fire signal is not, it means that the incorrect data has not been taken by the next pipeline stage. In this case, when flush is valid, valid will immediately become invalid in the next cycle, avoiding storing incorrect data in the pipeline for a long time.
However, the effect of flush on the valid signal varies depending on each pipeline stage. For example:
For the s1 pipeline stage, although flush is valid, if s0_fire is valid, indicating that new data is flowing in, valid will remain valid in the next cycle.
For the s2 pipeline stage, if flush is valid, valid will definitely be invalid in the next cycle (because s1 is also flushed), indicating that valid can be directly set to invalid. However, there is a special case where s2_redirect occurs but s2_flush is not set to valid. In this case, if s1_fire occurs, the incorrect prediction result of s1 may also flow in. In this case, s2_valid needs to be determined based on the s1_flush signal.
The use of flush is complex, and more detailed details need to be understood by referring to the code.
Redirect Recovery Logic
When the redirect request from FTQ to BPU takes effect, it indicates that all stages of the pipeline have incorrect predictions, and all stages should be flushed. This can be achieved by setting s3_flush to be valid. Therefore, we have:
s3_flush:=redirect_req.valid
In the BPU, the redirect request is delayed by one cycle before being officially used. Therefore, the response of s1_valid to the flush signal needs to be changed. When the redirect request (before delay) is valid, s1_valid in the next cycle is immediately set to invalid, without the need to refer to the s0_fire signal.
At this point, generators such as npcGen also need to directly use the data from the redirect request to generate, which is equivalent to redirecting the BPU’s state to the state before the error occurred. However, it is important to note that the default redirect level in BPU is flushAfter, which means that the redirect request corresponds to a predicted erroneous instruction, and the BPU assumes that although this instruction was predicted incorrectly, it has been corrected and executed by the backend. Therefore, the next prediction can start directly from the next instruction.
Therefore, when recovering from a redirect, it is necessary not only to restore the information from the redirect interface, but also to update the execution status of this predicted erroneous instruction in the history.
2 - uFTB Branch Predictor
Introduction to uFTB
uFTB is the first predictor among all the predictors in Xiangshan, and it serves as the cornerstone for other predictors to generate prediction results. uFTB works in the s1 stage. It can generate prediction results within the current cycle after obtaining s1_pc and output them in the s1 channel, without modifying other channels. It provides the position of the branch instruction and the target of the instruction. Subsequent predictors will further predict based on this result.
Its essence is an FTB item cache, which stores FTB items, and the basic prediction result will be directly generated from the read-out FTB item.
Therefore, before you start reading the document, make sure you understand the FTB items and their meanings, as well as the specific details of the prediction result interface.
Functionality of uFTB
Cache FTB items and generate one-cycle prediction results: uFTB maintains a small FTB item cache. After receiving PC, it reads out the FTB item corresponding to the PC within one cycle and generates an s1 stage prediction result from the FTB item.
Maintain two-bit saturating counters to provide basic conditional branch results: uFTB maintains two-bit saturating counters for each line of the FTB item cache. The direction prediction result is reflected in the prediction result output of uFTB.
Update the FTB cache and two-bit saturating counters based on update requests
uFTB Cache Structure
As mentioned above, uFTB is essentially a small cache that stores FTB items. Its approximate structure is shown in the figure below.
In the current version of Xiangshan, uFTB has a total of 32 cache lines, each cache line is called FauFTBWay, and one FTB item can be stored in each cache line.
When s1 pipeline is valid, uFTB will use s1_pc to determine which item of the uFTB cache to read out. The cache is indexed based on the tag field in PC, which is defined as pc[16:1], i.e., taking 16 bits from PC as an identifier to match a certain line in the cache.
Each line in the cache, i.e., the data request interface in FauFTBWay, has three items:
req_tag: Input tag identifier extracted from pc
resp: Output the FTB item stored in this line
resp_hit: Output indicates whether the FTB item in this line matches req_tag
uFTB connects the tag to the data request interface of each line in the cache and selects the hit FTB item based on the resp_hit signal. Subsequent steps will generate a complete prediction result based on this FTB item.
Two-Bit Saturating Counters
uFTB maintains two-bit saturating counters for each line of the FTB item cache. As we know, an FTB item can store up to two conditional branch instructions, so each line’s two-bit saturating counters also have two, responsible for providing rough prediction results for the conditional branch instructions in them.
When indexing the FTB item, uFTB also indexes the corresponding two-bit saturating counters.
When the prediction result is generated, it will be based on the FTB item and the contents of the two two-bit saturating counters corresponding to it.
Prediction Result Generation
After indexing the corresponding FTB item and two two-bit saturating counters based on s1_pc, uFTB needs to generate a prediction result based on them. The prediction result generated by uFTB will be outputted through the s1 channel when s1 pipeline is valid, and will not be modified for s2 and s3 channels.
The signal generation method in the s1 prediction result can refer to the following list:
hitWhether the FTB item hits
Generation method: resp_hit signal in FauFTBWay, one of them is valid
slot_valids: Slot valid bit, indicating whether each slot in the ftb item is valid
targets: Jump target address corresponding to each slot
offsets: The offset of each instruction in the slot relative to the starting address of the prediction block
is_jal: Whether the prediction block contains a jal instruction
is_jalr: Whether the prediction block contains a jalr instruction
is_call: Whether the prediction block contains a call instruction
is_ret: Whether the prediction block contains a ret instruction
last_may_be_rvi_call: Signal indicating that the last slot in the prediction block may be an RVI type call instruction
is_br_sharing: The signal stored in the last slot (tailSlot) indicates a conditional branch instruction.
Generation: Exported from the corresponding field in the FTB entry.
fallThroughErr: The pftAddr recorded in the FTB entry is incorrect.
Generation: Compare whether the end address represented by pftAddr is greater than the start address of the predicted block. If it is smaller, an error has occurred, and this signal is set to valid. This situation may occur when the PC indexes an incorrect FTB entry.
fallThroughAddr: The end address of the predicted block.
Generation: If fallThroughErr is invalid, it is generated based on pftAddr. Otherwise, it is set to the start address + prediction width.
br_taken_mask: Branch prediction result, with each branch (slot) corresponding to a bit indicating whether the branch is predicted as taken.
Generation: Generated based on the always_taken field in the FTB entry and the result indicated by the two-bit saturating counter.
jalr_target: The jump target of the jalr in this predicted block.
-Generation: From the jump target in the tailSlot of the FTB entry.
uFTB Update
The update of uFTB involves updating the FTB entry cache and the two-bit saturating counter, with the update content obtained through the update interface.
In the uFTB predictor, the reading and writing of the cache and the two-bit saturating counter do not conflict, so we do not need to consider timing conflicts between reading and updating and can consider them as two independent parts.
FTB Cache Update
The update process of the FTB cache is simple. The update channel has already specified the PC and the newly generated FTB entry, so it only needs to be written to the specified position in the cache.
FTB cache updating requires two cycles:
In the first cycle, calculate the following based on the signals in the update:
Which row in the cache corresponds to the update request The PC extracted from the update request is sent to the update request channel of FauFTBWay, and the hit signal returned by each row is calculated.
In the second cycle, update according to the position calculated in the first cycle. If no row is hit, uFTB will use a pseudo-LRU replacement algorithm to select the row to be written.
Specifically, the ftb_entry signal group in the update channel contains the complete information of the new FTB entry, which is sent to the cache row that needs to be updated.
Two-Bit Saturating Counter Update
The update of the two-bit saturating counter needs to be combined with the actual execution of the subsequent program and the branch instruction information recorded in the FTB entry, which can be obtained from the update channel.
The update of the two-bit saturating counter also requires two cycles:
In the first cycle, calculate which two-bit saturating counter corresponding to the conditional branch instruction in the slot needs to be updated, which needs to meet the following conditions:
The current branch instruction slot is valid and contains a conditional branch instruction.
The current branch instruction slot is not marked as always_taken. (Because after being marked as always_taken, the result of the two-bit saturating counter will not be used.)
The current branch instruction slot is not after the branch instruction slot where an actual jump occurred.
In the second cycle, update the saturating counter based on the mask generated in the first cycle and the br_taken_mask information in the update channel.
TAGE-SC is the primary predictor for conditional branches in the Kunming Lake architecture, classified as an Accurate Predictor (APD). TAGE-SC can be seen as two relatively independent components: the prediction part TAGE and the verification part SC.
The Tagged Geometric History Length Predictor (TAGE) utilizes multiple prediction tables with different history lengths to exploit extensive branch history information. TAGE predicts whether a branch instruction will be taken or not taken. It consists of a base prediction table and multiple history tables. It first predicts using multiple history tables. If there is no prediction, it uses the prediction from the base table.
The Statistical Corrector (SC) is a statistical corrector. SC references the prediction results of TAGE and statistical bias results. Based on these two results, it corrects the final prediction result.
In Kunming Lake, each prediction block can have up to 2 branch instructions, so TAGE can predict up to 2 conditional branch instructions simultaneously. When accessing the various history tables in TAGE, the starting address of the prediction block is used as the PC, and two prediction results are retrieved based on the same global history.
TAGE Branch Predictor in Kunming Lake
Basic Functionality
The core idea of the TAGE predictor is to provide prediction results with different history lengths and select the most appropriate result for feedback. In the TAGE predictor, there are a total of 1+N history record tables, where N is a configurable option. In Kunming Lake, N=4.
The base predictor based on the T0 table is the baseline predictor. During prediction, it directly looks up the “2-bit saturated counter representing the jump history information” corresponding to the address in the T0 table, and then makes a prediction based on the history information. The T0 table has only 2 bits per entry, so the history states it can record are limited.
For tables other than T0, we use Tn to represent them. During table lookup, in addition to the PC, it is also necessary to use the global jump history information H for the lookup. When a match is found, the prediction is made based on the “3-bit saturated predictor” to jump or not to jump. The higher the value of n for the Tn table, the longer the history information it uses, i.e., x<y.
For each prediction, TAGE selects the table entry with the longest global jump history among all the hit Tn entries.
If the table entry exists and the prediction result confidence is high, it is used as the final prediction result.
If the confidence is low, another internal counter is used to determine whether to select that entry or T0 as the final prediction.
To save space, when retrieving the Tn table, the input jump history information H needs to be compressed, a process also known as history folding.
The table entries of each prediction table include the following elements:
T0 table indexed directly by PC
2-bit pred unsigned saturated counter (indicating prediction direction and confidence level)
Tn table indexed by XOR of PC and folded global history
1-bit valid bit
3-bit pred unsigned saturated counter
8-bit tag (used for verifying whether the hit is intentional, not coincidental)
1-bit useful bit for controlling expiration
For a prediction block, all tables may generate prediction results, requiring a selection process. Typically, the higher the Tn table number, the higher its priority.
Pipeline
TAGE contains two pipeline stages: the first stage calculates the index, and the second stage reads the results from the SRAM table.
Stage 0 (s0): Input to the first pipeline stage, usually the pc and folded history.
First pipeline stage operation: Calculate the index. Output to s1 via registers.
Stage 1 (s1): Input to the second pipeline stage, consisting of the index and other calculated data from the first stage.
Second pipeline stage operation: Memory access to SRAM, reading the prediction result. Output to s2 via registers.
Stage 2 (s2): Actual prediction result. TAGE uses 2 stages for prediction, with the prediction result ready for use in the third stage after 2 stages.
Data Structures
In Kunming Lake’s implementation, the T0 and Tn table structures are as follows:
1 bit valid 有效位 3 bit ctr 饱和计数器8 bit tag 校验命中1 bit us 作为usefulness计数器
4096项、奇数项对应第一条分支指令,偶数项对应第二条分支指令
For each table Tn, the length of its input “global branch history data H” varies during querying. Assuming the total prediction history length is S, Tn and Tn+1 may use the low x, low y bits of S (the lower bits are the newer history) as query inputs. Generally, the larger the n of table Tn, the longer the history information used, i.e., x<y.
During the query of table Tn, since the historical data H is “compressed,” it may lead to a situation where the result of one PC1^H1 matches another PC2^H2 (similar to a Hash collision), resulting in indexing into invalid data (predicting PC1 indexes to predicted PC2’s data). Therefore, TAGE provides a tag identifier for each table, using an 8-bit tag in the Kunming Lake implementation to reduce the probability of collisions. The calculation method and index method for tags are different; only when the tag calculation is the same, the query result is valid.
In the Tn table entry, in addition to the saturation counter ctr and tag, there is also a 1-bit usefulness counter. When this counter is 0, it is a weak entry, indicating that the entry can be reallocated for other uses; when it is not 0, it is a strong entry, indicating that the entry cannot be reallocated for other uses.
To try to avoid the situation where all table entries are 1 and no new table entries can be allocated, TAGE expects to use the counter bankTickCtrs to clear all usefulness to 0.
Retrieval Method for T0 and Tn Tables
For table T0, indexing is done using the PC[11:1] bits to index 2048 table entries, so for T0, there is no possibility of not finding a match.
For table Tn, in addition to PC[11:1], retrieval also requires searching based on the global branch history. In Kunming Lake, the top-level branch predictor maintains a 256-bit global history record GH, which can fold the GH’s most recent n bits of history information based on the required number of bits x for the sub-predictor. That is, n is divided into ceil(x/n) units of length x, and then XOR is performed bitwise. This folded history is denoted as FH (Folded History), and the specific process can be found in the [Branch Folding History section](../00_bpu_top/#Branch Folding History). When the TAGE predictor searches for a table entry in Tn, it uses the index and tag, calculated as follows:
Calculation Formula
index = FH ^ ((pc»1)低位)
tag = FH1 ^ FH2 ^ ((pc»1)低位)
Where FH, FH1, FH2 represent the folded global branch history according to certain rules. For Tn, FH, FH1, and FH2 each have their own folding bit numbers, which may not be the same. In the Kunming Lake implementation, the configurations of the T0 and Tn tables are as follows:
表名称
FH长度
FH1长度
FH2长度
最近历史长度(用到GH中的位数)
T1
8比特
8比特
7比特
低8位,即把最新8位历史,折叠成FH、FH1、FH2
T2
11比特
8比特
7比特
低13位,即把最新13位历史,折叠成FH、FH1、FH2
T3
11比特
8比特
7比特
低32位,即把最新32位历史,折叠成FH、FH1、FH2
T4
11比特
8比特
7比特
低119位,即把最新119位历史,折叠成FH、FH1、FH2
Note: pc»1 is used because RISC-C extension is used, with 2-byte alignment, and PC itself is already aligned to 1 byte, so only 1 bit is used.
Alternative Predictor
Since the Tn table uses saturation counters for prediction, there may be situations where the output result is “not confident.” For example, in Kunming Lake, for a 3-bit saturation counter, both 100 and 011 indicate a weak prediction. To provide more choices as references for this state, the TAGE predictor also provides an “alternative predictor” mechanism, which determines whether to select the prediction result of Tn or T0 when the Tn table predicts with low confidence.
In the Kunming Lake implementation, the “alternative predictor” is implemented based on the register group useAltOnNaCtrs. It consists of two paths of 128 4-bit saturation counters each, initialized to 0b1000. When TAGE makes a prediction, it uses PC(7,1) to index the corresponding saturation counter. If the value of this counter is greater than or equal to the preset value and the prediction result of Tn is not confident, it selects the result of T0; otherwise, it selects the result of Tn.
Prediction Process
In summary, the prediction steps of the TAGE predictor in Kunming Lake are as follows:
Parallel indexing of T0 and Tn tables, selecting which table to use based on the hit result:
If a match to the tag of a Tn table is found, the potential prediction result is given by the saturation counter of the longest history Tn table.
If no match to a Tn table is found, the final prediction result is given by the saturation counter of the T0 table.
If the potential prediction result of the matched Tn table is a weak prediction (100,011), and the value of the corresponding 4-bit counter in the alternative prediction for PC is greater than or equal to a threshold, the result of the T0 table is used as the final result; otherwise, the prediction result of the Tn table is used as the final prediction result.
Training Process
Since the prediction process of TAGE involves many counters and tags, they need to be updated according to certain rules, a process known as training. This training process occurs in the BPU’s update stage, where the PC, branch history, and prediction correctness information are input. The training process for branch prediction in Kunming Lake is divided into several steps based on different conditions:
Update when T0 is the final prediction result: If a jump occurs (i.e., taken), increment the ctr saturation counter indexed by the pc; otherwise, decrement it.
When only T0 is hit, the following operations are performed:
If T0 is predicted correctly, no additional update is performed.
If T0 is predicted incorrectly, attempt to randomly allocate a new table entry in a Tn table. To allocate a new table entry, the original entry’s usefulness at the corresponding index must be 0. The new entry is initialized as a weak prediction with usefulness 0, and its tag is set to the newly calculated tag.
When both T0 and Tn are hit, the following operations are performed:
Tn is always updated: If a jump occurs, increment the ctr saturation counter indexed by the pc; otherwise, decrement it. It is important to note that “hit” means that the tag of the indexed entry matches the calculated tag.
If T0 and Tn produce the same result:
If predicted correctly, no additional update is performed.
If predicted incorrectly, attempt to allocate a new table entry in a table with a longer history than Tn. To allocate a new entry, the usefulness of the original entry at the corresponding index must be 0. The new entry is initialized as a weak prediction with usefulness 0, and its tag is set to the tag calculated using the new history information.
If T0 and Tn produce different results:
If Tn is correct, the entry’s usefulness is incremented.
If the result is still a weak prediction, the counter in the alternative prediction for T0 is decremented.
If Tn is incorrect, the entry’s usefulness is decremented, and a new entry is allocated in a table with a longer history than Tn, as in 3.2.2.
If the result is still a weak prediction, the counter in the alternative prediction for T0 is incremented.
When a new table needs to be allocated, dynamic reset of the usefulness flag is performed.
Using a 7-bit bankTickCtrs register and calculating:
The number of allocatable tables a (with a longer history length than the current and corresponding index usefulness is 0)
The number of unallocatable tables b (with a longer history length than the current and corresponding index usefulness is not 0)
Update bankTickCtrs += Δ (saturated counter), Δ = b - a,
When bankTickCtrs reaches its maximum value, reset all usefulness to 0.
Kunming Lake SC Branch Predictor
Basic Function Introduction
The SC (Statistics counter) branch predictor is a branch predictor based on historical statistical information. Similar to TAGE, SC typically has multiple tables Tn, each corresponding to different lengths of historical jump statistics. The difference is that in SC, when predicting based on the PC, each table Tn is accessed, and then SC adds up each hit table entry to calculate the total “saturated counter” jump information, and finally determines whether to jump based on the total jump information. Generally, SC uses “signed saturated counters”, where a counter value greater than 0 indicates a jump, and less than 0 indicates no jump. The larger the absolute value of the counter, the higher the prediction confidence.
In the SC predictor, SC is also composed of multiple tables (e.g., T1, T2, T3, T4), but with fewer basic prediction tables T0 compared to the TAGE predictor. The Tn tables in SC have 6-bit signed saturated counters. The indexing method for SC tables is as follows:
Calculation Method
Index = (FH) ^ ((pc»1)低位)
For each table, the number of entries and the folded history length used are as follows:
Table
Number of Entries
FH Length
Folded History Range
T1
512
0
不折叠
T2
512
4
把历史信息的低4位,折叠成FH
T3
512
8
把历史信息的低10位,折叠成FH
T4
512
8
把历史信息的低16位,折叠成FH
The formula for calculating the total statistical prediction result is as follows:
Where ctr_sc represents the signed saturated counter for each table. Left-shifting and adding one is for weight adjustment. The accumulated scCtrSum is the final prediction result of SC. If this value is greater than zero, the prediction is a jump; if it is less than zero, the prediction is no jump. The larger the absolute value, the higher the prediction confidence.
Typical data conversion results are as follows (extended to 9 bits to prevent overflow during calculation):
All are 6b100000 (strong no-jump), resulting in 9b100000100, with a value of -252.
All are 6b011111 (strong jump), resulting in 9b011111100, with a value of 252.
All are 6b000000 (weak jump), resulting in 9b000000100, with a value of 4.
All are 6b111111 (weak no-jump), resulting in 9b111111100, with a value of -4.
Prediction Process
Calculate the index of table Tn using the PC and historical information.
Query the index to obtain the saturation counters for all tables.
Sum up all the saturation counters obtained from all tables to get the final prediction result (take a jump for values greater than 0, no jump for values less than 0).
Training Process
Update the saturation counters during the update phase.
If the real instruction corresponding to PC jumps, increment the saturation counters corresponding to all tables.
If the real instruction corresponding to PC does not jump, decrement the saturation counters corresponding to all tables.
Kunming Lake TAGE-SC Branch Predictor
Why SC is Needed with TAGE
In some applications, some branch behaviors have a weak correlation with branch history or paths, showing a statistical prediction bias. For these branches, using counters to capture statistical biases is more effective than history-based branch prediction.
TAGE is very effective in predicting branches that are highly correlated with history, but it performs poorly for branches with statistical biases. For example, branches that have a small bias in one direction but are not strongly correlated with historical paths. To avoid this problem, an SC predictor can be added to the traditional TAGE predictor.
TAGE-SC Functionality
In the Kunming Lake TAGE-SC predictor, both the TAGE and SC prediction results P1 and P2 are obtained simultaneously, and then their results are accumulated P = P1 + P2. If the absolute value of P is greater than the 8-bit threshold sc_bank_thres, the predictor result P is used; otherwise, P1 is used as the final prediction result.
For dynamic adaptation, the threshold sc_thres needs to be dynamically changed. Therefore, in the implementation, TAGE-SC uses a 5-bit sc_bank_ctr counter to adjust the threshold sc_bank_thres. Additionally, since Kunming Lake supports the simultaneous prediction of 2 branch instructions, the threshold register and corresponding control counter are also duplicated.
Pipeline
The TAGE-SC predictor contains 3 pipeline stages, where the 2-stage pipeline of TAGE has been introduced, and the pipeline of the SC part is as follows:
Stage 0: Read PC and folded history into s0.
First Stage: Calculate the index from pc and FH to obtain s0_idx.
Stage 1: Read s0_idx from s0.
Second Stage: Find the counter data corresponding to s1_idx in SCTable and output to s1_scResps.
Stage 2: Read s1_scResps from s1.
Third Stage: Select whether to invert the prediction result based on s2_scResps and output to s2_disagree.
Stage 3: Read the result from s2_disagree as s3_disagree.
Prediction Process
In TAGE-SC prediction, the prediction result P1 of TAGE is represented by tage_ctr, and the prediction result P2 of SC is represented by scCtrSum. The prediction is divided into four steps:
Execute the SC predictor to get the prediction result scCtrSum.
Simultaneously obtain the prediction result tage_ctr of the TAGE predictor.
Since the prediction result of TAGE is an unsigned saturation counter, and the prediction result of SC is a signed saturation counter, if they are added together, data conversion is required.
Kunming Lake adopts a conversion for the result of TAGE. The converted result is represented by tageCtrCentered, and the specific conversion process is as follows:
Conversion of a 3-bit unsigned saturation counter to an 8-bit signed saturation counter result is illustrated as follows:
3b100 Weak jump => 8b00001000 = 8
3b011 Weak non-jump => 8b11111000 = -8
3b111 Strong jump => 8b00111000 = 56
3b000 Strong non-jump => 8b11001000 = -56
Add the prediction results of TAGE and SC to get the final prediction result P, represented by totalSum.
$$totalSum = scCtrSum + tageCtrCentered$$
Determine the final prediction direction based on totalSum and sc_bank_thres
Jump if totalSum > 0 and its absolute value exceeds the threshold: If scCtrSum > sc_bank_thres - tageCtrCentered, it can also be understood as totalSum > sc_bank_thres. The above expression can reduce the maximum bit width (ensuring no overflow requires 10 bits to become 9 bits).
No jump if totalSum < 0 and its absolute value exceeds the threshold: If scCtrSum < -sc_bank_thres - tageCtrCentered, it can also be understood as |totalSum| > sc_bank_thres.
Training Process
After combining TAGE and SC, TAGE-SC adds an sc_bank_ctr counter to control the threshold sc_bank_thres. Therefore, during training, in addition to the training of TAGE and SC themselves, the newly added counter needs to be updated.
During the update phase, the specific update process is as follows:
TAGE-SC uses the prediction result P (i.e., the prediction result after TAGE + SC). If |totalSum| is in the range [sc_bank_thres -4, sc_bank_thres -2], update the threshold-related register group.
Update sc_bank_ctr, the saturation counter: If the prediction is correct, sc_bank_ctr +=1; if the prediction is incorrect, sc_bank_ctr -=1.
Update sc_bank_thres, limited saturation operation: If the updated value of sc_bank_ctr reaches 0b11111 and sc_bank_thres <= 31, then sc_bank_thres +=2; if the updated value of sc_bank_ctr is 0 and sc_bank_thres >=6, then sc_bank_thres -=2. For all other cases, thres remains unchanged.
After the update judgment of sc_bank_thres is completed, another judgment is made on sc_bank_ctr. If the updated sc_bank_ctr is 0b11111 or 0, thres_ctr is reset to the initial value 0b10000.
TAGE-SC uses the prediction result P1 (i.e., the prediction result of TAGE) and does not perform any operations.
Interface List
TageSC
信号类型
信号宽度
信号名
信号描述
input
*
clock
input
*
reset
input
*[35:0]
io_reset_vector
用于reset时,reset s1_pc_dup_0 提供的值
input
*[40:0]
io_in_bits_s0_pc_0
复制的s0_pc的dup数组的第1个,给顶层BPU的PC
input
*[40:0]
io_in_bits_s0_pc_1
复制的s0_pc第2个,给Tage的PC
input
*[40:0]
io_in_bits_s0_pc_3
复制的s0_pc的第4个,给SC的PC
input
*[10:0]
io_in_bits_folded_hist_1_hist_17_folded_hist
TageTable 2 用到的11bits 折叠历史 从多长历史范围折叠到11bit见前文所述的表 注意TageTable下标+1,此处 T2 是前文 T3
input
*[10:0]
io_in_bits_folded_hist_1_hist_16_folded_hist
TageTable 3 用到的11bits 折叠历史
input
*[6:0]
io_in_bits_folded_hist_1_hist_15_folded_hist
TageTable 1 用到的7bits 折叠历史
input
*[7:0]
io_in_bits_folded_hist_1_hist_14_folded_hist
TageTable 0 用到的8bits 折叠历史
input
*[6:0]
io_in_bits_folded_hist_1_hist_9_folded_hist
TageTable 2 用到的7bits 折叠历史
input
*[7:0]
io_in_bits_folded_hist_1_hist_8_folded_hist
TageTable 3 用到的8bits 折叠历史
input
*[6:0]
io_in_bits_folded_hist_1_hist_7_folded_hist
TageTable 0 用到的7bits 折叠历史
input
*[6:0]
io_in_bits_folded_hist_1_hist_5_folded_hist
TageTable 3 用到的7bits 折叠历史
input
*[7:0]
io_in_bits_folded_hist_1_hist_4_folded_hist
TageTable 1 用到的8bits 折叠历史
input
*[7:0]
io_in_bits_folded_hist_1_hist_3_folded_hist
TageTable 2 用到的8bits 折叠历史
input
*[10:0]
io_in_bits_folded_hist_1_hist_1_folded_hist
TageTable 1 用到的11bits 折叠历史
input
*[3:0]
io_in_bits_folded_hist_3_hist_12_folded_hist
SCTable 1 用到的 4bit 折叠历史
input
*[7:0]
io_in_bits_folded_hist_3_hist_11_folded_hist
SCTable 2 用到的 8bit 折叠历史
input
*[7:0]
io_in_bits_folded_hist_3_hist_2_folded_hist
SCTable 3 用到的 8bit 折叠历史
output
*
io_out_s2_full_pred_0_br_taken_mask_0
io_out_s2_full_pred_{i}br_taken_mask{j} Tage 在 s2流水级输出的,复制4份 预测块中第 j 条分支指令TAGE预测结果 这里不该叫mask吧
output
*
io_out_s2_full_pred_0_br_taken_mask_1
output
*
io_out_s2_full_pred_1_br_taken_mask_0
output
*
io_out_s2_full_pred_1_br_taken_mask_1
output
*
io_out_s2_full_pred_2_br_taken_mask_0
output
*
io_out_s2_full_pred_2_br_taken_mask_1
output
*
io_out_s2_full_pred_3_br_taken_mask_0
output
*
io_out_s2_full_pred_3_br_taken_mask_1
output
*
io_out_s3_full_pred_0_br_taken_mask_0
io_out_s3_full_pred_{i}br_taken_mask{j} Tage 在 s3流水级输出的,复制4份 预测块中第 j 条分支指令SC预测结果
FTB is the third sub-predictor of the Xiangshan BPU, and it can also get the outputs of uFTB and TAGE-SC together. In the input interface of FTB, the s1 channel contains the basic prediction results of uFTB, and the s2 and s3 channels are filled with only one group of signals, br_taken_mask, by TAGE-SC, without the basic prediction results generated by the FTB entry. The function of FTB is to provide basic prediction results for the s2 and s3 channels.
In terms of functionality and structure, FTB is similar to uFTB. The main difference is that FTB can accommodate more FTB entries, and the prediction results of FTB are output in the s2 and s3 channels. Due to its large capacity, the readout speed of FTB is slower than that of uFTB, and it cannot be placed in the first cycle to generate prediction results. However, the large capacity enables it to obtain more accurate prediction results.
Function of uFTB
Cache more FTB entries and provide basic prediction results for the s2 and s3 channels. The FTB predictor is essentially a storage with a large capacity. It reads the corresponding FTB entry based on the current predicted PC and outputs it in the s2 stage. At the same time, this FTB entry will be saved for one more cycle to generate the s3 stage prediction result. One thing to note is to consider the br_taken_mask field inputted by the previous predictor to avoid losing it during generation.
Update FTB entries based on update requests.
FTB Storage Structure
FTB entries in the FTB predictor are placed in a dedicated storage structure called FTBBank. Before further examining the structure of FTBBank, let’s first see how FTBBank is used.
FTB Read Request
The read request interface of FTBBank is as follows:
Among, req_pc interface is Decoupled, meaning it contains valid and ready signals. FTB needs to get the PC before the s1 stage starts, so s0_pc is sent to the req_pc interface, s0_fire signal is connected to the valid signal of req_pc, and the ready signal is connected to the pipeline control signal s1_ready.
When s0_fire enters the s1 stage, in the next cycle, when s0_fire is at the same time as s1_fire, FTBBank has already outputted the readout FTB entry to the read_resp interface, and calculated read_hits. However, at this time, because the readout has wasted too much delay, it cannot be outputted in the s1 stage. Therefore, this readout result is saved in an internal register. It will be read out from the register in the s2 and s3 stages to generate the prediction result.
FTBBank
FTBBank defines a storage to store all FTB entries. The storage adopts a group-associative structure, with 512 groups (Sets) in total, each group has 4 ways, and can store up to 2048 FTB entries. Besides storing FTB entries, it also stores the tag corresponding to each FTB entry for matching.
Specifically, the tag is defined as pc[29:10], which takes 20 bits from the PC to identify the FTB entry. The PC is divided as follows:
When reading, provide the group number (idx) to the storage, read out all ways in that group, and then check if there is a way whose tag matches the current tag. If there is a match, it means a hit, and the readout FTB entry is sent out through the read_resp interface, and the hit way number is sent out through the read_hits interface.
Generation of Prediction Results
As mentioned earlier, for the FTB predictor, it needs to provide basic prediction results derived from FTB entries to the s2 and s3 channels. The FTB entries have been read and saved in the s1 stage. In the s2 and s3 stages, they only need to be read out to generate the prediction results. However, one thing to note is to preserve the br_taken_mask field generated by TAGE-SC in the s2 and s3 prediction results, which provides precise prediction results for conditional branch instructions. For the s1 channel, the FTB predictor does not make any changes.
The generation of signals in the s2 and s3 prediction results can refer to the following list:
hit Whether the FTB entry is hit
Generation method: The read_hits signal valid bit from FTBBank is valid.
slot_valids Slot valid bit, indicating whether each slot in the ftb entry is valid
targets Jump target address corresponding to each slot
offsets Instruction offset relative to the start address of the predicted block in each slot
is_jal Whether the predicted block contains a jal instruction
is_jalr Whether the predicted block contains a jalr instruction
is_call Whether the predicted block contains a call instruction
is_ret Whether the predicted block contains a ret instruction
last_may_be_rvi_call Signal indicating that the end of the predicted block may be an RVI type call instruction
**is_br_sharing Whether the last slot (tailSlot) stores a conditional branch instruction signal
Generation method**: Export from the corresponding field in the FTB entry
fallThroughErr Error in the pftAddr recorded in the FTB entry
Generation method: Compare whether the address represented by pftAddr is greater than the start address of the predicted block. If it is less than, it indicates an error, and this signal is set to valid. This situation may occur when the PC indexes an incorrect FTB entry.
fallThroughAddr End address of the predicted block
Generation method: If fallThroughErr is invalid, it is generated according to pftAddr. Otherwise, it is set to the start address + prediction width.
br_taken_mask Branch prediction result, each branch (slot) corresponds to a bit, indicating whether the branch is predicted as taken
Generation method: Generated based on the always_taken field in the FTB entry and the indication result of the two-bit saturation counter.
jalr_target Jump target of jalr in this predicted block
Generation method: Jump target in the tailSlot of the FTB entry.
FTB meta
In the third cycle of prediction, the FTB predictor outputs some auxiliary information of this prediction to last_stage_meta and also sends the read FTB entry to the last_stage_ftrb_entry interface.
The FTB meta contains two pieces of information, hit and writeWay, indicating whether the prediction hits and in which way it is read. Subsequently, the update channel generates the update information for this prediction, and these two pieces of information are also sent to guide the writing of the updated FTB entry.
FTB Update
In the update channel, the pc and the new FTB entry are already specified for us, along with the hit and writeWay in the meta information. If hit in the meta is valid, it means that the FTB entry corresponding to this pc was stored in the memory, and we only need to write it to the corresponding way.
If it is invalid, it means that there was no storage before, but we do not know whether it is stored now. It is possible that before this update request, the FTB entry corresponding to this pc was written by another update request. Therefore, we still need to send a read request to FTBBank to check if there is a corresponding FTB entry. If it exists, it can be directly written to this position in the next cycle, otherwise, FTBBank will be notified to allocate a new position.
Therefore, the number of cycles required for updating FTB entries depends on the hit situation.
Let’s first look at how FTBBank handles updates.
FTBBank Update
FTBBank’s update interface is divided into two parts, the update read interface and the update write interface.
u_req_pc: Update read request pc
Flipped(DecoupledIO(UInt(VAddrBits.W)))
update_hits: Hit information read out
Valid(UInt(log2Ceil(numWays).W))
update_access: There is an update request but the meta information indicates a miss
Bool()
update_pc: Update write request pc
UInt(VAddrBits.W))
update_write_data: Data to be written in the update request, write when valid
Flipped(Valid(new FTBEntryWithTag))
update_write_way: Way index to write in the update request
UInt(log2Ceil(numWays).W))
update_write_alloc: Whether a new FTB entry needs to be allocated (missed before)
Bool()
For the update read interface, FTBBank obtains the update read request through u_req_pc signal. This request has a higher priority than the read request during prediction. In the next cycle, FTBBank will output the hit information through the update_hits interface. update_access is only used for some internal status judgments of FTBBank.
For the update write interface, FTBBank obtains the pc of the update write request through the update_pc signal, and when update_write_data is valid, it writes the data into the corresponding position specified by update_write_way. If update_write_alloc is valid, it means that it cannot be directly written to the position specified in the request, but a new position needs to be allocated.
The allocation strategy is as follows:
If all ways are filled, use the pseudo LRU replacement algorithm to select the way to replace
If there is an empty way, select the empty way.
Update Request Timing
Meta hit is valid: If hit in the update request meta is valid, then we only need to specify the address and data to be written according to the information in the update request, and the writing only takes one cycle.
Meta hit is invalid: In this case, after receiving the update request, we connect the pc in the request to the read port of FTBBank. The read port will return the result in the next cycle. Due to timing issues, we save this result and use it in the next cycle. Depending on the hit status in the result, we decide whether to set update_write_alloc and send a write request. The entire update process takes three cycles.
For general conditional branch instructions, only predicting whether to jump (taken) or not (not taken) is needed. However, for indirect jumps, such as call/jump instructions, it is necessary to predict where to jump to (Target). In order to make TAGE support predicting jump addresses, ITTAGE (Indirect Target TAGE) was introduced.
The main difference between ITTAGE and TAGE is that in the T0 and Tn tables, Target PC data is added. During prediction, ITTAGE selects the Target from the matched, longest history entry as the prediction result, and uses a 2-bit saturating counter to decide whether to output this result or choose an alternative prediction result. For TAGE predictor details, please refer to TAGE-SC Branch Predictor.
Kunming Lake ITTAGE Branch Predictor
In the BPU design of Kunming Lake, prediction is performed in a cascaded manner with multiple predictors, so the implementation of the subpredictor differs from the original predictor, mainly in the default prediction result.
Basic Functionality
ITTAGE’s basic functionality is similar to the TAGE branch predictor, but with the following differences:
The Target is added as a jump target address item in the entry to predict the jump target address.
The saturating counter ctr no longer provides the prediction direction, but instead decides whether to output the result (just the prediction information).
Since there is only one indirect jump instruction in each branch prediction block, ITTAGE only considers one instruction.
Pipeline
ITTAGE contains three pipeline stages, the first stage calculates the index, and the second stage reads the result from the SRAM table using the index.
Cycle 0, s0: Input of the first pipeline stage, generally pc and folded history.
Operation of the first pipeline stage:Calculate the index. Output through registers to s1.
Cycle 1, s1: Input of the second pipeline stage, the index and other data calculated in the first stage.
Operation of the second pipeline stage:Access SRAM, read prediction information. Output through registers to s2.
Cycle 2, s2: Input of the third pipeline stage, the original prediction information read from SRAM in the second stage.
Operation of the third pipeline stage:Process the original prediction information, decide whether to output the prediction result.
Cycle 3, s3: Prediction result ready, the prediction result can now be used.
Data Structure
In the Kunming Lake implementation, the table structure of T0 and Tn is as follows:
The retrieval method is consistent with the TAGE branch predictor, only differing in the configuration options of each table.
表名称
FH长度
FH1长度
FH2长度
最近历史长度(用到GH中的位数)
T1
4比特
4比特
4比特
低4位,即把最新4位历史,折叠成FH、FH1、FH2
T2
8比特
8比特
8比特
低8位,即把最新8位历史,折叠成FH、FH1、FH2
T3
9比特
9比特
8比特
低13位,即把最新13位历史,折叠成FH、FH1、FH2
T4
9比特
9比特
8比特
低16位,即把最新16位历史,折叠成FH、FH1、FH2
T5
9比特
9比特
8比特
低32位,即把最新32位历史,折叠成FH、FH1、FH2
Other processes (computation method and computation formula) are similar to the TAGE-SC branch predictor.
Alternate Predictor
When the prediction result given by the Tn table has insufficient “prediction confidence,” the prediction result needs to be jumped to become an “alternate predictor.” This process is similar to TAGE. For details, please refer to the corresponding part of TAGE. Unlike TAGE, ITTAGE’s ctr does not give the prediction direction but only determines whether to output the result (prediction confidence). When ctr is 2b00, it is considered weak confidence. Choose the alternate prediction result:
If multiple tables are hit, output the Target from the second-longest history table entry.
Otherwise, output the T0 Target (FTB Target).
Prediction Process
The prediction process is similar to TAGE, but ITTAGE has an additional step to decide whether to output the prediction result based on ctr. The specific process is as follows:
When the ctr of the ITTAGE table entry is not 2b00, output Target.
When the ctr of the ITTAGE table entry is 2b00, output the alternate prediction result:
If there is a second-longest history (the second table is also hit), output the Target of the second-longest.
Otherwise, output the FTB Target.
When the ITTAGE table entry is not hit, output the T0 Target (FTB Target).
Training Process
This process is similar to TAGE, with the following differences:
Table entry updates (original prediction data):
ctr:
If the predicted address matches the actual address, increment the ctr counter of the corresponding provider table entry by 1.
If the predicted address does not match the actual address, decrement the ctr counter of the corresponding provider table entry by 1.
In ITTAGE, it is determined based on ctr whether to adopt the jump target result of this prediction. If multiple tables are hit and the ctr of the longest history table is 0, adopt the alternate prediction logic (the second-longest history table or T0). Always update the longest history table during updates, and also update the alternate prediction table if the alternate prediction is adopted.
target:
When the ctr of the table entry to be updated is 0 during this prediction, directly store the actual final jump result in the target, overwriting it.
When applying for a new table entry, directly store the actual final jump result in the target.
Otherwise, do not modify the target.
usefulness:
When the provider’s prediction is correct but the alternate prediction is incorrect, set the provider’s usefulness to 1.
If the alternate prediction has weak confidence and is correct, set the provider’s usefulness to 1. If the alternate prediction has weak confidence and is incorrect, set the provider’s usefulness to 0.
New table entry:
Each time the prediction from the longest history table with confidence is incorrect (not due to using the alternate prediction), try to randomly apply for a table entry from a longer history table. The condition for application is that the usefulness of the corresponding entry is 0.
If all longer entries are not 0, the allocation fails.
Reset useful bit:
Each time a prediction error occurs and a new table entry is applied for, if the allocation fails, increment tickCtr (an 8-bit saturated counter used to reset all usefulness). If successful, decrement tickCtr.
When tickCtr reaches its maximum value, set all usefulness in ITTAGE to 0 and reset tickCtr to 0.
Other Meta information can be found in the corresponding sub-predictor documentation
_ubtb_io_out_last_stage_meta
_tage_io_out_last_stage_meta
_ftb_io_out_last_stage_meta
ittage_io_out_last_stage_meta[100:0]
位宽
信号名
备注
100
s3_provided
是否有结果
[99:97]
s3_provider
提供结果的表项
96
s3_altProvided
是否有替代预测表项
[95:93]
s3_altProvider
提供结果的替代预测表项
92
resp_meta_altDiffers
替代预测是否是弱信心的(FTB不算)
91
s3_providerU
主预测的useful bit
[90:89]
s3_providerCtr
主预测给出的置信度
[88:87]
s3_altProviderCtr
替代预测给出的置信度
86
resp_meta_allocate_valid_r
有空余的表项可供申请
[85:83]
resp_meta_allocate_bits_r
申请哪个表中的表项
82
s3_tageTaken_dup_3
在不使用FTB的情况下始为true,使用FTB也为true
[81:41]
s3_providerTarget
主预测给出的跳转地址
[40:0]
s3_altProviderTarget
替代预测给出的跳转地址
6 - RAS Branch Predictor
RAS介绍
RAS stands for “Return Address Stack.” It helps determine branch behavior in programs by tracking return addresses. As previously mentioned, there are many branches in a program: if/else, switch/case, while/for loop, iteration, call/return, etc. The RAS branch predictor specifically targets call/return types.
As shown above, the main function calls add and sub, and add calls the function _add. In this process, each call’s jump address and return address are fixed, and the return address can be obtained at the time of the call. The function call process is a “stack push and pop” process, so branch prediction can be performed using a “stack” structure: each time a call instruction is encountered, the current PC+4 (compressed instructions and regular instructions have different offsets) is pushed onto the stack; when a return instruction is encountered, a pop operation is performed, and the address obtained is the target jump address. In the block-based BPU, RAS cannot know whether the current block is a call or ret, so it relies on other predictors, using the results of previous predictors for RAS operations.
Specifically, in Xiangshan’s RAS predictor, at the s2 stage, it needs to determine whether the previous stage’s s2 output predicts a call or ret (i.e., the input signal io.s2_full_pred.hit_taken_on_call/ret is valid). If it’s a call, it pushes the subsequent instruction address onto the stack; if it’s a ret, it pops the address from the stack as the prediction result. Because in the BPU predictor, the result obtained at the s3 stage is assumed to be better than the s2 stage, the RAS predictor needs to check at the s3 stage. If the previous stage’s s3 prediction result is inconsistent with s2, the s3 result is taken, and it needs to determine whether to cancel or complete the stack operations of the previous s2 stage as needed. For example, if the s2 stage predicted a call instruction and performed a push operation, but s3 predicted a regular branch instruction with no need for any operation, the push must be canceled; if s2 predicted a regular branch instruction and s3 predicted a call, a push operation must be performed to complete.
RAS Stack Operations
In RAS design, function return addresses are predicted using a stack. Ideally, this section assumes that RAS can be backed up at any time, with the stack top pointer represented by sp and the predicted address represented by paddr. The basic RAS operations are as follows:
PUSH
Since predictions can be wrong, the current stack state needs to be backed up (often referred to as a “snapshot” in software; this term is also used in subsequent content). When encountering a call instruction, get the return address of the call instruction addr = current pc + 4 (if it’s a compressed instruction, addr = pc+2), then push onto the stack: sp = addr; sp += 1.
POP
For the same reason, take a snapshot of the current stack, marked as s. When encountering a ret instruction, the predicted jump address is paddr = sp, then pop: sp = sp - 1. Take a snapshot of the current stack, marked as s.
Redirect Operation
Since the BPU predicts program branches, there are “correct predictions” and “wrong predictions.” When the CPU backend detects a branch prediction error, it performs a redirect operation, informing the BPU where the prediction was wrong and what the correct result is. During redirection, the RAS module receives the correct branch and the RAS stack information at the time of prediction. Depending on the type of correct branch instruction, the following snapshot recovery situations arise:
(1) The previously predicted instruction is actually a call instruction, and the push operation is executed based on the addr address provided in the redirect.
(2) The previously predicted instruction is actually a ret instruction, and the pop operation is executed.
Commit Operation
The commit operation is when the backend informs the frontend that the previous prediction was correct. Ideally, the RAS predictor doesn’t need to perform any operations at this time.
Implementation of RAS in Kunming Lake
In actual circuit design, an infinitely large stack is impossible, and constant backups are not feasible. Therefore, in Kunming Lake’s RAS implementation, the problems and solutions are as follows:
How to obtain RAS stack snapshots for each prediction?
To achieve the function of taking snapshots of the RAS stack, Kunming Lake adopts a linked representation based on a circular array. The design is as follows:
As shown above, a circular array is used for data management. The circular array has a starting address marked as BOS and a tail pointer marked as TOSW. The data between them are valid, and the data outside are free. Within the valid data, a linked structure represents the “RAS stack,” where each stack element records the number of its previous data. When performing stack operations, the corresponding previous element can be obtained through this number. The RAS stack’s bottom pointer shares the BOS. In the initial state S in the figure, the RAS stack elements are 0, 1, 3, 5. Element 5 records the position of its previous element 3, and element 3 records the position of its previous element 1. When a push operation is needed, the RAS stack top pointer TOSR = TOSW, the new element is stored at the new TOSR position 7, and the position of its previous element 5 is recorded in the new element, then TOSW is moved back (TOSW = TOSW+1). When a pop operation is performed, the RAS stack top pointer TOSR moves to the previous element’s position 3 based on the index saved in the stack top element. Therefore, under the condition that the stack does not overflow, the above RAS stack always allocates new data on the array through TOSW during normal Push/Pop operations, so all process states and intermediate data are saved. So, to restore to the state S, it only needs to reset the corresponding stack pointers. Therefore, in each prediction, the corresponding stack pointers (BOS, TOSR, TOSW) also need to be saved in the prediction result for later restoration in case of redirection. The advantage of this structure is that it can save complete process data, but frequent push operations can lead to large space resource consumption.
Chain RA storage space waste?
Since the prediction result is correct after commit, the stack will not roll back. When the RAS predictor receives the commit message of the prediction block “P,” it will no longer receive the redirect message of block P, so the snapshot taken during the push operation for block P will not be used again. Therefore, the RAS stack elements can be categorized, with uncommitted elements stored in a “linked” structure and committed elements stored in a regular stack structure (the original RAS stack is split into two parts: the uncommitted part stored in a snapshot-saving linked structure and the committed part stored in a regular stack structure). The optimized RAS structure is shown below:
As shown above, based on the normal call/ret predictions and commit call/ret, the original RAS stack can be split into two independent stacks, called the “speculative stack” (spec_stack, linked structure) and the “commit stack” (commit_stack, regular structure). Due to the change in stack structure, the specific Pop/Push operations are as follows:
Encounter normal call and ret:
(1) The call predicted by the prediction block is correct, and a push operation is performed on the spec_stack, specifically the linked stack Push process mentioned above.
(2) The ret predicted by the prediction block is correct, and the stack top of the spec_stack is used as the prediction value, then a pop operation is performed. If the spec_stack is empty, the stack top element of the commit_stack is used as the prediction result (no pop operation is performed).
Commit operation:
(1) The FTQ execution result is call correct, and a regular push operation is performed on the commit_stack.
(2) The FTQ execution result is ret correct, and a regular pop operation is performed on the commit_stack.
Redirect operation:
(1) Obtain the stack pointers (BOS, TOSR, TOSW, ssp) from the redirect message during the previous prediction and cover the current pointer values to complete the speculative stack rollback.
(2) This operation does not affect the commit stack.
How to handle when the S3 prediction result is inconsistent with S2 at the input end?
Since the S3 result is assumed to be better than S2, the RAS stack needs to be repaired again in case of inconsistency. The specific inconsistency and corresponding repair operations are shown in the table below:
S2 Pred. Res.
S3 Pred. Res.
Repair Operation
push
keep
pop
keep
pop
pop
pop
keep
push
keep
push
push
S2 and S3 operations do not exist, pop/push or push/pop scenarios (Why not exist?)
Other Optimizations
Each element in the RAS stack has a counter, which is used to save repeated values (for recursive calls). For example, when the address pushed for the first time is 0xff00, and the address pushed for the second time is also 0xff00, only the counter of the top element of the stack needs to be incremented, and there is no need to actually push the address onto the stack.
Interface Description
In the RAS predictor, the core component is the RASStack, with the following interface:
接口名称
功能描述
接口名称
功能描述
in.spec_push_valid
预测有Call指令,Spec压栈
in.s2_fire
s2信号有效
in.spec_pop_valid
预测有Ret指令,Spec出栈
in.s3_fire
s3信号有效
in.spec_push_addr
Ret地址
in.s3_cancel
s3和s2的预测结果不一样,需要撤销s2的操作
out.spec_pop_addr
RAS的栈顶数据
in.s3_meta
s3需要的s2时的现场信息
out.ssp
commit栈顶指针
in.s3_missed_pop
s3判断需要再次进行pop
out.sctr
commit栈顶重复元素计数器
in.s3_missed_push
s3判断需要再次进行push
out.nsp
commit栈顶,会被ssp覆盖
in.s3_pushAddr
需要再次push时的地址
out.TOSR
spec栈栈顶指针
in.redirect_valid
需要重定向
out.TOSW
spec栈数据分配指针
in.redirect_isCall
真实执行情况是Call
out.BOS
spec栈栈低指针
in.redirect_isRet
真实执行情况是Return
in.commit_push_valid
push操作正确
in.redirect_meta_ssp
之前预测时的现场信息ssp
in.commit_pop_valid
FTQ执行结果为Call正确
in.redirect_meta_sctr
之前预测时的现场信息sctr
in.commit_push_addr
更新信息中的Ret地址
in.redirect_meta_TOSW
之前预测时的现场信息TOSW
in.commit_meta_TOSW
更新信息中的TOSW
in.redirect_meta_TOSR
之前预测时的现场信息TOSR
in.commit_meta_TOSR
更新信息中的TOSR
in.redirect_meta_NOS
之前预测时的现场信息NOS
in.commit_meta_ssp
更新信息中的现场信息SSP
in.redirect_callAddr
重定向地址
in.commit_meta_sctr
更新信息中的现场信息SCTR
The relationship between the RASStack module and the BasePredictor interface is as follows:
Based on the S2 prediction result of the previous predictor, complete the prediction through the push/pop process and obtain the result spec_pop_addr.
Perform update operations based on commit signal.
Main tasks in S3
Based on the prediction result in the previous predictor S3 and the operations in S2, determine whether to undo the Pop/Push operation. The predictor assumes that the S3 prediction result is better than S2. If S2 and S3 predictions are inconsistent, the RAS predictor accepts the S3 result for stack operation.
The prediction process in S3 is the same as in S2, but the data is different.
Perform redirection operations (redirection information obtained from the previous cycle).
Since the RASStack appears to complete its tasks within a cycle, data bypassing is needed inside the stack to cache data for processing.