RAS Branch Predictor

RAS介绍

RAS stands for “Return Address Stack.” It helps determine branch behavior in programs by tracking return addresses. As previously mentioned, there are many branches in a program: if/else, switch/case, while/for loop, iteration, call/return, etc. The RAS branch predictor specifically targets call/return types.

function _add(a, b){
    return (a > 0 ? a : 0)  + (b > 0? b : 0);
}

function add(a, b){
    return a + b;
}

function sub(a, b){
    return a - b;
}

function main(){
    a = 1;
    b = 2;
    c = add(a, b);
    d = sub(a, b);
}

As shown above, the main function calls add and sub, and add calls the function _add. In this process, each call’s jump address and return address are fixed, and the return address can be obtained at the time of the call. The function call process is a “stack push and pop” process, so branch prediction can be performed using a “stack” structure: each time a call instruction is encountered, the current PC+4 (compressed instructions and regular instructions have different offsets) is pushed onto the stack; when a return instruction is encountered, a pop operation is performed, and the address obtained is the target jump address. In the block-based BPU, RAS cannot know whether the current block is a call or ret, so it relies on other predictors, using the results of previous predictors for RAS operations.

Specifically, in Xiangshan’s RAS predictor, at the s2 stage, it needs to determine whether the previous stage’s s2 output predicts a call or ret (i.e., the input signal io.s2_full_pred.hit_taken_on_call/ret is valid). If it’s a call, it pushes the subsequent instruction address onto the stack; if it’s a ret, it pops the address from the stack as the prediction result. Because in the BPU predictor, the result obtained at the s3 stage is assumed to be better than the s2 stage, the RAS predictor needs to check at the s3 stage. If the previous stage’s s3 prediction result is inconsistent with s2, the s3 result is taken, and it needs to determine whether to cancel or complete the stack operations of the previous s2 stage as needed. For example, if the s2 stage predicted a call instruction and performed a push operation, but s3 predicted a regular branch instruction with no need for any operation, the push must be canceled; if s2 predicted a regular branch instruction and s3 predicted a call, a push operation must be performed to complete.

RAS Stack Operations

In RAS design, function return addresses are predicted using a stack. Ideally, this section assumes that RAS can be backed up at any time, with the stack top pointer represented by sp and the predicted address represented by paddr. The basic RAS operations are as follows:

  1. PUSH

Since predictions can be wrong, the current stack state needs to be backed up (often referred to as a “snapshot” in software; this term is also used in subsequent content). When encountering a call instruction, get the return address of the call instruction addr = current pc + 4 (if it’s a compressed instruction, addr = pc+2), then push onto the stack: sp = addr; sp += 1.

  1. POP

For the same reason, take a snapshot of the current stack, marked as s. When encountering a ret instruction, the predicted jump address is paddr = sp, then pop: sp = sp - 1. Take a snapshot of the current stack, marked as s.

  1. Redirect Operation

Since the BPU predicts program branches, there are “correct predictions” and “wrong predictions.” When the CPU backend detects a branch prediction error, it performs a redirect operation, informing the BPU where the prediction was wrong and what the correct result is. During redirection, the RAS module receives the correct branch and the RAS stack information at the time of prediction. Depending on the type of correct branch instruction, the following snapshot recovery situations arise:

(1) The previously predicted instruction is actually a call instruction, and the push operation is executed based on the addr address provided in the redirect. (2) The previously predicted instruction is actually a ret instruction, and the pop operation is executed.

  1. Commit Operation

The commit operation is when the backend informs the frontend that the previous prediction was correct. Ideally, the RAS predictor doesn’t need to perform any operations at this time.

Implementation of RAS in Kunming Lake

In actual circuit design, an infinitely large stack is impossible, and constant backups are not feasible. Therefore, in Kunming Lake’s RAS implementation, the problems and solutions are as follows:

How to obtain RAS stack snapshots for each prediction?

To achieve the function of taking snapshots of the RAS stack, Kunming Lake adopts a linked representation based on a circular array. The design is as follows:

As shown above, a circular array is used for data management. The circular array has a starting address marked as BOS and a tail pointer marked as TOSW. The data between them are valid, and the data outside are free. Within the valid data, a linked structure represents the “RAS stack,” where each stack element records the number of its previous data. When performing stack operations, the corresponding previous element can be obtained through this number. The RAS stack’s bottom pointer shares the BOS. In the initial state S in the figure, the RAS stack elements are 0, 1, 3, 5. Element 5 records the position of its previous element 3, and element 3 records the position of its previous element 1. When a push operation is needed, the RAS stack top pointer TOSR = TOSW, the new element is stored at the new TOSR position 7, and the position of its previous element 5 is recorded in the new element, then TOSW is moved back (TOSW = TOSW+1). When a pop operation is performed, the RAS stack top pointer TOSR moves to the previous element’s position 3 based on the index saved in the stack top element. Therefore, under the condition that the stack does not overflow, the above RAS stack always allocates new data on the array through TOSW during normal Push/Pop operations, so all process states and intermediate data are saved. So, to restore to the state S, it only needs to reset the corresponding stack pointers. Therefore, in each prediction, the corresponding stack pointers (BOS, TOSR, TOSW) also need to be saved in the prediction result for later restoration in case of redirection. The advantage of this structure is that it can save complete process data, but frequent push operations can lead to large space resource consumption.

Chain RA storage space waste?

Since the prediction result is correct after commit, the stack will not roll back. When the RAS predictor receives the commit message of the prediction block “P,” it will no longer receive the redirect message of block P, so the snapshot taken during the push operation for block P will not be used again. Therefore, the RAS stack elements can be categorized, with uncommitted elements stored in a “linked” structure and committed elements stored in a regular stack structure (the original RAS stack is split into two parts: the uncommitted part stored in a snapshot-saving linked structure and the committed part stored in a regular stack structure). The optimized RAS structure is shown below:

As shown above, based on the normal call/ret predictions and commit call/ret, the original RAS stack can be split into two independent stacks, called the “speculative stack” (spec_stack, linked structure) and the “commit stack” (commit_stack, regular structure). Due to the change in stack structure, the specific Pop/Push operations are as follows:

  1. Encounter normal call and ret: (1) The call predicted by the prediction block is correct, and a push operation is performed on the spec_stack, specifically the linked stack Push process mentioned above.

(2) The ret predicted by the prediction block is correct, and the stack top of the spec_stack is used as the prediction value, then a pop operation is performed. If the spec_stack is empty, the stack top element of the commit_stack is used as the prediction result (no pop operation is performed).

  1. Commit operation: (1) The FTQ execution result is call correct, and a regular push operation is performed on the commit_stack.

(2) The FTQ execution result is ret correct, and a regular pop operation is performed on the commit_stack.

  1. Redirect operation: (1) Obtain the stack pointers (BOS, TOSR, TOSW, ssp) from the redirect message during the previous prediction and cover the current pointer values to complete the speculative stack rollback.

(2) This operation does not affect the commit stack.

How to handle when the S3 prediction result is inconsistent with S2 at the input end?

Since the S3 result is assumed to be better than S2, the RAS stack needs to be repaired again in case of inconsistency. The specific inconsistency and corresponding repair operations are shown in the table below:

S2 Pred. Res. S3 Pred. Res. Repair Operation
push keep pop
keep pop pop
pop keep push
keep push push

S2 and S3 operations do not exist, pop/push or push/pop scenarios (Why not exist?)

Other Optimizations

  1. Each element in the RAS stack has a counter, which is used to save repeated values (for recursive calls). For example, when the address pushed for the first time is 0xff00, and the address pushed for the second time is also 0xff00, only the counter of the top element of the stack needs to be incremented, and there is no need to actually push the address onto the stack.

Interface Description

In the RAS predictor, the core component is the RASStack, with the following interface:

接口名称 功能描述 接口名称 功能描述
in.spec_push_valid 预测有Call指令,Spec压栈 in.s2_fire s2信号有效
in.spec_pop_valid 预测有Ret指令,Spec出栈 in.s3_fire s3信号有效
in.spec_push_addr Ret地址 in.s3_cancel s3和s2的预测结果不一样,需要撤销s2的操作
out.spec_pop_addr RAS的栈顶数据 in.s3_meta s3需要的s2时的现场信息
out.ssp commit栈顶指针 in.s3_missed_pop s3判断需要再次进行pop
out.sctr commit栈顶重复元素计数器 in.s3_missed_push s3判断需要再次进行push
out.nsp commit栈顶,会被ssp覆盖 in.s3_pushAddr 需要再次push时的地址
out.TOSR spec栈栈顶指针 in.redirect_valid 需要重定向
out.TOSW spec栈数据分配指针 in.redirect_isCall 真实执行情况是Call
out.BOS spec栈栈低指针 in.redirect_isRet 真实执行情况是Return
in.commit_push_valid push操作正确 in.redirect_meta_ssp 之前预测时的现场信息ssp
in.commit_pop_valid FTQ执行结果为Call正确 in.redirect_meta_sctr 之前预测时的现场信息sctr
in.commit_push_addr 更新信息中的Ret地址 in.redirect_meta_TOSW 之前预测时的现场信息TOSW
in.commit_meta_TOSW 更新信息中的TOSW
in.redirect_meta_TOSR 之前预测时的现场信息TOSR
in.commit_meta_TOSR 更新信息中的TOSR in.redirect_meta_NOS 之前预测时的现场信息NOS
in.commit_meta_ssp 更新信息中的现场信息SSP in.redirect_callAddr 重定向地址
in.commit_meta_sctr 更新信息中的现场信息SCTR

The relationship between the RASStack module and the BasePredictor interface is as follows:

stack接口 转换过程 描述
s.spec_push_valid io.s2_fire(2) && s2_full_pred.hit_taken_on_call && !io.s3_redirect(2) s2输入有效,且上级预测为call跳转
s.spec_pop_valid io.s2_fire(2) && s2_full_pred.hit_taken_on_ret && !io.s3_redirect(2) s2输入有效,且上级预测为ret跳转
s.spec_push_addr s2_full_pred.fallThroughAddr + Mux(s2_full_pred.last_may_be_rvi_call, 2.U, 0.U) 上级预测器s2预测的fallThroughAddr(即PC+2),判断是否压缩指令是否需要 +2
s.redirect_isCall redirect.bits.level === 0.U && recover_cfi.pd.isCall
s.redirect_isRet redirect.bits.level === 0.U && recover_cfi.pd.isRet
s.redirect_meta_* redirect.bits.cfiUpdate.*
s.commit_push_valid io.update.is_call_taken call指令预测正确
s.commit_push_valid io.update.is_ret_taken ret指令预测正确
s.commit_push_addr update.ftb_entry.getFallThrough(update.pc) + Mux(update.ftb_entry.last_may_be_rvi_call, 2.U, 0.U) 根据是否为压缩指令,进行地址+2或者+0
s.commit_meta_* io.update.bits.meta.asTypeOf(new RASMeta)
io.out.last_stage_spec_info.* s3_meta.* s3_meta = RegEnable(s2_meta, io.s2_fire(2))由s2_meta延迟一怕得到
io.out.last_stage_meta s3_meta
io.out.s2.full_pred.*.jalr_target :=stack.spec_pop_addr 预测地址(栈顶地址,只预测ret)
io.out.s3.full_pred.*.jalr_target :=RegEnable(stack.spec_pop_addr, io.s2_fire(2)) 由s2延迟一拍得到
io.out.s2/3.full_pred.targets.last :=Mux(s2/3_is_jalr, s2/3_jalr_target, s2/3_last_target_in) 如果时call执行,更新targets.last的结果

Timing Description

In RAS, there are only 2 stages, S2 and S3.

Main tasks in S2

  1. Based on the S2 prediction result of the previous predictor, complete the prediction through the push/pop process and obtain the result spec_pop_addr.
  2. Perform update operations based on commit signal.

Main tasks in S3

  1. Based on the prediction result in the previous predictor S3 and the operations in S2, determine whether to undo the Pop/Push operation. The predictor assumes that the S3 prediction result is better than S2. If S2 and S3 predictions are inconsistent, the RAS predictor accepts the S3 result for stack operation.

  2. The prediction process in S3 is the same as in S2, but the data is different.

  3. Perform redirection operations (redirection information obtained from the previous cycle).

Since the RASStack appears to complete its tasks within a cycle, data bypassing is needed inside the stack to cache data for processing.

Last modified September 13, 2024: Update the picture of BPU Top. (431c050)