Regression Verification Phase I: Reproduction of Bug Test Cases for the Third Generation Xiangshan (Kunming Lake) (In Progress)
Design regression test cases to reproduce bugs based on the published bug tasks.
Created Jan 11, 2024 - Last updated: Jan 11, 2024
UT (Unit Testing), IT (Integration Testing), and ST (System Testing) are three critical phases in processor verification, each targeting different levels of the verification process. Among them, UT focuses on testing the smallest functional units (modules) within the processor design to ensure they meet the design requirements. However, in practice, some module bugs are not detected during the UT phase but are discovered and reported in subsequent verification stages. Therefore, it is necessary to construct test cases in UT to reproduce these bugs. Additionally, the test cases need to be rerun on the version where the bug has been fixed to ensure the bug has been correctly resolved. This process is also known as regression testing.
Task Overview
This phase of the task will provide 18 bugs to be regressed. Participants are required to use the specified toolchain (picker, toffee (optional)) to construct test cases for bug regression.
For most bugs, we will provide detailed descriptions, including the module where the bug occurs, the cause, and how to reproduce it. For some bugs, participants may need to locate the relevant module or set up the verification environment based on the description. Depending on the workload, different levels of rewards will be provided. For each bug regression completed and a Pull Request (PR) submitted to the UnityChipForXiangShan repository, a reward of 100 RMB will be granted. Additional rewards are available for contributing to the bug verification infrastructure. We hope everyone can actively participate and complete the tasks!
Test Content
For each bug, you need to complete the following tasks:
1、Environment Setup
First, based on the test environment provided by UnityChipForXiangShan, construct test cases to perform bug regression.
- If the module is not explicitly mentioned in the description, you need to locate the relevant module yourself.
- If the verification environment for the target module does not exist in the repository, please refer to the Prepare Verification Environment and Add Tests sections in the documentation to set up the verification environment according to the specifications.
2、Test Case Construction
Next, based on the cause of the bug, construct the corresponding test cases. For this task, you will be provided with the RTL where the bug occurred and the commit date that fixed the bug.
For the constructed test cases, the following two aspects need to be fulfilled:
In the RTL with a bug: The test case can be executed, but it fails when running the test case.
In the RTL downloaded from the repository after the commit date: The test case can be executed, and the test case passes.
3、Commit Message
Finally, write a description specifying which module the bug belongs to, briefly outline the relevant test environment and test case, and explain how to run the test.
After completing the above tasks, submit the PR to the UnityChipForXiangShan repository.
Task Reward
For each completed bug regression and submission of a PR to the UnityChipForXiangShan repository, a reward of 100 RMB will be granted for the first accepted PR. Additional rewards may be provided for contributions to the bug verification infrastructure.
Bug to Be Reproduced
For each bug, the following table lists the corresponding task ID, bug description, root cause, RTL version where the bug occurred, and the commit date of the fix. For more details, please refer to the issue or task details.
Note: Commit: Click to navigate to the commit page that fixed the bug; RTL: Click to download the RTL at the time the bug occurred.
Serial Number | Task ID | Bug Description | Commit | RTL | Cause | Issue | Contributor |
---|---|---|---|---|---|---|---|
1 | KMH22-242 | A misalignment occurred during memory access, and the GPA (Guest Physical Address) was not passed, resulting in an error.A misalignment occurred during memory access, and the GPA (Guest Physical Address) was not passed, resulting in an error. | 2024-09-12 | 24080301 | A misalignment occurred during memory access, and the GPA (Guest Physical Address) was not passed, resulting in an error.A misalignment occurred during memory access, and the GPA (Guest Physical Address) was not passed, resulting in an error. | Link | - |
2 | KMH22-118 | The stopcount and stoptime functionalities in DCSR (Debug Control and Status Register) have not been implemented. The stopcount and stoptime functionalities in DCSR (Debug Control and Status Register) have not been implemented. | 2024-11-14 | 24080301 | The stopcount and stoptime functionalities in DCSR (Debug Control and Status Register) have not been implemented. Implement these features in the design code.The stopcount and stoptime functionalities in DCSR (Debug Control and Status Register) have not been implemented. Implement these features in the design code. | Link | - |
3 | KMH22-351 | An error occurred in the calculation of the gpaddr (Guest Physical Address) during L1 TLB refill.An error occurred in the calculation of the gpaddr (Guest Physical Address) during L1 TLB refill. | 2024-09-04 | 24082801 | During the first-stage address translation in Bare mode and the second-stage address translation in Sv39x4 mode, when a GPF (Guest Page Fault) occurs, mtval2 is 0. In the case of "onlys2," an error occurs in the calculation of the gpaddr (Guest Physical Address) during L1 TLB refill. The s1 ppn (Page Physical Number) was used instead of the s2 tag, which is incorrect.During the first-stage address translation in Bare mode and the second-stage address translation in Sv39x4 mode, when a GPF (Guest Page Fault) occurs, mtval2 is 0. In the case of "onlys2," an error occurs in the calculation of the gpaddr (Guest Physical Address) during L1 TLB refill. The s1 ppn (Page Physical Number) was used instead of the s2 tag, which is incorrect. | Link | - |
4 | KMH22-397 | The RAS (Return Address Stack) misprediction caused speculation stack blockage, leading to front-end stalling and freezing.The RAS (Return Address Stack) misprediction caused speculation stack blockage, leading to front-end stalling and freezing. | 2024-09-10 | 24082801 | There is an issue with the Call and Ret signals sent to the RAS (Return Address Stack) module during redirection. When the RAS implementation is blocked, it misjudges the speculation stack blockage, causing the front-end to stall and freeze.There is an issue with the Call and Ret signals sent to the RAS (Return Address Stack) module during redirection. When the RAS implementation is blocked, it misjudges the speculation stack blockage, causing the front-end to stall and freeze. | Link | - |
5 | KMH22-398 | When a miss_req continuously waits for probing and replay in the MSHR (Miss Status Handling Register), the refill_req can block store_req and probe_req, leading to a deadlock. When a miss_req continuously waits for probing and replay in the MSHR (Miss Status Handling Register), the refill_req can block store_req and probe_req, leading to a deadlock. | 2024-09-04 | 24082801 | n the previous design, when a miss_req continuously waited for probing and replay in the MSHR (Miss Status Handling Register), the refill_req would block store_req and probe_req, leading to a deadlock. Now, the unnecessary blocking has been removed to fix this issue.n the previous design, when a miss_req continuously waited for probing and replay in the MSHR (Miss Status Handling Register), the refill_req would block store_req and probe_req, leading to a deadlock. Now, the unnecessary blocking has been removed to fix this issue. | Link | - |
6 | KMH22-399 | The unit-stride address is hardcoded with 39 bits. After switching to Sv48, the higher bits are lost.The unit-stride address is hardcoded with 39 bits. After switching to Sv48, the higher bits are lost. | 2024-09-16 | 24082801 | After enabling Sv48, there is an issue where the bit width for vector memory access has not been modified or adapted accordingly.After enabling Sv48, there is an issue where the bit width for vector memory access has not been modified or adapted accordingly. | Link | - |
7 | KMH22-400 | The PCredit for MMIO and cacheable spaces are not arbitrated together, resulting in the loss of PCrdGrant.The PCredit for MMIO and cacheable spaces are not arbitrated together, resulting in the loss of PCrdGrant. | TBD | 24082801 | The PCredit for MMIO and cacheable spaces should be arbitrated together. The allocation of MMIO's rxrsp to a specific MMIO entry is determined by the TxnID. However, PCrdGrant does not carry a TxnID. According to the previous PCrdGrant allocation logic, if a transaction receives PCrdGrant before receiving RetryAck, this PCrdGrant will be lost.The PCredit for MMIO and cacheable spaces should be arbitrated together. The allocation of MMIO's rxrsp to a specific MMIO entry is determined by the TxnID. However, PCrdGrant does not carry a TxnID. According to the previous PCrdGrant allocation logic, if a transaction receives PCrdGrant before receiving RetryAck, this PCrdGrant will be lost. | Link | - |
8 | KMH22-1547 | The PLRU (Pseudo Least Recently Used) replacement algorithm in the TLB has replaced a recently accessed TLB entry, causing the TLB's gpf (general protection fault) handling process to be disrupted and resulting in a system hang.The PLRU (Pseudo Least Recently Used) replacement algorithm in the TLB has replaced a recently accessed TLB entry, causing the TLB's gpf (general protection fault) handling process to be disrupted and resulting in a system hang. | 2024-12-02 | 24103001 | The L1TLB does not store physical addresses (gpaddr), but gpaddr is essential when a guest page fault (GPF) occurs. In such cases, the L1TLB needs to send a page table walk (PTW) request to obtain the gpaddr, which we refer to as getGpa. The getGpa mechanism can only handle one GPF TLB request (i.e., the first request) and expects the corresponding TLB entry to still reside in the L1TLB. The L1TLB replacement algorithm uses PLRU (Pseudo Least Recently Used), which may replace entries that are not necessarily the least recently used. We observed a scenario where the L1TLB replaced the GPF TLB entry even though it had been accessed recently. This led to a deadlock issue in the getGpa mechanism, ultimately causing the entire core to freeze. To resolve this issue, we decided to block any unrelated PTW fills while the getGpa mechanism is active (i.e., when gpaddr is needed). After addressing this problem, we found that, in some cases, other PTW responses were not being filled, and other TLB requests continued to trigger PTW requests, occupying the L2TLB request path and preventing the GPF PTW request from being serviced, eventually causing the processor to freeze. To solve this, we decided to block any unrelated PTW requests when gpaddr is required.The L1TLB does not store physical addresses (gpaddr), but gpaddr is required when a guest page fault (GPF) occurs. In such cases, the L1TLB needs to send a page table walk (PTW) request to obtain the gpaddr, which we refer to as getGpa. The getGpa mechanism can only handle one GPF TLB request (i.e., the first request) and assumes that the corresponding TLB entry still resides in the L1TLB. The L1TLB replacement algorithm uses PLRU (Pseudo Least Recently Used), which may replace entries that are not necessarily the least recently used. We observed a scenario where the L1TLB replaced the GPF TLB entry even though it had been accessed recently. This led to a deadlock issue in the getGpa mechanism, ultimately causing the entire core to freeze. To resolve this issue, we decided to block any unrelated PTW fills while the getGpa mechanism is active (i.e., when gpaddr is needed). After addressing this problem, we found that, in some cases, other PTW responses were not being filled, and other TLB requests continued to trigger PTW requests, occupying the L2TLB request path and preventing the GPF PTW request from being serviced, eventually causing the processor to freeze. To solve this, we decided to block any unrelated PTW requests when gpaddr is required. | Link | - |
9 | KMH22-1572 | Xvisor fails to handle GPF (Guest Page Fault) correctly, resulting in an infinite loop.Xvisor fails to handle GPF (Guest Page Fault) correctly, resulting in an infinite loop. | 2024-11-14 | 24110801 | The L1TLB has added logic to handle gpaddr for cross-page scenarios, but this logic only considers the DTLB case and fails to account for the fact that the ITLB does not utilize the fullva and related pathways. As a result, during exceptions, a value of 0 is incorrectly passed as htval. Xvisor's exception handling mechanism relies on htval to process GPF (Guest Page Fault) exceptions, and the incorrect htval causes the system to hang.The L1TLB has added logic to handle gpaddr for cross-page scenarios, but this logic only considers the DTLB case and fails to account for the fact that the ITLB does not utilize the fullva and related pathways. As a result, during exceptions, a value of 0 is incorrectly passed as htval. Xvisor's exception handling mechanism relies on htval to process GPF (Guest Page Fault) exceptions, and the incorrect htval causes the system to hang. | Link | - |
10 | KMH22-1786 | In the RTL, the vs (vector state) is incorrectly set to dirty for certain instructions that are incapable of modifying the vector state.In the RTL, the vs (vector state) is incorrectly set to dirty for certain instructions that are incapable of modifying the vector state. | 2024-12-01 | 24111901 | do not set vs.dirty for some type of vecInstsdo not set vs.dirty for some type of vecInsts | Link | - |
11 | KMH22-1824 | The vset instruction should not respond to clock interrupts, but it has been incorrectly marked as interrupt_safe.The vset instruction should not respond to clock interrupts, but it has been incorrectly marked as interrupt_safe. | 2024-12-07 | 24111901 | Modify the vset instruction to not respond to clock interrupts.Modify the vset instruction to not respond to clock interrupts. | Link | - |
12 | KMH22-1844 | The frontend incorrectly handles exceptions when fetching instructions across page boundaries.The frontend incorrectly handles exceptions when fetching instructions across page boundaries. | 2024-12-01 | 24112601 | It is likely an issue with ebsin.It is likely an issue with ebsin. | 链接 | - |
13 | KMH22-1861 | The issue involves reading vsie/vsip in non-V mode and reading sie/sip in V mode.The issue involves reading vsie/vsip in non-V mode and reading sie/sip in V mode. | 2024-12-06 | 24120301 | When a VS interrupt occurs on the host, the VM will experience an S interrupt. Therefore, the interrupt numbers need to be synchronized. The issue of reading vsie/vsip in non-V mode and reading sie/sip in V mode has been fixed. We enable interrupts by writing the corresponding bits in mstatus/sstatus/vsstatus, so we need to update xtopi when writing to them. Xu Zefan: Although measures have been taken in the L1TLB to block ptw refill requests when need_gpf is active, it fails to block ptw refill requests that enter at the critical moment when need_gpf is about to be asserted.When a VS interrupt occurs on the host, the VM will trigger an S interrupt. Therefore, the interrupt numbers need to be synchronized. The issue of reading vsie/vsip in non-V mode and sie/sip in V mode has been fixed. We enable interrupts by writing the corresponding bits in mstatus/sstatus/vsstatus, so we need to update xtopi when writing to these registers. Xu Zefan: Although measures have been taken in the L1TLB to block ptw refill requests when need_gpf is active, it fails to block ptw refill requests that enter at the critical moment when need_gpf is about to be asserted. | Link | - |
14 | KMH22-1872 | vecExcpInfo.valid is incorrectly updated when an interrupt occurs.vecExcpInfo.valid is incorrectly updated when an interrupt occurs. | 2024-12-06 | 24120301 | In Rob, vecExcpInfo.valid := exceptionHappen && exceptionDataRead.bits.vstartEn && exceptionDataRead.bits.isVecLoad && !exceptionDataRead.bits.isEnqExcp. When this signal is high, it indicates that a vector memory-related exception needs to be handled. At this point, a submodule under backend called vecExcpMod enters a state machine that temporarily blocks instructions from entering Dispatch. However, when an interrupt occurs, exceptionHappen will also be high, and the data in exceptionDataRead will be invalid. If this invalid data coincidentally causes vecExcpInfo.valid to be high, it will incorrectly block instructions from entering Dispatch, leading to a deadlock. Therefore, the assignment for this signal needs to exclude the interrupt scenario, changing it to vecExcpInfo.valid := exceptionHappen && !intrEnable && exceptionDataRead.bits.vstartEn && exceptionDataRead.bits.isVecLoad && !exceptionDataRead.bits.isEnqExcp.In Rob, vecExcpInfo.valid := exceptionHappen && exceptionDataRead.bits.vstartEn && exceptionDataRead.bits.isVecLoad && !exceptionDataRead.bits.isEnqExcp. When this signal is high, it indicates that a vector memory-related exception needs to be handled. At this point, a submodule under backend called vecExcpMod enters a state machine that temporarily blocks instructions from entering Dispatch. However, when an interrupt occurs, exceptionHappen will also be high, and the data in exceptionDataRead will be invalid. If this invalid data coincidentally causes vecExcpInfo.valid to be high, it will incorrectly block instructions from entering Dispatch, leading to a deadlock. Therefore, the assignment for this signal needs to exclude the interrupt scenario, changing it to vecExcpInfo.valid := exceptionHappen && !intrEnable && exceptionDataRead.bits.vstartEn && exceptionDataRead.bits.isVecLoad && !exceptionDataRead.bits.isEnqExcp. | Link | - |
15 | KMH22-1947 | During an interrupt, vecExcpInfo.valid is incorrectly updated, leading to vecExcpInfo.bits potentially containing uninitialized values, which can cause X-state propagation or random deadlocks.During an interrupt, vecExcpInfo.valid is incorrectly updated, leading to vecExcpInfo.bits potentially containing uninitialized values, which can cause X-state propagation or random deadlocks. | 2024-12-06 | 24101101 | vecExcpInfo.valid := exceptionHappen && exceptionDataRead.bits.vstartEn && exceptionDataRead.bits.isVecLoad && !exceptionDataRead.bits.isEnqExcp. When this signal is high, it indicates that a vector memory-related exception needs to be handled. At this point, a submodule under backend called vecExcpMod enters a state machine that temporarily blocks instructions from entering Dispatch. However, when an interrupt occurs, exceptionHappen will also be high, and the data in exceptionDataRead will be invalid. If this invalid data coincidentally causes vecExcpInfo.valid to be high, it will cause the vecExcpMod module to incorrectly block instructions from entering Dispatch, leading to a deadlock. Therefore, the assignment for this signal needs to exclude the interrupt scenario, changing it to vecExcpInfo.valid := exceptionHappen && !intrEnable && exceptionDataRead.bits.vstartEn && exceptionDataRead.bits.isVecLoad && !exceptionDataRead.bits.isEnqExcp.vecExcpInfo.valid := exceptionHappen && exceptionDataRead.bits.vstartEn && exceptionDataRead.bits.isVecLoad && !exceptionDataRead.bits.isEnqExcp. When this signal is high, it indicates that a vector memory-related exception needs to be handled. At this point, a submodule under backend called vecExcpMod enters a state machine that temporarily blocks instructions from entering Dispatch. However, when an interrupt occurs, exceptionHappen will also be high, and the data in exceptionDataRead will be invalid. If this invalid data coincidentally causes vecExcpInfo.valid to be high, it will cause the vecExcpMod module to incorrectly block instructions from entering Dispatch, leading to a deadlock. Therefore, the assignment for this signal needs to exclude the interrupt scenario, changing it to vecExcpInfo.valid := exceptionHappen && !intrEnable && exceptionDataRead.bits.vstartEn && exceptionDataRead.bits.isVecLoad && !exceptionDataRead.bits.isEnqExcp. | Link | - |
16 | KMH22-1957 | IF the ITLB is not synchronized during a specific prefetch flush, it can cause the ITLB to enter a need_gpf state, preventing it from receiving previous requests.IF the ITLB is not synchronized during a specific prefetch flush, it can cause the ITLB to enter a need_gpf state, preventing it from receiving previous requests. | 2024-12-09 | 24120301 | IF the ITLB is not synchronized during a specific prefetch flush, it can cause the ITLB to enter a need_gpf state, preventing it from receiving previous requests.IF the ITLB is not synchronized during a specific prefetch flush, it can cause the ITLB to enter a need_gpf state, preventing it from receiving previous requests. | Link | - |
17 | KMH22-1968 | An incorrect index was used to select the ECC.An incorrect index was used to select the ECC. | 2024-12-12 | 24120901 | An incorrect index was used to select the ECC.An incorrect index was used to select the ECC. | Link | - |
18 | KMH22-1971 | The vecExceptionFlag marking position condition is incorrect.The vecExceptionFlag marking position condition is incorrect. | 2025-01-13 | 24120901 | The vecExceptionFlag marking position condition is incorrect.The vecExceptionFlag marking position condition is incorrect. | Link | - |