

# ISO 26262 Dependent Failure Analysis Using PSS

Moonki Jang – Samsung Electronics Co., Ltd.



cādence<sup>®</sup>





#### Agenda

- Introduction to ISO 26262
- ISO 26262 functional safety features for semiconductor
- Using PSS for DFA (Dependent Failure Analysis)
- Result and lesson learned





### Background of ISO 26262

- For a long time electronics were a comfort feature
  - Now they are a Safety Feature







# **Functional Safety**

- Functional safety (ISO 26262)
  - Absence of unacceptable risk due to hazards caused by malfunctioning or unintended behavior of E/E systems
  - Possible root causes
    - Specification, implementation or realization errors
    - Failure during operation
    - Reasonably foreseeable misuse / operational errors





#### Overall framework of ISO 26262







#### Agenda

- Introduction to ISO 26262
- ISO 26262 functional safety features for semiconductor
- Using PSS for DFA (Dependent Failure Analysis)
- Result and lesson learned





#### ISO 26262 for Semiconductor

• 2<sup>nd</sup> revision of ISO 26262 was released in 2018. Part 11 has been modified for semiconductor guideline

| Main Agenda                                                                                                                                                                                                                            | Applicable Items                                                                                                                                                                                 |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>* Base failure rate estimation <ul> <li>Permanent fault</li> <li>Transient fault</li> <li>Component package failure</li> </ul> </li> <li>* Dependent failure analysis (DFA) <ul> <li>* Fault injection</li> </ul> </li> </ul> | <ul> <li>Digital components, memories</li> <li>Analogue / Mixed signal components</li> <li>Programmable logic devices</li> <li>Multi-core components</li> <li>Sensors and transducers</li> </ul> |





# Dependent Failure Analysis (DFA)

 The analysis of dependent failures aims to identify the single events or single causes that could bypass or invalidate a required independence or freedom from interference between elements and violate a safety requirement or a safety goal.







# Dependent Failure Initiator (DFI)

- The Dependent Failure Initiator (DFI) represents the root cause of dependent failures in functional safety
- In general, DFI is defined as an item that can threaten the independence required between elements.





# **Defining DFIs**

• Failure Mode and Effects Analysis (FMEA) determines all possible ways a system component can fail and determines the effect of such failures on the system. The DFI is selected based on the pre-defined FMEA items as shown below.

|                 | FMEA                           |                              |                                                                   |             |                                                               |                  |                                     |                                   |             |                         |                                 |                                  |                                              |     |      |     |
|-----------------|--------------------------------|------------------------------|-------------------------------------------------------------------|-------------|---------------------------------------------------------------|------------------|-------------------------------------|-----------------------------------|-------------|-------------------------|---------------------------------|----------------------------------|----------------------------------------------|-----|------|-----|
| Name / Function |                                |                              |                                                                   |             |                                                               | 0 Currer         | Current                             | Current                           |             | Recommended Action(s)   |                                 | Responsibilit                    | Action Results                               |     |      |     |
| ID              | Requirements                   | Potential Failure<br>Mode(s) | Potential<br>Effect(s)<br>of Failure                              | S<br>e<br>V | Potential<br>Cause(s) of<br>Failure                           | c<br>c<br>u<br>r | Design<br>Controls<br>(Prevention)  | Design<br>Controls<br>(Detection) | D<br>e<br>t | Preventive<br>Action(s) | Detection<br>Action(s)          | y & Target<br>Completion<br>Date | Actions<br>Taken                             | Sev | Occr | Det |
| M001            | Memory scheduler:ECC_logic     | ECC error<br>- double bit    | Loss of basic<br>functionality                                    |             | memory cell<br>defect due to<br>the electostatic              |                  | Experienced<br>Designer /<br>Review | Simulation                        |             |                         | interrupt/<br>error<br>response |                                  | system<br>reboot/<br>masking<br>problem area |     |      |     |
| M002            | Memory scheduler:AXI_Interface | SFRs not<br>writeable        | Adress Mapping<br>not correct /<br>Loss of basic<br>functionality |             | AXI Slave<br>Interface<br>wrongly<br>implemented/<br>SW fault |                  | Reuse / Family<br>Concept           | Simulation                        | 8           |                         | error<br>response               |                                  | system<br>reboot/<br>masking<br>problem area |     |      |     |





## Fault Injection

- In our experiment, a fault occurring in a shared memory area is defined as the DFI and implemented through fault injection
  - Uncorrectable ECC error injection
  - Memory Management Unit(MMU) translation fault generation
  - RAS error injection for CPU, Interrupt controller, System MMU





## **Coupling Factor**

- A coupling factor is a common characteristic or relationship of elements that leads to dependency in their failure.
- The following coherency interference stimulus for a shared memory region can be a coupling factor
  - False sharing coherency access
  - Distributed Virtual Memory(DVM) transaction broadcasting
  - Exclusive access
  - CPU cluster power down





#### Agenda

- Introduction to ISO 26262
- ISO 26262 functional safety features for semiconductor
- Using PSS for DFA (Dependent Failure Analysis)
- Result and lesson learned





#### Why PSS?

- For DFA, we need to create hundreds of scenarios that combine all of the functions that can be used as coupling factors for each DFI
- The PSS model reusability and constrained-random test generation made it easy to generate tests with various conditions defined in safety requirements.





# Dependency of Multi-Core System

- Cache coherence is the discipline which ensures that the changes in the values of shared operands (data) are propagated throughout the system in a timely fashion.
- A fault in a shared resource can affect other elements that share that resource







### False-Sharing Operation

- Each master uses a unique address-range within the same cache line
- Each time a coherent master writes a value to a block allocated to it, a number of snoop transactions are generated between the coherent masters to clear the caches of all other masters







## Fault Injection

- A fault occurring in a shared memory area is defined as the DFI and implemented through fault injection as follows:
  - Uncorrectable ECC error injection
    - Main Memory (DRAM)
    - Unified L3 Data Cache
    - L1/L2 Data cache
  - Memory Management Unit (MMU) translation fault generation
  - RAS (Reliability, Availability, and Serviceability) error injection for CPU, Interrupt controller, System MMU
- If a fault is injected into the 64-byte cache-line, previous coherency operation causes a failure in all coherent masters participating in the false sharing scenario





#### **Fault Generation**

• Using PSS, the previous fault injection options are modeled as reusable actions. And it can generate various DFIs with the desired number of faults at any given time.

```
action activity_selection {
    activity {
        //Randomly select one of the choices:
        select{
            //Valid coherency actions:
            [85]: do read_increment_write;
            //Error injection options:
            // - RAS error injection (library)
            [5]: do cdn_coherency_ops_c::ras_core_error_inject;
            // - MMU translation fault generation
            [5]: do core_remap_ttbr_error_inject;
            // - Uncorrectable ECC error injection
            [5]: do ecc_memory_error_inject;
```





# Interference Stimulus Generation

• Once the DFI is determined, the PSS selects an interference stimulus, which can be a coupling factor, to create a dependent failure scenario.

```
action false_sharing_with_err_injection_and_interference {
  activity {
    parallel {
      do false_sharing_with_err_injection;
      repeat (10) {
        select {
          do change_frequency;
          do cdn_coherency_ops::power_activity;
          do cdn_coherency_ops::exclusive_cache_access;
```





#### **Generated Dependent Failure Scenario**







# Interference Reporting

- Each scenario prints out the following information when simulation completes
  - Injected fault information
  - Executed interference action information
  - Maximum Fault Tolerance Time Interval (FTTI) information
  - External recovery monitor





## Fault Tolerance Report (FTR) Generation

• Using scenarios run results, an FTR is generated automatically

|                 | Fault To  | lerance | Report (FTR)                            |                       |                      |                  |                      |                   |                   |                    |                                     |               |                |
|-----------------|-----------|---------|-----------------------------------------|-----------------------|----------------------|------------------|----------------------|-------------------|-------------------|--------------------|-------------------------------------|---------------|----------------|
| Fault Injection |           |         |                                         | Interference stimulus |                      |                  |                      |                   |                   | ation result       | Scenario info                       |               |                |
| FMEA_ID         | type      | target  | expected failures                       | FTR_ID                | stimulus_1           | stimulus_2       | stimulus_3           | stimulus_4        | stimulus_5        | Recovery<br>result | Fault<br>Tolerance<br>Report (FTTI) | Scenario name | Seed<br>number |
|                 |           |         | error interrupt/<br>error response      | M001_1                | false sharing access |                  |                      |                   |                   | done               | 80                                  | dram_1_ecc_1  | 3523           |
|                 |           |         |                                         | M001_2                | false sharing access | exclusive access |                      |                   |                   | done               | 100                                 | dram_1_ecc_2  | 3475           |
|                 |           |         |                                         | M001_3                | false sharing access | exclusive access | MMU page remap       |                   |                   | done               | 105                                 | dram_1_ecc_3  | 2531           |
|                 |           |         |                                         | M001_4                | false sharing access | exclusive access | MMU page remap       | cluster powerdown |                   | done               | 105                                 | dram_1_ecc_4  | 3767           |
| M001            | ECC error | DRAM    |                                         | M001_5                | false sharing access | exclusive access | MMU page remap       | cluster powerdown | DFS level change  | done               | 110                                 | dram_1_ecc_5  | 8236           |
|                 |           |         | 1.0000000000000000000000000000000000000 | M001_6                | exclusive access     |                  |                      |                   |                   | done               | 50                                  | dram_1_ecc_1  | 3257           |
|                 |           |         |                                         | M001_7                | exclusive access     | MMU page remap   |                      | 2)<br>4)          |                   | done               | 55                                  | dram_1_ecc_2  | 3278           |
|                 |           |         |                                         | M001_8                | exclusive access     | MMU page remap   | false sharing access |                   |                   | done               | 90                                  | dram_1_ecc_3  | 4291           |
|                 |           |         |                                         | M001_9                | exclusive access     | MMU page remap   | false sharing access | DFS level change  |                   | done               | 93                                  | dram_1_ecc_4  | 3982           |
|                 |           |         |                                         | M001_10               | exclusive access     | MMU page remap   | false sharing access | DFS level change  | cluster powerdown | done               | 97                                  | dram_1_ecc_5  | 7218           |





#### DFA Result

• The FTRs for each error generated in this way are reflected in the DFA result as shown below, proving that safety is guaranteed under various error conditions.

| FMEA_ID | Dependent Failure Ana<br>Element | Redundant                  | Functional dependency                                                                          |                      | Dependent failures initiato                                          | rs                            | DI                                            |                            | Chabus                                 |        |
|---------|----------------------------------|----------------------------|------------------------------------------------------------------------------------------------|----------------------|----------------------------------------------------------------------|-------------------------------|-----------------------------------------------|----------------------------|----------------------------------------|--------|
|         |                                  | element                    | (Cascading failure)                                                                            |                      | (Common cause failures)                                              |                               |                                               |                            |                                        |        |
|         | Short name and description       | Short name and description | Description                                                                                    | Systematic<br>faults | Shared resources                                                     | Single Physical<br>root cause | Measure for fault (A)voidance<br>or (C)ontrol | Verification method        | <ul> <li>Responsible Person</li> </ul> | Status |
| M001    | Memory scheduler:ECC_logic       | CPU cluster 0/1/2/3/4      | False sharing                                                                                  |                      | Fault injection for generate<br>ECC error from shared<br>DRAM region |                               | interrupt / error response                    | simulation : FTR_ID M001_1 |                                        |        |
| -       | Memory scheduler:ECC_logic       | CPU cluster 0/1/2/3/4      | False sharing / exclusive access                                                               |                      | Fault injection for generate<br>ECC error from shared<br>DRAM region |                               | interrupt / error response                    | simulation : FTR_ID M001_2 |                                        |        |
|         | Memory scheduler:ECC_logic       | CPU cluster 0/1/2/3/4      | False sharing / exclusive access /<br>MMU page remap                                           |                      | Fault injection for generate<br>ECC error from shared<br>DRAM region |                               | interrupt / error response                    | simulation : FTR_ID M001_3 |                                        |        |
|         | Memory scheduler:ECC_logic       | CPU cluster 0/1/2/3/4      | False sharing / exclusive access /<br>MMU page remap / cluster<br>powerdown                    |                      | Fault injection for generate<br>ECC error from shared<br>DRAM region |                               | interrupt / error response                    | simulation : FTR_ID M001_4 |                                        |        |
|         | Memory scheduler:ECC_logic       | CPU cluster 0/1/2/3/4      | False sharing / exclusive access /<br>MMU page remap / cluster<br>powerdown / DFS level change | 6                    | Fault injection for generate<br>ECC error from shared<br>DRAM region |                               | interrupt / error response                    | simulation : FTR_ID M001_5 |                                        |        |





#### Agenda

- Introduction to ISO 26262
- ISO 26262 functional safety features for semiconductor
- Using PSS for DFA (Dependent Failure Analysis)
- Conclusion and lesson learned





#### Conclusion

- Using PSS, we were able to create a number of DFIs, and use random fault injection scenarios to reproduce and prevent a number of dependent failure cases
- Through the DFA results, the verification coverage of our system has increased dramatically.
  - x10 number of additional verification items have been generated from each single FMEA item for shared resource





#### Lesson learned

- ISO 26262 can be usefully applied to the general SoC verification process as well as functional safety
- The same scenario could be used for SW development as well as HW development through the scenario reusability of PSS.







# ISO 26262 Dependent Failure Analysis Using PSS

Moonki Jang – Samsung Electronics Co., Ltd.



cādence<sup>®</sup>





# **DVCon Slide Guidelines**

- Use Arial or Helvetica font for slide text
- Use Courier-new or Courier font for code
- First-order bullets should be 24 to 28 point
  - Second-order bullets should be 24 to 26 point
    - Third-order bullets should be 22 to 24 point
    - Code should be at least 18 point
- Your presentation will be shown in a very large room
  - These font guidelines will help ensure everyone can read you slides!







## **Code and Notes**



Informational boxes should be 18pt Arial-bold, or larger (using a background color is optional)

