DESIGN AND VERIFICATION TH **CONFERENCE AND EXHIBITION** 

#### UNITED STATES

SAN JOSE, CA, USA FEBRUARY 27-MARCH 2, 2023

#### Early Detection of Functional Corner Case Bugs using Methodologies of the ISO 26262

Moonki Jang, Samsung Electronics

## SAMSUNG



## Agenda

- Introduction of ISO 26262
- Systematic Failure Analysis
- Systematic failure model generation using Machine Learning
- SFA for requirements-driven verification
- Conclusion





## Requirements of Functional Safety

- For a long time, electronics were a comfort feature
  - Now, they are a safety feature







### What is ISO 26262?

- Functional Safety standard for Road vehicles
  - Aims to address possible hazards caused by the malfunctioning behaviour of electronic and electrical systems in vehicles.
  - The first edition was published on 11 November 2011.
  - The second edition, published in December 2018, added 'Part 11. Guidelines on application of ISO 26262 to semiconductors'
- Based on IEC 61508 : Functional Safety of Electrical / Electronic / Programable Electronic Safety-related system
  - Both IEC 61508 and ISO 26262 are risk-based safety standard





#### Overall framework of ISO 26262

• ISO 26262 V diagram

|                                                                                                                                                                                                     | 1                                          | 2. Managemen                                                                                                              | t of functional safety                                   | 1                                   | 1                                                                                                                            |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------|-------------------------------------|------------------------------------------------------------------------------------------------------------------------------|
| 2-5 Overall safety management                                                                                                                                                                       |                                            | 2-6 Safety manageme<br>and the product devel                                                                              | ent during the concept phase<br>opment phases            |                                     | anagement during production,<br>rvice and decommissioning                                                                    |
| 3. Concept phase                                                                                                                                                                                    |                                            |                                                                                                                           | pment at the system leve                                 |                                     | 7. Production, operation                                                                                                     |
| 3-5 Item definition                                                                                                                                                                                 | 4-5 General<br>development                 | topics for the product<br>t at the system level                                                                           | 4-9 Safety validation                                    | in the second second                | service and decommissioning                                                                                                  |
| 3-6 Hazard analysis and risk assessment                                                                                                                                                             | 4-6 Technica                               | al safety concept                                                                                                         | 4-8 System and Item and verification                     | integration                         | 7-5 Planning for production,<br>operation, service and<br>decommissioning                                                    |
| 3-7 Functional safety concept                                                                                                                                                                       |                                            | 4-7 System                                                                                                                | architectural design                                     |                                     | 7-6 Production                                                                                                               |
|                                                                                                                                                                                                     |                                            |                                                                                                                           |                                                          |                                     | 7-7 Operation, service and                                                                                                   |
| 12. Adaptation of ISO 26262<br>for motorcycles                                                                                                                                                      |                                            | development at the<br>dware level                                                                                         | 6. Product deve<br>softwar                               |                                     | decommissioning                                                                                                              |
| 12-5 Safety culture                                                                                                                                                                                 |                                            | opics for the product<br>tatthe hardware level                                                                            | 6-5General topics<br>development at th                   |                                     |                                                                                                                              |
| 12-6 Confirmation measures                                                                                                                                                                          |                                            | tion of hardware                                                                                                          | 6-6 Specification of requirements                        |                                     |                                                                                                                              |
| 12-7 Hazard analysis and risk<br>assessment                                                                                                                                                         |                                            | e design<br>in of the hardware<br>metrics                                                                                 | 6-7 Software arch<br>6-8 Software unit<br>implementation | design and                          |                                                                                                                              |
| 12-8 vehicle inlegration and testing                                                                                                                                                                | 5-9 Evaluatio<br>violations du<br>failures | n of the safety goal<br>e to random hardware                                                                              | 6-9 Software unit v<br>6-10 Software intervention        |                                     |                                                                                                                              |
| 12-9 Safety validation                                                                                                                                                                              | 5-10 Hardwa<br>verification                | re integration and                                                                                                        | 6-11 Testing of the software                             | embedded                            |                                                                                                                              |
|                                                                                                                                                                                                     |                                            | 8. Suppor                                                                                                                 | rting processes                                          |                                     |                                                                                                                              |
| <ul> <li>8-5 Interfaces within distributed developed file</li> <li>8-6 Specification and management of requirements</li> <li>8-7 Configuration management</li> <li>8-8 Change management</li> </ul> |                                            | 8-9 Verification<br>8-10 Documentation m<br>8-11 Confidence in the<br>8-12 Qualification of so<br>8-13 Evaluation of hard | use of sof ware lools<br>oftware components              | application out<br>8-16 Integration | use argument<br>a base vehicle or item in an<br>of scope of ISO 20202<br>of safety related systems not<br>rding to ISO 26262 |
|                                                                                                                                                                                                     |                                            | 9. ASIL-oriented and                                                                                                      | safety-oriented analyse                                  | , <u>.</u>                          |                                                                                                                              |
| 9-5 Requirements decomposition with<br>9-6 Criteria for coexistence of elemen                                                                                                                       |                                            | tailoring                                                                                                                 | 9-7 Analysis of dep<br>9-8 Safety analyse                |                                     |                                                                                                                              |
|                                                                                                                                                                                                     |                                            | 40 C. 14-1                                                                                                                | ne on ISO 26262                                          |                                     |                                                                                                                              |





# Foundations of Functional Safety

- Functional Safety
  - Avoidance of Systematic Faults
  - Control of Systematic Faults
  - Control of Random Hardware Faults





# Random Hardware Failures

- Random Hardware Failure
  - Failure that can occur unpredictably during the lifetime of a hardware element and that follows a probability distribution
- Measures against failures
  - Required runtime safety mechanisms (self-tests, diagonostic coverage)
  - Redundancy, safety layer
  - SPFM (Single Point Fault Metric) : shows robustness of the item to single-point faults
    - Single Point Fault : Fault in an element that is not covered by a safety mechanism
  - LFM (Latent Fault Metric) : shows robustness of the item to latent faults
    - Latent Fault : Multi-point fault whose presence is not detected by a safety mechanism
  - PMHF (Probability Metric for random Hardware Failures)
    - Calculating the system failure rates and assessing the ASIL for functional safety





#### Systematic Failures

- Systematic failure is a failure that arises from the activity itself that develops and produces a system.
  - Human error of personnel participating in development and production activities is the biggest cause.
- RTL bugs caused by incorrect design in the semiconductor design process are typical systematic failures





#### How to prevent these Systematic Failures?

- ISO 26262 relies on the traditional design verification methodologies
- However, as system complexity increases, errors caused by unintended action that occur during interactions conditions that are difficult to detect with existing verification methods are often found at the silicon level
- To detect above complex systematic fault, another new robust methodologies are required.





# Example of functional corner case bugs - 1

#### • Errata cases reported from IP provider

Cluster might not response to snoop during coherency connection handshake

#### **Conditions:**

- 1. The Cluster is in the OFF, MEM\_RET, or DEBUG\_RECOV power mode
- 2. The Cluster is powered on by the system requesting a transition on the cluster P-channel to the ON power mode. This caused the Cluster to request to connect to system coherency (SYSCOREQ=1, SYSCOACK=0).
- 3. The interconnect sends a snoop to the Cluster after it has observed SYSCOREQ HIGH but before it has asserted SYSCOACK.
- 4. The interconnect has a dependency that causes it to delay asserting SYSCOACK until the snoop transaction is outstanding







# Example of functional corner case bugs - 2

Protocol conflict between PCIe and ACE interface



**Condition\_1:** PCIe RC buffer overflowed

Condition\_2: CPU generates Writeback transaction

Condition\_3: Snoop generated from posted write of PCIE





## Agenda

- Introduction of ISO 26262
- Systematic Failure Analysis
- Systematic failure model generation using Machine Learning
- SFA for requirements-driven verification
- Conclusion





#### Introduction of SFA (Systematic Failure Analysis)

- We created Systematic Failure Analysis (SFA) to expand the functional verification coverage by extracting risk factors from the IP level and predicting risks.
  - Failure mode definition
  - Risk assessment
  - FMEA (Failure Mode and Effect Analysis)
  - DFA (Dependent Failure Analysis)





#### Failure mode definition for SFA

- Failure mode is created to predict possible failures
  - FM1: Integration issues (connection, configuration..)
  - FM2: Accessibility issue (access path, access control...)
  - FM3: Functionality issue (wrong output, unintended behavior...)
  - FM4: State transition issues (power gating, clock gating, reset...)
  - FM5: Absence of independence or FFI (Freedom from Interference)





## Risk factors

- Hazardous functionality
- Proven in use level
- Severity level
- Known issues in another project
- Applicable workaround





#### Definition of SFSL (Systematic Failure Severity Level)

• SFSL indicates the functional safety level guaranteed by the system.

 $Risk \,Level = \frac{P(4..1) + S(4..1) + H(4,0) + K(4,0)}{W(4..1)}$ 

| SFSL   | Level Definition | Description                                                            |
|--------|------------------|------------------------------------------------------------------------|
| SFSL_A | Risk Level > 12  | Very high risk of critical failures. Detailed verification is required |
| SFSL_B | Risk Level > 8   | High risk of critical failures. Additional verification is required    |
| SFSL_C | Risk Level > 4   | Mid risk of critical failures. Impact analysis is required             |
| SFSL_D | Risk Level <= 4  | Low risk of critical failures.                                         |





## FMEA (Failure Mode and Effect Analysis)

 Failure Mode and Effects Analysis (FMEA) determines all possible ways a system component can fail and determines the effect of such failures on the system.

|                 | FMEA         |                                                                                                    |                                                      |                                                      |      |             |   |     |     |   |                           |                                                                                          |  |
|-----------------|--------------|----------------------------------------------------------------------------------------------------|------------------------------------------------------|------------------------------------------------------|------|-------------|---|-----|-----|---|---------------------------|------------------------------------------------------------------------------------------|--|
| Name / Function |              | Potential Failure Mode(s)                                                                          | Potential Cause of Failure                           | Potential Effect of Failures                         | Ris  | Risk Assess |   |     | ent |   | Related high level        | Occurance Conditions                                                                     |  |
| ID              | Requirements |                                                                                                    |                                                      | rotential Effect of Fallules                         | SFSL | Ρ           | S | ΗKW |     | W | functions                 | Occurance conditions                                                                     |  |
| CPU_CPD_FM3     | CPUCL_F1     | P-ch handshaking has failed                                                                        | 5                                                    | Power mode transition does<br>not working            | В    | 3           | 3 | Y   | Ν   |   | SYSTEM idle/sleep<br>mode | try power mode transition                                                                |  |
| CPU_CPD_FM5     | -            | ACE interface stalled after snoop<br>arrived between coherency<br>disconnect and coherency disable | reported Errata: 1500609                             | Deadlock occurred between<br>CPUCL and BUS           | A    | 3           | 4 | Y   | Y   | 1 | SVSIEM cloop mode         | Generate snoop between<br>SYSCOREQ and SYSCOACK                                          |  |
| CMU_ACG_FM1     | CPUCL_F2     | wrong clock pll ratio                                                                              | wrong PLL configuration                              | generate wrong clk out                               | D    | 1           | 2 | Ν   | Ν   | 3 | Normal active mode        | Check clk after CMU init                                                                 |  |
| CMU_ACG_FM5     | CPUCL_F2     | unintended clock gating<br>occurred during CPU is in active<br>state                               | Interference occurred<br>between clk gating sequence | Deadlock occurred due to<br>incompleted transactions | A    | 2           | 4 | Y   | Y   |   | system sleep mode         | Access CPUCL0 register<br>between cpucl0_clk_gating_en<br>and cpucl0_clk_blocking_ext_en |  |





#### DFA (Dependent Failure Analysis)

• The analysis of dependent failures aims to identify the single events or single causes that could bypass or invalidate a required independence or freedom from interference between elements and violate a safety requirement or a safety goal







# DFA implementation for DV

- We've found DFI (Dependent Failure Initiator) and coupling factors by:
  - Fault injection
    - Uncorrectable ECC error injection (DRAM/L3DCache/L1,L2 Dcache)
    - Memory Management Unit(MMU) translation fault generation
    - RAS (Reliability, Availability, and Serviceability) error injection for CPU, Interrupt controller, System MMU
  - Generate interference stimulus for a shared memory region
    - False sharing coherency access
    - Distributed Virtual Memory(DVM) transaction broadcasting
    - Exclusive access
    - CPU cluster power down





# Output of DFA

• FTR (Fault Tolerance Report)

|                 | Fault To  | lerance | Report (FTR)                       |                       |                      |                  |                      |                   |                   |                    |                                     |               |                |
|-----------------|-----------|---------|------------------------------------|-----------------------|----------------------|------------------|----------------------|-------------------|-------------------|--------------------|-------------------------------------|---------------|----------------|
| Fault Injection |           |         |                                    | Interference stimulus |                      |                  |                      |                   | Simul             | ation result       | on result Scenario info             | nfo           |                |
| FMEA_ID         | type      | target  | expected failures                  | FTR_ID                | stimulus_1           | stimulus_2       | stimulus_3           | stimulus_4        | stimulus_5        | Recovery<br>result | Fault<br>Tolerance<br>Report (FTTI) | Scenario name | Seed<br>number |
|                 |           |         |                                    | M001_1                | false sharing access |                  |                      |                   |                   | done               | 80                                  | dram_1_ecc_1  | 3523           |
|                 |           |         |                                    | M001_2                | false sharing access | exclusive access |                      |                   |                   | done               | 100                                 | dram_1_ecc_2  | 3475           |
| M001 E          |           |         |                                    | M001_3                | false sharing access | exclusive access | MMU page remap       | _                 |                   | done               | 105                                 | dram_1_ecc_3  | 2531           |
|                 |           |         | in more and hereing a              | M001_4                | false sharing access | exclusive access | MMU page remap       | cluster powerdown |                   | done               | 105                                 | dram_1_ecc_4  | 3767           |
|                 | ECC error | DRAM    | error interrupt/<br>error response | M001_5                | false sharing access | exclusive access | MMU page remap       | cluster powerdown | DFS level change  | done               | 110                                 | dram_1_ecc_5  | 8236           |
|                 |           |         |                                    | M001_6                | exclusive access     |                  |                      |                   |                   | done               | 50                                  | dram_1_ecc_1  | 3257           |
|                 |           |         |                                    | M001_7                | exclusive access     | MMU page remap   |                      |                   |                   | done               | 55                                  | dram_1_ecc_2  | 3278           |
|                 |           |         |                                    | M001_8                | exclusive access     | MMU page remap   | false sharing access |                   |                   | done               | 90                                  | dram_1_ecc_3  | 4291           |
|                 |           |         |                                    | M001_9                | exclusive access     | MMU page remap   | false sharing access | DFS level change  |                   | done               | 93                                  | dram_1_ecc_4  | 3982           |
|                 |           |         |                                    | M001_10               | exclusive access     | MMU page remap   | false sharing access | DFS level change  | cluster powerdown | done               | 97                                  | dram_1_ecc_5  | 7218           |





# Output of DFA

#### • DFA result

|             | Dependent Failu               | re Analysis (DFA)             |                                                     |                                 |                        |                                                                                    |                               |  |
|-------------|-------------------------------|-------------------------------|-----------------------------------------------------|---------------------------------|------------------------|------------------------------------------------------------------------------------|-------------------------------|--|
|             | Element                       | Redundant Element             | Functional Dependency                               | Dependent                       | Failure Initiator(DFI) | DFA                                                                                |                               |  |
| FMEA_ID     | Short name and<br>description | Short name and<br>description | Description                                         | Systematic fault                | Shared resource        | Expected Dependent Failure                                                         | Verification Method           |  |
| CPU_CPD_FM5 | BLK_CPUCL0                    |                               | CPU should wake up GPU for requested GPU processing | Stalled ACE<br>interface of CPU |                        | GPU can't wakeup and system hang occurred                                          | PSS_ML_fault_model            |  |
|             | BLK_CPUCL0                    | BLK_PCIe                      | request and it will generate                        | Stalled ACE<br>interface of CPU |                        | PCIe will not available. Posted write<br>will wait for snoop response from<br>CPU. | PSS_ML_fault_model            |  |
| MIF_FAULT_1 | Memory<br>scheduler:ECC logic | BLK_CPUCL0                    | False sharing                                       |                                 |                        | CPU will access fault address during<br>ECC error state                            | PSS_fault_injection_mo<br>del |  |
|             | Memory<br>scheduler:ECC logic | BLK_CPUCL0                    | exclusive access                                    |                                 |                        | CPU will access fault address during<br>ECC error state                            | PSS_fault_injection_mo<br>del |  |





# Agenda

- Introduction of ISO 26262
- Systematic Failure Analysis
- Systematic failure model generation using Machine Learning
- SFA for requirements-driven verification
- Conclusion





## Requirements of systematic failure model

• Even if SW is executed at the same time, the resulting HW event occurs differently.



• We had to insert delay to make synchronized HW events.



### TB structure of the systematic failure model

• We've created UVM Delay counter/Output monitor and Output repository for Machine Learning and result analysis







#### Machine Learning implementation







# ML sequence modeling flow







## Agenda

- Introduction of ISO 26262
- Systematic Failure Analysis
- Systematic failure model generation using Machine Learning
- SFA for requirements-driven verification
- Conclusion





#### Reusable output of SFA

#### • SFA output could be reuse for various requirements based standards

#### Architecture and design specification

- Change management and impact analysis report
- Detailed hardware design specification and requirements

#### Verification plan

- Tool, methods and environments that used for verification
- Verification strategy for target design

#### Verification specification

- Risk analysis report
- Function list with correlations for target design
- Test cases, test data and objects

#### Verification report

- FMEA report
- DFA report
- Coverage report





# Agenda

- Introduction of ISO 26262
- Systematic Failure Analysis
- Systematic failure model generation using Machine Learning
- SFA for requirements-driven verification
- Conclusion





#### Conclusion

- Higher levels of reliability will be required for semiconductors
  - Reinforced HARA(Hazard Analysis and Risk Assessment) process will be required
  - Innovative expansion of the verification coverage is needed
- ISO 26262 is not a reference. It will be a common requirements for our future development process.





- Questions and Answers
  - Please feel free to contact me (moonki.jang@gmail.com)



