

# Next Generation Verification for the Era of AI/ML and 5G

Frank Schirrmeister, Pete Hardee, Larry Melling, Amit Dua, Moshik Rubin

Cadence Design Systems, Inc.







# Agenda

- The era of 5G and AI/ML
  - Challenges
  - What they mean for verification
  - Typical Designs
- Verification options to \*enable\* 5G/AI/ML
- AI/ML Inside/Outside for Verification
- Summary Outlook





# THE ERA OF 5G AND AI/ML





### **Electronics Innovation**







**Memory Technology** 

(HDD, DRAM, NAND)

### A Data-driven World

#### **Sensors Everywhere**

Internet of Things



Wireless & Wired Infrastructure 5G



ML/AI
n Systems Inc - All Rights Reserved



### The Promise of 5G – Where to Start?







SYSTEMS INITIATIVE

# 5G — New Bandwidth, New Subsystems





# 5G Requirements 1/2

#### **Enhanced mobile broadband**

#### Needs

- Faster speed, Lower latency, Greater capacity
- On-the-go, ultra-high-definition video, virtual reality, and other advanced applications.

### Design characteristics

- traditional large (>200MG), complex designs
- requiring full-chip execution for SW -Emulation

#### **Internet of Things**

#### Needs

- Existing networks struggling
- 5G unlocks (IoT) more connections at once
- Additional monthly revenues for carriers
- IoT revenues smaller because of low usage
- 5G competes against Wi-Fi and Zigbee.

### Design characteristics

- much smaller (<32MG), very power sensitive</li>
- performance system dependent
- multi-device simulation / emulation for QoS and performance validation





# 5G Requirements 2/2

### **Mission-critical & control**

#### Needs

- absolute reliability in medical, vehicle safety
- Latency limiting factor
- 5G delivering lower latency
- New use cases in healthcare, utilities, traffic management, and other time-critical contexts
- Operators expect only incremental revenue

### Design characteristics

- Small-to-medium designs (<200MG)</li>
- Functional safety drives need for system emulation

#### **Fixed wireless access**

#### Needs

- 5G, millimeter wave spectrum, capable of delivering speeds of more than 100 Mbps to the home
- Viable alternative to wired broadband
- New revenue stream for wireless operators in areas with less fiber/cable access

### Design characteristics

- Extension to traditional base-station developers
- More complexity requiring system emulation
- Design size expected to be large (>200MG)





### Design Start Market: Bifurcation





Market drivers

250M+ tablets

**Cloud Computing** 



### **Datacenter Opportunities**

Workload-optimized, high-performance compute, connectivity, accelerators – AI/ML/DL



### **Hyperscale Optimization CPU**

- Workload optimized
- Machine learning
- Deep learning
- Accelerator offloads





#### **Rack-Level Connectivity**

- Leaf /spine
- Memory pool (HBM)
- Connectivity / SiP
- Reduced latency
- Mesh / 3D-torus / fabric



#### **Scale Out Clusters**

- DNN
- SSD / NVMe
- Coherency
- VM / containers
- Mesh/3D-torus



# Al Chip Datacenter Technology

|            | Opportunities in existing market                                                                             | Emerging opportunities                                                                          |
|------------|--------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|
| Compute    | <ul> <li>Accelerators for parallel processing, such as<br/>GPUs<sup>1</sup> and FPGAs<sup>2</sup></li> </ul> | <ul><li>Workload-specific AI accelerators</li><li>Quantum computing, neuromorphic</li></ul>     |
| Memory     | <ul> <li>High-bandwidth memory</li> <li>On-chip memory (SRAM³)</li> </ul>                                    | <ul> <li>Quasi-volatile memory</li> <li>Non-volatile memory (NVM)</li> <li>PCM, MRAM</li> </ul> |
| Storage    | <ul> <li>Potential growth in demand for existing<br/>storage systems as more data is retained</li> </ul>     | <ul><li>Al-optimized storage systems</li><li>Emerging NVM (as storage device)</li></ul>         |
| Networking | Infrastructure for data centers                                                                              | <ul><li>Silicon photonics</li><li>Programmable switches</li></ul>                               |





# AI/ML Technology Stack



- Complex optimization across hardware, software, platforms and services
- Silicon designed to perform highly parallel operations required by AI enables simultaneous computations

"AI could allow semiconductor companies to capture 40 to 50 percent of total value from the technology stack, representing the best opportunity they've had in decades."

Source: McKinsey analysis and Cadence





## Al Chip Landscape, Key Ecosystems







Source: International Business Strategies, Inc. (IBS) Semiconductor Market Data Centers, April 2019.

Al processors expected growth: 3X faster than total semi market





# Many AI/ML Implementation Options



Source: Microsoft. Hot Chips 2017

- Deep learning algorithms show a high degree of parallelism
  - General-purpose CPUs designed for sequential workloads
  - GPUs with massively parallel execution capability have popularized DNNs over the last few years
  - Proliferation of FPGAs and domain-specific accelerator ASICs for more efficiency





### Decision Making is Shifting from MPC to Al

**Example: Path Planning** 

#### **Traditional**:

Model Predictive Control (MPC)

MPC frameworks running on CPU to generate safe trajectories



#### **Present and Future**:

ChauffeurNet - Google Waymo
Convolution and recurrent neural networks







# Accelerating AV and ADAS Experiences







### AI, ML and 5G are connected

#### 5G mmWave and C-RAN 4G and 5G >24GHz Femto Cells 'Fixed wireless' replaces 'last mile' copper to homes <6GHz Every 2-3Km Small Cells Every Mobile 200m Ultra-Low Smart AR/VR Homes Latency and **Autonomous Vehicles** Massive Al Fronthaul (Optical) Internet Backbone **Edge Computing** © Cadence Design Systems, Inc - All Rights Reserved



**Datacenter** 

Higher Latency





# Data Drives Edge Computing





Source: Intel



### Location of data is crucial!

#### **Datacenter Training**

- **High complexity NNs for ADAS**
- Analysis to predict our behavior

#### **Datacenter Inferencing**

- More complex data higher latency
- Train and deploy inferenced NN



**Google Cloud TPUs** 



**Nvidia GPUs** 



Intel Nervana NNP-T

#### **Edge Training**

- Face recognition to unlock phone
- **Update with localized data**



- Real time decisions (Vision)
- Phone unlock



**Google Coral** 



**Nvidia Jetson Nano** 











### Training & Inference at Datacenter & Edge



- Al in Datacenter
  - Scientists directly craft models for all situations
  - Model training is automated
  - Supervision is required in most cases to label data
  - Full fidelity data must be used for training
- Al at the Edge
  - Automatically adapt, deploy on available infrastructure
  - Performance needs to be achieved within the edge products' power and bandwidth constraints





### Design & Verification Requirements

#### **Datacenter**

- Big chip solutions for GPUs, NOCs, Workload specific AI accelerators
- High capacity verification, Advanced nodes
- Specific models for IP, verification
- Advanced flows Low power with high performance
- Workload optimization server flows (SBSA, ...)

### Edge

- Specialized IP, extendable processing
- Advanced IP models (verification, implementation)
- Comprehensive low power tool set for Digital and Analog IP







# Design and Verification Challenges

### Diverse requirements

- Training High throughput, big designs
- Inference Flexibility, Lowest power

### Verification Challenges

- Significant software content
- Big Designs Emulation throughput, debug
- Physical and virtual interfaces , Virtualization

### Physical Design Challenges

- Complex SystemVerilog Descriptions
- 1<sup>st</sup> Floorplan uncertainty
- Advanced node foundry limits and closure

### System Design Challenges

- Mix of older and advanced nodes
- New system design innovations with 2.5D/3D





**Example: Al Processor on FPGA Prototype** 





### Unifying Challenge: Verification and Software



VERIFICATION & SOFTWARE

**2**<sup>N</sup>

Source: IBS 2018



### **VERIFICATION OPTIONS TO \*ENABLE\* 5G/AI/ML**





# AI/ML Verification Requirements









#### **Formal**

- SoC Scalability
- Smart Proof technology
- Optimized regressions

### **Simulation**

- Fast simulation for high-activity designs
- UVM randomization
- Fast elaboration for replicated structures
- Coverage metrics

### **Emulation**

- Billion-gate designs
- Parallel Partition Compiler
- INT8 to INT64
- Power / Performance
- Memory **models**: HBMx
- Senor model: MIPI CSI

### **Prototyping**

- Scaling to large designs
- Unified frontend
- ICE test suite: Faster, dataextended (DL) regression
- SW driven validation deep learning data training refinement





SYSTEMS INITIATIVE

# Verification Throughput!

Find and fix the most bugs per \$ invested in bare metal compute





### Formal Verification







# SoC Design Scalability

# Compilation Performance

✓ 2X faster in 2019





### 6 big designs from 4 different customers

- Min. compilation speed: 100M gates/hour
- Peak memory: less than 0.5 Kbyte/gate



**40%** smaller in 2019







SYSTEMS INITIATIVE

### **Smart Proof Automation Framework**



Component & Data Management

#### **Proof Profiling Data**

• Regular read/write

### **Proof Caching**

- Cache storage in single file
- Automatic cleanup of old cache data

# Multi Advisor Proof Orchestration

- Forced on
- No overwrite on engine mode

# Engine Algorithm Selection

 Automatic training and inference

Learning

**Machine Learning** 

Optimizes subsequent runs/regressions

Optimizes out-of-the-box proofs



Find more bugs



Better convergence



Faster proofs



### Property Performance and Convergence



Avg. **14X** speedup in last 3 years

Avg. **1.6X** speedup in 2019

### Core Engine Performance

- √ 10 fully-converging designs
- √ 14X speedup last 3 years
- √ 1.6X speedup in 2019

### **Core Engine Proof Success**

- √ 7 "hard" designs: significant # undetermined properties
- ✓ 2X reduction in undetermined properties in 2019

SYSTEMS INITIATIVE

Avg. **3.2X** nonconverged reduction in last 3 years

Avg. **2X** nonconverged reduction in 2019



© Cadence Design Systems, Inc - All Rights Reserved



### Proof profiling data (PPD) and proof caching

#### Challenge

- Running engines to reproduce previous tool results can often be extremely costly in terms of resources, especially in cases where the design/environment has minimally changed or not changed at all
- Need ability to <u>learn from past runs of a design</u>, to optimize subsequent proofs

#### Approach

- Keep record of previous proof strategies that worked, so they can be tried again
  - prove -save ppd -with ppd
- Be able to harvest previous proof results, when confirmed that they still apply
  - set prove cache on

#### Benefits

- Better convergence: faster proofs on properties determined before frees up resources to work on other properties
- Smarter use of machine time
- Helps with reproducibility of proof results





## Proof profiling data (PPD)

 Saves knowledge collected during a prove command to be used as recommendations in subsequent runs



jgproject/sessionLogs/session\_0/sessionPPD.ppd

Benefits

SYSTEMS INITIATIVE

 Increase the chances of converging faster on properties determined before, and that did not have their results JasperGold®

prove -with\_ppd

INFO (IPF149): Starting PPD exploitation on 36 matching properties

"Best engines" orchestration

Regular orchestration kicks when results in PPD fail to be reproduced





# **Proof caching**

- Restores proof results based on properties' signatures
- Current version sensitive to small changes to design and environment



- Benefits
  - Save engine time when processing unchanged properties already determined before
  - Focus resources on properties that changed or that were never determined, improving convergence





### A Continuum of Dynamic Engines

Verification and software platforms need to interoperate













### SDK OS Simulation

**Highest speed** 

Earliest in flow Ignores HW

**Easy replication** 

**Cross-compile** 

### Virtual Platform

Almost @ speed
Pre-RTL
Less accurate
TLM HW Debug
Great SW debug
Easy replication

Less HW detail Slower with detail

### HDL Simulation

KHz Range
Early RTL
Golden Reference
Best HW debug
Limited SW Debug
Easy replication

Mixed-abstractions
Slow SW execution

#### Acceleration Emulation

MHz Range
Early RTL
Min RTL mods
Detailed HW debug
Great SW Debug
Harder to replicate

Datacenter access
Contested Resource

#### FPGA Prototype

10's of MHz
Later RTL
Some RTL mods
Some HW debug
Great SW Debug
OK to replicate

**Harder Bring-up** 

### Prototyping Board

Real time speed
Fully accurate
Actual Silicon
Difficult HW debug
OK SW Debug
Easy to replicate

**HW** changes hard





### Simulation







# Simulation Performance



Incremental Build Parallel Build Hierarchical build w/ Cloning

Up to 10X speed-up

### **FULL REGRESSION THROUGHPUT**

Relentless focus on performance

Continuous Core Performance Enh's

Save/Restart w/ dynamic test reload LONG TEST LATENCY

Rocketick multi-core technology

MC-Lite 1.2-1.8X MC-Rocketick 3-5X



X86

Arm

Cloud





# Multi-Core Performance



### Can use on any long running test

- Uses the single-core build
  - Decision to use mclite deferred to runtime
- Same as single-core scheduling
  - · Guaranteed congruent results with SC
- 1.2 1.8X gains



### Requires "high activity" DUT-heavy test e.g. Gate Level ATPG

- Highly scalable parallel solution
  - Needs a special build
- Specialized (Rocketick) engine
  - Non-accelerable parts (testbench) on SC
- 3-5X gains





# Multi-Core and Simulation Regressions







# Dynamic Test Reload for SystemVerilog



- SystemVerilog/UVM Dynamic Test Load
  - Load new SystemVerilog package into saved snapshot
  - Call testbench functions when the snapshot is reloaded
- Dynamic Base Snapshot: Time zero snapshot or saved snapshot
- Dynamic Test Snapshot: Contains the incremental new SystemVerilog package





# Hardware Assisted Development







# **Emulation And Prototyping**

Emulation "Debug your design"

- Predictable fast build: Rebuild معالمة المحالة المح
- SOC level capacity: IP level sin
- Fast and complete debug: N

Fast RTL

**Key Care Aboats** 

verification & ough already

debug

Prototyping "Debug your Software"

- Build time and debug less important: design
- Highest performance: software debug
- Lowest cost: replicate one build feet
- SOC level capacity: more and more s

Early SW week

development tion runs

& HW Topers

regressions





### Processor Based Emulation and FPGA Based Prototyping



### **Emulation Processor**



### 1MHz

Fast compile
Predictable compile
("If it compiles it runs")

"Full Vision" debug

### 5-10MHz

Slower compile

Compile may need tuning to close timing/routing violations

Less Flexible debug

### Xilinx FPGA







# Dynamic Duo: Emulation & Prototyping

### **Emulation**

- Optimized HW/SW debug
- SoC acceleration, HW/SW
- Power & Performance Analysis
- Advanced Use Models



### **Prototyping**

- Automated Bring-Up
- Scalable performance
- SW development
- HW/SW regressions



Congruency and common environment



# **Emulation Capabilities**

- Palladium<sup>®</sup> Z1 enterprise emulation platform
  - Up to 5X greater emulation throughput
- Scalability from IP blocks to full systems on chip
  - Capacity of up to 9.2 billion gates with 2304 users
- Best in class total cost of ownership (TCO)
  - 22 use models
- New era of datacenter-class emulation
  - Proven reliability







# HW-assisted verification productivity loop







# Billion Gate Design Examples

# Habana Labs Billion-gate class inference processor: Goya



Source: Hotchip Conference

# **Fujitsu**Billion-gate class HPC/AI Processor: A64FX



Source: Hotchip Conference





# Emulation Model Scaling: Enabling AI/ML/5G



- Scaling from IP verification to multi-core/die emulation
  - 4 million gates to 7 billion gates
- HW/SW co-verification without performance impact
  - FullVision, waveform streaming, dynamic RTL, physical and virtual JTAG debuggers
  - Supports 3<sup>rd</sup> party 5G testers
- Emulation in the cloud eases variability of workload sizes and use models



- Parallel Partition Compiler enables one billion gate emulation model to be compiled in about 4 hours
- 2<sup>nd</sup> generation PPC-based emulation models are fully automated – no need for manual partitioning
- PPC enables practical scaling of emulating billion-gate class AI/ML and 5G designs





# INT64 to INT8 Computational Efficiency

### **Fujitsu**

Billion-gate class HPC/AI Processor: A64FX



- For certain ML applications, accuracy may be traded off for faster and more power efficient implementation methods
- Depending on the target applications, design may scale down (INT8) or scaleacross (INT 8 to INT 64) architecturally to ensure computational efficiency

Source: Hotchip Conference





# Power / Performance Trade-offs

# **Fujitsu**Billion-gate class HPC/AI Processor: A64FX

# Power Management (Cont.) ■ "Power knob" for power optimization ■ A64FX provides power management function called "Power Knob" • Applications can change hardware configurations for power optimization → Power knobs and Energy monitor/analyzer will help users to optimize power consumption of their applications <a href="#">A64FX Power Knob Diagram</a> Decode width: 2 Expipeline usage: EXA only Frequency reduction Frequency reduction

### **Customer Example**



Source: Hotchip Conference







# Senor Models – MIPI CSI-2





- Emulation with real CSI-2 sensor provides realistic live image capture and processing
- Allows user to conduct visual inspection (e.g. fast forward and replay)
- Video frames can be captured for detailed analysis (e.g. data packet, line graphs, etc.)





### FPGA-based prototyping accelerates time to revenue

- Early, embedded software & firmware development
- Initial systems and/or proof of concept
- Pre-silicon chip (ASIC) verification





2 months faster

time to market!





# But ... your Mom's and Dad's prototyping doesn't cut it any more

- Too many gates
- Too much memory
- Too many peripherals
- Too much software...
- And not enough time







# Protium X1 Enterprise Prototyping System



### Performance

- Enabling early firmware and software development, automated bring-up
- Up to 100MHz for single FPGA; up to 5MHz on billion gate designs

### Capacity

- Advanced blade architecture scales to billions of gates
- Ideal for AI, ML, 5G, mobile, and graphics applications

### Fast Bring-up

- Unified Palladium<sup>®</sup> Z1 / Protium<sup>™</sup> X1 compile ensures DUT congruency
- Enables transition from emulation to prototyping in days

### Multi-user

- Single-FPGA granularity assures high utilization and efficiency
- Ideal for storage, automotive, image, consumer and medical applications







# Scalable Capacity

- Blade architecture: scalability and flexibility
  - Each blade (of up to 150M gates) self-contained
  - Can be used as individual desktop system
  - Up to 8 blades mounted into standard 19" rack (1.2BG per rack)
- Racks connected for multi-billion gate prototyping
  - AI, 5G, graphic and mobile designs
- GUI-based, interactive configuration assistant
  - Create optimal configuration for user needs





# Scalable Performance

- New fully-automatic partitioning, technology mapping algorithms for best possible performance regardless of design size
  - New Pathfinder multi-FPGA partitioner
  - New multi-strategy, high-speed pin-multiplexing
- Manual optimization capabilities for higher performance up to 100MHz+
  - Black-boxing: native, high-speed interfaces
  - Manual partition guidance: sub-systems
  - Inter-FPGA hardware optimization to customize connectivity







# Advanced Debug

- **Software debug**: early firmware and software development
  - Memory (backdoor) upload and download
  - Clock control to stop and resume the hardware at any time
- Standard interfaces to industry-leading debuggers and software environments - use familiar tools
  - Joint Test Action Group (JTAG) and Universal Asynchronous Receiver/Transmitter (UART) interfaces
- Transaction interface
  - Directly connect to software models and virtual environments
- Hardware debug: bring-up design quickly, validate functionality
  - Force and release for internal notes
  - Prototyping full visibility
- Data capture card (DCC): 1000's of signals, millions of cycles
- Assertion checkers: automation of remote test execution









# Advanced Prototping Debug





- Force/Release Predefined signals to "0" or "1" during runtime
- Real-time monitoring of predefined (at compile time) signals
- Data Capture Thousands of signals for millions of cycles
- Prototyping Full Visibility Probe without recompile
- Assertion Checkers Non-intrusive hardware monitors



- Backdoor memory access change boot code, software, etc.
- Clock control Start/Stop the clock on demand
- Fully scriptable runtime environment

**JTAG** 

- Remote access Network resource anytime from anywhere
- High-performance link to software model



Probes



**Daughter Cards & Peripherals** 



# DeepChip – "Best of 2019"

### The users have spoken!

- The Protium users also gushed a lot about the fact that they could go back to Palladium for fast debug and waves if needed.
  - "We run our design on Protium then go back to Palladium for debug. With Palladium, we can capture waves up and down the hierarchy of every net in our chip. It's a really big advantage of Protium."
  - "With Protium, we get the speed of an FPGA-based system, with the fast ramp-up, debug, and signal traces of Palladiun. It only takes seconds for us to see all the waveforms."
- Protium took 1.2 to 2.1 days to recompile vs. Synopsys
   HAPS taking 3.9 to 6.1 days to recompile.





And that's why Protium (actually the crazy fast incremental Protium compiles with FPGA 8.3 Mhz simulation speeds) wins the #1 Best of EDA in 2019 award from the end users this year.







# Test Solutions for Emulation / Prototyping

- Next Gen 5G testers: serial high speed link
  - Due to higher 5G bandwidth requirements
  - Better physical constraints, standards based
  - 25G/40G Ethernet through QSFP/SFP+ fiber



- Off the shelf solution needed: rate adapt, convert to IQ data
  - Testers no longer slowdown parallel IQ data
  - Need to packetize/de-packetize Ethernet traffic
  - Need triggering method to rate adapt traffic flow





accelle

SYSTEMS INITIATI

# HD 5G Rate Adapter

### Capabilities

- R&S and Keysight
- Palladium Z1, Protium S1/X1
- Slow 5G IQ data to 5G Handset DUT



© C

### Features

- Direct fiber connection (Z1)
- HDIO adapter (Protium S1/X1 )
- 1U 19" Rack mountable
- Tester connection through fiber to QSFP/SPF+
- One tester per SpeedBridge
- SpeedBridge View with debug GUI 5G Tester





# Virtual, Emulation & Prototyping





# Virtual Prototyping and Hybrid Environment

Reference and Starter Platforms OS Configurations (Linux, Android)

Platform Assembly

ARM Fast Models Imperas for MIPS & RISC V

Tensilica |
ARM Processor & Primecells |
MIPS, RISC V |
UFS, PCIe ... TLM Libraries

Routers, Loaders, UARTs, Terminal...

Model generation for Register I/F CPU Integration Automation Open SystemC TLM 2.0 TLM Libraries

**Base Libraries** 

Modeling Automation Platform Assembly

Create, visualize, integrate, extend

Run Time Environment

Heterogenous, multiprocess, multi-system

IEEE 1666 SystemC Engine

based on Xcelium

Hybrid Connections to RTL Engines
Xcelium, Palladium Z1, Protium X1

External IF

Virtual Device Models Virtual

3<sup>rd</sup> Party Debuggers IF

Native Debug SystemC, ESW, RTL CSI, DSI, WiFi, ... Virtual 5G Tester Interfaces

Lauterbach Trace 32 Arm DS-5 Greenhills Multi

Native Integrated Debug of SystemC, Embedded SW Synchronized Debug across multiple engines

Smart Memory, Transactors, TLM-RTL Bridges - AXI, AHB, etc.

Virtual and Hybrid platform and SW Bring-Up Expertise and Services





# Use Models









# Architecture Analysis

Accuracy, Accuracy
Best achieved "RTL-up",
auto generation for
interconnect

AT modeling requires careful effort-accuracy-performance considerations

# Software Development

Speed, Speed, Speed
Loosely timed
"just enough" accuracy is
sufficient

# Mixed Fidelity Hybrids

Speed & Fidelity
Details in hardware
Keep processor sub-

system and other peripherals virtual

Smart synchronization between TLM and RTL

# Hardware Verification

SystemC to drive hardware DUT





# **Smart Verification Management**







# The Problem







**Formal** 

Simulation

**Emulation** 

**Lation Prototyping**© Cadence Design Systems, Inc. All Rights Reserved



# Verification Management: Data-Driven

# Use-case-based

- Define legal operations
- Workload matters: must represent real operation

# Data Collection

- Non-intrusive data collection
- Use the right execution platform

# Analysis

- Correlate, filter, learn, predict
- Anomaly Detection

# Goal-based

- Verification throughput
- Smarter bug hunting





## **Use-case Based Test Generation**

### Accellera Portable Stimulus Standard

- Describe Test Intent and Design Behaviors
  - Use-cases
  - Legal scenario space
- Deliver Test Portability
  - Vertical reuse: From IP to SoC
  - Horizontal reuse: from Simulation to Emulation to Post Silicon
- Across Users
  - Abstract modeling
  - Actions, inputs, outputs, resources







# **PSS Impact on Stimulus**







# Renesas Performance Verification with Perspec Generated Use-cases



- Leading industrial and automotive MCUs
  - Number of integrated IPs is increasing
  - Switched interconnect
  - Configuration has big impact on performance
- Interconnect Workbench performance analysis
  - Early performance characterization
  - Interconnect tuning to optimize performance
  - Use case performance validation
- Palladium Z1 with Perspec use cases
  - Bring-up the entire design and software
  - Perspec generating use case tests
  - Reduce from 50 hour simulations to 12 minutes

Inc - All Rights Reserved



### Perspec: Automated SoC Test Generation

10x productivity gain creating regression suites

 Generates Scenario-based randomized tests

Auto-generates C-based tests

Portable across all engines



Portable Stimulus Specification (PSS)

Supports Accellera PSS 1.0





SYSTEMS INITIATIVE

# Distributed Data: Centralized Management





## vManager Analytics

- Regression Productivity
  - Manage massive parallel execution
  - Analyze: rank by failures
  - Automate rerun of failures to collect debug data
- Requirements Traceability
  - Requirement import to seed verification plan creation
  - Link requirements and verification plan
  - Visibility of changes to requirements
- Coverage Analysis
  - Multi-engine coverage merge/combine
  - Refinement (Unreachability (UNR), UNR crosses)
  - Analysis of coverage vs. plan











### vManager: Multi-engine Coverage





- Single click merge across regressions
- Multi-engine coverage
- User-defined grade calculation





SYSTEMS INITIATIVE

## VIP Catalog – Protocol Support





#### Cadence VIP Architecture

- Fast VIP
  - All VIPs implemented in C for the most optimized performance!
  - Uses dynamic memory allocation for all internal memories/registers/data structures
- Portable and Scalable architecture
  - Seamless IP=> SoC and Project => Project transition
- Support all languages, simulators, methodologies
  - Native SV, Verilog, OVM, UVM, VMM, C, SystemC, etc.
- Consistent user experience across multiple protocols
  - Common UVM library for all VIPs and Memory models







## AI/ML INSIDE/OUTSIDE FOR VERIFICATION





## Glossary

- Machine learning
  - The study of algorithms and statistical models that computer systems use to perform a specific task effectively without using explicit instructions
- Machine learning model
  - An approximative model for a function, automatically generated from training data, that allows inference over new data
- Reinforcement learning
  - A machine learning task that allows computer systems to automatically and dynamically determine the ideal behavior within a specific context to maximize its performance, based on observation, reward and action
- Supervised learning
  - A machine learning task in which the algorithm builds a model of an unknown function from a subset of its inputs and the desired outputs
  - In this type of learning, one needs to supply labels to the output data





#### ML-enabled Formal Verification









Third-Generation JasperGold® Formal Verification Platform



"We measured 2x faster proofs out-of-the-box, 5x faster regressions and non-converged properties reduced by 50%"

-Mirella Negro Marcigaglia, digital design verification manager, STMicroelectronics





#### Smart Proof automation framework



Component & Data Management

Proof Profiling Data

Regular read/write

**Proof Caching** 

- Cache storage in single file
- Automatic cleanup of old cache data

Multi Advisor Proof Orchestration

- Forced on
- No overwrite on engine mode

Engine Algorithm
Selection

 Automatic training and inference

Learning

**Machine Learning** 

Optimizes subsequent runs/regressions

Optimizes out-of-the-box proofs



SYSTEMS INITIATIVE





Better convergence



Faster proofs



## Smart Proof technology: ML-enabled optimizations

# Smart Proof Technology Training Data Custom Solver ML for algorithm selection and multiadvisor orchestration

#### **Algorithm Selection**

Supervised learning: ML uses training data from 500+ customer designs
Supervised inference: Best-fit core engine selected and

configured to create custom solver

=> Up to 4X (2X avg.) better out-of-the-box proofs

#### **Multi-Advisor Orchestration**

Multiple proof advisors use reinforcement learning to improve proof efficiency

Uses training data for **better out-of-the-box proofs**Adjusts training data using proof profiling for **up to 6X (5X avg.) better subsequent proofs and regressions** 





#### What is multi advisor proof orchestration?

- Algorithm that dynamically adjusts engine selection, time limits, license usage, etc, during a proof, enabled by reinforcement learning
  - ➤ Hides complexity from the user: hand technology control over to the tool
  - Focuses on resources (max licenses, max jobs, global time limit), which are still respected in orchestration
  - Uses time slicing to switch engines during run, allowing engine mode diversity
  - Opens the door to future automatic optimizations (partitioning, bug hunting, etc)
  - ➤ No need to deploy every new strategy
- Continuously expanded to include new technology



+ Custom engines dynamically created during proof, to further explore non-default engine settings





#### Multi advisor proof orchestration overview

 A group of advisors with different weight provide recommendations for different strategies to be run during a proof







#### How to enable multi advisor orchestration?

 Orchestration is on by default whenever the user doesn't change the engine mode for a proof, but can be forced to be turned on with configuration commands

```
[<embedded>] % prove -task <embedded>
INFO (IPF036): Starting proof on task: "<embedded>'
INFO (IPF031): Settings used for proof thread 1:
    orchestration
                                  = on (auto)
    time limit
                                  = 86400s
    per property time limit
                                  = 1s * 10 ^ scan
   engine mode
                                  = auto
   proofgrid per engine max jobs = 1
    max engine jobs
                                  = auto
    proofgrid mode
                                  = local
    proofgrid restarts
                                  = 10
```

```
[<embedded>] % prove -task <embedded> -engine mode {B R D}
INFO (IPF036): Starting proof on task: "<embedded>", 53 properties
INFO (IPF031): Settings used for proof thread 3:
    orchestration
                                  = off (auto)
    time limit
                                  = 86400s
   per property time limit
                                  = 1s * 10 ^ scan
    engine mode
                                  = B R D
    proofgrid per engine max jobs = 1
    max engine jobs
                                  = B R D, total 3
   proofgrid mode
                                  = local
    proofgrid restarts
                                  = 10
```





#### Algorithm selection concept: example engine B

#### Challenge

- JasperGold has different technologies which are used internally by engine B
- The challenge is to choose the best engine B parameter configuration to run on each given property

#### Approach

- Use <u>supervised learning</u> to create a machine learning model trained on our internal benchmarks across multiple customer designs, to try to infer on the fly better-than-default engine B configuration on each given property
- Wrap solution into a new engine called B4, which can be added to the engine mode using set\_engine\_mode command like any other engine





#### Algorithm selection for engine B overview

#### Training @Cadence

For each property P in the <u>property set</u>: run engine B with configuration chosen randomly, and extract leatures

Measure performance of the selected algorithm

Store performance measures and features in database

Use supervised learning to build ML model, mapping features and algorithms to performance measures

Internal benchmark used to measure performance

ML models

Extract property P's features

Consult ML models that map features and algorithms to performance measures

Infer best engine B configuration for property P

Run engine B with chosen configuration on property P

Inference @Customer site





## Algorithm selection for engine B results

 Inference evaluation against engine B showed overall performance boost when using engine B4



Dots below the diagonal line correspond to properties which were proven faster with engine B4, when compared to engine B





## Smart Proof case study

#### A customer scenario

- Three versions of a customer design (processor core)
- Goal: compare performance of proof using Smart Proof vs. proof without it over time
- Resources
  - Dedicated machine
  - 20 jobs, 20 licenses, 24h









## Design drop 1

2340 properties



| Run             | # Determined |
|-----------------|--------------|
| Smart Proof OFF | 1516         |
| Smart Proof ON  | 1543         |





## Design drop 2: no changes to design





| Run             | # Determined |
|-----------------|--------------|
| Smart Proof OFF | 1506         |
| Smart Proof ON  | 1590         |



Speedup: run with Smart Proof ON is 32x faster than proof without it



All properties determined in drop 1 using Smart Proof were <u>reproduced</u> in drop 2 (the same is not true for the runs with Smart Proof off in this case study)





## Design drop 3: design changed





| Run             | # Determined |
|-----------------|--------------|
| Smart Proof OFF | 1510         |
| Smart Proof ON  | 1586         |



Speedup: run with Smart Proof ON is 5x faster than proof without it





## Case study take away



- By learning from previous runs, Smart Proof frees up resources, which are better utilized to solve hard properties
- When the design is stable,
   Smart Proof is a great tool to help reproduce previous proof results



#### **SUMMARY - OUTLOOK**





## Enabling Verification of AI/ML Designs









#### **Formal**

- High Capacity
- Regression improvements
- SAT Solver Inferencing

#### Simulation

- Fast simulation for high-activity designs
- UVM randomization
- Fast elaboration for replicated structures
- Coverage metrics

#### **Emulation**

- · Billion-gate designs
- Parallel Partition Compiler
- INT8 to INT64
- Power / Performance
- Memory **models**: HBMx
- Senor model: MIPI CSI

#### **Prototyping**

- Scaling to large designs
- Unified frontend
- ICE test suite: Faster, dataextended (DL) regression
- SW driven validation deep learning data training refinement





## Verification Throughput!

Find and fix the most bugs per \$ invested in bare metal compute







### Some Examples

- Datacenter / Edge limited by compute power
- Accelerators
  - Custom to the ML application. No one size fits all.
  - HW/SW Co-validation at architectural level performance using up to date user models
- Training/Inference in datacenter:
  - Massive # of systems every milliwatt counts
  - HW/SW Co-validation with power analysis, user models
- Inference at Edge:
  - Very limited power budget
  - Validating power spec required before fabrication











**Palladium Z1** "instrumental for Gaudi"

Source: Cadence Earnings Call Q1'19

#### Palladium Z1

Source: Cadence Earnings Call Q1'19

#### **HW Portfolio**

Source: Cadence Earnings Call Q2'19

#### **Full Verification Suite**

Source: Press Release

**HW Portfolio – Capacity, Debug** 

Source: Video @ DVCON Keynote

