

MUNICH, GERMANY DECEMBER 6 - 7, 2022

#### Verification of Inferencing Algorithm Accelerators

Russell Klein Petri Solanti

Siemens EDA

accellera



## Agenda

- AI Accelerators
- High Level Synthesis
- Bespoke Accelerator Optimization
  - Neural Network Architecture
  - Quantization
  - Data Movement
- Verification
  - From Python to RTL





# AI Accelerators





acceller

SYSTEMS INITIATIVE

Page 4



#### Inference Execution (on-chip)







## Al Accelerator Verification Challenges

- CPU, GPU, NPU, TPU
  - Verify algorithm implementation runs on IP
  - Verify that IP is correctly integrated
  - IP is assumed to be correct from the IP provider
- Bespoke accelerator
  - Verify the algorithm runs on the accelerator
  - Verify the accelerator is correctly integrated
  - Verify the accelerator functions correctly













#### HLS AI Design Flow

Page 10

MUNICH, GERMAN DECEMBER 6 - 7, 20



# High-Level Synthesis



#### What is High-Level Synthesis?



## High-Level Synthesis Features

- User architectural control
  - Parallelism, Throughput, Area, Latency (loop unrolling & pipelining)
  - Memories vs Registers (resource allocation)
- Exploration and implementation by applying constraints
  - Not by changing the source code
- Automatic arithmetic optimizations and bit-width trimming
  - Bit-accurate types enable mathematical accuracy to propagate to outputs
- Multi-objective process-aware scheduling for both FPGA and ASIC
  - Area/Latency/blend driven datapath scheduling
  - Eliminates RTL technology penalty of I.P. reuse



| void func (short a[N],<br>for (int i=0; i <n; i++)<br="">if (cond)<br/>z+=a[i]*b[i];<br/>else</n;>                                                                                                                                                                                                                                                                                                                              | ¢/¢++      |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------|
|                                                                                                                                                                                                                                                                                                                                                                                                                                 |            |
|                                                                                                                                                                                                                                                                                                                                                                                                                                 |            |
| <ul> <li>G det</li> <li>Pots</li> <li>⊕ det_jrsc (s4x8)</li> <li>⊕ det_jrsc (s4x8)</li> <li>⊕ det_jrsc (s4x1)</li> <li>⊕ det_jrsc (s4x1)</li> <li>⊕ det_jrsc</li> <li>⊕ det_jrsc</li> <li>⊕ det_jrsc</li> <li>⊕ det_jrsc (lar)</li> </ul> |            |
|                                                                                                                                                                                                                                                                                                                                                                                                                                 |            |
| RTL                                                                                                                                                                                                                                                                                                                                                                                                                             | anan an Al |





## High-Level Synthesis Benefits

- Faster design
  - Typically, RTL design phase is 2X faster for novice users 10X for experienced users
  - Project start to tape-out can be 4X faster
- Faster verification
  - Algorithm is verified at the abstract level
  - Formal and dynamic verification can be used to prove equivalence between C++ and HDL
- Easy technology retargeting, retiming
  - RTL can be mapped to new technology library or clock frequency by re-synthesizing
  - Simple transition between FPGA and ASIC implementation



## How does High-Level Synthesis Work?

- HLS automatically meets timing based on the user-specified clock constraints.
- HLS understands the timing and area of the target technology and uses this to insert registers when needed.
  - Using the right HLS target library is very important!
- HLS closes on timing using:
  - Data flow graph analysis
  - Resource allocation
  - Scheduling
  - Resource sharing and timing analysis



Page 15



## High-Level Synthesis: Bit-Accuracy

- Different datatypes used in HLS tools:
  - Algorithmic C (AC) data types
    - Faster in simulation
  - System C data types
    - Slower than AC datatype in simulation
  - Arbitrary Precision (AP) data types
- Needed to model true hardware behaviour
  - Bit-accuracy simulated in source
  - Provides a path for automated bit-for-bit comparison of C++ and RTL
- Fixed point types include optional rounding and saturation modes.



## Bit-Accurate AC Data Types

- Allows designers to model a signed or unsigned bit vector representing
  - Arbitrary length integer: ac\_int
  - Arbitrary length fixed-point: ac\_fixed
  - Arbitrary length floating-point: ac\_float, ac\_std\_float
- Saturation and overflow behavior like in RTL
- Decimal or integer numerical values no need for scaling nor conversion

The Algorithmic C fixed point data types are declared as:

ac\_fixed<W,I,S> x;







Page 17



## AC Types Example – Integer and Fixed point

- 10 bit total, 1 integer bit, signed
  - -1.0 to 0.99:
- Accumulator for 3 bits headroom
  - No round/saturate
- Simple unsigned "int" 3-bit representation





#### Data Flow Graph Analysis

- HLS analyzes the data dependencies between the various steps in the algorithm
  - Analysis leads to Data Flow Graph (DFG) description
  - Each node of the DFG represents an operation defined in the C++ code
    - For this example, all operations use the "add" operator
  - Connections between nodes represent data dependencies and indicate the order of operations



#### **Resource Allocation**

- During DFG analysis each operation is mapped onto a hardware resource which is then used during scheduling.
- Resources corresponding to a physical implementation of the operator hardware
  - Implementation is annotated with both timing and area information which is used during scheduling
  - Operations may have multiple hardware resource implementations that each have different area/delay/latency trade-offs
- Resources are selected from a technology specific library





Page 20



## Scheduling

- HLS adds "time" to the design during the process known as "scheduling"
- Scheduling takes the operations described in the DFG and decides when (in which clock cycle) they are performed
  - Has the effect of adding registers when needed to meet timing
  - Similar to what RTL designers would call pipelining, by which they mean inserting registers to reduce combinational delays
- Scheduling automatically shares resources







#### **Resource and Register Sharing**

## HLS Optimizations for Area and Performance

- Loop optimizations
  - provides a way to explore many possible micro-architectures
- Loop Unrolling
  - Represents space/parallelism
- Pipelining
  - Represents time/throughput
- Automatic merging







#### Interface Synthesis

- Adding an interface protocol to an untimed C++ design is known as "Interface Synthesis"
- C++ source code does not specify the protocol
- Interface synthesis allows the protocol to be defined using the HLS tool



## Memory Interface Synthesis

- Automatically mapped to ASIC or FPGA memories/registers
- User control over memory mapping
- Arrays on the design interface can be synthesized as memory interfaces Memory interface protocol
  - Address, data, control

STEMS INITIATIVE

void simple function(..., int data[1024] int mem[1024]; <function body>

Instantiated memory wrapper

R/W From SRA From CPI

simple function RTL

mem wrapper

Page 25

CE



## Designing Concurrent Clocked Hierarchies

- Multi-block Design
  - Top-down or bottom-up synthesis
  - User specifies design blocks
- Design blocks/processes run in parallel
  - High throughput







# Accelerator Optimization



## Accelerator Optimization

- Neural Network Architecture
  - Modifying layers and channels
- Quantization
  - Changing the representation of numbers
- Data Movement, Storage
  - Alter data caching and access patterns





#### HLS AI Design Flow

MUNICH, GERMAN DECEMBER 6 - 7, 20

### Neural Network Architecture

- Most Neural Networks are architected for accuracy on servers
- Reducing the number of layers and channels in each layer
  - Small impact on accuracy (<1%)
  - Large impact on performance and efficiency (>90%)



Page 30



#### Impact of Channel Count on Accuracy

Accuracy vs. Channels



Based on MNIST LeNet Dense layer has 500 channels



#### Reducing Network Size Example

#### **Original MNIST network**

| MAC operations:        | 12,353,000      |
|------------------------|-----------------|
| Number of parameters:  | 4,915,080       |
| Minimum data transfer: | 4,941,854 words |

Accuracy: 98.75%

#### **Optimized MNIST network**

| MAC operations:        | 537,410       |
|------------------------|---------------|
| Number of parameters:  | 145,977       |
| Minimum data transfer: | 150,728 words |

Accuracy: 98.46%



## Quantization: Data Sizes and Operators

- Fixed point multipliers are about 1/2 the area of a floating-point multiplier
- Multipliers are proportional to the square of their inputs
- A 64-bit floating point multiplier is about 64 times larger than an 8-bit fixed point multiplier
- Data storage and movement scale linearly with size •



#### Fixed Point Representation





## Quantizing Neural Networks

- Convert weights and features from floating point to fixed point
- Eliminate unused high-order bits
  - Removes constant 0 values from design
  - Many neural network values are normalized to near 0
    - May only need 4 or 5 integer bits
- Reduce fractional precision and measure impact on accuracy
  - Iterative process



Page 35



#### Bitwidth vs Accuracy





Wake word Algorithm



## Accuracy vs. Bit Width, Post-training Quantization

| Integer Bits    |   |       |       |       |       |       |       |       |       |       |
|-----------------|---|-------|-------|-------|-------|-------|-------|-------|-------|-------|
|                 |   | 8     | 7     | 6     | 5     | 4     | 3     | 2     | 1     | 0     |
| Fractional Bits | 8 | 98.05 | 98.05 | 98.05 | 97.55 | 76.75 | 28.70 | 18.00 | 16.80 | 14.90 |
|                 | 7 | 97.85 | 97.85 | 97.85 | 97.25 | 75.39 | 27.90 | 17.50 | 16.60 | 15.40 |
|                 | 6 | 97.13 | 97.95 | 97.91 | 97.45 | 75.15 | 28.30 | 17.30 | 15.90 | 13.90 |
|                 | 5 | 97.21 | 98.08 | 98.10 | 97.40 | 72.57 | 24.50 | 16.90 | 15.20 | 14.90 |
|                 | 4 | 96.94 | 97.79 | 97.76 | 95.71 | 59.90 | 21.40 | 16.20 | 13.10 | 15.10 |
|                 | 3 | 95.56 | 96.37 | 96.35 | 90.08 | 38.83 | 16.70 | 14.00 | 11.50 | 12.70 |
|                 | 2 | 82.31 | 83.13 | 83.13 | 64.73 | 22.70 | 14.90 | 12.30 | 10.50 | 8.50  |
|                 | 1 | 30.15 | 30.97 | 30.92 | 33.72 | 32.07 | 24.60 | 34.90 | 12.30 | 8.50  |
|                 | 0 | 9.53  | 9.33  | 9.50  | 9.37  | 9.37  | 8.50  | 8.50  | 8.50  | 10.00 |

32 bit floating point accuracy is 98.05

Area/power for 32 bit floating point multiplier is ~20X more than a 10 bit fixed point multiplier



## Saturating Math

- Floating point representations almost never overflow
  - 64 bit floating point represents up to 10<sup>308</sup>
- Using reduced precision means overflows are more likely
  - Overflow truncation corrupts the result, and all subsequent calculations
- Saturating math stores the maximum value which can be represented when an overflow occurs
- For many neural networks when a number gets large the absolute magnitude is not important, just that the number is "large"



## Saturating Math



Saturating math can reduce required representation size by 1 or 2 bits



## Data Movement and Storage

- Movement and storage of weights and features impacts performance and power
- Reducing numeric representation has a linear effect on storage costs
- For data movement, fully packing the bus with data is optimal
  - Buses are typically sized based on powers of two
  - For example, 16 bit representation is preferred to 17 bits
- While reducing the size of the representation usually negatively impacts accuracy, this can be offset by increasing layers or channels
  - This means changing the architecture of the neural network



## Convolution Order of Operations

- Convolution algorithms access the input feature map and output array multiple times
- Early in the network the input data sets are typically smaller
- Later layers typically have larger input arrays
- Coordinating cache size with order of operations can optimize PPA



## Caching and Buffering

- Minimizing accesses to external memory can improve performance and minimize power
- Memories tend to dominate area and power
- Data movement tends to limit performance
- If CNN data sets are too large to fit on-chip, careful data management can significantly improve design characteris



Inference Accelerator, post P&R



## Accelerator Optimization

- The CNN will undergo significant modification between the ML framework and the hardware design
- This presents unique verification challenges







## Verification Challenges



## Verification of Inferencing Systems

- Need to verify:
  - Individual operators, multipliers, adders, etc.
  - Processing elements, Multiply/Accumulate (MAC) operations
  - Complete inferences
- Neural Networks are robust to failed individual operations
  - A single correct inference does not prove correctness of the implementation
  - A statistically significant number of inferences is required







## Verification of Inferencing Systems

- Performance in logic simulation is prohibitively slow (28 hours for one inference in an object recognition algorithm)
  - And hardware acceleration is often not available early in the design cycle
- Verify at the abstract level and prove equivalence between representations at different design stages
  - This can be done between Python and C++, then C++ and RTL



## Traditional UVM Flow





## Traditional UVM Flow

- Verilog implements modified CNN
  - Changes in layers/channels (these can be implemented in the predictor)
  - Changes in numeric representation
    - Float vs. fixed
    - Bit widths
    - Saturation/rounding
- Cannot directly compare outputs
- If there is a problem, debug is very hard









2022

Page 50



022 | R. Klein - Al Hardware Keynote Prep | Siemens Digital Industries Software | Where today meets

*Cácceller* 

SYSTEMS INITIATIVE

## Python to C++ Consistency



### Python to C++

- Run C++ node in parallel with Python node
- Both nodes use common float types
- Differences should be only order of computation rounding error
- Import C++ function into Python
  - Several ways to do this: ctypes, CFFI, PyBind11, Cython
- Repeat for subsequent nodes, then layers, then complete network



## C++ to Quantized Model Consistency



## • Run C++ node in parallel with the quantized node

- Quantized implementation should be identical to C++ algorithmic except for data types
  - Verify/debug one thing at a time
- Nodes use different types
  - Float vs. fixed point, reduced bit-width (ac data types)
- Differences will exist, and may be large
  - When in range, single operations will be within rounding error
  - Outside of range will be saturated





## Quantized Model Must be Verified

- Need to run large number of inferences
  - Predictions will be different from Python or C++ algorithmic model
- Determine if CNN accuracy is acceptable
  - Modify network/layers/channels as needed and repeat
- One day ML frameworks will support quantized numbers
  - Qkeras, Larq, and Hawq are examples of extensions that support quantization
  - Currently, works for TPUs, but not expressive enough for bespoke accelerator
  - Abstract model must *exactly* match the Verilog to be implemented



## C++ Quantized to C++ Architecture Consistency



#### Architecture



• Run Quantized node with Architected node

- Quantized and Algorithmic nodes should differ only by order of operation rounding errors
- Nodes use same types
  - Fixed point, reduced bit-width



## Verification – before HLS

#### C++ Architected CNN



#### Static Design Checks

Static code analysis and synthesis checks. Find coding errors and problem constructs

#### **Coverage Analysis**

Determine completeness of test cases. Statement, branch and expression coverage as well as covergroups, coverpoints, bins and crosses





## C++ to RTL consistency

#### C++ Architected CNN



#### Formal

Using formal techniques, prove as much equivalency as possible

#### UVM

Architected C++ is used as a predictor for RTL verification

#### **RTL** Coverage

Determine remaining verification effectiveness through RTL coverage metrics







## Debug – When Things Go Wrong



- Log all intermediate values to memory or log file
  - This includes output from each layer
- Have scripts that can compare intermediate values from different model representations
  - This identifies the first point of divergence between models
  - Immediately find layer and node where problem resides
- Intermediate values from the Python can be recorded to a file for comparison





## HVL UVM Flow







## Verification in HLS Flow

## Conclusion

- Moving from Python to RTL in a single step introduces a significant verification problem
  - Inferencing algorithms do not produce bit-level equivalency when accelerated
  - Requires many inferences to verify accuracy of implementation
  - Simulation performance is too slow, emulation or FPGA prototypes are usually not available
- High-Level Synthesis introduces an intermediate C++ model
  - Verify the algorithm at the Python level
  - Prove equivalency between subsequent model stages



# Questions or Comments

?? || //

systems initiative

Thank You

Petri Solanti, Field Applications Engineer, Petri.Solanti@Siemens.com Russell Klein, Program Director, Russell.Klein@Siemens.com