SystemC: Focusing on High Level Synthesis and Functional Coverage for SystemC

Organizers: Dragos Dospinescu - AMIQ and Mark Glasser - NVIDIA

- High-Level Synthesis: an Introduction - Frederic Doucet - Facebook
- High Level Synthesis: Model Structure and Data Types - Mike Meredith - Cadence
- High Level Synthesis: Lessons Learned - Bob Condon - Intel
- Functional Coverage for SystemC (FC4SC) - Dragos Dospinescu - AMIQ
- Accellera SystemC Working Group Update - Mike Meredith - Cadence and Martin Barnasconi, NXP
High-level Synthesis: An Introduction

Frederic Doucet,
Facebook,
Menlo Park, CA
High-level Synthesis Overview

• SystemC / C++ based design with HLS
  – Higher level of abstraction than Verilog

• Thousands of tapeouts on a variety of designs
  – from very small to very large!
  – Example of sizes of synthesized SystemC processes
    • Small ~1k - 10k instances
    • Large ~100k instances
    • Very large ~500k instances
  – Large datapaths, control mixed with datapath, etc
  – Significant productivity increases, get to the finish line faster
High-level Synthesis Overview

• SystemC / C++ based design with HLS
  – Higher level of abstraction than Verilog

• Thousands of tapeouts on a variety of designs
  – from very small to very large!
  – Example of sizes of synthesized SystemC processes
    • Small ~1k - 10k instances
    • Large ~100k instances
    • Very large ~500k instances
  – Large datapaths, control mixed with datapath, etc
  – Significant productivity increases, get to the finish line faster

What does this all means?
How does it work?
How is it different than RTL?
High-level Synthesis Overview

HLS tool transforms synthesizable C++/SystemC code into RTL Verilog

1. Elaborate C++/SystemC code describing the design
2. Apply designer-specified synthesis directives / constraints
3. Characterize resources for all operations
4. Schedule all operations onto available clock cycles
5. Generate RTL that is “equivalent” to the input
Describing Computation with C++

Datapath functions:

1: int compute(int val[4], int coef[4])
2: {
3:   int sum = 0;
4:   for (int i=0; i<4; i++) {
5:     sum += val[i]*coef[i];
6:   }
7:   return sum;
8: }

DSP processing, Image processing, etc.
Describing Computation with C++

Datapath functions:

1: int compute(int val[4], int coef[4])
2: {
3:   int sum = 0;
4:   for (int i=0; i<4; i++) {
5:     sum += val[i]*coef[i];
6:   }
7:   return sum;
8: }

DSP processing, Image processing, etc.

HLS tool sees the body of the function as a loop - What does it means in hardware?
Hardware Modeling with SystemC

• SystemC: syntax to model hardware in C++
  – modules, ports, signals, processes, clocks, resets bit accurate datatypes, channels, etc.

• SystemC module:
  – provides the I/O interface of the design, and clock and reset specifications
  – describes structure of the design: sub-modules, connections, etc.

• SystemC process:
  – Defines I/O behavior and control around calls to datapath functions
  – Specifies the control flow (usually with an implicit FSM)
    • will be “concretized” by HLS tool into FSM/datapath in the RTL
SystemC Module

SC_MODULE(DUT) {
    sc_in<bool>           clk;
    sc_in<bool>           rst_n;
    sc_in<bool>           vld_i;
    sc_in<sc_uint<16> >   vals_i [N];
    sc_in<sc_uint<16> >   coeffs_i[N];
    sc_out<bool>          vld_o;
    sc_out<sc_uint<16> >  sum_o;
    ...
    SC_CTOR(DUT) {
        SC_THREAD(process);
        sensitive << clk.pos();
        reset_signal_is(rst_n,0);
    }
    ...
    void process() { ... } 
};
... void process() {
    vld_o.write(0);
    wait();
    while (1) {
        bool input_vld = vld_i.read();
        sc_uint<16> vals[4], coeffs[4];
        for (int i=0; i<4; i++) {
            vals[i] = vals_i[i].read();
            coeffs[i] = coeffs_i[i].read();
        }
        sc_uint<16> sum = compute(vals, coeffs);
        vld_o.write(input_vld);
        sum_o = write(sum);
        wait();
        vld_o.write(0);
    }
    vld_o.write(0);
}
SystemC I/O Behavior

... void process() {
  vld_o.write(0);
  wait();
  while (1) {
    bool input_vld = vld_i.read();
    sc_uint<16> vals[4], coeffs[4];
    for (int i=0; i<4; i++) {
      vals[i] = vals_i[i].read();
      coeffs[i] = coeffs_i[i].read();
    }
    sc_uint<16> sum = compute(vals, coeffs);
    vld_o.write(input_vld);
    sum_o = write(sum);
    wait();
    vld_o.write(0);
  }
}
...
Synthesis Directives:
Provide Hardware Design Intent

Tell the HLS tool how to transform C++ structures in hardware structures

```c++
1: sc_uint<16> compute(sc_uint<16> val [4], sc_uint<16> coef[4])
2: {
3:   sc_uint<16> sum = 0;
4: for (int i=0; i<4; i++) {
5:     UNROLL_LOOP;
6:     sum += val[i] * coef[i];
7:    }
8: return sum;
9: }
```

Unroll the loop:
all iterations to be executed
in parallel
Synthesis Directives: Provide Hardware Design Intent

Tell the HLS tool how to transform C++ structures in hardware structures

```cpp
1: sc_uint<16> compute(sc_uint<16> val[4], sc_uint<16> coef[4])
2: {
3:   sc_uint<16> sum = 0;
4:   for (int i=0; i<4; i++) {
5:     UNROLL_LOOP;
6:     sum += val[i] * coef[i];
7:   }
8:   return sum;
9: }
```

Unroll the loop:
all iterations to be executed
in parallel
SystemC + Directives = Hardware Model Ready for HLS

```
void process() {
    vld_o.write(0);
    wait();
    while (1) {
        bool input_vld = vld_i.read();
        sc_uint<16> vals[4], coeffs[4];
        for (int i=0; i<4; i++) {
            vals[i] = vals_i[i].read();
            coeffs[i] = coeffs_i[i].read();
        }
        sc_uint<16> sum = compute(vals, coeffs);
        vld_o.write(input_vld);
        sum_o = write(sum);
        wait();
        vld_o.write(0);
    }
}
```

With directives:
- unroll loops
- balanced expressions
HLS: Cycle-Accurate Design

- Directives / constraints:
  - Unroll loops
  - Balance expressions
  - Clock period: 0.7ns
  - Scheduling: cycle accurate
High-level Synthesis Overview

HLS tool transforms synthesizable C++/SystemC code into RTL Verilog

1. Elaborate C++/SystemC code describing the design
2. Apply designer-specified synthesis directives / constraints
3. Characterize resources for all operations
4. Schedule all operations onto available clock cycles
5. Generate RTL that is “equivalent” to the input
Resource Characterization

For all operations in the design, HLS tool characterizes resources for delay and area:

<table>
<thead>
<tr>
<th>Resource</th>
<th>Size</th>
<th>Grade</th>
<th>Delay</th>
<th>Area</th>
</tr>
</thead>
<tbody>
<tr>
<td>Multiplier</td>
<td>16x16x16</td>
<td>Fast</td>
<td>0.27</td>
<td>70</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Slow</td>
<td>0.5</td>
<td>36</td>
</tr>
<tr>
<td>Adder</td>
<td>16x16x16</td>
<td>Fast</td>
<td>0.1</td>
<td>15</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Slow</td>
<td>0.3</td>
<td>6</td>
</tr>
<tr>
<td>Mux</td>
<td>16x4-&gt;16</td>
<td>Fast</td>
<td>0.1</td>
<td>4</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Slow</td>
<td>0.05</td>
<td>3</td>
</tr>
<tr>
<td>Register</td>
<td>16</td>
<td></td>
<td>0.04 /</td>
<td>6</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>0.03</td>
<td></td>
</tr>
</tbody>
</table>

HLS tool will use the combination of resource grades when exploring the different schedules.
**Operation Scheduling: Cycle Accurate**

<table>
<thead>
<tr>
<th>Resource</th>
<th>Size</th>
<th>Grade</th>
<th>Delay</th>
<th>Area</th>
</tr>
</thead>
<tbody>
<tr>
<td>Multiplier</td>
<td>16x16x16</td>
<td>Fast</td>
<td>0.27</td>
<td>70</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Slow</td>
<td>0.5</td>
<td>36</td>
</tr>
<tr>
<td>Adder</td>
<td>16x16x16</td>
<td>Fast</td>
<td>0.1</td>
<td>15</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Slow</td>
<td>0.3</td>
<td>6</td>
</tr>
<tr>
<td>Mux</td>
<td>16x4-&gt;16</td>
<td>Fast</td>
<td>0.1</td>
<td>4</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Slow</td>
<td>0.05</td>
<td>3</td>
</tr>
<tr>
<td>Register</td>
<td>16</td>
<td></td>
<td>0.04 / 0.03</td>
<td>6</td>
</tr>
</tbody>
</table>

With clock period set to 0.7ns:

![Diagram of operation scheduling with clock period set to 0.7ns]
Cycle-Accurate Design: Generated RTL

• Directives / constraints:
  – Unroll loops
  – Balance expressions
  – Clock period: 0.7ns
  – Scheduling: cycle accurate

• Area:
  – 4 fast multipliers, 3 fast adders

<table>
<thead>
<tr>
<th>Micro-arch</th>
<th>Area</th>
<th>Thro.</th>
<th>Lat.</th>
</tr>
</thead>
<tbody>
<tr>
<td>Cycle accurate</td>
<td>335</td>
<td>1</td>
<td>1</td>
</tr>
</tbody>
</table>

![Diagram showing a network of operations involving variables and coefficients with calculated values]
High-level Synthesis Overview

HLS tool transforms synthesizable C++/SystemC code into RTL Verilog

- Elaborate C++/SystemC code describing the design
- Apply designer-specified synthesis directives / constraints
- Characterize resources for all operations
- Schedule all operations onto available clock cycles
- Generate RTL that is “equivalent” to the input
High-level Synthesis Overview

HLS tool transforms synthesizable C++/SystemC code into RTL Verilog

- Elaborate C++/SystemC code describing the design
- Apply designer-specified synthesis directives / constraints
- Characterize resources for all operations
- Schedule all operations onto available clock cycles

Generate RTL that is “equivalent” to the input

Let’s go back and try a different micro-architecture...
HLS: Minimal Area Design (1/3)

- Directives / constraints:
  - Unroll loops
  - Balance expressions
  - Clock period: 0.7ns
  - Scheduling: minimize area

Reduce area: increase latency
to share resources and generate a new RTL
HLS: Minimal Area Design (1/3)

• Directives / constraints:
  – Unroll loops
  – Balance expressions
  – Clock period: 0.7ns
  – Scheduling: minimize area

• The state machine is changed
  – The scheduler adds 4 states to share 1 multiplier for 4 multiplications
  – Adders are also shared
HLS: Minimal Area Design (2/3)

• Generated RTL will now include
  – shared resources
  – shared registers
  – sharing muxes

• The generated FSM drives enables to sharing muxes and registers at the correct time
HLS: Minimal Area Design (3/3)

- Area ~1/3, but throughput 5clk

<table>
<thead>
<tr>
<th>Micro-arch</th>
<th>Area</th>
<th>Thro.</th>
<th>Lat.</th>
</tr>
</thead>
<tbody>
<tr>
<td>Cycle accurate</td>
<td>335</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>Min Area</td>
<td>139</td>
<td>5</td>
<td>5</td>
</tr>
</tbody>
</table>

(No need to register the coeffs...)

(Diagram of a minimal area design with signals labeled as `vld_i`, `vals_i[0-3]`, `coeffs_i[0-3]`, `vld_o`, and `sum_o`)

<table>
<thead>
<tr>
<th>Signal</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>vld_i</td>
<td></td>
</tr>
<tr>
<td>vals_i[0]</td>
<td>v0</td>
</tr>
<tr>
<td>vals_i[1]</td>
<td>v1</td>
</tr>
<tr>
<td>vals_i[2]</td>
<td>v2</td>
</tr>
<tr>
<td>vals_i[3]</td>
<td>v3</td>
</tr>
<tr>
<td>coeffs_i[0]</td>
<td>c0</td>
</tr>
<tr>
<td>coeffs_i[1]</td>
<td>c1</td>
</tr>
<tr>
<td>coeffs_i[2]</td>
<td>c2</td>
</tr>
<tr>
<td>coeffs_i[3]</td>
<td>c3</td>
</tr>
<tr>
<td>vld_o</td>
<td></td>
</tr>
<tr>
<td>sum_o</td>
<td>sum_o</td>
</tr>
</tbody>
</table>

(Clock signal labeled as `clk`, showing 6 cycles with ticks)

(No need to register the coeffs...)

(accellera

2019 Design and Verification Conference and Exhibition United States)
HLS: Minimal Area Design, Stable Inputs

- Directives / constraints:
  - Unroll loops
  - Balance expressions
  - Minimize area
  - Coeffs inputs are stable
- 18% smaller, throughput 5clk

<table>
<thead>
<tr>
<th>Micro-arch</th>
<th>Area</th>
<th>Thro.</th>
<th>Lat.</th>
</tr>
</thead>
<tbody>
<tr>
<td>Cycle accurate</td>
<td>335</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>Min Area</td>
<td>139</td>
<td>5</td>
<td>5</td>
</tr>
<tr>
<td>Min area / stable inputs</td>
<td>115</td>
<td>5</td>
<td>5</td>
</tr>
</tbody>
</table>
HLS: Pipeline (1/2)

- Directives / constraints:
  - Unroll loops
  - Balance expressions
  - Pipeline

- Can use slow resources except for the last adder
HLS: Pipeline (2/2)

- Throughput is 1 per cycle
- Latency is now 2 cycles
- Significant area gain for extra cycle of latency

<table>
<thead>
<tr>
<th>Micro-arch</th>
<th>Area</th>
<th>Thro.</th>
<th>Lat.</th>
</tr>
</thead>
<tbody>
<tr>
<td>Cycle accurate</td>
<td>335</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>Min Area</td>
<td>139</td>
<td>5</td>
<td>5</td>
</tr>
<tr>
<td>Min area / stable inputs</td>
<td>115</td>
<td>5</td>
<td>5</td>
</tr>
<tr>
<td>Pipeline</td>
<td>205</td>
<td>1</td>
<td>2</td>
</tr>
</tbody>
</table>

Imagine making all these changes by hand...
Abstracted in SystemC, Refined by HLS

1. Operations to resource bindings and sharing muxes
   – Resource sharing depends on the synthesis directives (performance or area?)

2. Allocation and mapping of values to internal registers
   – Values in flight need to be registered
   – Depends on when the operation are mapped to the resources, which depends on the HLS directives

3. Creation of FSM states and transitions:
   – `wait()` statements are converted to FSM states (in code, and added by tool)
   – Transitions between waits are FSM transitions
   – Current / next state logic generated by the tool
Benefits of HLS

1. Fast design turnaround:
   – Quickly implement large (micro-architecture) changes and regenerate RTL
   – Allows for fast micro-architecture exploration for design and qor optimizations

2. High-level verification:
   – huge productivity benefits to verify and close coverage at SystemC level
   – Bit match datapath functions
   – Bugs are mostly in integration with other non-HLS RTL blocks

3. Get to finish line faster
   – Get a first version up and optimize it (when good enough, tape it out!)
Accellera SystemC Standardization

- Goal: support eco-system with multiple HLS vendors

- Further standardization work needed:
  - Channel/hierarchical port syntax
  - Channel libraries
    - fifos, point-to-point, memories, etc.
  - Standardization of Synthesis Directives
    - pipeline, loop unrolling, etc
    - syntax and interpretation
  - C++11 / C++14 support
High-level Synthesis
SystemC Model Structure and Datatypes

Mike Meredith
Contact: mmeredith@cadence.com
SystemC Models

Primary purposes for use of SystemC

• Virtual platform modelling
  – Primarily for integration and validation of embedded software
  – TLM now part of IEEE 1666-2011 SystemC language standard

• High-level synthesis
  – As an alternative to traditional RTL design by hand
  – Accellera SystemC Synthesis Subset standard

• Verification
  – As glue for multiple languages and abstractions
  – Increasingly as a testbench language
  – Accellera SystemC Verification Library standard and new UVM-SystemC Library draft
TLM Modeling

• TLM requirements: *SPEED!*
• Appropriate scope
  – System
• Appropriate detail
  – Memory map
  – Algorithm
  – Transaction order
• Appropriate techniques
  – Event sensitivity
  – Abstract communication with function calls through sc_port
  – Passing pointers to host memory
  – Any technique that will increase speed without losing necessary detail
Modeling For Synthesis

- Synthesis requirements: Bring the model down to earth
- Appropriate scope
  - Block, subsystem
- Appropriate detail
  - Cycle accuracy for protocol and control
  - Abstract algorithm for exploration
- Appropriate techniques
  - Clock sensitivity
  - Concrete communication with pin-level protocols
  - Detail modeling of reset behaviors
  - Abstract modeling of algorithm and storage architecture
Modeling For Verification

• Appropriate scope
  – Block, subsystem, and system
  – For verifying virtual platforms and synthesizable implementations

• Appropriate techniques
  – Constrained random stimulus
  – Test sequences
  – Sequencer, driver, monitor functionality
  – Functional coverage
Module Structure For Synthesis

sc_in ports required for
SC_CTHREAD, SC_THREAD

sc_in and
sc_out
ports for
reading
data

SC_MODULE

Clock
Reset

SC_CTHREAD

SC_THREAD

SC_METHOD

sc_signals

sc_signals

submodule

submodule

Member
functions

Data members
(Storage)

sc_in and
sc_out
ports for
writing
data
SC_CTHREAD And SC_THREAD Reset Semantics

For Simulation

• At start_of_simulation each SC_THREAD and SC_CTHREAD function is called
  – It runs until it hits a wait()
• When an SC_THREAD or SC_CTHREAD is restarted after wait()
  – If reset condition is false
    • execution continues
  – If reset condition is true
    • stack is torn down and function is called again from the beginning
• This means
  – Everything before the first wait will be executed while reset is asserted

```c
while (true) {
  main loop
  post-reset initialization
}
```

Note that every path through main loop must contain a wait() or simulation hangs with an infinite loop
SC_CTHREAD And SC_THREAD Reset Semantics

For Synthesis

- Assignments become reset initializations of registers in the hardware
  - Assignments to ports
  - Assignments to signals
  - Assignments to variables

- Initialization of data members of modules
  - Includes ports, signals, and data members
  - Should be done in reset behavior of some process
  - Should \textit{NOT} be done in module constructor
    - This invites a mismatch between behavior and RTL reset functionality

SC_CTHREAD or SC_THREAD

- reset behavior
- wait();
- post-reset initialization
  - while (true) {
    - main loop
  }

Note that every path through main loop must contain a wait() or simulation hangs with an infinite loop.
SystemC Processes For Synthesis

SC_CTHREAD
- Clock-synchronous thread process
- Must have clock and reset specification
- Can have wait()s to span clock cycles
- Implemented in RTL as an FSM

SC_METHOD
- For implementing RTL constructs
- Semantics are same as Verilog always block
- Can be synchronous or asynchronous

SC_THREAD
- Equivalent in synthesis to SC_CTHREAD
- Only synthesizable if constrained like SC_CTHREAD
  - Sensitive to clock and reset
  - Only wait()s are to the sensitive clock edge
C++ Datatypes For Synthesis

• All C++ integer types are supported except wchar_t
• Synthesis standard refinements over ISOC++
  – Twos complement signed representation
  – Specific bit widths

<table>
<thead>
<tr>
<th>Type</th>
<th>Width</th>
</tr>
</thead>
<tbody>
<tr>
<td>(un)signed char, char</td>
<td>8</td>
</tr>
<tr>
<td>(un)signed short</td>
<td>16</td>
</tr>
<tr>
<td>(un)signed int</td>
<td>32</td>
</tr>
<tr>
<td>(un)signed long</td>
<td>32</td>
</tr>
<tr>
<td>(un)signed long long</td>
<td>64</td>
</tr>
</tbody>
</table>

Note that specification of narrower bit widths using SystemC datatypes can significantly reduce hardware cost after synthesis.
SystemC Datatypes

- **sc_int, sc_uint**
  - Limited precision signed and unsigned integers with widths from 1 to 64
- **sc_bigint, sc_biguint**
  - Finite precision signed and unsigned integers with width from 1 to unlimited
- **sc_fixed, sc_ufixed**
  - Finite precision fixed-point data with user selectable saturation and rounding
- **sc_bv**
  - Finite word-length bit vector without arithmetic support
- **sc_lv**
  - 4-state logic, but X and Z not supported for synthesis
Lessons Learned – Intel’s Experience

Bob Condon
Intel
• Bob Condon - past 8 years at Intel – coach new HLS teams
• At Intel we use HLS in production for both algorithm dominated designs and control dominated designs.
• We have many groups who have produced multiple generations of designs and have thoroughly integrated HLS as part of their default workflow
• Key benefit – faster time to market because
  – Find bugs sooner
  – Tolerate late breaking arch changes
• What have we learned about designs which have gone through several iterations.
Power

• HLS tools have some ability to consider power when pipelining.
• When targeting cell libraries with low leakage cells, designer intuition of a “good design” is sketchy. – and using these cells is a bit like a technology change.
• Key HLS benefit -- rapidly generate multiple uarchs lets us evaluate design properties which HLS doesn’t explicitly address. (ex, static and dynamic power consumption)
Reuse tests across flows

• When a test fails, who is wrong? The test, the DUT, the spec?...
• Goal – find as many failures as possible with the cheapest tests.
• SystemC DUT tested with MATLAB vectors
  Keep algo and implementation in synch.
  flushes out functional and quantization bugs
• Some HLS models are fast enough to integrate directly in a VP flow.
• For designs with well established interfaces, test the pre-HLS code with OVM/UVM testbench.
Designs Evolve -- Refactor

- Refactor – change a design to make it easier to debug, reuse, maintain without changing the functionality.
- A good one-minute C++ test will find almost all functional bugs in an HLS design. Run on every clean compile.
- Refactorings
  - Templatizing datatypes, modules ...
  - separating control from algorithm
  - adding debugging
Evolution of a function

/ Closest to the original C code
OUT_T filt_calc_v0(sc_fixed<10,2> d[4]) {
    const sc_fixed<5,1> Coef[] = { 9.0/16, -1.0/16};
    sc_fixed<14,2> dac = (Coef[0] * (d[1]+d[2])) +
                       (Coef[1] * (d[0]+d[3]));
    return dac;
}

// Matches the spec (with explicit datapath sizing)
// But the multiple RND's and SAT's are expensive
template <typename OUT_T, typename IN_T>
OUT_T filt_calc_v1(IN_T d[4]) {
    {
        const sc_fixed<5,1> C[] = { 9.0/16, -1.0/16};
        sc_fixed<10,2,SC_RND,SC_SAT> t1 = d[1]+d[2];
        sc_fixed<10,2,SC_RND,SC_SAT> t2 = d[0]+d[3];
        sc_fixed<14,2,SC_RND,SC_SAT> t3 = t1 * C[0];
        sc_fixed<14,2,SC_RND,SC_SAT> t4 = t2 * C[1];
        sc_fixed<15,2> t5 = t3 + t4;
        OUT_T dac = t5;
        return dac;
    }
}

// Avoids the rnd until the end
template <typename OUT_T, typename IN_T>
OUT_T filt_calc_v2(IN_T d[4]) {
    {
        const sc_fixed<5,1> C[] = { 9.0/16, -1.0/16};
        sc_fixed<10,2> t1 = d[1]+d[2];
        sc_fixed<10,2> t2 = d[0]+d[3];
        sc_fixed<14,2> t3 = t1 * C[0];
        sc_fixed<14,2> t4 = t2 * C[1];
        sc_fixed<15,2> t5 = t3 + t4;
        OUT_T dac = t5;
        return dac;
    }
}
Evolution of a funct (cont)

-refactored to 1-off test (of a subunit)
-Kept 3 variants of the code
- tradeoff between maintenance and triage
-Added unit tests for individual functions
- the first three vectors found all the functional bugs.

```c++
// Here is a small test harness to isolate the core of the design and allow rapid experimentation
template <int VER>
SC_MODULE(x2_experiment) {
    typedef x2::input_t i_t;
    typedef x2::dac_output_t o_t;

    void process() {
        wait();
        while (true) {
            input_t d[4]; d[0]=d0.read(); d[1]=d1.read();
            switch (VER) {
            case 0: dac.write(filt_calc_v0(d); break;
            case 1: dac.write(filt_calc_v1<o_t,i_t>(d);
            break;
            case 2: dac.write(filt_calc_v2<o_t,i_t>(d);
            break;
            }
            wait();
        }
    }
};
```
Refactor to extract common control idioms

Many blocks will have the same control pattern. A common idiom: calculate with a throughput of \( K \) outputs per clock:

\[
\text{result}[i] = f(s[i-W], \ldots s[i-1])
\]

Let \( K = 8, W = 4 \)

Implement with a shift register

```
template <typename T, int N, int K=1>
struct TD_Window {
    unsigned maxSample; // debug – total samples
    T d[N];

    void reset() {
        maxSample=0;
        for (size_t i=K; i<N; ++i)   //Add HLS pragmas here
            d[i] = 0;
    }

    T operator[](int indx) const {
        sc_assert(size_t(indx) <N);
        return d[indx];
    }

    void shift_in(const T t[K]) {
        maxSample += K;
        for (size_t i=0; i<N-K; i++)
            d[i] = d[i+K]; // Shift the old
        for (size_t i=0; i<K; i++)
            d[i + (N-K)] = t[i]; // ... and read in the new
    }
};
```
Evolution of a funct (cont)

-Algo code still uses [] – but now it is from the window class.
HLS can optimize the arrays and the functions together (so different than sharing a module).
Any debugging, logging gets shared across all users.
Repurpose C++ tools

• Eclipse with extensions for SystemC datatypes.
  – Our code has lots of templates and the IDE helps new coders get up to speed on the codebase.

•/gtest for regression testing of C++ libraries.

• Boost command line argument parsing.

• Boost metaprogramming for iteration over the repetitive parts.
Recap

- Rapid generation of different RTL implementation allows power exploration
- Re-use every test you can
  - From the architectural/functional model
  - From the RTL turnin model
- Refactor to make code used in more circumstance and easier to debug.
- Separate datapath from control to make each piece re-usable with other models.
- Keep an eye on what you can steal from the C++ software engineering world.
Thanks for listening

Bob Condon
Intel
Functional Coverage For SystemC (FC4SC)

Dragoș Dospinescu
Contact: contributors@amiq.com
Agenda

1. Motivation
2. What is functional coverage?
3. What is FC4SC?
4. FC4SC features overview
5. Coverage constructs
6. Coverage control
7. Coverage database management
8. Conclusions
9. Roadmap
Motivation

• Implement constrained-random testbenches for verification
• Measure the degree of randomisation in the test suite
• Define milestones based on coverage metrics
• Track verification progress during the development cycle
• Generate reports on what functionality was tested
What is functional coverage? (1)

- User defined metric used in constrained-random verification
- Records what “happens” during test execution
- Qualitative metric relative to functionality aspects of the model

Two 64-bit inputs ⇒ $2^{128}$ possibilities.

Impossible to verify exhaustively!
The functional coverage approach:

- Interesting values for A & B
  - 0, 1, MIN, MAX
  - some values in [MIN:MAX]
- Relationship between A & B
  - parity
  - sign

...
What is FC4SC?

- C++11 header only library
- No dependency on any 3rd party library
- Provides functional coverage capabilities
- Based on the **IEEE 1800 - 2012 SystemVerilog Standard**

- Download library: [https://github.com/amiq-consulting/fc4sc](https://github.com/amiq-consulting/fc4sc)
- Include it in your project: `#include "fc4sc.hpp"
- Ready to use!
FC4SC features overview

- Coverage definition: bin, coverpoint, cross, covergroup
- Coverage control: options, sample disabling
- Runtime coverage interrogation
- Coverage database saving
- Coverage database management tools
Coverage constructs: bin (1)

Bin: collection of values and intervals

### FC4SC

```cpp
bin<int>("less_than_8", // bin name
    1, // 1
    interval(2, 3), // [2:3]
    interval(7, 5) // [5:7]
);

bin_array<int>("split",
    3, // 3 bins
    interval(0, 255) // [0:255]
);

illegal_bin<int>("illegal_10", 10);
ignore_bin<int>("ignore_100", 100);
```

### SystemVerilog

```verilog
bins less_than_8 = {
    1,
    [2:3],
    [5:7]
};

bins split[3] = {
    [0:255]
};

illegal_bins illegal_10 = {10};
ignore_bins ignore_100 = {100};
```
auto fibonacci = [](size_t N) -> std::vector<int> {
    int f0 = 1, f1 = 2; // initialize starting number
    std::vector<int> result(N, f0);
    // calculate following fibonacci numbers
    for (size_t i = 1; i < N; i++) {
        std::swap(f0, f1);
        result[i] = f0;
        f1 += f0;
    }
    return result;
};

COVERPOINT(int, bin_array_cvp, value) {
    bin_array<int>("fibonacci", fibonacci(5))
};

bin<int>("fibonacci[0]", 1),
bin<int>("fibonacci[1]", 2),
bin<int>("fibonacci[2]", 3),
bin<int>("fibonacci[3]", 5),
bin<int>("fibonacci[4]", 8)
Coverage constructs: coverpoint (1)

- Contains bins with data of interest
- Handles sampling
- `ignore_bin → illegal_bin → bin`

**FC4SC**

```plaintext
COVERPOINT(int, datacp, data)
{
  bin<int>("positive", interval(0, 10)),
  bin<int>("negative", interval(-10,0)),
  illegal_bin<int>("illegal_zero", 0)
}
```

**SystemVerilog**

```plaintext
datacp : coverpoint data
{
  bins positive = {[0:10]};
  bins negative = {[ -10:0]};
  illegal_bins illegal_zero = {0};
}
```
Coverage constructs: coverpoint (2)

Both are evaluated at the point of sampling (dynamically)!
Coverage constructs: cross

- Cartesian product of coverpoints’ bins
- Behaves the same as a coverpoint in all regards

**FC4SC**

```
COVERPOINT(int, cvp1, data1) {
    bin<int>("zero", 0),
    bin<int>("positive", 1, 2, 3)
};
COVERPOINT(int, cvp2, data2) {
    bin<int>("zero", 0),
    bin<int>("negative", -1, -2, -3)
};
auto cvp1_x_cvp2 = cross<int,int>(
    "cvp1_x_cvp2", &cvp1, &cvp2);
```

**SystemVerilog**

```
cvp1 : coverpoint data1 {
    bins zero = {0};
    bins positive = {1, 2, 3};
}
cvp2 : coverpoint data2 {
    bins zero = {0};
    bins negative = {-1, -2, -3};
}
cvp1_x_cvp2 : cross cvp1, cvp2;
```
Coverage constructs: covergroup

- Ties together all coverage constructs
- Dispatches sampling data to coverpoints and crosses

```systemverilog
class cvg_ex: public covergroup {
public:
  int data;
  COVERPOINT(int, cvp1, data) {
    bin<int>("zero", 0),
    bin<int>("positive", 1, 2, 3)
  };
  CG_CONS(cvg_ex) { /*constructor*/ }
};
```

```systemverilog
covergroup cvg_ex {
  cvp1 : coverpoint data {
    bins zero = {0};
    bins positive = {1, 2, 3};
  }
}
```
Coverage control (1)

• Options
  – adjusting coverage distributions: \textit{weight}
  – setting coverage goals: \textit{goal}, \textit{at\_least}
• Sample enable/disable
  – \textit{starting} and \textit{stopping} coverage collection
• Coverage interrogation (at runtime)
  – getting coverage percentage (per type/instance)
  – getting the number of hits
• Usable on: covergroup, coverpoint, cross
class cvg_ex: public covergroup {
public:
   int data;
   CG_CONS(cvg_ex, int w = 100) {
      this->option.weight = w;
   }
   COVERPOINT(int, cp1, data) {
      bin<int>("zero", 0),
      bin<int>("positive", 1, 2, 3)
   }
};
Coverage db management: visualization

JavaScript app: fc4sc/tools/gui/index.html

Output coverage

<table>
<thead>
<tr>
<th>Covergroup types</th>
<th>output_coverage</th>
<th>output_coverage_1</th>
<th>data_ready_cvp &quot;value&quot;</th>
<th>output_valid_cvp &quot;valid&quot;</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td>zero</td>
<td>valid</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>positive</td>
<td>18</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>[1:2147483646]</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>negative</td>
<td>5</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>[-2147483647:-1]</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>illegal_zero</td>
<td>1</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>0</td>
<td>x</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>valid</td>
<td>24</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>invalid</td>
<td>0</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>0</td>
</tr>
</tbody>
</table>

Covergroup types:
- Back to top
- output_coverage
- fsm_coverage
- stimulus_coverage

Coverage:
- 75.00%
- 100.00%
- 50.00%
Coverage db management: creation

• Generate coverage database:
  
  fc4sc::global::coverage_save("coverage_db_name.xml");

• Databases can be generated at any point during runtime!

• Writes to XML file:
  – Complete coverage model
  – All coverage options
  – Number of hits for each bin
Coverage db management: merging

Merge = aggregate the coverage data from different executions

$>$ python merge.py /path/to/top/directory merged_coverage_db.xml
Coverage db management: reporting

$> python report.py
   --xml_report input_db.xml
   --yaml_out report.yaml
   --report_missing_bins

Special thanks to: Armond Paiva <apaiva@tenstorrent.com>
Conclusions

FC4SC:

● brings the functional coverage from SV domain to SystemC domain
● provides a qualitative metric of the functionality of a SystemC model
● introduces coverage-driven verification as an alternative to test-driven verification
● allows an easy transition from SV syntax
● is easy to integrate into a regression flow
Roadmap

• Default bins
• SystemC integration:
  – Support for coverage over custom data types
  – Event based sampling
• Cross bin filtering: with keyword
• Cross definition: binsof, intersect
• Transition coverage
References

• FC4SC github repository
• IEEE 1800 - 2012 SystemVerilog Standard
• Singhal M. (2015, June 4). What is functional coverage
• INF5430 - SystemVerilog for Verification, Ch. 9 Functional Coverage
Accellera SystemC Working Groups Update

Mike Meredith, Cadence Design Systems
Martin Barnasconi, Accellera Technical Committee Chair
Outline

- Accellera SystemC Working Groups
- IEEE-related SystemC Working Groups
- SystemC Working Groups update
- SystemC Evolution Day
- SystemC Community and Forum
Accellera SystemC Working Groups

• **Language** Working Group (LWG)
• **Transaction-Level Modeling** Working Group (TLMWG)
• **Analog/Mixed-Signal** Working Group (AMSWG)
• **Configuration, Control & Inspection** Working Group (CCIWG)
• **Synthesis** Working Group (SWG)
• **Datatypes** Working Group (SDTWG)
• **Verification** Working Group (VWG)
IEEE-related SystemC Working Groups

- **P1666 (SystemC)**
  - Latest version: IEEE 1666-2011, published 2012-01-09
  - Chair: Jerome Cornet (ST Microelectronics)
  - PAR approved, P1666 WG started end of 2018
  - **Call for Participation:** Please contact Jerome Cornet (chair) or Jonathan Goldberg (IEEE) how to join

- **P1666.1 (SystemC-AMS)**
  - Latest version: IEEE 1666.1-2016, Published 2016-04-06
  - Chair: Martin Barnasconi (NXP)
  - P1666.1 WG not active at the moment
SystemC Language + TLM WG

- SystemC Reference Implementation version 2.3.3 released in Nov 2018

- LWG is preparing contribution to IEEE P1666

- TLM-CAN contribution from Bosch + ST Microelectronics
  - Discussion standard to explore the need for TLM standardization for other serial protocols
SystemC Analog/Mixed Signal

• SystemC AMS User’s Guide
  – Update to make it compatible with IEEE 1666.1 standard
  – Detailed documentation on dynamic TDF features
  – Release expected in Q2 2019

• Development and release of SystemC AMS regression suite
  – Containing many basic and application examples
  – Release expected 2H 2019
SystemC Configuration Control and Inspection

• CCI 1.0.0 released in June 2018, covering Configurability of SystemC models
• CCI Community forum is in place
• Language Reference Manual and supplemental material available
  – Overview tutorial, Reference implementation and 20+ examples
  – Key features
    • Portable information exchange
    • Preloading configuration info
    • Value callbacks & traceability
    • Architected for seamless integration of existing configuration solutions
• More information and download: http://accellera.org/activities/working-groups/systemc-cci
• Next: SystemC Checkpointing
SystemC Synthesis & Datatypes WG

• SystemC Synthesis Subset Language Reference Manual version 1.4.7 (2016) available on Accellera website
  – https://accellera.org/downloads/standards/systemc

• Ongoing discussion to enhance datatypes
  – Different contributions submitted to Accellera
  – Exploring standardization and implementation w.r.t. language, API and performance

• Enhancements for high-level synthesis under discussion
  – E.g. Benefit from modern language constructs in C++1
SystemC Verification Working Group

• UVM-SystemC reference implementation 1.0beta2 released for public review in November 2018
  – Current development focusing on completion of registration abstraction layer

• Next step: introduce Constrained Randomization capabilities, by using CRAVE as add-on library
SystemC Evolution Day

• Successful SystemC Evolution Day held at DVCon Europe October 2018
  – Interactive workshop to discuss evolution of SystemC standards to advance the SystemC eco-system
  – Topics discussed: AMS, CCI, TLM-serial, Multi-language
  – Presentation material available
    https://accellera.org/news/events/systemc-evolution-day-2018

• SystemC Evolution Day 2019 planned on October 31, 2019
  – Call for contributions will open soon, more information:
    https://accellera.org/news/events/systemc-evolution-day-2019
SystemC Community & Forum

- Join the vibrant SystemC Community!

- Accellera SystemC Community pages [https://accellera.org/community/systemc/about-systemc](https://accellera.org/community/systemc/about-systemc)


- Or join any of the Accellera SystemC Working Groups!
Q & A