

# Efficacious Verification of Irreproachable and Steady Data Transfer Protocol for high-speed die-to-die communication on a 3DIC Chip

Varun Kumar C Associate Staff Engineer SSIR, Bangalore, India (varun.k2@samsung.com) Jyoti Verma Associate Director SSIR, Bangalore, India (jyoti.verma@samsung.com) Sekhar Dangudubiyyam Associate Director SSIR, Bangalore, India (sekhar.d@samsung.com)

*Abstract-* In semiconductor world, a trend to vertically stack integrated circuits or circuitry has emerged as a viable solution for meeting electronic device requirements such as higher packing density and smaller footprints, shorter global interconnect due to the short length of Through Silicon Via (TSVs) and the flexibility of vertical routing, higher performance due to interconnect length shortening and less circuitry for die to die communication. Further exploring into this technology where both the top die and bottom die are working on a different node technology making it a heterogeneous 3D package and both the dies have functional components like CPU cores and intensive traffic generators, communication across the dies implores an irreproachable and steady data transfer protocol. Based on these requisitions, Samsung internally developed a protocol with TSV IO based Interface calling it D2D interface. The prominent features of the protocol and its detailed verification which led to its integration on the chip parallel to the IP development will be discussed in the paper.

Keywords—D2D Interface; High Speed Die-to-Die Communication; 3DIC Technology; 3D Packages; CRC Error; ECC Error; AXI Performance Monitor; Error Injection Modules; Coarse Grain and Fine Grain Delays; Clock Skewer/Jitter inducing component;

#### I. INTRODUCTION

For more than fifty years, Moore's law has been advocating the exponential increase in the number of components onto a single chip [1]. The problem for chip designers is that Moore's Law depends on transistors shrinking, and eventually, the laws of physics intervene. The other problem hindering smaller transistors is heat extraction. The more transistors there are on a chip, the more heat it produces [2]. The previous 2.5D technology also referred to as interposer technology integrates several electronic devices inside a single package by assembling them side-by-side on a shared interposer. Taking the technology further into a 3D package which offers a new paradigm that builds multiple tiers of active devices stacked above each other offers various advantages.

3D heterogeneous stacking of diverse chips/dies is one of the most promising solutions [3]. 3D-ICs promise to achieve reduction in wire-length, in critical path delay and in power consumption as compared to 2D implementation. Other benefits include improved packing density, noise immunity, superior performance and the ability to implement added functionality [4]. The reduction of the length of wires could be achieved by ICs composed of vertically stacked dies that use Through Silicon Vias(TSVs) to connect inter-die signals, as shown in Figure 1. We can reduce a length of a long horizontal wire by substituting it with much shorter ones on different device layers plus short vertical connection through TSVs. Reduced footprint area in 3Ds helps reducing the total wire-length as well. This technology brings a new dimension to system design and allows for integration of diverse functionality and diverse technology. It is shown that in systems that will demand hundreds of GBs of memory bandwidth, power consumptions could significantly be reduced by using thousands of low-power TSVs between memory blocks and processors or between processors [5].



Figure 1. Reduction of wire-length between blocks 3 and 5 in 3DIC

Such 3D stacked ICs are interconnected with the TSVs (Through Silicon Via). TSVs are the building blocks that enable 3DIC technology. However, one of the major concerns of TSVs is reliability due to their low yield rates [6], vulnerability to thermal and stress and the crosstalk issues of parallel TSVs [7], [8]. A single dynamic defect in TSV can corrupt the data communication between two layers. Therefore, identifying and correcting the eroded data is necessary to improve the overall yield rate. On the other hand, by having higher operating temperature and high temperature differences between layers [9], the thermal and stress impacts on 3D-ICs reliability are also critical, which can shorten the lifetime expectation. Hence, conventional data transfer through AXI interface across the dies through experimental TSVs poses several challenges like lack of reliability, data loss, higher pin count (450 pins) and low latency issues. Overcoming the data loss, chip may fall into complex die to die NOC timeout scenarios where not only special registers will be required on both the dies to mitigate the hang scenarios but also intelligence on which master has hanged and which other master will clear the hang will be necessary.

Considering these requirements, a new digital IP which removes the need of PHY and communicates through TSV has been introduced. The IP had complex TSV IO modules having CRC and ECC error check and correction, retry, fine grain and coarse grain delay cells.

D2D TSV based interface is a digital IP for high speed die to die communication. It has the advantage of reducing the number of inputs and outputs by downsizing the data transmitted from the AXI interface and transmitting it at a speed of four times faster. Therefore, the area can be designed to be smaller than other die to die communication IPs and the reliability of the data is more due to CRC bits in the payload transferred, CRC error detection, data retry function, bus maintenance and cleaning operation in case of continuous error detection and ECC error detection and corrections in the response payload. SRT supports SDR and DDR in 160bit/40bit transfer. The maximum data rate is about 4GBps. Centre aligned and edge aligned with variable delay cell values to overcome the clock skew, clock delay in the real silicon. It also supports 20 unique interrupts for various types of error scenarios.

The remaining parts of this paper is organised as follows: section II describes the stack level verification. Section III presents the complete verification strategy followed in SOC level verification. Section IV provides an insight into various bugs addressed. Section V describes the performance and coverage results. Section VI and VII concludes the paper with future works and summarization.

#### II. STACK LEVEL VERIFICATION

For the faster verification of the newly developed protocol, the following testbench was created considering all the features.

From the Figure 2 it can be observed that D2D has two components D2D\_Slave\_interface and D2D\_Master\_interface. Slave interface is connected to the AXI4 master VIP and Master interface is connected to the AXI4 slave VIP. Both the master and slave VIPs had internal coverage components. A scoreboard has been instantiated for comparing the transactions from master to the slave. Interface Monitor has been provided for analysing the behaviour of the D2D protocol signals under various conditions. Back to back IP wrap is provided with the TSV IO modules. AXI Channel arbiter is responsible for converting the AXI transactions into D2D



protocol specific payload and then the payload will be sliced into four parts, each with their specific CRC bits. This payload will be stored in the D2D\_FIFO\_SI/MI ready to be transferred through the TSVs.



Figure 2. Testbench Architecture for Stack verification

The scalable testbench provided suitable verification environment for the stack level. Following features were verified with the setup:

## A. DRCG (Dynamic Root Clock Gating)

To reduce power consumption of D2D clock lines, configuration can be done to disable the D2D clock cell through gating modules. Transactions after the DRCG enable and disable were issued and the required clock gating were verified in the interface monitor.

## B. SDR and DDR operational modes

Various standard data sampling modes were verified with the required latency checks. Minimum latency requirements were expected in all the operational modes. Expected average latency was 7ns.

# C. Delay Line modules

To mitigate the clock skew induced due to gates, delay line modules can be configured with various values for the proper payload slicing and reconstructing.

#### D. Dynamic mode switching

After the initialization of the interface, verification of mode switches between SDR, DDR and various delay line values were completed.

## E. Bus transaction monitoring

All the transactions can be monitored through dedicated registers, verification involved continuous polling of the registers during the D2D interface operation through the APB master VIPs.

#### F. VIP coverage

The master should generate different type of D2D transactions with varying bus parameters like ID, burst length, burst size, burst type, multiple outstanding transactions, et al. Internally provided coverage modules in the VIPs helped in analyzing the coverage results.



#### III. SOC LEVEL VERIFICATION

Figure 3 gives a comprehensive view of SOC architecture in 3DIC chip. The traffic flowing from all the traffic generators and the CPU cores will pass through block D2D. The testbench components included are AXI Performance Monitors, Error Injection Module and Clock Skewer/Jitter inducing component.



Figure 3. Architectural Overview of a 3DIC Chip

Since, above discussed D2D interface provided first time solution for die-to-die communication on a 3D chip as opposed to PCIE, UCIE, Memory CXL controllers, et al. on a 2.5D technology, various new strategies were developed and adopted for providing the complete verification solution for the interface at the SOC level verification.

## A. Real Time Application scenarios

Multiple masters were present in both the top and bottom die capable of generating the traffic simultaneously and independently. Concurrent traffic from all the masters needed to be generated to verify the quality of channel arbitration supported in D2D interface. This involved traffic generation from all the CPU cores, traffic generators and the PCIE master to verify the stress scenario and understand the back pressure if any.

## B. Clock Jitter and Skew introduction

Since, the 3D packages involve two different crystal oscillators in the top die and the bottom die and the Parts per million of the clocks under consideration will vary, it introduces variations like skews and jitters. Clock Skew is the temporal difference between the arrival of the same edge of a clock signal at the Clock pin of the capture and launch flops. Clock jitter is the deviation in a clock edge from its ideal position in time. Verification of the new protocol against such issues is necessary and hence, clock skewer and jitter inducing component was introduced. D2D interface supported tuning of coarse grain delays and fine grain delays in the DDR and SDR clocks and it was



automatically configured through CPU cores. Challenges were faced regarding the selection of coarse grain and fine grain delays for various values presented by delay cell modules.

## C. Variable Clock Ratios with traffic stress

The operational clocks of D2D\_master\_interface and D2D\_slave\_interface were independent. D2D interface supported various clock ratios. Various bus transactions were initiated from multiple masters to create the traffic stress and all the supported clock ratios were verified. This helped in concluding the robustness of the flow control provided by the protocol.

## D. System level tests for ATE

Since, D2D interface will be used in an 3DIC environment, to verify the interface before the die integration and in a single IP environment various loopbacks modes were provided. They are internal loopback, external loopback and remote loopback. D2D IP can be configured to generate the patterns internally in various modes like pseudo random, toggle, step-up and user-defined. All the possible combinations were tested with mismatch register polling throughout the pattern generation time period.

## E. Cross die interrupt handling

D2D protocol supported more than 20 interrupts for various link error features, bus maintenance and loopbacks. Interrupt assertion and servicing was expected from master interface or slave interface or both in certain conditions and these interrupts were acknowledged and serviced by the CPU cores present in both the dies. For the SOC under consideration there were two GIC600s present in top and bottom die, communication between GIC 600s happen through AXI stream interface (ICDR and ICRD). Real time scenarios where interrupt generation and acknowledgement occurs in top die, but servicing will be done in bottom die and vice versa were verified.

# F. Low power verification

The chip under consideration is a heterogeneous chip with top and bottom dies having different node technology. Due to this they operated at different voltages. Block D2D alone had the most complex power architecture with different voltage rails provided for PLLs, D2D core, TSV\_TX\_IO and TSV\_RX\_IO. This resulted in large number of power domains, internal power modes and SOC power modes. Since, it is a heterogeneous chip with TSVs enabling the data communication and the top die with lower node operated with lower voltage, voltage level shifters were provided in the bottom die TSV\_IO domain to avoid the data corruption due to voltage changes and thermal conditions. Verification of such complicated power architecture involved the following checks:

- The data transfer across the dies in all the power modes.
- x propagation with only one block powered down.
- SOC power modes dynamic switching during the D2D interface operation.
- Data loss and data retention with a single block powered on.
- Q channel handshakes during the power verification.

## G. Link Reliability features and Bus maintenance

The main future challenge for TSV technology based D2D protocol relates to its ability to maintain performance parameters, such as signal integrity and heat management, as data rates climb [10]. Hence, protocol was built with reliability features like ECC blocks with QEDTEC (Quad Error detection and Triple Error Correction), CRC errors with retry functions and bus maintenance during reliability failures. For the verification of data retry function, bus cleaning operations and also in general behaviour of the IP during the error scenarios, we had to bring up various complicated error injection mechanisms. The entire 3D package will have multiple instances of D2D interface, error injection through force will require lot of coding efforts. Due to this error injection module complaint with D2D interface and also could be scaled up to any payload based protocol which takes user input during runtime and injects error in both the payload and the response was developed. The module had intelligence to inject configured number of errors in required position of the payload. Error Injection Module has been presented in Figure 4.



Figure 4. Error injection module

Module had many runtime configurable parameters like injection enable, number of CRC error, CRC error bit position, ECC error, ECC error type (single, double, triple, un-correctable), Payload type (AW, AR, W, B, R), ... This enabled plethora of combinations in error scenario verification. The error condition behavior was monitored in the sequences through registers, counters and interface signals. AXI interface was also monitored for response error generation since D2D interface had intelligence to generate the error during the bus cleany operations. This helped in the robust verification of error scenarios without repeated forces and releases.

#### IV. ISSUES AND BUGS ADDRESSED

Extensive verification strategy provided suitable environment for addressing the bugs and issues at an early stage of verification. Various issues found will be bucketed in this section.

- With the verification of protocol with wide range of easily configurable AXI parameters like number of beats, burst sizes, et al. in the stack level verification environment resulted in addressing protocol violations.
- Transactions hang during multi-master D2D interface access, where both ACE masters and AXI masters issued concurrent traffic.
- Proper payload formation for various values of coarse grain delay and fine grain delays with the introduction of clock skew and clock jitters provided lot of combinations and then also provided insight on the complete flow control. Various issues related to improper payload formation and transactions hanging were addressed.
- In low power verification issue in P channel and Q channel handshake [11] was found due to ambiguity in deciding the end of transactions across the dies through D2D interface.
- Since, the protocol offered various reliability features and corresponding error injection module provided the complete setup for verification, this enabled for finding various bugs, like register behaviour correction for ECC and CRC errors, missing of CRC error injection through SFR, bus cleany not getting asserted due to issue in ECC module count logic, ECC uncorrectable error not triggering cleany operation, transactions lost during the bus cleany, response error not getting generated for bus cleany due to recurring CRC errors.

All the issues were addressed and fixes were provided in the subsequent design releases.



#### V. RESULTS

The data from the functional verification and performance verification are provided in this section.

The functional coverage data which involved cover points for all the features of the Interface, most importantly various coarse grained and fine grained delays were covered. The snippet captured have been provided in Figure 5.



Figure 5. Functional Coverage data capture

As discussed previously, there are multiple instances of D2D IP throughout the SOC, performance verification across the instances is very necessary. VIP independent and module based performance monitors were developed. Performance monitors act as SV bind modules, because of this the task of integrating them becomes really easy, as once a bus module of interest is bound with the monitor, it doesn't matter how many instances of modules are in the system, the monitor will also get instantiated along with it and will track its performance.

The following performance numbers were captured for D2D interface which are being referenced in Figure 6.



Figure 6. D2D Interface performance numbers



Figure 7 represents the latency captured with the help of performance evaluators.

| axi_if_checker.u_AXI_LATENCY_PERF_MON p            | erformance monitor       |           |                    |                          |
|----------------------------------------------------|--------------------------|-----------|--------------------|--------------------------|
| axi_if_checker.u_AXI_LATENCY_PERF_MON A            | CLK frequency            | = 992.063 | 85 MHz             |                          |
| axi if checker.u AXI LATENCY PERF MON (            | WR) Accumulate Latency   | =         | 9776 ns            | (9698 cvcle)             |
| axi if checker u AXI LATENCY PERF MON (            | WR) Average Latency      | =         | 76 ns              | (75 cvcle)               |
| axi if checker.u AXI LATENCY PERF MON (            | WR) Peak Latency         | =         | 78 ns              | (77 cycle)               |
| axi if checker u AXI LATENCY PERF MON (            | WR) Request Number       | = 128     |                    | · · ·                    |
| axi if checker.u AXI LATENCY PERF MON (            | WR) write data count     | = 2048    |                    |                          |
| axi if checker.u AXI LATENCY PERF MON (            | WR) Execution cycle      | = 15627   |                    |                          |
| axi if checker.u AXI LATENCY PERF MON (            | WR) Pending cycle        | = 2522    |                    |                          |
| axi if checker.u AXI LATENCY PERF MON (            | WR) System utilization   | = 13      |                    |                          |
| axi if checker.u AXI LATENCY PERF MON (            | WR) Utilization          | = 81      |                    |                          |
| axi if checker.u AXI LATENCY PERF MON (            | WR) Average MO           | = 3       |                    |                          |
| axi if checker.u AXI LATENCY PERF MON (            | WR) Maximum MO           | = 4       |                    |                          |
| axi if checker.u AXI LATENCY PERF MON (            | WR) Blocking ratio       | = 0       |                    |                          |
| axi if checker.u AXI LATENCY PERF MON (            | WR) Total data transfer  | = 32768   | (Bytes)            |                          |
| axi if checker.u AXI LATENCY PERF MON (            | WR) Average BL           | = 16      |                    |                          |
| axi if checker.u AXI LATENCY PERF MON (            | WR) Maximum BL           | = 16      |                    |                          |
| axi_if_checker.u_AXI_LATENCY_PERF_MON (            | WR) Average BandWidth ir | n GBits/s | =                  | 27                       |
|                                                    |                          |           |                    |                          |
| <pre>axi_if_checker.u_AXI_LATENCY_PERF_MON (</pre> | RD) Accumulate Latency   | =         | 8645 ns            | (8576 cycle)             |
| <pre>axi_if_checker.u_AXI_LATENCY_PERF_MON (</pre> | RD) Average Latency      | =         | <mark>68</mark> ns | ( <mark>67</mark> cycle) |
| axi_if_checker.u_AXI_LATENCY_PERF_MON (            | RD) Peak Latency         | =         | <mark>92</mark> ns | (91 cycle)               |
| axi_if_checker.u_AXI_LATENCY_PERF_MON (            | RD) Request Number       | = 128     |                    |                          |
| <pre>axi_if_checker.u_AXI_LATENCY_PERF_MON (</pre> | RD) Read data count      | = 2048    |                    |                          |
| axi_if_checker.u_AXI_LATENCY_PERF_MON (            | RD) Execution cycle      | = 2589    |                    |                          |
| axi_if_checker.u_AXI_LATENCY_PERF_MON (            | RD) Pending cycle        | = 2202    |                    |                          |
| <pre>axi_if_checker.u_AXI_LATENCY_PERF_MON (</pre> | RD) System utilization   | = 79      |                    |                          |
| axi_if_checker.u_AXI_LATENCY_PERF_MON (            | RD) Utilization          | = 93      |                    |                          |
| axi_if_checker.u_AXI_LATENCY_PERF_MON (            | RD) Average MO           | = 3       |                    |                          |
| <pre>axi_if_checker.u_AXI_LATENCY_PERF_MON (</pre> | RD) Maximum MO           | = 4       |                    |                          |
| axi_if_checker.u_AXI_LATENCY_PERF_MON (            | RD) Blocking ratio       | = 0       |                    |                          |
| <pre>axi_if_checker.u_AXI_LATENCY_PERF_MON (</pre> | RD) Total data transfer  | = 32768   | (Bytes)            |                          |
| axi_if_checker.u_AXI_LATENCY_PERF_MON (            | RD) Average BL           | = 16      |                    |                          |
| axi_if_checker.u_AXI_LATENCY_PERF_MON (            | RD) Maximum BL           | = 16      |                    |                          |
| axi_if_checker.u_AXI_LATENCY_PERF_MON (            | RD) Average BandWidth ir | n GBits/s | =                  | 30                       |
| Figure 7. D2D Interface latency numbers            |                          |           |                    |                          |

#### VI. FUTURE WORKS

Detecting and debugging deep sequential CDC (Clock Domain Crossing) convergences using structural CDC verification is extremely difficult since doing a flat analysis on die-to-die 3D designs has TSV reliability (power and capacitance) related challenges, and even if verification tools can complete the analysis, it becomes a nightmare to debug the violations with complex sequential logic. Thus there is a requirement of dynamic CDC verification using metastability injection (CDC jitter) mechanism during simulation. Thus dynamic CDC jitter verification solution with SDC can be implemented.

Generating the concurrent traffic scenarios from all the masters in an SOC can result in very long running tests. Such traffic stress and back pressure scenarios can be targeted on the emulation systems.

#### VII. CONCLUSION

Die-to-die interfaces provide major use cases for targeting applications like HPC, networking, hyper scale data center, artificial intelligence (AI), et al. With the high performance data transfer, reduction in number of inputs and outputs which in turn reduces the number of TSV line requirements and area reduction, power reduction, various data integrity checks, presented D2D protocol offers better die to die communication. The complexity and enormous impact of verification came from the fact that the protocol is required to be validated against several objectives, such as correct functionality, timing, power, energy consumption and reliability in pre-silicon and post-silicon stages before it can be used in hardware devices. Master like VIPs which are typically used for SOC level verification were used to provide the efficient verification solution, since reusing the sequences at SOC level required only changing the VIP master names. Complete code coverage and Functional coverage solutions were provided. The all-inclusive verification strategy provided in both the stack level and SOC level with new methods in clock scenarios verification, loopback verification, performance verification, CRC and ECC error verification and functional verification guaranteed the success of new protocol.



#### REFERENCES

- Ding-Ming Kwai and Cheng-Wen Wu, "3D Integration Opportunities, Issues and Solutions", Proc. of SPIE Vol. 7520 752003-1. [1]
- Moore's Law | Definition, history and limitations [online]. Available: https://www.financestrategists.com/wealth-management/moores-law/. [2]
- S Borkar, "3D Integration for Energy Efficient System Design, "48th DAC 5-9 June 2011, pp. 214-219. [3]
- Shinde, Ambadas. "3D VLSI Technology," 2013 unpublished. [4]
- M. Chrzanowska-Jeske and Mohammad A. Ahmed, "Power efficiency of 3D vs 2D ICs, "Electrical and Computer Engineering, Portland [5] State University
- J. U. Knickerbocker, P. S. Andry, B. Dang, R. R. Horton, M. J. Interrante, C. S. Patel, et al., "Three-dimensional silicon integration", IBM J. [6] Res. Develop., vol. 52, no. 6, pp. 553-569, Nov. 2008.
- G. Van der Plas et al., "Design issues and considerations for low-cost 3-D TSV IC technology", IEEE J. Solid-State Circuits, vol. 46, no. 1, [7]
- pp. 293-307, Jan. 2011. F. Ye and K. Chakrabarty, "TSV open defects in 3D integrated circuits: Characterization test and optimal spare allocation", Proc. 49th [8] Annu. Design Autom. Conf., pp. 1024-1030, 2012.
- [9] D. Cuesta, J. L. Risco-Martín, J. L. Ayala and J. I. Hidalgo, "Thermal-aware floorplanner for 3D IC including TSVs liquid microchannels and thermal domains optimization", Appl. Soft Comput., vol. 34, pp. 164-177, Sep. 2015.
- [10] Frost & Sullivan, "Global Advances in Electronic/Chip Packaging (Technical Insights)," December 31, 2007; Available: http://www.frost.com/prod/servlet/report-homepage.page
- SSE-300 Subsystem [11] Arm Corstone Example Technical Reference Manual r0p1 [Online]. Available: https://developer.arm.com/documentation/101773/0001/Interfaces/P-Channel-and-Q-Channel-Device-interfaces/