Real Use Cases » Static Timing Analysis

We have applied Taskflow to solve a real-world VLSI static timing analysis problem that incorporates hundreds of millions of tasks and dependencies. The goal is to analyze the timing behavior of a design.

OpenTimer: A High-performance Timing Analysis Tool

Static timing analysis (STA) is an important step in the overall chip design flow. It verifies the static behavior of a circuit design and ensure its correct functionality under the given clock speed. However, efficient parallel timing analysis is extremely challenging to design and implement, due to large irregularity and graph-oriented computing. The following figure shows an extracted timing graph from an industrial design.

Image

We consider our research project OpenTimer, an open-source static timing analyzer that has been used in many industrial and academic projects. The first release v1 in 2015 implemented the pipeline-based levelization algorithm using the OpenMP 4.5 task dependency clause. To overcome the performance bottleneck caused by pipeline, we rewrote the core incremental timing engine using Taskflow in the second release v2.

Programming Effort

The table below measures the software costs of two OpenTimer versions using the Linux tool SLOCCount. In OpenTimer v2, a large amount of exhaustive OpenMP dependency clauses that were used to carry out task dependencies are now replaced with only a few lines of flexible Taskflow code (9123 vs 4482). The maximum cyclomatic complexity in a single function is reduced from 58 to 20, due to Taskflow's programmability.

ToolTask ModelLines of CodeCyclomatic ComplexityCost
OpenTimer v1OpenMP 4.5912358$275,287
OpenTimer v2Taskflow448220$130,523

OpenTimer v1 relied on a pipeline data structure to adtop loop parallelism with OpenMP. We found it very difficult to go beyond this paradigm because of the insufficient support for dynamic dependencies in OpenMP. With Taskflow in place, we can break this bottleneck and easily model both static and dynamic task dependencies at programming time and runtime. The task dependency graph flows computations naturally with the timing graph, providing improved asynchrony and performance. The following figure shows a task graph to carry one iteration of timing update.

Taskflow [A33] bprop_tau2015_clk [A33] bprop_tau2015_clk [A32] bprop_f1:CLK [A32] bprop_f1:CLK [A32] bprop_f1:CLK->[A33] bprop_tau2015_clk [A31] bprop_f1:Q [A31] bprop_f1:Q [A31] bprop_f1:Q->[A32] bprop_f1:CLK [A31] bprop_f1:Q->[A32] bprop_f1:CLK [A30] bprop_u4:B [A30] bprop_u4:B [A30] bprop_u4:B->[A31] bprop_f1:Q [A29] bprop_u2:A [A29] bprop_u2:A [A29] bprop_u2:A->[A31] bprop_f1:Q [A28] bprop_u2:Y [A28] bprop_u2:Y [A28] bprop_u2:Y->[A29] bprop_u2:A [A28] bprop_u2:Y->[A29] bprop_u2:A [A27] bprop_u3:A [A27] bprop_u3:A [A27] bprop_u3:A->[A28] bprop_u2:Y [A26] bprop_u3:Y [A26] bprop_u3:Y [A26] bprop_u3:Y->[A27] bprop_u3:A [A26] bprop_u3:Y->[A27] bprop_u3:A [A25] bprop_out [A25] bprop_out [A25] bprop_out->[A26] bprop_u3:Y [A24] bprop_inp2 [A24] bprop_inp2 [A23] bprop_u1:B [A23] bprop_u1:B [A23] bprop_u1:B->[A24] bprop_inp2 [A22] bprop_inp1 [A22] bprop_inp1 [A21] bprop_u1:A [A21] bprop_u1:A [A21] bprop_u1:A->[A22] bprop_inp1 [A20] bprop_u1:Y [A20] bprop_u1:Y [A20] bprop_u1:Y->[A23] bprop_u1:B [A20] bprop_u1:Y->[A23] bprop_u1:B [A20] bprop_u1:Y->[A21] bprop_u1:A [A20] bprop_u1:Y->[A21] bprop_u1:A [A19] bprop_u4:A [A19] bprop_u4:A [A19] bprop_u4:A->[A20] bprop_u1:Y [A18] bprop_u4:Y [A18] bprop_u4:Y [A18] bprop_u4:Y->[A30] bprop_u4:B [A18] bprop_u4:Y->[A30] bprop_u4:B [A18] bprop_u4:Y->[A19] bprop_u4:A [A18] bprop_u4:Y->[A19] bprop_u4:A [A17] bprop_f1:D [A17] bprop_f1:D [A17] bprop_f1:D->[A32] bprop_f1:CLK [A17] bprop_f1:D->[A32] bprop_f1:CLK [A17] bprop_f1:D->[A18] bprop_u4:Y [A16] fprop_f1:D [A16] fprop_f1:D [A16] fprop_f1:D->[A17] bprop_f1:D [A15] fprop_u4:Y [A15] fprop_u4:Y [A15] fprop_u4:Y->[A16] fprop_f1:D [A14] fprop_u4:A [A14] fprop_u4:A [A14] fprop_u4:A->[A15] fprop_u4:Y [A14] fprop_u4:A->[A15] fprop_u4:Y [A13] fprop_u1:Y [A13] fprop_u1:Y [A13] fprop_u1:Y->[A14] fprop_u4:A [A12] fprop_u1:A [A12] fprop_u1:A [A12] fprop_u1:A->[A13] fprop_u1:Y [A12] fprop_u1:A->[A13] fprop_u1:Y [A11] fprop_inp1 [A11] fprop_inp1 [A11] fprop_inp1->[A12] fprop_u1:A [A10] fprop_u1:B [A10] fprop_u1:B [A10] fprop_u1:B->[A13] fprop_u1:Y [A10] fprop_u1:B->[A13] fprop_u1:Y [A9] fprop_inp2 [A9] fprop_inp2 [A9] fprop_inp2->[A10] fprop_u1:B [A8] fprop_out [A8] fprop_out [A8] fprop_out->[A25] bprop_out [A7] fprop_u3:Y [A7] fprop_u3:Y [A7] fprop_u3:Y->[A8] fprop_out [A6] fprop_u3:A [A6] fprop_u3:A [A6] fprop_u3:A->[A7] fprop_u3:Y [A6] fprop_u3:A->[A7] fprop_u3:Y [A5] fprop_u2:Y [A5] fprop_u2:Y [A5] fprop_u2:Y->[A6] fprop_u3:A [A4] fprop_u2:A [A4] fprop_u2:A [A4] fprop_u2:A->[A5] fprop_u2:Y [A4] fprop_u2:A->[A5] fprop_u2:Y [A3] fprop_u4:B [A3] fprop_u4:B [A3] fprop_u4:B->[A15] fprop_u4:Y [A3] fprop_u4:B->[A15] fprop_u4:Y [A2] fprop_f1:Q [A2] fprop_f1:Q [A2] fprop_f1:Q->[A4] fprop_u2:A [A2] fprop_f1:Q->[A3] fprop_u4:B [A1] fprop_f1:CLK [A1] fprop_f1:CLK [A1] fprop_f1:CLK->[A16] fprop_f1:D [A1] fprop_f1:CLK->[A16] fprop_f1:D [A1] fprop_f1:CLK->[A2] fprop_f1:Q [A1] fprop_f1:CLK->[A2] fprop_f1:Q [A0] fprop_tau2015_clk [A0] fprop_tau2015_clk [A0] fprop_tau2015_clk->[A1] fprop_f1:CLK

Performance Improvement

We compare the performance between OpenTimer v1 and v2. We evaluated the runtime versus incremental iterations under 16 CPUs on two industrial circuit designs tv80 (5.3K gates and 5.3K nets) and vga_lcd (139.5K gates and 139.6K nets) with 45nm NanGate cell libraris. Each incremental iteration refers a design modification followed by a timing query to trigger a timing update. In v1, this includes the time to reconstruct the data structure required by OpenMP to alter the task dependencies. In v2, this includes the time to create and launch a new task dependency graph

Image

The scalability of Taskflow is shown in the Figure below. We used two million-scale designs, netcard (1.4M gates) and leon3mp (1.2M gates) to evaluate the runtime of v1 and v2 across different number of CPUs. There are two important observations. First, v2 is slightly slower than v1 at one CPU (3-4%), where all OpenMP's constructs are literally disabled. This shows the graph overhead of Taskflow; yet it is negligible. Second, v2 is consistently faster than v1 regardless of CPU numbers except one. This highlights that Taskflow's programming model largely improves the design of a parallel VLSI timing engine that would otherwise not be possible with OpenMP.

Image

Conclusion

Programming models matter. Different models give different implementations. The parallel code sections may run fast, yet the data structures to support a parallel decomposition strategy may outweigh its parallelism benefits. In OpenTimer v1, loop-based OpenMP code is very fast. But it's too costly to maintain the pipeline data structure over iterations.

References