Real Use Cases » Standard Cell Placement

We applied Taskflow to solve a VLSI placement problem. The goal is to determine the physical locations of cells (logic gates) in a fixed layout region using minimal interconnect wirelength.

DreamPlace: GPU-accelerated Placement Engine

Placement is an important step in the layout generation stage of a circuit design. It places each cell of synthesized netlists in a region and optimizes their interconnect. The following figure shows a placement layout of an industrial design, adaptec1.

Image

Modern placement typically incorporates hundreds of millions of cells and takes several hours to finish. To reduce the long runtime, recent work started investigating new CPU-GPU algorithms. We consider matching-based hybrid CPU-GPU placement refinement algorithm developed by DREAMPlace. The algorithm iterates the following:

  • A GPU-based maximal independent set algorithm to identify cell candidates
  • A CPU-based partition algorithm to cluster adjacent cells
  • A CPU-based bipartite matching algorithm to find the best permutation of cell locations.

Each iteration contains overlapped CPU and GPU tasks with nested conditions to decide the convergence.

Image

Programming Effort

We implemented the hybrid CPU-GPU placement algorithm using Taskflow, Intel TBB, and StarPU. The algorithm is crafted on one GPU and many CPUs. Since TBB and StarPU have no support for nested conditions, we unroll their task graphs across fixed-length iterations found in hindsight. The figure below shows a partial taskflow of 4 cudaFlows, 1 conditioned cycle, and 12 static tasks for one placement iteration.

Taskflow cluster_p0x55f824e16950 cudaFlow: h2d_constant cluster_p0x55f824e16ea0 cudaFlow: [0]mis_h2d cluster_p0x55f824e170c0 cudaFlow: [0]mis_loop_k cluster_p0x55f824e173f0 cudaFlow: [0]mis_loop_end p0x55f824e15da0 new_net_mask p0x55f824e16950 h2d_constant p0x55f824e15da0->p0x55f824e16950 p0x55f824e16fb0 mis_loop_beg p0x55f824e16950->p0x55f824e16fb0 p0x55f824e160d0 new_pin2net p0x55f824e160d0->p0x55f824e16950 p0x55f824e170c0 [0]mis_loop_k p0x55f824e16fb0->p0x55f824e170c0 p0x7f4ad8000e70 h2d_pin2net p0x7f4ad8000e70->p0x55f824e16950 p0x7f4ad8000f30 h2d_fv2pin p0x7f4ad8000f30->p0x55f824e16950 p0x7f4ad8001140 h2d_pin2v p0x7f4ad8001140->p0x55f824e16950 p0x55f824e16a60 [0]shuffle_beg p0x55f824e16b70 [0]shuffle_k p0x55f824e16a60->p0x55f824e16b70 p0x55f824e16c80 [0]shuffle_end p0x55f824e16b70->p0x55f824e16c80 p0x55f824e16d90 [0]mis_parallel_beg p0x55f824e16c80->p0x55f824e16d90 p0x55f824e16ea0 [0]mis_h2d p0x55f824e16d90->p0x55f824e16ea0 p0x55f824e16ea0->p0x55f824e16fb0 p0x7f4ad8004530 [0]h2d_ordered_vs p0x7f4ad8004530->p0x55f824e16ea0 p0x7f4ad8006d10 [0]h2d_dependent p0x7f4ad8006d10->p0x55f824e16ea0 p0x7f4ad8006df0 [0]h2d_selected p0x7f4ad8006df0->p0x55f824e16ea0 p0x55f824e171d0 [0]mis_loop_update p0x55f824e170c0->p0x55f824e171d0 p0x55f824e172e0 [0]mis_cond p0x55f824e171d0->p0x55f824e172e0 p0x7f4ad8007e00 [0]h2d_size p0x7f4ad8007d00 [0]mis_k p0x7f4ad8007e00->p0x7f4ad8007d00 p0x7f4ad8007d00->p0x55f824e170c0 p0x7f4ad8007b80 [0]d2h_size p0x7f4ad8007b80->p0x55f824e170c0 p0x55f824e172e0->p0x55f824e16fb0 0 p0x55f824e173f0 [0]mis_loop_end p0x55f824e172e0->p0x55f824e173f0 1 p0x55f824e1aa20 [0]hpwl_k p0x55f824e173f0->p0x55f824e1aa20 p0x55f824e1ab30 del_net_mask p0x55f824e1aa20->p0x55f824e1ab30 p0x55f824e1ac40 del_fnet2pin p0x55f824e1aa20->p0x55f824e1ac40 p0x55f824e1ad50 del_fnet2pin_ofst p0x55f824e1aa20->p0x55f824e1ad50 p0x7f4ad8008470 p0x7f4ad8008470 p0x7f4ad8008470->p0x55f824e173f0

The table below lists the programming effort of each method, measured by SLOCCount. Taskflow outperforms TBB and StarPU in all aspects. The whole program is about 1.5x and 1.7x less complex than that of TBB and StarPU, respectively.

MethodLines of CodeNumber of TokensCyclomatic ComplexityCost
Taskflow677418053$15,054
TBB1000641578$21,736
StarPU1279813690$29,686

Performance

Using 8 CPUs and 1 GPU, Taskflow is consistently faster than others across all problem sizes (placement iterations). The gap becomes clear at large problem size; at 100 iterations, Taskflow is 17% faster than TBB and StarPU. We observed similar results using other CPU numbers. Performance saturates at about 16 CPUs, primarily due to the inherent irregularity of the placement algorithm.

Image

The memory footprint shows the benefit of our conditional tasking. We keep nearly no growth of memory when the problem size increases, whereas StarPU and TBB grow linearly due to unrolled task graphs. At a vertical scale, increasing the number of CPUs bumps up the memory usage of all methods, but the growth rate of Taskflow is much slower than the others.

Image

In terms of energy, our scheduler is very power efficient in completing the placement workload, regardless of problem sizes and CPU numbers. Beyond 16 CPUs where performance saturates, our system does not suffer from increasing power as StarPU, due to our adaptive task scheduling algorithm.

Image

For irregular task graphs akin to this placement workload, resource utilization is critical to the system throughput. We corun the same program by up to nine processes that compete for 40 CPUs and 1 GPU. Corunning a placement program is very common for searching the best parameters for an algorithm. We plot the throughput using weighted speed-up across nine coruns at two problem sizes. Both Taskflow and TBB achieve higher throughput than StarPU. At the largest problem size, Taskflow outperforms TBB and StarPU across all coruns.

Image

Conclusion

We have observed two significant benefits of Taskflow over existing programming systems. The first benefit is our conditional tasking. Condition tasks encode control-flow decisions directly in a cyclic task graph rather than unrolling it statically across iterations, saving a lot of memory usage. The second benefit is our runtime scheduler. Our scheduler is able to adapt the number of worker threads to available task parallelism at any time during the graph execution, providing improved performance, power efficiency, and system throughput.

References