We study a graph processing pipeline that propagates a sequence of stage functions through the nodes of a DAG in topological order. This example shows how task graph parallelism and pipeline parallelism can be combined — and when one model outperforms the other.
Given a DAG where each node must execute three sequential stage functions f1, f2, f3, and where an edge u→v requires fi(u) to complete before fi(v) starts, we want to process all nodes as quickly as possible.
The following figure shows the per-node stage dependency for a three-node DAG with edges A→B and A→C:
One approach is pure task graph parallelism: one task per node that runs f1, f2, f3 sequentially. This is simple but leaves pipeline stages idle whenever one stage finishes before the next node is ready.
A better approach transforms the problem into pipeline parallelism by finding a topological order of the DAG (e.g., A, B, C) and treating each node as a token that flows through a three-stage pipeline. The following figure shows the resulting pipeline execution:
Stages on the same anti-diagonal can execute simultaneously. For example, f3(A), f2(B), and f1(C) can all run in parallel — this is wavefront parallelism over the (stage × node) grid.
We create a three-serial-pipe pipeline. Each pipe calls the stage function for the node identified by the current token index into the topological order:
The pipeline only supports dependencies from the current token to a previously processed token (i.e., the pipeline flows in one direction). We satisfy this by feeding nodes in a valid topological order, so that for any edge u→v, u appears before v in the node sequence. In this example we hard-code it:
In a general application, topological order can be computed via DFS or Kahn's algorithm on the input graph.
Three possible execution orderings for this pipeline:
The task graph for this pipeline is shown below:
This graph processing pipeline technique has been applied to accelerate timing analysis in VLSI circuit design. For details, see: