We study a graph processing pipeline that propagates a sequence of stage functions through the nodes of a DAG in topological order. This example shows how task graph parallelism and pipeline parallelism can be combined — and when one model outperforms the other.
Problem Formulation
Given a DAG where each node must execute three sequential stage functions f1, f2, f3, and where an edge u→v requires fi(u) to complete before fi(v) starts, we want to process all nodes as quickly as possible.
The following figure shows the per-node stage dependency for a three-node DAG with edges A→B and A→C:
One approach is pure task graph parallelism: one task per node that runs f1, f2, f3 sequentially. This is simple but leaves pipeline stages idle whenever one stage finishes before the next node is ready.
A better approach transforms the problem into pipeline parallelism by finding a topological order of the DAG (e.g., A, B, C) and treating each node as a token that flows through a three-stage pipeline. The following figure shows the resulting pipeline execution:
Stages on the same anti-diagonal can execute simultaneously. For example, f3(A), f2(B), and f1(C) can all run in parallel — this is wavefront parallelism over the (stage × node) grid.
- Note
- The relative performance of task graph parallelism vs pipeline parallelism depends on graph size and stage count. A small, wide graph with many short stage functions often favours task graph parallelism; a long chain with expensive stages favours pipelining. Benchmark both on your workload.
Implementation
We create a three-serial-pipe pipeline. Each pipe calls the stage function for the node identified by the current token index into the topological order:
#include <taskflow/taskflow.hpp>
#include <taskflow/algorithm/pipeline.hpp>
void f1(const std::string& node) { printf("f1(%s)\n", node.c_str()); }
void f2(const std::string& node) { printf("f2(%s)\n", node.c_str()); }
void f3(const std::string& node) { printf("f3(%s)\n", node.c_str()); }
int main() {
const size_t num_lines = 2;
const std::vector<std::string> nodes = {"A", "B", "C"};
if(pf.token() == nodes.size()) {
pf.stop();
}
else {
f1(nodes[pf.token()]);
}
}},
f2(nodes[pf.token()]);
}},
f3(nodes[pf.token()]);
}}
);
tf::Task init = taskflow.emplace([](){ std::cout <<
"ready\n"; })
.name("start");
tf::Task done = taskflow.emplace([](){ std::cout <<
"done\n"; })
.name("stop");
executor.
run(taskflow).wait();
return 0;
}
class to create an executor
Definition executor.hpp:62
tf::Future< void > run(Taskflow &taskflow)
runs a taskflow once
class to create a pipe object for a pipeline stage
Definition pipeline.hpp:144
class to create a pipeflow object used by the pipe callable
Definition pipeline.hpp:43
class to create a pipeline scheduling framework
Definition pipeline.hpp:307
class to create a task handle over a taskflow node
Definition task.hpp:263
const std::string & name() const
queries the name of the task
Definition task.hpp:1082
Task & precede(Ts &&... tasks)
adds precedence links from this to other tasks
Definition task.hpp:952
Task & composed_of(T &object)
creates a module task from a taskflow
Definition task.hpp:984
class to create a taskflow object
Definition taskflow.hpp:64
Topological Order
The pipeline only supports dependencies from the current token to a previously processed token (i.e., the pipeline flows in one direction). We satisfy this by feeding nodes in a valid topological order, so that for any edge u→v, u appears before v in the node sequence. In this example we hard-code it:
const std::vector<std::string> nodes = {"A", "B", "C"};
In a general application, topological order can be computed via DFS or Kahn's algorithm on the input graph.
Sample Output
Three possible execution orderings for this pipeline:
# output 1 — full pipelining
ready
f1(A)
f2(A) f1(B)
f3(A) f2(B) f1(C)
f3(B) f2(C)
f3(C)
done
# output 2 — no overlap (sequential-like)
ready
f1(A) f2(A) f3(A)
f1(B) f2(B) f3(B)
f1(C) f2(C) f3(C)
done
The task graph for this pipeline is shown below:
Reference
This graph processing pipeline technique has been applied to accelerate timing analysis in VLSI circuit design. For details, see:
- Cheng-Hsiang Chiu and Tsung-Wei Huang, "Efficient Timing Propagation
with Simultaneous Structural and Pipeline Parallelisms," DAC 2022.