Lewati ke konten utama

Reduce flow runtime

If a flow runs longer than you'd like, the cause is usually one of four things: too much data, redundant work, sequential steps that could parallelize, or a slow upstream connection. Walk through them in this order — fixes higher in the list have larger impact and are easier to apply.

1. Filter early

Push every WHERE clause as far upstream as possible. A flow that reads 10M rows from a connector and then filters down to 50K is doing 200× the work it needs to.

-- ✗ Bad: full scan, then filter
SELECT * FROM raw.events WHERE created_at > now() - interval '7 days';

-- ✓ Good: predicate pushdown into the connector
SELECT * FROM raw.events
WHERE created_at > now() - interval '7 days'
-- if the connector supports it, push this further into the source query

For partitioned datasets, name the partition column in the filter so the engine prunes partitions instead of scanning all of them.

2. Cache intermediate datasets

If two recipes both read from the same upstream and apply the same first transformation, that transformation runs twice per flow. Materialize it once.

In the flow editor: right-click the recipe output → Make persistent. Subsequent runs read from the cached table instead of recomputing.

3. Run independent steps in parallel

By default, the flow scheduler runs siblings concurrently up to a per-flow worker limit. Check the flow's Run settingsMax parallel workers. The default of 2 is conservative; raise it to 4 or 8 if you have CPU headroom and the steps don't fight over the same connection.

If two siblings read from the same database, parallelism may not help — the bottleneck is now the connection's query throughput, not your worker count.

4. Pick the right recipe type

Some recipes are 10× faster than others on the same data:

OperationFasterSlower
AggregationSQL group-by recipePython df.groupby
JoinSQL join recipePython merge
FilterVisual recipe (predicate pushdown)Python row-level filter
Multi-step transformationdbt modelChained Python recipes

Reach for Python only when the SQL/visual recipe genuinely cannot express the operation (regex munging, custom ML inference, calling an external API).

5. Reduce the working set

If the flow operates on a full historical dataset every run, consider:

  • Incremental processing — only process rows added since the last run. See Incremental vs full refresh for when this is safe.
  • Sampling — during development, run on a 1% sample. The flow's Run on sample toggle does this without code changes.
  • Time-bounded windows — most analytics flows only need the last N days. Hardcode that window unless your dashboards explicitly need full history.

6. Profile, don't guess

Every flow run produces a per-step timing report under Activity → Run details. Open it. The slowest step is rarely the one you'd guess; spend optimization effort there, not on the steps that already run in seconds.

See also