Skip to main content

Incremental vs. full refresh

Every recipe and dataset materialization has a refresh strategy. Pick the right one — the difference between "5 minutes" and "5 hours" usually lives here.

Full refresh

Every run reads the entire upstream and rebuilds the output from scratch.

-- Full refresh recipe
CREATE OR REPLACE TABLE customer_metrics AS
SELECT customer_id, COUNT(*) AS orders, SUM(amount) AS total
FROM orders
GROUP BY customer_id;

When it's the right choice:

  • The upstream is small (< ~1M rows)
  • The transformation is non-incremental by nature (e.g. ranking — the rank changes when any new row arrives)
  • The output needs to reflect deletions in the upstream (incremental modes don't see deletes)
  • Simplicity matters more than runtime (early development)

Incremental refresh

Each run reads only rows added/changed since the last run, and merges them into the existing output.

-- Incremental recipe — pseudo-code; real syntax depends on the recipe type
SELECT customer_id, COUNT(*) AS orders, SUM(amount) AS total
FROM orders
WHERE created_at > {{ last_run_timestamp }} -- only new rows
GROUP BY customer_id

-- Then MERGE INTO customer_metrics ...

When it's the right choice:

  • Upstream is large (> ~10M rows) and growing
  • Most rows don't change after they're written (events, logs, transactions)
  • The output computation is associative/decomposable (sums, counts, max/min — yes; medians, ranks, percentiles — no)
  • The flow runs on a schedule and runtime is starting to bite

Watermarks: how incremental works

Incremental needs a watermark column — typically a timestamp like created_at or updated_at, or a monotonically increasing ID. The recipe stores the highest watermark value from the previous run and filters the next run to rows above that value.

Run 1: read all rows with created_at <= 2026-04-25 18:00 → watermark = '2026-04-25 18:00'
Run 2: read only rows with created_at > '2026-04-25 18:00'
Run 3: read only rows with created_at > <max from Run 2>

Pick a watermark column that:

  • Is never updated retroactively — if a row's created_at can change after creation, you'll miss updates
  • Is indexed in the source — every incremental run does a range scan on this column
  • Has second-level (or finer) precision — date-only columns produce duplicate boundary rows; many incremental recipes default to > rather than >= to handle this safely

Pitfalls

Out-of-order arrivals

If row #100 (created_at = 18:01) arrives at the source AFTER row #101 (created_at = 18:02), and the flow runs in between, row #100 is missed forever.

Fix: use an _arrived_at column representing when the row landed in the source, not when the event happened. Use _arrived_at as the watermark.

Updates in upstream

If orders.amount can be corrected after the fact, incremental misses the correction (the row's created_at didn't change).

Fix: use updated_at as the watermark and treat it as max-changed, not max-created. Make sure the source actually updates updated_at on every row change (most ORMs do; some legacy schemas don't).

Deletions

If a row is deleted in the source, incremental never sees it — the WHERE created_at > x query just doesn't return the deleted row, which means the previously-computed output still has it.

Fix: either accept "soft deletions only" (the source marks rows deleted=true instead of removing them) so they look like updates, or run a periodic full refresh to reconcile.

Aggregations that aren't decomposable

A MEDIAN(amount) over the full dataset can't be incrementally updated. Same for ranks and percentiles. Incremental can recompute "rows added since last run" but combining them with the previous output requires the aggregation to be associative.

Fix: full refresh, or accept approximations (e.g. T-Digest sketches for percentiles, which are mergeable).

How to choose

Is the upstream < 1M rows? → Full refresh.
Are deletions semantically important? → Full refresh, or implement soft-delete.
Is the aggregation rank/median/percentile? → Full refresh, or sketch.
Is the watermark column reliable? → Incremental.
None of the above? → Start full, switch when runtime hurts.

You can always change strategies later — the public-facing dataset stays at the same name, downstream consumers don't care which strategy produced it.

See also