Lewati ke konten utama

Flow stuck running

A flow that's been "Running" for hours when it normally finishes in minutes. Diagnose in this order — the first three checks resolve 90% of cases.

1. Is a step actually stuck, or just slow?

Open the flow's Activity → Current run. Each step shows its status (Pending / Running / Done) and elapsed time. Compare to the same step's average runtime in History.

  • Step elapsed time matches history → not stuck, just slow. See Reduce flow runtime.
  • Step elapsed time is 10×+ history average → genuinely stuck, continue.

2. Is the upstream connection alive?

If the stuck step reads from a connector, the connector might be hung. From a terminal:

# Test the source independently
ssh root@<honeyframe-host> "psql -h <source-host> -c 'SELECT 1' -t" # or whatever the source speaks

If the test query also hangs, the source is the problem — not Honeyframe. Either wait for the source to recover or kill the flow run and resume after the source is back.

3. Is the worker process alive?

# Check that flow workers are running and not zombies
ssh root@<honeyframe-host> 'systemctl status hub-platform-worker'

# Check for stuck Python processes
ssh root@<honeyframe-host> 'ps aux | grep python | grep flow'

If the worker process exited but the database still shows the run as "Running", you have an orphan run. Mark it as failed manually:

-- Connect to the platform DB
UPDATE hubstudio.flow_runs
SET status='failed', failed_reason='worker_orphaned', finished_at=now()
WHERE id=<run_id> AND status='running';

Then restart the worker: systemctl restart hub-platform-worker. The next scheduled run picks up cleanly.

4. Is the database in deadlock?

Two flows or a flow + an interactive query can deadlock on the same table.

-- On the dataset's source database
SELECT pid, usename, query, state, wait_event_type, wait_event
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;

If you see two transactions waiting on each other (Lock wait_event_type), pick the one that's been running longer and pg_cancel_backend(pid) it. The other will then complete.

5. Is disk full?

A stuck INSERT or COPY step often means the destination table's tablespace ran out of disk.

ssh root@<honeyframe-host> 'df -h /'

If usage is >95%, free space (drop old temp tables, vacuum, or extend the volume) before the flow can complete.

Cancelling a stuck flow safely

Activity → Current run → Cancel. The cancel signal:

  1. Sends SIGTERM to the flow worker for that run
  2. Waits up to 60 seconds for graceful shutdown
  3. If still running, sends SIGKILL
  4. Marks the run as cancelled in flow_runs

Cancellation is not transactional — if the stuck step had partially written rows to a destination table, those rows stay. Most recipes are idempotent (re-running produces the same end state), but for non-idempotent recipes (e.g. an "increment counter" step), you may need to manually undo the partial write before re-running.

Preventing it next time

  • Set a flow timeout — flow settings → Timeout. Default is "no timeout"; setting 2× the historical p95 runtime turns silent stuck-flows into loud timeouts.
  • Add a heartbeat alert — a flow that's been "running" for >2× p95 should page or email someone, not silently sit.
  • Watchdog cron — a tiny script that queries flow_runs WHERE status='running' AND started_at < now() - interval '4 hours' and posts to Slack. Less elegant than alerts but bulletproof.

See also