Flow stuck running
A flow that's been "Running" for hours when it normally finishes in minutes. Diagnose in this order — the first three checks resolve 90% of cases.
1. Is a step actually stuck, or just slow?
Open the flow's Activity → Current run. Each step shows its status (Pending / Running / Done) and elapsed time. Compare to the same step's average runtime in History.
- Step elapsed time matches history → not stuck, just slow. See Reduce flow runtime.
- Step elapsed time is 10×+ history average → genuinely stuck, continue.
2. Is the upstream connection alive?
If the stuck step reads from a connector, the connector might be hung. From a terminal:
# Test the source independently
ssh root@<honeyframe-host> "psql -h <source-host> -c 'SELECT 1' -t" # or whatever the source speaks
If the test query also hangs, the source is the problem — not Honeyframe. Either wait for the source to recover or kill the flow run and resume after the source is back.
3. Is the worker process alive?
# Check that flow workers are running and not zombies
ssh root@<honeyframe-host> 'systemctl status hub-platform-worker'
# Check for stuck Python processes
ssh root@<honeyframe-host> 'ps aux | grep python | grep flow'
If the worker process exited but the database still shows the run as "Running", you have an orphan run. Mark it as failed manually:
-- Connect to the platform DB
UPDATE hubstudio.flow_runs
SET status='failed', failed_reason='worker_orphaned', finished_at=now()
WHERE id=<run_id> AND status='running';
Then restart the worker: systemctl restart hub-platform-worker. The next scheduled run picks up cleanly.
4. Is the database in deadlock?
Two flows or a flow + an interactive query can deadlock on the same table.
-- On the dataset's source database
SELECT pid, usename, query, state, wait_event_type, wait_event
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;
If you see two transactions waiting on each other (Lock wait_event_type), pick the one that's been running longer and pg_cancel_backend(pid) it. The other will then complete.
5. Is disk full?
A stuck INSERT or COPY step often means the destination table's tablespace ran out of disk.
ssh root@<honeyframe-host> 'df -h /'
If usage is >95%, free space (drop old temp tables, vacuum, or extend the volume) before the flow can complete.
Cancelling a stuck flow safely
Activity → Current run → Cancel. The cancel signal:
- Sends
SIGTERMto the flow worker for that run - Waits up to 60 seconds for graceful shutdown
- If still running, sends
SIGKILL - Marks the run as
cancelledinflow_runs
Cancellation is not transactional — if the stuck step had partially written rows to a destination table, those rows stay. Most recipes are idempotent (re-running produces the same end state), but for non-idempotent recipes (e.g. an "increment counter" step), you may need to manually undo the partial write before re-running.
Preventing it next time
- Set a flow timeout — flow settings → Timeout. Default is "no timeout"; setting 2× the historical p95 runtime turns silent stuck-flows into loud timeouts.
- Add a heartbeat alert — a flow that's been "running" for >2× p95 should page or email someone, not silently sit.
- Watchdog cron — a tiny script that queries
flow_runs WHERE status='running' AND started_at < now() - interval '4 hours'and posts to Slack. Less elegant than alerts but bulletproof.
See also
- Reduce flow runtime — for "slow but not stuck"
- Connection errors — when the upstream is the cause