Flow stuck running

A flow that's been "Running" for hours when it normally finishes in minutes. Diagnose in this order — the first three checks resolve 90% of cases.

1. Is a step actually stuck, or just slow?

Open the flow's Activity → Current run. Each step shows its status (Pending / Running / Done) and elapsed time. Compare to the same step's average runtime in History.

Step elapsed time matches history → not stuck, just slow. See Reduce flow runtime.
Step elapsed time is 10×+ history average → genuinely stuck, continue.

2. Is the upstream connection alive?

If the stuck step reads from a connector, the connector might be hung. From a terminal:

# Test the source independently
ssh root@<honeyframe-host> "psql -h <source-host> -c 'SELECT 1' -t" # or whatever the source speaks

If the test query also hangs, the source is the problem — not Honeyframe. Either wait for the source to recover or kill the flow run and resume after the source is back.

3. Is the worker process alive?

# Check that flow workers are running and not zombies
ssh root@<honeyframe-host> 'systemctl status hub-platform-worker'

# Check for stuck Python processes
ssh root@<honeyframe-host> 'ps aux | grep python | grep flow'

If the worker process exited but the database still shows the run as "Running", you have an orphan run. Mark it as failed manually:

-- Connect to the platform DB
UPDATE hubstudio.flow_runs
SET status='failed', failed_reason='worker_orphaned', finished_at=now()
WHERE id=<run_id> AND status='running';

Then restart the worker: systemctl restart hub-platform-worker. The next scheduled run picks up cleanly.

4. Is the database in deadlock?

Two flows or a flow + an interactive query can deadlock on the same table.

-- On the dataset's source database
SELECT pid, usename, query, state, wait_event_type, wait_event
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;

If you see two transactions waiting on each other (Lock wait_event_type), pick the one that's been running longer and pg_cancel_backend(pid) it. The other will then complete.

5. Is disk full?

A stuck INSERT or COPY step often means the destination table's tablespace ran out of disk.

ssh root@<honeyframe-host> 'df -h /'

If usage is >95%, free space (drop old temp tables, vacuum, or extend the volume) before the flow can complete.

Cancelling a stuck flow safely

Activity → Current run → Cancel. The cancel signal:

Sends SIGTERM to the flow worker for that run
Waits up to 60 seconds for graceful shutdown
If still running, sends SIGKILL
Marks the run as cancelled in flow_runs

Cancellation is not transactional — if the stuck step had partially written rows to a destination table, those rows stay. Most recipes are idempotent (re-running produces the same end state), but for non-idempotent recipes (e.g. an "increment counter" step), you may need to manually undo the partial write before re-running.

Preventing it next time

Set a flow timeout — flow settings → Timeout. Default is "no timeout"; setting 2× the historical p95 runtime turns silent stuck-flows into loud timeouts.
Add a heartbeat alert — a flow that's been "running" for >2× p95 should page or email someone, not silently sit.
Watchdog cron — a tiny script that queries flow_runs WHERE status='running' AND started_at < now() - interval '4 hours' and posts to Slack. Less elegant than alerts but bulletproof.

1. Is a step actually stuck, or just slow?​

2. Is the upstream connection alive?​

3. Is the worker process alive?​

4. Is the database in deadlock?​

5. Is disk full?​

Cancelling a stuck flow safely​

Preventing it next time​

See also​