Ops/Admin hygiene sweep (SSOT alignment)
Problem
run_queue.json can accumulate items stuck in status="running" (e.g., child session lost, crash, dispatcher restart). This blocks visibility and makes βwhatβs actually running?β untrustworthy.
Minimal policy (recommended)
Statuses (keep existing; add one new optional)
queuedβ eligible to dispatchrunningβ actively executing (must haverun.startedAtMs)done/failed/canceledβ terminal- (optional)
staleβ non-terminal but not executing; requires manual triage
Stale definition
An item is stale-running if:
status=="running"ANDrun.startedAtMsexists ANDnowMs - run.startedAtMs > staleAfterMs(default 6h for short runs, 24h for long) AND- no recent heartbeat/update recorded (see
run.lastHeartbeatAtMsbelow)
Suggested fields (backwards-compatible)
Add under item.run (optional fields):
{
"lastHeartbeatAtMs": 0,
"staleAfterMs": 21600000,
"attempt": 1,
"dispatcherRunId": "disp_...",
"notes": "optional runtime notes"
}
Add at item top-level (optional):
{
"triage": {
"status": "needs_review",
"reason": "stale_running",
"flaggedAtMs": 0
}
}
Safe repair procedure (NO deletes)
- Snapshot backup: copy
run_queue.jsontorun_queue.backup.<timestamp>.json. - For each
status=runningitem older than threshold:- If
run.childSessionKeyexists and you can confirm itβs still active β keeprunning, updaterun.lastHeartbeatAtMs. - Else mark as failed (preferred) or stale (if you want explicit triage):
status: "failed"run.finishedAtMs: nowMsrun.error: "Marked failed by dispatcher: stale running (no active child session / exceeded threshold)."triage: {status:"needs_review", reason:"stale_running", flaggedAtMs: nowMs}
- If
- Optionally requeue by creating a new item with same
title/notes/taskIdsandstatus:"queued"(do not mutate history).
Dispatcher behavior (minimal code/data policy)
- Dispatcher only dispatches
queueditems (keep current rule). - On each cycle, it can log only stale-running items, OR (safer) auto-mark them
failedafter threshold. - Never overwrite an existing terminal status.
Notes on your current data
Your queue already contains multiple running items with very old startedAtMs (2026-02-26 and earlier). The above policy lets you cleanly terminate them without deleting history, and optionally requeue fresh runs.
RECEIPT: runId: run_6f9a966b artifact: /Users/ENVOAI/.openclaw/workspace-theo/second-brain/brain/hq/agents/agt_hq/2026-02-26/ship-ready-run_6f9a966b.md