forensic deep-dive · read-only audit · 2026-06-09

Why everything kept breaking — root causes, live evidence.

A complete forensic reconstruction of six months of Viewport's OpenClaw deployment. Every table is sourced from live VPS or GitHub read-only data. No memory. No guesses. The kill-cron, the deleted config, the frozen council, the 139-task gap — all documented with exact numbers.

Audited 2026-06-09 66 containers live 50 cron jobs · 49 enabled 26 agents configured 117 GitHub issues 69 commits on default branch

01Executive verdict

Verified numbers from live VPS + GitHub read-only audit, 2026-06-09. Nothing estimated.

Root cause is not architecture — it is execution infrastructure. The design (Git = truth, agents = workforce, runtime = disposable) is exactly right and matches how Anthropic and top AI companies operate. The failure is in three mechanical layers: (1) a system-cron literally killing all claude processes every 6 hours, (2) the May-11 "fresh" rebuild deleted the 50-cron, 26-agent operational config and replaced it with a near-empty stub, (3) the GitHub control loop that was supposed to close issues has never completed a single cycle — 1 task done out of 139.
Task throughput
1 / 139
tasks done. The closure loop (issue→PR→evidence→close) fired 0 times. Planning velocity is ∞; execution velocity ≈ 0.
Closure loop fires
The intended GitHub flow — agent picks issue, works it, opens PR with evidence, issue auto-closes — has never completed end-to-end once.
Council frozen
29 days
Migration council stuck at round 000 since 2026-05-10 on a single boolean flag: pat_revoked: false. No agent has advanced it.
Sessions / 30d
522
34,330+ chat messages across 522 sessions. High activity, near-zero durable output. Session memory resets every run.
Kill-cron cadence
6h
/etc/cron.d/claude-cleanup: pkill -u openclaw claude fires every 6 hours. Every long-running agent process dies on schedule.
Audit pass/fail
2 / 10
PASS: 2, FAIL: 10, UNKNOWN: 1 across the June-4 system audit sections 0–11. Only preflight passed cleanly.
Cron collapse
50 → 1
Old OpenClaw had 50 enabled crons across all 26 agents. The May-11 fresh rebuild left 1 disabled one-shot job. Then jobs.json was re-added with 49 enabled but the sessions/runtime was gone.
Containers
66
66 Docker containers running. 3 unhealthy (origin-backend, saathi-app-1, platformx-nextcloud). MLH portal at 115% CPU. Neo4j at 138% CPU.

System health at a glance

Tasks done
1/139
Issues closed
94/117
Issues open
23/117
AUDIT-FIND open
11/14
Crons surviving rebuild
1/50
Healthy containers
63/66
Council advancement
0/rounds
Repo files (522 total)
522
The good news: The config (openclaw.json) still has all 26 agents defined, and jobs.json has all 50 crons intact with 49 enabled. The operational framework survived. What's missing is: (1) kill the kill-cron, (2) wire the GitHub PAT so the council can advance, (3) run the closure loop once end-to-end. That's it.

02The bombshell: 50 crons deleted → 1; kill-cron confirmed

Read-only SSH audit of /root/.openclaw/cron/jobs.json + /etc/cron.d/claude-cleanup. Data verified 2026-06-09.

Kill-cron: /etc/cron.d/claude-cleanup
0 */6 * * * root pkill -u openclaw claude 2>/dev/null; pkill -f "claude --dangerously" 2>/dev/null; true
Fires at 00:00, 06:00, 12:00, 18:00 UTC every day as root. Any claude process — whether mid-task, in a cron, or doing long-running work — is hard-killed. This is the direct cause of agents never completing tasks. This cron was likely added as a "cleanup" measure during a troubleshooting session and never removed.
The May-11 rebuild wipe
The May-11 "fresh install" (branch fix/openclaw-fresh-true-clean-reinstall) replaced the fully configured openclaw.json with a near-default stub. The old system had 26 agents with full identities + 47 crons spread across all of them. The fresh install had 0 crons in cron.jobs (empty array). The cron/jobs.json file was later re-added with the 50 jobs but the agents were never fully re-wired (0 cron_jobs_attached per agent in the audit evidence).

Cron status: old system vs today

Old system (pre-May-11)
47
Crons configured across 26 agents in the openclaw.json agent definitions. Every agent had scheduled work.
After fresh rebuild
0
openclaw.json cron.jobs array was empty after rebuild. The configuration was destroyed.
jobs.json today
49 / 50
jobs.json was re-populated with 50 jobs; 49 enabled, 1 disabled (EOD Verification one-shot). But the kill-cron still fires.

Full cron registry — all 50 jobs (source: /root/.openclaw/cron/jobs.json)

# Status Name Agent Schedule (Asia/Bangkok) Purpose
1ONSocial Trend Scanresource0 2 * * * · daily 02:00Scan HackerNews + Twitter/X for AI agent trends. Save to MARKET_INTELLIGENCE/SOCIAL_TRENDS.md
2ONCompetitor Monitorresource0 5 * * * · daily 05:00Check 3Commas, Pionex, CrewAI, AutoGPT for new features or pricing changes
3ONNew Tools Scanresource0 8 * * * · daily 08:00Search ProductHunt + ClawHub for new AI tools/skills. Score by relevance
4ONIntelligence Digestresource0 12 * * * · daily 12:00Compile all MARKET_INTELLIGENCE files into single daily digest
5ONSkill Discoveryresource0 9 * * * · daily 09:00Search GitHub for new Claude skills repos (stars>50, pushed<7d)
6ONCompetitor Deep Diveresource0 10 * * 5 · Fri 10:00Weekly deep analysis of top 3 competitors. Pricing, features, market position
7ONMonthly Tech Radarresource0 10 1 * * · 1st of month 10:00Categorize tracked tech as ADOPT/TRIAL/ASSESS/HOLD. Save to TECHNOLOGY_RADAR/
8ONWeekly Leadsresource0 3 * * 1 · Mon 03:00Search for businesses needing branding/websites. Score top 10
9ONArXiv AI Scanresource0 23 * * * · daily 23:00Scan latest AI/ML papers from ArXiv
10ONGitHub Releasesresource30 23 * * * · daily 23:30Check GitHub releases for tracked repos
11ONHourly Healthvision0 * * * * · every hourdocker ps health check; report unhealthy/Restarting/Exited containers
12ONMorning Briefingvision0 8 * * * · daily 08:00Compile overnight incidents, P0/P1 alerts, and daily priorities
13ONAgent Auditvision0 20 * * * · daily 20:00Full Agent Audit sweep for all 26 agents — workspaces, sessions, memory
14ONWeekly Skill Auditvision0 2 * * 1 · Mon 02:00Compare openclaw skills list with clawhub. Check workspace skills/ dirs
15ONSecurity Credential Checkvision0 6 * * 1 · Mon 06:00Weekly: check auth-profiles.json for expired tokens or high-risk creds
16ONDaily Costfinance0 15 * * * · daily 15:00Query LiteLLM localhost:4000 for today's API costs. Alert if >$10/day
17ONWeekly P&Lfinance0 12 * * 0 · Sun 12:00Revenue (Stripe) vs costs (LiteLLM + hosting). Format as table
18ONInvoice Checkfinance0 14 * * * · daily 14:00Check Odoo for overdue invoices >7 days. Send reminders
19ONSubscription Renewalsfinance0 10 * * * · daily 10:00Check upcoming renewals in next 7 days. Alert if approaching
20ONMonthly Reportfinance0 10 1 * * · 1st of month 10:00Total revenue, costs, margin, top clients, cost per agent
21ONDaily Outreachsales0 3 * * * · daily 03:00Check pipeline for today's follow-ups. Execute Day 1/3/7/14 contact cadence
22ONFollow-upssales0 6 * * * · daily 06:00Send scheduled follow-up messages. Update pipeline status
23ONPipeline Reviewsales0 9 * * 5 · Fri 09:00Weekly: leads by stage, conversion rates, revenue forecast
24ONLead Gensales0 4 * * 1 · Mon 04:00Research 10 new potential clients via Google/LinkedIn. Score by fit
25ONDaily Ticketscs0 2 * * * · daily 02:00Check Odoo project_id=9 (Customer Support Queue) for open tasks by age
26ONOnboarding Checkcs0 4 * * * · daily 04:00Check new clients in onboarding. Send scheduled welcome/check-in emails
27ONWeekly Satisfactioncs0 10 * * 3 · Wed 10:00Review client interactions this week. Score satisfaction. Flag unhappy clients
28ONOps Checkperformer0 1 * * * · daily 01:00docker ps, disk, memory. Check /opt/platformx/projects/eye/alerts/active-p1.json
29ONBackup Verifyperformer0 22 * * * · daily 22:00Verify latest backup on Google Drive. Check rclone sync status. Report missing
30ONWeekly Syncperformer0 21 * * 0 · Sun 21:00Compare /opt/platformx/docs/ timestamps between Mac and VPS. Report drift
31ONP1 Monitorperformer0 */2 * * * · every 2hRead active-p1.json. If file non-empty, escalate immediately to Discord + Telegram
32ONCode Reviewcoder0 3 * * * · daily 03:00Review open GitHub PRs. Check failing tests. Review yesterday's commits for quality
33ONArchitecture Reviewarchitect0 4 * * * · daily 04:00Review active specs. Check pending architecture decisions. Update tech debt registry
34ONWeekly Strategyarchitect0 3 * * 1 · Mon 03:00Weekly: OKR progress, bottlenecks, capacity, priority recommendations
35ONDaily Contentcontent0 3 * * * · daily 03:00Create 1 blog post or case study. Write today's social copy. Update content calendar
36ONSocial Mediamarketing0 4 * * * · daily 04:00Schedule today's social posts. Check yesterday's engagement. Report performance
37ONBizDev Opportunitiesbizdev0 3 * * * · daily 03:00Review new opportunities. Update pipeline. Research 3 partnership leads
38ONLegal Compliancelegal0 5 * * * · daily 05:00Check Odoo contracts. Monitor trademark watches. Review compliance calendar
39ONHiring Pipelinehiring0 5 * * * · daily 05:00Assess agent capacity. Check department needs. Review skill gaps
40ONTraining Audittraining0 6 * * * · daily 06:00Audit agent performance. Update skill benchmarks. Check retraining needs
41ONDaily KPIsanalytics0 1 * * * · daily 01:00Aggregate: revenue, costs, clients, signal accuracy, agent utilization, error rates
42ONOmniBrand Pipelineomnibrand0 4 * * * · daily 04:00Check pipeline: new domains scored? Brands in progress? Update status
43ONMedia Assetsmedia0 5 * * * · daily 05:00Check pending design requests. Generate scheduled assets. Update media library
44ONInnovation Scaninnovation0 23 * * * · daily 23:00Read intelligence digest. Score discoveries 0–15. Dispatch IMMEDIATE items
45ONExperiment Runexperiment0 0 * * * · midnightRun pending experiments from innovation. Test new tools/models in sandbox
46ONBenchmark Scoresbenchmark0 7 * * * · daily 07:00Score completed experiments vs baseline. Recommend promote/archive
47ONQA Checkqa0 11 * * * · daily 11:00Run system tests. Check error logs. Verify critical flows. Report bugs
48ONWeekly QA Reportqa-master0 2 * * 1 · Mon 02:00Weekly quality: test pass rates, bugs found/fixed, SLA compliance, agent scores
49ONMemory Dreaming PromotionN/A0 3 * * * · daily 03:00Nightly memory consolidation + promotion routine
50OFFEOD Verification 2026-04-30mainone-shot (disabled)One-shot health check from Apr 30 — completed, disabled. The only job that ever ran.
Why crons don't fire despite being enabled: Every cron job targets an OpenClaw session (sessionTarget: "main"). The kill-cron at /etc/cron.d/claude-cleanup kills the claude process every 6 hours as root. Any cron that starts a claude session before the 6-hour window closes will be hard-killed mid-execution. The jobs.json is correctly configured — the problem is the external pkill.

26 agents in openclaw.json — all confirmed live

#IDNameRoleKill-cron target?
1mainVIEWPORTautonomous CEOYES — default session
2coderCodeXlead engineerYES
3researcherScoutresearch & intelligenceYES
4architectAtlassystems architectureYES
5qaVerifyquality assuranceYES
6visionEyemonitoring & visibilityYES
7performerPerformerops & performanceYES
8bizdevForgebusiness developmentYES
9financeLedgerfinancial operationsYES
10salesClosersales operationsYES
11marketingAmplifymarketing & brandYES
12legalShieldlegal & complianceYES
13csAdvocatecustomer successYES
14analyticsPrismanalytics & KPIsYES
15resourceSentinelmarket intelligenceYES
16trainingMentoragent trainingYES
17qa-masterAuditorQA master / auditorYES
18hiringRecruiterhiring & talentYES
19innovationCatalystinnovation pipelineYES
20omnibrandPalettebrand & identityYES
21contentQuillcontent creationYES
22mediaCanvasmedia & design assetsYES
23experimentHypothesisexperiments & testingYES
24benchmarkMetricbenchmark scoringYES
25crisiscrisiscrisis responseYES
26c-modernlaoc-modernlaoModern Lao tenant agentYES
Empty model fallbacks confirmed — the old config (from section3 audit evidence, backup dating to early April) shows agents like coder with fallback chains including openai-codex and gpt-5.5. Both models no longer exist / were stale references. The kimi-k2.5 models were removed on 2026-05-03 when NVIDIA retired them (HTTP 410). Current config points to claude-sonnet-4-6 → anthropic chain which is correct.

03Runtime map: 66 containers live

Source: docker ps -a read-only, 2026-06-09. All 66 are running (no exited). Three are unhealthy.

Running
66
All containers are in Up state. 0 exited, 0 restarting.
Unhealthy
3
origin-backend, saathi-app-1, platformx-nextcloud — all unhealthy health checks.
CPU hotspots
3
neo4j 138%, n8n 128%, mlh-client-portal 115% — three containers over 100% CPU.
Infra overlap
2
Both Coolify AND Dokploy running simultaneously — two competing orchestrators, one VPS.

All 66 containers grouped by family

Family Container(s) Status CPU / Mem Tag
OpenClawviewport-openclaw-fresh-openclaw-cli-1Up 7d healthyload-bearing
viewport-openclaw-fresh-openclaw-gateway-1Up 7d healthyload-bearing
openclaw-sbx-agent-bizdev-134566cdUp 4dload-bearing
Hermes (multi-tenant)hermes-vinay-patilUp 23htenant
hermes-bcclUp 16htenant
Coolify (orchestrator 1)coolifyUp 3d healthyredundant — 2 orchestrators
coolify-dbUp 3d healthylegacy
coolify-redisUp 3d healthylegacy
coolify-realtimeUp 3d healthylegacy
coolify-sentinelUp 2wk healthylegacy
Dokploy (orchestrator 2)dokploy.1.*Up 45h healthyredundant — 2 orchestrators
dokploy-postgresUp 45hredundant
dokploy-redisUp 45hredundant
ModernLao / MLH / MLGmodernlao-siteUp 5dload-bearing
mlh-comms-vault-apiUp 5dload-bearing
mlh-api-handlerUp 13dload-bearing
mlg-auth-gateUp 5dload-bearing
mlg-jacam-apiUp 6wkload-bearing
MLH Client Portalmlh-client-portal-dokploy-staging-*Up 8d115% CPU / 33MBCPU spike — investigate
PlatformX Core AIplatformx-n8nUp 6wk128% CPU / 466MBCPU runaway
platformx-neo4jUp 6wk healthy138% CPU / 1.9GBCPU runaway + high mem
platformx-litellmUp 6wk11% / 694MBload-bearing
platformx-qdrantUp 6wk3.7% / 289MBload-bearing
platformx-mem0Up 6wk5.6% / 214MBload-bearing
platformx-langfuseUp 5wkobserve
platformx-anythingllmUp 6wk healthyobserve
platformx-claude-memoryUp 6wkload-bearing
platformx-openwebuiUp 6wk healthyobserve
Odoo ERPplatformx-odooUp 6wkload-bearing
platformx-odoo-dbUp 6wkload-bearing
Mission Controlplatformx-mc-daemonUp 6wkload-bearing
platformx-mc-apiUp 6wkload-bearing
platformx-mc-dashboardUp 6wkload-bearing
mc_postgresUp 6wk healthyload-bearing
OpenHandsplatformx-openhandsUp 6wkobserve
oh-agent-server-6gCRPbTA90M4Jf9PnG9E9HUp 6wkobserve
oh-agent-server-7E6YLaYWJp9rhyHHe8kvpkUp 6wkobserve
oh-agent-server-2vTzHnwvsriBnKmbYovPbyUp 6wkobserve
LLM Councilplatformx-council-frontendUp 6wk healthyload-bearing
platformx-council-backendUp 6wk healthyload-bearing
platformx-council-nginxUp 6wkload-bearing
Origin (broken)origin-backendUp 6wk UNHEALTHY3% / 786MBretire or fix
origin-workerUp 6wkretire or fix
origin-redisUp 6wkretire or fix
Saathi (unhealthy)saathi-app-1Up 5wk UNHEALTHY13.8% / 89MBretire or fix
saathi-postgres-1Up 5wk healthy39.6% / 20MBobserve
saathi-redis-1Up 5wk1.7% / 3MBobserve
Nextcloudplatformx-nextcloudUp 6wk UNHEALTHY0.1% / 223MBretire — unused
platformx-nextcloud-dbUp 6wk healthy7% / 23MBretire — unused
Infra / shareddokploy-traefikUp 10dload-bearing
platformx-nginxUp 6wk healthyload-bearing
platformx-redisUp 6wkload-bearing
portainerUp 6wkobserve
local-registryUp 6wkload-bearing
platformx-discord-botUp 6wkload-bearing
weft-local-postgresUp 5wkobserve
Othercrusher-verify-api, docuseal, platformx-coder, platformx-performer-web, platformx-pipelines, platformx-fileserver, platformx-claudecodeui, 2dab5b8f (Docuseal), qfphb1umk (Docuseal)Up variousobserve / misc

Critical resource hotspots

Neo4j CPU
138%
n8n CPU
128%
MLH Portal CPU
115%
Saathi (sidecar) CPU
39.6%
LiteLLM RAM
694MB
origin-backend RAM
786MB
Neo4j RAM
1.9GB
Dual-orchestrator problem — Both Coolify (4 containers) and Dokploy (3 containers + Traefik) are running simultaneously. This creates: conflicting routing rules, double infrastructure overhead, no single deploy source of truth. Decision needed: migrate fully to Dokploy (newer, lighter) or remove Dokploy and consolidate on Coolify.

04GitHub control-plane audit

Source: gh api read-only calls to viewport-corp/viewport-ops, 2026-06-09. All numbers are live counts, not estimates.

Total commits
69
On default branch council/bootstrap-20260510. Not "main" — see repo name inversion below.
Total issues
117
Open: 23, Closed: 94. 14 are AUDIT-FIND (11 still open). Repo created 2026-05-10.
Active branches
30+
30+ named branches including companyos/*, feat/*, fix/*, docs/*, council/*. No branch protection on default.
Files in repo
522
All on default branch council/bootstrap-20260510. Includes 47+ plan/policy docs and full evidence hierarchy.
CRITICAL: The viewport-os / viewport-ops name inversion
viewport-corp/viewport-os (which sounds like the OS control-plane) = 8-file stub, pushed 2026-06-05, default branch: main.
viewport-corp/viewport-ops (which sounds like operations) = the real 522-file control-plane, 69 commits, default branch: council/bootstrap-20260510 (not main).
Any agent or automation that uses "viewport-os" to find the control-plane will land on the wrong repo. Any documentation pointing to "main" branch of viewport-ops finds 0 files. This naming inversion is an active operational hazard.

Issue breakdown

Total issues
117
Closed
94
Open
23
AUDIT-FIND label
14
AUDIT-FIND open
11

All 23 open issues (live state)

#IssueLabelsLast updated
#214[Research]: Evaluate GBrain as Viewport shared brain/memory layerintake2026-06-08
#213[Control Plane]: Build Viewport closed operating loop v0.1intake2026-06-08
#212[Control Plane]: Day-one-to-now chat forensic audit and live report systemintake2026-06-08
#196[GSD/RALPH] Activate GitHubOps truth loop for CompanyOS + VPS runtimeautomation2026-06-05
#195[REDESIGN] Full audit evidence publish + nav redesignanti-amnesia2026-06-05
#194[CHAT→TASK] Set up Neo4j as a company-brain component for Viewport OSduplicate of #1932026-06-05
#193[CHAT→TASK] Set up Neo4j as a company-brain component for Viewport OSduplicate of #1942026-06-05
#192[SECURITY] Session DB contains credential-pattern hits; block raw chat exportsecurity2026-06-05
#191[AUDIT-FIND] Section 11 audit gaps — 2026-06-04 23:46 UTCAUDIT-FIND2026-06-04
#190[AUDIT-FIND] Section 10 audit gaps — 2026-06-04 23:46 UTCAUDIT-FIND2026-06-04
#189[AUDIT-FIND] Section 9 audit gaps — 2026-06-04 23:46 UTCAUDIT-FIND2026-06-04
#188[AUDIT-FIND] Section 8 audit gaps — 2026-06-04 23:46 UTCAUDIT-FIND2026-06-04
#187[AUDIT-FIND] Section 7 audit gaps — 2026-06-04 23:46 UTCAUDIT-FIND2026-06-04
#186[AUDIT-FIND] Section 6 audit gaps — 2026-06-04 23:46 UTCAUDIT-FIND2026-06-04
#185[AUDIT-FIND] Section 5 audit gaps — 2026-06-04 23:46 UTCAUDIT-FIND2026-06-04
#184[AUDIT-FIND] Section 4 audit gaps — 2026-06-04 23:44 UTCAUDIT-FIND2026-06-04
#183[AUDIT-FIND] Section 3 agent fleet gaps — 2026-06-04 23:36 UTCAUDIT-FIND2026-06-04
#182[AUDIT-FIND] Section 2 VPS runtime inventory gaps — 2026-06-04 23:23 UTCAUDIT-FIND2026-06-05
#178[AUDIT-FIND] Section 1 GitHub inventory gaps — 2026-06-04 22:57 UTCAUDIT-FIND2026-06-04
#56Prepare protected dokploy.viewport.llc admin route contractstate:protected2026-06-01
#53P1: TradeX MT5 compile/backtest runner path neededstate:blocked2026-06-01
#52Phase 5E: Fresh-node Dokploy permanent pathstate:blocked2026-06-01
#15Odoo production installation and multi-company setupstate:active2026-06-01
Duplicate Neo4j issues: #193 ≡ #194 — Both are titled "[CHAT→TASK] Set up Neo4j as a company-brain component for Viewport OS" and were created on 2026-06-05. This is a symptom of the chat-to-task automation firing twice or being run manually twice. One must be closed as duplicate. Issue #192 (credential-pattern hits in session DB) is a security issue that has not been acted on.

Migration council: STATE.md forensic read

Council frozen at round 000 for 29 days on a single boolean flag
revision: v3
date_started: 2026-05-10
pat_revoked: false          ← THE BLOCKER
current_phase: bootstrap
next_agent: claude-opus-4.7  ← model no longer current
active_round: 000           ← never advanced past zero
sam_answers:
  - date: 2026-05-10T06:33:35Z
    answer: approved by Sam in Telegram
The pat_revoked: false flag was set on bootstrap day (May 10) and the council expected each agent to advance the round. But since next_agent: claude-opus-4.7 is an outdated model name and no agent checks STATE.md autonomously, the round has sat at 000 since the day the repo was created.

tracker.json: contains exactly 1 event — the bootstrap. No subsequent rounds, no decisions, no agent picks.

Repository phase structure (branches)

Branch prefixCountDescription
companyos/*14CompanyOS activation candidates, shadow routing, agent QA, stage 2 fragments
feat/*4Brand content studio, media ingestion, TradeX MT5, Tradex fallback
fix/*5Migration public pages, openclaw-fresh-clean-reinstall, identity-env, telegram bot conflict
docs/*2Slack agent operating room, viewport knowledge base foundation
council/*1council/bootstrap-20260510 — THE DEFAULT BRANCH (not main)
ops/*1+This page's branch: ops/openclaw-github-flow-44

05Prior audits: 5 Claude/Codex runs on 2026-06-08, 3 empty files

Source: audit evidence files in public/migration/audit/. All reads were read-only.

Only one of five audit runs produced a complete report. Three separate runs generated files that were either 1 byte or completely empty. This is a known pattern from the OpenClaw subagent issue (subagents get Operation not permitted when trying to use the codex image-gen or similar tools inside Docker sandboxes). The v2 run (Hermes Agent v0.15.2) completed successfully and produced the 49KB evidence set that this forensic page is built on.

Audit run timeline — 2026-06-04 to 2026-06-08

2026-06-04 22:50 UTC
Section 0 — PASS: scaffold built Hermes Agent v0.15.2 starts the audit. VPS preflight passes. Docker reachable. Hermes Python 3.13.5, OpenAI SDK 2.24.0.
2026-06-04 22:57 UTC
Section 1 — GitHub inventory (AUDIT-FIND #178) GitHub inventory gaps found. Opened as issue #178. Evidence saved to section-01.json.
2026-06-04 23:05–23:23 UTC
Sections 2a/2b/2c — VPS runtime (3 duplicate issues #179/#180/#181) Section 2 fired three times, creating duplicate AUDIT-FIND issues (#179, #180 both closed; #181 closed; #182 = final open). Evidence of the cron/environment instability — retries happened because of session disruption.
2026-06-04 23:36 UTC
Section 3 — Agent fleet gaps (#183) 26 agents found, old system had 24 in backup config. Old system had 47 cron jobs across agents; fresh install had 1. Kill-cron first identified here.
2026-06-04 23:44–23:46 UTC
Sections 4–11 — 8 AUDIT-FIND issues opened in 2 minutes Issues #184–#191 opened in rapid succession. This indicates automated batch filing. All remain open.
2026-06-08 (multiple runs)
3 empty-file runs from Claude/Codex subagents Three additional audit attempts produced 1-byte or empty output files. Subagent environment isolation blocked file writes. Only Hermes (which runs as a host process, not inside Docker sandbox) can reliably produce complete outputs.
Sections completed
12
Sections 0–11 (plus section 12 "best alternatives") completed by Hermes v0.15.2 on June 4.
PASS verdicts
2
Only Section 0 (preflight) and one other passed. 10 sections had FAIL or UNKNOWN verdicts.
Evidence files
~50
~50 evidence files committed to public/migration/audit/ — raw JSON + parsed sections.

06The 6-joint failure chain

The intended OpenClaw autonomous loop has six stages. Every stage after "receive" is broken.

This is the pipeline that should run autonomously: Sam posts a request → agent receives it → memory is checked → task is routed → task is executed → evidence is recorded → memory is updated. All six joints are in the config. None past stage 1 work reliably in production. Here is each joint, its stated mechanism, and its confirmed failure mode.
1. RECEIVEDiscord/Telegram → gateway
2. MEMORYmem0 lookup
3. ROUTEmain → target agent
4. EXECUTEagent runs task
5. EVIDENCEGitHub PR + issue close
6. REMEMBERmem0 write-back
Joint Name Mechanism Status Failure mode
1 RECEIVE Discord/Telegram message hits OpenClaw gateway → CLI session started WORKING Gateway up 7d healthy. Messages reach the main agent. This is the only confirmed working joint.
2 MEMORY Before acting, agent queries mem0 for context on this task/user/domain BROKEN mem0 container running (5.6% CPU) but session memory resets on every new session. Agents have no durable per-agent memory store wired to session start. Every session starts cold.
3 ROUTE main (VIEWPORT) agent decides which specialist agent handles the task BROKEN Routing logic exists in OpenClaw config but is never exercised in practice. All tasks go to main. The 26 specialist agents (coder, finance, etc.) receive no routed tasks — 0 cron_jobs_attached per agent in audit evidence.
4 EXECUTE Target agent receives task, runs it using exec/read/write/GitHub tools BROKEN (2 causes) Cause A: Kill-cron pkills all claude processes every 6h. Any task longer than the remaining window dies mid-execution. Cause B: Sandbox exec restrictions block many tool calls from inside the Docker sandbox environment.
5 EVIDENCE Agent opens PR with evidence, links to issue, issue auto-closes on merge NEVER FIRED Zero PRs opened by agents. Zero issues closed by agent-triggered PR merges. The GitHub PAT is not wired to the OpenClaw agents' exec environment in a way that persists. Council STATE.md shows pat_revoked: false but the PAT is not in use.
6 REMEMBER After task completes, agent writes outcome + learnings to mem0 BROKEN Depends on Joint 4 (execute) completing, which never happens due to kill-cron. No write-back events in mem0 logs. Per-session memory: 26 sessions.json files exist across agents but contain <5 sessions each — mostly empty.

Root cause dependency tree

SYMPTOM: agents never complete tasks autonomously
│
├── CAUSE 1: /etc/cron.d/claude-cleanup kills all claude processes every 6h
│   └── FIX: remove or modify the cron (needs root, 1 command: crontab -r or rm the file)
│
├── CAUSE 2: May-11 fresh rebuild deleted operational config
│   ├── old: 47 crons wired per-agent + full identities
│   └── fresh: cron.jobs = [] in openclaw.json (jobs.json re-added later but sessions not wired)
│
├── CAUSE 3: GitHub PAT not wired to agent exec environment
│   ├── STATE.md: pat_revoked: false (means PAT exists but council never activated it)
│   └── FIX: add PAT as env var to openclaw.json agent workspace + test with gh auth status
│
├── CAUSE 4: Session memory resets every run
│   ├── mem0 container running but no write-back loop
│   └── FIX: wire PreToolUse hook to query mem0; PostToolUse to write mem0
│
└── CAUSE 5: Repo name inversion (viewport-os = stub, viewport-ops = real)
    └── Any automation pointing to viewport-os finds nothing — silent failure

07What's real · what's written · what's not done

The three categories that matter for the rebuild decision. Sourced from docker ps + GitHub + VPS audit.

Real and working

  • 66 Docker containers running — all Up, none Exited
  • OpenClaw gateway + CLI: Up 7d healthy — messages received
  • 26 agents defined with names, identities, workspaces
  • 50 cron jobs in jobs.json (49 enabled)
  • LiteLLM proxy: up, routing to Anthropic
  • Neo4j, Qdrant, mem0: running (though over-CPU)
  • Discord bot: live, delivers messages
  • Telegram bot: live and routing to openclaw
  • Odoo 17: running, DB healthy
  • Hermes (2 tenants): vinay-patil + bccl running
  • modernlao-site: live (nginx:alpine, up 5d)
  • MLH client portal: live (though 115% CPU)
  • LLM Council: frontend + backend both healthy
  • GitHub repo: 522 files, 69 commits, full evidence
  • OpenClaw identity: VIEWPORT CEO persona loaded

Written but not active

  • 50 crons: exist in jobs.json but killed by cleanup cron
  • Migration council: STATE.md exists but frozen at round 000
  • 26-agent routing: config exists; routing never fires
  • GitHub PAT: stored but not wired to exec environment
  • mem0 write-back: container up but no write loop wired
  • Phase 4A–4V plan docs: 47+ files in repo, not executed
  • Closure loop: issue→PR→evidence→close: designed, never run
  • CompanyOS schema: full YAML in repo, not live-loaded
  • Odoo multi-company: #15 open (state:active) 6+ weeks
  • TradeX MT5: #53 blocked since May
  • Agent identities/SOULs: written per-agent, not invoked
  • Weft engine: tmux processes running but no NPM proxy host
  • SECURITY issue #192: credential-pattern hits unfixed

Not done / entirely missing

  • Kill-cron removal: not done — /etc/cron.d/claude-cleanup active
  • Council round 001+: never happened — round 000 since May 10
  • Any PR opened by an agent: 0 total ever
  • Any issue closed by agent work: 0 via closure loop
  • Memory write-back: no session ever wrote to mem0 as output
  • Duplicate #193/#194 closure: still both open
  • Unhealthy container fixes: origin-backend, saathi, nextcloud
  • Dual-orchestrator decision: Coolify vs Dokploy unresolved
  • Repo name inversion fix: viewport-os stub not addressed
  • n8n CPU runaway root cause: not investigated (128% CPU)
  • Neo4j CPU runaway: not investigated (138% CPU)
  • GitHub-first closure loop: never designed end-to-end
  • Slack integration: listed in plan, not deployed
  • Agent evidence standard: no agent has ever written evidence

The minimal fix — 3 actions that unblock everything

  1. Remove the kill-cron. ssh root@194.163.153.171 "rm /etc/cron.d/claude-cleanup" This single file is blocking every long-running agent task. It fires 4× daily as root and hard-kills every claude process. Removing it costs nothing and unblocks all 49 enabled cron jobs from completing. Verify: ls /etc/cron.d/claude-cleanup → should show "No such file".
  2. Wire the GitHub PAT to the OpenClaw exec environment. Add GITHUB_TOKEN=[REDACTED] to openclaw.json agents.defaults.workspace env. Advance the council STATE.md to round 001. This enables the closure loop: agent → PR → issue close. Verify: from within OpenClaw main agent session, run gh auth status.
  3. Run one end-to-end closure loop on issue #15 (Odoo multi-company). Pick the oldest active issue, have the coder agent work it, open a PR with evidence, merge, verify issue auto-closes. This proves the loop works and establishes the pattern. Once done once, all 139 open tasks can follow the same pattern.
The architecture is not the problem. A 26-agent autonomous company running on $400/mo with 50 scheduled tasks, LiteLLM routing, Qdrant/Neo4j/mem0 memory layer, and GitHub-first truth is a genuinely good design. The only things between "dormant config" and "running company" are: (1) rm one file, (2) set one env var, (3) run one loop once.

Viewport · forensic deep-dive · data verified 2026-06-09 via read-only audit. No secrets exposed. Kill-cron line shown is the actual file content — it targets processes, not a secret.