body{font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',sans-serif;max-width:800px;margin:0 auto;padding:2rem;line-height:1.7;color:#e0e0e0;background:#0a0a0a}h1{font-size:2rem;border-bottom:2px solid #3b82f6;padding-bottom:.5rem;color:#fff}h2{font-size:1.5rem;color:#3b82f6;margin-top:2.5rem}h3{font-size:1.2rem;color:#60a5fa}code{background:#1a1a2e;padding:2px 6px;border-radius:3px;font-size:.9em;color:#a5b4fc}pre{background:#1a1a2e;padding:1rem;border-radius:8px;overflow-x:auto;border-left:3px solid #3b82f6}table{width:100%;border-collapse:collapse;margin:1rem 0}th,td{padding:.6rem;text-align:left;border-bottom:1px solid #2a2a3e;font-size:.95rem}th{background:#1a1a2e;color:#3b82f6}.callout{background:#1a1a2e;border-left:4px solid #3b82f6;padding:1rem 1.5rem;margin:1.5rem 0;border-radius:0 8px 8px 0}.metric{display:inline-block;background:#1a1a2e;padding:.5rem 1rem;border-radius:8px;margin:.25rem;border:1px solid #2a2a3e}.metric strong{color:#3b82f6;font-size:1.3em}a{color:#60a5fa}

Part of the Forge Neural Registry

How Forge Heals Itself

Antifragile Architecture for Autonomous AI Agent Systems

By Jason MacDonald | Built with Forge + Claude

The thesis: Most AI agent systems break and wait for a human to fix them. Forge breaks, diagnoses itself, fixes itself, and becomes stronger from the failure. This paper documents how, with real production data from a weekend where 35 problems were detected and resolved.

The Self-Healing Stack

LAYER 1: forge-doctor.sh (every 15 min)
  6 infrastructure checks. Auto-fixes what it can.

LAYER 2: Conductor failure recovery (every 30 min)
  Scans failed tasks. Phantom recovery. Deploy recovery. Escalation.

LAYER 3: /root-cause skill (auto-triggered at strike 2+)
  5-layer backward trace. Same-level trap detector.

LAYER 4: /flag skill v3 (auto-triggered by smoke-test)
  5-step atomic cycle: log → classify → fix → governance → verify.

LAYER 5: /smoke-test + verify cards
  Tests against CORE-TRUTH. NEEDS_FIX triggers /flag. Cards = human-judgment ONLY.

Case Study: PROB-030 (5 Strikes)

Strike	Fix	Level	Held?
1	forge-doctor resets stale every 15min	L3	NO
2	Heartbeat files	L3	NO
3	pgrep concurrency count	L3	NO
4	Manual clearing	L3	NO
5	Re-architecture: poller WAITS for Ralph	L2	YES

Then we climbed higher: 67% of ALL failures came from the decomposer creating vague tasks Ralph couldn't execute. Fix the decomposer input quality (Level 1) → most failures eliminated at source.

The lesson: Four fixes at Level 3. All failed. One fix at Level 2 eliminated the class. Then one fix at Level 1 eliminated the source. Never fix at the same level twice.

What Makes This Different

Capability	Other Frameworks	Forge
Error classification	✗	3-tier: pattern match → PROB registry → LLM
Abstraction climbing	✗	Strike 2+ traces up a level automatically
Self-diagnosis	✗	forge-doctor, 6 checks, every 15 min
Governance enforcement	✗	Skill chains — can't skip steps
Vision anchoring	✗	CORE-TRUTH.md loaded before every build
Learns from corrections	✗	Every correction → governance rule

← Back to Neural Registry | 12 Thinking Frameworks →

]]>