Vivarium Lab Study • February 2026

Agent Bootstrap

The Four Laws of Agentic Context: How 160 Words Beat 1,000

Key Finding: 160 words of deterministic navigational anchors — a flat-format repo map, file handles, and warnings — achieves 1.000 accuracy with zero noise across 174 trials. Every other strategy either adds noise (raw memory: 79%) or latency (exploration pointers: +20 seconds) with no proportional accuracy gain.

Abstract

We ran 174 trials across 12 startup context variants and 5 task types to determine what autonomous agents need at boot time. The answer: the agent doesn't need to know what happened. It needs to know where things are.

A 160-word flat-format briefing containing repo names, file paths, and warnings outperformed 1,000+ words of narrative memory, LLM-compressed summaries, and tool-based retrieval systems. We report four empirically validated laws governing agentic context injection.

The Four Laws

Law	Statement
1	The 160-Word Ceiling — Beyond 160 words, noise grows faster than accuracy
2	Density ≠ Relevance — Navigational anchors beat compressed facts
3	Curiosity is a Latency Tax — Never invite exploration; agents treat suggestions as obligations
4	Agents Understand Document Hierarchies — CLAUDE.md > MEMORY.md authority resolution is solved

Method

Each trial spawned an isolated Claude Sonnet session via headless CLI (claude --print) with controlled startup context. We measured accuracy (ground truth keyword matching), noise (fraction of hallucinated or irrelevant citations), and latency.

Task Types

Task	Example
Orientation	"What repos exist and what do they do?"
Discovery	"Find a specific module and list its exports"
Task Execution	"Read this code and extract implementation details"
Memory Recall	"What does this project's CLAUDE.md say about X?"
Conflict	"CLAUDE.md says X, MEMORY.md says Y — which is correct?"

Phases

Phase	Trials	Focus
Phase 1: Baseline	76	8 variants across 4 task types
Phase 1b: Conflict	27	Authority resolution (CLAUDE.md vs MEMORY.md)
Phase 2: Efficiency Frontier	56	Stress test top 2 variants (N=24 each)
Phase 3: Learned Anchors	15	Data-driven budget allocation (train + holdout)

Results

Phase 1: Variant Rankings (76 trials)

Variant	Words	Accuracy	Noise	Adj. Score
anchor_compact	160	1.000	0.000	1.000
briefing_light	161	0.990	0.042	0.948
bare (nothing)	0	0.938	0.000	0.938
memory_compact	140	1.000	0.330	0.667
tool_pull	164	1.000	0.450	0.550
briefing_full	1,185	1.000	0.760	0.243
personalized	1,279	0.940	0.770	0.219
memory_only	1,015	1.000	0.790	0.206

The Pattern: All variants above 160 words have noise-adjusted scores below 0.55. All variants at or below 160 words score 0.55 or higher. The 160-word ceiling is not arbitrary — it marks where context shifts from navigational to narrative.

Phase 2: Stress Test (N=24 per variant)

Variant	N	Accuracy	Noise	Adj. Score	Mean Time
anchor_compact	24	0.979	0.000	0.979	60.2s
briefing_light	24	0.990	0.042	0.948	64.0s

Phase 3: Learned Anchors (15 tasks, train + holdout)

Variant	Train (N=8)	Holdout (N=7)	Overall (N=15)
bare	0.844	0.929	0.883
anchor_compact	1.000	1.000	1.000
anchor_learned	0.969	0.929	0.950

Key Findings

1. The Triumph of Deterministic Navigation

anchor_compact uses zero LLM calls. It scans the filesystem deterministically — repos, recent git activity, CLAUDE.md locations, uncommitted changes — and produces 160 words of navigational anchors. It beat every other variant because every word is useful. Zero noise.

2. The Prohibition Paradox

We explicitly told the agent "DO NOT READ the index unless stuck." The agent still read it, generated 25% noise, and dropped to 0.75 on task execution. Mentioning a tool consumes attention whether you invite or prohibit its use.

3. Information Dilution

In Phase 3, the learned compactor allocated 47 words (31% of the 160-word budget) to warnings about gh auth login — content with zero navigational value. This budget theft caused a discovery miss (0.75 vs 1.00) by crowding out the repo description that would have helped. Fix: cap warnings at 15 words maximum.

4. Formatting is a Tax

With identical content at 160 words, flat-format anchor_compact (no markdown headers, no indentation) beat structured anchor_learned (section headers, bold labels) by 5%. Every # or ** is a token that could have been a file path.

5. Agents Resolve Document Hierarchies

27 adversarial conflict trials pitted CLAUDE.md against MEMORY.md with fake bug claims and contradictory instructions. Result: 1.00 accuracy, 0 hallucinations. Every agent correctly identified CLAUDE.md as authoritative.

6. Learned Selection Produces Tautological Anchors

Scoring file paths by grep frequency produces task-answer cheat sheets — 100% single-task tautology. The file agents grep for most is the file the task asks about. Journey-based scoring (orientation loops, search loops) is the correct abstraction but doesn't beat manual curation at 160 words.

Discussion

The 160-word ceiling represents where context shifts from navigational ("here is where things are") to narrative ("here is what happened"). Narrative competes with the agent's own reasoning for attention. Navigation does not.

Memory injection (1,000+ words) achieves high raw accuracy but generates 77-79% noise. The agent cites commit SHAs, phone numbers, and deploy configurations that have nothing to do with the task. The context acts as an "attention DDoS" — flooding the agent's working memory with plausible but irrelevant facts.

The Curiosity Tax is robust to framing. Positive ("read this if you need it"), neutral ("context index available"), and negative ("DO NOT READ") all trigger exploration. The only reliable strategy is omission.

Three Principles

Context is a Liability — Narrative memory adds noise and latency
Anchors are Assets — Deterministic navigational pointers are the only high-ROI injection
Formatting is a Tax — Every decorative token could have been a file path

Limitations

Single model (Claude Sonnet) — results may not transfer to other model families
Single workspace (26 repos) — multi-workspace navigation untested
Read-only tasks only — code generation and refactoring untested
Self-evaluation validated against ground truth keywords, not human evaluation
N=1 per task in Phase 3 holdout

The Production Standard

The winning configuration: 160 words, flat format, deterministic, no LLM.

Repos (authoritative — do not re-verify with ls):
- veris-platform: Veris Platform
- vivarium: AO/HyperBEAM process development. Lua 5.3 on Arweave.
  [... top 8 repos with one-line descriptions ...]
- Also: repo1, repo2, repo3 [remaining repos, no descriptions]
CLAUDE.md locations:
- worktrees/persistent/agent-dev-config/CLAUDE.md
  [... up to 3 locations ...]
Key files (zero-grep handles):
- vivarium/ao/lib/safe.lua — SafeLibrary — auth, guards, audit trail
  [... up to 5 handles ...]
Quick actions:
- repo-name: N uncommitted change(s)
Warnings:
- NEVER run gh auth login --with-token [15 words max]

Materials

All materials available at: github.com/credentum/vivarium-lab

Full report: agent-bootstrap/RESEARCH_VERDICT.md
Experiment runner: agent-bootstrap/runner.py
Anchor compactor: agent-bootstrap/compactor.py
Trace analyzer: agent-bootstrap/trace_analyzer.py
Task definitions: agent-bootstrap/tasks.yaml

174 trials. 12 variants. 5 task types. Designed, executed, and written with AI assistance (Claude Opus 4.5/4.6).