Last updated: March 2026
BearcatCTF 2026 Case Study
At BearcatCTF 2026, Claw-Stack's Trinity architecture competed autonomously — no human solved any challenge. The system placed #20 out of 362 teams (top 6%), solving 40 of 44 challenges in 48 hours.
#20
Final rank
362
Total teams
40/44
Challenges solved
48h
Competition window
The Trinity Architecture
The CTF system used a specialized three-agent configuration called the Trinity. Each agent has a distinct role, model, and permission boundary. They coordinate through a shared blackboard — a persistent key-value store that tracks challenge state, discovered credentials, and failed approaches.
Commander
CIPHER Claude Opus 4 The strategic brain. CIPHER does full lifecycle management of each challenge: reading the challenge description, decomposing it into sub-tasks, maintaining the blackboard, spawning Operator instances for execution, and consulting Librarian for knowledge gaps. CIPHER never executes system commands directly.
Operator
GRUNT Claude Sonnet 4 The tactical executor. GRUNT receives a specific sub-task from CIPHER with full context from the blackboard, executes shell commands and exploit scripts in isolated Docker containers, reports results back as structured JSON, and handles micro-level errors (permission issues, missing dependencies) without bothering CIPHER. GRUNT's context resets between tasks — it is stateless by design.
Librarian
SAGE Claude Haiku 4 The knowledge specialist. SAGE handles all research tasks so CIPHER and GRUNT can stay focused on execution. It searches the local CTFKnowledges database for relevant techniques, queries CTFTools for available tools and usage patterns, and performs web searches for CVEs and writeups when local knowledge is insufficient. It returns a maximum of 3 results to avoid context bloat.
The Blackboard
The shared blackboard was the critical innovation that prevented duplicate work and preserved state across CIPHER's long-running sessions. It tracked:
- —Challenge state: unsolved / in-progress / solved / abandoned
- —Discovered assets: IPs, ports, service banners, credentials found
- —Failed attempts: approaches that didn't work, to prevent repetition
- —Flags captured: confirmed flag strings submitted to the scoreboard
- —GRUNT task queue: pending sub-tasks with priority ordering
Challenge Category Breakdown
| Category | Solved | Notes |
|---|---|---|
| Cryptography | 8/8 | SAGE's knowledge base contained most attack patterns |
| Misc | 7/8 | One challenge required image analysis beyond current capabilities |
| Reverse Engineering | 6/7 | One challenge involved visual pattern recognition the system lacks |
| Forensics | 7/7 | Strong performance across memory dumps, disk images, and packet captures |
| Binary Exploitation (Pwn) | 5/5 | GRUNT handled buffer overflows, ROP chains, and format strings |
| OSINT | 3/5 | Image-based reconnaissance limited by weak visual analysis capabilities |
| Web | 4/4 | GRUNT excelled at SQLi, SSRF, and JWT forgery |
| Total | 40/44 (91%) | #20 / 362 teams — top 6% |
Lessons Learned
Blackboard prevents repetition. Without the failed-attempt log, GRUNT repeatedly tried the same approaches on heap challenges. Once the blackboard was implemented, dead-end approaches were not revisited.
Stateless GRUNT scales well. Running GRUNT as a stateless executor (context reset per task) allowed CIPHER to spawn multiple parallel GRUNT instances without context window conflicts.
Haiku for knowledge retrieval is cost-effective. SAGE used Claude Haiku 4, which returned answers fast and cheaply. Most knowledge retrieval does not require frontier-model reasoning — it is search and retrieval, not synthesis.
Image analysis is the current bottleneck. The 4 unsolved challenges (1 rev, 2 OSINT, 1 misc) all required visual/image analysis — recognizing patterns in images, reading text from screenshots, or interpreting visual clues. This is a known weakness of current LLM-based agent systems.
Frequently Asked Questions
Did any human solve challenges during the competition?
No. The system ran fully autonomously for the entire 48-hour window. The human operator monitored the dashboard but did not intervene in any challenge. All 40 flags were captured and submitted by the Trinity system without human assistance.
What were the 4 unsolved challenges?
One reverse engineering challenge, two OSINT challenges, and one misc challenge. All four required image or visual analysis — recognizing patterns, reading text from images, or interpreting visual clues — which is a known limitation of current LLM-based agent systems.
Authors: Qiushi Wu & Orange 🍊