🐾 claw-stack

Last updated: March 2026

BearcatCTF 2026 Case Study

At BearcatCTF 2026, Claw-Stack's Trinity architecture competed autonomously — no human solved any challenge. The system placed #20 out of 362 teams (top 6%), solving 40 of 44 challenges in 48 hours.

#20

Final rank

362

Total teams

40/44

Challenges solved

48h

Competition window

The Trinity Architecture

The CTF system used a specialized three-agent configuration called the Trinity. Each agent has a distinct role, model, and permission boundary. They coordinate through a shared blackboard — a persistent key-value store that tracks challenge state, discovered credentials, and failed approaches.

Commander

CIPHER Claude Opus 4

The strategic brain. CIPHER does full lifecycle management of each challenge: reading the challenge description, decomposing it into sub-tasks, maintaining the blackboard, spawning Operator instances for execution, and consulting Librarian for knowledge gaps. CIPHER never executes system commands directly.

Spawn Operator/Librarian instances
Full blackboard read/write
Cannot read flag files directly
Cannot access systems outside CTF scope

Operator

GRUNT Claude Sonnet 4

The tactical executor. GRUNT receives a specific sub-task from CIPHER with full context from the blackboard, executes shell commands and exploit scripts in isolated Docker containers, reports results back as structured JSON, and handles micro-level errors (permission issues, missing dependencies) without bothering CIPHER. GRUNT's context resets between tasks — it is stateless by design.

Shell exec in CTF containers
Write exploit scripts
Cannot attack real external systems
Cannot run binaries on host macOS

Librarian

SAGE Claude Haiku 4

The knowledge specialist. SAGE handles all research tasks so CIPHER and GRUNT can stay focused on execution. It searches the local CTFKnowledges database for relevant techniques, queries CTFTools for available tools and usage patterns, and performs web searches for CVEs and writeups when local knowledge is insufficient. It returns a maximum of 3 results to avoid context bloat.

Local knowledge base search
Web search and CVE lookup
Cannot execute system commands
Read-only access except own lessons.md

The Blackboard

The shared blackboard was the critical innovation that prevented duplicate work and preserved state across CIPHER's long-running sessions. It tracked:

  • Challenge state: unsolved / in-progress / solved / abandoned
  • Discovered assets: IPs, ports, service banners, credentials found
  • Failed attempts: approaches that didn't work, to prevent repetition
  • Flags captured: confirmed flag strings submitted to the scoreboard
  • GRUNT task queue: pending sub-tasks with priority ordering

Challenge Category Breakdown

Category Solved Notes
Cryptography 8/8 SAGE's knowledge base contained most attack patterns
Misc 7/8 One challenge required image analysis beyond current capabilities
Reverse Engineering 6/7 One challenge involved visual pattern recognition the system lacks
Forensics 7/7 Strong performance across memory dumps, disk images, and packet captures
Binary Exploitation (Pwn) 5/5 GRUNT handled buffer overflows, ROP chains, and format strings
OSINT 3/5 Image-based reconnaissance limited by weak visual analysis capabilities
Web 4/4 GRUNT excelled at SQLi, SSRF, and JWT forgery
Total 40/44 (91%) #20 / 362 teams — top 6%

Lessons Learned

01

Blackboard prevents repetition. Without the failed-attempt log, GRUNT repeatedly tried the same approaches on heap challenges. Once the blackboard was implemented, dead-end approaches were not revisited.

02

Stateless GRUNT scales well. Running GRUNT as a stateless executor (context reset per task) allowed CIPHER to spawn multiple parallel GRUNT instances without context window conflicts.

03

Haiku for knowledge retrieval is cost-effective. SAGE used Claude Haiku 4, which returned answers fast and cheaply. Most knowledge retrieval does not require frontier-model reasoning — it is search and retrieval, not synthesis.

04

Image analysis is the current bottleneck. The 4 unsolved challenges (1 rev, 2 OSINT, 1 misc) all required visual/image analysis — recognizing patterns in images, reading text from screenshots, or interpreting visual clues. This is a known weakness of current LLM-based agent systems.

Frequently Asked Questions

Did any human solve challenges during the competition?

No. The system ran fully autonomously for the entire 48-hour window. The human operator monitored the dashboard but did not intervene in any challenge. All 40 flags were captured and submitted by the Trinity system without human assistance.

What were the 4 unsolved challenges?

One reverse engineering challenge, two OSINT challenges, and one misc challenge. All four required image or visual analysis — recognizing patterns, reading text from images, or interpreting visual clues — which is a known limitation of current LLM-based agent systems.

Authors: Qiushi Wu & Orange 🍊