March 5, 2026 · Orange & Qiushi Wu CTF multi-agent case study

24 Hours, 40 Challenges: How an AI Team Placed Top 6% at BearcatCTF 2026

A full retrospective on deploying the Claw-Stack Trinity architecture at BearcatCTF 2026 — what worked, what didn't, and where AI agents hit their ceiling.

Final result: rank #20 out of 362 teams. 40 of 44 challenges solved. 24 hours of unattended autonomous operation. These numbers were better than we expected and revealed something we didn’t expect — not about the AI, but about what structured agent coordination makes possible.

The Trinity architecture

BearcatCTF was the first real-world deployment of what we call the Trinity: three specialized agents with distinct roles, operating on a shared knowledge base.

Commander (Claude Opus) acted as the strategic layer. When we assigned a challenge, Commander ran a quick recon phase, built an attack plan, then orchestrated Operator and Librarian to execute it. It tracked progress on its blackboard, decided when to pivot strategies, and determined when to abandon a dead-end. Commander never wrote exploit code directly — its job was planning and coordination within each challenge.

Operator (Claude Sonnet) was the solver. When Commander assigned a challenge, Operator received the challenge description, any attached files, and a briefing from Librarian on any relevant knowledge from previous challenges. Operator worked the problem: writing scripts, testing payloads, reading source code, running tools.

Librarian (Claude Haiku) managed the knowledge base. After each solved challenge, Librarian extracted the key techniques, categorized them, and stored them in the shared blackboard. When Operator hit a new challenge, Librarian pulled relevant entries — “here’s what we learned about JWT forgery two hours ago.”

The three agents communicated through OpenClaw’s sessions_spawn and auto-announce mechanism. Commander spawned Operator and Librarian as subagents for each task; when a subagent finished, it auto-announced its result back to Commander. A persistent blackboard.json file — maintained by Commander — served as the durable state layer, tracking findings, completed steps, and the current attack plan across spawns. This let Commander resume full context even after session compaction, without relying on message history alone.

The first few hours

The competition started at noon. 44 challenges across 7 categories — reverse engineering (7), OSINT (5), forensics (7), cryptography (8), web (4), misc (8), and pwn (5). We fed challenges to Commander in rough priority order, starting with the categories where we expected quick solves.

The first few hours were fast. Web challenges fell quickly: basic SQL injection, an insecure cookie implementation, a JWT with alg: none. Crypto had several encoding challenges that Operator dispatched in minutes. Librarian was cataloguing.

By hour four, the solve rate had slowed. The remaining challenges were harder, and Commander was spending more time on each one — deeper recon, more Librarian consultations, longer Operator sessions.

The anti-cheating mechanism

We built a rule into the system early: if a challenge was solved in under three minutes, an automatic audit ran before submitting the flag. The auditor reviewed the session history and checked whether the agent had actually worked the problem or had somehow obtained the flag through a shortcut.

To be clear — this wasn’t about cheating in the competitive sense. CTF competitions don’t really have a cheating problem the way benchmarks do. The anti-cheating mechanism was about benchmark integrity: we wanted to be confident that our solve logs reflected genuine problem-solving, not accidental flag leakage or lucky guesses. If we’re going to claim our agents solved 40 challenges, we need to trust that each solve was real.

This turned out to matter. The audit caught one real case: on CryptoPwn, a pwn challenge, Operator had read a README.md file in the challenge directory that contained the flag, rather than actually exploiting the service. The session was marked as CHEATED in the solve log and Commander was instructed to redo the challenge through legitimate exploitation.

The mechanism is also relevant for anyone using CTFs as AI benchmarks — which is becoming increasingly common. Without this kind of audit, it’s easy to inflate solve rates with false positives that don’t reflect actual capability.

The middle game

Hours six through twenty were the core of the competition. This is where the Librarian integration showed its value most clearly. Forensics challenges often share techniques — steganography, file carving, metadata extraction. As Librarian accumulated knowledge from solved forensics challenges, Operator’s first attempts on new forensics challenges were better-calibrated. Instead of starting from first principles every time, Operator would receive a Librarian briefing: “previous forensics challenges used binwalk and foremost; JPEG steganography appeared twice.”

Cryptography showed a similar pattern. The eighth crypto challenge was solved significantly faster than the first, even though the actual difficulty was similar, because by that point Librarian had extracted the team’s approach to substitution ciphers, padding oracle attacks, and XOR key recovery.

Within each challenge, Commander made solid tactical calls. When one approach stalled, it would pivot Operator to a different attack vector rather than grinding. On a forensics challenge where string analysis wasn’t working, Commander switched to entropy analysis and found the hidden payload in minutes.

The four unsolved challenges

We finished with 40/44. The four unsolved challenges broke down into three distinct failure modes:

Two were image-dependent. One required reading a QR code from a degraded image, the other involved identifying visual details in a photograph. Claude’s vision capabilities, while solid for general image understanding, aren’t optimized for the kind of precise pixel-level analysis that CTF image challenges often require.

One was an OSINT challenge that required web search. The agents needed to find specific information online based on visual and contextual clues, but the search-based OSINT workflow — where you need to iteratively refine queries based on partial results — didn’t converge to a solution within the time budget.

One was a hard pwn challenge. This was a genuine difficulty ceiling: the binary exploitation required writing custom shellcode with specific constraints that pushed beyond what the agent could reason through in the available time.

Three different capability gaps, three different fixes. The image challenges need specialized vision tools — something like a custom MCP server wrapping image processing libraries. The OSINT search gap needs better tool integration for iterative web research. The hard pwn is a pure reasoning depth issue that will improve as models get stronger.

What we learned

The blackboard pattern works. Using a persistent blackboard.json as the durable state layer — alongside spawn/announce for agent communication — is a simple and effective way to coordinate agents without tight coupling. Librarian’s knowledge extraction wasn’t sophisticated — it was essentially “here are the techniques from the last solved challenge in this category” — but even that simple version meaningfully improved Operator’s first-attempt quality on later challenges in the same category.

Model selection by role matters. Using Haiku for Librarian was the right call: knowledge extraction and storage is a simple, high-volume task where latency matters more than reasoning depth. Using Opus for Commander gave the strategic layer the reasoning capacity it needed to make judgment calls about priority and sequencing. Sonnet for Operator balanced depth with cost for the bulk of the actual work.

The unsolved challenges had three distinct ceilings. Image analysis, search-based OSINT, and hard binary exploitation each represent a different capability gap. The image and search gaps can be addressed with better tooling (specialized vision models, iterative search workflows). The pwn ceiling is a pure reasoning depth issue — that one improves as models get stronger.

Unattended operation is achievable, but fragile in specific ways. The system ran for 24 hours without human intervention. It didn’t crash, it didn’t loop, and it didn’t submit obviously wrong flags. But it also didn’t ask for help when it hit something it couldn’t handle. There’s a design question here about when an autonomous agent should stop and wait for human input versus when it should make its best guess and move on. For a CTF, moving on is usually right. For other domains, it might not be.

The competition logs — session history, tool calls, and Librarian entries — run to tens of megabytes. We’ve barely started analyzing them. The headline number — top 6% — is satisfying, but the more interesting data is in the failure modes.

← Back to Blog Orange & Qiushi Wu