This is Part 00 of an ongoing series on building a custom agentic AI workflow. Part 01 covers evaluating prompt quality and agent performance with Braintrust. Future posts apply the workflow to real problems — from automated binary fuzzing to orchestrating my weekly dinner plans.
In Hacknet (fun little game I played back in highschool:
), you inherit a dead hacker's failsafe. Bit — the guy who built the most invasive security system on the planet — is gone, and his automated ghost starts feeding you instructions through the terminal. You follow the breadcrumbs, break through firewalls with real UNIX commands, and slowly realize that what Bit actually created was an operating system specialized in hacking. An autonomous system that could do what skilled humans do, but relentlessly and without sleep.I've played through Hacknet multiple times, and the premise always stuck with me: what would it actually take to build something like that? Not a game — a real system where AI agents collaborate autonomously on security engineering tasks, each one specialized, each one operating through a shared orchestration layer that keeps the whole operation durable and recoverable when things go sideways.
This post breaks down the architecture behind a multi-agent workflow I built from scratch: a system where specialized AI agents collaborate through a durable orchestration layer to tackle complex, multi-step tasks autonomously. No LangChain. No CrewAI. Just Temporal, Pydantic, and a lot of deliberate design choices.
And yes — Claude Code wrote most of it. My will, its hands(?).
Why Build From Scratch?
The honest answer starts with a failed experiment. Before any of this, I tried building the workflow in n8n which seemed like the fast path to a working prototype. I was using Gemini Pro's latest models, and even with very specific, descriptive tool-call examples and carefully engineered prompts, tool calls would fail in ways I couldn't diagnose. The visual workflow editor gave me speed but i gave up visibility. When something broke, I was staring at a node that said "error" with no clear path to understanding why the model chose the wrong tool, or why the structured output was malformed, or where in the chain the context had gone stale. I needed to see inside every decision the system was making, and n8n's abstraction layer was a wall between me and the problem.
So I scrapped it and started from zero. I wanted to understand every moving part. Frameworks like LangChain and CrewAI abstract away orchestration decisions that matter enormously when your agents are doing real work — routing tool calls, handling failures mid-pipeline, streaming partial results back to a user who's watching in real time. When something breaks, I want to know exactly what my system is doing.
The more strategic answer: I took heavy inspiration from a Praetorian blog post on deterministic AI orchestration, where they describe a philosophy of treating the LLM as a nondeterministic kernel process wrapped in a deterministic runtime.
Their core insight stuck with me — the bottleneck in agentic systems isn't model intelligence, it's context management and architectural determinism. Prompts are probabilistic. Your orchestration layer shouldn't be.
So I went with Temporal.io for durable workflow orchestration and PydanticAI for structured LLM interactions, and I built the glue in between myself. The result is a React/TypeScript frontend talking to a FastAPI backend, which signals a long-running Temporal workflow that orchestrates a pipeline of specialized agents — each running as an isolated Temporal activity.
The Agent Pipeline: Running a Crew
Here's the core philosophy: no single agent does everything. Every agent has one job, and the boundaries between those jobs are enforced structurally — not by prompt instructions. This mirrors the coordinator-executor pattern that Praetorian describes, but adapted for a conversational multi-turn workflow.
When a user sends a message, it enters the pipeline:
The Triage Agent is the fixer — it reads the user's request and breaks it into actionable goals. It doesn't execute anything. It scopes the job.
The Task Manager is the planner. It has read-only tools (search, file reading) and its job is to reason step-by-step about what needs to happen. It can take exactly four actions:
- spawn an executor
- mark a goal complete
- flag something as blocked,
- ask for clarification.
It does not, under any circumstances, write code or produce deliverables.
The Executor is the operator. Full tool access: code execution, file writing, web search. It receives a specific task from the Task Manager and works through it across multiple turns — reading files, running code, iterating on results, and producing structured output with artifact metadata. It's a full agent, not a one-shot function call. It reasons, acts, observes, and adapts until it's satisfied with its deliverable. What it doesn't do is plan the overall approach or decide what to work on next — that's the TM's job.
The Goal Judge only gets called when the Task Manager says "complete". It reviews the work against the original request and either approves or sends it back with feedback. This creates a built-in revision loop — the TM incorporates judge feedback and can re-dispatch the executor.
The Run Judge fires once at the end, after all goals are resolved. It produces a summary and selects which artifacts to surface to the user. It does not re-verify goals — that's not its job.
These boundaries matter. When I started, I had a single mega-agent that planned, executed, and judged its own work. It worked sometimes. But when it didn't, the failures were annoying and undebuggable. Splitting responsibilities gave me clean error boundaries and made each agent's context window dramatically more focused.
One detail worth calling out: every agent in this pipeline is genuinely autonomous within its scope, and I didn't have to build that machinery myself. PydanticAI's Agent.run() has a built-in agentic loop — the LLM calls tools, receives results, reasons over them, calls more tools, and continues iterating until it's ready to return structured output. All of that happens automatically within a single .run() invocation. None of my agents set max_retries or tool-call limits, so they all run on PydanticAI's defaults. This is a meaningful abstraction. I didn't have to write the observe-reason-act loop, the tool dispatch, or the output parsing. PydanticAI handles all of that, which let me focus on the orchestration between agents — the Temporal workflow, the incremental messaging, the signal/query protocol — rather than the inner mechanics of each agent's turn-taking. It's exactly the right layer of abstraction: enough control over what matters (agent boundaries, tool access, structured outputs), enough automation over what doesn't (the internal reasoning loop).
Temporal Backbone
Here's the problem with long-running AI pipelines: they crash. Workers restart. LLM calls timeout. Network connections drop. If your entire multi-agent pipeline runs in a single Python process with no checkpointing, a failure at minute 4 of a 5-minute run means you eat the entire cost and start over.
Temporal solves this with three properties that turned out to be essential:
Durable execution. The full triage → task manager → executor → judge pipeline can take 30-60+ seconds. Temporal ensures the workflow survives worker restarts, deploys, and transient failures without losing progress. If a worker dies mid-execution, another worker picks up exactly where it left off.
Activity offloading. LLM calls are non-deterministic external API calls. Temporal's activity model isolates them from the deterministic workflow code, enabling replay-safe recovery. The workflow itself stays pure and replayable; all the messy I/O lives in activities.
Event History. Temporal maintains an append-only, durable log of all workflow events. This log is always readable regardless of what the workflow is currently doing — a property that turned out to be the key to solving the incremental messaging problem I'd been hitting since n8n.
The workflow is defined as a PydanticAIWorkflow subclass with a __pydantic_ai_agents__ class attribute that registers each agent. Combined with a PydanticAIPlugin, this auto-registers every agent's run() method as a Temporal activity. It's genuinely elegant — you declare your agents, and the framework handles the activity registration, serialization, and scheduling.
@workflow.defn
class ResearchWorkflow(PydanticAIWorkflow):
__pydantic_ai_agents__ = [
temporal_triage,
temporal_task_manager,
temporal_executor,
temporal_goal_judge,
temporal_judge,
]Context Engineering
The Praetorian blog and this project overall pretty much showed me that context management is the main problem to solve. When you're running a multi-turn Task Manager (TM) loop, every iteration adds to the step history: executor results, deliverables, judge feedback, error messages. By iteration six or seven, you've accumulated enough context that the TM prompt is pushing 85% of the model's context window. At that point, the next LLM call either fails outright or silently degrades.
The naive solution is truncation: just drop old steps. But step history is how the TM knows what it's already tried, what worked, what failed, and what the judge told it to fix. Blindly dropping that causes the TM to repeat failed approaches, forget artifacts it produced earlier, or lose track of key decisions. I watched this happen in testing. With n8n, you can provide message history to models, and you can configure how many messages to provide, but there is no easy path to context management as far as I can tell.
So instead of truncating, I built a compressor agent whose only job is to summarize old steps into a structured digest while preserving the most recent steps verbatim.
The mechanics work on a sliding window. Every TM call passes through a context check that estimates total token usage against the model's context window (using tiktoken for OpenAI models, with a character-heuristic fallback for others). Two thresholds control behavior: at 75% usage, the system emits a visible warning to the chat. At 85%, it triggers the compressor. When compression fires, the step history splits: old steps get fed to the compressor agent, while the most recent two steps are preserved word-for-word — because the TM needs exact details about what just happened to make good decisions about what to do next.
The compressor produces structured output — not a free-text summary, but a CompressedContext object with distinct fields:
class CompressedContext(BaseModel):
summary: str # Narrative summary (<500 tokens)
key_decisions: list[str] # 3-5 most important decisions
unresolved_issues: list[str] # Open questions, pending items
artifacts_summary: list[dict] # [{index, description}]This structure matters. The TM can act on "unresolved issues" as a checklist rather than mining them from prose. And the compressor's system prompt is explicit about what to preserve versus discard — key decisions and their reasoning, error messages, judge feedback, artifact references, and unresolved issues all survive compression. What gets dropped is the verbosity: redundant reasoning repeated across steps, verbose tool outputs already captured in artifacts, and duplicate information where only the most recent version matters. A step that says "searched 5 files, found the bug in auth.py, decided to use regex" carries the same planning value as the 2,000-token executor output it replaced.
One implementation detail worth mentioning: token counting uses tiktoken for OpenAI models (exact counts) with a character-heuristic fallback (~4 chars/token) for everything else, via a pluggable registry. The tiktoken encoding is cached per model with @functools.lru_cache and eagerly initialized at import time — because tiktoken's lazy registry initialization creates a threading.RLock, which would fail inside Temporal's deterministic sandbox. It's the kind of thing you only discover when your worker crashes with a threading error that makes no sense until you read Temporal's sandbox docs.
Crucially, compression is iterative. Goals that run for many iterations (judge revisions, retries, clarification loops) may trigger compression multiple times. Each round, the compressor receives the previous compressed summary with instructions to incorporate it, so each summary is a superset of the last — just more condensed. Nothing is lost across compression rounds; the summaries stack.
Only the Task Manager gets compression. The other agents either don't accumulate unbounded history (the executor works on a single task) or have naturally bounded input (judges receive a fixed-structure prompt). The cost of a compression call is roughly 3,000-5,000 tokens — a good trade when you're compressing a 174K-token prompt down to 60K.
Clarification Loop
Before I built the clarification system, ambiguity was a dead end. If the triage agent couldn't decompose a vague message like "do something" into goals, it had to guess. If the task manager needed more information mid-goal — say, "which cloud provider should I target?" — NEEDS_CLARIFICATION was a terminal (💀) status. The goal was abandoned and the judge summarized it as incomplete. The user's only option was to start over with a better prompt.
This is the wrong tradeoff for a conversational system. Users send vague messages. Agents should be able to ask.
The solution leans on a Temporal primitive that turned out to be perfect for this: workflow.wait_condition(). When an agent needs clarification, the workflow emits the question as a clarification_request message and then calls wait_condition() — which genuinely suspends the workflow. No threads. No polling loops. No timeouts. Zero resource consumption. The workflow could wait for five seconds or five days — the cost is the same.
When the user responds in chat, the existing send_message signal fires. The workflow doesn't need a separate signal type — it disambiguates based on its own state. If awaiting_clarification is True, the message routes to the clarification response slot instead of the normal message queue. The wait condition triggers, and execution resumes exactly where it left off. This was a deliberate design choice: adding a separate send_clarification_response signal would mean the API, frontend, and workflow all need to know which signal to use when. Instead, the workflow — which is the only component that knows whether it's waiting — makes the routing decision. The frontend sends messages the same way regardless of context.
This works at two points in the pipeline. At triage, if the user's request is too vague to decompose into goals, the triage agent returns structured clarification questions (via an optional clarification_questions field on TriageResponse) with an empty goals list. The workflow enters a clarification loop, and when the user responds, triage re-runs with the original message augmented by the user's answer — formatted as "original message\n\n[User clarification: response]" so the triage agent sees the full context in a single prompt without needing multi-turn conversation history. The loop can iterate multiple times: if the user's clarification is still too vague, triage can ask again.
At the task manager level, the TM's existing NEEDS_CLARIFICATION action — which was previously terminal — now pauses the workflow instead. The TM's summary field (which describes what's unclear) becomes the clarification question. When the user responds, the answer gets injected into the goal's step history as a synthetic step record, so the TM sees it as part of the goal's progress on its next iteration. No special handling needed in the TM's prompt — it just sees two new entries in "Previous Steps": the original clarification request and the user's response.
One distinction that matters: BLOCKED remains terminal. Blocked means something is fundamentally impossible — a missing dependency, a permission issue, an API that doesn't exist. NEEDS_CLARIFICATION means the agent needs more information that the user can provide. Conflating these would mean blocked goals wait indefinitely for a user response that won't help.
The frontend handles this cleanly too. The poll loop watches for clarification_request messages and exits early when it sees one — the run isn't done yet, so it skips the normal completion reconciliation. The input bar re-enables with a distinct placeholder ("respond to clarification..."). And on page refresh, the status endpoint now includes awaiting_clarification, so the frontend immediately restores the correct UI state without the user needing to re-send anything.
Incremental Messaging
When I was using n8n, I could stream LLM token output fine — text appearing chunk by chunk. But I couldn't stream tool calls, node transitions, or intermediate results (as they occurred). From the user's perspective, the system would start generating text, then go silent for 5 minutes while tools executed behind the scenes, then suddenly dump a wall of output. At work, this was a real problem: users couldn't tell if the system was thinking, stuck, or crashed. People need to see constant real-time updates — not because they're impatient, but because silence is indistinguishable from failure. This UX gap was a big reason I moved away from n8n, and solving it properly became a first-class design goal.
Here's the scenario: the user sends a message, the backend kicks off a 45-second multi-agent pipeline, and the user needs to see what's happening in real time. Not a spinner. Not "processing..." for a minute. Actual incremental updates — the triage result, then each task manager step, then executor outputs — streaming into the chat as they're produced.
The natural approach would be Temporal workflow queries: the workflow maintains an in-memory buffer of messages, and the API polls it. Simple. Except workflow queries are unreliable during active execution. Queries compete with activity execution for the same workflow task slot. In the Python SDK, they're processed at the end of workflow activations, meaning they queue behind running tasks. In practice, queries during multi-agent execution consistently fail with RPC errors. The frontend accumulates errors and eventually gives up.
The solution is something I call the emit-message-activity pattern. Instead of trying to query the workflow's in-memory state, each message is emitted as a lightweight pass-through activity — a function that does nothing except return its input:
@activity.defn
async def emit_message_activity(msg_data: dict) -> dict:
"""Pass-through activity that stores a message
in event history."""
return msg_dataThe purpose isn't the activity itself — it's the event it creates. Every activity execution produces an ActivityTaskCompleted event in Temporal's durable Event History. That history is an append-only log that is always readable, even while the workflow is busy executing the next agent call.
The wrapper method is worth showing because it reveals why this works so reliably:
async def _emit_message(self, msg: Message) -> None:
self._state.messages.append(msg)
msg_dict = {"role": msg.role, "content": msg.content, ...}
await workflow.execute_activity(
emit_message_activity, msg_dict,
start_to_close_timeout=timedelta(seconds=5),
)The await on execute_activity is the key. It yields to Temporal, creating a workflow task boundary — which means the emitted message is committed to Event History before the next agent starts. This guarantees ordering: the user sees the triage result before the task manager's first step, every time. No batching, no out-of-order messages.
The API polls this history, scans for completed emit-message activities, decodes the payloads, and returns them incrementally to the frontend. The frontend advances a cursor (last_event_id), appends new messages to React state, and the user watches the agents think in real time. The poll endpoint determines whether the workflow is still running by comparing scheduled versus completed activity counts in the event history — if any activities are still pending (scheduled but not completed, failed, or cancelled), it returns processing: true.
For a typical message processing cycle with 10-15 emitted messages, this adds about 45 events to the history — well within Temporal's 51,200 event limit.
Lessons Learned (The Hard Way)
Agent boundaries must be structural, not instructional. Early on, I tried using prompt instructions to keep the Task Manager from writing code. It worked 90% of the time and failed the other 10%. The fix was giving the TM read-only tools. If it physically cannot call a code execution tool, it cannot write code. Enforce constraints through capability, not through asking nicely.
Timeout tuning is an art form. LLM calls have wildly variable latency — a reasoning model might take 3 seconds or 45 seconds depending on the complexity of its chain-of-thought. Set your activity timeouts too low and you get spurious failures. Set them too high and a genuinely stuck call blocks your pipeline for minutes. I ended up with tiered timeouts: 5 seconds for emit activities, 120 seconds for executor calls, and custom per-agent tuning.
Token tracking is non-optional. When you're running 5+ agents per user message and each is making reasoning-model calls, costs compound fast. Per-agent token tracking — not just per-session totals — was essential for identifying which agents were burning disproportionate context and optimizing their prompts accordingly. Braintrust's OTel integration made this observable in real time. More on this in Part 01.
The TM-Executor loop needs a leash. The Task Manager ↔ Executor iteration loop is powerful but dangerous. Without a hard cap on iterations (MAX_TM_ITERATIONS = 10), a confused TM can endlessly dispatch executors, each one burning a full LLM call's worth of tokens on slightly different attempts at the same failed task. The cap forces escalation — either mark it blocked or ask for clarification. Don't let agents spin.
Replay safety is subtle. Temporal replays workflow code deterministically during recovery. This means any side-effect in workflow code — like emitting an OTel span — will fire again during replay, creating duplicates. I had to write a ReplaySafeSpanProcessor that checks workflow.unsafe.is_replaying() and suppresses span emission during replays. It's exactly the kind of bug that only manifests in production when a worker restarts.
What's Next
In Part 01, I'll cover how I evaluate this system — using Braintrust to measure prompt quality, agent performance, and the iterative process of making these agents reliably good instead of occasionally impressive. How do you know if your triage agent is actually decomposing goals well? How do you A/B test prompt changes without breaking production? Turns out, eval-driven development for agents is a discipline unto itself.
After that, the series becomes open-ended. The whole point of building a general-purpose agentic workflow is that the applications are limitless — each one just becomes a new set of agent skills plugged into the same durable orchestration layer. Here's what's on deck:
Automated binary fuzzing — Working through the Fuzzing 101 challenges and following the OpenSecurityTraining2 AFL course. An agentic system that can analyze binary targets, generate intelligent seed corpora, orchestrate fuzzing campaigns, and triage crashes — the kind of work that usually requires a security engineer babysitting a terminal for hours.
AI-powered date night planner — Scraping my girlfriend's Beli "want to visit" list, cross-referencing it with our weekly budget pulled from Monarch Money, and having the agents plan out which restaurants we hit that week. The system handles the API integrations, budget constraints, and scheduling — I just show up and eat.
Each application gets its own standalone post. The foundation stays the same; the skills change.
Bit's failsafe didn't just preserve his work — it passed the terminal to someone else. Consider this post the first automated email. More instructions incoming.