Trustworthy Agentic Workflows with a Custom DSL

1. The trust problem with agentic AI

Elicit is an AI research assistant that reviews scientific literature on behalf of pharmaceutical companies, regulatory agencies, and systematic-review teams. Their challenge is not accuracy alone — it is provability.

When a drug company asks “does this compound show cardiotoxicity in published trials?”, they cannot submit an answer that comes with a footnote reading “an AI reasoned about it”. The regulator will ask: what did the system look at, in what order, and how did it combine sources? If those questions cannot be answered precisely and repeatably, the output has no standing.

This is the trust gap that purely chat-based agents fall into. A single Claude call that browses the web and writes a memo is opaque:

You cannot tell which sources were consulted.
You cannot replay the run and prove it produces the same output.
If a step changed, you cannot tell which step, or what the downstream effect was.

Elicit’s engineering lead James Brady presented their solution at Anthropic’s Code with Claude conference: a custom domain-specific language called AshPL (Ash Programming Language) that the agent writes as a plan, which a deterministic interpreter then executes.

2. Three desiderata for trustworthy workflows

Brady frames the design requirements as three properties a trustworthy agentic workflow must have. Each one rules out a whole class of naive architectures.

Desideratum 1: Legible process

You can read what the system is going to do before it runs, in a form humans can understand and verify.

A chain of tool calls buried in an agent loop is not legible. A natural-language summary generated after the fact is not legible either — it is a reconstruction, not a plan. Legibility requires that the workflow be expressed as a durable artifact a human can inspect and approve before execution starts.

This is what a DSL buys you: the program is the plan. The curator writes it, a human (or automated check) can read it, and only then does the interpreter run it.

Desideratum 2: Iteration fidelity

When you change something, only what you changed actually changes.

Agents that call LLMs at every node introduce stochastic variation throughout the pipeline. If you fix a bug in step 3, steps 4–10 might also change slightly because they took step 3’s output as a prompt and LLMs are not deterministic. You cannot tell whether the downstream change was caused by your fix or by random sampling variation.

Iteration fidelity requires memoization at the step level: if step 3’s input did not change, its output is served from cache, and no downstream step sees any variation it did not actually cause.

Desideratum 3: Faithful execution

What actually runs matches what was described.

It is possible to have a legible plan that the system then silently deviates from — taking shortcuts, skipping steps, or substituting one model call for another. Faithful execution means the interpreter is a strict, deterministic runtime that has no latitude to improvise. Every step in the plan maps one-to-one to an operation in the execution log.

The three desiderata form interlocking constraints. Violating any one of them breaks the full audit trail.

3. AshPL design choices

AshPL is the DSL Elicit built to satisfy all three desiderata simultaneously. Its three core design choices are deliberate constraints, not limitations.

Turing-incomplete by design

AshPL deliberately cannot express infinite loops or unbounded recursion. You cannot write a while True: construct. You cannot have a recursive function with an unbounded depth.

This feels restrictive, but it is a safety property. A Turing-complete agent can do anything — including getting stuck in a loop, consuming unbounded resources, or producing outputs that cannot be traced to a bounded set of inputs. Turing-incompleteness guarantees:

Every program terminates.
The static structure of the program is its full execution graph — you can enumerate every step before running anything.
Audit tools can walk the plan statically without running it.

Purely functional

AshPL programs have no side effects and no shared mutable state. Every function takes inputs and returns a value. Two calls with the same inputs always return the same output.

This is what makes the content-addressed cache possible: if inputs haven’t changed, the interpreter can skip re-execution entirely and return the cached result without affecting correctness. Pure functions are also trivially parallelisable — there are no data races when functions have no shared state.

Opinionated Python subset

Rather than inventing new syntax, AshPL looks like a restricted subset of Python. This is a pragmatic choice with two beneficiaries:

LLMs: Claude already knows Python deeply. Asking it to write AshPL programs leverages that existing knowledge rather than requiring fine-tuning on a new grammar.
Humans: reviewers who know Python can read AshPL programs without learning a new language.

The “opinionated” part is that the interpreter enforces the restrictions at parse time — it is not just a style guide. Code that violates the no-loops or no-mutation rules is rejected before execution starts.

4. The AshPL architecture

Five components work together to deliver the three desiderata.

AshPL architecture: the curator writes a program, the interpreter executes it against a content-addressed cache and a quiver of task models, and every action is appended to the immutable event log.

Component 1: The Curator

The curator is a Claude agent whose job is to read the user’s research question and write an AshPL program. It has no other responsibility — it does not browse the web, extract data, or synthesise results. It only writes the plan.

If the plan fails validation (wrong syntax, uses a forbidden construct, references undefined functions), the curator receives feedback and enters a redraft loop until the program is valid. This separation of concerns — planning agent versus execution runtime — is what makes faithful execution possible.

Component 2: The Python interpreter

Once the curator produces a valid AshPL program, a deterministic Python interpreter takes over. The interpreter has no access to the LLM and no ability to make judgment calls — it only follows the program. When the program calls a task function, the interpreter dispatches it to the quiver of models. When the function returns, the interpreter stores the result and continues.

This strict separation means the execution log is a deterministic consequence of the program and the model outputs, nothing else.

Component 3: The content-addressed cache

Every task call is identified by a hash of its inputs (the function name plus all its arguments). Before executing any step, the interpreter checks the cache. If a matching hash exists, the cached result is returned immediately — no model call, no latency, no non-determinism.

This is what delivers iteration fidelity. If you change a prompt in step 3, steps 1–2 are unchanged and their cached results are reused. Only step 3 and its downstream dependents re-execute. The diff between two runs is exactly the diff you intended.

Component 4: The quiver of models

The quiver is Elicit’s term for the pool of models (and model configurations) available to the interpreter. Different steps in a workflow can use different models — a fast, cheap model for screening, a more capable model for extraction, a specialised model for a particular domain. The AshPL program names which function to call; the quiver resolves the function to a specific model and prompt.

Component 5: The event log

Every action the interpreter takes is appended to an immutable, event-sourced log. The log is the ground truth of what happened. It enables:

Audit: show the regulator exactly which documents were read, with their hashes, at what timestamp.
Replay: re-run any historical workflow and compare the new outputs to the old, step by step.
Debugging: trace any output backwards through the event log to the exact inputs that produced it.

5. When to use this pattern

AshPL is not the right answer for every agentic workflow. Building and maintaining a custom DSL, interpreter, and event log is a significant engineering investment. The pattern is worth that investment when your context satisfies at least two of the following conditions:

External accountability: a client, regulator, or auditor can ask “show your work” and expects a precise, verifiable answer. If “the AI determined this” is an acceptable answer, you probably don’t need this pattern.

Repeated workflows: the same logical workflow runs many times on different inputs. The content-addressed cache only pays off when there is cache-hit opportunity, which requires repetition.

Long-running pipelines: if a workflow involves dozens of LLM calls and takes hours, the ability to resume after failure and to re-run only the changed steps is enormously valuable.

Multiple contributors: when a team edits workflows over time, iteration fidelity becomes critical. Without it, a change by one engineer silently affects outputs that another engineer is tracking.

When a simpler approach is better

If you are building a one-off assistant, a prototype, or a workflow where the answer itself is the only deliverable, a simple agent loop with good logging is likely sufficient. The overhead of a custom DSL is only justified by the accountability requirements it serves.

Check your understanding

5 questions · your answers are saved in this browser only

1. Why is "Claude did some stuff" insufficient as an audit trail in regulated industries?

Regulated workflows require provenance: precisely what ran, on what inputs, producing what outputs. A natural-language summary is a reconstruction, not a verifiable record.
2. What does "iteration fidelity" mean in the context of AshPL?

Iteration fidelity is achieved by the content-addressed cache: steps whose inputs have not changed return cached results, so only intentional changes propagate downstream.
3. Why is Turing-incompleteness a deliberate feature of AshPL rather than a bug?

A Turing-incomplete DSL cannot express infinite loops or unbounded recursion, so every program is guaranteed to terminate and its complete execution graph can be audited before any code runs.
4. In the AshPL architecture, what is the curator's sole responsibility?

The curator is a Claude agent that only writes the plan. It does not execute tasks, browse sources, or synthesise results. Strict separation of planning from execution is what makes faithful execution provable.
5. You change a prompt used in step 5 of a 10-step AshPL pipeline. Which steps re-execute on the next run?

The content-addressed cache keys on input hashes. Steps 1–4 have unchanged inputs, so their cached results are reused. Step 5 has a new prompt (changed input), so it re-executes. Steps 6–10 receive a new input from step 5 and therefore also re-execute.

Build it yourself

Follow these exact steps to reproduce it yourself · estimated time: ~45 minutes

Prerequisites

Python 3.10+ installed
Access to a Claude API key (or any LLM API)
Familiarity with Python functions and dictionaries

Goal

Design and implement a minimal version of this pattern for your own use case: a structured plan format, a deterministic interpreter, and a content-addressed cache. You won’t build a full language — you’ll build just enough to understand the architectural commitments.

Step 1 — Define your DSL format

Pick a structure that is legible and constrainable. YAML works well for small workflows:

workflow:
  name: "screen_papers"
  steps:
    - id: fetch_abstracts
      fn: retrieve_abstracts
      inputs:
        query: "{{ user.query }}"
        limit: 20
    - id: screen
      fn: relevance_screen
      inputs:
        abstracts: "{{ fetch_abstracts.output }}"
        criteria: "{{ user.criteria }}"
    - id: extract
      fn: extract_findings
      inputs:
        papers: "{{ screen.output }}"

Constraints to enforce at parse time:

Every id must be unique.
Every inputs reference (e.g. {{ fetch_abstracts.output }}) must refer to a step defined earlier — no forward references.
No step may reference its own output (no self-loops).

Step 2 — Build the validator

def validate_workflow(workflow: dict) -> list[str]:
    errors = []
    seen_ids = set()
    for step in workflow["steps"]:
        if step["id"] in seen_ids:
            errors.append(f"Duplicate step id: {step['id']}")
        seen_ids.add(step["id"])
        for ref in find_references(step["inputs"]):
            if ref not in seen_ids:
                errors.append(f"Step '{step['id']}' references unknown step '{ref}'")
    return errors

A validator that runs before execution is what turns a YAML file into a legible, checkable plan.

Step 3 — Implement the content-addressed cache

import hashlib, json

class ContentCache:
    def __init__(self):
        self._store: dict[str, any] = {}

    def key(self, fn_name: str, inputs: dict) -> str:
        payload = json.dumps({"fn": fn_name, "inputs": inputs}, sort_keys=True)
        return hashlib.sha256(payload.encode()).hexdigest()

    def get(self, fn_name: str, inputs: dict):
        return self._store.get(self.key(fn_name, inputs))

    def put(self, fn_name: str, inputs: dict, result) -> None:
        self._store[self.key(fn_name, inputs)] = result

Test it: call a slow function twice with the same inputs. The second call should return instantly from cache.

Step 4 — Write the interpreter

def run_workflow(workflow: dict, registry: dict, cache: ContentCache) -> dict:
    results = {}
    log = []
    for step in workflow["steps"]:
        resolved = resolve_inputs(step["inputs"], results)
        cached = cache.get(step["fn"], resolved)
        if cached is not None:
            results[step["id"]] = cached
            log.append({"step": step["id"], "cache": "hit"})
        else:
            output = registry[step["fn"]](**resolved)
            cache.put(step["fn"], resolved, output)
            results[step["id"]] = output
            log.append({"step": step["id"], "cache": "miss", "output_hash": hash(str(output))})
    return {"results": results, "log": log}

Step 5 — Verify iteration fidelity

Run your workflow end-to-end. All steps should be cache misses.
Run it again without changes. All steps should be cache hits — your “models” are not called.
Change one input in the middle of the workflow.
Run again. Steps before the change should be cache hits; the changed step and all downstream steps should be cache misses.

If step 4 behaves correctly, you have iteration fidelity.

Step 6 — Add a planning agent (optional)

If you have Claude API access, add a curator step that generates the YAML workflow from a natural-language task description:

import anthropic
import yaml

def curate_workflow(task: str, schema: str) -> dict:
    client = anthropic.Anthropic()
    response = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"Write a workflow in this YAML schema to accomplish: {task}\n\nSchema:\n{schema}"
        }]
    )
    raw = response.content[0].text
    workflow = yaml.safe_load(raw)
    errors = validate_workflow(workflow)
    if errors:
        raise ValueError(f"Invalid workflow: {errors}")
    return workflow

The curator writes the plan; the interpreter executes it. The same separation of concerns as AshPL.

Expected result: a working minimal DSL interpreter with cache hits on repeated runs and a JSON event log you can read after execution. The three desiderata — legibility (YAML plan), iteration fidelity (content-addressed cache), faithful execution (deterministic interpreter) — should all be demonstrable.

Where to go next

Watch the original talk by James Brady from Elicit — the live demo of replaying a run and seeing exactly which steps re-executed is the clearest illustration of iteration fidelity.
Read about event sourcing as a general pattern for append-only audit logs.
Continue with Prompting for Agents to understand how to write effective system prompts for agents like the curator.

Key takeaways

AshPL satisfies three desiderata — legible process, iteration fidelity, faithful execution — through three interlocking design choices: a Turing-incomplete, purely functional DSL interpreted by a strict runtime with a content-addressed cache
Turing-incompleteness is a safety property: every program terminates and its full execution graph can be enumerated statically
Pure functions plus a content-addressed cache mean that only intentionally changed steps re-execute — you can trust that a diff between two runs reflects only your edits
An append-only event log is the mechanism that makes compliance claims verifiable rather than just plausible
The pattern is worth building when external accountability, repeated workflows, or long pipelines make auditability a hard requirement — not for every agent

Trustworthy Agentic Workflows with a Custom DSL

1. The trust problem with agentic AI

2. Three desiderata for trustworthy workflows

Desideratum 1: Legible process

Desideratum 2: Iteration fidelity

Desideratum 3: Faithful execution

3. AshPL design choices

Turing-incomplete by design

Purely functional

Opinionated Python subset

4. The AshPL architecture

Component 1: The Curator

Component 2: The Python interpreter

Component 3: The content-addressed cache

Component 4: The quiver of models

Component 5: The event log

5. When to use this pattern

When a simpler approach is better

Check your understanding

Build it yourself

Goal

Step 1 — Define your DSL format

Step 2 — Build the validator

Step 3 — Implement the content-addressed cache

Step 4 — Write the interpreter

Step 5 — Verify iteration fidelity

Step 6 — Add a planning agent (optional)

Where to go next

Related lessons

Giving Agents Their Own Computers

Routines, CI Autofix, and the Advisor Strategy

Ship Your First Managed Agent: Agent, Environment, Session

1. The trust problem with agentic AI

2. Three desiderata for trustworthy workflows

Desideratum 1: Legible process

Desideratum 2: Iteration fidelity

Desideratum 3: Faithful execution

3. AshPL design choices

Turing-incomplete by design

Purely functional

Opinionated Python subset

4. The AshPL architecture

Component 1: The Curator

Component 2: The Python interpreter

Component 3: The content-addressed cache

Component 4: The quiver of models

Component 5: The event log

5. When to use this pattern

When a simpler approach is better

🧠 Check your understanding

🛠️ Build it yourself

Goal

Step 1 — Define your DSL format

Step 2 — Build the validator

Step 3 — Implement the content-addressed cache

Step 4 — Write the interpreter

Step 5 — Verify iteration fidelity

Step 6 — Add a planning agent (optional)

Where to go next

Related lessons

Giving Agents Their Own Computers

Routines, CI Autofix, and the Advisor Strategy

Ship Your First Managed Agent: Agent, Environment, Session

Check your understanding

Build it yourself