Hi,👋 we have updated the app and fixed multiple bugs. We are lacking funds, request to free user not to use Adblock. Ads are non intrusive. 😊

@hooeem: Your agentic workflows are was...

@hooeem
25 views Jun 04, 2026
Advertisement

Your agentic workflows are wasting your tokens. They’re wasting your money repeatedly for an orchestration loop. Here’s how you fix it and make your workflows 100x cheaper (yes, 100x cheaper).

Media image

In fact in testing it was found to be 128x, 296x, and 462x cheaper in the three tested domains, so 100x is an understatement.

The paper this research has come from has been written by Simon Dennis, Riviaan Patil, Kevin Shabahang, & Hao Guo from the University of Melbourne (I’ll link the paper in full at the end).

This article is going to tell you how to utilise their research so that you can make your agentic workflows 100x cheaper and the contents of this article is the following:

  • Why your agentic workflow costs so much
  • The one idea to take away
  • How it's actually done (theory)
  • Does it hold up?
  • Run your own numbers (find how much it cost)
  • Is this for you?
  • Full build guide
  • This guide can make your conversations up to 462x cheaper whilst keeping 87-98% of the frontier quality kept, so let's get started!

    1: Why your agentic workflow costs so much

    Okay, so you have a fixed procedure - an agentic workflow.

    Then you have where this agentic workflow lives and depending on that single choice is exactly what drives the cost of your agentic workflow.

    A: Orchestration

    This is the most common setup you see today, this is where software sits on top of the model and, every single turn, injects instructions and decides where the conversation goes next.

    The cost? $0.05-0.17 per conversation.

    B: In-context

    This is the "just prompt it" route, this is where you paste the whole workflow into the model's system prompt and you let it run yourself.

    This is the most expensive approach, the cost? $0.10-0.33 per conversation.

    C: Compiled

    This is the method we're going to learn from the research paper - we teach a smaller model the procedure once and then we host it ourself.

    The procedure is buried inside the model and the cost? $0.0003-0.001 per conversation.

    Media image

    2: The one idea to take away

    If the shape of your workflow, it's steps, it's branches, it's order, doesn't change from one conversation to the next then why pay to describe it every time? FFFFFFFF*****CCCCCKKKKK THAT.

    Go put the unchanging shape into the model itself and keep the prompt for the only thing that actually varies.

    Look:

    Media image

    3: How?

    Let's dive in...

    Step 1: Draw the workflow as a flowchart

    Map your procedure as boxes and arrows, yes, a frickin flow chart, each box is a turn each arrow is a possible next step. Mark the start and the ways a conversation can end.

    Why? It's a precise way of writing down the procedure a computer can walk through automatically. If you're able to write down your agentic workflow then you're ready for step 2.

    Step 2: Generate practice conversations from it

    A frontier model such as Claude Sonnet will be able to walk through every sensible path that can be made from your flowchart and will write out realistic example conversations with varying details each time. This is how you teach it and to do this you need to get it to run 2000-6000 conversations which will cost around $40 in API calls. (you don't need to chat to your workflow 2000 times yourself lol)

    Step 3: Fine tune a small model on those conversations

    Take a small open model: the study used Qwen 2.5 (3 billion parameters) and Qwen3 (8 billion), and train it on the examples until it absorbs the procedure. It needs to learn the workflow in it's entirety not as a set of instructions it reads each time.

    One important caveat from the study: this has to be a full retrain, not the cheap shortcut (LoRA), the shortcut was shown to fail at learning multi-step procedures.

  • Hardware: 1 high-end GPU (e.g. an A100 or H200)
  • Cost: ~$10–40
  • "3B" = 3 billion parameters
  • Fine-tuning: taking an existing model and continuing its training on your own examples so it specialises. Parameters are the model's adjustable internals; 3–8 billion is small enough to run on a single rented GPU, versus the ~70× larger frontier models behind the paid APIs.

    Step 4: Deploy it, with no orchestrator at all

    Host the trained model yourself and let conversations fly. The model self-orchestrates from what it learned. The expensive parts of the old setup are simply gone.

  • Serving: self-hosted on your own GPU
  • Prompt size: constant
  • Routing errors: eliminated by design
  • Self-hosting: running the model on a machine you rent or own (the study used a cloud A100 at ~$2.50/hour) instead of paying a provider per token. This is where the ~65× per-token saving comes from.

    Media image

    I will be diving deeper into the build at the end of this article.

    4: Does it hold up?

    Cheap is easy if you don't care about quality. The study measured both, across three workflows, judged blind by an AI grader (and re-checked by a second, different grader).

    They found that despite it being ridiculously cheaper, it's quality kept vs. the gold standard.

  • For naturalness its quality kept 97%.
  • For graceful handling its quality kept 92%.
  • For task success its quality kept 91%.
  • For information accuracy its quality kept 87%.
  • Overall the 8B compiled model scored 87–98% of the frontier in-context baseline. Not only that but in two of three workflows tested in the article the compiled model failed far less often than the orchestrator version.

    Media image

    5: How much does it cost?

    Media image
  • At low volume the one-time setup dominates, so the advantage looks modest. As volume grows, the compiled cost approaches the raw 128–462× advantage from the cost ladder.
  • The study notes break-even against the in-context approach arrives within 500 conversations visible in the first row above.
  • For 10,000+ conversations, the study reports compilation adds less than $0.01 per conversation once the setup is spread out.
  • 6: Is this for you?

    A strong fit when…

  • Your workflow is procedural - you can draw it as a flowchart with clear steps and branches.
  • The procedure is stable - it doesn't change shape from one conversation to the next.
  • You run enough volume for the one-time setup to pay off (break-even under 500 conversations).
  • You want to keep the procedure private rather than expose it to a third-party API.
  • Latency and per-conversation cost matter to you at scale.
  • A poor fit when…

  • The task is open-ended - not a defined procedure you can chart. The study only tested procedural workflows.
  • Success depends on broad world knowledge - that was the compiled model's weakest area (87%).
  • You need the absolute highest quality and cost is no object - the in-context frontier baseline still scored highest.
  • Your procedure changes constantly - though a refresh is only 30–50 minutes, not a full rebuild.
  • You have no access to a suitable GPU or the skills to run a fine-tune (or to delegate it).
  • Media image

    7: Building it

    You do not need to understand every single nerdy thing that's going on, and you shouldn't expect to either. Think of it like building a house, you draw the plans, a builder does the construction.

    Here we are simply deciding the workflow and we'll judge whether the result is good or not (nice).

    Here's what we're about to do (we mentioned it briefly earlier but now we're going to give those who are actually going to do this the tools):

  • WRITE WORKFLOW DOWN AS FLOWCHART
  • GET AI TO PRACTICE CONVERSATIONS WITH IT
  • GET SMALL MODEL TO LEARN IT
  • IT THEN RUNS THAT
  • It should cost around $50-80 bucks to set this fucker up assuming you have a GPU to use and after 500 conversations with your agentic workflow it has already paid for itself!

    Stage 1: Draw your workflow as a flowchart

    Write down your procedure as a simple map: boxes for each thing that gets said, arrows for what can happen next, and a few clearly-marked endings (the customer is happy, the customer gives up, or it's handed to a human). That's it. If you can sketch your process on a whiteboard, you've done the hard part of this stage.

    This is the one stage that is genuinely yours. You know your workflow better than anyone; nobody else can draw it for you.

    A few tips from the study:

  • Keep agent and customer turns alternating, the agent says something, the customer replies, and so on.
  • Every ending should be one of three kinds: success, gave up, or handed to a human.
  • Write each agent step to ask one thing at a time. The study found this single-question rhythm is what makes the trained model feel natural and easy to follow.
  • For a sense of scale: the paper's simple workflows had 14 boxes; its most complex (insurance claims) had 55 boxeswith 6 branching points. Both worked.
  • The flowchart as a file (procedure.json) The map is saved as a plain text file.

    Here is a tiny travel-booking example; replace it with your own or get Claude to help you with it by explaining your workflow.

    {
      "system_prompt": "You are a helpful travel booking assistant.",
      "start": "greet",
      "terminals": { "booked": "success", "abandoned": "abandonment", "escalated": "escalation" },
      "scenario_variables": {
        "destination": ["Japan", "Portugal", "Peru"],
        "budget_per_person": ["£650", "£900", "£2,000"],
        "trip_length": ["a weekend", "6 days", "two weeks"],
        "user_style": ["uncertain", "specific", "price-conscious"]
      },
      "nodes": {
        "greet":        { "role": "agent", "prompt": "Warmly greet the customer and ask what trip they'd like to book." },
        "user_request": { "role": "user",  "prompt": "You want {destination} for {trip_length}, budget about {budget_per_person} each. Be {user_style}." },
        "gather":       { "role": "agent", "prompt": "Ask ONE focused follow-up question about dates or interests." },
        "user_detail":  { "role": "user",  "prompt": "Answer, staying consistent with your budget and style." },
        "present":      { "role": "agent", "prompt": "Present 2–3 concrete options that fit the budget." },
        "user_choose":  { "role": "user",  "prompt": "React; pick one or ask for alternatives." },
        "confirm":      { "role": "agent", "prompt": "Summarise the choice and ask the customer to confirm." },
        "user_confirm": { "role": "user",  "prompt": "Confirm you're happy to book." },
        "booked":       { "role": "agent", "prompt": "Confirm the booking and close with one travel tip." },
        "abandoned":    { "role": "agent", "prompt": "Politely acknowledge they're not ready." },
        "escalated":    { "role": "agent", "prompt": "Explain you're handing off to a human specialist." }
      },
      "edges": [
        { "from": "greet", "to": "user_request" }, { "from": "user_request", "to": "gather" },
        { "from": "gather", "to": "user_detail" }, { "from": "user_detail", "to": "present" },
        { "from": "user_detail", "to": "gather", "condition": "needs more info" },
        { "from": "present", "to": "user_choose" }, { "from": "user_choose", "to": "confirm" },
        { "from": "user_choose", "to": "present", "condition": "wants alternatives" },
        { "from": "user_choose", "to": "abandoned", "condition": "not interested" },
        { "from": "confirm", "to": "user_confirm" }, { "from": "user_confirm", "to": "booked" },
        { "from": "present", "to": "escalated", "condition": "too complex" }
      ]
    }

    Here is a live artefact to help you with this stage:

    https://claude.ai/public/artifacts/3f0bd0cf-980b-407c-a515-9880f66103e7

    Stage 2: Let a clever AI have thousands of conversations with it

    You don't need a pile of real customer transcripts to start. Instead, a top-tier AI (the study used Claude) walks every sensible route through your flowchart and writes out realistic example conversations, thousands of them, changing the details each time, like a different destination or a more sceptical customer. These examples are the "textbook" your small model will learn from.

    The clever part: the finished examples read as completely natural dialogue. None of the flowchart labels show up in them. The procedure is hidden inside how the conversations flow which is exactly how the small model will end up learning it.

    What it costs? $40 in usage.

    For whoever runs it generate.py (the tested data generator). This walks the flowchart, writes each turn with a frontier model, and saves the conversations.

    import json, random, os
    from anthropic import Anthropic
    client = Anthropic()                       # reads your ANTHROPIC_API_KEY
    GENERATOR_MODEL = "claude-sonnet-4-5"       # the study used Claude Sonnet 4.5
    
    F = json.load(open("procedure.json"))
    NODES, EDGES = F["nodes"], F["edges"]; TERMINALS = set(F["terminals"].keys())
    
    def enumerate_acyclic_paths(max_paths=10000):
        # List every distinct route through the flowchart (no box visited twice).
        # This gives even coverage of all endings — a simple random walk lopsidedly
        # over-samples short "gave up / escalated" routes (a bug found in testing).
        paths = []
        def dfs(node, seen, acc):
            acc = acc + [node]
            if node in TERMINALS: paths.append(acc); return
            for e in EDGES:
                if e["from"] == node and e["to"] not in seen and len(paths) < max_paths:
                    dfs(e["to"], seen | {e["to"]}, acc)
        dfs(F["start"], {F["start"]}, []); return paths
    ALL_PATHS = enumerate_acyclic_paths()
    
    def fill(t, s):
        for k, v in s.items(): t = t.replace("{"+k+"}", str(v))
        return t
    
    def generate_turn(node, scenario, history):
        who = "the booking AGENT" if node["role"] == "agent" else "the CUSTOMER"
        transcript = "\n".join(f'{m["role"].upper()}: {m["content"]}' for m in history) or "(start)"
        r = client.messages.create(model=GENERATOR_MODEL, max_tokens=400, messages=[{"role":"user","content":
            f"Write one turn as {who}.\nYOUR INSTRUCTION: {fill(node['prompt'], scenario)}\n"
            f"CONVERSATION SO FAR:\n{transcript}\n\nWrite only {who}'s next message, naturally. "
            f"No labels, no mention of any procedure."}])
        return r.content[0].text.strip()
    
    def generate_conversation():
        path = random.choice(ALL_PATHS)
        scenario = {k: random.choice(v) for k, v in F["scenario_variables"].items()}
        turns = []
        for nid in path:
            node = NODES[nid]; role = "assistant" if node["role"] == "agent" else "user"
            text = generate_turn(node, scenario, turns)
            # Merge two same-role turns in a row into one (keeps a valid chat format —
            # another bug found and fixed in testing).
            if turns and turns[-1]["role"] == role: turns[-1]["content"] += " " + text
            else: turns.append({"role": role, "content": text})
        return [{"role": "system", "content": F["system_prompt"]}] + turns
    
    def build(n=2125, eval_frac=0.10):           # the study's volume + 90/10 split
        convs = [{"messages": generate_conversation()} for _ in range(n)]
        random.shuffle(convs); cut = int(len(convs)*(1-eval_frac))
        os.makedirs("data", exist_ok=True)
        for name, part in [("train", convs[:cut]), ("eval", convs[cut:])]:
            with open(f"data/{name}.jsonl", "w") as f:
                for c in part: f.write(json.dumps(c)+"\n")
    if __name__ == "__main__": build()

    Stage 3: Train the small model on those conversations

    Take a small, free, open model and let it study the practice conversations until it has absorbed the procedure. It doesn't memorise a rulebook to re-read each time, it picks up the workflow as a habit, the way a new employee eventually stops checking the manual.

    It then needs a powerful graphics computer (a GPU).

    The one rule you must insist on: it has to be a full training, not the popular cheap shortcut (called "LoRA"). The study tested the shortcut and it failed to learn multi-step procedures properly. If someone offers to do it the quick way, the answer is no.

    For whoever runs it: the settings that matter, and the tested script. Base model: Qwen 2.5 (3B) for simple workflows, Qwen3-8B for complex ones. Full fine-tune, never LoRA. Learning rate 2×10⁻⁵, 10–20 passes over the data, keep the best version by held-out score.

    import torch
    from datasets import load_dataset
    from transformers import AutoModelForCausalLM, AutoTokenizer
    from trl import SFTConfig, SFTTrainer
    
    BASE_MODEL = "Qwen/Qwen2.5-3B-Instruct"     # use "Qwen/Qwen3-8B" for complex workflows
    tok = AutoTokenizer.from_pretrained(BASE_MODEL)
    model = AutoModelForCausalLM.from_pretrained(BASE_MODEL, torch_dtype=torch.bfloat16)
    ds = load_dataset("json", data_files={"train": "data/train.jsonl", "eval": "data/eval.jsonl"})
    
    cfg = SFTConfig(
        output_dir="compiled-model",
        num_train_epochs=20,                    # 10–20 passes
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,          # effective batch 16 (use 32 for the 8B model)
        learning_rate=2e-5,
        lr_scheduler_type="cosine",
        bf16=True,
        optim="adamw_8bit",                     # lets it fit one GPU (needs bitsandbytes); use "adamw_torch" if you have plenty of memory
        eval_strategy="epoch", save_strategy="epoch",
        load_best_model_at_end=True, metric_for_best_model="eval_loss", greater_is_better=False,
        assistant_only_loss=True,               # learn the AGENT's turns only
        max_length=4096,                        # NOTE: this was called max_seq_length in older versions — the old name now errors
        report_to=[],
    )
    SFTTrainer(model=model, processing_class=tok,
               train_dataset=ds["train"], eval_dataset=ds["eval"], args=cfg).train()

    Stage 4: Switch it on

    Put the trained model on a computer you control. Its only instruction is a single line. The model runs the whole workflow from what it learned.

    This is where the saving comes from: you're no longer renting a giant model by the word, and you're no longer re-sending the procedure on every reply. A whole conversation now costs a tiny fraction of a penny.

    For whoever runs it: serve and query. The study used vLLM on a rented GPU (about $2.50/hour).

    vllm serve ./compiled-model --max-model-len 4096 --port 8000

    The bottom line

    Stop over paying for your agentic workflows on repeat. If your workflow is procedural, stable and high-volume, compiling it into a small self-hosted model is the natural move with near-frontier quality, fewer failures, and a cost that drops by two orders of magnitude with the advantage growing the more complex your workflow gets.

    The article: https://arxiv.org/pdf/2605.22502

    Media image

    pls gib like lol

    Actions
    Visual Editor Carousel Maker NEW
    Update Thread
    What You Can Do
    • Download as PDF
    • Save to Notion
    • Export as Markdown
    • Visual Editor
    • LinkedIn & Instagram Carousel Maker
    Create Free Account

    Includes 7-day Premium trial

    Advertisement