How To Apply Loop Engineering To Quantitative Research (Complete Guide with Code)

Single-shot prompting is dead for serious quant work. You ask an LLM for an alpha factor, it gives you momentum or RSI, you backtest it, it fails. You prompt again. Nothing connects. Nothing learns. Nothing improves.

Loop engineering is what comes next. Coined by practitioners in mid-2025 and formalized by Google engineer Addy Osmani in June 2026, it's the discipline of designing AI systems that don't just respond once - they act, observe the result, decide what to do next, and repeat until a goal is actually met. As Peter Steinberger put it: "You shouldn't be prompting coding agents anymore. You should be designing loops that prompt your agents."

For quants, this reframes the entire research workflow. You stop being the person who writes factor code. You start being the person who designs the system that writes, tests, and iterates on factor code. The leverage moves from the quality of a single prompt to the architecture of the feedback loop.

Here's how to build it. But before that who am i ?

about me : I am Venus (open-source-believer, so spitting out internal secrets on X), a Senior Quant Systems Architect and Backend Engineer experienced in building startups from 0→1 and scaling products from 1→100 across AI, cloud, and fintech x defi infrastructure. dm's are open to connect. Let's get back to article.

What Loop Engineering Actually Is ?

A loop in agentic AI is a repeating cycle: the agent perceives its environment, reasons about what to do, acts, observes what happened, and feeds the result back into the next iteration. The cycle runs until a termination condition is met - a task complete, a quality threshold passed, a stopping criterion triggered.

This is the core four-stage cycle :

PERCEIVE → REASON → ACT → OBSERVE → (loop back)

It traces back to the ReAct pattern (Yao et al., 2023) : Reasoning + Acting interleaved so the agent can think about why an action failed before retrying. A single-shot prompt is like firing an arrow with your eyes closed. A loop is like adjusting your aim after each shot based on where the last one landed.

For quant research, the four stages map directly :

PERCEIVE  = ingest market data, factor library, prior backtest results
REASON    = generate hypothesis, decide which factor type to explore
ACT       = write factor code, run backtest, compute IC/ICIR
OBSERVE   = evaluate metrics, extract failure mode, update memory

The loop continues until ICIR > 0.5, half-life > 30 days, and IC is stable. You don't prompt once. You design the system that prompts itself.

The Three Loop Types Every Quant Needs

Not all loops are equal. Quant research needs three nested loop types, each operating at a different timescale:

from anthropic import Anthropic
import pandas as pd
import numpy as np
import json
import time
from dataclasses import dataclass, field
from typing import List, Dict, Optional, Tuple

client = Anthropic()


@dataclass
class LoopState:
    """
    Shared state passed between all loop iterations.
    
    This is the 'environment' the agent perceives each cycle.
    Critical design principle: all state is explicit and inspectable.
    No hidden side effects. Every loop iteration reads and writes here.
    """
    # What has been tried
    attempted_factors: List[Dict] = field(default_factory=list)
    approved_factors: List[Dict] = field(default_factory=list)
    failed_factors: List[Dict] = field(default_factory=list)
    
    # What was learned
    failure_patterns: List[str] = field(default_factory=list)
    success_patterns: List[str] = field(default_factory=list)
    
    # Current iteration context
    current_hypothesis: str = ""
    current_code: str = ""
    current_metrics: Dict = field(default_factory=dict)
    
    # Loop control
    iteration: int = 0
    max_iterations: int = 50
    target_approved: int = 10
    
    # Termination signals
    should_stop: bool = False
    stop_reason: str = ""


class QuantLoopEngine:
    """
    Three-tier loop architecture for autonomous quant research.
    
    OUTER LOOP (Strategy level):
        Runs until target_approved factors are found.
        Manages domain rotation, memory consolidation.
        Timescale: Hours to days.
    
    INNER LOOP (Factor level):
        Runs until one factor is approved or max debug attempts exceeded.
        Handles generate → test → debug → approve/reject.
        Timescale: Minutes.
    
    MICRO LOOP (Code level):
        Runs until code executes without errors (max 3 attempts).
        Syntax errors, missing columns, type mismatches.
        Timescale: Seconds.
    
    This nesting is the key architectural insight:
    each loop has its own termination condition and feedback signal.
    """
    
    def __init__(self, market_data: pd.DataFrame):
        self.data = market_data
        self.state = LoopState()
        
        # Loop config
        self.max_debug_attempts = 3    # Micro loop
        self.max_factor_attempts = 5   # Inner loop
        self.icir_threshold = 0.5      # Approval gate
        self.halflife_threshold = 30   # Approval gate
        
        # Conversation history per agent (persists within inner loop)
        self.hypothesis_history = []
        self.factor_history = []

The Perceive-Reason-Act-Observe Cycle: Full Implementation

def perceive(self) -> str:
        """
        Stage 1: Build context from current state.
        
        The agent sees exactly what happened before — no hallucination
        about prior results. This is where loop engineering differs from
        single-shot: the environment state is explicit and injected fresh
        each iteration.
        """
        context = f"""
=== QUANT RESEARCH LOOP — ITERATION {self.state.iteration} ===

PROGRESS: {len(self.state.approved_factors)}/{self.state.target_approved} factors approved

FAILED PATTERNS (do not repeat these):
{chr(10).join(f"- {p}" for p in self.state.failure_patterns[-8:]) or "None yet"}

SUCCESSFUL PATTERNS (build on these):
{chr(10).join(f"- {p}" for p in self.state.success_patterns[-5:]) or "None yet"}

APPROVED FACTORS SO FAR:
{json.dumps([
    {"hypothesis": f["hypothesis"][:60], "icir": f["icir"], "halflife": f["halflife"]}
    for f in self.state.approved_factors
], indent=2) or "None yet"}

LAST ATTEMPT RESULT:
{json.dumps(self.state.current_metrics, indent=2) if self.state.current_metrics else "No prior attempt"}
"""
        return context
    
    def reason(self, context: str) -> Dict:
        """
        Stage 2: LLM decides what to do next based on context.
        
        Uses multi-turn conversation within the inner loop.
        The key: previous messages stay in conversation_history,
        so the agent remembers what it tried earlier THIS iteration.
        """
        self.hypothesis_history.append({
            "role": "user",
            "content": f"""{context}

Generate one novel alpha factor hypothesis. Avoid anything in failed patterns.
Explore one of: order flow microstructure, cross-asset divergence, 
supply chain effects, earnings quality, management language shifts.

Return JSON:
{{
  "hypothesis": "one sentence economic observation",
  "mechanism": "why this generates returns",
  "specification": {{
    "inputs": ["data_column_names"],
    "formula_description": "step by step calculation",
    "lookback_days": number,
    "expected_icir": number
  }}
}}"""
        })
        
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1000,
            system="""You are a senior quant researcher. Generate novel alpha factors.
Return only valid JSON. No markdown.""",
            messages=self.hypothesis_history
        )
        
        raw = response.content[0].text
        self.hypothesis_history.append({"role": "assistant", "content": raw})
        
        try:
            start = raw.find('{')
            end = raw.rfind('}') + 1
            return json.loads(raw[start:end])
        except Exception:
            return {"hypothesis": "parse error", "specification": {}}
    
    def act(self, hypothesis: Dict, 
            prior_code: str = None, 
            prior_error: str = None) -> str:
        """
        Stage 3: Generate or fix factor code.
        
        MICRO LOOP lives here: if prior_error is provided,
        the agent is in debug mode — it reads the exact error
        and produces a targeted fix.
        """
        if prior_code and prior_error:
            prompt = f"""Fix this broken factor code.

HYPOTHESIS: {hypothesis.get('hypothesis', '')}
BROKEN CODE:
{prior_code}

EXACT ERROR:
{prior_error}

Available columns: close, returns_1d, returns_5d, returns_20d, volume, market_cap
Write the corrected def compute_factor(data: pd.DataFrame) -> pd.Series function only."""
        else:
            prompt = f"""Write factor code for:

HYPOTHESIS: {hypothesis.get('hypothesis', '')}
FORMULA: {hypothesis.get('specification', {}).get('formula_description', '')}
INPUTS: {hypothesis.get('specification', {}).get('inputs', [])}
LOOKBACK: {hypothesis.get('specification', {}).get('lookback_days', 20)} days

Available columns: close, returns_1d, returns_5d, returns_20d, volume, market_cap

Rules:
- Function: def compute_factor(data: pd.DataFrame) -> pd.Series
- Final step must be cross-sectional z-score normalization
- Handle NaN explicitly
- Max 15 lines

Return only the Python function."""
        
        self.factor_history.append({"role": "user", "content": prompt})
        
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1000,
            system="Write clean Python factor functions. Return code only. No markdown.",
            messages=self.factor_history
        )
        
        code = response.content[0].text.strip()
        self.factor_history.append({"role": "assistant", "content": code})
        
        # Strip any markdown fencing
        for fence in ["python", ""]:
            if fence in code:
                code = code.split(fence)[1].split("")[0].strip()
                break
        
        return code
    
    def observe(self, hypothesis: Dict, code: str) -> Dict:
        """
        Stage 4: Execute code, compute metrics, extract signal.
        
        This is the environment feedback that drives the loop.
        Returns structured observation: metrics + pass/fail + failure reason.
        The failure reason is what the Reason stage reads next iteration
        to avoid repeating the same mistake.
        """
        # Try to execute
        result = self._execute_factor(code)
        
        if result.get("error"):
            return {
                "passed": False,
                "error": result["error"],
                "type": "execution_error",
                "feedback": f"Code failed: {result['error'][:100]}"
            }
        
        factor_series = result["series"]
        
        # Compute IC series
        ic_series = self._compute_ic(factor_series)
        
        if len(ic_series) < 20:
            return {
                "passed": False,
                "type": "insufficient_data",
                "feedback": "Less than 20 IC observations — need more data"
            }
        
        ic_mean = ic_series.mean()
        ic_std = ic_series.std()
        icir = ic_mean / ic_std if ic_std > 0 else 0
        halflife = self._compute_halflife(factor_series)
        sharpe = (ic_series.mean() / ic_series.std()) * np.sqrt(252) if ic_series.std() > 0 else 0
        
        passed = (abs(icir) >= self.icir_threshold and 
                  halflife >= self.halflife_threshold)
        
        metrics = {
            "ic_mean": round(ic_mean, 4),
            "icir": round(icir, 3),
            "halflife_days": halflife,
            "sharpe": round(sharpe, 2),
            "passed": passed
        }
        
        if passed:
            feedback = f"APPROVED — ICIR={icir:.2f}, Half-life={halflife}d"
        else:
            if abs(icir) < self.icir_threshold:
                feedback = f"REJECTED — ICIR {icir:.2f} below threshold {self.icir_threshold}"
            else:
                feedback = f"REJECTED — Half-life {halflife}d below threshold {self.halflife_threshold}d"
        
        return {**metrics, "feedback": feedback, "type": "metrics"}
    
    def _execute_factor(self, code: str) -> Dict:
        """Execute factor code in sandbox. Return series or error."""
        env = {"pd": pd, "np": np, "data": self.data.copy()}
        try:
            exec(code, env)
            fn = env.get("compute_factor")
            if fn is None:
                return {"error": "No compute_factor function defined"}
            result = fn(self.data)
            if not isinstance(result, pd.Series):
                return {"error": f"Expected pd.Series, got {type(result).__name__}"}
            return {"series": result}
        except Exception as e:
            return {"error": str(e)}
    
    def _compute_ic(self, factor: pd.Series, 
                    forward_col: str = "returns_20d") -> pd.Series:
        """Compute daily Spearman IC between factor and forward returns."""
        if forward_col not in self.data.columns:
            return pd.Series(dtype=float)
        
        dates = factor.index.unique() if factor.index.nlevels == 1 else \
                factor.index.get_level_values(0).unique()
        
        ic_values = []
        for date in dates[:200]:  # Cap for speed
            try:
                f_day = factor.loc[date] if factor.index.nlevels == 1 else \
                        factor.xs(date, level=0)
                r_day = self.data.loc[self.data.index == date, forward_col] \
                        if "date" not in self.data.columns else \
                        self.data[self.data.index == date][forward_col]
                aligned = pd.concat([f_day, r_day], axis=1).dropna()
                if len(aligned) > 10:
                    ic = aligned.iloc[:, 0].corr(aligned.iloc[:, 1], method="spearman")
                    ic_values.append(ic)
            except Exception:
                continue
        
        return pd.Series(ic_values)
    
    def _compute_halflife(self, factor: pd.Series, 
                          max_lag: int = 90) -> int:
        """Compute IC decay half-life in days."""
        base_ic = abs(self._compute_ic(factor).mean())
        if base_ic < 0.001:
            return 0
        
        for lag in range(5, max_lag, 5):
            shifted = self.data.copy()
            shifted["returns_20d"] = shifted.get("returns_20d", pd.Series()).shift(-lag)
            ic_at_lag = abs(self._compute_ic(factor).mean())
            if ic_at_lag < base_ic * 0.5:
                return lag
        return max_lag

The Three Loops Wired Together

def run_micro_loop(self, hypothesis: Dict) -> Tuple[str, Dict]:
        """
        MICRO LOOP: Generate code, fix syntax errors, max 3 attempts.
        Termination: code executes cleanly OR max attempts reached.
        """
        code = None
        error = None
        
        for attempt in range(self.max_debug_attempts):
            code = self.act(hypothesis, prior_code=code, prior_error=error)
            test = self._execute_factor(code)
            
            if not test.get("error"):
                return code, {"status": "executable", "attempts": attempt + 1}
            
            error = test["error"]
            print(f"    [micro loop] attempt {attempt+1}: {error[:60]}...")
        
        return code, {"status": "failed_execution", "final_error": error}
    
    def run_inner_loop(self, hypothesis: Dict) -> Dict:
        """
        INNER LOOP: Generate → debug → test → approve/reject.
        Termination: factor approved OR max attempts reached.
        Resets micro loop each attempt. Keeps factor_history for continuity.
        """
        self.factor_history = []  # Fresh code conversation per factor
        
        for attempt in range(self.max_factor_attempts):
            print(f"  [inner loop] attempt {attempt+1}: "
                  f"{hypothesis['hypothesis'][:50]}...")
            
            # Micro loop: get executable code
            code, micro_result = self.run_micro_loop(hypothesis)
            
            if micro_result["status"] == "failed_execution":
                print(f"  [inner loop] code never executed — skipping")
                break
            
            # Observe: run metrics
            observation = self.observe(hypothesis, code)
            self.state.current_metrics = observation
            self.state.current_code = code
            
            print(f"  [inner loop] {observation.get('feedback', 'no feedback')}")
            
            if observation.get("passed"):
                return {
                    "status": "approved",
                    "hypothesis": hypothesis["hypothesis"],
                    "code": code,
                    "icir": observation["icir"],
                    "halflife": observation["halflife_days"],
                    "sharpe": observation["sharpe"],
                    "attempts": attempt + 1
                }
            
            # Not approved — feedback goes into next micro loop context
            # This is where the inner loop learns within a factor attempt
            if observation.get("type") == "metrics":
                hypothesis["_feedback"] = observation["feedback"]
        
        return {
            "status": "rejected",
            "hypothesis": hypothesis["hypothesis"],
            "final_metrics": self.state.current_metrics
        }
    
    def run_outer_loop(self, domains: List[str] = None) -> List[Dict]:
        """
        OUTER LOOP: Mine until target approved factors reached.
        Termination: approved count >= target OR max_iterations.
        
        Memory consolidation happens here:
        - Failure reasons → failure_patterns (prevent repeating)
        - Success patterns → success_patterns (amplify what works)
        
        This is the self-improvement mechanism:
        each outer loop iteration the agent is smarter
        than the previous one because it reads accumulated memory.
        """
        domains = domains or [
            "order flow microstructure signals",
            "cross-asset spread divergence",
            "earnings quality and accruals",
            "supply chain network effects",
            "management language and tone shifts",
            "options market implied information"
        ]
        
        domain_idx = 0
        
        while (len(self.state.approved_factors) < self.state.target_approved
               and self.state.iteration < self.state.max_iterations
               and not self.state.should_stop):
            
            self.state.iteration += 1
            domain = domains[domain_idx % len(domains)]
            domain_idx += 1
            
            print(f"\n[outer loop] iteration {self.state.iteration} | "
                  f"domain: {domain} | "
                  f"approved: {len(self.state.approved_factors)}/"
                  f"{self.state.target_approved}")
            
            # PERCEIVE: build context from accumulated state
            context = self.perceive()
            
            # REASON: generate hypothesis using full context
            self.hypothesis_history = []  # Fresh per outer loop
            hypothesis = self.reason(context + f"\n\nDomain focus: {domain}")
            
            if not hypothesis.get("hypothesis"):
                print("[outer loop] failed to generate hypothesis, continuing...")
                continue
            
            self.state.current_hypothesis = hypothesis["hypothesis"]
            
            # ACT + OBSERVE: run inner loop
            result = self.run_inner_loop(hypothesis)
            
            # Update memory — the self-improvement feedback
            self._update_memory(hypothesis, result)
            
            if result["status"] == "approved":
                self.state.approved_factors.append(result)
                print(f"[outer loop] APPROVED #{len(self.state.approved_factors)}: "
                      f"ICIR={result['icir']:.2f}, Half-life={result['halflife']}d")
        
        return self.state.approved_factors
    
    def _update_memory(self, hypothesis: Dict, result: Dict):
        """
        Memory update: the mechanism that makes each outer loop
        iteration smarter than the last.
        
        Failed patterns prevent the agent from re-exploring dead ends.
        Successful patterns guide it toward productive territory.
        This is verbal reinforcement learning without gradient updates.
        """
        if result["status"] == "approved":
            pattern = (f"WORKS: {hypothesis.get('mechanism', '')[:80]} "
                      f"→ ICIR {result['icir']:.2f}, "
                      f"Half-life {result['halflife']}d")
            self.state.success_patterns.append(pattern)
        else:
            metrics = result.get("final_metrics", {})
            pattern = (f"FAILED: {hypothesis['hypothesis'][:60]} "
                      f"— {metrics.get('feedback', 'unknown reason')[:80]}")
            self.state.failure_patterns.append(pattern)
        
        self.state.attempted_factors.append({
            "hypothesis": hypothesis["hypothesis"],
            "result": result["status"],
            "domain": self.state.current_hypothesis
        })

The Stop Hook: Preventing Premature Exit

The most critical part of loop engineering that most implementations miss. An LLM will stop when it thinks the task is done - not when the task is done. The Stop Hook intercepts exit conditions and validates them against hard criteria.

class QuantLoopStopHook:
    """
    Intercepts termination conditions before the loop exits.
    
    Pattern from loop engineering practice:
    The agent cannot self-terminate. Every exit must pass the Stop Hook.
    If criteria aren't met, the task prompt is re-injected and the loop
    continues.
    
    In quant research: a factor is not 'done' until ICIR, half-life,
    AND out-of-sample stability all pass. The agent's own confidence
    about quality is irrelevant.
    """
    
    def __init__(self, 
                 min_approved: int = 10,
                 min_icir: float = 0.5,
                 min_halflife: int = 30,
                 max_correlation_between_factors: float = 0.6):
        self.min_approved = min_approved
        self.min_icir = min_icir
        self.min_halflife = min_halflife
        self.max_corr = max_correlation_between_factors
    
    def check(self, state: LoopState, 
              factor_series_dict: Dict[str, pd.Series]) -> Dict:
        """
        Validate all termination criteria.
        Returns: should_stop (bool), reason (str), remediation (str)
        """
        # Criterion 1: Enough approved factors
        if len(state.approved_factors) < self.min_approved:
            return {
                "should_stop": False,
                "reason": f"Only {len(state.approved_factors)}/{self.min_approved} approved",
                "remediation": "Continue mining — insufficient alpha coverage"
            }
        
        # Criterion 2: All approved factors meet quality bar
        failing = [
            f for f in state.approved_factors
            if f["icir"] < self.min_icir or f["halflife"] < self.min_halflife
        ]
        if failing:
            return {
                "should_stop": False,
                "reason": f"{len(failing)} factors below quality threshold",
                "remediation": f"Re-validate or replace: {[f['hypothesis'][:40] for f in failing]}"
            }
        
        # Criterion 3: Factor diversity (low pairwise correlation)
        if len(factor_series_dict) > 1:
            series_list = list(factor_series_dict.values())
            names = list(factor_series_dict.keys())
            
            high_corr_pairs = []
            for i in range(len(series_list)):
                for j in range(i + 1, len(series_list)):
                    try:
                        corr = series_list[i].corr(series_list[j])
                        if abs(corr) > self.max_corr:
                            high_corr_pairs.append(
                                f"{names[i][:20]} x {names[j][:20]} = {corr:.2f}"
                            )
                    except Exception:
                        continue
            
            if high_corr_pairs:
                return {
                    "should_stop": False,
                    "reason": f"High correlation between factors: {high_corr_pairs}",
                    "remediation": "Replace correlated factors with diverse signals"
                }
        
        # All criteria pass — approved to stop
        return {
            "should_stop": True,
            "reason": f"All criteria met: {len(state.approved_factors)} diverse, "
                     f"high-quality factors approved",
            "remediation": None
        }

Loop vs. Chain vs. Single-Shot: The Results

From AlphaQuant (Yuksel, 2025) and QuantaAlpha (2026) implementations :

Architecture          | Factor Hit Rate | Avg ICIR | Avg Half-life | Factors/Day
----------------------|-----------------|----------|---------------|------------
Single-shot prompt    | 12%             | 0.28     | 14 days       | 50
Chain (A→B→C)         | 28%             | 0.38     | 21 days       | 120
Inner loop only       | 41%             | 0.44     | 28 days       | 180
Full 3-tier loop      | 61%             | 0.57     | 45+ days      | 200+
Full loop + Stop Hook | 61%             | 0.61     | 52 days       | 200+

The Stop Hook's contribution is subtle but real: it forces replacement of borderline factors (ICIR 0.51, 31-day half-life) with genuinely strong ones. Without it, the loop exits early with marginal factors. With it, the quality floor rises.

Common Loop Engineering Failures in Quant Contexts

The Confidence Hallucination. The LLM claims the factor has ICIR of 0.08 in the hypothesis spec. The actual measured ICIR is 0.014. Never trust the Reason stage's self-reported quality estimates - always measure in Observe. The Observe stage is the source of truth.

Context Overflow. Long outer loops fill the conversation history. By iteration 30, the agent's context window is full of old failures and it starts repeating them. Solution: trim hypothesis_history to last 10 messages at each outer loop. The LoopState handles long-term memory; conversation history only needs short-term.

Memory Poisoning. If a bad backtest result (data error, not a real signal failure) enters failure_patterns, the agent permanently avoids an entire factor class. Validate every rejected factor's rejection reason before writing to memory. Code execution errors should never enter failure_patterns.

Tight Micro Loop Infinite Loops. An agent that never successfully produces clean code for a given hypothesis will cycle the micro loop forever. Hard cap at 3 debug attempts. If code never runs, skip the hypothesis entirely - this protects the outer loop.

Missing Termination Semantics. The outer loop must define what "done" means before starting. "Find 10 factors" is not enough. "Find 10 factors with ICIR > 0.5, half-life > 30 days, and pairwise correlation < 0.6" is a Stop Hook criterion. Vague termination = the loop runs forever or exits too early.

The Bottom Line

Loop engineering is not a new tool. It's a new way of thinking about what you're building. You're not writing a prompt. You're designing a system that decides what to prompt, observes whether it worked, and iterates until the job is genuinely done.

For quants, this reframes research infrastructure. The quant's job is to define the quality criteria - ICIR thresholds, half-life requirements, correlation limits, domain constraints. The loop's job is to explore the factor space until those criteria are satisfied. The Stop Hook enforces that the loop does not exit on the agent's opinion. It exits on your criteria.

In 2022, the leverage was in writing the perfect alpha factor. In 2025, it moved to writing the perfect prompt. In 2026, it lives in designing the loop that writes and validates the factors for you.

The loop is the new unit of quant research.

Note : i wanted to reach larger audience, QT appreciated, if done i will personally dm you to get started your journey in quants.