How To Apply Loop Engineering To Quantitative Research (Complete Guide with Code)
Single-shot prompting is dead for serious quant work. You ask an LLM for an alpha factor, it gives you momentum or RSI, you backtest it, it fails. You prompt again. Nothing connects. Nothing learns. Nothing improves.
Loop engineering is what comes next. Coined by practitioners in mid-2025 and formalized by Google engineer Addy Osmani in June 2026, it's the discipline of designing AI systems that don't just respond once - they act, observe the result, decide what to do next, and repeat until a goal is actually met. As Peter Steinberger put it: "You shouldn't be prompting coding agents anymore. You should be designing loops that prompt your agents."
For quants, this reframes the entire research workflow. You stop being the person who writes factor code. You start being the person who designs the system that writes, tests, and iterates on factor code. The leverage moves from the quality of a single prompt to the architecture of the feedback loop.
Here's how to build it. But before that who am i ?
about me : I am Venus (open-source-believer, so spitting out internal secrets on X), a Senior Quant Systems Architect and Backend Engineer experienced in building startups from 0→1 and scaling products from 1→100 across AI, cloud, and fintech x defi infrastructure. dm's are open to connect. Let's get back to article.
What Loop Engineering Actually Is ?
A loop in agentic AI is a repeating cycle: the agent perceives its environment, reasons about what to do, acts, observes what happened, and feeds the result back into the next iteration. The cycle runs until a termination condition is met - a task complete, a quality threshold passed, a stopping criterion triggered.
This is the core four-stage cycle :
PERCEIVE → REASON → ACT → OBSERVE → (loop back)
It traces back to the ReAct pattern (Yao et al., 2023) : Reasoning + Acting interleaved so the agent can think about why an action failed before retrying. A single-shot prompt is like firing an arrow with your eyes closed. A loop is like adjusting your aim after each shot based on where the last one landed.
For quant research, the four stages map directly :
PERCEIVE = ingest market data, factor library, prior backtest results
REASON = generate hypothesis, decide which factor type to explore
ACT = write factor code, run backtest, compute IC/ICIR
OBSERVE = evaluate metrics, extract failure mode, update memoryThe loop continues until ICIR > 0.5, half-life > 30 days, and IC is stable. You don't prompt once. You design the system that prompts itself.
The Three Loop Types Every Quant Needs
Not all loops are equal. Quant research needs three nested loop types, each operating at a different timescale:
from anthropic import Anthropic
import pandas as pd
import numpy as np
import json
import time
from dataclasses import dataclass, field
from typing import List, Dict, Optional, Tuple
client = Anthropic()
@dataclass
class LoopState:
"""
Shared state passed between all loop iterations.
This is the 'environment' the agent perceives each cycle.
Critical design principle: all state is explicit and inspectable.
No hidden side effects. Every loop iteration reads and writes here.
"""
# What has been tried
attempted_factors: List[Dict] = field(default_factory=list)
approved_factors: List[Dict] = field(default_factory=list)
failed_factors: List[Dict] = field(default_factory=list)
# What was learned
failure_patterns: List[str] = field(default_factory=list)
success_patterns: List[str] = field(default_factory=list)
# Current iteration context
current_hypothesis: str = ""
current_code: str = ""
current_metrics: Dict = field(default_factory=dict)
# Loop control
iteration: int = 0
max_iterations: int = 50
target_approved: int = 10
# Termination signals
should_stop: bool = False
stop_reason: str = ""
class QuantLoopEngine:
"""
Three-tier loop architecture for autonomous quant research.
OUTER LOOP (Strategy level):
Runs until target_approved factors are found.
Manages domain rotation, memory consolidation.
Timescale: Hours to days.
INNER LOOP (Factor level):
Runs until one factor is approved or max debug attempts exceeded.
Handles generate → test → debug → approve/reject.
Timescale: Minutes.
MICRO LOOP (Code level):
Runs until code executes without errors (max 3 attempts).
Syntax errors, missing columns, type mismatches.
Timescale: Seconds.
This nesting is the key architectural insight:
each loop has its own termination condition and feedback signal.
"""
def __init__(self, market_data: pd.DataFrame):
self.data = market_data
self.state = LoopState()
# Loop config
self.max_debug_attempts = 3 # Micro loop
self.max_factor_attempts = 5 # Inner loop
self.icir_threshold = 0.5 # Approval gate
self.halflife_threshold = 30 # Approval gate
# Conversation history per agent (persists within inner loop)
self.hypothesis_history = []
self.factor_history = []The Perceive-Reason-Act-Observe Cycle: Full Implementation
def perceive(self) -> str:
"""
Stage 1: Build context from current state.
The agent sees exactly what happened before — no hallucination
about prior results. This is where loop engineering differs from
single-shot: the environment state is explicit and injected fresh
each iteration.
"""
context = f"""
=== QUANT RESEARCH LOOP — ITERATION {self.state.iteration} ===
PROGRESS: {len(self.state.approved_factors)}/{self.state.target_approved} factors approved
FAILED PATTERNS (do not repeat these):
{chr(10).join(f"- {p}" for p in self.state.failure_patterns[-8:]) or "None yet"}
SUCCESSFUL PATTERNS (build on these):
{chr(10).join(f"- {p}" for p in self.state.success_patterns[-5:]) or "None yet"}
APPROVED FACTORS SO FAR:
{json.dumps([
{"hypothesis": f["hypothesis"][:60], "icir": f["icir"], "halflife": f["halflife"]}
for f in self.state.approved_factors
], indent=2) or "None yet"}
LAST ATTEMPT RESULT:
{json.dumps(self.state.current_metrics, indent=2) if self.state.current_metrics else "No prior attempt"}
"""
return context
def reason(self, context: str) -> Dict:
"""
Stage 2: LLM decides what to do next based on context.
Uses multi-turn conversation within the inner loop.
The key: previous messages stay in conversation_history,
so the agent remembers what it tried earlier THIS iteration.
"""
self.hypothesis_history.append({
"role": "user",
"content": f"""{context}
Generate one novel alpha factor hypothesis. Avoid anything in failed patterns.
Explore one of: order flow microstructure, cross-asset divergence,
supply chain effects, earnings quality, management language shifts.
Return JSON:
{{
"hypothesis": "one sentence economic observation",
"mechanism": "why this generates returns",
"specification": {{
"inputs": ["data_column_names"],
"formula_description": "step by step calculation",
"lookback_days": number,
"expected_icir": number
}}
}}"""
})
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1000,
system="""You are a senior quant researcher. Generate novel alpha factors.
Return only valid JSON. No markdown.""",
messages=self.hypothesis_history
)
raw = response.content[0].text
self.hypothesis_history.append({"role": "assistant", "content": raw})
try:
start = raw.find('{')
end = raw.rfind('}') + 1
return json.loads(raw[start:end])
except Exception:
return {"hypothesis": "parse error", "specification": {}}
def act(self, hypothesis: Dict,
prior_code: str = None,
prior_error: str = None) -> str:
"""
Stage 3: Generate or fix factor code.
MICRO LOOP lives here: if prior_error is provided,
the agent is in debug mode — it reads the exact error
and produces a targeted fix.
"""
if prior_code and prior_error:
prompt = f"""Fix this broken factor code.
HYPOTHESIS: {hypothesis.get('hypothesis', '')}
BROKEN CODE:
{prior_code}
EXACT ERROR:
{prior_error}
Available columns: close, returns_1d, returns_5d, returns_20d, volume, market_cap
Write the corrected def compute_factor(data: pd.DataFrame) -> pd.Series function only."""
else:
prompt = f"""Write factor code for:
HYPOTHESIS: {hypothesis.get('hypothesis', '')}
FORMULA: {hypothesis.get('specification', {}).get('formula_description', '')}
INPUTS: {hypothesis.get('specification', {}).get('inputs', [])}
LOOKBACK: {hypothesis.get('specification', {}).get('lookback_days', 20)} days
Available columns: close, returns_1d, returns_5d, returns_20d, volume, market_cap
Rules:
- Function: def compute_factor(data: pd.DataFrame) -> pd.Series
- Final step must be cross-sectional z-score normalization
- Handle NaN explicitly
- Max 15 lines
Return only the Python function."""
self.factor_history.append({"role": "user", "content": prompt})
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1000,
system="Write clean Python factor functions. Return code only. No markdown.",
messages=self.factor_history
)
code = response.content[0].text.strip()
self.factor_history.append({"role": "assistant", "content": code})
# Strip any markdown fencing
for fence in ["python", ""]:
if fence in code:
code = code.split(fence)[1].split("")[0].strip()
break
return code
def observe(self, hypothesis: Dict, code: str) -> Dict:
"""
Stage 4: Execute code, compute metrics, extract signal.
This is the environment feedback that drives the loop.
Returns structured observation: metrics + pass/fail + failure reason.
The failure reason is what the Reason stage reads next iteration
to avoid repeating the same mistake.
"""
# Try to execute
result = self._execute_factor(code)
if result.get("error"):
return {
"passed": False,
"error": result["error"],
"type": "execution_error",
"feedback": f"Code failed: {result['error'][:100]}"
}
factor_series = result["series"]
# Compute IC series
ic_series = self._compute_ic(factor_series)
if len(ic_series) < 20:
return {
"passed": False,
"type": "insufficient_data",
"feedback": "Less than 20 IC observations — need more data"
}
ic_mean = ic_series.mean()
ic_std = ic_series.std()
icir = ic_mean / ic_std if ic_std > 0 else 0
halflife = self._compute_halflife(factor_series)
sharpe = (ic_series.mean() / ic_series.std()) * np.sqrt(252) if ic_series.std() > 0 else 0
passed = (abs(icir) >= self.icir_threshold and
halflife >= self.halflife_threshold)
metrics = {
"ic_mean": round(ic_mean, 4),
"icir": round(icir, 3),
"halflife_days": halflife,
"sharpe": round(sharpe, 2),
"passed": passed
}
if passed:
feedback = f"APPROVED — ICIR={icir:.2f}, Half-life={halflife}d"
else:
if abs(icir) < self.icir_threshold:
feedback = f"REJECTED — ICIR {icir:.2f} below threshold {self.icir_threshold}"
else:
feedback = f"REJECTED — Half-life {halflife}d below threshold {self.halflife_threshold}d"
return {**metrics, "feedback": feedback, "type": "metrics"}
def _execute_factor(self, code: str) -> Dict:
"""Execute factor code in sandbox. Return series or error."""
env = {"pd": pd, "np": np, "data": self.data.copy()}
try:
exec(code, env)
fn = env.get("compute_factor")
if fn is None:
return {"error": "No compute_factor function defined"}
result = fn(self.data)
if not isinstance(result, pd.Series):
return {"error": f"Expected pd.Series, got {type(result).__name__}"}
return {"series": result}
except Exception as e:
return {"error": str(e)}
def _compute_ic(self, factor: pd.Series,
forward_col: str = "returns_20d") -> pd.Series:
"""Compute daily Spearman IC between factor and forward returns."""
if forward_col not in self.data.columns:
return pd.Series(dtype=float)
dates = factor.index.unique() if factor.index.nlevels == 1 else \
factor.index.get_level_values(0).unique()
ic_values = []
for date in dates[:200]: # Cap for speed
try:
f_day = factor.loc[date] if factor.index.nlevels == 1 else \
factor.xs(date, level=0)
r_day = self.data.loc[self.data.index == date, forward_col] \
if "date" not in self.data.columns else \
self.data[self.data.index == date][forward_col]
aligned = pd.concat([f_day, r_day], axis=1).dropna()
if len(aligned) > 10:
ic = aligned.iloc[:, 0].corr(aligned.iloc[:, 1], method="spearman")
ic_values.append(ic)
except Exception:
continue
return pd.Series(ic_values)
def _compute_halflife(self, factor: pd.Series,
max_lag: int = 90) -> int:
"""Compute IC decay half-life in days."""
base_ic = abs(self._compute_ic(factor).mean())
if base_ic < 0.001:
return 0
for lag in range(5, max_lag, 5):
shifted = self.data.copy()
shifted["returns_20d"] = shifted.get("returns_20d", pd.Series()).shift(-lag)
ic_at_lag = abs(self._compute_ic(factor).mean())
if ic_at_lag < base_ic * 0.5:
return lag
return max_lagThe Three Loops Wired Together
def run_micro_loop(self, hypothesis: Dict) -> Tuple[str, Dict]:
"""
MICRO LOOP: Generate code, fix syntax errors, max 3 attempts.
Termination: code executes cleanly OR max attempts reached.
"""
code = None
error = None
for attempt in range(self.max_debug_attempts):
code = self.act(hypothesis, prior_code=code, prior_error=error)
test = self._execute_factor(code)
if not test.get("error"):
return code, {"status": "executable", "attempts": attempt + 1}
error = test["error"]
print(f" [micro loop] attempt {attempt+1}: {error[:60]}...")
return code, {"status": "failed_execution", "final_error": error}
def run_inner_loop(self, hypothesis: Dict) -> Dict:
"""
INNER LOOP: Generate → debug → test → approve/reject.
Termination: factor approved OR max attempts reached.
Resets micro loop each attempt. Keeps factor_history for continuity.
"""
self.factor_history = [] # Fresh code conversation per factor
for attempt in range(self.max_factor_attempts):
print(f" [inner loop] attempt {attempt+1}: "
f"{hypothesis['hypothesis'][:50]}...")
# Micro loop: get executable code
code, micro_result = self.run_micro_loop(hypothesis)
if micro_result["status"] == "failed_execution":
print(f" [inner loop] code never executed — skipping")
break
# Observe: run metrics
observation = self.observe(hypothesis, code)
self.state.current_metrics = observation
self.state.current_code = code
print(f" [inner loop] {observation.get('feedback', 'no feedback')}")
if observation.get("passed"):
return {
"status": "approved",
"hypothesis": hypothesis["hypothesis"],
"code": code,
"icir": observation["icir"],
"halflife": observation["halflife_days"],
"sharpe": observation["sharpe"],
"attempts": attempt + 1
}
# Not approved — feedback goes into next micro loop context
# This is where the inner loop learns within a factor attempt
if observation.get("type") == "metrics":
hypothesis["_feedback"] = observation["feedback"]
return {
"status": "rejected",
"hypothesis": hypothesis["hypothesis"],
"final_metrics": self.state.current_metrics
}
def run_outer_loop(self, domains: List[str] = None) -> List[Dict]:
"""
OUTER LOOP: Mine until target approved factors reached.
Termination: approved count >= target OR max_iterations.
Memory consolidation happens here:
- Failure reasons → failure_patterns (prevent repeating)
- Success patterns → success_patterns (amplify what works)
This is the self-improvement mechanism:
each outer loop iteration the agent is smarter
than the previous one because it reads accumulated memory.
"""
domains = domains or [
"order flow microstructure signals",
"cross-asset spread divergence",
"earnings quality and accruals",
"supply chain network effects",
"management language and tone shifts",
"options market implied information"
]
domain_idx = 0
while (len(self.state.approved_factors) < self.state.target_approved
and self.state.iteration < self.state.max_iterations
and not self.state.should_stop):
self.state.iteration += 1
domain = domains[domain_idx % len(domains)]
domain_idx += 1
print(f"\n[outer loop] iteration {self.state.iteration} | "
f"domain: {domain} | "
f"approved: {len(self.state.approved_factors)}/"
f"{self.state.target_approved}")
# PERCEIVE: build context from accumulated state
context = self.perceive()
# REASON: generate hypothesis using full context
self.hypothesis_history = [] # Fresh per outer loop
hypothesis = self.reason(context + f"\n\nDomain focus: {domain}")
if not hypothesis.get("hypothesis"):
print("[outer loop] failed to generate hypothesis, continuing...")
continue
self.state.current_hypothesis = hypothesis["hypothesis"]
# ACT + OBSERVE: run inner loop
result = self.run_inner_loop(hypothesis)
# Update memory — the self-improvement feedback
self._update_memory(hypothesis, result)
if result["status"] == "approved":
self.state.approved_factors.append(result)
print(f"[outer loop] APPROVED #{len(self.state.approved_factors)}: "
f"ICIR={result['icir']:.2f}, Half-life={result['halflife']}d")
return self.state.approved_factors
def _update_memory(self, hypothesis: Dict, result: Dict):
"""
Memory update: the mechanism that makes each outer loop
iteration smarter than the last.
Failed patterns prevent the agent from re-exploring dead ends.
Successful patterns guide it toward productive territory.
This is verbal reinforcement learning without gradient updates.
"""
if result["status"] == "approved":
pattern = (f"WORKS: {hypothesis.get('mechanism', '')[:80]} "
f"→ ICIR {result['icir']:.2f}, "
f"Half-life {result['halflife']}d")
self.state.success_patterns.append(pattern)
else:
metrics = result.get("final_metrics", {})
pattern = (f"FAILED: {hypothesis['hypothesis'][:60]} "
f"— {metrics.get('feedback', 'unknown reason')[:80]}")
self.state.failure_patterns.append(pattern)
self.state.attempted_factors.append({
"hypothesis": hypothesis["hypothesis"],
"result": result["status"],
"domain": self.state.current_hypothesis
})The Stop Hook: Preventing Premature Exit
The most critical part of loop engineering that most implementations miss. An LLM will stop when it thinks the task is done - not when the task is done. The Stop Hook intercepts exit conditions and validates them against hard criteria.
class QuantLoopStopHook:
"""
Intercepts termination conditions before the loop exits.
Pattern from loop engineering practice:
The agent cannot self-terminate. Every exit must pass the Stop Hook.
If criteria aren't met, the task prompt is re-injected and the loop
continues.
In quant research: a factor is not 'done' until ICIR, half-life,
AND out-of-sample stability all pass. The agent's own confidence
about quality is irrelevant.
"""
def __init__(self,
min_approved: int = 10,
min_icir: float = 0.5,
min_halflife: int = 30,
max_correlation_between_factors: float = 0.6):
self.min_approved = min_approved
self.min_icir = min_icir
self.min_halflife = min_halflife
self.max_corr = max_correlation_between_factors
def check(self, state: LoopState,
factor_series_dict: Dict[str, pd.Series]) -> Dict:
"""
Validate all termination criteria.
Returns: should_stop (bool), reason (str), remediation (str)
"""
# Criterion 1: Enough approved factors
if len(state.approved_factors) < self.min_approved:
return {
"should_stop": False,
"reason": f"Only {len(state.approved_factors)}/{self.min_approved} approved",
"remediation": "Continue mining — insufficient alpha coverage"
}
# Criterion 2: All approved factors meet quality bar
failing = [
f for f in state.approved_factors
if f["icir"] < self.min_icir or f["halflife"] < self.min_halflife
]
if failing:
return {
"should_stop": False,
"reason": f"{len(failing)} factors below quality threshold",
"remediation": f"Re-validate or replace: {[f['hypothesis'][:40] for f in failing]}"
}
# Criterion 3: Factor diversity (low pairwise correlation)
if len(factor_series_dict) > 1:
series_list = list(factor_series_dict.values())
names = list(factor_series_dict.keys())
high_corr_pairs = []
for i in range(len(series_list)):
for j in range(i + 1, len(series_list)):
try:
corr = series_list[i].corr(series_list[j])
if abs(corr) > self.max_corr:
high_corr_pairs.append(
f"{names[i][:20]} x {names[j][:20]} = {corr:.2f}"
)
except Exception:
continue
if high_corr_pairs:
return {
"should_stop": False,
"reason": f"High correlation between factors: {high_corr_pairs}",
"remediation": "Replace correlated factors with diverse signals"
}
# All criteria pass — approved to stop
return {
"should_stop": True,
"reason": f"All criteria met: {len(state.approved_factors)} diverse, "
f"high-quality factors approved",
"remediation": None
}Loop vs. Chain vs. Single-Shot: The Results
From AlphaQuant (Yuksel, 2025) and QuantaAlpha (2026) implementations :
Architecture | Factor Hit Rate | Avg ICIR | Avg Half-life | Factors/Day
----------------------|-----------------|----------|---------------|------------
Single-shot prompt | 12% | 0.28 | 14 days | 50
Chain (A→B→C) | 28% | 0.38 | 21 days | 120
Inner loop only | 41% | 0.44 | 28 days | 180
Full 3-tier loop | 61% | 0.57 | 45+ days | 200+
Full loop + Stop Hook | 61% | 0.61 | 52 days | 200+The Stop Hook's contribution is subtle but real: it forces replacement of borderline factors (ICIR 0.51, 31-day half-life) with genuinely strong ones. Without it, the loop exits early with marginal factors. With it, the quality floor rises.
Common Loop Engineering Failures in Quant Contexts
The Confidence Hallucination. The LLM claims the factor has ICIR of 0.08 in the hypothesis spec. The actual measured ICIR is 0.014. Never trust the Reason stage's self-reported quality estimates - always measure in Observe. The Observe stage is the source of truth.
Context Overflow. Long outer loops fill the conversation history. By iteration 30, the agent's context window is full of old failures and it starts repeating them. Solution: trim hypothesis_history to last 10 messages at each outer loop. The LoopState handles long-term memory; conversation history only needs short-term.
Memory Poisoning. If a bad backtest result (data error, not a real signal failure) enters failure_patterns, the agent permanently avoids an entire factor class. Validate every rejected factor's rejection reason before writing to memory. Code execution errors should never enter failure_patterns.
Tight Micro Loop Infinite Loops. An agent that never successfully produces clean code for a given hypothesis will cycle the micro loop forever. Hard cap at 3 debug attempts. If code never runs, skip the hypothesis entirely - this protects the outer loop.
Missing Termination Semantics. The outer loop must define what "done" means before starting. "Find 10 factors" is not enough. "Find 10 factors with ICIR > 0.5, half-life > 30 days, and pairwise correlation < 0.6" is a Stop Hook criterion. Vague termination = the loop runs forever or exits too early.
The Bottom Line
Loop engineering is not a new tool. It's a new way of thinking about what you're building. You're not writing a prompt. You're designing a system that decides what to prompt, observes whether it worked, and iterates until the job is genuinely done.
For quants, this reframes research infrastructure. The quant's job is to define the quality criteria - ICIR thresholds, half-life requirements, correlation limits, domain constraints. The loop's job is to explore the factor space until those criteria are satisfied. The Stop Hook enforces that the loop does not exit on the agent's opinion. It exits on your criteria.
In 2022, the leverage was in writing the perfect alpha factor. In 2025, it moved to writing the perfect prompt. In 2026, it lives in designing the loop that writes and validates the factors for you.
The loop is the new unit of quant research.
Note : i wanted to reach larger audience, QT appreciated, if done i will personally dm you to get started your journey in quants.
