A Survey on LLMs in Scientific Discovery
The next step for AI agents is scientific discovery.
This is a great paper summarizing trends and the future.
Here are my notes:

What's the paper about?
This paper presents a conceptual framework to understand the evolving role of LLMs in scientific discovery, emphasizing their progression from task-specific tools to autonomous scientific agents.
Anchored in the stages of the scientific method, the survey proposes a three-level taxonomy, LLM as Tool, Analyst, and Scientist, and categorizes over 90 research works accordingly.
This paper presents a conceptual framework to understand the evolving role of LLMs in scientific discovery, emphasizing their progression from task-specific tools to autonomous scientific agents.
Anchored in the stages of the scientific method, the survey proposes a three-level taxonomy, LLM as Tool, Analyst, and Scientist, and categorizes over 90 research works accordingly.

Three Levels of Autonomy:
Tool (Level 1): LLMs automate discrete tasks (e.g., literature summarization, code snippets) with direct human supervision.
Analyst (Level 2): LLMs independently handle analytical workflows, such as statistical modeling or symbolic regression, requiring less human intervention.
Scientist (Level 3): LLMs autonomously conduct multi-stage research cycles, including hypothesis generation, experimentation, and refinement, with minimal human input.
Tool (Level 1): LLMs automate discrete tasks (e.g., literature summarization, code snippets) with direct human supervision.
Analyst (Level 2): LLMs independently handle analytical workflows, such as statistical modeling or symbolic regression, requiring less human intervention.
Scientist (Level 3): LLMs autonomously conduct multi-stage research cycles, including hypothesis generation, experimentation, and refinement, with minimal human input.

Mapping to the Scientific Method
The paper maps LLM applications to all six stages of the scientific method (e.g., hypothesis generation, data analysis, conclusion). The table shows a detailed breakdown of Level 1 works by task and domain.
Characteristics of Level 1 systems include:
- Operates with explicit prompts and limited autonomy
- Enhances researcher productivity in discrete tasks
- Outputs generally require human integration and validation
The paper maps LLM applications to all six stages of the scientific method (e.g., hypothesis generation, data analysis, conclusion). The table shows a detailed breakdown of Level 1 works by task and domain.
Characteristics of Level 1 systems include:
- Operates with explicit prompts and limited autonomy
- Enhances researcher productivity in discrete tasks
- Outputs generally require human integration and validation

Level 2
Here is the comparison and classification of Level 2 research works in LLM-based scientific discovery.
These are autonomous analytical agents that execute goal-oriented tasks with moderate human oversight.
Characteristics include:
- Capable of multi-step reasoning and data modeling
- Manages sequences of tasks (e.g., analyzing experiments, refining models)
- Requires humans mainly for goal definition and result validation
Here is the comparison and classification of Level 2 research works in LLM-based scientific discovery.
These are autonomous analytical agents that execute goal-oriented tasks with moderate human oversight.
Characteristics include:
- Capable of multi-step reasoning and data modeling
- Manages sequences of tasks (e.g., analyzing experiments, refining models)
- Requires humans mainly for goal definition and result validation

Level 3
Notable Level 3 systems include The AI Scientist, Agent Laboratory, and Zochi, which demonstrate autonomous literature review, idea development, experimentation, and report generation.
These systems often use agentic workflows and multi-agent feedback loop.
Unlike Level 2 systems, which require humans to define tasks or validate outputs, Level 3 systems may start from broad prompts or even operate autonomously within a domain, with human involvement limited to high-level oversight or quality control.
Notable Level 3 systems include The AI Scientist, Agent Laboratory, and Zochi, which demonstrate autonomous literature review, idea development, experimentation, and report generation.
These systems often use agentic workflows and multi-agent feedback loop.
Unlike Level 2 systems, which require humans to define tasks or validate outputs, Level 3 systems may start from broad prompts or even operate autonomously within a domain, with human involvement limited to high-level oversight or quality control.

Challenges and Future Directions
The authors highlight key challenges for advancing LLM-based science:
- enabling fully autonomous research cycles
- integrating robotic automation for physical experiments
- achieving transparent and interpretable reasoning
- ensuring continuous self-improvement
- addressing ethical governance and societal alignment
This paper has a comprehensive set of related works for further reading if anyone is interested in specific domains.
Paper: arxiv.org/abs/2505.13259
The authors highlight key challenges for advancing LLM-based science:
- enabling fully autonomous research cycles
- integrating robotic automation for physical experiments
- achieving transparent and interpretable reasoning
- ensuring continuous self-improvement
- addressing ethical governance and societal alignment
This paper has a comprehensive set of related works for further reading if anyone is interested in specific domains.
Paper: arxiv.org/abs/2505.13259
Generated by Thread Navigator
Press ⌘ + S to quick-export
