Run Management & Reproducibility¶
Versifai agents don't just produce answers -they produce auditable artifacts that show their work. Every SQL query, every statistical test, every chart, every decision is recorded to disk. If you remove the AI agent from the process, a human can follow the exact same reasoning path and verify every conclusion.
This page explains what agents output, how runs are isolated and resumed, and how knowledge flows between agents.
The Core Idea¶
Most AI systems work like this: Input → LLM → Answer. The answer is the only output. If something goes wrong, you start over.
Versifai agents run in a ReAct loop -a cycle of reasoning and tool calls. At each step in the loop, the agent produces durable artifacts on disk. The answer is one output among many:
flowchart TD
PROMPT["Prompt + config<br>injected into agent"] --> REASON["LLM reasons<br>about next step"]
REASON --> TOOL{"Agent calls<br>a tool"}
TOOL -->|execute_sql| SQL["SQL runs on Spark<br>→ results returned"]
TOOL -->|save_note| NOTE["Reasoning + SQL<br>→ written to notes/ file"]
TOOL -->|create_visualization| CHART["Chart rendered<br>→ saved to charts/"]
TOOL -->|save_finding| FINDING["Structured evidence<br>→ appended to findings.json"]
TOOL -->|write_narrative| SECTION["Section draft<br>→ saved to section_*.md"]
SQL --> BACK["Tool result returned<br>to LLM"]
NOTE --> BACK
CHART --> BACK
FINDING --> BACK
SECTION --> BACK
BACK --> STATE["RunState updated<br>→ run_metadata.json"]
STATE --> DONE{Done?}
DONE -->|No| REASON
DONE -->|Yes| FINAL["Final output +<br>all artifacts on disk"]
style PROMPT fill:#e8f0fe,stroke:#4a6f93
style REASON fill:#fff8e1,stroke:#b38600
style NOTE fill:#f0f0f0,stroke:#999
style CHART fill:#f0f0f0,stroke:#999
style FINDING fill:#e8f4e8,stroke:#4a8a4a
style SECTION fill:#f0f0f0,stroke:#999
style STATE fill:#e8f0fe,stroke:#4a6f93
style FINAL fill:#e8f4e8,stroke:#4a8a4a
Every cycle through the loop leaves something on disk. If the agent crashes at any point, the artifacts from completed cycles are already persisted — and the agent can resume from where it left off.
Why this matters:
- Reproducibility -A human can re-run any SQL query, re-generate any chart, and verify any statistical claim without the LLM
- Resumability -If the agent crashes at theme 4 of 7, it restarts at theme 5 with full knowledge of what was already done
- Auditability -Every decision has a timestamp, every chart has its source data, every finding has its p-value and methodology
- Inter-agent handoff -The StoryTeller reads the Scientist's structured outputs, not raw LLM text
Run Isolation¶
Every agent execution gets its own run directory -a timestamped folder that contains all outputs for that run. This means you can run the same agent multiple times without overwriting previous results.
Run ID Format¶
20260223_143012_a1b2
│ │ │
│ │ └── Random suffix (4 hex chars)
│ └── Time: 14:30:12
└── Date: 2026-02-23
Run IDs are lexicographically sortable -alphabetical order = chronological order. This makes "find the latest run" trivial.
Directory Structure¶
flowchart LR
VOL["/Volumes/.../results/runs/"] --> RUN1["20260223_100000_f4e1/"]
VOL --> RUN2["20260223_143012_a1b2/"]
VOL --> RUN3["20260225_091500_c7d3/"]
RUN1 -.- S1(["interrupted"])
RUN2 -.- S2(["completed"])
RUN3 -.- S3(["in progress"])
style RUN1 fill:#fee,stroke:#c33
style RUN2 fill:#efe,stroke:#3a3
style RUN3 fill:#ffe,stroke:#b80
style S1 fill:#fee,stroke:#c33
style S2 fill:#efe,stroke:#3a3
style S3 fill:#ffe,stroke:#b80
Each run directory is self-contained:
runs/20260223_143012_a1b2/
├── run_metadata.json # Run state, timing, completion status
├── findings.json # Structured research findings (Scientist)
├── narrative_report.md # Final assembled report (StoryTeller)
├── section_intro.md # Individual narrative sections (StoryTeller)
├── section_analysis.md
├── charts/ # Visualization outputs
│ ├── theme0_distribution.png
│ ├── theme1_scatter.png
│ └── theme5_precision_recall.png
├── tables/ # CSV result tables
│ ├── correlation_matrix.csv
│ └── model_comparison.csv
└── notes/ # Per-theme reasoning logs
├── theme_0_notes.txt
├── theme_1_notes.txt
└── silver_notes.txt
What Each Agent Outputs¶
Data Engineer¶
The Data Engineer's outputs are Delta tables in Unity Catalog -not files on disk. It creates clean, validated, joinable tables from raw data files.
flowchart LR
RAW[/"6 raw files<br>(CSV, ZIP, Excel)"/] --> ENG["Data Engineer<br>profile → design → load"]
ENG --> T1[("silver_daily_weather")]
ENG --> T2[("silver_quack_frequency")]
ENG --> T3[("silver_feather_fluffing")]
ENG --> T4[("silver_ice_cream_sales")]
style RAW fill:#fff8e1,stroke:#b38600
style ENG fill:#e8f0fe,stroke:#4a6f93
style T1 fill:#e8f4e8,stroke:#4a8a4a
style T2 fill:#e8f4e8,stroke:#4a8a4a
style T3 fill:#e8f4e8,stroke:#4a8a4a
style T4 fill:#e8f4e8,stroke:#4a8a4a
Every table includes:
| Column | Purpose |
|---|---|
source_file_name |
Which raw file this row came from (audit trail) |
load_timestamp |
When this row was loaded (reproducibility) |
Join key (e.g., station_id) |
Links this table to every other table |
The source_file_name column is what enables smart resume -on re-run, the
agent queries the table to see which files are already loaded and skips them.
Data Scientist¶
The Data Scientist produces the richest set of artifacts:
flowchart LR
SCI["Data Scientist<br>ReAct loop"] --> META["run_metadata.json<br>phases + progress"]
SCI --> FIND["findings.json<br>14 structured findings"]
SCI --> CHARTS[/"charts/<br>9 visualizations"/]
SCI --> TABLES[/"tables/<br>6 CSV summaries"/]
SCI --> NOTES[/"notes/<br>per-theme reasoning"/]
style SCI fill:#e8f0fe,stroke:#4a6f93
style FIND fill:#e8f4e8,stroke:#4a8a4a
style META fill:#f0f0f0,stroke:#999
style CHARTS fill:#fef3e0,stroke:#b38600
style TABLES fill:#f0f0f0,stroke:#999
style NOTES fill:#f0f0f0,stroke:#999
findings.json -Structured Evidence¶
Every statistical finding the agent discovers is saved as a structured record. This is the primary handoff artifact to the StoryTeller.
[
{
"research_question_id": "theme_1",
"title": "Quack-Rain Correlation Confirmed",
"finding": "Quack frequency at lag-1 shows Spearman rho=0.42 with next-day precipitation.",
"evidence": "Spearman rank correlation: rho=0.42, p<0.001, n=312,847. Effect persists across all seasons (summer rho=0.51, winter rho=0.28). Partial correlation controlling for temperature: rho=0.38, p<0.001.",
"significance": "high",
"visualization_path": "charts/theme1_lag_correlation.png",
"timestamp": "2026-02-23T14:35:22.123456",
"index": 3
}
]
| Field | Purpose |
|---|---|
research_question_id |
Links finding to the research theme that produced it |
title |
One-line summary for the StoryTeller to reference |
finding |
The actual discovery in plain language |
evidence |
Statistical backing -p-values, effect sizes, sample sizes, methods |
significance |
high / medium / low -drives evidence tier classification |
visualization_path |
Chart that supports this finding |
Notes -The Reasoning Log¶
Per-theme notes files are the most important artifact for reproducibility. They record everything the agent did and why -every SQL query, every statistical test, every decision.
--- 2026-02-23 14:25:33 ---
Starting Theme 1: Quack Before the Storm.
Research question: Is there a significant correlation between quack
frequency and next-day rain?
Required table: silver_weather_duck_daily
Available columns: station_id, observation_date, quack_count,
temp_max_c, temp_min_c, precip_mm, fluff_intensity
--- 2026-02-23 14:27:15 ---
Step 1: Computing lag-1 correlation.
SQL used:
SELECT station_id, observation_date, quack_count,
LEAD(precip_mm, 1) OVER (
PARTITION BY station_id ORDER BY observation_date
) AS next_day_precip
FROM silver_weather_duck_daily
WHERE quack_count IS NOT NULL
Result: 312,847 rows with complete lag-1 pairs.
--- 2026-02-23 14:28:40 ---
Spearman correlation: rho = 0.42, p < 0.001
Pearson correlation: r = 0.35, p < 0.001
(Spearman higher → relationship is monotonic but not strictly linear)
--- 2026-02-23 14:30:12 | CHART: theme1_lag_correlation.png ---
Type: line
Title: Lag-Correlation: Quack Frequency vs. Precipitation
Path: /Volumes/.../charts/theme1_lag_correlation.png
SQL Query:
SELECT lag_days, season,
CORR(quack_count, precip_at_lag) AS correlation
FROM lagged_quack_precip
GROUP BY lag_days, season
Interpretation: The lag-1 peak is consistent across all seasons.
Summer shows the strongest signal (r=0.51), winter the weakest (r=0.28).
--- 2026-02-23 14:32:55 ---
Step 3: Checking for temperature confound.
Partial correlation controlling for temp_max_c: rho = 0.38, p < 0.001.
Temperature explains some variance but the quack signal persists.
Why notes matter
If you read a notes file top to bottom, you get a complete lab notebook of everything the agent did for that theme. Every SQL query is there. Every statistical result is there. You can re-run any of it manually without the AI.
Chart Metadata¶
Every chart saved to charts/ also has its metadata logged to the notes file.
This includes the SQL query that produced the data, the render code, and a
sample of the input data. A human can regenerate any chart from its metadata.
StoryTeller¶
The StoryTeller produces narrative sections and a final assembled report:
flowchart LR
IN[/"findings.json<br>+ charts + notes"/] --> ST["StoryTeller<br>ReAct loop"]
ST -->|"per section"| SEC[/"section_*.md<br>(8 files, saved immediately)"/]
SEC --> ASM["Assemble + coherence pass"]
ASM --> RPT["narrative_report.md<br>~4,000 words with TOC<br>+ bibliography"]
style IN fill:#fff8e1,stroke:#b38600
style ST fill:#e8f0fe,stroke:#4a6f93
style SEC fill:#fef3e0,stroke:#b38600
style RPT fill:#e8f4e8,stroke:#4a8a4a
Each section is persisted to disk immediately after the agent writes it. If the agent crashes after writing 5 of 8 sections, those 5 sections are already on disk and will be loaded on resume.
Smart Resume¶
The most practical feature of the run management system: agents detect what's already been done and skip it.
How It Works¶
flowchart TD
START([Agent starts]) --> CHECK["Load run_metadata.json<br>from latest run"]
CHECK --> SCAN["Scan for completed work:<br>• Tables in Unity Catalog<br>• Findings in findings.json<br>• Sections on disk<br>• RunState completed_items"]
SCAN --> UNION["Union rule:<br>If ANY source says done → skip"]
UNION --> RESUME["Resume from first<br>incomplete item"]
style START fill:#e8f0fe,stroke:#4a6f93
style SCAN fill:#fef3e0,stroke:#b38600
style RESUME fill:#e8f4e8,stroke:#4a8a4a
Data Scientist Resume¶
The scientist checks three sources to determine what's complete:
| Source | What It Checks | Example |
|---|---|---|
| Unity Catalog | Do the silver tables exist? | silver_weather_duck_daily in catalog → silver phase done |
| findings.json | Which themes have findings? | Finding with research_question_id: "theme_1" → theme 1 done |
| RunState | What did the run state record? | completed_items.themes: ["theme_0", "theme_1"] |
The system uses a union rule -if any of these three sources says a piece of work is done, it's treated as done.
# Pseudocode from _scan_pipeline_state()
completed_silver = set()
# Check 1: Does the table exist in the catalog?
for dataset in config.silver_datasets:
if dataset.name in catalog_tables:
completed_silver.add(dataset.name)
# Check 2: Does the run state say it's done?
for name in run_state.completed_items.get("silver", []):
completed_silver.add(name)
# Union: skip anything in completed_silver
StoryTeller Resume¶
The storyteller checks for section files on disk:
# Scan the output directory for existing sections
for filename in os.listdir(output_path):
if filename.startswith("section_") and filename.endswith(".md"):
# This section exists -skip it on resume
completed[section_id] = content
This is why sections are persisted immediately -each one is a durable checkpoint.
What Resume Looks Like¶
$ python run_scientist.py
Phase 1: Orientation -SKIPPED (already completed)
Phase 2: Silver Construction
├── silver_weather_duck_daily -SKIPPED (exists in catalog)
├── silver_duck_forecast_comparison -SKIPPED (exists in catalog)
└── silver_ice_cream_weather -RUNNING...
Phase 3: Theme Analysis
├── Theme 0: The Quack Census -SKIPPED (findings exist)
├── Theme 1: Quack Before the Storm -SKIPPED (findings exist)
├── Theme 2: The Fluff Factor -SKIPPED (findings exist)
├── Theme 3: The Ice Cream Confounder -SKIPPED (findings exist)
├── Theme 4: V-Formation Tornado Warning -RUNNING... ← picks up here
...
Run State Tracking¶
The RunState dataclass tracks exactly where the agent is in its pipeline.
It's persisted to run_metadata.json and updated after every completed item.
{
"run_id": "20260223_143012_a1b2",
"agent_type": "scientist",
"config_name": "global_development",
"started_at": "2026-02-23T14:00:00",
"state": {
"status": "running",
"entry_point": "run",
"current_phase": "theme_analysis",
"current_item": "theme_4",
"completed_phases": ["orientation", "silver"],
"completed_items": {
"silver": ["silver_development_panel", "silver_development_recent", "silver_development_long_run"],
"themes": ["theme_0", "theme_1", "theme_2", "theme_3"]
},
"updated_at": "2026-02-23T15:12:33"
}
}
| Field | Purpose |
|---|---|
status |
running / completed / failed / interrupted |
current_phase |
What the agent is working on right now |
current_item |
The specific item within the phase |
completed_phases |
Phases that are fully done |
completed_items |
Per-phase list of completed items (for partial progress) |
Carryover Context¶
When an agent moves between phases, its conversation history is cleared to save tokens -but key knowledge is preserved via carryover context.
flowchart LR
P1["Phase 1<br>Orientation"] -->|"extract summaries<br>+ key decisions"| CTX["Carryover<br>Context"]
CTX -->|"clear history,<br>inject context"| P2["Phase 2<br>Silver Construction"]
P2 -->|"extract summaries<br>+ key decisions"| CTX2["Carryover<br>Context"]
CTX2 -->|"clear history,<br>inject context"| P3["Phase 3<br>Theme Analysis"]
style P1 fill:#e8f0fe,stroke:#4a6f93
style CTX fill:#fee,stroke:#c33
style P2 fill:#e8f0fe,stroke:#4a6f93
style CTX2 fill:#fee,stroke:#c33
style P3 fill:#e8f0fe,stroke:#4a6f93
The AgentMemory class manages this:
reset_for_new_source()-Clears conversation history but preserves source summaries, context notes, and decisionsget_carryover_context()-Builds a markdown summary of everything learned so far (last 10 notes, all source summaries)log_source_summary()-Records a one-line summary of what was done for a source/phase
This means the agent in phase 3 knows what happened in phases 1 and 2 — without carrying 200 messages of conversation history.
Inter-Agent Handoff¶
The three agents don't communicate directly. They communicate through structured artifacts on disk.
flowchart LR
ENG["Data Engineer"] -->|"implicit handoff<br>via Unity Catalog"| SCI["Data Scientist"]
SCI -->|"explicit handoff<br>via AgentDependency"| STORY["StoryTeller"]
ENG -.->|"Delta tables"| CAT[("Unity Catalog")]
CAT -.->|"SQL queries"| SCI
SCI -.->|"findings + charts + notes"| DISK[/"Run directory"/]
DISK -.->|"resolve latest run"| STORY
style ENG fill:#e8f0fe,stroke:#4a6f93
style SCI fill:#e8f4e8,stroke:#4a8a4a
style STORY fill:#fef3e0,stroke:#b38600
style CAT fill:#f0f0f0,stroke:#999
style DISK fill:#f0f0f0,stroke:#999
Engineer → Scientist¶
The handoff is implicit -both agents point to the same Unity Catalog
schema. The scientist runs list_catalog_tables and sees the tables the
engineer created.
Scientist → StoryTeller¶
The handoff is explicit via AgentDependency:
from versifai.core.run_manager import AgentDependency
# StoryTeller config declares where to find scientist outputs
dependency = AgentDependency(
agent_type="scientist",
config_name="global_development",
base_path="/Volumes/.../results",
run_id="" # Empty = use latest run
)
The dependency resolver finds the latest scientist run and returns its path.
The StoryTeller then reads findings.json, scans charts/, tables/, and
notes/ from that directory.
The Reproducibility Contract¶
Every artifact the system produces can be verified by a human without the LLM:
| Artifact | How to Verify |
|---|---|
| Delta table | Run the CREATE TABLE SQL from the schema designer's output |
| Statistical finding | Re-run the SQL query from the notes file, feed results to scipy/statsmodels |
| Chart | Re-run the SQL query + render code from the notes file |
| Narrative claim | Check the finding it cites → check the evidence → check the SQL |
| Evidence tier | Compare the p-value and effect size against the tier criteria |
The chain is always: Narrative claim → Finding → Evidence → SQL query → Raw data.
Every link in that chain is recorded as an artifact on disk.
flowchart RL
CLAIM["Narrative claim:<br>'Ducks quack more<br>before rain'"]
FINDING["Finding:<br>Spearman rho=0.42<br>p < 0.001"]
EVIDENCE["Evidence:<br>Notes file with full<br>SQL + methodology"]
SQL["SQL query:<br>SELECT quack_count,<br>LEAD(precip_mm, 1)..."]
DATA["Raw data:<br>silver_weather_duck_daily<br>312,847 rows"]
CLAIM --> FINDING
FINDING --> EVIDENCE
EVIDENCE --> SQL
SQL --> DATA
style CLAIM fill:#fef3e0,stroke:#b38600
style FINDING fill:#e8f4e8,stroke:#4a8a4a
style EVIDENCE fill:#e8f0fe,stroke:#4a6f93
style SQL fill:#f0f0f0,stroke:#999
style DATA fill:#f0f0f0,stroke:#999
This is the core design principle: the AI is there to create artifacts, not to be the artifact. Remove the AI and the work still stands.