Tutorial: World Development Indicators¶
This tutorial builds a complete Versifai project using real World Bank data — 6 development indicators covering 217 countries from 1960 to 2023. By the end, three AI agents will have ingested raw ZIP archives, run statistical analysis on classic development economics questions, and produced a narrative research report.
The dataset is real. The analysis is real. The output is a genuine research artifact -not a toy example.
What You're About to Build¶
Three agents run in sequence, each picking up where the last one left off:
| Stage | Notebook | Agent | Output |
|---|---|---|---|
| 0. Download | 00_download_data.py |
(script) | 6 ZIP files in Volume |
| 1. Ingest | 01_run_engineer.py |
DataEngineerAgent | 7 Delta tables in Unity Catalog |
| 2. Analyze | 02_run_scientist.py |
DataScientistAgent | findings.json, charts/, tables/ |
| 3. Narrate | 03_run_storyteller.py |
StoryTellerAgent | world_development_report.md |
All example files live in
examples/world_development/.
The Dataset¶
Source: World Bank Open Data API -free, authoritative, no authentication required.
| Code | Indicator | Table |
|---|---|---|
NY.GDP.PCAP.CD |
GDP per capita (current US$) | silver_gdp_per_capita |
SP.DYN.LE00.IN |
Life expectancy at birth (years) | silver_life_expectancy |
SE.PRM.NENR |
School enrollment, primary (% net) | silver_school_enrollment |
SH.XPD.CHEX.PC.CD |
Health expenditure per capita (US$) | silver_health_expenditure |
EN.ATM.CO2E.PC |
CO2 emissions (metric tons per capita) | silver_co2_emissions |
SP.POP.TOTL |
Population total | silver_population |
Plus silver_country_metadata -region and income group classifications for
every country.
Each indicator is downloaded as a ZIP archive (~3MB each, ~20MB total) containing wide-format CSVs with years as columns. This is a real engineering challenge — the Data Engineer agent must extract ZIPs, skip metadata rows, pivot wide-to-long, and standardize column names.
Stage 0: Download the Data¶
File: examples/world_development/notebooks/00_download_data.py
This notebook downloads all 6 indicators from the World Bank API and saves them to a Databricks Volume:
import urllib.request
INDICATORS = {
"NY.GDP.PCAP.CD": "gdp_per_capita",
"SP.DYN.LE00.IN": "life_expectancy",
"SE.PRM.NENR": "school_enrollment",
"SH.XPD.CHEX.PC.CD": "health_expenditure",
"EN.ATM.CO2E.PC": "co2_emissions",
"SP.POP.TOTL": "population",
}
for code, name in INDICATORS.items():
url = f"https://api.worldbank.org/v2/en/indicator/{code}?downloadformat=csv"
urllib.request.urlretrieve(url, f"/Volumes/my_catalog/world_development/raw_data/{name}.zip")
The download is idempotent -it skips files that already exist. After running, your Volume will contain 6 ZIP files:
/Volumes/my_catalog/world_development/raw_data/
├── gdp_per_capita.zip (~3MB)
├── life_expectancy.zip (~2MB)
├── school_enrollment.zip (~2MB)
├── health_expenditure.zip (~2MB)
├── co2_emissions.zip (~2MB)
└── population.zip (~3MB)
Stage 1: The Engineer Config¶
File: examples/world_development/engineer_config.py
The ProjectConfig tells the Data Engineer Agent everything about ingesting
these World Bank ZIP files.
Join Key -Country Code¶
Every table must include a country_code column (ISO 3166-1 alpha-3) so they
can be joined:
from versifai.data_agents.engineer.config import JoinKeyConfig
join_key = JoinKeyConfig(
column_name="country_code",
data_type="STRING",
description="ISO 3166-1 alpha-3 country code (e.g., 'USA', 'GBR', 'CHN').",
validation_rule="Must be exactly 3 uppercase letters matching [A-Z]{3}",
expected_entity_count=217,
)
Alternative Keys -Region and Income Group¶
World Bank assigns every country to a region and income group. These enable stratified analysis:
from versifai.data_agents.engineer.config import AlternativeKeyConfig
alternative_keys = [
AlternativeKeyConfig(
column_name="region",
description="World Bank region for regional aggregations",
data_type="STRING",
grain="region",
),
AlternativeKeyConfig(
column_name="income_group",
description="Low income, Lower middle income, Upper middle income, High income",
data_type="STRING",
grain="income_group",
),
]
Source Processing Hints -Wide-Format Pivot¶
This is where the real engineering guidance lives. World Bank CSVs have a non-standard format:
- 4 metadata rows at the top (must be skipped)
- Wide format -years as columns:
1960, 1961, ..., 2023 - Each ZIP contains 3 files: data CSV, country metadata CSV, indicator metadata CSV
The hints tell the agent exactly how to handle this:
from versifai.data_agents.engineer.config import SourceProcessingHint, SourceFileHint
SourceProcessingHint(
source_pattern="gdp_per_capita",
description="GDP per capita (current US$) -World Bank indicator NY.GDP.PCAP.CD",
multi_table=True, # This ZIP produces TWO tables
files=[
SourceFileHint(
file_pattern="API_NY.GDP.PCAP.CD",
target_table="silver_gdp_per_capita",
description="GDP per capita by country and year",
),
SourceFileHint(
file_pattern="Metadata_Country",
target_table="silver_country_metadata",
description="Country classifications: region, income group",
),
],
notes=(
"The main data CSV has 4 metadata rows at the top -skip them. "
"This is WIDE FORMAT -pivot to LONG FORMAT with columns: "
"country_name, country_code, year, and the indicator value column."
),
)
Country metadata deduplication
Country metadata is identical across all 6 ZIPs. Only the first source
hint (gdp_per_capita) sets multi_table=True to load it. The remaining
5 ZIPs skip the metadata file.
Domain Guidance¶
WORLD_DEVELOPMENT = ProjectConfig(
...,
grain_detection_guidance=(
"Country-level: Look for 'Country Code' or 3-letter ISO codes\n"
"Country-year: After pivoting wide-format data, grain is (country_code, year)\n"
"WARNING: World Bank data includes aggregate entities (e.g., 'World', "
"'East Asia & Pacific') alongside individual countries."
),
column_naming_examples=(
"'Country Name' -> country_name\n"
"'Country Code' -> country_code\n"
"Year columns (1960..2023) -> pivot to: year (INT) + value column\n"
"'IncomeGroup' -> income_group"
),
)
See the complete config in
engineer_config.py.
Running the Engineer¶
File: examples/world_development/notebooks/01_run_engineer.py
from examples.world_development.engineer_config import WORLD_DEVELOPMENT
from versifai.data_agents.engineer.agent import DataEngineerAgent
cfg = WORLD_DEVELOPMENT
agent = DataEngineerAgent(cfg=cfg, dbutils=dbutils)
# Stage 1: Discover, extract, pivot, load
results = agent.run(source_path=cfg.volume_path)
# Stage 2: Standardize column names
agent.run_rename()
# Stage 3: Build data catalog
agent.run_catalog()
# Stage 4: Validate all tables
agent.run_quality_check()
Result: 7 Delta tables in Unity Catalog -6 indicator tables (one per
indicator, long format) plus silver_country_metadata.
Stage 2: The Research Config¶
File: examples/world_development/research_configs/global_development.py
The ResearchConfig defines the entire research agenda: a thesis, 6 analysis
themes, 3 silver datasets, and 5 literature references.
The Thesis¶
GLOBAL_DEVELOPMENT = ResearchConfig(
name="Does Economic Development Drive Human Wellbeing?",
thesis=(
"Economic development (GDP per capita) is strongly correlated with life "
"expectancy and health outcomes, but the relationship is log-linear with "
"sharply diminishing returns. Education correlates with growth but causality "
"is confounded. Healthcare spending efficiency varies enormously. Carbon "
"emissions track development, though high-income nations show signs of "
"decoupling. Whether the world is converging or diverging depends on what "
"you measure."
),
agent_role="Development Economist and Global Health Researcher",
)
Six Analysis Themes¶
| # | Title | Type | Key Question |
|---|---|---|---|
| 0 | The Development Dashboard | descriptive | What does the data look like? |
| 1 | The Preston Curve | correlation | Does GDP predict life expectancy? |
| 2 | Education and Economic Growth | comparative | Do educated countries grow faster? |
| 3 | Healthcare Spending Returns | correlation | Does health spending beat GDP alone? |
| 4 | The Carbon Cost of Development | trend | Is growth carbon-intensive? Kuznets Curve? |
| 5 | Convergence or Divergence? | trend | Are countries converging in GDP and health? |
Each theme is a self-contained research question with methodology, expected outputs, and a signature visualization:
from versifai.science_agents.scientist.config import AnalysisTheme
AnalysisTheme(
id="theme_1",
title="The Preston Curve",
question="Does the classic concave Preston Curve hold in modern data?",
analysis_type="correlation",
sequence=1,
required_tables=["silver_development_recent"],
analysis_steps=[
"Compute Pearson and Spearman correlation: log(GDP) vs life expectancy",
"Fit log-linear regression: life_expectancy ~ log(gdp_per_capita)",
"Test for concavity: add quadratic term",
"Identify outliers deviating from the curve",
"Compare curve shape across decades (2000, 2010, 2020)",
],
signature_visualization=(
"The Preston Curve scatter plot: log(GDP) vs life expectancy, "
"dots sized by population, colored by income group."
),
)
Three Silver Datasets¶
The agent builds pre-joined analytical tables before running themes:
| Dataset | Description | Use |
|---|---|---|
silver_development_panel |
All 6 indicators + metadata joined on (country_code, year) | Themes 0-4 |
silver_development_recent |
Panel filtered to 2000-2023, excludes aggregates | Themes 1-4 |
silver_development_long_run |
Balanced panel 1960-2023, ~100-120 countries | Theme 5 |
from versifai.science_agents.scientist.config import SilverDatasetSpec
SilverDatasetSpec(
name="silver_development_recent",
description="2000-2023 panel with best data coverage",
source_tables=["silver_development_panel"],
join_key="country_code",
time_column="year",
data_notes=(
"Filter WHERE year >= 2000. Exclude aggregate entities -keep only "
"individual countries (those with a non-NULL region)."
),
)
Domain Context¶
The config injects data-specific knowledge into the agent's system prompt:
GLOBAL_DEVELOPMENT = ResearchConfig(
...,
domain_context=(
"## Data Quirks\n\n"
"- World Bank aggregates (e.g., 'WLD', 'EAS') must be excluded\n"
"- GDP per capita is in current US$ -use log transforms\n"
"- Health expenditure starts ~2000; earlier years are missing\n"
"- School enrollment can exceed 100% (UNESCO methodology)\n\n"
"## Expected Value Ranges\n\n"
"- GDP per capita: $200 (Burundi) to $100,000+ (Luxembourg)\n"
"- Life expectancy: 50-85 years\n"
"- CO2 emissions: 0.05-40 metric tons per capita\n"
),
analysis_method_guidance={
"correlation": (
"Always log-transform GDP and health expenditure. Report both "
"Pearson and Spearman. Size scatter points by population."
),
"trend": (
"For convergence, use balanced panels. Report sigma-convergence "
"(CV of log GDP by decade). Annotate structural breaks."
),
},
)
See the complete config in
global_development.py.
Running the Scientist¶
File: examples/world_development/notebooks/02_run_scientist.py
from examples.world_development.research_configs.global_development import GLOBAL_DEVELOPMENT
from versifai.science_agents.scientist.agent import DataScientistAgent
cfg = GLOBAL_DEVELOPMENT
agent = DataScientistAgent(cfg=cfg, dbutils=dbutils)
# Full pipeline: Orientation → Silver Construction → Theme Analysis → Synthesis
results = agent.run()
The agent runs through 4 phases automatically. If it crashes midway, re-running skips completed themes (smart resume).
You can also run specific themes:
# Skip themes 0-2, run themes 3-5 only
agent.run_themes(start_theme=3)
# Or run specific themes
agent.run_themes(themes=[1, 4]) # Preston Curve + Carbon Cost
Result: Structured findings with p-values and effect sizes, charts for each theme, CSV summary tables, and per-theme markdown reasoning notes.
Stage 3: The Storyteller Config¶
File: examples/world_development/storyteller_config.py
The StorytellerConfig defines how to turn the scientist's findings into
"The Shape of Global Progress" -an 8-section narrative report.
Style Guide¶
from versifai.story_agents.storyteller.config import StyleGuide
style = StyleGuide(
voice="third-person analytical",
audience="Policy analysts, development economists, and informed general readers",
document_type="Analytical white paper",
tone_guidance=(
"Authoritative but accessible. Write like The Economist or Our World in Data -"
"precise, evidence-first, globally aware. Let the data speak."
),
anti_patterns=(
"- NO: Vague claims like 'the world is getting better/worse' without data\n"
"- NO: Conflating correlation with causation\n"
"- NO: Cherry-picking countries that support a narrative\n"
),
)
Eight Narrative Sections¶
| # | Title | Source Themes | Max Words |
|---|---|---|---|
| 0 | The Shape of Progress (hook) | theme_0 | 800 |
| 1 | When Wealth Buys Years | theme_1 | 1,200 |
| 2 | Schools, Skills, and Growth | theme_2 | 1,000 |
| 3 | Diminishing Returns | theme_3 | 1,200 |
| 4 | The Carbon Crossroads | theme_4 | 1,200 |
| 5 | A Narrowing Gap? | theme_5 | 1,500 |
| 6 | What Development Data Reveals | all | 1,000 |
| 7 | Methodology & Reproducibility | all | 2,000 |
Each section maps to research themes and includes transition text:
from versifai.story_agents.storyteller.config import NarrativeSection
NarrativeSection(
id="section_preston",
title="When Wealth Buys Years",
purpose="Present the GDP-life expectancy relationship and the Preston Curve",
source_theme_ids=["theme_1"],
max_words=1200,
key_evidence="Preston Curve regression, R-squared, outlier analysis, decade shifts",
narrative_guidance=(
"Lead with the Preston Curve scatter plot. Explain the log-linear shape -"
"why doubling GDP from $2K to $4K buys more years than $40K to $80K."
),
transition_from="With data in hand, we begin with the most iconic relationship.",
transition_to="If wealth buys health with diminishing returns, what about direct investment?",
sequence=1,
)
Evidence Thresholds¶
Zero tolerance for ungrounded claims:
from versifai.story_agents.storyteller.config import EvidenceThreshold
evidence = EvidenceThreshold(
min_significance_for_lead="high",
min_significance_for_support="medium",
require_effect_size=True,
max_unsupported_claims=0,
)
Domain Writing Rules¶
WORLD_DEVELOPMENT_STORY = StorytellerConfig(
...,
domain_writing_rules=(
"EVIDENCE-FIRST ANALYTICAL TONE: Every claim must cite a specific statistic. "
"Acknowledge when data limitations affect conclusions. Let readers draw their "
"own policy conclusions -the report's job is to present evidence, not advocate."
),
citation_source_guidance=(
"World Bank technical documentation, peer-reviewed development economics literature, "
"Our World in Data, OECD reports, and classic texts on growth theory."
),
)
See the complete config in
storyteller_config.py.
Running the Storyteller¶
File: examples/world_development/notebooks/03_run_storyteller.py
from examples.world_development.storyteller_config import WORLD_DEVELOPMENT_STORY
from versifai.story_agents.storyteller.agent import StoryTellerAgent
cfg = WORLD_DEVELOPMENT_STORY
agent = StoryTellerAgent(cfg=cfg, dbutils=dbutils)
# Full pipeline: Inventory → Evidence → Write → Coherence → Finalize
results = agent.run()
You can also rewrite specific sections or run an editorial pass:
# Rewrite sections 0, 3, and 5
agent.run_sections(sections=[0, 3, 5])
# Editor pass with specific instructions
agent.run_editor(
instructions="Strengthen the transition from the Preston Curve section "
"into the Education section."
)
Result: world_development_report.md -a narrative report with table of
contents, inline citations, and bibliography.
The Final Report¶
The full report produced by the StoryTeller agent will be linked here after the pipeline runs end-to-end on Databricks. Check back soon, or run the notebooks yourself to generate it.
Data Flow Summary¶
flowchart LR
DL["Download<br>6 ZIPs"] --> VOL[/"Volume<br>6 ZIP archives"/]
VOL --> ENG["**Data Engineer**<br>Extract, pivot,<br>load to Delta"]
ENG --> CAT[("7 Delta Tables<br>in Unity Catalog")]
CAT --> SCI["**Data Scientist**<br>Join, analyze,<br>6 themes"]
SCI --> OUT[/"Findings +<br>Charts + Tables"/]
OUT --> ST["**StoryTeller**<br>Evaluate evidence,<br>write narrative"]
ST --> RPT[/"world_development_<br>report.md"/]
style DL fill:#fff8e1,stroke:#b38600
style VOL fill:#fff8e1,stroke:#b38600
style ENG fill:#e8f0fe,stroke:#4a6f93
style CAT fill:#e8f4e8,stroke:#4a8a4a
style SCI fill:#e8f0fe,stroke:#4a6f93
style OUT fill:#fff8e1,stroke:#b38600
style ST fill:#e8f0fe,stroke:#4a6f93
style RPT fill:#e8f4e8,stroke:#4a8a4a
What Each Agent Produces¶
Data Engineer -7 Delta tables:
silver_gdp_per_capita-GDP per capita by country and year (long format)silver_life_expectancy-Life expectancy by country and yearsilver_school_enrollment-Primary enrollment rate by country and yearsilver_health_expenditure-Health spending per capita by country and yearsilver_co2_emissions-CO2 per capita by country and yearsilver_population-Total population by country and yearsilver_country_metadata-Region, income group, special notes per country
Data Scientist -research artifacts:
findings.json-Structured findings with p-values, effect sizes, evidence tierscharts/-PNG visualizations (Preston Curve, Kuznets Curve, convergence, etc.)tables/-CSV summary tables (regression coefficients, ANOVA results, etc.)notes/-Per-theme markdown reasoning logs
StoryTeller -narrative report:
world_development_report.md-~10,000-word analytical report with TOC, citations, bibliography
Output File Structure¶
/Volumes/my_catalog/world_development/
├── raw_data/ # Input (downloaded by notebook 0)
│ ├── gdp_per_capita.zip
│ ├── life_expectancy.zip
│ ├── school_enrollment.zip
│ ├── health_expenditure.zip
│ ├── co2_emissions.zip
│ └── population.zip
│
├── results/ # Data Scientist outputs
│ ├── findings.json
│ ├── charts/
│ │ ├── development_dashboard_grid.png
│ │ ├── preston_curve_scatter.png
│ │ ├── education_growth_boxplots.png
│ │ ├── healthcare_spending_scatter.png
│ │ ├── carbon_kuznets_two_panel.png
│ │ └── convergence_dual_axis.png
│ ├── tables/
│ │ ├── data_inventory_summary.csv
│ │ ├── preston_curve_regression.csv
│ │ ├── enrollment_tertile_comparison.csv
│ │ ├── spending_model_comparison.csv
│ │ ├── kuznets_regression.csv
│ │ └── convergence_by_decade.csv
│ └── notes/
│ ├── theme_0.md
│ ├── theme_1.md
│ ├── theme_2.md
│ ├── theme_3.md
│ ├── theme_4.md
│ └── theme_5.md
│
└── narrative/ # StoryTeller outputs
└── world_development_report.md
How It All Connects¶
| Part | What It Is | What Changes Between Projects |
|---|---|---|
| Config | A Python dataclass holding all domain knowledge | Everything -this is where your project lives |
| Agent | A generic Python class that reads the config and does work | Nothing -agents are reusable across projects |
| Notebook | A Databricks notebook that creates the agent and runs it | Just the import path to your config |
The agents are generic. All domain-specific knowledge lives in the configs. To start a new project, write new configs and run the same agents.
Adapting for Your Own Project¶
Copy the World Development example and replace the domain content:
-
Copy the example:
-
Edit
engineer_config.py:- Change
catalog,schema,volume_pathto your Databricks target - Update
join_keyto your primary join column - List your data sources in
known_sources - Add processing hints if your data has a non-standard format
- Add
grain_detection_guidanceandcolumn_naming_examples
- Change
-
Edit
research_configs/:- Write your thesis
- Define 5-10 analysis themes with research questions
- Specify silver datasets for pre-joined tables
- Set
agent_roleanddomain_context - Add
analysis_method_guidancefor domain-specific methodology
-
Edit
storyteller_config.py:- Define narrative sections (one per major finding)
- Set the style guide for your audience
- Configure evidence thresholds
- Add
domain_writing_rulesandcitation_source_guidance
-
Write a download notebook (if using public data) or upload files manually
-
Run the notebooks in order:
00_download_data.py-Get the data01_run_engineer.py-Ingest into Delta tables02_run_scientist.py-Analyze03_run_storyteller.py-Write the report
The agent code is the same for every project. Your configs are the only thing that changes.
Key Concepts Recap¶
| Concept | What It Is | Where It Lives |
|---|---|---|
| ProjectConfig | Data engineering instructions (catalog, schema, join keys, sources) | engineer_config.py |
| ResearchConfig | Research methodology (thesis, themes, silver datasets, domain context) | research_configs/*.py |
| StorytellerConfig | Narrative rules (sections, style, evidence thresholds) | storyteller_config.py |
| AnalysisTheme | One research question with steps and a signature chart | Inside ResearchConfig |
| SilverDatasetSpec | A pre-joined analytical table to build | Inside ResearchConfig |
| NarrativeSection | One section of the report with tone and evidence mapping | Inside StorytellerConfig |
| Smart Resume | Agents skip completed work on re-run | Built into all agents |
| Tools | The unit of agent capability (SQL, stats, charts, etc.) | src/versifai/*/tools/ |