Data Scientist Agent¶
The DataScientistAgent runs autonomous research analysis against validated data tables. It builds analytical datasets ("silver" tables), runs statistical tests, fits models, and produces charts and research findings.
How It Works¶
The scientist operates in theme-based analysis. Each theme represents a research question or area of investigation (e.g., "Geographic Disparity", "Temporal Trends"). For each theme, the agent:
- Orientation -Reviews available data and plans the analysis approach
- Silver dataset construction -Creates derived analytical tables via SQL
- Statistical analysis -Runs hypothesis tests, correlations, descriptive statistics
- Model fitting -Regression, classification, or clustering as appropriate
- Visualization -Produces charts and plots for each finding
- Findings -Persists structured research findings with evidence
Silver Tables¶
The scientist can create derived tables prefixed with silver_ via SQL. These are read-write. All other (bronze) tables are read-only -enforced by the SilverOnlyExecuteSQLTool.
Available Tools¶
| Tool | Description |
|---|---|
execute_sql |
SQL against Databricks (silver-only write access) |
list_catalog_tables |
List all tables in the target schema |
statistical_analysis |
Descriptive stats, correlations, distributions, hypothesis testing |
model_fitting |
Regression, classification, clustering with automated feature selection |
validate_statistics |
Cross-check statistical claims for validity |
literature_review |
Web-based research context gathering |
create_visualization |
Charts, plots, geographic maps |
save_finding |
Persist research findings with evidence |
save_note |
Save analytical notes and observations |
log_model |
Log model artifacts and metadata |
web_search |
External documentation and literature |
create_custom_tool |
Runtime tool creation for custom analysis |
ask_human |
Pause and ask the operator a question |
See the Tool Reference for detailed parameter and return value tables.
Run Outputs¶
Each run produces a structured output directory:
results/{config_name}/runs/{run_id}/
├── findings.json # All research findings with evidence
├── charts/ # Visualization images
├── tables/ # Exported data tables
├── notes/ # Analytical notes
├── models/ # Model artifacts
└── run_metadata.json # Run info, timing, state
Usage¶
from versifai.science_agents import DataScientistAgent, ResearchConfig
cfg = ResearchConfig(
name="Customer Churn Analysis",
catalog="analytics",
schema="churn",
results_path="/tmp/results/churn",
themes=[...], # Define research themes
# Domain-specific guidance (optional, but recommended)
agent_role="Customer Analytics Researcher",
domain_context=(
"Churn rate is typically 5-15% monthly for SaaS.\n"
"Revenue values are in USD. MRR = Monthly Recurring Revenue.\n"
"Customer tenure is measured in months since first subscription."
),
analysis_method_guidance={
"simulation": "Use survival analysis (Kaplan-Meier) for time-to-churn modeling.",
},
)
agent = DataScientistAgent(cfg=cfg, dbutils=dbutils)
# Full run
result = agent.run()
# Re-run only specific themes
result = agent.run_themes(themes=[0, 3])
Domain Guidance Fields¶
The agent prompts are domain-agnostic. Use these config fields to inject domain knowledge:
| Field | Purpose | Default |
|---|---|---|
agent_role |
Agent identity in system prompt (e.g., "Health Policy Researcher") | "Data Scientist" |
domain_context |
Data quirks, validation ranges, sense-check benchmarks | "" (no domain section) |
analysis_method_guidance |
Per-analysis-type methodology overrides (keys: "simulation", "comparative", etc.) |
{} (use built-in defaults) |
visualization_guidance |
Chart and visualization priorities | "" (generic chart guidance) |
When these fields are empty, the agent uses rigorous but generic statistical methodology.
Evidence Standards¶
All findings include structured evidence with statistical rigor:
| Tier | Criteria | Example |
|---|---|---|
| DEFINITIVE | p < 0.001, large effect size, multiple methods agree | RCT with pre-registered hypothesis |
| STRONG | p < 0.01, meaningful effect size | Regression with significant predictors |
| SUGGESTIVE | p < 0.05, moderate effect size | Correlation with plausible mechanism |
| CONTEXTUAL | Descriptive patterns, no formal test | Geographic distribution pattern |
| WEAK | p > 0.05 or trivial effect size | Non-significant trend |