Versifai Tool Inventory¶

Complete reference for every tool available to each agent. All tools extend BaseTool and return ToolResult.

ToolResult fields: success (bool), data (Any), error (str), summary (str), image_path (str)

Tool Cross-Reference Matrix¶

Which tools are available to which agents:

Tool Name	Data Engineer	Data Scientist	StoryTeller
`execute_sql`	Full	Silver-only	Silver-only
`list_catalog_tables`	Y	Y	Y
`web_search`	Y	Y	Y
`scrape_web`	-	Y	Y
`create_visualization`	-	Y	Y
`view_chart`	-	Y	Y
`save_note`	-	Y	Y
`create_custom_tool`	Y	Y	Y
`explore_volume`	Y	-	-
`extract_archive`	Y	-	-
`read_file_header`	Y	-	-
`read_documentation`	Y	-	-
`scan_for_documentation`	Y	-	-
`profile_data`	Y	-	-
`design_schema`	Y	-	-
`transform_and_load`	Y	-	-
`write_to_catalog`	Y	-	-
`rename_columns`	Y	-	-
`statistical_analysis`	-	Y	-
`fit_model`	-	Y	-
`check_confounders`	-	Y	-
`validate_silver`	-	Y	-
`validate_statistics`	-	Y	-
`review_literature`	-	Y	-
`save_finding`	-	Y	-
`log_model`	-	Y*	-
`read_findings`	-	-	Y
`read_chart`	-	-	Y
`read_table`	-	-	Y
`write_narrative`	-	-	Y
`evaluate_evidence`	-	-	Y
`cite_source`	-	-	Y

* log_model is conditionally registered only when cfg.mlflow_experiment is set.

Shared Tools¶

These tools are registered by two or more agents.

`execute_sql`¶

Class: ExecuteSQLTool / SilverOnlyExecuteSQLTool | File: core/tools/catalog_writer.py

Execute SQL queries against Unity Catalog. Data Engineer gets full write access; Data Scientist and StoryTeller get write-protected access (DDL/DML restricted to silver_* tables).

Parameters:

Parameter	Type	Required	Description
`sql`	string	Y	SQL query (DDL, DML, or SELECT)

Returns:

Field	Type	Description
`rows`	list[dict]	Query result rows (SELECT only, capped at 100)
`row_count`	int	Number of rows returned
`method`	string	Execution method (`spark` or `sdk`)

Notes: - SELECT results capped at 100 rows for agent display; use WHERE/LIMIT for large sets - OOM detection with actionable error messages suggesting LIMIT, WHERE, or GROUP BY - Spark execution with 10-minute timeout, SDK fallback with async polling - SilverOnly variant returns success=False if DDL/DML targets non-silver tables

`list_catalog_tables`¶

Class: ListCatalogTablesTool | File: core/tools/catalog_writer.py

List all tables in the configured Unity Catalog schema.

Parameters: None required.

Returns:

Field	Type	Description
`tables`	list[str]	Table names in the schema
`count`	int	Number of tables

`web_search`¶

Class: WebSearchTool | File: core/tools/web_search.py

Search the web or fetch a specific URL for data documentation and metadata.

Parameters:

Parameter	Type	Required	Description
`query`	string	Y	Search query, matched against `ProjectConfig.documentation_urls`
`url`	string	-	Specific URL to fetch directly (bypasses query matching)
`max_chars`	int	-	Max characters to return (default 10000)

Returns:

Field	Type	Description
`url`	string	Fetched URL (direct mode)
`content`	string	Extracted text content (direct mode)
`results`	list[dict]	Search results with url/content (search mode)
`urls_checked`	list[str]	URLs that were fetched (search mode)

Notes: - Matches queries against ProjectConfig.documentation_urls for known data portals - Falls back to DuckDuckGo search if no documentation URL match - HTML text extraction via BeautifulSoup (falls back to regex strip)

`scrape_web`¶

Class: WebScraperTool | File: core/tools/web_scraper.py

Advanced web scraping with JavaScript rendering. Three operations.

Parameters:

Parameter	Type	Required	Description
`operation`	string	Y	`discover_site`, `fetch_page`, or `extract_tables`
`url`	string	Y	Target URL
`section_filter`	string	-	Keyword filter for `discover_site`

Returns (by operation):

Operation	Key Fields
`discover_site`	`base_url`, `total_pages`, `sections` (list of name/url/page_count)
`fetch_page`	`url`, `text`, `content_length`, `source` (direct/google_cache/playwright)
`extract_tables`	`url`, `tables` (list of headers/rows), `count`

Notes: - Playwright headless browser for JavaScript-rendered pages - Falls back to Google cache when direct fetch fails - Supports PDF extraction

`create_visualization`¶

Class: CreateVisualizationTool | File: core/tools/visualization.py

Create publication-quality charts, maps, and result tables. 15 chart types including choropleth maps. Logs all metadata (SQL, data, render code) to theme notes for reproducibility.

Parameters:

Parameter	Type	Required	Description
`chart_type`	string	Y	`bar`, `scatter`, `heatmap`, `line`, `box`, `histogram`, `waterfall`, `dumbbell`, `lollipop`, `violin`, `choropleth`, `dual_choropleth`, `sankey`, `table`, `custom`
`title`	string	Y	Chart title
`filename`	string	Y	Output filename (e.g., `rq1_chart.png`)
`sql_query`	string	-	SQL to fetch data (preferred over `data` for large datasets)
`data`	list[dict]	-	Pre-computed data rows (fallback if SQL not possible)
`x_column`	string	-	X-axis column name
`y_column`	string	-	Y-axis column name
`color_column`	string	-	Color/grouping column
`theme_id`	string	-	Theme ID for notes logging
`interpretation`	string	-	2-3 sentence chart interpretation (logged to notes)
`render_code`	string	-	Full Python program for `custom` chart type
`datasets`	dict	-	Additional named SQL data sources for `custom` charts
`fips_column`	string	-	FIPS code column (choropleth only, default `fips`)
`color_scale`	string	-	Color scale: `Viridis`, `RdYlGn`, `RdBu`, `Blues`, `Reds`, `YlOrRd`, `Plasma`, `Inferno`

Returns:

Field	Type	Description
`chart_path`	string	Path to saved PNG/CSV
`chart_type`	string	Chart type used
`row_count`	int	Number of data rows rendered

Notes: - sql_query executes via Spark with no row cap (unlike execute_sql's 100-row display limit) - Chart metadata (SQL, data, render code, interpretation) logged to theme notes file - custom chart type: agent writes full Python; available variables include df, pd, np, plt, sns, go, px - Saves to configured results volume path

`view_chart`¶

Class: ViewChartTool | File: core/tools/view_chart.py

View a previously created chart or list all available charts and tables.

Parameters:

Parameter	Type	Required	Description
`filename`	string	-	Specific file to view. Omit to list all.

Returns:

Field	Type	Description
`charts`	list[str]	Available PNG files (list mode)
`tables`	list[str]	Available CSV files (list mode)
`filename`	string	Viewed file name (view mode)
`path`	string	Full file path (view mode)

Notes: - PNG files displayed inline via base64 in Databricks notebooks - CSV files returned as formatted text - Sets image_path on ToolResult for automatic rendering

`save_note`¶

Class: SaveNoteTool | File: core/tools/save_note.py

Save a research note to a per-theme notes file for reproducibility and audit.

Parameters:

Parameter	Type	Required	Description
`theme_id`	string	Y	Theme identifier (e.g., `theme_1`, `rq_1`)
`note`	string	Y	Note content (supports markdown)

Returns:

Field	Type	Description
`theme_id`	string	Theme ID
`file`	string	Path to notes file
`timestamp`	string	ISO timestamp
`char_count`	int	Note length

Notes: - Uses read-then-write pattern for Databricks FUSE compatibility (no append mode) - One notes file per theme - Timestamps included for audit trail

`create_custom_tool`¶

Class: DynamicToolBuilderTool | File: core/tools/dynamic_tool_builder.py

Create a custom tool at runtime from agent-provided Python code. Registered immediately for use.

Parameters:

Parameter	Type	Required	Description
`tool_name`	string	Y	Name for the new tool
`tool_description`	string	Y	Description visible to agent
`parameters`	dict	Y	JSON Schema for the tool's parameters
`code`	string	Y	Python implementation (receives **kwargs, must return dict)

Returns:

Field	Type	Description
`tool_name`	string	Created tool name
`registered`	bool	Whether registration succeeded
`total_custom_tools`	int	Count of custom tools in session

Security guardrails -blocked operations: - subprocess, os.system -no shell commands - eval, exec -no dynamic code execution - open, file I/O -no direct file access - socket -no network operations - spark, dbutils -no direct Databricks access

Allowed: pandas, numpy, dict/list/string operations, math, stage_dataframe() bridge function.

Data Engineer Agent Tools¶

Agent: DataEngineerAgent | File: data_agents/engineer/agent.py

Total tools: 14 + 1 pseudo-tool (ask_human) | SQL access: Full (read/write)

Summary¶

Tool	Purpose
`explore_volume`	Browse Databricks Volume directories
`extract_archive`	Unpack ZIP/GZ/TAR archives
`read_file_header`	Preview file headers and sample rows
`read_documentation`	Read and classify documentation files
`scan_for_documentation`	Find documentation in a directory
`profile_data`	Profile column types, distributions, nulls
`design_schema`	Design Delta table schema from source columns
`transform_and_load`	Transform and stage data for catalog write
`write_to_catalog`	Write staged DataFrames to Unity Catalog
`rename_columns`	Rename columns in Delta tables
+ 4 shared tools	`execute_sql`, `list_catalog_tables`, `web_search`, `create_custom_tool`

`explore_volume`¶

Class: VolumeExplorerTool | File: data_agents/engineer/tools/volume_explorer.py

List files and subdirectories in a Databricks Volume with metadata.

Parameters:

Parameter	Type	Required	Description
`path`	string	Y	Volume path (e.g., `/Volumes/catalog/schema/volume/data`)
`recursive`	bool	-	Recurse into subdirectories (default false)

Returns:

Field	Type	Description
`path`	string	Explored path
`entry_count`	int	Number of entries
`entries`	list[dict]	File/directory entries with `name`, `type`, `size_bytes`, `size_mb`

`extract_archive`¶

Class: FileExtractorTool | File: data_agents/engineer/tools/file_extractor.py

Extract compressed archive files.

Parameters:

Parameter	Type	Required	Description
`file_path`	string	Y	Path to archive file
`dest_path`	string	-	Destination directory (default: same as archive)

Returns:

Field	Type	Description
`source_archive`	string	Archive path
`destination`	string	Extraction directory
`file_count`	int	Number of extracted files
`files`	list[str]	Paths to extracted files

Supported formats: ZIP (.zip), GZIP (.gz), TAR (.tar), TAR.GZ (.tgz, .tar.gz)

`read_file_header`¶

Class: FileReaderTool | File: data_agents/engineer/tools/file_reader.py

Read file headers and sample rows from data files.

Parameters:

Parameter	Type	Required	Description
`file_path`	string	Y	Path to data file
`n_rows`	int	-	Sample rows to read (default 10)
`encoding`	string	-	Encoding override (auto-detected)
`separator`	string	-	CSV separator override (auto-detected)
`sheet_name`	string	-	Excel sheet name (default: first)
`skip_rows`	int	-	Rows to skip before reading (default 0)

Returns:

Field	Type	Description
`columns`	list[str]	Column names
`column_count`	int	Number of columns
`columns_detail`	list[dict]	Per-column dtype and sample values
`sample_data`	list[dict]	Sample rows
`estimated_total_rows`	int	Estimated row count
`file_size_mb`	float	File size

Supported formats: CSV, TSV, Excel (.xls/.xlsx), Parquet, SAS (.sas7bdat), Stata (.dta)

`read_documentation`¶

Class: DocumentationReaderTool | File: data_agents/engineer/tools/doc_reader.py

Read and classify a documentation file.

Parameters:

Parameter	Type	Required	Description
`file_path`	string	Y	Path to documentation file
`max_chars`	int	-	Max characters to return (default 15000)

Returns:

Field	Type	Description
`file_path`	string	File path
`doc_type`	string	Detected type (e.g., `markdown`, `pdf`)
`classification`	string	`data_dictionary`, `readme`, `schema_documentation`, `api_reference`, `general_documentation`
`content`	string	Extracted text content
`truncated`	bool	Whether content was truncated

Supported formats: TXT, MD, HTML, PDF, CSV, Excel

`scan_for_documentation`¶

Class: ScanForDocumentationTool | File: data_agents/engineer/tools/doc_reader.py

Scan a directory for documentation files, prioritized by type.

Parameters:

Parameter	Type	Required	Description
`path`	string	Y	Directory path to scan

Returns:

Field	Type	Description
`documentation_files`	list[dict]	Found files with `filename`, `path`, `priority`, `doc_type`
`count`	int	Number of files found

Priority order: README > DICTIONARY/DATA_DICTIONARY > SCHEMA > other docs

`profile_data`¶

Class: DataProfilerTool | File: data_agents/engineer/tools/data_profiler.py

Profile a data file's structure -column types, distributions, missing values, outliers.

Parameters:

Parameter	Type	Required	Description
`file_path`	string	Y	Path to data file
`sample_size`	int	-	Rows to sample (default 500)
`encoding`	string	-	Encoding override
`separator`	string	-	Separator override

Returns:

Field	Type	Description
`rows_sampled`	int	Actual rows sampled
`column_count`	int	Number of columns
`columns`	list[str]	Column names
`column_profiles`	list[dict]	Per-column: dtype, null_count, unique_count, min/max/mean/median/std, sample_values
`potential_fips_columns`	list[str]	Columns detected as FIPS codes
`potential_geo_columns`	list[str]	Columns detected as geographic
`memory_usage_mb`	float	Estimated memory usage

Heuristics: - FIPS: columns with "fips" in name or 5-digit zero-padded patterns - Geographic: columns named "state", "county", "city", "zip", etc.

`design_schema`¶

Class: SchemaDesignerTool | File: data_agents/engineer/tools/schema_designer.py

Design a Delta table schema from source file columns. Auto-infers types, detects cryptic names, generates CREATE TABLE SQL.

Parameters:

Parameter	Type	Required	Description
`source_name`	string	Y	Source name key (must match a profiled file)
`table_name`	string	Y	Target table name (e.g., `silver_county_demographics`)
`description`	string	-	Table description
`column_names`	list[str]	-	Subset of columns to include (default: all)
`join_key_source_column`	string	-	Column to use as primary/join key
`type_overrides`	dict	-	Manual type overrides: `{col: data_type}`
`name_overrides`	dict	-	Manual name overrides: `{source: target}`
`columns`	list[dict]	-	Full column definitions with transform expressions
`partition_columns`	list[str]	-	Partition columns (e.g., `["year", "state"]`)

Returns:

Field	Type	Description
`schema`	dict	Full schema with column definitions
`create_table_sql`	string	Executable CREATE TABLE SQL
`warnings`	list[str]	Cryptic names, missing join keys, etc.
`column_count`	int	Number of columns in schema
`has_join_key`	bool	Whether a join key was identified

Auto-detection: pct_/rate_ -> DOUBLE; id/key -> STRING; date patterns -> DATE/TIMESTAMP

`transform_and_load`¶

Class: DataTransformerTool | File: data_agents/engineer/tools/data_transformer.py

Transform source data per designed schema and stage for catalog write. Supports single-file and batch modes.

Parameters:

Parameter	Type	Required	Description
`source_name`	string	Y	Source key (must match a designed schema)
`file_path`	string	-	Source file (single-file mode)
`files`	list[dict]	-	Batch mode: list of `{file_path, encoding, separator}`
`source_year`	string	-	Year label (e.g., `2024`)
`column_overrides`	dict	-	Per-column transform overrides
`encoding`	string	-	Encoding override
`separator`	string	-	Separator override
`skip_rows`	int	-	Rows to skip
`sheet_name`	string	-	Excel sheet
`append`	bool	-	Append to staged data (default true)

Returns:

Field	Type	Description
`original_rows`	int	Source row count
`transformed_rows`	int	Rows after transform
`total_staged_rows`	int	Cumulative staged rows
`columns_mapped`	int	Columns successfully mapped
`auto_flush`	bool	Whether auto-flush triggered
`sample_data`	list[dict]	Sample transformed rows

Notes: - Auto-flush to parquet at 30M row threshold to prevent OOM - Column mapping, type casting, null handling per schema - Batch mode normalizes multiple files to single schema

`write_to_catalog`¶

Class: CatalogWriterTool | File: core/tools/catalog_writer.py

Write staged DataFrames to Unity Catalog as managed Delta tables.

Parameters:

Parameter	Type	Required	Description
`source_name`	string	Y	Staged data key
`table_name`	string	Y	Target table (e.g., `silver_county_demographics`)
`mode`	string	-	`overwrite` (default) or `append`

Returns:

Field	Type	Description
`table_name`	string	Full catalog.schema.table name
`rows_written`	int	Rows written
`verified_row_count`	int	Post-write verification count
`method`	string	`spark` or `sdk`

Notes: - Direct write up to 2M rows; stages to parquet above that threshold - Post-write verification via COUNT(*) query - Spark first, Databricks SDK fallback

`rename_columns`¶

Class: RenameColumnsTool | File: core/tools/column_renamer.py

Rename columns in a Delta table via ALTER TABLE (metadata-only, no data rewrite).

Parameters:

Parameter	Type	Required	Description
`table_name`	string	Y	Table name (without catalog.schema prefix)
`column_renames`	dict	Y	`{old_name: new_name}` mapping

Returns:

Field	Type	Description
`table_name`	string	Full table name
`succeeded`	int	Columns renamed
`failed`	list	Columns that failed

Notes: Metadata-only operation via ALTER TABLE RENAME COLUMN -instant, no data rewrite.

Data Scientist Agent Tools¶

Agent: DataScientistAgent | File: science_agents/scientist/agent.py

Total tools: 14-15 | SQL access: Silver-only (write-protected)

Summary¶

Tool	Purpose
`statistical_analysis`	8 analysis types: describe, distribution, hypothesis test, correlation, effect size, data quality, assumptions, Bayesian
`fit_model`	9 model types: regression, classification, clustering, time series, counterfactual, Bayesian
`check_confounders`	Detect Simpson's Paradox and confounding variables
`validate_silver`	6 data quality checks for silver-layer tables
`validate_statistics`	4 checks: multiple comparisons, multicollinearity, ecological fallacy, robustness
`review_literature`	Search, fetch, and compare published research
`save_finding`	Save structured findings to JSON for StoryTeller
`log_model`	Log trained models to MLflow (conditional)
+ 8 shared tools	`execute_sql`, `list_catalog_tables`, `web_search`, `scrape_web`, `create_visualization`, `view_chart`, `save_note`, `create_custom_tool`

`statistical_analysis`¶

Class: StatisticalAnalysisTool | File: science_agents/scientist/tools/statistical_analysis.py

Perform statistical analysis with 8 analysis types.

Parameters:

Parameter	Type	Required	Description
`analysis_type`	string	Y	`describe`, `distribution`, `hypothesis_test`, `correlation`, `effect_size`, `data_quality`, `assumption_check`, `bayesian_test`
`sql_query`	string	Y	SELECT query — tool fetches all rows directly
`columns`	list[str]	-	Columns to analyze (default: all numeric)
`group_column`	string	-	Grouping column (hypothesis_test, effect_size)
`value_column`	string	-	Value column (hypothesis_test, effect_size)
`method`	string	-	Specific method (see below)
`confidence_level`	float	-	Confidence level (default 0.95)
`prior`	dict	-	Bayesian prior: `{mean, std}`, `{alpha, beta}`
`rope`	list[float]	-	Region of Practical Equivalence `[low, high]`

Methods by analysis type:

Analysis Type	Available Methods
`hypothesis_test`	`ttest_ind`, `ttest_rel`, `mannwhitney`, `chi_square`, `anova`, `kruskal`
`correlation`	`pearson`, `spearman`
`assumption_check`	`regression`, `ttest`, `anova`, `chi_square`
`bayesian_test`	`bayesian_ttest`, `bayesian_proportion`, `bayesian_correlation`

Returns (key fields by type):

Analysis Type	Key Return Fields
`describe`	Per-column: `count`, `mean`, `median`, `std`, `min`, `max`, `q25`, `q75`, `skew`, `kurtosis`
`distribution`	Normality tests (Shapiro-Wilk, D'Agostino), distribution fitting
`hypothesis_test`	`test_statistic`, `p_value`, `result` (REJECT_NULL / FAIL_TO_REJECT)
`correlation`	Correlation matrix with p-values
`effect_size`	Cohen's d, rank-biserial, magnitude classification
`data_quality`	Missing rates, outlier detection, cardinality
`assumption_check`	Normality, homoscedasticity, independence checks
`bayesian_test`	`posterior_mean`, `credible_interval`, `bayes_factor`, `rope_analysis`

`fit_model`¶

Class: ModelFittingTool | File: science_agents/scientist/tools/model_fitting.py

Fit statistical and ML models with 9 model types.

Parameters:

Parameter	Type	Required	Description
`model_type`	string	Y	`linear_regression`, `logistic_regression`, `random_forest`, `gradient_boosting`, `kmeans`, `time_series`, `counterfactual`, `cross_validate`, `bayesian_regression`
`sql_query`	string	Y	SELECT statement — tool fetches all rows via Spark/SDK
`target_column`	string	-	Target variable
`feature_columns`	list[str]	-	Feature variables
`time_column`	string	-	Time column (time_series only)
`parameters`	dict	-	Model-specific params (e.g., `{n_clusters: 4}`, `{scenarios: [...]}`)

Returns (key fields by model type):

Model Type	Key Return Fields
`linear_regression`	`r_squared`, `coefficients`, `p_values`, `vif_scores`, `f_statistic`
`logistic_regression`	`accuracy`, `auc`, `precision`, `recall`, confusion matrix
`random_forest`	`accuracy`, `auc`, `feature_importance`
`gradient_boosting`	`accuracy`, `feature_importance`
`kmeans`	Silhouette scores, cluster profiles
`time_series`	Trend decomposition, change points, forecasts
`counterfactual`	`scenario`, `predicted_outcome`, `interpretation`
`cross_validate`	Multi-model comparison metrics
`bayesian_regression`	`coefficients` (mean/std), `credible_intervals`, `probability_direction`

Notes: - Always use sql_query — the tool fetches all rows directly via Spark/SDK. Do not serialize data inline. - VIF (Variance Inflation Factor) computed automatically for linear regression to detect multicollinearity - counterfactual enables what-if scenario analysis - bayesian_regression supports informative priors from published research

`check_confounders`¶

Class: CheckConfoundersTool | File: science_agents/scientist/tools/check_confounders.py

Detect Simpson's Paradox by decomposing aggregate relationships into subgroups.

Parameters:

Parameter	Type	Required	Description
`sql_query`	string	Y	SELECT statement — tool fetches all rows via Spark/SDK
`outcome_column`	string	Y	Outcome/dependent variable
`predictor_column`	string	Y	Predictor/independent variable
`grouping_columns`	list[str]	Y	Potential confounder columns to stratify by

Returns:

Field	Type	Description
`aggregate`	dict	Overall `correlation`, `direction`, `strength`
`decompositions`	dict	Per-group correlations for each grouping column
`paradox_detected`	bool	Whether Simpson's Paradox found
`paradox_type`	string	`direction_reversal`, `strength_masking`, or `strength_reversal`
`strongest_confounder`	string	Most impactful grouping column
`recommendation`	string	Interpretation and guidance

`validate_silver`¶

Class: ValidateSilverTool | File: science_agents/scientist/tools/validate_silver.py

Validate silver-layer data quality with 6 check types.

Parameters:

Parameter	Type	Required	Description
`check_type`	string	Y	`grain`, `enrollment_sanity`, `year_alignment`, `join_completeness`, `value_ranges`, `zero_columns`
`data`	list[dict]	Y	Data rows
`primary_key_columns`	list[str]	-	PK columns (for `grain`)
`enrollment_columns`	list[str]	-	Enrollment columns (for `enrollment_sanity`)
`max_enrollment_per_row`	int	-	Max reasonable value (default 1M)
`year_column_left`	string	-	Left year column (for `year_alignment`)
`year_column_right`	string	-	Right year column
`expected_offset`	int	-	Expected year offset (default -1)
`left_count`	int	-	Left table rows (for `join_completeness`)
`matched_count`	int	-	Matched rows
`column_ranges`	dict	-	Expected ranges: `{col: {min, max}}`

Returns:

Field	Type	Description
`issues_found`	bool	Whether any issues detected
`issue_count`	int	Number of issues
`details`	list[str]	Issue descriptions
`suggested_fix`	string	Recommended fix

`validate_statistics`¶

Class: ValidateStatisticsTool | File: science_agents/scientist/tools/validate_statistics.py

Validate statistical analyses for common pitfalls.

Parameters:

Parameter	Type	Required	Description
`check_type`	string	Y	`multiple_comparisons`, `multicollinearity`, `ecological_fallacy`, `robustness`
`data`	list[dict]	-	Data (for multicollinearity/robustness)
`p_values`	list[float]	-	P-values (for multiple_comparisons)
`alpha`	float	-	Significance threshold (default 0.05)
`feature_columns`	list[str]	-	Features (for multicollinearity)
`outcome_column`	string	-	Outcome (for robustness)
`predictor_column`	string	-	Predictor (for ecological_fallacy)
`data_fine_grain`	list[dict]	-	Individual-level data (for ecological_fallacy)

Returns (by check type):

Check Type	Key Return Fields
`multiple_comparisons`	`bonferroni_corrected`, `bh_corrected`, per-p-value significance
`multicollinearity`	`vif_scores`, `high_vif` (columns with VIF > 5)
`ecological_fallacy`	`aggregate_correlation`, `individual_correlation`, `risk` level
`robustness`	Sensitivity to outlier removal, stability assessment

`review_literature`¶

Class: LiteratureReviewTool | File: science_agents/scientist/tools/literature_review.py

Search for and compare published research.

Parameters:

Parameter	Type	Required	Description
`operation`	string	Y	`search`, `fetch_article`, `compare_findings`
`query`	string	-	Search query (for `search`)
`url`	string	-	Article URL (for `fetch_article`)
`own_finding`	string	-	Your finding (for `compare_findings`)
`published_finding`	string	-	Published finding to compare against
`source_title`	string	-	Published source title
`max_results`	int	-	Max search results (default 10)

Returns (by operation):

Operation	Key Return Fields
`search`	`results` list with title, authors, year, url, abstract
`fetch_article`	`title`, `authors`, `abstract`, `full_text`, `key_findings`
`compare_findings`	`alignment` (ALIGNED/CONTRADICTS/NOVEL), `similarity_score`, `differences`

`save_finding`¶

Class: SaveFindingTool | File: science_agents/scientist/tools/save_finding.py

Save a structured research finding to findings JSON for the StoryTeller agent.

Parameters:

Parameter	Type	Required	Description
`research_question_id`	string	Y	RQ identifier (e.g., `rq_1`)
`title`	string	Y	Short finding title
`finding`	string	Y	Finding statement
`evidence`	string	Y	Supporting evidence (stats, p-values, N)
`significance`	string	Y	`high`, `medium`, or `low`
`visualization_path`	string	-	Path to associated chart

Returns:

Field	Type	Description
`research_question_id`	string	RQ ID
`title`	string	Finding title
`timestamp`	string	ISO timestamp
`index`	int	Finding index in JSON file

`log_model`¶

Class: LogModelTool | File: science_agents/scientist/tools/log_model.py

Log a trained model to MLflow with metrics and optional Unity Catalog registration.

Parameters:

Parameter	Type	Required	Description
`model_type`	string	Y	Model type (e.g., `linear_regression`)
`data`	list[dict]	Y	Training data
`target_column`	string	Y	Target column
`feature_columns`	list[str]	Y	Feature columns
`metrics`	dict	Y	Metrics to log (e.g., `{r_squared: 0.85}`)
`model_name`	string	Y	Registered model name
`parameters`	dict	-	Hyperparameters
`tags`	dict	-	Custom MLflow tags

Returns:

Field	Type	Description
`mlflow_run_id`	string	MLflow run ID
`model_uri`	string	Model URI for loading
`registered_model`	string	Registered model name
`metrics_logged`	dict	Logged metrics
`top_features`	list[str]	Top features by importance

Notes: - Only registered when cfg.mlflow_experiment is configured - Optional Unity Catalog model registration for production deployment

StoryTeller Agent Tools¶

Agent: StoryTellerAgent | File: story_agents/storyteller/agent.py

Total tools: 14 | SQL access: Silver-only (SELECT in practice)

Summary¶

Tool	Purpose
`read_findings`	Read findings saved by Data Scientist
`read_chart`	Read chart metadata from results directory
`read_table`	Read CSV result tables
`write_narrative`	Write, read, and assemble narrative sections
`evaluate_evidence`	Score evidence strength, curate findings
`cite_source`	Manage citations and references
+ 8 shared tools	`execute_sql`, `list_catalog_tables`, `web_search`, `scrape_web`, `create_visualization`, `view_chart`, `save_note`, `create_custom_tool`

`read_findings`¶

Class: ReadFindingsTool | File: story_agents/storyteller/tools/read_findings.py

Read research findings saved by the Data Scientist agent. 5 operations.

Parameters:

Parameter	Type	Required	Description
`operation`	string	Y	`list`, `get`, `by_theme`, `high_significance`, `search`
`index`	int	-	Finding index (for `get`)
`theme_id`	string	-	Theme/RQ ID (for `by_theme`)
`query`	string	-	Search query (for `search`)

Returns (by operation):

Operation	Key Return Fields
`list`	`findings` list (index, title, significance), `total_count`
`get`	Full finding: title, finding, evidence, significance, visualization_path
`by_theme`	Filtered findings for theme
`high_significance`	Only high/medium significance findings
`search`	Keyword-matched findings with relevance scores

`read_chart`¶

Class: ReadChartTool | File: story_agents/storyteller/tools/read_chart.py

Read chart metadata from the results directory.

Parameters:

Parameter	Type	Required	Description
`operation`	string	Y	`list`, `by_theme`, `metadata`
`theme_id`	string	-	Theme ID (for `by_theme`)
`chart_filename`	string	-	Chart filename (for `metadata`)

Returns (by operation):

Operation	Key Return Fields
`list`	`charts` list (filename, theme_id, title), `total_count`
`by_theme`	Charts filtered by theme
`metadata`	Full metadata: chart_type, x/y columns, row_count, interpretation

`read_table`¶

Class: ReadTableTool | File: story_agents/storyteller/tools/read_table.py

Read CSV result tables from the results directory.

Parameters:

Parameter	Type	Required	Description
`operation`	string	Y	`list`, `read`, `summary`
`filename`	string	-	Table filename (for `read`/`summary`)
`max_rows`	int	-	Max rows to return (default 100)

Returns (by operation):

Operation	Key Return Fields
`list`	`tables` list (filename, row_count, column_count), `total_count`
`read`	`columns`, `rows` (as list[dict]), `row_count`
`summary`	Per-column statistics: mean, std, min, max

`write_narrative`¶

Class: WriteNarrativeTool | File: story_agents/storyteller/tools/write_narrative.py

Write, read, update, and assemble narrative sections for the final report. 5 operations.

Parameters:

Parameter	Type	Required	Description
`operation`	string	Y	`write_section`, `read_section`, `update_section`, `list_sections`, `assemble`
`section_id`	string	-	Section identifier (e.g., `introduction`, `rq_1_findings`)
`title`	string	-	Section title (for `write_section`)
`content`	string	-	Markdown content (for `write_section`/`update_section`)
`sequence`	int	-	Order in final document (for `write_section`)

Returns (by operation):

Operation	Key Return Fields
`write_section`	`section_id`, `sequence`, `created`
`read_section`	`title`, `content`, `sequence`
`list_sections`	`sections` list (id, title, sequence), `total_count`
`assemble`	`document_path`, `table_of_contents`, `section_count`, `word_count`

Notes: - assemble generates the final markdown report with auto-generated table of contents - Sections ordered by sequence number in final document

`evaluate_evidence`¶

Class: EvaluateEvidenceTool | File: story_agents/storyteller/tools/evaluate_evidence.py

Score evidence strength and curate findings for narrative sections.

Parameters:

Parameter	Type	Required	Description
`operation`	string	Y	`evaluate` or `curate`
`finding`	dict	-	Single finding (for `evaluate`)
`findings`	list[dict]	-	Multiple findings (for `curate`)
`purpose`	string	-	Section purpose for curation (e.g., `lead finding for introduction`)
`max_findings`	int	-	Max curated findings (default 5)

Returns (by operation):

Operation	Key Return Fields
`evaluate`	`tier`, `tier_description`, `usable_as_lead`, `usable_as_support`, `effect_size`, `p_value`
`curate`	`curated` list (ranked by tier), `lead_candidates`, `support_candidates`

Evidence Tier Classification:

Tier	Criteria	Suitable For
DEFINITIVE	p < 0.001, large effect size	Primary conclusions
STRONG	p < 0.01, medium+ effect	Leading paragraphs
SUGGESTIVE	p < 0.05	Supporting evidence
CONTEXTUAL	Descriptive, no test	Background context
WEAK	p >= 0.05, negligible effect	Limitations

`cite_source`¶

Class: CiteSourceTool | File: story_agents/storyteller/tools/cite_source.py

Manage citations and references for the narrative report.

Parameters:

Parameter	Type	Required	Description
`operation`	string	Y	`add`, `list`, `format`, `search`
`title`	string	-	Source title (for `add`)
`url`	string	-	Source URL (for `add`)
`author`	string	-	Author (for `add`)
`year`	string	-	Publication year (for `add`)
`description`	string	-	Brief description (for `add`)
`cite_key`	string	-	Citation key (for `format`)
`query`	string	-	Search query (for `search`)

Returns (by operation):

Operation	Key Return Fields
`add`	`cite_key` (auto-generated), `added`
`list`	`sources` list (cite_key, title, author, year), `total_count`
`format`	`citation_formats` with apa, chicago, inline
`search`	`results` with relevance scores

Versifai Tool Inventory¶

Tool Cross-Reference Matrix¶

Shared Tools¶

execute_sql¶

list_catalog_tables¶

web_search¶

scrape_web¶

create_visualization¶

view_chart¶

save_note¶

create_custom_tool¶

Data Engineer Agent Tools¶

Summary¶

explore_volume¶

extract_archive¶

read_file_header¶

read_documentation¶

scan_for_documentation¶

profile_data¶

design_schema¶

transform_and_load¶

write_to_catalog¶

rename_columns¶

Data Scientist Agent Tools¶

Summary¶

statistical_analysis¶

fit_model¶

check_confounders¶

validate_silver¶

validate_statistics¶

review_literature¶

save_finding¶

log_model¶

StoryTeller Agent Tools¶

Summary¶

read_findings¶

read_chart¶

read_table¶

write_narrative¶

evaluate_evidence¶

cite_source¶

`execute_sql`¶

`list_catalog_tables`¶

`web_search`¶

`scrape_web`¶

`create_visualization`¶

`view_chart`¶

`save_note`¶

`create_custom_tool`¶

`explore_volume`¶

`extract_archive`¶

`read_file_header`¶

`read_documentation`¶

`scan_for_documentation`¶

`profile_data`¶

`design_schema`¶

`transform_and_load`¶

`write_to_catalog`¶

`rename_columns`¶

`statistical_analysis`¶

`fit_model`¶

`check_confounders`¶

`validate_silver`¶

`validate_statistics`¶

`review_literature`¶

`save_finding`¶

`log_model`¶

`read_findings`¶

`read_chart`¶

`read_table`¶

`write_narrative`¶

`evaluate_evidence`¶

`cite_source`¶