Skip to main content

Automatically extract structured facts, insights, and Q/A pairs from tabular datasets

Project description

StatQA

CI Documentation PyPI version Python 3.11+ License: MIT

StatQA is a modern Python framework for automatically extracting structured facts, statistical insights, and Q/A pairs from tabular datasets. It converts raw columns and values into clear, human-readable statements, enabling rapid knowledge discovery, RAG corpus construction, and LLM training.

🎯 Key Features

  • 📋 Flexible Metadata Parsing: Parse codebooks from text, CSV, or PDF formats
  • 🤖 LLM-Powered Enrichment: Automatically infer variable types and relationships
  • 📊 Comprehensive Statistical Analysis:
    • Univariate: descriptive statistics, distribution tests, robust estimators
    • Bivariate: correlations, chi-square, group comparisons with effect sizes
    • Temporal: trend detection (Mann-Kendall), change points, year-over-year analysis
    • Causal: regression with confounding control, sensitivity analysis
  • 💬 Natural Language Insights: Convert statistics to publication-ready text
  • ❓ Q/A Generation: Create training data for LLMs with template-based and LLM-paraphrased questions
  • 🔍 Provenance Tracking: Full metadata for reproducibility (timestamps, tools, methods, analysis types)
  • 📈 Publication-Quality Visualizations: Automated plots for all analyses
  • 🔬 Statistical Rigor: Multiple testing correction, effect sizes, normality tests
  • ⚡ Modern Python: Type-safe (Pydantic), async-ready, fully typed

📦 Installation

Basic Installation

pip install statqa

With Optional Features

# Include LLM support (OpenAI/Anthropic)
pip install statqa[llm]

# Include PDF parsing
pip install statqa[pdf]

# Development installation
pip install statqa[dev]

# Complete installation
pip install statqa[all]

From Source

git clone https://github.com/gojiplus/statqa.git
cd statqa
pip install -e ".[dev]"

🚀 Quick Start

1. Create a Codebook

from statqa.metadata.parsers import TextParser

codebook_text = """
# Variable: age
Label: Respondent Age
Type: numeric_continuous
Units: years
Range: 18-99
Missing: -1, 999

# Variable: satisfaction
Label: Job Satisfaction
Type: categorical_ordinal
Values:
  1: Very Dissatisfied
  2: Dissatisfied
  3: Neutral
  4: Satisfied
  5: Very Satisfied
"""

parser = TextParser()
codebook = parser.parse(codebook_text)

2. Run Statistical Analyses

import pandas as pd
from statqa.analysis import UnivariateAnalyzer, BivariateAnalyzer

# Load your data
data = pd.read_csv("survey_data.csv")

# Univariate analysis
analyzer = UnivariateAnalyzer()
result = analyzer.analyze(data["age"], codebook.variables["age"])

print(result)
# Output: {'mean': 42.5, 'median': 41.0, 'std': 12.3, ...}

# Bivariate analysis
biv_analyzer = BivariateAnalyzer()
result = biv_analyzer.analyze(
    data,
    codebook.variables["age"],
    codebook.variables["satisfaction"]
)

3. Generate Natural Language Insights

from statqa.interpretation import InsightFormatter

formatter = InsightFormatter()
insight = formatter.format_univariate(result)

print(insight)
# Output: "**Respondent Age**: mean=42.5, median=41.0, std=12.3, range=[18, 95]. N=1,000 [2.3% outliers]."

4. Create Q/A Pairs for LLM Training

from statqa.qa import QAGenerator

qa_gen = QAGenerator(use_llm=False)  # Template-based
qa_pairs = qa_gen.generate_qa_pairs(result, insight)

for qa in qa_pairs:
    print(f"Q: {qa['question']}")
    print(f"A: {qa['answer']}")
    print(f"Provenance: {qa['provenance']}\n")

Each Q/A pair includes provenance metadata tracking:

  • When the answer was generated (timestamp)
  • What tool was used (statqa version)
  • What compute was performed (analysis type, analyzer)
  • How it was generated (template vs. LLM paraphrase)
  • Which LLM was used (if applicable)

🎨 Complete Pipeline Example

from statqa import Codebook, UnivariateAnalyzer
from statqa.metadata.parsers import CSVParser
from statqa.interpretation import InsightFormatter
from statqa.qa import QAGenerator
from statqa.utils.io import load_data, save_json

# 1. Parse codebook
parser = CSVParser()
codebook = parser.parse("codebook.csv")

# 2. Load data
data = load_data("data.csv")

# 3. Run analyses
analyzer = UnivariateAnalyzer()
results = analyzer.batch_analyze(data, codebook.variables)

# 4. Format insights
formatter = InsightFormatter()
for result in results:
    result["insight"] = formatter.format_insight(result)

# 5. Generate Q/A pairs
qa_gen = QAGenerator(use_llm=True, api_key="your-api-key")
qa_results = qa_gen.generate_batch(
    results,
    [r["insight"] for r in results]
)

# 6. Export for LLM fine-tuning
lines = qa_gen.export_qa_dataset(qa_results, format="openai")
with open("training_data.jsonl", "w") as f:
    f.write("\n".join(lines))

📝 Q/A Provenance Tracking

Every Q/A pair generated by StatQA includes detailed provenance metadata to ensure reproducibility and traceability:

{
  "question": "What is the average Respondent Age?",
  "answer": "The mean age is 42.5 years (median=41.0, std=12.3).",
  "type": "descriptive",
  "provenance": {
    "generated_at": "2025-11-19T10:30:45.123456+00:00",
    "tool": "statqa",
    "tool_version": "0.1.0",
    "generation_method": "template",
    "analysis_type": "univariate",
    "analyzer": "UnivariateAnalyzer"
  }
}

Provenance Fields

Field Description Example Values
generated_at ISO 8601 timestamp (UTC) 2025-11-19T10:30:45+00:00
tool Software used for generation statqa
tool_version Version of statqa 0.1.0
generation_method How the Q/A was created template, llm_paraphrase
analysis_type Statistical analysis performed univariate, bivariate, temporal, causal
analyzer Specific analyzer class used UnivariateAnalyzer, BivariateAnalyzer
llm_model LLM model (if applicable) gpt-4, claude-3-opus

This provenance tracking enables:

  • Reproducibility: Recreate Q/A pairs from original data
  • Quality Control: Filter by generation method or analysis type
  • Auditing: Track when and how answers were computed
  • Citation: Properly attribute computational methods in research

🖥️ Command-Line Interface

StatQA provides a powerful CLI for common workflows:

# Parse a codebook
statqa parse-codebook codebook.csv --output codebook.json --enrich

# Run full analysis pipeline
statqa analyze data.csv codebook.json --output-dir results/ --plots

# Generate Q/A pairs
statqa generate-qa results/all_insights.json --output qa_pairs.jsonl --llm

# Complete pipeline
statqa pipeline data.csv codebook.csv --output-dir output/ --enrich --qa

📊 Supported Analyses

Univariate Statistics

  • Central tendency: mean, median, mode
  • Dispersion: std, IQR, MAD (robust)
  • Distribution: skewness, kurtosis, normality tests
  • Categorical: frequencies, entropy, diversity indices

Bivariate Relationships

  • Numeric × Numeric: Pearson/Spearman correlation, effect sizes
  • Categorical × Categorical: Chi-square, Cramér's V
  • Categorical × Numeric: t-tests, ANOVA, Cohen's d

Temporal Analysis

  • Trend detection: Mann-Kendall test, linear regression
  • Change point detection
  • Year-over-year comparisons
  • Seasonal decomposition

Causal Inference

  • Regression with control variables
  • Confounder identification
  • Sensitivity analysis
  • Treatment effect estimation

🔧 Advanced Features

LLM-Powered Metadata Enrichment

from statqa.metadata import MetadataEnricher

enricher = MetadataEnricher(provider="openai", api_key="your-key")
enriched_codebook = enricher.enrich_codebook(codebook)

# LLM infers variable types, suggests relationships, identifies confounders

Multiple Testing Correction

from statqa.utils.stats import correct_multiple_testing

p_values = [0.03, 0.01, 0.15, 0.002]
reject, corrected_p = correct_multiple_testing(p_values, method="fdr_bh")

Custom Visualizations

from statqa.visualization import PlotFactory

plotter = PlotFactory(style="publication", figsize=(10, 6))
fig = plotter.plot_bivariate(data, var1, var2, output_path="plot.png")

📚 Documentation

🧪 Development

Running Tests

pytest --cov=statqa --cov-report=html

Code Quality

# Linting
ruff check statqa tests

# Type checking
mypy statqa

# Formatting
black statqa tests

Building Documentation

cd docs
make html

🤝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes with tests
  4. Run tests and linting
  5. Commit (git commit -m 'Add amazing feature')
  6. Push (git push origin feature/amazing-feature)
  7. Open a Pull Request

See CONTRIBUTING.md for detailed guidelines.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Built with modern Python tools: Pydantic, pandas, statsmodels, typer
  • Inspired by survey data analysis workflows (ANES, GSS, etc.)
  • Statistical methods from standard social science practice

📬 Contact & Support

🗺️ Roadmap

  • Support for additional codebook formats (SPSS, Stata, SAS)
  • Web interface for interactive analysis
  • Integration with popular survey platforms
  • Advanced causal inference methods (instrumental variables, DiD)
  • Automated report generation (Markdown, LaTeX, HTML)
  • Cloud deployment templates

Made with ❤️ for data scientists, researchers, and LLM engineers

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

statqa-0.1.0.tar.gz (51.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

statqa-0.1.0-py3-none-any.whl (54.1 kB view details)

Uploaded Python 3

File details

Details for the file statqa-0.1.0.tar.gz.

File metadata

  • Download URL: statqa-0.1.0.tar.gz
  • Upload date:
  • Size: 51.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for statqa-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b1f4418f6fcb3fc174a73949577e7e1fc008356eaae4e4d25a8ca16e6aba0a3f
MD5 3e5b63be9970906241751bb0e31c9245
BLAKE2b-256 6e0dc9c9223d3fb9048a0f0c1265bc1823264ad65093d523c883aba331f1d824

See more details on using hashes here.

File details

Details for the file statqa-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: statqa-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 54.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for statqa-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 af97c52774b75010014d33896cbd4feb7a067745cdb12904e003920cbac0358a
MD5 593e74a0ebabe85618f4446d9e90259d
BLAKE2b-256 f4b32d2b405ac1ce411027a10b156d51bb1ecacf75fde3856f29ac35494adbfd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page