Automatically extract structured facts, insights, and Q/A pairs from tabular datasets

These details have not been verified by PyPI

Project links

Project description

StatQA

StatQA is a modern Python framework for automatically extracting structured facts, statistical insights, and Q/A pairs from tabular datasets. It converts raw columns and values into clear, human-readable statements, enabling rapid knowledge discovery, RAG corpus construction, and LLM training.

🎯 Key Features

📋 Flexible Metadata Parsing: Parse codebooks from text, CSV, or PDF formats
🤖 LLM-Powered Enrichment: Automatically infer variable types and relationships
📊 Comprehensive Statistical Analysis:
- Univariate: descriptive statistics, distribution tests, robust estimators
- Bivariate: correlations, chi-square, group comparisons with effect sizes
- Temporal: trend detection (Mann-Kendall), change points, year-over-year analysis
- Causal: regression with confounding control, sensitivity analysis
💬 Natural Language Insights: Convert statistics to publication-ready text
❓ Q/A Generation: Create training data for LLMs with template-based and LLM-paraphrased questions
🔍 Provenance Tracking: Full metadata for reproducibility (timestamps, tools, methods, analysis types)
📈 Publication-Quality Visualizations: Automated plots for all analyses
🔬 Statistical Rigor: Multiple testing correction, effect sizes, normality tests
⚡ Modern Python: Type-safe (Pydantic), async-ready, fully typed

📦 Installation

Basic Installation

pip install statqa

With Optional Features

# Include LLM support (OpenAI/Anthropic)
pip install statqa[llm]

# Include PDF parsing
pip install statqa[pdf]

# Development installation
pip install statqa[dev]

# Complete installation
pip install statqa[all]

From Source

git clone https://github.com/gojiplus/statqa.git
cd statqa
pip install -e ".[dev]"

🚀 Quick Start

1. Create a Codebook

from statqa.metadata.parsers import TextParser

codebook_text = """
# Variable: age
Label: Respondent Age
Type: numeric_continuous
Units: years
Range: 18-99
Missing: -1, 999

# Variable: satisfaction
Label: Job Satisfaction
Type: categorical_ordinal
Values:
  1: Very Dissatisfied
  2: Dissatisfied
  3: Neutral
  4: Satisfied
  5: Very Satisfied
"""

parser = TextParser()
codebook = parser.parse(codebook_text)

2. Run Statistical Analyses

import pandas as pd
from statqa.analysis import UnivariateAnalyzer, BivariateAnalyzer

# Load your data
data = pd.read_csv("survey_data.csv")

# Univariate analysis
analyzer = UnivariateAnalyzer()
result = analyzer.analyze(data["age"], codebook.variables["age"])

print(result)
# Output: {'mean': 42.5, 'median': 41.0, 'std': 12.3, ...}

# Bivariate analysis
biv_analyzer = BivariateAnalyzer()
result = biv_analyzer.analyze(
    data,
    codebook.variables["age"],
    codebook.variables["satisfaction"]
)

3. Generate Natural Language Insights

from statqa.interpretation import InsightFormatter

formatter = InsightFormatter()
insight = formatter.format_univariate(result)

print(insight)
# Output: "**Respondent Age**: mean=42.5, median=41.0, std=12.3, range=[18, 95]. N=1,000 [2.3% outliers]."

4. Create Q/A Pairs for LLM Training

from statqa.qa import QAGenerator

qa_gen = QAGenerator(use_llm=False)  # Template-based
qa_pairs = qa_gen.generate_qa_pairs(result, insight)

for qa in qa_pairs:
    print(f"Q: {qa['question']}")
    print(f"A: {qa['answer']}")
    print(f"Provenance: {qa['provenance']}\n")

Each Q/A pair includes provenance metadata tracking:

When the answer was generated (timestamp)
What tool was used (statqa version)
What compute was performed (analysis type, analyzer)
How it was generated (template vs. LLM paraphrase)
Which LLM was used (if applicable)

🎨 Complete Pipeline Example

from statqa import Codebook, UnivariateAnalyzer
from statqa.metadata.parsers import CSVParser
from statqa.interpretation import InsightFormatter
from statqa.qa import QAGenerator
from statqa.utils.io import load_data, save_json

# 1. Parse codebook
parser = CSVParser()
codebook = parser.parse("codebook.csv")

# 2. Load data
data = load_data("data.csv")

# 3. Run analyses
analyzer = UnivariateAnalyzer()
results = analyzer.batch_analyze(data, codebook.variables)

# 4. Format insights
formatter = InsightFormatter()
for result in results:
    result["insight"] = formatter.format_insight(result)

# 5. Generate Q/A pairs
qa_gen = QAGenerator(use_llm=True, api_key="your-api-key")
qa_results = qa_gen.generate_batch(
    results,
    [r["insight"] for r in results]
)

# 6. Export for LLM fine-tuning
lines = qa_gen.export_qa_dataset(qa_results, format="openai")
with open("training_data.jsonl", "w") as f:
    f.write("\n".join(lines))

📝 Q/A Provenance Tracking

Every Q/A pair generated by StatQA includes detailed provenance metadata to ensure reproducibility and traceability:

{
  "question": "What is the average Respondent Age?",
  "answer": "The mean age is 42.5 years (median=41.0, std=12.3).",
  "type": "descriptive",
  "provenance": {
    "generated_at": "2025-11-19T10:30:45.123456+00:00",
    "tool": "statqa",
    "tool_version": "0.1.0",
    "generation_method": "template",
    "analysis_type": "univariate",
    "analyzer": "UnivariateAnalyzer"
  }
}

Provenance Fields

Field	Description	Example Values
`generated_at`	ISO 8601 timestamp (UTC)	`2025-11-19T10:30:45+00:00`
`tool`	Software used for generation	`statqa`
`tool_version`	Version of statqa	`0.1.0`
`generation_method`	How the Q/A was created	`template`, `llm_paraphrase`
`analysis_type`	Statistical analysis performed	`univariate`, `bivariate`, `temporal`, `causal`
`analyzer`	Specific analyzer class used	`UnivariateAnalyzer`, `BivariateAnalyzer`
`llm_model`	LLM model (if applicable)	`gpt-4`, `claude-3-opus`

This provenance tracking enables:

Reproducibility: Recreate Q/A pairs from original data
Quality Control: Filter by generation method or analysis type
Auditing: Track when and how answers were computed
Citation: Properly attribute computational methods in research

🖥️ Command-Line Interface

StatQA provides a powerful CLI for common workflows:

# Parse a codebook
statqa parse-codebook codebook.csv --output codebook.json --enrich

# Run full analysis pipeline
statqa analyze data.csv codebook.json --output-dir results/ --plots

# Generate Q/A pairs
statqa generate-qa results/all_insights.json --output qa_pairs.jsonl --llm

# Complete pipeline
statqa pipeline data.csv codebook.csv --output-dir output/ --enrich --qa

📊 Supported Analyses

Univariate Statistics

Central tendency: mean, median, mode
Dispersion: std, IQR, MAD (robust)
Distribution: skewness, kurtosis, normality tests
Categorical: frequencies, entropy, diversity indices

Bivariate Relationships

Numeric × Numeric: Pearson/Spearman correlation, effect sizes
Categorical × Categorical: Chi-square, Cramér's V
Categorical × Numeric: t-tests, ANOVA, Cohen's d

Temporal Analysis

Trend detection: Mann-Kendall test, linear regression
Change point detection
Year-over-year comparisons
Seasonal decomposition

Causal Inference

Regression with control variables
Confounder identification
Sensitivity analysis
Treatment effect estimation

🔧 Advanced Features

LLM-Powered Metadata Enrichment

from statqa.metadata import MetadataEnricher

enricher = MetadataEnricher(provider="openai", api_key="your-key")
enriched_codebook = enricher.enrich_codebook(codebook)

# LLM infers variable types, suggests relationships, identifies confounders

Multiple Testing Correction

from statqa.utils.stats import correct_multiple_testing

p_values = [0.03, 0.01, 0.15, 0.002]
reject, corrected_p = correct_multiple_testing(p_values, method="fdr_bh")

Custom Visualizations

from statqa.visualization import PlotFactory

plotter = PlotFactory(style="publication", figsize=(10, 6))
fig = plotter.plot_bivariate(data, var1, var2, output_path="plot.png")

📚 Documentation

Full Documentation: https://gojiplus.github.io/statqa
API Reference: API Docs
Examples: See examples/ directory

🧪 Development

Running Tests

pytest --cov=statqa --cov-report=html

Code Quality

# Linting
ruff check statqa tests

# Type checking
mypy statqa

# Formatting
black statqa tests

Building Documentation

cd docs
make html

🤝 Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes with tests
Run tests and linting
Commit (git commit -m 'Add amazing feature')
Push (git push origin feature/amazing-feature)
Open a Pull Request

See CONTRIBUTING.md for detailed guidelines.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with modern Python tools: Pydantic, pandas, statsmodels, typer
Inspired by survey data analysis workflows (ANES, GSS, etc.)
Statistical methods from standard social science practice

📬 Contact & Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: maintainers@statqa.org

🗺️ Roadmap

Support for additional codebook formats (SPSS, Stata, SAS)
Web interface for interactive analysis
Integration with popular survey platforms
Advanced causal inference methods (instrumental variables, DiD)
Automated report generation (Markdown, LaTeX, HTML)
Cloud deployment templates

Made with ❤️ for data scientists, researchers, and LLM engineers

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.0

Dec 27, 2025

This version

0.1.0

Nov 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

statqa-0.1.0.tar.gz (51.3 kB view details)

Uploaded Nov 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

statqa-0.1.0-py3-none-any.whl (54.1 kB view details)

Uploaded Nov 19, 2025 Python 3

File details

Details for the file statqa-0.1.0.tar.gz.

File metadata

Download URL: statqa-0.1.0.tar.gz
Upload date: Nov 19, 2025
Size: 51.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for statqa-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b1f4418f6fcb3fc174a73949577e7e1fc008356eaae4e4d25a8ca16e6aba0a3f`
MD5	`3e5b63be9970906241751bb0e31c9245`
BLAKE2b-256	`6e0dc9c9223d3fb9048a0f0c1265bc1823264ad65093d523c883aba331f1d824`

See more details on using hashes here.

File details

Details for the file statqa-0.1.0-py3-none-any.whl.

File metadata

Download URL: statqa-0.1.0-py3-none-any.whl
Upload date: Nov 19, 2025
Size: 54.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for statqa-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`af97c52774b75010014d33896cbd4feb7a067745cdb12904e003920cbac0358a`
MD5	`593e74a0ebabe85618f4446d9e90259d`
BLAKE2b-256	`f4b32d2b405ac1ce411027a10b156d51bb1ecacf75fde3856f29ac35494adbfd`

See more details on using hashes here.

statqa 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

StatQA

🎯 Key Features

📦 Installation

Basic Installation

With Optional Features

From Source

🚀 Quick Start

1. Create a Codebook

2. Run Statistical Analyses

3. Generate Natural Language Insights

4. Create Q/A Pairs for LLM Training

🎨 Complete Pipeline Example

📝 Q/A Provenance Tracking

Provenance Fields

🖥️ Command-Line Interface

📊 Supported Analyses

Univariate Statistics

Bivariate Relationships

Temporal Analysis

Causal Inference

🔧 Advanced Features

LLM-Powered Metadata Enrichment

Multiple Testing Correction

Custom Visualizations

📚 Documentation

🧪 Development

Running Tests

Code Quality

Building Documentation

🤝 Contributing

📄 License

🙏 Acknowledgments

📬 Contact & Support

🗺️ Roadmap

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes