Spatiotemporal Index Extraction from Unstructured Text

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

moebuta

These details have not been verified by PyPI

Project links

Documentation

Project description

STIndex - Spatiotemporal Information Extraction

STIndex is a multi-dimensional information extraction system that uses LLMs to extract temporal, spatial, and custom dimensional data from unstructured text. Features end-to-end pipeline with preprocessing, extraction, and visualization.

🌐 Try the Demo Dashboard

Quick Start

Installation

pip install stindex

# Install spaCy language model (required for NER)
python -m spacy download en_core_web_sm

Basic Extraction

# Extract spatiotemporal entities
stindex extract "On March 15, 2022, a cyclone hit Broome, Western Australia."

# Use specific LLM provider
stindex extract "Text here..." --config openai  # or anthropic, hf

End-to-End Pipeline

from stindex import InputDocument, STIndexPipeline

# Create input documents (URL, file, or text)
docs = [
    InputDocument.from_url("https://example.com/article"),
    InputDocument.from_file("/path/to/document.pdf"),
    InputDocument.from_text("Your text here")
]

# Run full pipeline: preprocessing → extraction → warehouse → visualization
pipeline = STIndexPipeline(
    dimension_config="dimensions",
    output_dir="data/output",
)
results = pipeline.run_pipeline(docs)

Schema Discovery (NEW in v0.6.0)

Automatically discover dimensional schemas from Q&A datasets:

from stindex.pipeline.discovery_pipeline import SchemaDiscoveryPipeline

# Discover schema from medical Q&A dataset
discovery = SchemaDiscoveryPipeline(
    questions_path="data/original/mirage/train.jsonl",
    corpus_path="data/original/medcorp/train.jsonl",
    output_path="cfg/discovered_medical_schema.yml",
    n_clusters=10
)
schema = discovery.run()

# Use discovered schema for extraction
pipeline = STIndexPipeline(
    dimension_config="cfg/discovered_medical_schema.yml"
)
results = pipeline.run_pipeline(docs)

Features:

Domain-agnostic schema discovery from question-answer datasets
Two-phase approach: cluster-based initial discovery + refinement
Outputs hierarchy-based dimension configs compatible with extraction pipeline
Automatic mandatory dimension inclusion (temporal, spatial)

Supported datasets: MIRAGE, MedCorp, HotpotQA, 2WikiMQA, MuSiQue

Python API (Direct Extraction)

from stindex import DimensionalExtractor

# Initialize with default config (cfg/extract.yml)
extractor = DimensionalExtractor()

# Or specify a config
extractor = DimensionalExtractor(config_path="openai")

# Extract entities
result = extractor.extract("March 15, 2022 in Broome, Australia")

# Access results
print(f"Temporal: {len(result.temporal_entities)} entities")
print(f"Spatial: {len(result.spatial_entities)} entities")

# Raw LLM output available for debugging
if result.extraction_config:
    raw_output = result.extraction_config.get("raw_llm_output") if isinstance(result.extraction_config, dict) else result.extraction_config.raw_llm_output
    print(f"Raw output: {raw_output}")

Server Deployment

MS-SWIFT Server (Model Sharding with Tensor Parallelism)

Deploy a single MS-SWIFT server that uses all available GPUs via tensor parallelism:

# Deploy server (auto-detects GPUs by default)
./scripts/deploy_ms_swift.sh

# Stop server
./scripts/stop_ms_swift.sh

# Check logs
tail -f logs/hf_server.log

Configuration (cfg/hf.yml):

deployment.port: Server port (default: 8001)
deployment.model: HuggingFace model ID or local path
deployment.result_path: Directory for inference logs (default: data/output/result)
deployment.vllm.tensor_parallel_size:
- auto (default): Auto-detect all available GPUs
- Or set manually: 1, 2, 4, etc.
deployment.vllm.gpu_memory_utilization: GPU memory fraction (default: 0.7)

Configuration

STIndex uses a hierarchical configuration structure organized by module:

Preprocessing Configs (`cfg/preprocess/`)

chunking.yml: Document chunking strategies
- strategy: "sliding_window", "paragraph", "element_based", "semantic"
- max_chunk_size: Maximum tokens per chunk (default: 1500)
- overlap: Token overlap between chunks (default: 150)
parsing.yml: Document parsing settings
- parsing_method: "unstructured" (recommended) or "simple"
- Format-specific settings for PDF, HTML, DOCX
- max_file_size_mb: Maximum file size (default: 50MB)
scraping.yml: Web scraping configuration
- rate_limit: Seconds between requests (default: 2.0)
- timeout: Request timeout (default: 30s)
- cache.enabled: Enable response caching
- robots.respect_robots_txt: Respect robots.txt rules

Extraction Configs (`cfg/extraction/`)

Inference Configs (`cfg/extraction/inference/`)

extract.yml: Main extraction configuration
- llm.llm_provider: "hf", "openai", or "anthropic"
- extraction.enable_cache: Cache extraction results
- extraction.auto_save: Auto-save to data/output/yyyy-mm-dd/hh-mm-ss.json
- extraction.min_confidence: Minimum confidence threshold (0.0-1.0)
- Context-aware extraction settings
- Post-processing toggles (reflection, OSM context, relative temporal resolution)
dimensions.yml: Multi-dimensional extraction definitions (hierarchy-based format v0.6.0+)
- temporal: ISO 8601 normalized dates with 4-level hierarchy (timestamp → date → month → year)
- spatial: Geocoded locations with 4-level hierarchy (location → city → state → country)
- event: Optional categorical dimension for event types (disabled by default)
- entity: Optional categorical dimension for named entities (disabled by default)
- Each dimension defines: enabled, extraction_type, schema_type, hierarchy, examples
- Custom dimensions: Add hierarchical dimensions for domain-specific extraction
- Migration: Use scripts/migrate_dimension_configs.py to convert old field-based configs
reflection.yml: Two-pass reflection settings
- enabled: Enable LLM-based quality filtering (default: false)
- thresholds: Relevance, accuracy, completeness, consistency scores
- Context-aware reasoning for temporal/spatial consistency checks
- Quality scoring with configurable weights
openai.yml: OpenAI API settings
- model_name: "gpt-4o-mini", "gpt-4o", "gpt-4.1", etc.
- temperature: Generation temperature (default: 0.0)
- max_tokens: Maximum output tokens (default: 2048)
- Requires: OPENAI_API_KEY environment variable
anthropic.yml: Anthropic Claude API settings
- model_name: "claude-sonnet-4-5-20250929" (latest)
- temperature: Generation temperature (default: 0.0)
- max_tokens: Maximum output tokens (default: 2048)
- Requires: ANTHROPIC_API_KEY environment variable
hf.yml: HuggingFace/MS-SWIFT server settings
- Client config (llm): API endpoint and generation parameters
  - model_name: Model name as reported by server (e.g., "Qwen3-8B")
  - base_url: Server endpoint (e.g., "http://localhost:8001")
  - max_tokens: Maximum tokens per request (default: 32768)
- Server config (deployment): Model deployment settings
  - model: HuggingFace model ID (e.g., "Qwen/Qwen3-8B")
  - port: Server port (default: 8001)
  - result_path: Inference log directory (null to disable)
  - vllm.tensor_parallel_size: GPU configuration (auto or number)
  - vllm.gpu_memory_utilization: GPU memory fraction (default: 0.7)
  - vllm.max_model_len: Maximum sequence length (default: 32768)

Post-Processing Configs (`cfg/extraction/postprocess/`)

spatial.yml: Geocoding and spatial validation
- geocoder: "nominatim" (free, OSM) or "google" (requires API key)
- nominatim.rate_limit: Rate limiting (minimum 1.0 seconds for OSM)
- cache.enabled: Cache geocoding results
- disambiguation: Context-aware disambiguation settings
- validation: Geocoding validation (min_confidence, max_distance_km)
temporal.yml: Temporal normalization
- format: "iso8601" (default)
- timezone.default: Default timezone (default: "UTC")
- relative.handle_relative: Resolve relative dates (e.g., "Monday" → absolute date)
- ranges.expand_intervals: Expand date ranges to start/end
- validation: Year range validation (min_year: 1900, max_year: 2100)

Evaluation Config (`cfg/extraction/evaluation/`)

evaluate.yml: Evaluation settings
- dataset.path: Path to evaluation dataset
- dataset.sample_limit: Limit number of chunks (null = all)
- llm.llm_provider: LLM provider for evaluation
- context_aware.enabled: Enable context-aware extraction
- Post-processing settings for evaluation

Switching LLM Providers

Edit cfg/extraction/inference/extract.yml:

llm:
  llm_provider: hf  # or openai, anthropic

Or specify at runtime:

extractor = DimensionalExtractor(config_path="openai")

Quick Evaluation

# Sequential mode (default)
stindex evaluate

# With specific config
stindex evaluate --llm-config openai

# Limit samples
stindex evaluate --sample-limit 10

Output Structure

Results are organized by dataset and model:

data/output/evaluations/
└── {dataset_name}-{model_name}/
    ├── eval_{timestamp}_{config}.csv         # Detailed results
    └── eval_{timestamp}_{config}.summary.json # Aggregate metrics

TODOs

Backend server implementation
Data warehouse integration

License

MIT License

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

moebuta

These details have not been verified by PyPI

Project links

Documentation

Release history Release notifications | RSS feed

This version

1.1.1

Dec 11, 2025

1.1.0

Dec 11, 2025

1.0.2

Nov 17, 2025

1.0.1

Nov 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stindex-1.1.1.tar.gz (182.9 kB view details)

Uploaded Dec 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

stindex-1.1.1-py3-none-any.whl (218.7 kB view details)

Uploaded Dec 11, 2025 Python 3

File details

Details for the file stindex-1.1.1.tar.gz.

File metadata

Download URL: stindex-1.1.1.tar.gz
Upload date: Dec 11, 2025
Size: 182.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stindex-1.1.1.tar.gz
Algorithm	Hash digest
SHA256	`dfc780c00bbba1b503cda893daf4dde8c3300d2f96e79247b720c69a399e0a7c`
MD5	`66d4b067ed3ba1a86a6c73d4ddc844e2`
BLAKE2b-256	`f24a570a1418a0a457324bedb0c1c1a9931a0bc1f3c943ed98a5c35d745eb08a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for stindex-1.1.1.tar.gz:

Publisher: publish.yml on MoeBuTa/STIndex

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: stindex-1.1.1.tar.gz
- Subject digest: dfc780c00bbba1b503cda893daf4dde8c3300d2f96e79247b720c69a399e0a7c
- Sigstore transparency entry: 759639765
- Sigstore integration time: Dec 11, 2025
Source repository:
- Permalink: MoeBuTa/STIndex@971ecb718e21fec7fa819b0307f882675de3b0c7
- Branch / Tag: refs/tags/v1.1.1
- Owner: https://github.com/MoeBuTa
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@971ecb718e21fec7fa819b0307f882675de3b0c7
- Trigger Event: release

File details

Details for the file stindex-1.1.1-py3-none-any.whl.

File metadata

Download URL: stindex-1.1.1-py3-none-any.whl
Upload date: Dec 11, 2025
Size: 218.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stindex-1.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ddd02a8b0259eb82413750d70a24c5f64f96e3a1c7d2e7022bc683c28566af8f`
MD5	`e1363482247295e5f4b111c100ed9b47`
BLAKE2b-256	`20a032f604ee039967e3ba40cfb9138741ed96d120909db2de51f5cd8ae66740`

See more details on using hashes here.

Provenance

The following attestation bundles were made for stindex-1.1.1-py3-none-any.whl:

Publisher: publish.yml on MoeBuTa/STIndex

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: stindex-1.1.1-py3-none-any.whl
- Subject digest: ddd02a8b0259eb82413750d70a24c5f64f96e3a1c7d2e7022bc683c28566af8f
- Sigstore transparency entry: 759639791
- Sigstore integration time: Dec 11, 2025
Source repository:
- Permalink: MoeBuTa/STIndex@971ecb718e21fec7fa819b0307f882675de3b0c7
- Branch / Tag: refs/tags/v1.1.1
- Owner: https://github.com/MoeBuTa
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@971ecb718e21fec7fa819b0307f882675de3b0c7
- Trigger Event: release

stindex 1.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

STIndex - Spatiotemporal Information Extraction

Quick Start

Installation

Basic Extraction

End-to-End Pipeline

Schema Discovery (NEW in v0.6.0)

Python API (Direct Extraction)

Server Deployment

MS-SWIFT Server (Model Sharding with Tensor Parallelism)

Configuration

Preprocessing Configs (cfg/preprocess/)

Extraction Configs (cfg/extraction/)

Inference Configs (cfg/extraction/inference/)

Post-Processing Configs (cfg/extraction/postprocess/)

Evaluation Config (cfg/extraction/evaluation/)

Switching LLM Providers

Quick Evaluation

Output Structure

TODOs

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Preprocessing Configs (`cfg/preprocess/`)

Extraction Configs (`cfg/extraction/`)

Inference Configs (`cfg/extraction/inference/`)

Post-Processing Configs (`cfg/extraction/postprocess/`)

Evaluation Config (`cfg/extraction/evaluation/`)