Spatiotemporal Index Extraction from Unstructured Text
Project description
STIndex - Spatiotemporal Information Extraction
STIndex is a multi-dimensional information extraction system that uses LLMs to extract temporal, spatial, and custom dimensional data from unstructured text. Features end-to-end pipeline with preprocessing, extraction, and visualization.
Quick Start
Installation
pip install stindex
# Install spaCy language model (required for NER)
python -m spacy download en_core_web_sm
Basic Extraction
# Extract spatiotemporal entities
stindex extract "On March 15, 2022, a cyclone hit Broome, Western Australia."
# Use specific LLM provider
stindex extract "Text here..." --config openai # or anthropic, hf
End-to-End Pipeline
from stindex import InputDocument, STIndexPipeline
# Create input documents (URL, file, or text)
docs = [
InputDocument.from_url("https://example.com/article"),
InputDocument.from_file("/path/to/document.pdf"),
InputDocument.from_text("Your text here")
]
# Run full pipeline: preprocessing → extraction → warehouse → visualization
pipeline = STIndexPipeline(
dimension_config="dimensions",
output_dir="data/output",
)
results = pipeline.run_pipeline(docs)
Schema Discovery (NEW in v0.6.0)
Automatically discover dimensional schemas from Q&A datasets:
from stindex.pipeline.discovery_pipeline import SchemaDiscoveryPipeline
# Discover schema from medical Q&A dataset
discovery = SchemaDiscoveryPipeline(
questions_path="data/original/mirage/train.jsonl",
corpus_path="data/original/medcorp/train.jsonl",
output_path="cfg/discovered_medical_schema.yml",
n_clusters=10
)
schema = discovery.run()
# Use discovered schema for extraction
pipeline = STIndexPipeline(
dimension_config="cfg/discovered_medical_schema.yml"
)
results = pipeline.run_pipeline(docs)
Features:
- Domain-agnostic schema discovery from question-answer datasets
- Two-phase approach: cluster-based initial discovery + refinement
- Outputs hierarchy-based dimension configs compatible with extraction pipeline
- Automatic mandatory dimension inclusion (temporal, spatial)
Supported datasets: MIRAGE, MedCorp, HotpotQA, 2WikiMQA, MuSiQue
Python API (Direct Extraction)
from stindex import DimensionalExtractor
# Initialize with default config (cfg/extract.yml)
extractor = DimensionalExtractor()
# Or specify a config
extractor = DimensionalExtractor(config_path="openai")
# Extract entities
result = extractor.extract("March 15, 2022 in Broome, Australia")
# Access results
print(f"Temporal: {len(result.temporal_entities)} entities")
print(f"Spatial: {len(result.spatial_entities)} entities")
# Raw LLM output available for debugging
if result.extraction_config:
raw_output = result.extraction_config.get("raw_llm_output") if isinstance(result.extraction_config, dict) else result.extraction_config.raw_llm_output
print(f"Raw output: {raw_output}")
Server Deployment
MS-SWIFT Server (Model Sharding with Tensor Parallelism)
Deploy a single MS-SWIFT server that uses all available GPUs via tensor parallelism:
# Deploy server (auto-detects GPUs by default)
./scripts/deploy_ms_swift.sh
# Stop server
./scripts/stop_ms_swift.sh
# Check logs
tail -f logs/hf_server.log
Configuration (cfg/hf.yml):
deployment.port: Server port (default: 8001)deployment.model: HuggingFace model ID or local pathdeployment.result_path: Directory for inference logs (default:data/output/result)deployment.vllm.tensor_parallel_size:auto(default): Auto-detect all available GPUs- Or set manually:
1,2,4, etc.
deployment.vllm.gpu_memory_utilization: GPU memory fraction (default: 0.7)
Configuration
STIndex uses a hierarchical configuration structure organized by module:
Preprocessing Configs (cfg/preprocess/)
-
chunking.yml: Document chunking strategiesstrategy: "sliding_window", "paragraph", "element_based", "semantic"max_chunk_size: Maximum tokens per chunk (default: 1500)overlap: Token overlap between chunks (default: 150)
-
parsing.yml: Document parsing settingsparsing_method: "unstructured" (recommended) or "simple"- Format-specific settings for PDF, HTML, DOCX
max_file_size_mb: Maximum file size (default: 50MB)
-
scraping.yml: Web scraping configurationrate_limit: Seconds between requests (default: 2.0)timeout: Request timeout (default: 30s)cache.enabled: Enable response cachingrobots.respect_robots_txt: Respect robots.txt rules
Extraction Configs (cfg/extraction/)
Inference Configs (cfg/extraction/inference/)
-
extract.yml: Main extraction configurationllm.llm_provider: "hf", "openai", or "anthropic"extraction.enable_cache: Cache extraction resultsextraction.auto_save: Auto-save todata/output/yyyy-mm-dd/hh-mm-ss.jsonextraction.min_confidence: Minimum confidence threshold (0.0-1.0)- Context-aware extraction settings
- Post-processing toggles (reflection, OSM context, relative temporal resolution)
-
dimensions.yml: Multi-dimensional extraction definitions (hierarchy-based format v0.6.0+)- temporal: ISO 8601 normalized dates with 4-level hierarchy (timestamp → date → month → year)
- spatial: Geocoded locations with 4-level hierarchy (location → city → state → country)
- event: Optional categorical dimension for event types (disabled by default)
- entity: Optional categorical dimension for named entities (disabled by default)
- Each dimension defines:
enabled,extraction_type,schema_type,hierarchy,examples - Custom dimensions: Add hierarchical dimensions for domain-specific extraction
- Migration: Use
scripts/migrate_dimension_configs.pyto convert old field-based configs
-
reflection.yml: Two-pass reflection settingsenabled: Enable LLM-based quality filtering (default: false)thresholds: Relevance, accuracy, completeness, consistency scores- Context-aware reasoning for temporal/spatial consistency checks
- Quality scoring with configurable weights
-
openai.yml: OpenAI API settingsmodel_name: "gpt-4o-mini", "gpt-4o", "gpt-4.1", etc.temperature: Generation temperature (default: 0.0)max_tokens: Maximum output tokens (default: 2048)- Requires:
OPENAI_API_KEYenvironment variable
-
anthropic.yml: Anthropic Claude API settingsmodel_name: "claude-sonnet-4-5-20250929" (latest)temperature: Generation temperature (default: 0.0)max_tokens: Maximum output tokens (default: 2048)- Requires:
ANTHROPIC_API_KEYenvironment variable
-
hf.yml: HuggingFace/MS-SWIFT server settings- Client config (
llm): API endpoint and generation parametersmodel_name: Model name as reported by server (e.g., "Qwen3-8B")base_url: Server endpoint (e.g., "http://localhost:8001")max_tokens: Maximum tokens per request (default: 32768)
- Server config (
deployment): Model deployment settingsmodel: HuggingFace model ID (e.g., "Qwen/Qwen3-8B")port: Server port (default: 8001)result_path: Inference log directory (null to disable)vllm.tensor_parallel_size: GPU configuration (autoor number)vllm.gpu_memory_utilization: GPU memory fraction (default: 0.7)vllm.max_model_len: Maximum sequence length (default: 32768)
- Client config (
Post-Processing Configs (cfg/extraction/postprocess/)
-
spatial.yml: Geocoding and spatial validationgeocoder: "nominatim" (free, OSM) or "google" (requires API key)nominatim.rate_limit: Rate limiting (minimum 1.0 seconds for OSM)cache.enabled: Cache geocoding resultsdisambiguation: Context-aware disambiguation settingsvalidation: Geocoding validation (min_confidence, max_distance_km)
-
temporal.yml: Temporal normalizationformat: "iso8601" (default)timezone.default: Default timezone (default: "UTC")relative.handle_relative: Resolve relative dates (e.g., "Monday" → absolute date)ranges.expand_intervals: Expand date ranges to start/endvalidation: Year range validation (min_year: 1900, max_year: 2100)
Evaluation Config (cfg/extraction/evaluation/)
evaluate.yml: Evaluation settingsdataset.path: Path to evaluation datasetdataset.sample_limit: Limit number of chunks (null = all)llm.llm_provider: LLM provider for evaluationcontext_aware.enabled: Enable context-aware extraction- Post-processing settings for evaluation
Switching LLM Providers
Edit cfg/extraction/inference/extract.yml:
llm:
llm_provider: hf # or openai, anthropic
Or specify at runtime:
extractor = DimensionalExtractor(config_path="openai")
Quick Evaluation
# Sequential mode (default)
stindex evaluate
# With specific config
stindex evaluate --llm-config openai
# Limit samples
stindex evaluate --sample-limit 10
Output Structure
Results are organized by dataset and model:
data/output/evaluations/
└── {dataset_name}-{model_name}/
├── eval_{timestamp}_{config}.csv # Detailed results
└── eval_{timestamp}_{config}.summary.json # Aggregate metrics
TODOs
- Backend server implementation
- Data warehouse integration
License
MIT License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file stindex-1.1.1.tar.gz.
File metadata
- Download URL: stindex-1.1.1.tar.gz
- Upload date:
- Size: 182.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dfc780c00bbba1b503cda893daf4dde8c3300d2f96e79247b720c69a399e0a7c
|
|
| MD5 |
66d4b067ed3ba1a86a6c73d4ddc844e2
|
|
| BLAKE2b-256 |
f24a570a1418a0a457324bedb0c1c1a9931a0bc1f3c943ed98a5c35d745eb08a
|
Provenance
The following attestation bundles were made for stindex-1.1.1.tar.gz:
Publisher:
publish.yml on MoeBuTa/STIndex
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
stindex-1.1.1.tar.gz -
Subject digest:
dfc780c00bbba1b503cda893daf4dde8c3300d2f96e79247b720c69a399e0a7c - Sigstore transparency entry: 759639765
- Sigstore integration time:
-
Permalink:
MoeBuTa/STIndex@971ecb718e21fec7fa819b0307f882675de3b0c7 -
Branch / Tag:
refs/tags/v1.1.1 - Owner: https://github.com/MoeBuTa
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@971ecb718e21fec7fa819b0307f882675de3b0c7 -
Trigger Event:
release
-
Statement type:
File details
Details for the file stindex-1.1.1-py3-none-any.whl.
File metadata
- Download URL: stindex-1.1.1-py3-none-any.whl
- Upload date:
- Size: 218.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ddd02a8b0259eb82413750d70a24c5f64f96e3a1c7d2e7022bc683c28566af8f
|
|
| MD5 |
e1363482247295e5f4b111c100ed9b47
|
|
| BLAKE2b-256 |
20a032f604ee039967e3ba40cfb9138741ed96d120909db2de51f5cd8ae66740
|
Provenance
The following attestation bundles were made for stindex-1.1.1-py3-none-any.whl:
Publisher:
publish.yml on MoeBuTa/STIndex
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
stindex-1.1.1-py3-none-any.whl -
Subject digest:
ddd02a8b0259eb82413750d70a24c5f64f96e3a1c7d2e7022bc683c28566af8f - Sigstore transparency entry: 759639791
- Sigstore integration time:
-
Permalink:
MoeBuTa/STIndex@971ecb718e21fec7fa819b0307f882675de3b0c7 -
Branch / Tag:
refs/tags/v1.1.1 - Owner: https://github.com/MoeBuTa
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@971ecb718e21fec7fa819b0307f882675de3b0c7 -
Trigger Event:
release
-
Statement type: