Skip to main content

Spatiotemporal Index Extraction from Unstructured Text

Project description

STIndex - Spatiotemporal Information Extraction

PyPI version Python 3.11+ License: MIT Home Homepage Demo Dashboard

STIndex is a multi-dimensional information extraction system that uses LLMs to extract temporal, spatial, and custom dimensional data from unstructured text. Features end-to-end pipeline with preprocessing, extraction, and visualization.

🌐 Try the Demo Dashboard

Quick Start

Installation

pip install stindex

# Install spaCy language model (required for NER)
python -m spacy download en_core_web_sm

Basic Extraction

# Extract spatiotemporal entities
stindex extract "On March 15, 2022, a cyclone hit Broome, Western Australia."

# Use specific LLM provider
stindex extract "Text here..." --config openai  # or anthropic, hf

End-to-End Pipeline

from stindex import InputDocument, STIndexPipeline

# Create input documents (URL, file, or text)
docs = [
    InputDocument.from_url("https://example.com/article"),
    InputDocument.from_file("/path/to/document.pdf"),
    InputDocument.from_text("Your text here")
]

# Run full pipeline: preprocessing → extraction → warehouse → visualization
pipeline = STIndexPipeline(
    dimension_config="dimensions",
    output_dir="data/output",
)
results = pipeline.run_pipeline(docs)

Schema Discovery (NEW in v0.6.0)

Automatically discover dimensional schemas from Q&A datasets:

from stindex.pipeline.discovery_pipeline import SchemaDiscoveryPipeline

# Discover schema from medical Q&A dataset
discovery = SchemaDiscoveryPipeline(
    questions_path="data/original/mirage/train.jsonl",
    corpus_path="data/original/medcorp/train.jsonl",
    output_path="cfg/discovered_medical_schema.yml",
    n_clusters=10
)
schema = discovery.run()

# Use discovered schema for extraction
pipeline = STIndexPipeline(
    dimension_config="cfg/discovered_medical_schema.yml"
)
results = pipeline.run_pipeline(docs)

Features:

  • Domain-agnostic schema discovery from question-answer datasets
  • Two-phase approach: cluster-based initial discovery + refinement
  • Outputs hierarchy-based dimension configs compatible with extraction pipeline
  • Automatic mandatory dimension inclusion (temporal, spatial)

Supported datasets: MIRAGE, MedCorp, HotpotQA, 2WikiMQA, MuSiQue

Python API (Direct Extraction)

from stindex import DimensionalExtractor

# Initialize with default config (cfg/extract.yml)
extractor = DimensionalExtractor()

# Or specify a config
extractor = DimensionalExtractor(config_path="openai")

# Extract entities
result = extractor.extract("March 15, 2022 in Broome, Australia")

# Access results
print(f"Temporal: {len(result.temporal_entities)} entities")
print(f"Spatial: {len(result.spatial_entities)} entities")

# Raw LLM output available for debugging
if result.extraction_config:
    raw_output = result.extraction_config.get("raw_llm_output") if isinstance(result.extraction_config, dict) else result.extraction_config.raw_llm_output
    print(f"Raw output: {raw_output}")

Server Deployment

MS-SWIFT Server (Model Sharding with Tensor Parallelism)

Deploy a single MS-SWIFT server that uses all available GPUs via tensor parallelism:

# Deploy server (auto-detects GPUs by default)
./scripts/deploy_ms_swift.sh

# Stop server
./scripts/stop_ms_swift.sh

# Check logs
tail -f logs/hf_server.log

Configuration (cfg/hf.yml):

  • deployment.port: Server port (default: 8001)
  • deployment.model: HuggingFace model ID or local path
  • deployment.result_path: Directory for inference logs (default: data/output/result)
  • deployment.vllm.tensor_parallel_size:
    • auto (default): Auto-detect all available GPUs
    • Or set manually: 1, 2, 4, etc.
  • deployment.vllm.gpu_memory_utilization: GPU memory fraction (default: 0.7)

Configuration

STIndex uses a hierarchical configuration structure organized by module:

Preprocessing Configs (cfg/preprocess/)

  • chunking.yml: Document chunking strategies

    • strategy: "sliding_window", "paragraph", "element_based", "semantic"
    • max_chunk_size: Maximum tokens per chunk (default: 1500)
    • overlap: Token overlap between chunks (default: 150)
  • parsing.yml: Document parsing settings

    • parsing_method: "unstructured" (recommended) or "simple"
    • Format-specific settings for PDF, HTML, DOCX
    • max_file_size_mb: Maximum file size (default: 50MB)
  • scraping.yml: Web scraping configuration

    • rate_limit: Seconds between requests (default: 2.0)
    • timeout: Request timeout (default: 30s)
    • cache.enabled: Enable response caching
    • robots.respect_robots_txt: Respect robots.txt rules

Extraction Configs (cfg/extraction/)

Inference Configs (cfg/extraction/inference/)

  • extract.yml: Main extraction configuration

    • llm.llm_provider: "hf", "openai", or "anthropic"
    • extraction.enable_cache: Cache extraction results
    • extraction.auto_save: Auto-save to data/output/yyyy-mm-dd/hh-mm-ss.json
    • extraction.min_confidence: Minimum confidence threshold (0.0-1.0)
    • Context-aware extraction settings
    • Post-processing toggles (reflection, OSM context, relative temporal resolution)
  • dimensions.yml: Multi-dimensional extraction definitions (hierarchy-based format v0.6.0+)

    • temporal: ISO 8601 normalized dates with 4-level hierarchy (timestamp → date → month → year)
    • spatial: Geocoded locations with 4-level hierarchy (location → city → state → country)
    • event: Optional categorical dimension for event types (disabled by default)
    • entity: Optional categorical dimension for named entities (disabled by default)
    • Each dimension defines: enabled, extraction_type, schema_type, hierarchy, examples
    • Custom dimensions: Add hierarchical dimensions for domain-specific extraction
    • Migration: Use scripts/migrate_dimension_configs.py to convert old field-based configs
  • reflection.yml: Two-pass reflection settings

    • enabled: Enable LLM-based quality filtering (default: false)
    • thresholds: Relevance, accuracy, completeness, consistency scores
    • Context-aware reasoning for temporal/spatial consistency checks
    • Quality scoring with configurable weights
  • openai.yml: OpenAI API settings

    • model_name: "gpt-4o-mini", "gpt-4o", "gpt-4.1", etc.
    • temperature: Generation temperature (default: 0.0)
    • max_tokens: Maximum output tokens (default: 2048)
    • Requires: OPENAI_API_KEY environment variable
  • anthropic.yml: Anthropic Claude API settings

    • model_name: "claude-sonnet-4-5-20250929" (latest)
    • temperature: Generation temperature (default: 0.0)
    • max_tokens: Maximum output tokens (default: 2048)
    • Requires: ANTHROPIC_API_KEY environment variable
  • hf.yml: HuggingFace/MS-SWIFT server settings

    • Client config (llm): API endpoint and generation parameters
      • model_name: Model name as reported by server (e.g., "Qwen3-8B")
      • base_url: Server endpoint (e.g., "http://localhost:8001")
      • max_tokens: Maximum tokens per request (default: 32768)
    • Server config (deployment): Model deployment settings
      • model: HuggingFace model ID (e.g., "Qwen/Qwen3-8B")
      • port: Server port (default: 8001)
      • result_path: Inference log directory (null to disable)
      • vllm.tensor_parallel_size: GPU configuration (auto or number)
      • vllm.gpu_memory_utilization: GPU memory fraction (default: 0.7)
      • vllm.max_model_len: Maximum sequence length (default: 32768)

Post-Processing Configs (cfg/extraction/postprocess/)

  • spatial.yml: Geocoding and spatial validation

    • geocoder: "nominatim" (free, OSM) or "google" (requires API key)
    • nominatim.rate_limit: Rate limiting (minimum 1.0 seconds for OSM)
    • cache.enabled: Cache geocoding results
    • disambiguation: Context-aware disambiguation settings
    • validation: Geocoding validation (min_confidence, max_distance_km)
  • temporal.yml: Temporal normalization

    • format: "iso8601" (default)
    • timezone.default: Default timezone (default: "UTC")
    • relative.handle_relative: Resolve relative dates (e.g., "Monday" → absolute date)
    • ranges.expand_intervals: Expand date ranges to start/end
    • validation: Year range validation (min_year: 1900, max_year: 2100)

Evaluation Config (cfg/extraction/evaluation/)

  • evaluate.yml: Evaluation settings
    • dataset.path: Path to evaluation dataset
    • dataset.sample_limit: Limit number of chunks (null = all)
    • llm.llm_provider: LLM provider for evaluation
    • context_aware.enabled: Enable context-aware extraction
    • Post-processing settings for evaluation

Switching LLM Providers

Edit cfg/extraction/inference/extract.yml:

llm:
  llm_provider: hf  # or openai, anthropic

Or specify at runtime:

extractor = DimensionalExtractor(config_path="openai")

Quick Evaluation

# Sequential mode (default)
stindex evaluate

# With specific config
stindex evaluate --llm-config openai

# Limit samples
stindex evaluate --sample-limit 10

Output Structure

Results are organized by dataset and model:

data/output/evaluations/
└── {dataset_name}-{model_name}/
    ├── eval_{timestamp}_{config}.csv         # Detailed results
    └── eval_{timestamp}_{config}.summary.json # Aggregate metrics

TODOs

  • Backend server implementation
  • Data warehouse integration

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stindex-1.1.1.tar.gz (182.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stindex-1.1.1-py3-none-any.whl (218.7 kB view details)

Uploaded Python 3

File details

Details for the file stindex-1.1.1.tar.gz.

File metadata

  • Download URL: stindex-1.1.1.tar.gz
  • Upload date:
  • Size: 182.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stindex-1.1.1.tar.gz
Algorithm Hash digest
SHA256 dfc780c00bbba1b503cda893daf4dde8c3300d2f96e79247b720c69a399e0a7c
MD5 66d4b067ed3ba1a86a6c73d4ddc844e2
BLAKE2b-256 f24a570a1418a0a457324bedb0c1c1a9931a0bc1f3c943ed98a5c35d745eb08a

See more details on using hashes here.

Provenance

The following attestation bundles were made for stindex-1.1.1.tar.gz:

Publisher: publish.yml on MoeBuTa/STIndex

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stindex-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: stindex-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 218.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stindex-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ddd02a8b0259eb82413750d70a24c5f64f96e3a1c7d2e7022bc683c28566af8f
MD5 e1363482247295e5f4b111c100ed9b47
BLAKE2b-256 20a032f604ee039967e3ba40cfb9138741ed96d120909db2de51f5cd8ae66740

See more details on using hashes here.

Provenance

The following attestation bundles were made for stindex-1.1.1-py3-none-any.whl:

Publisher: publish.yml on MoeBuTa/STIndex

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page