Skip to main content

Spatiotemporal Index Extraction from Unstructured Text

Project description

STIndex - Spatiotemporal Information Extraction

PyPI version Python 3.11+ License: MIT Home Homepage Demo Dashboard

STIndex is a multi-dimensional information extraction system that uses LLMs to extract temporal, spatial, and custom dimensional data from unstructured text. Features end-to-end pipeline with preprocessing, extraction, and visualization.

🌐 Try the Demo Dashboard

Quick Start

Installation

pip install stindex

# Install spaCy language model (required for NER)
python -m spacy download en_core_web_sm

Basic Extraction

# Extract spatiotemporal entities
stindex extract "On March 15, 2022, a cyclone hit Broome, Western Australia."

# Use specific LLM provider
stindex extract "Text here..." --config openai  # or anthropic, hf

End-to-End Pipeline

from stindex import InputDocument, STIndexPipeline

# Create input documents (URL, file, or text)
docs = [
    InputDocument.from_url("https://example.com/article"),
    InputDocument.from_file("/path/to/document.pdf"),
    InputDocument.from_text("Your text here")
]

# Run full pipeline: preprocessing → extraction → warehouse → visualization
pipeline = STIndexPipeline(
    dimension_config="dimensions",
    output_dir="data/output",
)
results = pipeline.run_pipeline(docs)

Schema Discovery (NEW in v0.6.0)

Automatically discover dimensional schemas from Q&A datasets:

from stindex.pipeline.discovery_pipeline import SchemaDiscoveryPipeline

# Discover schema from medical Q&A dataset
discovery = SchemaDiscoveryPipeline(
    questions_path="data/original/mirage/train.jsonl",
    corpus_path="data/original/medcorp/train.jsonl",
    output_path="cfg/discovered_medical_schema.yml",
    n_clusters=10
)
schema = discovery.run()

# Use discovered schema for extraction
pipeline = STIndexPipeline(
    dimension_config="cfg/discovered_medical_schema.yml"
)
results = pipeline.run_pipeline(docs)

Features:

  • Domain-agnostic schema discovery from question-answer datasets
  • Two-phase approach: cluster-based initial discovery + refinement
  • Outputs hierarchy-based dimension configs compatible with extraction pipeline
  • Automatic mandatory dimension inclusion (temporal, spatial)

Supported datasets: MIRAGE, MedCorp, HotpotQA, 2WikiMQA, MuSiQue

Python API (Direct Extraction)

from stindex import DimensionalExtractor

# Initialize with default config (cfg/extract.yml)
extractor = DimensionalExtractor()

# Or specify a config
extractor = DimensionalExtractor(config_path="openai")

# Extract entities
result = extractor.extract("March 15, 2022 in Broome, Australia")

# Access results
print(f"Temporal: {len(result.temporal_entities)} entities")
print(f"Spatial: {len(result.spatial_entities)} entities")

# Raw LLM output available for debugging
if result.extraction_config:
    raw_output = result.extraction_config.get("raw_llm_output") if isinstance(result.extraction_config, dict) else result.extraction_config.raw_llm_output
    print(f"Raw output: {raw_output}")

Server Deployment

MS-SWIFT Server (Model Sharding with Tensor Parallelism)

Deploy a single MS-SWIFT server that uses all available GPUs via tensor parallelism:

# Deploy server (auto-detects GPUs by default)
./scripts/deploy_ms_swift.sh

# Stop server
./scripts/stop_ms_swift.sh

# Check logs
tail -f logs/hf_server.log

Configuration (cfg/hf.yml):

  • deployment.port: Server port (default: 8001)
  • deployment.model: HuggingFace model ID or local path
  • deployment.result_path: Directory for inference logs (default: data/output/result)
  • deployment.vllm.tensor_parallel_size:
    • auto (default): Auto-detect all available GPUs
    • Or set manually: 1, 2, 4, etc.
  • deployment.vllm.gpu_memory_utilization: GPU memory fraction (default: 0.7)

Configuration

STIndex uses a hierarchical configuration structure organized by module:

Preprocessing Configs (cfg/preprocess/)

  • chunking.yml: Document chunking strategies

    • strategy: "sliding_window", "paragraph", "element_based", "semantic"
    • max_chunk_size: Maximum tokens per chunk (default: 1500)
    • overlap: Token overlap between chunks (default: 150)
  • parsing.yml: Document parsing settings

    • parsing_method: "unstructured" (recommended) or "simple"
    • Format-specific settings for PDF, HTML, DOCX
    • max_file_size_mb: Maximum file size (default: 50MB)
  • scraping.yml: Web scraping configuration

    • rate_limit: Seconds between requests (default: 2.0)
    • timeout: Request timeout (default: 30s)
    • cache.enabled: Enable response caching
    • robots.respect_robots_txt: Respect robots.txt rules

Extraction Configs (cfg/extraction/)

Inference Configs (cfg/extraction/inference/)

  • extract.yml: Main extraction configuration

    • llm.llm_provider: "hf", "openai", or "anthropic"
    • extraction.enable_cache: Cache extraction results
    • extraction.auto_save: Auto-save to data/output/yyyy-mm-dd/hh-mm-ss.json
    • extraction.min_confidence: Minimum confidence threshold (0.0-1.0)
    • Context-aware extraction settings
    • Post-processing toggles (reflection, OSM context, relative temporal resolution)
  • dimensions.yml: Multi-dimensional extraction definitions (hierarchy-based format v0.6.0+)

    • temporal: ISO 8601 normalized dates with 4-level hierarchy (timestamp → date → month → year)
    • spatial: Geocoded locations with 4-level hierarchy (location → city → state → country)
    • event: Optional categorical dimension for event types (disabled by default)
    • entity: Optional categorical dimension for named entities (disabled by default)
    • Each dimension defines: enabled, extraction_type, schema_type, hierarchy, examples
    • Custom dimensions: Add hierarchical dimensions for domain-specific extraction
    • Migration: Use scripts/migrate_dimension_configs.py to convert old field-based configs
  • reflection.yml: Two-pass reflection settings

    • enabled: Enable LLM-based quality filtering (default: false)
    • thresholds: Relevance, accuracy, completeness, consistency scores
    • Context-aware reasoning for temporal/spatial consistency checks
    • Quality scoring with configurable weights
  • openai.yml: OpenAI API settings

    • model_name: "gpt-4o-mini", "gpt-4o", "gpt-4.1", etc.
    • temperature: Generation temperature (default: 0.0)
    • max_tokens: Maximum output tokens (default: 2048)
    • Requires: OPENAI_API_KEY environment variable
  • anthropic.yml: Anthropic Claude API settings

    • model_name: "claude-sonnet-4-5-20250929" (latest)
    • temperature: Generation temperature (default: 0.0)
    • max_tokens: Maximum output tokens (default: 2048)
    • Requires: ANTHROPIC_API_KEY environment variable
  • hf.yml: HuggingFace/MS-SWIFT server settings

    • Client config (llm): API endpoint and generation parameters
      • model_name: Model name as reported by server (e.g., "Qwen3-8B")
      • base_url: Server endpoint (e.g., "http://localhost:8001")
      • max_tokens: Maximum tokens per request (default: 32768)
    • Server config (deployment): Model deployment settings
      • model: HuggingFace model ID (e.g., "Qwen/Qwen3-8B")
      • port: Server port (default: 8001)
      • result_path: Inference log directory (null to disable)
      • vllm.tensor_parallel_size: GPU configuration (auto or number)
      • vllm.gpu_memory_utilization: GPU memory fraction (default: 0.7)
      • vllm.max_model_len: Maximum sequence length (default: 32768)

Post-Processing Configs (cfg/extraction/postprocess/)

  • spatial.yml: Geocoding and spatial validation

    • geocoder: "nominatim" (free, OSM) or "google" (requires API key)
    • nominatim.rate_limit: Rate limiting (minimum 1.0 seconds for OSM)
    • cache.enabled: Cache geocoding results
    • disambiguation: Context-aware disambiguation settings
    • validation: Geocoding validation (min_confidence, max_distance_km)
  • temporal.yml: Temporal normalization

    • format: "iso8601" (default)
    • timezone.default: Default timezone (default: "UTC")
    • relative.handle_relative: Resolve relative dates (e.g., "Monday" → absolute date)
    • ranges.expand_intervals: Expand date ranges to start/end
    • validation: Year range validation (min_year: 1900, max_year: 2100)

Evaluation Config (cfg/extraction/evaluation/)

  • evaluate.yml: Evaluation settings
    • dataset.path: Path to evaluation dataset
    • dataset.sample_limit: Limit number of chunks (null = all)
    • llm.llm_provider: LLM provider for evaluation
    • context_aware.enabled: Enable context-aware extraction
    • Post-processing settings for evaluation

Switching LLM Providers

Edit cfg/extraction/inference/extract.yml:

llm:
  llm_provider: hf  # or openai, anthropic

Or specify at runtime:

extractor = DimensionalExtractor(config_path="openai")

Quick Evaluation

# Sequential mode (default)
stindex evaluate

# With specific config
stindex evaluate --llm-config openai

# Limit samples
stindex evaluate --sample-limit 10

Output Structure

Results are organized by dataset and model:

data/output/evaluations/
└── {dataset_name}-{model_name}/
    ├── eval_{timestamp}_{config}.csv         # Detailed results
    └── eval_{timestamp}_{config}.summary.json # Aggregate metrics

TODOs

  • Backend server implementation
  • Data warehouse integration

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stindex-1.1.0.tar.gz (179.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stindex-1.1.0-py3-none-any.whl (215.7 kB view details)

Uploaded Python 3

File details

Details for the file stindex-1.1.0.tar.gz.

File metadata

  • Download URL: stindex-1.1.0.tar.gz
  • Upload date:
  • Size: 179.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stindex-1.1.0.tar.gz
Algorithm Hash digest
SHA256 64b00c6d348912958dc08e71fd7bad0b20d1c04b6b51e13884d7a32596b35e08
MD5 d1a7cd9e01875f298ceea73a3ad31f24
BLAKE2b-256 19912d61b35d16692485e788ecb0fe1f6492e6896d826d321b56ec0fa31fc301

See more details on using hashes here.

Provenance

The following attestation bundles were made for stindex-1.1.0.tar.gz:

Publisher: publish.yml on MoeBuTa/STIndex

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stindex-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: stindex-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 215.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stindex-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5bc1e0657cc5789445ff265e994afad89e201d748b8c78768d14ee9f3c5c0254
MD5 f6504a0dc680a7adb9114d0bc4f0614b
BLAKE2b-256 9d9718cecf1bf90c7b888847975f4db667ad46cc134cd2245c8edcb4ab2c5f48

See more details on using hashes here.

Provenance

The following attestation bundles were made for stindex-1.1.0-py3-none-any.whl:

Publisher: publish.yml on MoeBuTa/STIndex

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page