Skip to main content

Spatiotemporal Index Extraction from Unstructured Text

Project description

STIndex - Spatiotemporal Information Extraction

PyPI version Python 3.11+ License: MIT Home Demo Dashboard

STIndex is a multi-dimensional information extraction system that uses LLMs to extract temporal, spatial, and custom dimensional data from unstructured text. Features end-to-end pipeline with preprocessing, extraction, and visualization.

🌐 Try the Demo Dashboard

Quick Start

Installation

pip install stindex

# Install spaCy language model (required for NER)
python -m spacy download en_core_web_sm

Basic Extraction

# Extract spatiotemporal entities
stindex extract "On March 15, 2022, a cyclone hit Broome, Western Australia."

# Use specific LLM provider
stindex extract "Text here..." --config openai  # or anthropic, hf

End-to-End Pipeline

from stindex import InputDocument, STIndexPipeline

# Create input documents (URL, file, or text)
docs = [
    InputDocument.from_url("https://example.com/article"),
    InputDocument.from_file("/path/to/document.pdf"),
    InputDocument.from_text("Your text here")
]

# Run full pipeline: preprocessing → extraction → warehouse → visualization
pipeline = STIndexPipeline(
    dimension_config="dimensions",
    output_dir="data/output",
)
results = pipeline.run_pipeline(docs)

Python API (Direct Extraction)

from stindex import DimensionalExtractor

# Initialize with default config (cfg/extract.yml)
extractor = DimensionalExtractor()

# Or specify a config
extractor = DimensionalExtractor(config_path="openai")

# Extract entities
result = extractor.extract("March 15, 2022 in Broome, Australia")

# Access results
print(f"Temporal: {len(result.temporal_entities)} entities")
print(f"Spatial: {len(result.spatial_entities)} entities")

# Raw LLM output available for debugging
if result.extraction_config:
    raw_output = result.extraction_config.get("raw_llm_output") if isinstance(result.extraction_config, dict) else result.extraction_config.raw_llm_output
    print(f"Raw output: {raw_output}")

Server Deployment

MS-SWIFT Server (Model Sharding with Tensor Parallelism)

Deploy a single MS-SWIFT server that uses all available GPUs via tensor parallelism:

# Deploy server (auto-detects GPUs by default)
./scripts/deploy_ms_swift.sh

# Stop server
./scripts/stop_ms_swift.sh

# Check logs
tail -f logs/hf_server.log

Configuration (cfg/hf.yml):

  • deployment.port: Server port (default: 8001)
  • deployment.model: HuggingFace model ID or local path
  • deployment.result_path: Directory for inference logs (default: data/output/result)
  • deployment.vllm.tensor_parallel_size:
    • auto (default): Auto-detect all available GPUs
    • Or set manually: 1, 2, 4, etc.
  • deployment.vllm.gpu_memory_utilization: GPU memory fraction (default: 0.7)

Configuration

STIndex uses a hierarchical configuration structure organized by module:

Preprocessing Configs (cfg/preprocess/)

  • chunking.yml: Document chunking strategies

    • strategy: "sliding_window", "paragraph", "element_based", "semantic"
    • max_chunk_size: Maximum tokens per chunk (default: 1500)
    • overlap: Token overlap between chunks (default: 150)
  • parsing.yml: Document parsing settings

    • parsing_method: "unstructured" (recommended) or "simple"
    • Format-specific settings for PDF, HTML, DOCX
    • max_file_size_mb: Maximum file size (default: 50MB)
  • scraping.yml: Web scraping configuration

    • rate_limit: Seconds between requests (default: 2.0)
    • timeout: Request timeout (default: 30s)
    • cache.enabled: Enable response caching
    • robots.respect_robots_txt: Respect robots.txt rules

Extraction Configs (cfg/extraction/)

Inference Configs (cfg/extraction/inference/)

  • extract.yml: Main extraction configuration

    • llm.llm_provider: "hf", "openai", or "anthropic"
    • extraction.enable_cache: Cache extraction results
    • extraction.auto_save: Auto-save to data/output/yyyy-mm-dd/hh-mm-ss.json
    • extraction.min_confidence: Minimum confidence threshold (0.0-1.0)
    • Context-aware extraction settings
    • Post-processing toggles (reflection, OSM context, relative temporal resolution)
  • dimensions.yml: Multi-dimensional extraction definitions

    • temporal: ISO 8601 normalized dates (enabled by default)
    • spatial: Geocoded locations with parent regions (enabled by default)
    • event: Optional categorical dimension for event types (disabled by default)
    • entity: Optional categorical dimension for named entities (disabled by default)
    • Each dimension defines: enabled, extraction_type, schema_type, fields, examples
  • reflection.yml: Two-pass reflection settings

    • enabled: Enable LLM-based quality filtering (default: false)
    • thresholds: Relevance, accuracy, completeness, consistency scores
    • Context-aware reasoning for temporal/spatial consistency checks
    • Quality scoring with configurable weights
  • openai.yml: OpenAI API settings

    • model_name: "gpt-4o-mini", "gpt-4o", "gpt-4.1", etc.
    • temperature: Generation temperature (default: 0.0)
    • max_tokens: Maximum output tokens (default: 2048)
    • Requires: OPENAI_API_KEY environment variable
  • anthropic.yml: Anthropic Claude API settings

    • model_name: "claude-sonnet-4-5-20250929" (latest)
    • temperature: Generation temperature (default: 0.0)
    • max_tokens: Maximum output tokens (default: 2048)
    • Requires: ANTHROPIC_API_KEY environment variable
  • hf.yml: HuggingFace/MS-SWIFT server settings

    • Client config (llm): API endpoint and generation parameters
      • model_name: Model name as reported by server (e.g., "Qwen3-8B")
      • base_url: Server endpoint (e.g., "http://localhost:8001")
      • max_tokens: Maximum tokens per request (default: 32768)
    • Server config (deployment): Model deployment settings
      • model: HuggingFace model ID (e.g., "Qwen/Qwen3-8B")
      • port: Server port (default: 8001)
      • result_path: Inference log directory (null to disable)
      • vllm.tensor_parallel_size: GPU configuration (auto or number)
      • vllm.gpu_memory_utilization: GPU memory fraction (default: 0.7)
      • vllm.max_model_len: Maximum sequence length (default: 32768)

Post-Processing Configs (cfg/extraction/postprocess/)

  • spatial.yml: Geocoding and spatial validation

    • geocoder: "nominatim" (free, OSM) or "google" (requires API key)
    • nominatim.rate_limit: Rate limiting (minimum 1.0 seconds for OSM)
    • cache.enabled: Cache geocoding results
    • disambiguation: Context-aware disambiguation settings
    • validation: Geocoding validation (min_confidence, max_distance_km)
  • temporal.yml: Temporal normalization

    • format: "iso8601" (default)
    • timezone.default: Default timezone (default: "UTC")
    • relative.handle_relative: Resolve relative dates (e.g., "Monday" → absolute date)
    • ranges.expand_intervals: Expand date ranges to start/end
    • validation: Year range validation (min_year: 1900, max_year: 2100)

Evaluation Config (cfg/extraction/evaluation/)

  • evaluate.yml: Evaluation settings
    • dataset.path: Path to evaluation dataset
    • dataset.sample_limit: Limit number of chunks (null = all)
    • llm.llm_provider: LLM provider for evaluation
    • context_aware.enabled: Enable context-aware extraction
    • Post-processing settings for evaluation

Switching LLM Providers

Edit cfg/extraction/inference/extract.yml:

llm:
  llm_provider: hf  # or openai, anthropic

Or specify at runtime:

extractor = DimensionalExtractor(config_path="openai")

Quick Evaluation

# Sequential mode (default)
stindex evaluate

# With specific config
stindex evaluate --llm-config openai

# Limit samples
stindex evaluate --sample-limit 10

Output Structure

Results are organized by dataset and model:

data/output/evaluations/
└── {dataset_name}-{model_name}/
    ├── eval_{timestamp}_{config}.csv         # Detailed results
    └── eval_{timestamp}_{config}.summary.json # Aggregate metrics

TODOs

  • Backend server implementation
  • Data warehouse integration

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stindex-1.0.2.tar.gz (145.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stindex-1.0.2-py3-none-any.whl (175.2 kB view details)

Uploaded Python 3

File details

Details for the file stindex-1.0.2.tar.gz.

File metadata

  • Download URL: stindex-1.0.2.tar.gz
  • Upload date:
  • Size: 145.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stindex-1.0.2.tar.gz
Algorithm Hash digest
SHA256 8a7e5a24059f73533b385a6bea2f9eb5cc29db4a7fc77267380b7c81dbc93f13
MD5 5354bb07e61f2b232cd40ccb2a11dbf9
BLAKE2b-256 fc5f9e72cb7d0a3418cf9fc160d6b26c9b9536b616e45a0f852a16373daa3e94

See more details on using hashes here.

Provenance

The following attestation bundles were made for stindex-1.0.2.tar.gz:

Publisher: publish.yml on MoeBuTa/STIndex

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stindex-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: stindex-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 175.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stindex-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 782bd8a4abfaf7fa792dae063c7cf393a624887703fa4f920cd9a371d27f58b0
MD5 0d1321e4d24a653e038ac56a80455e18
BLAKE2b-256 85590db2a94350c83eda68eefeabf0bb39def4fc15a1afe084e8437cacb91591

See more details on using hashes here.

Provenance

The following attestation bundles were made for stindex-1.0.2-py3-none-any.whl:

Publisher: publish.yml on MoeBuTa/STIndex

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page