Skip to main content

Spatiotemporal Index Extraction from Unstructured Text

Project description

STIndex - Spatiotemporal Information Extraction

PyPI version Python 3.9+ License: MIT Demo Dashboard

STIndex is a multi-dimensional information extraction system that uses LLMs to extract temporal, spatial, and custom dimensional data from unstructured text. Features end-to-end pipeline with preprocessing, extraction, and visualization.

🌐 Try the Demo Dashboard

Quick Start

Installation

pip install stindex

# Install spaCy language model (required for NER)
python -m spacy download en_core_web_sm

Basic Extraction

# Extract spatiotemporal entities
stindex extract "On March 15, 2022, a cyclone hit Broome, Western Australia."

# Use specific LLM provider
stindex extract "Text here..." --config openai  # or anthropic, hf

End-to-End Pipeline

from stindex import InputDocument, STIndexPipeline

# Create input documents (URL, file, or text)
docs = [
    InputDocument.from_url("https://example.com/article"),
    InputDocument.from_file("/path/to/document.pdf"),
    InputDocument.from_text("Your text here")
]

# Run full pipeline: preprocessing → extraction → warehouse → visualization
pipeline = STIndexPipeline(
    dimension_config="dimensions",
    output_dir="data/output",
    enable_warehouse=True,  # NEW in v0.6.0: Load data into warehouse
    warehouse_config="warehouse"
)
results = pipeline.run_pipeline(docs, load_to_warehouse=True)
# Automatically generates zip archive: data/visualizations/stindex_report_{timestamp}.zip
# Contains: HTML report + all plots, maps, and source files

Python API (Direct Extraction)

from stindex import DimensionalExtractor

# Initialize with default config (cfg/extract.yml)
extractor = DimensionalExtractor()

# Or specify a config
extractor = DimensionalExtractor(config_path="openai")

# Extract entities
result = extractor.extract("March 15, 2022 in Broome, Australia")

# Access results
print(f"Temporal: {len(result.temporal_entities)} entities")
print(f"Spatial: {len(result.spatial_entities)} entities")

# Raw LLM output available for debugging
if result.extraction_config:
    raw_output = result.extraction_config.get("raw_llm_output") if isinstance(result.extraction_config, dict) else result.extraction_config.raw_llm_output
    print(f"Raw output: {raw_output}")

Server Deployment

MS-SWIFT Server (Model Sharding with Tensor Parallelism)

Deploy a single MS-SWIFT server that uses all available GPUs via tensor parallelism:

# Deploy server (auto-detects GPUs by default)
./scripts/deploy_ms_swift.sh

# Stop server
./scripts/stop_ms_swift.sh

# Check logs
tail -f logs/hf_server.log

Configuration (cfg/hf.yml):

  • deployment.port: Server port (default: 8001)
  • deployment.model: HuggingFace model ID or local path
  • deployment.result_path: Directory for inference logs (default: data/output/result)
  • deployment.vllm.tensor_parallel_size:
    • auto (default): Auto-detect all available GPUs
    • Or set manually: 1, 2, 4, etc.
  • deployment.vllm.gpu_memory_utilization: GPU memory fraction (default: 0.7)

Output Logs:

  • Server logs: logs/hf_server.log
  • Inference logs: data/output/result/{model_name}/deploy_result/{timestamp}.jsonl

Each inference log contains:

  • response: Complete LLM output (including <think> tags and JSON)
  • infer_request: Input messages and generation config
  • generation_config: Sampling parameters used

Configuration

Configuration files in cfg/:

  • extract.yml: Main configuration (sets LLM provider)
  • evaluate.yml: Evaluation settings
  • dimensions.yml: Multi-dimensional extraction configuration
  • warehouse.yml: Data warehouse configuration (connection, ETL, embeddings)
  • openai.yml: OpenAI API settings (GPT-4)
  • anthropic.yml: Anthropic API settings (Claude)
  • hf.yml: HuggingFace/MS-SWIFT server settings
    • Client config (llm): API endpoint and generation parameters
    • Server config (deployment): Model deployment settings
      • result_path: Inference log directory (default: data/output/result)
      • vllm.tensor_parallel_size: GPU configuration (auto or number)

Switching Providers

Edit cfg/extract.yml:

llm:
  llm_provider: hf  # or openai, anthropic

Or specify at runtime:

extractor = DimensionalExtractor(config_path="openai")

Quick Evaluation

# Sequential mode (default)
stindex evaluate

# With specific config
stindex evaluate --llm-config openai

# Limit samples
stindex evaluate --sample-limit 10

Output Structure

Results are organized by dataset and model:

data/output/evaluations/
└── {dataset_name}-{model_name}/
    ├── eval_{timestamp}_{config}.csv         # Detailed results
    └── eval_{timestamp}_{config}.summary.json # Aggregate metrics

TODOs

  • Backend server implementation
  • Data warehouse integration

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stindex-1.0.1.tar.gz (143.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stindex-1.0.1-py3-none-any.whl (174.1 kB view details)

Uploaded Python 3

File details

Details for the file stindex-1.0.1.tar.gz.

File metadata

  • Download URL: stindex-1.0.1.tar.gz
  • Upload date:
  • Size: 143.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stindex-1.0.1.tar.gz
Algorithm Hash digest
SHA256 5440253cef002025997856de0bb28f0b5720cabc866940a93659ac998f6c21f1
MD5 86b84bf19f307b945eb0530c3a1042fb
BLAKE2b-256 6c62381c799b08c9cf367dc6525510a645dbbfd7c20e38ad8e4675a12c55aa87

See more details on using hashes here.

Provenance

The following attestation bundles were made for stindex-1.0.1.tar.gz:

Publisher: publish.yml on MoeBuTa/STIndex

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file stindex-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: stindex-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 174.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for stindex-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f0d96d7aaf525ef2d05c817c1409f235861f0071f87b518d0ee97fc786d0eaeb
MD5 5dca65f08326328269fb6bea73962f16
BLAKE2b-256 28896eeff1f4bf138a4d1b6dd6c1c20d4373d7f89c966804ca822448c0ded1d0

See more details on using hashes here.

Provenance

The following attestation bundles were made for stindex-1.0.1-py3-none-any.whl:

Publisher: publish.yml on MoeBuTa/STIndex

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page