Spatiotemporal Index Extraction from Unstructured Text
Project description
STIndex - Spatiotemporal Information Extraction
STIndex is a multi-dimensional information extraction system that uses LLMs to extract temporal, spatial, and custom dimensional data from unstructured text. Features end-to-end pipeline with preprocessing, extraction, and visualization.
Quick Start
Installation
pip install stindex
# Install spaCy language model (required for NER)
python -m spacy download en_core_web_sm
Basic Extraction
# Extract spatiotemporal entities
stindex extract "On March 15, 2022, a cyclone hit Broome, Western Australia."
# Use specific LLM provider
stindex extract "Text here..." --config openai # or anthropic, hf
End-to-End Pipeline
from stindex import InputDocument, STIndexPipeline
# Create input documents (URL, file, or text)
docs = [
InputDocument.from_url("https://example.com/article"),
InputDocument.from_file("/path/to/document.pdf"),
InputDocument.from_text("Your text here")
]
# Run full pipeline: preprocessing → extraction → warehouse → visualization
pipeline = STIndexPipeline(
dimension_config="dimensions",
output_dir="data/output",
enable_warehouse=True, # NEW in v0.6.0: Load data into warehouse
warehouse_config="warehouse"
)
results = pipeline.run_pipeline(docs, load_to_warehouse=True)
# Automatically generates zip archive: data/visualizations/stindex_report_{timestamp}.zip
# Contains: HTML report + all plots, maps, and source files
Python API (Direct Extraction)
from stindex import DimensionalExtractor
# Initialize with default config (cfg/extract.yml)
extractor = DimensionalExtractor()
# Or specify a config
extractor = DimensionalExtractor(config_path="openai")
# Extract entities
result = extractor.extract("March 15, 2022 in Broome, Australia")
# Access results
print(f"Temporal: {len(result.temporal_entities)} entities")
print(f"Spatial: {len(result.spatial_entities)} entities")
# Raw LLM output available for debugging
if result.extraction_config:
raw_output = result.extraction_config.get("raw_llm_output") if isinstance(result.extraction_config, dict) else result.extraction_config.raw_llm_output
print(f"Raw output: {raw_output}")
Server Deployment
MS-SWIFT Server (Model Sharding with Tensor Parallelism)
Deploy a single MS-SWIFT server that uses all available GPUs via tensor parallelism:
# Deploy server (auto-detects GPUs by default)
./scripts/deploy_ms_swift.sh
# Stop server
./scripts/stop_ms_swift.sh
# Check logs
tail -f logs/hf_server.log
Configuration (cfg/hf.yml):
deployment.port: Server port (default: 8001)deployment.model: HuggingFace model ID or local pathdeployment.result_path: Directory for inference logs (default:data/output/result)deployment.vllm.tensor_parallel_size:auto(default): Auto-detect all available GPUs- Or set manually:
1,2,4, etc.
deployment.vllm.gpu_memory_utilization: GPU memory fraction (default: 0.7)
Output Logs:
- Server logs:
logs/hf_server.log - Inference logs:
data/output/result/{model_name}/deploy_result/{timestamp}.jsonl
Each inference log contains:
response: Complete LLM output (including<think>tags and JSON)infer_request: Input messages and generation configgeneration_config: Sampling parameters used
Configuration
Configuration files in cfg/:
extract.yml: Main configuration (sets LLM provider)evaluate.yml: Evaluation settingsdimensions.yml: Multi-dimensional extraction configurationwarehouse.yml: Data warehouse configuration (connection, ETL, embeddings)openai.yml: OpenAI API settings (GPT-4)anthropic.yml: Anthropic API settings (Claude)hf.yml: HuggingFace/MS-SWIFT server settings- Client config (
llm): API endpoint and generation parameters - Server config (
deployment): Model deployment settingsresult_path: Inference log directory (default:data/output/result)vllm.tensor_parallel_size: GPU configuration (autoor number)
- Client config (
Switching Providers
Edit cfg/extract.yml:
llm:
llm_provider: hf # or openai, anthropic
Or specify at runtime:
extractor = DimensionalExtractor(config_path="openai")
Quick Evaluation
# Sequential mode (default)
stindex evaluate
# With specific config
stindex evaluate --llm-config openai
# Limit samples
stindex evaluate --sample-limit 10
Output Structure
Results are organized by dataset and model:
data/output/evaluations/
└── {dataset_name}-{model_name}/
├── eval_{timestamp}_{config}.csv # Detailed results
└── eval_{timestamp}_{config}.summary.json # Aggregate metrics
TODOs
- Backend server implementation
- Data warehouse integration
License
MIT License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file stindex-1.0.1.tar.gz.
File metadata
- Download URL: stindex-1.0.1.tar.gz
- Upload date:
- Size: 143.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5440253cef002025997856de0bb28f0b5720cabc866940a93659ac998f6c21f1
|
|
| MD5 |
86b84bf19f307b945eb0530c3a1042fb
|
|
| BLAKE2b-256 |
6c62381c799b08c9cf367dc6525510a645dbbfd7c20e38ad8e4675a12c55aa87
|
Provenance
The following attestation bundles were made for stindex-1.0.1.tar.gz:
Publisher:
publish.yml on MoeBuTa/STIndex
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
stindex-1.0.1.tar.gz -
Subject digest:
5440253cef002025997856de0bb28f0b5720cabc866940a93659ac998f6c21f1 - Sigstore transparency entry: 704345805
- Sigstore integration time:
-
Permalink:
MoeBuTa/STIndex@26a4ff5d73b892155323953e73b378555c5533a6 -
Branch / Tag:
refs/tags/v1.0.1 - Owner: https://github.com/MoeBuTa
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@26a4ff5d73b892155323953e73b378555c5533a6 -
Trigger Event:
release
-
Statement type:
File details
Details for the file stindex-1.0.1-py3-none-any.whl.
File metadata
- Download URL: stindex-1.0.1-py3-none-any.whl
- Upload date:
- Size: 174.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f0d96d7aaf525ef2d05c817c1409f235861f0071f87b518d0ee97fc786d0eaeb
|
|
| MD5 |
5dca65f08326328269fb6bea73962f16
|
|
| BLAKE2b-256 |
28896eeff1f4bf138a4d1b6dd6c1c20d4373d7f89c966804ca822448c0ded1d0
|
Provenance
The following attestation bundles were made for stindex-1.0.1-py3-none-any.whl:
Publisher:
publish.yml on MoeBuTa/STIndex
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
stindex-1.0.1-py3-none-any.whl -
Subject digest:
f0d96d7aaf525ef2d05c817c1409f235861f0071f87b518d0ee97fc786d0eaeb - Sigstore transparency entry: 704345810
- Sigstore integration time:
-
Permalink:
MoeBuTa/STIndex@26a4ff5d73b892155323953e73b378555c5533a6 -
Branch / Tag:
refs/tags/v1.0.1 - Owner: https://github.com/MoeBuTa
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@26a4ff5d73b892155323953e73b378555c5533a6 -
Trigger Event:
release
-
Statement type: