Fast autocomplete and text processing for knowledge graphs
Project description
Terraphim Automata Python Bindings
Fast autocomplete and text processing library for knowledge graphs, powered by Rust.
Features
- โก Lightning Fast: Built on Rust with Finite State Transducers (FST) and Aho-Corasick automata
- ๐ Autocomplete: Prefix-based search with sub-millisecond response times
- ๐ฏ Fuzzy Search: Support for typos using Jaro-Winkler and Levenshtein distance
- ๐ Text Processing: Find and replace terms with automatic linking
- ๐ Paragraph Extraction: Extract relevant paragraphs based on term matches
- ๐ Pythonic API: Easy-to-use interface with type hints
- ๐ Type Safe: Full type stub support for IDEs and type checkers
Installation
From PyPI (Recommended)
pip install terraphim-automata
From Source with uv
# Clone the repository
git clone https://github.com/terraphim/terraphim-ai.git
cd terraphim-ai/crates/terraphim_automata_py
# Install uv if you haven't already
pip install uv
# Build and install
uv pip install maturin
maturin develop
Quick Start
Building an Autocomplete Index
from terraphim_automata import build_index
# Define a thesaurus with your terms
thesaurus_json = """{
"name": "Engineering",
"data": {
"machine learning": {
"id": 1,
"nterm": "machine learning",
"url": "https://example.com/ml"
},
"deep learning": {
"id": 2,
"nterm": "deep learning",
"url": "https://example.com/dl"
},
"artificial intelligence": {
"id": 3,
"nterm": "artificial intelligence",
"url": "https://example.com/ai"
}
}
}"""
# Build the index
index = build_index(thesaurus_json)
# Search for completions
results = index.search("mach")
for result in results:
print(f"{result.term} (score: {result.score}, url: {result.url})")
Fuzzy Search
# Jaro-Winkler similarity (good for typos at the start)
results = index.fuzzy_search("machin lerning", threshold=0.8)
# Levenshtein distance (good for general typos)
results = index.fuzzy_search_levenshtein("machne", max_distance=2)
Text Processing
from terraphim_automata import find_all_matches, replace_with_links
text = "Machine learning and deep learning are subfields of artificial intelligence."
# Find all term matches
matches = find_all_matches(text, thesaurus_json)
for match in matches:
print(f"Found '{match.term}' at position {match.pos}")
# Replace terms with markdown links
markdown = replace_with_links(text, thesaurus_json, "markdown")
print(markdown)
# Output: [machine learning](https://example.com/ml) and [deep learning](https://example.com/dl)
# are subfields of [artificial intelligence](https://example.com/ai).
# Or HTML links
html = replace_with_links(text, thesaurus_json, "html")
# Or wiki-style links
wiki = replace_with_links(text, thesaurus_json, "wiki")
Paragraph Extraction
from terraphim_automata import extract_paragraphs
document = """
Introduction to AI.
Machine learning is a subset of artificial intelligence that focuses on
developing systems that can learn from data. It has applications in various
fields including computer vision and natural language processing.
Deep learning is a specialized form of machine learning.
"""
# Extract paragraphs containing matched terms
paragraphs = extract_paragraphs(document, thesaurus_json)
for term, paragraph in paragraphs:
print(f"\nTerm: {term}")
print(f"Paragraph: {paragraph[:100]}...")
API Reference
Classes
AutocompleteIndex
The main index class for fast prefix searches.
Properties:
name: str- Name of the thesauruslen() -> int- Number of terms in the index
Methods:
search(prefix: str, max_results: int = 10, case_sensitive: bool = False) -> List[AutocompleteResult]
Search for terms matching the prefix.
Parameters:
prefix- The search prefixmax_results- Maximum number of results (default: 10)case_sensitive- Whether search is case-sensitive (default: False)
fuzzy_search(query: str, threshold: float = 0.8, max_results: int = 10) -> List[AutocompleteResult]
Fuzzy search using Jaro-Winkler similarity.
Parameters:
query- The search querythreshold- Similarity threshold 0.0-1.0 (default: 0.8)max_results- Maximum number of results (default: 10)
fuzzy_search_levenshtein(query: str, max_distance: int = 2, max_results: int = 10) -> List[AutocompleteResult]
Fuzzy search using Levenshtein distance.
Parameters:
query- The search querymax_distance- Maximum edit distance (default: 2)max_results- Maximum number of results (default: 10)
AutocompleteResult
Result from autocomplete search.
Attributes:
term: str- The matched termnormalized_term: str- Normalized form of the termid: int- Term ID from thesaurusurl: Optional[str]- Associated URLscore: float- Relevance score
Matched
A matched term found in text.
Attributes:
term: str- The matched termnormalized_term: str- Normalized formid: int- Term IDurl: Optional[str]- Associated URLpos: Optional[Tuple[int, int]]- Match position (start, end)
Functions
build_index(json_str: str, case_sensitive: bool = False) -> AutocompleteIndex
Build an autocomplete index from thesaurus JSON.
load_thesaurus(json_str: str) -> Tuple[str, int]
Load thesaurus and return (name, term_count).
find_all_matches(text: str, json_str: str, return_positions: bool = True) -> List[Matched]
Find all thesaurus term matches in text.
replace_with_links(text: str, json_str: str, link_type: str) -> str
Replace matched terms with links.
Link types:
"markdown"-[term](url)"html"-<a href="url">term</a>"wiki"-[[term]]"plain"-normalized_term
extract_paragraphs(text: str, json_str: str) -> List[Tuple[str, str]]
Extract paragraphs containing matched terms.
Thesaurus Format
The thesaurus JSON structure:
{
"name": "Thesaurus Name",
"data": {
"term to match": {
"id": 1,
"nterm": "normalized term",
"url": "https://example.com/page"
}
}
}
Fields:
name- Thesaurus name (required)data- Dictionary of terms (required)- Key: Term to match (case-insensitive)
id: Unique integer ID (required)nterm: Normalized term form (required)url: Associated URL (optional)
Performance
Benchmarks on a modern laptop (Apple M1):
| Operation | Index Size | Time |
|---|---|---|
| Build index | 10,000 terms | ~50ms |
| Prefix search | 10,000 terms | ~0.1ms |
| Fuzzy search | 10,000 terms | ~5ms |
| Find matches | 100KB text | ~2ms |
| Replace links | 100KB text | ~3ms |
Run benchmarks yourself:
cd crates/terraphim_automata_py
uv pip install pytest-benchmark
pytest python/benchmarks/ --benchmark-only
Development
Setup Development Environment
# Install uv
pip install uv
# Clone repository
git clone https://github.com/terraphim/terraphim-ai.git
cd terraphim-ai/crates/terraphim_automata_py
# Install development dependencies
uv pip install maturin pytest pytest-benchmark pytest-cov black ruff mypy
# Build in development mode
maturin develop
# Run tests
pytest python/tests/ -v
# Run benchmarks
pytest python/benchmarks/ --benchmark-only
# Format code
black python/
ruff check python/ --fix
# Type check
mypy python/terraphim_automata/
Project Structure
crates/terraphim_automata_py/
โโโ src/
โ โโโ lib.rs # Rust Python bindings (PyO3)
โโโ python/
โ โโโ terraphim_automata/
โ โ โโโ __init__.py # Python package entry point
โ โ โโโ __init__.pyi # Type stubs
โ โโโ tests/ # Pytest test suite
โ โ โโโ test_autocomplete.py
โ โ โโโ test_matcher.py
โ โ โโโ test_thesaurus.py
โ โโโ benchmarks/ # Performance benchmarks
โ โโโ benchmark_autocomplete.py
โ โโโ benchmark_matcher.py
โโโ Cargo.toml # Rust dependencies
โโโ pyproject.toml # Python package metadata
โโโ README.md # This file
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass:
pytest python/tests/ -v - Format code:
black python/ && ruff check python/ --fix - Submit a pull request
License
Apache License 2.0 - See LICENSE for details.
Links
- Documentation: https://docs.terraphim.ai
- Repository: https://github.com/terraphim/terraphim-ai
- Issue Tracker: https://github.com/terraphim/terraphim-ai/issues
- PyPI: https://pypi.org/project/terraphim-automata/
- Terraphim AI: https://terraphim.ai
Related Projects
- terraphim_automata - The Rust library this wraps
- terraphim-ai - Full Terraphim AI system
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file terraphim_automata-1.0.0.tar.gz.
File metadata
- Download URL: terraphim_automata-1.0.0.tar.gz
- Upload date:
- Size: 198.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1df3582f5e97bc6b61c4cb2b5547eee0efc56eef3c3dc3b98ed399c012adcd87
|
|
| MD5 |
c3389b8cdc1ac7fada2c252d5d1070fe
|
|
| BLAKE2b-256 |
cbd9a1feb5621459d8cf51ca89747b85c46d10bbfff9fd85b65906a7b6878ef2
|
File details
Details for the file terraphim_automata-1.0.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: terraphim_automata-1.0.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: PyPy, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
87dfbb886e9448e1e592b32cf19e00d63fc70b1650cbcdabfba630704c451d57
|
|
| MD5 |
f7197190b5a1104784d061e8ebdc7fde
|
|
| BLAKE2b-256 |
38ab974c91d5bb240f8775fda37e2604fb1b57d75f37bc7406f0da4aef7fec12
|
File details
Details for the file terraphim_automata-1.0.0-cp39-abi3-win_amd64.whl.
File metadata
- Download URL: terraphim_automata-1.0.0-cp39-abi3-win_amd64.whl
- Upload date:
- Size: 911.2 kB
- Tags: CPython 3.9+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5d8cad12e19fca4aaaabf8cf0ef67084b1ee9d12eca395518ab152d0cd1b5748
|
|
| MD5 |
b43b413323d14cc1d35d03ff7defb2cb
|
|
| BLAKE2b-256 |
6fcb834695e5cdc44c6f6d41ed599c2525687678a531836062cd57d9e40f109d
|
File details
Details for the file terraphim_automata-1.0.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: terraphim_automata-1.0.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b4113d5f951713c359512049d122022c1dbdf23ffd7ea9dfd27de0dcdb68a942
|
|
| MD5 |
ff65e5d02090c79438071ec2e0bb5294
|
|
| BLAKE2b-256 |
2a1c2cb646c10633fd8dc968c77f1678d669e26089229b8e5630ac3c3d0ab5e5
|
File details
Details for the file terraphim_automata-1.0.0-cp39-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: terraphim_automata-1.0.0-cp39-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.0 MB
- Tags: CPython 3.9+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0ef0697e6cafe53ba5b83527d29b4672c7b1d6bf99139990a6e6b5c83cd954ea
|
|
| MD5 |
98a6ae02319f9cd793202e36908b1e1b
|
|
| BLAKE2b-256 |
94f4e9377089751d3986ce9d6bc720cbb8a64162eeee2ab11e9c57beb6a602e1
|