Skip to main content

Fast autocomplete and text processing for knowledge graphs

Project description

Terraphim Automata Python Bindings

PyPI version Python Support License

Fast autocomplete and text processing library for knowledge graphs, powered by Rust.

Features

  • โšก Lightning Fast: Built on Rust with Finite State Transducers (FST) and Aho-Corasick automata
  • ๐Ÿ” Autocomplete: Prefix-based search with sub-millisecond response times
  • ๐ŸŽฏ Fuzzy Search: Support for typos using Jaro-Winkler and Levenshtein distance
  • ๐Ÿ“ Text Processing: Find and replace terms with automatic linking
  • ๐Ÿ“„ Paragraph Extraction: Extract relevant paragraphs based on term matches
  • ๐Ÿ Pythonic API: Easy-to-use interface with type hints
  • ๐Ÿ”’ Type Safe: Full type stub support for IDEs and type checkers

Installation

From PyPI (Recommended)

pip install terraphim-automata

From Source with uv

# Clone the repository
git clone https://github.com/terraphim/terraphim-ai.git
cd terraphim-ai/crates/terraphim_automata_py

# Install uv if you haven't already
pip install uv

# Build and install
uv pip install maturin
maturin develop

Quick Start

Building an Autocomplete Index

from terraphim_automata import build_index

# Define a thesaurus with your terms
thesaurus_json = """{
    "name": "Engineering",
    "data": {
        "machine learning": {
            "id": 1,
            "nterm": "machine learning",
            "url": "https://example.com/ml"
        },
        "deep learning": {
            "id": 2,
            "nterm": "deep learning",
            "url": "https://example.com/dl"
        },
        "artificial intelligence": {
            "id": 3,
            "nterm": "artificial intelligence",
            "url": "https://example.com/ai"
        }
    }
}"""

# Build the index
index = build_index(thesaurus_json)

# Search for completions
results = index.search("mach")
for result in results:
    print(f"{result.term} (score: {result.score}, url: {result.url})")

Fuzzy Search

# Jaro-Winkler similarity (good for typos at the start)
results = index.fuzzy_search("machin lerning", threshold=0.8)

# Levenshtein distance (good for general typos)
results = index.fuzzy_search_levenshtein("machne", max_distance=2)

Text Processing

from terraphim_automata import find_all_matches, replace_with_links

text = "Machine learning and deep learning are subfields of artificial intelligence."

# Find all term matches
matches = find_all_matches(text, thesaurus_json)
for match in matches:
    print(f"Found '{match.term}' at position {match.pos}")

# Replace terms with markdown links
markdown = replace_with_links(text, thesaurus_json, "markdown")
print(markdown)
# Output: [machine learning](https://example.com/ml) and [deep learning](https://example.com/dl)
#         are subfields of [artificial intelligence](https://example.com/ai).

# Or HTML links
html = replace_with_links(text, thesaurus_json, "html")

# Or wiki-style links
wiki = replace_with_links(text, thesaurus_json, "wiki")

Paragraph Extraction

from terraphim_automata import extract_paragraphs

document = """
Introduction to AI.

Machine learning is a subset of artificial intelligence that focuses on
developing systems that can learn from data. It has applications in various
fields including computer vision and natural language processing.

Deep learning is a specialized form of machine learning.
"""

# Extract paragraphs containing matched terms
paragraphs = extract_paragraphs(document, thesaurus_json)
for term, paragraph in paragraphs:
    print(f"\nTerm: {term}")
    print(f"Paragraph: {paragraph[:100]}...")

API Reference

Classes

AutocompleteIndex

The main index class for fast prefix searches.

Properties:

  • name: str - Name of the thesaurus
  • len() -> int - Number of terms in the index

Methods:

search(prefix: str, max_results: int = 10, case_sensitive: bool = False) -> List[AutocompleteResult]

Search for terms matching the prefix.

Parameters:

  • prefix - The search prefix
  • max_results - Maximum number of results (default: 10)
  • case_sensitive - Whether search is case-sensitive (default: False)
fuzzy_search(query: str, threshold: float = 0.8, max_results: int = 10) -> List[AutocompleteResult]

Fuzzy search using Jaro-Winkler similarity.

Parameters:

  • query - The search query
  • threshold - Similarity threshold 0.0-1.0 (default: 0.8)
  • max_results - Maximum number of results (default: 10)
fuzzy_search_levenshtein(query: str, max_distance: int = 2, max_results: int = 10) -> List[AutocompleteResult]

Fuzzy search using Levenshtein distance.

Parameters:

  • query - The search query
  • max_distance - Maximum edit distance (default: 2)
  • max_results - Maximum number of results (default: 10)

AutocompleteResult

Result from autocomplete search.

Attributes:

  • term: str - The matched term
  • normalized_term: str - Normalized form of the term
  • id: int - Term ID from thesaurus
  • url: Optional[str] - Associated URL
  • score: float - Relevance score

Matched

A matched term found in text.

Attributes:

  • term: str - The matched term
  • normalized_term: str - Normalized form
  • id: int - Term ID
  • url: Optional[str] - Associated URL
  • pos: Optional[Tuple[int, int]] - Match position (start, end)

Functions

build_index(json_str: str, case_sensitive: bool = False) -> AutocompleteIndex

Build an autocomplete index from thesaurus JSON.

load_thesaurus(json_str: str) -> Tuple[str, int]

Load thesaurus and return (name, term_count).

find_all_matches(text: str, json_str: str, return_positions: bool = True) -> List[Matched]

Find all thesaurus term matches in text.

replace_with_links(text: str, json_str: str, link_type: str) -> str

Replace matched terms with links.

Link types:

  • "markdown" - [term](url)
  • "html" - <a href="url">term</a>
  • "wiki" - [[term]]
  • "plain" - normalized_term

extract_paragraphs(text: str, json_str: str) -> List[Tuple[str, str]]

Extract paragraphs containing matched terms.

Thesaurus Format

The thesaurus JSON structure:

{
    "name": "Thesaurus Name",
    "data": {
        "term to match": {
            "id": 1,
            "nterm": "normalized term",
            "url": "https://example.com/page"
        }
    }
}

Fields:

  • name - Thesaurus name (required)
  • data - Dictionary of terms (required)
    • Key: Term to match (case-insensitive)
    • id: Unique integer ID (required)
    • nterm: Normalized term form (required)
    • url: Associated URL (optional)

Performance

Benchmarks on a modern laptop (Apple M1):

Operation Index Size Time
Build index 10,000 terms ~50ms
Prefix search 10,000 terms ~0.1ms
Fuzzy search 10,000 terms ~5ms
Find matches 100KB text ~2ms
Replace links 100KB text ~3ms

Run benchmarks yourself:

cd crates/terraphim_automata_py
uv pip install pytest-benchmark
pytest python/benchmarks/ --benchmark-only

Development

Setup Development Environment

# Install uv
pip install uv

# Clone repository
git clone https://github.com/terraphim/terraphim-ai.git
cd terraphim-ai/crates/terraphim_automata_py

# Install development dependencies
uv pip install maturin pytest pytest-benchmark pytest-cov black ruff mypy

# Build in development mode
maturin develop

# Run tests
pytest python/tests/ -v

# Run benchmarks
pytest python/benchmarks/ --benchmark-only

# Format code
black python/
ruff check python/ --fix

# Type check
mypy python/terraphim_automata/

Project Structure

crates/terraphim_automata_py/
โ”œโ”€โ”€ src/
โ”‚   โ””โ”€โ”€ lib.rs              # Rust Python bindings (PyO3)
โ”œโ”€โ”€ python/
โ”‚   โ”œโ”€โ”€ terraphim_automata/
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py     # Python package entry point
โ”‚   โ”‚   โ””โ”€โ”€ __init__.pyi    # Type stubs
โ”‚   โ”œโ”€โ”€ tests/              # Pytest test suite
โ”‚   โ”‚   โ”œโ”€โ”€ test_autocomplete.py
โ”‚   โ”‚   โ”œโ”€โ”€ test_matcher.py
โ”‚   โ”‚   โ””โ”€โ”€ test_thesaurus.py
โ”‚   โ””โ”€โ”€ benchmarks/         # Performance benchmarks
โ”‚       โ”œโ”€โ”€ benchmark_autocomplete.py
โ”‚       โ””โ”€โ”€ benchmark_matcher.py
โ”œโ”€โ”€ Cargo.toml              # Rust dependencies
โ”œโ”€โ”€ pyproject.toml          # Python package metadata
โ””โ”€โ”€ README.md               # This file

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass: pytest python/tests/ -v
  5. Format code: black python/ && ruff check python/ --fix
  6. Submit a pull request

License

Apache License 2.0 - See LICENSE for details.

Links

Related Projects

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

terraphim_automata-1.0.0.tar.gz (198.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

terraphim_automata-1.0.0-cp39-abi3-win_amd64.whl (911.2 kB view details)

Uploaded CPython 3.9+Windows x86-64

terraphim_automata-1.0.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

terraphim_automata-1.0.0-cp39-abi3-macosx_11_0_arm64.whl (1.0 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

File details

Details for the file terraphim_automata-1.0.0.tar.gz.

File metadata

  • Download URL: terraphim_automata-1.0.0.tar.gz
  • Upload date:
  • Size: 198.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for terraphim_automata-1.0.0.tar.gz
Algorithm Hash digest
SHA256 1df3582f5e97bc6b61c4cb2b5547eee0efc56eef3c3dc3b98ed399c012adcd87
MD5 c3389b8cdc1ac7fada2c252d5d1070fe
BLAKE2b-256 cbd9a1feb5621459d8cf51ca89747b85c46d10bbfff9fd85b65906a7b6878ef2

See more details on using hashes here.

File details

Details for the file terraphim_automata-1.0.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for terraphim_automata-1.0.0-pp311-pypy311_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 87dfbb886e9448e1e592b32cf19e00d63fc70b1650cbcdabfba630704c451d57
MD5 f7197190b5a1104784d061e8ebdc7fde
BLAKE2b-256 38ab974c91d5bb240f8775fda37e2604fb1b57d75f37bc7406f0da4aef7fec12

See more details on using hashes here.

File details

Details for the file terraphim_automata-1.0.0-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for terraphim_automata-1.0.0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 5d8cad12e19fca4aaaabf8cf0ef67084b1ee9d12eca395518ab152d0cd1b5748
MD5 b43b413323d14cc1d35d03ff7defb2cb
BLAKE2b-256 6fcb834695e5cdc44c6f6d41ed599c2525687678a531836062cd57d9e40f109d

See more details on using hashes here.

File details

Details for the file terraphim_automata-1.0.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for terraphim_automata-1.0.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b4113d5f951713c359512049d122022c1dbdf23ffd7ea9dfd27de0dcdb68a942
MD5 ff65e5d02090c79438071ec2e0bb5294
BLAKE2b-256 2a1c2cb646c10633fd8dc968c77f1678d669e26089229b8e5630ac3c3d0ab5e5

See more details on using hashes here.

File details

Details for the file terraphim_automata-1.0.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for terraphim_automata-1.0.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0ef0697e6cafe53ba5b83527d29b4672c7b1d6bf99139990a6e6b5c83cd954ea
MD5 98a6ae02319f9cd793202e36908b1e1b
BLAKE2b-256 94f4e9377089751d3986ce9d6bc720cbb8a64162eeee2ab11e9c57beb6a602e1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page