Skip to main content

Python bindings for ChemFST - a high-performance chemical name search library

Project description

ChemFST

Python CI Rust CI Docs

ChemFST is a high-performance chemical name search library using Finite State Transducers (FSTs) to provide efficient searches of systematic and trivial names of chemical compounds in milliseconds. It's particularly useful for autocomplete features and searching through large chemical compound databases.

Features

  • Memory-efficient indexing using Finite State Transducers
  • Extremely fast prefix-based searches (autocomplete)
  • Case-insensitive substring searches
  • Memory-mapped file access for optimal performance
  • Simple API with just a few functions

Setup

Prerequisites

  • Rust 1.56.0 or higher
  • Cargo (comes with Rust)

Installation

Add this to your Cargo.toml:

[dependencies]
chemfst = "0.1.1"

Using the Library

Basic Usage

use chemfst::{build_fst_set, load_fst_set, prefix_search, substring_search};
use std::error::Error;

fn main() -> Result<(), Box<dyn Error>> {
    // Step 1: Create an index from a list of chemical names (one term per line)
    // Note: The .fst file is generated and not distributed with the package
    // The repository includes a sample data/chemical_names.txt with 32+ chemical names
    let input_path = "data/chemical_names.txt";
    let fst_path = "data/chemical_names.fst";
    build_fst_set(input_path, fst_path)?;

    // Step 2: Load the index into memory
    let set = load_fst_set(fst_path)?;

    // Step 3: Perform searches

    // Prefix search (autocomplete)
    let prefix_results = prefix_search(&set, "acet", 10); // Find up to 10 terms starting with "acet"

    // Substring search
    let substring_results = substring_search(&set, "enz", 10)?; // Find up to 10 terms containing "enz"

    Ok(())
}

API Reference

Functions

build_fst_set(input_path: &str, fst_path: &str) -> Result<(), Box<dyn Error>>

Creates an FST set from a list of chemical names in a text file. The resulting .fst file is generated and not distributed with the package.

  • input_path: Path to a text file with one chemical name per line
  • fst_path: Path where the FST index will be saved

load_fst_set(fst_path: &str) -> Result<Set<Mmap>, Box<dyn Error>>

Loads a previously created FST set from disk using memory mapping.

  • fst_path: Path to the FST index file
  • Returns: A memory-mapped FST Set

prefix_search(set: &Set<Mmap>, prefix: &str, max_results: usize) -> Vec<String>

Performs a prefix-based search (autocomplete).

  • set: The FST Set to search through
  • prefix: The prefix to search for
  • max_results: Maximum number of results to return
  • Returns: A vector of matching chemical names

substring_search(set: &Set<Mmap>, substring: &str, max_results: usize) -> Result<Vec<String>, Box<dyn Error>>

Performs a case-insensitive substring search.

  • set: The FST Set to search through
  • substring: The substring to search for
  • max_results: Maximum number of results to return
  • Returns: A vector of matching chemical names

Development

Project Structure

  • src/lib.rs - Core library functionality
  • src/main.rs - Example binary that demonstrates the library
  • tests/ - Integration tests

Setting Up Development Environment

  1. Clone the repository:

    git clone <repository_url>
    cd chemfst
    
  2. Build the project:

    cargo build
    
  3. Run the example:

    cargo run
    

Running Tests

Run all tests:

cargo test

Adding New Tests

Add new integration tests to the tests/fst_search_tests.rs file or create additional test files in the tests directory.

Continuous Integration

The project uses GitHub Actions for continuous integration and testing across multiple platforms and Python versions.

GitHub Workflows

Rust CI (rust.yml)

  • Platforms: Ubuntu, macOS, Windows
  • Rust versions: stable, beta
  • Features: Build, test, clippy linting, format checking, code coverage

Python CI (python.yml)

  • Platforms: Ubuntu, macOS, Windows
  • Python versions: 3.11, 3.12, 3.13
  • Features:
    • Automated FST file generation from test data
    • Cross-platform testing
    • Example execution validation
    • Code coverage reporting

Local Validation

Before pushing changes, validate the workflow locally:

# Run the validation script
python scripts/validate_workflow.py

This script:

  • Creates test data files
  • Builds the Python package
  • Runs all tests
  • Validates examples work correctly

FST File Generation in CI

The workflows automatically create test data files since FST files are not distributed with the package. Each platform creates the required data/chemical_names.txt with sample chemical names for testing.

Contributing

Contributions are welcome! Here's how you can contribute:

  1. Fork the repository
  2. Create a new branch (git checkout -b feature/your-feature-name)
  3. Make your changes
  4. Run the tests (cargo test)
  5. Commit your changes (git commit -m 'Add some feature')
  6. Push to the branch (git push origin feature/your-feature-name)
  7. Open a Pull Request

Performance Considerations

  • FST sets are immutable. If your chemical database changes, you'll need to rebuild the index.
  • For large chemical databases, consider building the index as an offline process.
  • Memory-mapped files provide excellent performance but require care when the underlying file changes.

License

MIT License

Credits

This project uses the following key dependencies:

  • fst - Finite State Transducer implementation
  • memmap2 - Memory mapping functionality

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

chemfst-0.1.1-cp311-abi3-win_amd64.whl (193.5 kB view details)

Uploaded CPython 3.11+Windows x86-64

chemfst-0.1.1-cp311-abi3-manylinux_2_34_x86_64.whl (333.0 kB view details)

Uploaded CPython 3.11+manylinux: glibc 2.34+ x86-64

chemfst-0.1.1-cp311-abi3-macosx_11_0_arm64.whl (291.6 kB view details)

Uploaded CPython 3.11+macOS 11.0+ ARM64

File details

Details for the file chemfst-0.1.1-cp311-abi3-win_amd64.whl.

File metadata

  • Download URL: chemfst-0.1.1-cp311-abi3-win_amd64.whl
  • Upload date:
  • Size: 193.5 kB
  • Tags: CPython 3.11+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for chemfst-0.1.1-cp311-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 899885c3fa863d9f622919d4a2f1adf1c0aed2f27257cdbcc49efd092698e048
MD5 76aada4ca792fb72134dea2113d42081
BLAKE2b-256 e3ab8be56056fde149093ac2a7a94fd011029b35513a8212b2ff8b2f16b6bc87

See more details on using hashes here.

Provenance

The following attestation bundles were made for chemfst-0.1.1-cp311-abi3-win_amd64.whl:

Publisher: publish-pypi.yml on esrehmki/chemfst

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chemfst-0.1.1-cp311-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for chemfst-0.1.1-cp311-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 8ad3b604c573d51d4ddd598ef602075148fa6a2b9ece05cfb8c68424511aa44b
MD5 5b0afd2bfe08f119297a4758b6907579
BLAKE2b-256 0eef26870cd968b6cbbd8bb402dd3a12cca9ac446e71470696d8e1f6e4c843f7

See more details on using hashes here.

Provenance

The following attestation bundles were made for chemfst-0.1.1-cp311-abi3-manylinux_2_34_x86_64.whl:

Publisher: publish-pypi.yml on esrehmki/chemfst

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chemfst-0.1.1-cp311-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for chemfst-0.1.1-cp311-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 105f4b76e9f810ba8eb5de07ead6c403277b48bc6ddc037d29970f7d0b7d06f4
MD5 04a47a9a1a98d618b5edc39790cd6123
BLAKE2b-256 2fe980eae04cb6b6d695349c2349fd21864abacd5ca1c3103ad08c812d4ef3cb

See more details on using hashes here.

Provenance

The following attestation bundles were made for chemfst-0.1.1-cp311-abi3-macosx_11_0_arm64.whl:

Publisher: publish-pypi.yml on esrehmki/chemfst

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page