Skip to main content

Fast similarity search using forward-looking LSH

Project description

copyforward

Fast copy-forward compression for message threads. Detects repeated substrings across messages and replaces them with references to earlier occurrences, reducing storage requirements by 50-90%.

Perfect for chat logs, document histories, and any sequence of texts with repeated content.

Quick Start

Python

import copyforward

messages = ["Hello world", "Hello world, how are you?", "Hello world today"]

# Text API (exact by default)
cf = copyforward.CopyForwardText.from_texts(messages)
print(f"Compression ratio: {cf.compression_ratio():.2f}")
original = cf.render()  # ['Hello world', 'Hello world, how are you?', 'Hello world today']

# Approximate (faster) text compression
cf_fast = copyforward.CopyForwardText.from_texts(messages, exact_mode=False)
assert cf_fast.render() == messages

# Token API
toks = [[10, 11, 12], [10, 11, 12, 13]]
cf_tok = copyforward.CopyForwardTokens.from_tokens(toks, exact_mode=True)
assert cf_tok.render() == toks

# Tokenizer opt-in: build token-mode directly from texts and keep tokenizer for decoding
cf_tok2 = copyforward.CopyForwardTokens.from_texts_with_tokenizer(messages, tokenizer="whitespace", exact_mode=True)
token_ids = cf_tok2.render()         # List[List[int]]
roundtrip = cf_tok2.render_texts()   # Decoded via stored tokenizer
assert roundtrip == messages

Rust

use copyforward::{exact, approximate, Config};

let messages = &["Hello world", "Hello world, how are you?"];

// Exact compression - finds optimal matches
let compressed = exact(messages, Config::default());

// Fast approximate compression - 2x speed for large texts
let compressed = approximate(messages, Config::default()); 

// Render back to original
let original = compressed.render_with(|_, _, _, text| text.to_string());

Algorithm Selection

Choose between two optimized algorithms:

Algorithm Best for Speed Accuracy
Exact < 1MB total text, perfect compression needed Slower Perfect
Approximate > 1MB text, speed matters ~2x faster Excellent

The approximate algorithm may split some long references but still achieves excellent compression ratios.

Installation

Python

pip install maturin
# Build the wheel with Python bindings enabled
maturin develop --features python

# If you want named tokenizers (e.g., whitespace / HF), enable the bundle:
# maturin develop --features python-tokenizers

Rust

[dependencies]
copyforward = "0.1"

Advanced Usage

Python

import copyforward

# Custom configuration (text)
cf = copyforward.CopyForwardText.from_texts(
    messages,
    exact_mode=True,      # Perfect compression
    min_match_len=8,      # Only create refs for 8+ char matches
    lookback=100          # Only search previous 100 messages
)

# Get detailed segment information
segments = cf.segments()
for msg_segments in segments:
    for segment in msg_segments:
        if segment['type'] == 'reference':
            print(f"Reference to message {segment['message']}")
        else:
            print(f"Literal text: {segment['text']}")

# Render with custom replacement (useful for debugging)
redacted = cf.render("[REFERENCE]")

# Tokenization (opt-in)
cf_tok = copyforward.CopyForwardTokens.from_texts_with_tokenizer(
    messages,
    tokenizer="whitespace",   # or feature-gated 'hf:<model>' / 'file:<path>'
    exact_mode=True,
)
token_ids = cf_tok.render()
texts = cf_tok.render_texts()   # Decoded via stored tokenizer

Viewing generated Python docs

After building the Python extension with maturin, the PyO3 docstrings are available via Python's help system:

# Build and install the extension into your active venv
maturin develop --features python

python -c "import copyforward; help(copyforward.CopyForwardText)"
python -c "import copyforward; help(copyforward.CopyForwardTokens)"

This prints the docstring and usage information emitted by the PyO3 bindings.

Rust

use copyforward::{exact, Config, CopyForward};

// Custom configuration
let config = Config {
    min_match_len: 8,
    lookback: Some(100),  
    ..Config::default()
};

let compressed = exact(&messages, config);

// Get compression details
let segments = compressed.segments();
for (i, msg_segments) in segments.iter().enumerate() {
    println!("Message {}: {} segments", i, msg_segments.len());
}

// Custom rendering
let redacted = compressed.render_with_static("[REF]");

How It Works

Copy-forward compression works in two phases:

  1. Analysis: Scan messages to find repeated substrings using rolling hash indexing
  2. Compression: Replace repeated text with references to first occurrence

Example:

Input:  ["Hello world", "Hello world today"]
Output: [Literal("Hello world"), [Reference(0,0,11), Literal(" today")]]

This represents the second message as a reference to the entire first message plus the literal text " today".

Performance

Typical compression ratios:

  • Chat logs: 60-80% space savings
  • Code diffs: 70-90% space savings
  • Document versions: 50-80% space savings

Speed comparison on 1MB of message data:

  • Exact: ~50ms, perfect compression
  • Approximate: ~25ms, 95% of perfect compression

Repository Structure

  • src/ — Rust library implementation
  • tests/ — Integration tests
  • benches/ — Performance benchmarks

License

MIT License - see LICENSE file for details.

Features and Builds

  • Default build has no Python or tokenizer dependencies, keeping Rust users lean.
  • Cargo features:
    • python: enables PyO3 and numpy for Python bindings.
    • tokenizers: enables integration with the tokenizers crate for named tokenizers.
    • hf-hub: adds optional support wiring for Hub-based tokenizers; current crate version does not implement hub loading at runtime.
    • Bundles: python-tokenizers, python-tokenizers-hub for convenience.

Python wheels

  • Build Python extension (bindings only):
    • maturin develop --features python
  • Build Python extension with tokenizer support:
    • maturin develop --features python-tokenizers
  • Hub bundle (compiles but hub loading not implemented in tokenizers v0.15):
    • maturin develop --features python-tokenizers-hub

Notes

  • Using tokenizer="whitespace" requires only the python feature.
  • Using tokenizer="hf:<model>" or tokenizer="file:<path>" requires tokenizers; hub loading by name is not implemented for the current tokenizers version. Load from a local tokenizer JSON using file:<path>.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

copyforward-0.1.0.tar.gz (46.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

copyforward-0.1.0-cp312-cp312-win_amd64.whl (193.0 kB view details)

Uploaded CPython 3.12Windows x86-64

copyforward-0.1.0-cp312-cp312-manylinux_2_34_x86_64.whl (351.9 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

copyforward-0.1.0-cp312-cp312-macosx_11_0_arm64.whl (307.7 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

copyforward-0.1.0-cp311-cp311-win_amd64.whl (195.4 kB view details)

Uploaded CPython 3.11Windows x86-64

copyforward-0.1.0-cp311-cp311-manylinux_2_34_x86_64.whl (352.5 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

copyforward-0.1.0-cp311-cp311-macosx_11_0_arm64.whl (307.1 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

copyforward-0.1.0-cp310-cp310-win_amd64.whl (195.3 kB view details)

Uploaded CPython 3.10Windows x86-64

copyforward-0.1.0-cp310-cp310-manylinux_2_34_x86_64.whl (352.5 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

copyforward-0.1.0-cp310-cp310-macosx_11_0_arm64.whl (307.0 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file copyforward-0.1.0.tar.gz.

File metadata

  • Download URL: copyforward-0.1.0.tar.gz
  • Upload date:
  • Size: 46.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for copyforward-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0d84985638bbfec0e338e0ea7b5d0fd78062a2da74369d5af9d1b3084a34ecd4
MD5 a5cf8b487797ea12fea0db74f98d317a
BLAKE2b-256 644195e540e5b323b7f6f0f8e788a329926058a78862ba09b6b1c11e1e2a9de2

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.1.0.tar.gz:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file copyforward-0.1.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: copyforward-0.1.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 193.0 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for copyforward-0.1.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 b8a53d920cca47dea0ed4fa2b92a2d93fe982f812e5ebebbb73b891b203c6526
MD5 a0b509861b37f166162b6d5b89d4587b
BLAKE2b-256 d4335beff853c8607f231391ce614615eb8da697a3ffae0e2c3a10b7788cdfe6

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.1.0-cp312-cp312-win_amd64.whl:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file copyforward-0.1.0-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for copyforward-0.1.0-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 1e80bf24ac848ccac84154e56e828566e969a763ae430399134ec9e3ad1c673b
MD5 2733506be442339e0e6e4b3e6c84a7b0
BLAKE2b-256 fecfc61cfd1a78669e5b8450476989f99f48e364b45f268a395242e65b792e57

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.1.0-cp312-cp312-manylinux_2_34_x86_64.whl:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file copyforward-0.1.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for copyforward-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e01151c7db70e9a4f08c1c24e9dc65f8f24a1094749d602d664874aa8e49eaf1
MD5 93e910265a6494f8a7013b09f49781ea
BLAKE2b-256 21dc6cbce5db3ce01c2b44a2407e57660df49bffdc2fc0cc53ff8cbfa42ae590

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.1.0-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file copyforward-0.1.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: copyforward-0.1.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 195.4 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for copyforward-0.1.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 b8d57486764284acae9da2530695f2ec9a6766d107a55e28ec900c0c2ae06063
MD5 d00e5960245bb6568b1d7764ac7bbfd9
BLAKE2b-256 a817ea0f32799c3558306aa091408224d9a7ed0593bcb3e343e1085df4d9110c

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.1.0-cp311-cp311-win_amd64.whl:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file copyforward-0.1.0-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for copyforward-0.1.0-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 23b483ffcbe4ebd46b5ea51249f996a76f05cb5d318fa2b9c0a1af94b9eceff0
MD5 0bc03e71b1101da9648211c63c84d845
BLAKE2b-256 0563e8add8e783c2c2a8a69a2f3e74ebde5815047f01c53e07a5964d8da90cb2

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.1.0-cp311-cp311-manylinux_2_34_x86_64.whl:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file copyforward-0.1.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for copyforward-0.1.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 604af99bec3c98a3389dbdb86361e982c8e9d22e01df9bc6ede11c1de26e7894
MD5 b2ad90fe41f5a0c8797259d57962c408
BLAKE2b-256 4e2460071bba83b7e58d5763c5233cee949ea851d61c6780853883bddc8c9e8e

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.1.0-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file copyforward-0.1.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: copyforward-0.1.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 195.3 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for copyforward-0.1.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 b4f41cfb90855e8cfe78f20fd5c09294439e25c53dc5166a0c7fb95dddd620cd
MD5 03e4518b4e9bb46713f609eba447b502
BLAKE2b-256 62b86f42ab8832984f42802e7b55dd44ec9f78a7fe5c57ba9b746b4b9f34d12e

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.1.0-cp310-cp310-win_amd64.whl:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file copyforward-0.1.0-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for copyforward-0.1.0-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 65cc0f600639f35c505dfb0880112ae77516fdafed695775a3895763e8792256
MD5 b3f1eb3e357dfc37952c80abc3bf095a
BLAKE2b-256 bc0d1273f7a3b7d34e0ed4be3855d9474c4eb04ff20979092c8b08f875df4791

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.1.0-cp310-cp310-manylinux_2_34_x86_64.whl:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file copyforward-0.1.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for copyforward-0.1.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f01e9987a110595883c41f4a72dcbfd01591e766bbbfb30afc521cf0f4efe30f
MD5 ae1d7e141c60019f9779b7c9973b1adc
BLAKE2b-256 7de66c9d1dc05d10350a1ebd907b4890bd198e08c755cba5aed73369f9d7f377

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.1.0-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page