Skip to main content

Fast similarity search using forward-looking LSH

Project description

copyforward

Crates.io PyPI CI Python CI Crates Publish

Fast copy-forward compression for message threads. Detects repeated substrings across messages and replaces them with references to earlier occurrences, reducing storage requirements by 50-90%.

Perfect for chat logs, document histories, dataframes with missing values, and any sequence of texts with repeated content.

Quick Start

Python

import copyforward

# Basic usage with text messages
messages = ["Hello world", "Hello world, how are you?", "Hello world today"]

# Text API (exact by default)
cf = copyforward.CopyForwardText.from_texts(messages)
print(f"Compression ratio: {cf.compression_ratio():.2f}")
# Render with replacement text to visualize references
visualized = cf.render("[REF]")  # ['Hello world', '[REF], how are you?', '[REF] today']

# Handle missing values (perfect for dataframes!)
messages_with_none = ["Hello world", None, "Hello world again"]
cf_none = copyforward.CopyForwardText.from_texts(messages_with_none)
result = cf_none.render("[REF]")  # ['Hello world', None, '[REF] again']

# Approximate (faster) text compression
cf_fast = copyforward.CopyForwardText.from_texts(messages, exact_mode=False)
approx_result = cf_fast.render("[REF]")  # May find different compression patterns

# Token API with missing values
toks = [[10, 11, 12], None, [10, 11, 12, 13]]
cf_tok = copyforward.CopyForwardTokens.from_tokens(toks, exact_mode=True)
# Render with replacement tokens - only non-None entries returned
rendered_toks = cf_tok.render([999])  # [[10, 11, 12], [999, 13]]

# Tokenizer opt-in: build token-mode directly from texts and keep tokenizer for decoding
repeated_messages = ["Hello world from Alice", "Hello world from Alice again", "Alice says hi"]
cf_tok2 = copyforward.CopyForwardTokens.from_texts_with_tokenizer(repeated_messages, tokenizer="whitespace", exact_mode=True)
token_ids = cf_tok2.render([9999])         # List[List[int]] with replacement tokens
decoded = cf_tok2.render_texts("[REF]")    # Decoded text with replacements

Rust

use copyforward::{exact, approximate, Config};

// Basic usage
let messages = &["Hello world", "Hello world, how are you?"];
let compressed = exact(messages, Config::default());

// Handle missing values (Option types work seamlessly!)
let messages_with_none = &[Some("Hello world"), None, Some("Hello world again")];
let compressed = exact(messages_with_none, Config::default());

// Fast approximate compression - 2x speed for large texts
let compressed = approximate(messages, Config::default()); 

// Render back to original
let original = compressed.render_with(|_, _, _, text| text.to_string());

Algorithm Selection

Choose between two optimized algorithms:

Algorithm Best for Speed Accuracy
Exact < 1MB total text, perfect compression needed Slower Perfect
Approximate > 1MB text, speed matters ~2x faster Excellent

The approximate algorithm may split some long references but still achieves excellent compression ratios.

Missing Value Support

Both Python and Rust APIs seamlessly handle missing/None values, making them perfect for dataframe compression:

Python

import copyforward

# DataFrame-like data with missing values
messages = [
    "User logged in",
    None,  # Missing log entry
    "User logged in successfully", 
    None,
    "User logged out"
]

cf = copyforward.CopyForwardText.from_texts(messages)
compressed = cf.render("[REF]")
# Result: ['User logged in', None, '[REF] successfully', None, 'User logged out']

# Token data with missing values
tokens = [[1, 2, 3], None, [1, 2, 3, 4]]
cf_tok = copyforward.CopyForwardTokens.from_tokens(tokens)

Rust

use copyforward::{exact, exact_tokens, Config};

// Mixed Option types work seamlessly
let messages = &[
    Some("User logged in"),
    None,
    Some("User logged in successfully")
];
let compressed = exact(messages, Config::default());

// Token sequences with None values
let tokens = &[
    Some(vec![1u32, 2u32, 3u32]),
    None,
    Some(vec![1u32, 2u32, 3u32, 4u32])
];
let compressed = exact_tokens(tokens, Config::default());

Installation

Python

pip install maturin
# Build the wheel with Python bindings enabled
maturin develop --features python

# If you want named tokenizers (e.g., whitespace / HF), enable the bundle:
# maturin develop --features python-tokenizers

Rust

[dependencies]
copyforward = "0.2"

Advanced Usage

Python

import copyforward

# Custom configuration (text)
cf = copyforward.CopyForwardText.from_texts(
    messages,
    exact_mode=True,      # Perfect compression
    min_match_len=8,      # Only create refs for 8+ char matches
    lookback=100          # Only search previous 100 messages
)

# Get detailed segment information
segments = cf.segments()
for msg_segments in segments:
    for segment in msg_segments:
        if segment['type'] == 'reference':
            print(f"Reference to message {segment['message']}")
        else:
            print(f"Literal text: {segment['text']}")

# Render with custom replacement (useful for debugging and visualization)
redacted = cf.render("[REFERENCE]")  # Shows where references occur

# Tokenization (opt-in)
cf_tok = copyforward.CopyForwardTokens.from_texts_with_tokenizer(
    messages,
    tokenizer="whitespace",   # or feature-gated 'hf:<model>' / 'file:<path>'
    exact_mode=True,
)
token_ids = cf_tok.render([9999])  # Replace references with token 9999
texts = cf_tok.render_texts("[REF]")  # Decoded text with "[REF]" replacements

Viewing generated Python docs

After building the Python extension with maturin, the PyO3 docstrings are available via Python's help system:

# Build and install the extension into your active venv
maturin develop --features python

python -c "import copyforward; help(copyforward.CopyForwardText)"
python -c "import copyforward; help(copyforward.CopyForwardTokens)"

This prints the docstring and usage information emitted by the PyO3 bindings.

Rust

use copyforward::{exact, Config, CopyForward};

// Custom configuration
let config = Config {
    min_match_len: 8,
    lookback: Some(100),  
    ..Config::default()
};

let compressed = exact(&messages, config);

// Get compression details
let segments = compressed.segments();
for (i, msg_segments) in segments.iter().enumerate() {
    println!("Message {}: {} segments", i, msg_segments.len());
}

// Custom rendering
let redacted = compressed.render_with_static("[REF]");

How It Works

Copy-forward compression works in two phases:

  1. Analysis: Scan messages to find repeated substrings using rolling hash indexing
  2. Compression: Replace repeated text with references to first occurrence

Example:

Input:  ["Hello world", "Hello world today"]
Output: [Literal("Hello world"), [Reference(0,0,11), Literal(" today")]]

This represents the second message as a reference to the entire first message plus the literal text " today".

Missing Value Handling: None/null values are preserved in their original positions but skipped during compression analysis, ensuring perfect round-trip fidelity for dataframe-like data.

Performance

Typical compression ratios:

  • Chat logs: 60-80% space savings
  • Code diffs: 70-90% space savings
  • Document versions: 50-80% space savings
  • Dataframes with missing values: 50-85% space savings (None values don't affect compression)

Speed comparison on 1MB of message data:

  • Exact: ~50ms, perfect compression
  • Approximate: ~25ms, 95% of perfect compression

Missing values add minimal overhead - compression speed remains constant regardless of None density.

Repository Structure

  • src/ — Rust library implementation
  • tests/ — Integration tests
  • benches/ — Performance benchmarks

Changelog

See CHANGELOG.md for detailed release notes.

License

MIT License - see LICENSE file for details.

Features and Builds

  • Default build has no Python or tokenizer dependencies, keeping Rust users lean.
  • Cargo features:
    • python: enables PyO3 and numpy for Python bindings.
    • tokenizers: enables integration with the tokenizers crate for named tokenizers.
    • hf-hub: adds optional support wiring for Hub-based tokenizers; current crate version does not implement hub loading at runtime.
    • Bundles: python-tokenizers, python-tokenizers-hub for convenience.

Python wheels

  • Build Python extension (bindings only):
    • maturin develop --features python
  • Build Python extension with tokenizer support:
    • maturin develop --features python-tokenizers
  • Hub bundle (compiles but hub loading not implemented in tokenizers v0.15):
    • maturin develop --features python-tokenizers-hub

Notes

  • Using tokenizer="whitespace" requires only the python feature.
  • Using tokenizer="hf:<model>" or tokenizer="file:<path>" requires tokenizers; hub loading by name is not implemented for the current tokenizers version. Load from a local tokenizer JSON using file:<path>.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

copyforward-0.2.1.tar.gz (48.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

copyforward-0.2.1-cp312-cp312-win_amd64.whl (197.3 kB view details)

Uploaded CPython 3.12Windows x86-64

copyforward-0.2.1-cp312-cp312-manylinux_2_34_x86_64.whl (355.8 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

copyforward-0.2.1-cp312-cp312-macosx_11_0_arm64.whl (311.9 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

copyforward-0.2.1-cp311-cp311-win_amd64.whl (199.0 kB view details)

Uploaded CPython 3.11Windows x86-64

copyforward-0.2.1-cp311-cp311-manylinux_2_34_x86_64.whl (355.4 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

copyforward-0.2.1-cp311-cp311-macosx_11_0_arm64.whl (310.8 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

copyforward-0.2.1-cp310-cp310-win_amd64.whl (198.4 kB view details)

Uploaded CPython 3.10Windows x86-64

copyforward-0.2.1-cp310-cp310-manylinux_2_34_x86_64.whl (356.0 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

copyforward-0.2.1-cp310-cp310-macosx_11_0_arm64.whl (310.9 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file copyforward-0.2.1.tar.gz.

File metadata

  • Download URL: copyforward-0.2.1.tar.gz
  • Upload date:
  • Size: 48.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for copyforward-0.2.1.tar.gz
Algorithm Hash digest
SHA256 7c4f3127867cf43caf399b42949f1e2bdc418393d570541134ed0d52ef36e3f6
MD5 d40b415fd356d8af28a27036173375c5
BLAKE2b-256 a658717f60b05230e392da660c1f9ebb1db3e372ada26bbca5ff746fb4694153

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.2.1.tar.gz:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file copyforward-0.2.1-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: copyforward-0.2.1-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 197.3 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for copyforward-0.2.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 d6e858c95dd74433aa822dfbbe2e5ed8f34337cc0f1a33199309d915cfcb7059
MD5 fb04609250aee2b229c3e9aa85900c47
BLAKE2b-256 ee91911cb462fce06d1d8817f610413da44ec6a529df5e2ae51e5782b3776c30

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.2.1-cp312-cp312-win_amd64.whl:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file copyforward-0.2.1-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for copyforward-0.2.1-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 9a60827c93550d3972a4d708df776357c55f8d63366c591f69ac1d9c5155282f
MD5 9c95dac84654f749ee46167979a4c393
BLAKE2b-256 d2c1210edb033e3225f0a31dd89595901596e07f058f52f90f7abc219b7e45ad

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.2.1-cp312-cp312-manylinux_2_34_x86_64.whl:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file copyforward-0.2.1-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for copyforward-0.2.1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7fd9a994d85f28fc71124ccc923c748b257d84dfbe65adb11a66c624c15b9412
MD5 fc916a888655655fc25e6462f616f11b
BLAKE2b-256 a3b83b76074447596d96ac362c1e91932ecf0f5aea7053fc7209e733a1f7eb14

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.2.1-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file copyforward-0.2.1-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: copyforward-0.2.1-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 199.0 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for copyforward-0.2.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 cf4c6e6e1706432accbcb5281701d15ec74e264bd0576afeb51b3236df37b0cb
MD5 316d6bb0718af9b479bbfd6efc244ff6
BLAKE2b-256 07e18558e1248e6b3583b4b535f81d2dd851cf9280c12a25da27ba49094b3228

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.2.1-cp311-cp311-win_amd64.whl:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file copyforward-0.2.1-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for copyforward-0.2.1-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 7390f3c69cff4fdcd0db389266b7f47fe935412f4ed712f7b98ef07760e5f556
MD5 3efc1fb5b73c5ba5fd8b133534c0e21a
BLAKE2b-256 251abd82f2776bdfe45b8632218e9c3754e0b509596d3d1d61bfea15278ca223

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.2.1-cp311-cp311-manylinux_2_34_x86_64.whl:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file copyforward-0.2.1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for copyforward-0.2.1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4ffdec5ca9c3636b1414323ddde84976cdd9cc097379396d49cfb244538d86a6
MD5 096ba793aeabc24f81ab97b54db5b712
BLAKE2b-256 3f983c0376537facdbc2a7e8bd6a0eada89b51e7059927420091bc09b579e0e3

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.2.1-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file copyforward-0.2.1-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: copyforward-0.2.1-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 198.4 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for copyforward-0.2.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 bb6dd1787c07fd0827db724aeafe3229e9e2635762d8c7d6f242972c7043356e
MD5 3e0a445b999a9bf627af695b50b11e91
BLAKE2b-256 1ae33486d0060529655569d80e5135f2ea57ea345b859af672d38ad0e90b948c

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.2.1-cp310-cp310-win_amd64.whl:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file copyforward-0.2.1-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for copyforward-0.2.1-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 2d148f3a6a6211cb526dd3c09239adee688d47ff648b57b14de612f979e073b6
MD5 07b073bde3b574efa9858b8f4c9756cc
BLAKE2b-256 c0d60b6771fb35ac137f6fc57578a9cc5207746db0df14753408bc67e57b8958

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.2.1-cp310-cp310-manylinux_2_34_x86_64.whl:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file copyforward-0.2.1-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for copyforward-0.2.1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 53862ce13a49a0d1ff2883dac96d17337e97ddf6ff9ebb306ce287cb211d324f
MD5 1b8a2ac8815e35b8ade036c67663d231
BLAKE2b-256 ad9f2b55b7c22a754312a9242dc431729440ec07f95493c29a6f7671b1033e4c

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.2.1-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page