Skip to main content

Fast similarity search using forward-looking LSH

Project description

copyforward

Fast copy-forward compression for message threads. Detects repeated substrings across messages and replaces them with references to earlier occurrences, reducing storage requirements by 50-90%.

Perfect for chat logs, document histories, dataframes with missing values, and any sequence of texts with repeated content.

Quick Start

Python

import copyforward

# Basic usage with text messages
messages = ["Hello world", "Hello world, how are you?", "Hello world today"]

# Text API (exact by default)
cf = copyforward.CopyForwardText.from_texts(messages)
print(f"Compression ratio: {cf.compression_ratio():.2f}")
# Render with replacement text to visualize references
visualized = cf.render("[REF]")  # ['Hello world', '[REF], how are you?', '[REF] today']

# Handle missing values (perfect for dataframes!)
messages_with_none = ["Hello world", None, "Hello world again"]
cf_none = copyforward.CopyForwardText.from_texts(messages_with_none)
result = cf_none.render("[REF]")  # ['Hello world', None, '[REF] again']

# Approximate (faster) text compression
cf_fast = copyforward.CopyForwardText.from_texts(messages, exact_mode=False)
approx_result = cf_fast.render("[REF]")  # May find different compression patterns

# Token API with missing values
toks = [[10, 11, 12], None, [10, 11, 12, 13]]
cf_tok = copyforward.CopyForwardTokens.from_tokens(toks, exact_mode=True)
# Render with replacement tokens - only non-None entries returned
rendered_toks = cf_tok.render([999])  # [[10, 11, 12], [999, 13]]

# Tokenizer opt-in: build token-mode directly from texts and keep tokenizer for decoding
repeated_messages = ["Hello world from Alice", "Hello world from Alice again", "Alice says hi"]
cf_tok2 = copyforward.CopyForwardTokens.from_texts_with_tokenizer(repeated_messages, tokenizer="whitespace", exact_mode=True)
token_ids = cf_tok2.render([9999])         # List[List[int]] with replacement tokens
decoded = cf_tok2.render_texts("[REF]")    # Decoded text with replacements

Rust

use copyforward::{exact, approximate, Config};

// Basic usage
let messages = &["Hello world", "Hello world, how are you?"];
let compressed = exact(messages, Config::default());

// Handle missing values (Option types work seamlessly!)
let messages_with_none = &[Some("Hello world"), None, Some("Hello world again")];
let compressed = exact(messages_with_none, Config::default());

// Fast approximate compression - 2x speed for large texts
let compressed = approximate(messages, Config::default()); 

// Render back to original
let original = compressed.render_with(|_, _, _, text| text.to_string());

Algorithm Selection

Choose between two optimized algorithms:

Algorithm Best for Speed Accuracy
Exact < 1MB total text, perfect compression needed Slower Perfect
Approximate > 1MB text, speed matters ~2x faster Excellent

The approximate algorithm may split some long references but still achieves excellent compression ratios.

Missing Value Support

Both Python and Rust APIs seamlessly handle missing/None values, making them perfect for dataframe compression:

Python

import copyforward

# DataFrame-like data with missing values
messages = [
    "User logged in",
    None,  # Missing log entry
    "User logged in successfully", 
    None,
    "User logged out"
]

cf = copyforward.CopyForwardText.from_texts(messages)
compressed = cf.render("[REF]")
# Result: ['User logged in', None, '[REF] successfully', None, 'User logged out']

# Token data with missing values
tokens = [[1, 2, 3], None, [1, 2, 3, 4]]
cf_tok = copyforward.CopyForwardTokens.from_tokens(tokens)

Rust

use copyforward::{exact, exact_tokens, Config};

// Mixed Option types work seamlessly
let messages = &[
    Some("User logged in"),
    None,
    Some("User logged in successfully")
];
let compressed = exact(messages, Config::default());

// Token sequences with None values
let tokens = &[
    Some(vec![1u32, 2u32, 3u32]),
    None,
    Some(vec![1u32, 2u32, 3u32, 4u32])
];
let compressed = exact_tokens(tokens, Config::default());

Installation

Python

pip install maturin
# Build the wheel with Python bindings enabled
maturin develop --features python

# If you want named tokenizers (e.g., whitespace / HF), enable the bundle:
# maturin develop --features python-tokenizers

Rust

[dependencies]
copyforward = "0.2"

Advanced Usage

Python

import copyforward

# Custom configuration (text)
cf = copyforward.CopyForwardText.from_texts(
    messages,
    exact_mode=True,      # Perfect compression
    min_match_len=8,      # Only create refs for 8+ char matches
    lookback=100          # Only search previous 100 messages
)

# Get detailed segment information
segments = cf.segments()
for msg_segments in segments:
    for segment in msg_segments:
        if segment['type'] == 'reference':
            print(f"Reference to message {segment['message']}")
        else:
            print(f"Literal text: {segment['text']}")

# Render with custom replacement (useful for debugging and visualization)
redacted = cf.render("[REFERENCE]")  # Shows where references occur

# Tokenization (opt-in)
cf_tok = copyforward.CopyForwardTokens.from_texts_with_tokenizer(
    messages,
    tokenizer="whitespace",   # or feature-gated 'hf:<model>' / 'file:<path>'
    exact_mode=True,
)
token_ids = cf_tok.render([9999])  # Replace references with token 9999
texts = cf_tok.render_texts("[REF]")  # Decoded text with "[REF]" replacements

Viewing generated Python docs

After building the Python extension with maturin, the PyO3 docstrings are available via Python's help system:

# Build and install the extension into your active venv
maturin develop --features python

python -c "import copyforward; help(copyforward.CopyForwardText)"
python -c "import copyforward; help(copyforward.CopyForwardTokens)"

This prints the docstring and usage information emitted by the PyO3 bindings.

Rust

use copyforward::{exact, Config, CopyForward};

// Custom configuration
let config = Config {
    min_match_len: 8,
    lookback: Some(100),  
    ..Config::default()
};

let compressed = exact(&messages, config);

// Get compression details
let segments = compressed.segments();
for (i, msg_segments) in segments.iter().enumerate() {
    println!("Message {}: {} segments", i, msg_segments.len());
}

// Custom rendering
let redacted = compressed.render_with_static("[REF]");

How It Works

Copy-forward compression works in two phases:

  1. Analysis: Scan messages to find repeated substrings using rolling hash indexing
  2. Compression: Replace repeated text with references to first occurrence

Example:

Input:  ["Hello world", "Hello world today"]
Output: [Literal("Hello world"), [Reference(0,0,11), Literal(" today")]]

This represents the second message as a reference to the entire first message plus the literal text " today".

Missing Value Handling: None/null values are preserved in their original positions but skipped during compression analysis, ensuring perfect round-trip fidelity for dataframe-like data.

Performance

Typical compression ratios:

  • Chat logs: 60-80% space savings
  • Code diffs: 70-90% space savings
  • Document versions: 50-80% space savings
  • Dataframes with missing values: 50-85% space savings (None values don't affect compression)

Speed comparison on 1MB of message data:

  • Exact: ~50ms, perfect compression
  • Approximate: ~25ms, 95% of perfect compression

Missing values add minimal overhead - compression speed remains constant regardless of None density.

Repository Structure

  • src/ — Rust library implementation
  • tests/ — Integration tests
  • benches/ — Performance benchmarks

License

MIT License - see LICENSE file for details.

Features and Builds

  • Default build has no Python or tokenizer dependencies, keeping Rust users lean.
  • Cargo features:
    • python: enables PyO3 and numpy for Python bindings.
    • tokenizers: enables integration with the tokenizers crate for named tokenizers.
    • hf-hub: adds optional support wiring for Hub-based tokenizers; current crate version does not implement hub loading at runtime.
    • Bundles: python-tokenizers, python-tokenizers-hub for convenience.

Python wheels

  • Build Python extension (bindings only):
    • maturin develop --features python
  • Build Python extension with tokenizer support:
    • maturin develop --features python-tokenizers
  • Hub bundle (compiles but hub loading not implemented in tokenizers v0.15):
    • maturin develop --features python-tokenizers-hub

Notes

  • Using tokenizer="whitespace" requires only the python feature.
  • Using tokenizer="hf:<model>" or tokenizer="file:<path>" requires tokenizers; hub loading by name is not implemented for the current tokenizers version. Load from a local tokenizer JSON using file:<path>.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

copyforward-0.2.0.tar.gz (46.7 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

copyforward-0.2.0-cp312-cp312-win_amd64.whl (196.2 kB view details)

Uploaded CPython 3.12Windows x86-64

copyforward-0.2.0-cp312-cp312-manylinux_2_34_x86_64.whl (354.7 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

copyforward-0.2.0-cp312-cp312-macosx_11_0_arm64.whl (310.9 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

copyforward-0.2.0-cp311-cp311-win_amd64.whl (197.9 kB view details)

Uploaded CPython 3.11Windows x86-64

copyforward-0.2.0-cp311-cp311-manylinux_2_34_x86_64.whl (354.7 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

copyforward-0.2.0-cp311-cp311-macosx_11_0_arm64.whl (309.8 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

copyforward-0.2.0-cp310-cp310-win_amd64.whl (197.4 kB view details)

Uploaded CPython 3.10Windows x86-64

copyforward-0.2.0-cp310-cp310-manylinux_2_34_x86_64.whl (354.3 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

copyforward-0.2.0-cp310-cp310-macosx_11_0_arm64.whl (309.7 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file copyforward-0.2.0.tar.gz.

File metadata

  • Download URL: copyforward-0.2.0.tar.gz
  • Upload date:
  • Size: 46.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for copyforward-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ff196132d66e9bfa84e48c2a5dfa177a10e1ce817f98c17e7e290b4a95c0f468
MD5 53e712c63b26fff523b3a8f40a660461
BLAKE2b-256 26dbb11abb2984ead23c89a7e1ce660798ccc9b9081b62c7841efa011a4a1bcc

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.2.0.tar.gz:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file copyforward-0.2.0-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: copyforward-0.2.0-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 196.2 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for copyforward-0.2.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 2c4dba910f5782fe1afd6ae781df38d7033d60f90cd589a9be0de63a7731adea
MD5 0c5275d905982fb2c06bc9a1153609b2
BLAKE2b-256 a7504c46794cc50bc63e17f00d0db55e435d06f4427b7f3c325cb92c2b8d8921

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.2.0-cp312-cp312-win_amd64.whl:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file copyforward-0.2.0-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for copyforward-0.2.0-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 88c08236abb03454eeaf5d60e5d48baff0e70cc17795215411cfb544f7b5b811
MD5 12a7de7ae5660358f382d774e41be2ef
BLAKE2b-256 05b294c477cb5458683e45a124868753aa78d9117ebabcf1eda5f128161899d1

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.2.0-cp312-cp312-manylinux_2_34_x86_64.whl:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file copyforward-0.2.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for copyforward-0.2.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0bcdd4fbd5ebd8de2a95b2a95ae3ef92b4a49ab48b2e671f9697ff56236cee47
MD5 b6c4a9578c7ac847ca85d4418846f3f3
BLAKE2b-256 40d2c15d0eecfaca39cfbe46f24827115bea857e306004ac8a6ee8108cfa77db

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.2.0-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file copyforward-0.2.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: copyforward-0.2.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 197.9 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for copyforward-0.2.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 e0bae99d3260ccee7325b26c0c8721f658e465a2784fcef69790fc1c75ccb0f7
MD5 942b2a14d01ff076aebddb7c6516dfc8
BLAKE2b-256 09f17cde9db86bbab9d80cd38a9737e31b0a8b69740e741fb93678223fcbb7a5

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.2.0-cp311-cp311-win_amd64.whl:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file copyforward-0.2.0-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for copyforward-0.2.0-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 1c1bb3f9f69b75398346c833d56cb7ef5802301284d416844017a22cadebf1bb
MD5 7832108d0ef238cef1262e8b6232cef1
BLAKE2b-256 6c23046fec838b6169ad67e1f7ea96251627954a4feb10d42ff2d06fb66c9663

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.2.0-cp311-cp311-manylinux_2_34_x86_64.whl:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file copyforward-0.2.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for copyforward-0.2.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e61b8d2db15d9d1f34dc7fef4ce868e9097ccb083e882f847519a6b6967176d9
MD5 137d2e439834fae54e59ba81d5b207a6
BLAKE2b-256 6c7b6774972b1cc9a04445f32a51ca01faccc5cf699932743790dce226315450

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.2.0-cp311-cp311-macosx_11_0_arm64.whl:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file copyforward-0.2.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: copyforward-0.2.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 197.4 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for copyforward-0.2.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 3a0bae8dffe75d4a70faa5c26ddcb8f4b5243a1d5b985ffd27965725e280bbdb
MD5 a2ac67879959b209537fca6319a0969c
BLAKE2b-256 348947fa366851c0d7f5be8be968e9f34691b6d32eeca4ce0e455ce1c85197af

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.2.0-cp310-cp310-win_amd64.whl:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file copyforward-0.2.0-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for copyforward-0.2.0-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 6158b80feb4b61b0abb344adbd43b752452f91f166f299174d3f993b140cdc7d
MD5 56f96146a447263c447b5ddca0018b8f
BLAKE2b-256 398a2ea8b63dd81f47185a7973b1b0337a54c81a7bc3881d9f7f51b199aa50c6

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.2.0-cp310-cp310-manylinux_2_34_x86_64.whl:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file copyforward-0.2.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for copyforward-0.2.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5d5d66b692c5027f5738245794a7522735b2611112b0c0be3d90eee3003b3ba5
MD5 47ba4b156258191c05ea1ddeb301b92a
BLAKE2b-256 53ce05379e7d51ae7fff83a214d987b5475b1e6e584de7efd735826e37146448

See more details on using hashes here.

Provenance

The following attestation bundles were made for copyforward-0.2.0-cp310-cp310-macosx_11_0_arm64.whl:

Publisher: python-ci.yml on SeanTater/copyforward

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page