Skip to main content

A locality sensitive hashing implementationoptimized for large data processing.

Project description

LLSHash-Py

Version: 1.0.0

LLSHash-Py is an enhanced Python implementation of Locality Sensitive Hashing (LSH) tailored for data deduplication in large-scale datasets, particularly for training large language models. Developed at Shahid Beheshti University, this package leverages modern industry-standard libraries and parallel processing techniques to ensure efficient and scalable performance.

Highlights

  • Direct Duplicate Finder: Quickly identify and manage duplicate data points.
  • Batch Processing: Improved performance through efficient batch indexing.
  • Parallel Processing: Utilize CPUs and clusters with Ray for scalable hashing.
  • Integration with Hugging Face Suite: Seamlessly integrate with Hugging Face libraries for advanced data handling.
  • Disk-Based Storage Management: Offload and reload data to disk to manage RAM usage in gigantic datasets.
  • Unique Identifier Tracking: Track indexed points with unique identifiers, simplifying data point management in large datasets.
  • NumPy's Advanced Storage: Leverage NumPy's latest storage management features.
  • Removed Redis Dependency: Streamlined storage management without Redis.
  • Modern Best Practices: Incorporates type annotations and removes support for deprecated packages.

Installation

LLSHash-Py depends on the following libraries:

  • numpy
  • ray
  • transformers
  • datasets

To install the package via pip:

pip install llshash-py

Note: Ensure that Ray is properly set up for parallel processing, especially if you intend to use cluster features.

Quickstart

To create 6-bit hashes for input data of 8 dimensions and perform batch indexing:

from llshash import LLSHash
import numpy as np

# Initialize LLSHash
lsh = LLSHash(
    hash_size=6,
    input_dim=8,
    num_hashtables=2,
    matrices_filename='matrices.npz',
    hashtable_filename='hashtables.npz',
    overwrite=True,
    num_cpus=4
)

# Index a single data point
input_point = np.random.rand(8)
extra_data = 1
lsh.index(input_point, extra_data)

# Batch indexing
input_points = np.random.rand(1000, 8)
extra_data_batch = np.arange(1000)
lsh.index_batch(input_points, extra_data_batch)

# Find similar documents
similar_pairs = lsh.find_similar_documents()
print(similar_pairs)

# Save hash tables
lsh.save()

# Load hash tables
lsh.load_batch()

Detailed Usage

Initializing LLSHash

from llshash import LLSHash

lsh = LLSHash(
    hash_size=6,
    input_dim=8,
    num_hashtables=2,
    matrices_filename='matrices.npz',
    hashtable_filename='hashtables.npz',
    overwrite=True,
    num_cpus=4
)

Parameters:

  • hash_size: Length of the resulting binary hash (e.g., 6 bits).
  • input_dim: Dimension of the input vector (e.g., 8).
  • num_hashtables: Number of hash tables used for multiple lookups (default: 1).
  • matrices_filename: Path to the .npz file where random matrices are stored.
  • hashtable_filename: Path to the .npz file where hash tables are stored.
  • overwrite: Whether to overwrite existing matrix files (default: False).
  • num_cpus: Number of CPUs to use for parallel processing (default: 1).

Indexing Data Points

Single Indexing

import numpy as np

input_point = np.random.rand(8)
extra_data = 1
lsh.index(input_point, extra_data)

Batch Indexing

input_points = np.random.rand(1000, 8)
extra_data_batch = np.arange(1000)
lsh.index_batch(input_points, extra_data_batch)

Finding Similar Documents

similar_pairs = lsh.find_similar_documents()
print(similar_pairs)

Saving and Loading Hash Tables

# Save hash tables
lsh.save()

# Load hash tables
lsh.load_batch()

Example with Hugging Face Libraries

Integration with Hugging Face's Transformers and Datasets for advanced data handling.

from transformers import AutoTokenizer
from datasets import Dataset
from llshash import LLSHash
import numpy as np
from pprint import pprint

# Function to tokenize and pad the content
def tokenize_and_pad(content_list, tokenizer, max_length=2048):
    tokenized_output = tokenizer(content_list, padding='max_length', truncation=True, max_length=max_length, return_tensors='np')
    return tokenized_output['input_ids']

# Initialize LLSHash
ht = LLSHash(
    hash_size=6,
    input_dim=2048,
    num_hashtables=14,
    matrices_filename='mat.npz',
    hashtable_filename='hash.npz',
    num_cpus=4
)

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained('amirakhlaghiqqq/persian-llama2')

def mapper(batch):
    ted_and_ped = tokenize_and_pad(batch["text"], tokenizer)
    ht.index_batch(ted_and_ped, np.array(batch['index']))
    ht.save_batch()
    ht.restart()
    return batch

def list_to_hf_dataset_with_index(list_of_strings):
    data_dict = {
        "index": list(range(1, len(list_of_strings) + 1)),
        "text": list_of_strings
    }
    dataset = Dataset.from_dict(data_dict)
    return dataset

# Example usage
if __name__ == "__main__":
    list_of_strings = [
        # Your test set for manual incpetion
    ]
    data = list_to_hf_dataset_with_index(list_of_strings)
    mapped_data = data.map(mapper, batched=True, batch_size=7)
    ht.load_batch()
    pairs = ht.find_similar_documents()
    print(pairs)
    pprint(ht.hash_tables)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llshash-package-0.1.0.tar.gz (10.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llshash_package-0.1.0-py3-none-any.whl (7.9 kB view details)

Uploaded Python 3

File details

Details for the file llshash-package-0.1.0.tar.gz.

File metadata

  • Download URL: llshash-package-0.1.0.tar.gz
  • Upload date:
  • Size: 10.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.11

File hashes

Hashes for llshash-package-0.1.0.tar.gz
Algorithm Hash digest
SHA256 79de9f7e31faee4fa3cd66622e87022e4b1b90ef4cfb8ba473711fab3f8fbab5
MD5 b12086c7c00357e5515b93b80a27d721
BLAKE2b-256 b223f9db851216bf409c9c53488dbbdfdb4ff2991f08fe524d9dae161c74bc2c

See more details on using hashes here.

File details

Details for the file llshash_package-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llshash_package-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fcde45284dc5ab6ea8e2591c1086ddc0802dff4768d9f9f5f7743718e996b251
MD5 726522ef80cde4d436d8d969d1d5f14f
BLAKE2b-256 e85a05a53236a6deb6953f01b61c652c615fec633f5ff641067efd7251896693

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page