A locality sensitive hashing implementationoptimized for large data processing.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

LLSHash-Py

Version: 1.0.0

LLSHash-Py is an enhanced Python implementation of Locality Sensitive Hashing (LSH) tailored for data deduplication in large-scale datasets, particularly for training large language models. Developed at Shahid Beheshti University, this package leverages modern industry-standard libraries and parallel processing techniques to ensure efficient and scalable performance.

Highlights

Direct Duplicate Finder: Quickly identify and manage duplicate data points.
Batch Processing: Improved performance through efficient batch indexing.
Parallel Processing: Utilize CPUs and clusters with Ray for scalable hashing.
Integration with Hugging Face Suite: Seamlessly integrate with Hugging Face libraries for advanced data handling.
Disk-Based Storage Management: Offload and reload data to disk to manage RAM usage in gigantic datasets.
Unique Identifier Tracking: Track indexed points with unique identifiers, simplifying data point management in large datasets.
NumPy's Advanced Storage: Leverage NumPy's latest storage management features.
Removed Redis Dependency: Streamlined storage management without Redis.
Modern Best Practices: Incorporates type annotations and removes support for deprecated packages.

Installation

LLSHash-Py depends on the following libraries:

numpy
ray
transformers
datasets

To install the package via pip:

pip install llshash-py

Note: Ensure that Ray is properly set up for parallel processing, especially if you intend to use cluster features.

Quickstart

To create 6-bit hashes for input data of 8 dimensions and perform batch indexing:

from llshash import LLSHash
import numpy as np

# Initialize LLSHash
lsh = LLSHash(
    hash_size=6,
    input_dim=8,
    num_hashtables=2,
    matrices_filename='matrices.npz',
    hashtable_filename='hashtables.npz',
    overwrite=True,
    num_cpus=4
)

# Index a single data point
input_point = np.random.rand(8)
extra_data = 1
lsh.index(input_point, extra_data)

# Batch indexing
input_points = np.random.rand(1000, 8)
extra_data_batch = np.arange(1000)
lsh.index_batch(input_points, extra_data_batch)

# Find similar documents
similar_pairs = lsh.find_similar_documents()
print(similar_pairs)

# Save hash tables
lsh.save()

# Load hash tables
lsh.load_batch()

Detailed Usage

Initializing LLSHash

from llshash import LLSHash

lsh = LLSHash(
    hash_size=6,
    input_dim=8,
    num_hashtables=2,
    matrices_filename='matrices.npz',
    hashtable_filename='hashtables.npz',
    overwrite=True,
    num_cpus=4
)

Parameters:

hash_size: Length of the resulting binary hash (e.g., 6 bits).
input_dim: Dimension of the input vector (e.g., 8).
num_hashtables: Number of hash tables used for multiple lookups (default: 1).
matrices_filename: Path to the .npz file where random matrices are stored.
hashtable_filename: Path to the .npz file where hash tables are stored.
overwrite: Whether to overwrite existing matrix files (default: False).
num_cpus: Number of CPUs to use for parallel processing (default: 1).

Indexing Data Points

Single Indexing

import numpy as np

input_point = np.random.rand(8)
extra_data = 1
lsh.index(input_point, extra_data)

Batch Indexing

input_points = np.random.rand(1000, 8)
extra_data_batch = np.arange(1000)
lsh.index_batch(input_points, extra_data_batch)

Finding Similar Documents

similar_pairs = lsh.find_similar_documents()
print(similar_pairs)

Saving and Loading Hash Tables

# Save hash tables
lsh.save()

# Load hash tables
lsh.load_batch()

Example with Hugging Face Libraries

Integration with Hugging Face's Transformers and Datasets for advanced data handling.

from transformers import AutoTokenizer
from datasets import Dataset
from llshash import LLSHash
import numpy as np
from pprint import pprint

# Function to tokenize and pad the content
def tokenize_and_pad(content_list, tokenizer, max_length=2048):
    tokenized_output = tokenizer(content_list, padding='max_length', truncation=True, max_length=max_length, return_tensors='np')
    return tokenized_output['input_ids']

# Initialize LLSHash
ht = LLSHash(
    hash_size=6,
    input_dim=2048,
    num_hashtables=14,
    matrices_filename='mat.npz',
    hashtable_filename='hash.npz',
    num_cpus=4
)

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained('amirakhlaghiqqq/persian-llama2')

def mapper(batch):
    ted_and_ped = tokenize_and_pad(batch["text"], tokenizer)
    ht.index_batch(ted_and_ped, np.array(batch['index']))
    ht.save_batch()
    ht.restart()
    return batch

def list_to_hf_dataset_with_index(list_of_strings):
    data_dict = {
        "index": list(range(1, len(list_of_strings) + 1)),
        "text": list_of_strings
    }
    dataset = Dataset.from_dict(data_dict)
    return dataset

# Example usage
if __name__ == "__main__":
    list_of_strings = [
        # Your test set for manual incpetion
    ]
    data = list_to_hf_dataset_with_index(list_of_strings)
    mapped_data = data.map(mapper, batched=True, batch_size=7)
    ht.load_batch()
    pairs = ht.find_similar_documents()
    print(pairs)
    pprint(ht.hash_tables)

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.0

Dec 26, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llshash-package-0.1.0.tar.gz (10.4 kB view details)

Uploaded Dec 26, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llshash_package-0.1.0-py3-none-any.whl (7.9 kB view details)

Uploaded Dec 26, 2024 Python 3

File details

Details for the file llshash-package-0.1.0.tar.gz.

File metadata

Download URL: llshash-package-0.1.0.tar.gz
Upload date: Dec 26, 2024
Size: 10.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.11.11

File hashes

Hashes for llshash-package-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`79de9f7e31faee4fa3cd66622e87022e4b1b90ef4cfb8ba473711fab3f8fbab5`
MD5	`b12086c7c00357e5515b93b80a27d721`
BLAKE2b-256	`b223f9db851216bf409c9c53488dbbdfdb4ff2991f08fe524d9dae161c74bc2c`

See more details on using hashes here.

File details

Details for the file llshash_package-0.1.0-py3-none-any.whl.

File metadata

Download URL: llshash_package-0.1.0-py3-none-any.whl
Upload date: Dec 26, 2024
Size: 7.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.11.11

File hashes

Hashes for llshash_package-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fcde45284dc5ab6ea8e2591c1086ddc0802dff4768d9f9f5f7743718e996b251`
MD5	`726522ef80cde4d436d8d969d1d5f14f`
BLAKE2b-256	`e85a05a53236a6deb6953f01b61c652c615fec633f5ff641067efd7251896693`

See more details on using hashes here.

llshash-package 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

LLSHash-Py

Highlights

Installation

Quickstart

Detailed Usage

Initializing LLSHash

Indexing Data Points

Single Indexing

Batch Indexing

Finding Similar Documents

Saving and Loading Hash Tables

Example with Hugging Face Libraries

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes