A locality sensitive hashing implementationoptimized for large data processing.
Project description
LLSHash-Py
Version: 1.0.0
LLSHash-Py is an enhanced Python implementation of Locality Sensitive Hashing (LSH) tailored for data deduplication in large-scale datasets, particularly for training large language models. Developed at Shahid Beheshti University, this package leverages modern industry-standard libraries and parallel processing techniques to ensure efficient and scalable performance.
Highlights
- Direct Duplicate Finder: Quickly identify and manage duplicate data points.
- Batch Processing: Improved performance through efficient batch indexing.
- Parallel Processing: Utilize CPUs and clusters with Ray for scalable hashing.
- Integration with Hugging Face Suite: Seamlessly integrate with Hugging Face libraries for advanced data handling.
- Disk-Based Storage Management: Offload and reload data to disk to manage RAM usage in gigantic datasets.
- Unique Identifier Tracking: Track indexed points with unique identifiers, simplifying data point management in large datasets.
- NumPy's Advanced Storage: Leverage NumPy's latest storage management features.
- Removed Redis Dependency: Streamlined storage management without Redis.
- Modern Best Practices: Incorporates type annotations and removes support for deprecated packages.
Installation
LLSHash-Py depends on the following libraries:
numpyraytransformersdatasets
To install the package via pip:
pip install llshash-py
Note: Ensure that Ray is properly set up for parallel processing, especially if you intend to use cluster features.
Quickstart
To create 6-bit hashes for input data of 8 dimensions and perform batch indexing:
from llshash import LLSHash
import numpy as np
# Initialize LLSHash
lsh = LLSHash(
hash_size=6,
input_dim=8,
num_hashtables=2,
matrices_filename='matrices.npz',
hashtable_filename='hashtables.npz',
overwrite=True,
num_cpus=4
)
# Index a single data point
input_point = np.random.rand(8)
extra_data = 1
lsh.index(input_point, extra_data)
# Batch indexing
input_points = np.random.rand(1000, 8)
extra_data_batch = np.arange(1000)
lsh.index_batch(input_points, extra_data_batch)
# Find similar documents
similar_pairs = lsh.find_similar_documents()
print(similar_pairs)
# Save hash tables
lsh.save()
# Load hash tables
lsh.load_batch()
Detailed Usage
Initializing LLSHash
from llshash import LLSHash
lsh = LLSHash(
hash_size=6,
input_dim=8,
num_hashtables=2,
matrices_filename='matrices.npz',
hashtable_filename='hashtables.npz',
overwrite=True,
num_cpus=4
)
Parameters:
hash_size: Length of the resulting binary hash (e.g., 6 bits).input_dim: Dimension of the input vector (e.g., 8).num_hashtables: Number of hash tables used for multiple lookups (default: 1).matrices_filename: Path to the.npzfile where random matrices are stored.hashtable_filename: Path to the.npzfile where hash tables are stored.overwrite: Whether to overwrite existing matrix files (default: False).num_cpus: Number of CPUs to use for parallel processing (default: 1).
Indexing Data Points
Single Indexing
import numpy as np
input_point = np.random.rand(8)
extra_data = 1
lsh.index(input_point, extra_data)
Batch Indexing
input_points = np.random.rand(1000, 8)
extra_data_batch = np.arange(1000)
lsh.index_batch(input_points, extra_data_batch)
Finding Similar Documents
similar_pairs = lsh.find_similar_documents()
print(similar_pairs)
Saving and Loading Hash Tables
# Save hash tables
lsh.save()
# Load hash tables
lsh.load_batch()
Example with Hugging Face Libraries
Integration with Hugging Face's Transformers and Datasets for advanced data handling.
from transformers import AutoTokenizer
from datasets import Dataset
from llshash import LLSHash
import numpy as np
from pprint import pprint
# Function to tokenize and pad the content
def tokenize_and_pad(content_list, tokenizer, max_length=2048):
tokenized_output = tokenizer(content_list, padding='max_length', truncation=True, max_length=max_length, return_tensors='np')
return tokenized_output['input_ids']
# Initialize LLSHash
ht = LLSHash(
hash_size=6,
input_dim=2048,
num_hashtables=14,
matrices_filename='mat.npz',
hashtable_filename='hash.npz',
num_cpus=4
)
# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained('amirakhlaghiqqq/persian-llama2')
def mapper(batch):
ted_and_ped = tokenize_and_pad(batch["text"], tokenizer)
ht.index_batch(ted_and_ped, np.array(batch['index']))
ht.save_batch()
ht.restart()
return batch
def list_to_hf_dataset_with_index(list_of_strings):
data_dict = {
"index": list(range(1, len(list_of_strings) + 1)),
"text": list_of_strings
}
dataset = Dataset.from_dict(data_dict)
return dataset
# Example usage
if __name__ == "__main__":
list_of_strings = [
# Your test set for manual incpetion
]
data = list_to_hf_dataset_with_index(list_of_strings)
mapped_data = data.map(mapper, batched=True, batch_size=7)
ht.load_batch()
pairs = ht.find_similar_documents()
print(pairs)
pprint(ht.hash_tables)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llshash-package-0.1.0.tar.gz.
File metadata
- Download URL: llshash-package-0.1.0.tar.gz
- Upload date:
- Size: 10.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
79de9f7e31faee4fa3cd66622e87022e4b1b90ef4cfb8ba473711fab3f8fbab5
|
|
| MD5 |
b12086c7c00357e5515b93b80a27d721
|
|
| BLAKE2b-256 |
b223f9db851216bf409c9c53488dbbdfdb4ff2991f08fe524d9dae161c74bc2c
|
File details
Details for the file llshash_package-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llshash_package-0.1.0-py3-none-any.whl
- Upload date:
- Size: 7.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fcde45284dc5ab6ea8e2591c1086ddc0802dff4768d9f9f5f7743718e996b251
|
|
| MD5 |
726522ef80cde4d436d8d969d1d5f14f
|
|
| BLAKE2b-256 |
e85a05a53236a6deb6953f01b61c652c615fec633f5ff641067efd7251896693
|