Document deduplication package for DEEP
Project description
Document De-Duplication
A module for creating and indexing document character trigrams.
Installation
pip install deep_utils.deduplication
A use case scenario
Suppose we have an elasticsearch index my-index
setup in AWS with vector name vector1
of size 10000
from deep_utils.deduplication.utils import es_wrapper
from deep_utils.deduplication.vector_generator import create_trigram_vector
from deep_utils.deduplication.elasticsearch import search_similar, add_to_index
es = es_wrapper('<aws endpoint>', 'aws_region')
text_document: str = 'this is test document'
vector = create_trigram_vector('en', text_document)
similar_docs_resp = search_similar(10, ('vector1', vector), 'my-index', es)
total = search_similar_resp['hits']['total']
max_score = search_similar_resp['hits']['max_score']
docs_ids = [x['_id'] for x in similar_docs_resp['hits']['hits']]
docs_scores = [x['_score'] for x in similar_docs_resp['hits']['hits']]
# To add document to index
resp = add_to_index(doc_id='1', vectors=dict(vector1=vector), index_name='my-index', es=es)
hasError = resp['errors']
Motivation
Can be found here
Scripts
There are scripts that generate trigrams from leads(documents) in DEEP.
Modules
trigrams
Just the collection of relevant trigrams for en
, es
and fr
languages.
from deep_utils.deduplication.trigrams import en, es, fr
en_trigrams = en.trigrams # [' th', 'the', 'he ', ....]
es_trigrams = es.trigrams
fr_trigrams = fr.trigrams
NOTE: The trigrams contain 10000 relevant trigrams. So, the vector created will have dimension 10000.
utils
Consists of following functions:
# This is a wrapper function for creating Elasticsearch object
es_wrapper(endpoint: str, region: str, profile_name: str = 'default') -> Elasticsearch`
remove_puncs_and_extra_spaces(text: str) -> str` which is used for preprocessing texts
vector_generator
create_trigram_vector(lang: str, text: str) -> List[float]
create_count_vector(processed_text: str, trigrams: Dict[str, int]) -> List[int]
normalize_count_vector(count_vector: List[int]) -> List[float]
elasticsearch
search_similar(similar_count: int, vector: Tuple[str, List[float]], index_name: str, es: Elasticsearch)
add_to_index(doc_id: int, vectors: Dict[str, List[float]], index_name: str, es: Elasticsearch)
index_exists(index_name: str, es: Es) -> bool
create_knn_vector_index(index_name: str, vector_size: int, es: Es, ignore_error: bool = False) -> Tuple[bool, ErrorString]
create_knn_vector_index_if_not_exists(index_name: str, vector_size: int, es: Es) -> Tuple[bool, ErrorString]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for deep_utils.deduplication-1.0.7.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8e94db1d9dd2cc453bc423e9c7d7c636c31952bfad41737198ec9457c0d58d2c |
|
MD5 | 3c4e74206541deeff907ac31807d0169 |
|
BLAKE2b-256 | 35464432543398444fb20b37b94dc6e17245e9dd0fba321ec66487d656b58188 |
Close
Hashes for deep_utils.deduplication-1.0.7-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c65b675fe93291af36782375addd77fad3b4dcad3ac712d35e25d8724f706563 |
|
MD5 | c1242fa574c9b3e2615cf6ae31a00a06 |
|
BLAKE2b-256 | 1ae1b24ac68346015c84da21e2518bd149239a24e56e9a29b72cba5f61787712 |