Document deduplication package for DEEP
Project description
Document De-Duplication
A module for creating and indexing document character trigrams.
Installation
pip install deep_utils.deduplication
A use case scenario
Suppose we have an elasticsearch index my-index
setup in AWS with vector name vector1
of size 10000
from deep_utils.deduplication.utils import es_wrapper
from deep_utils.deduplication.vector_generator import create_trigram_vector
from deep_utils.deduplication.elasticsearch import search_similar, add_to_index
es = es_wrapper('<aws endpoint>', 'aws_region')
text_document: str = 'this is test document'
vector = create_trigram_vector('en', text_document)
similar_docs_resp = search_similar(10, ('vector1', vector), 'my-index', es)
total = search_similar_resp['hits']['total']
max_score = search_similar_resp['hits']['max_score']
docs_ids = [x['_id'] for x in similar_docs_resp['hits']['hits']]
docs_scores = [x['_score'] for x in similar_docs_resp['hits']['hits']]
# To add document to index
resp = add_to_index(doc_id='1', vectors=dict(vector1=vector), index_name='my-index', es=es)
hasError = resp['errors']
Motivation
Can be found here
Scripts
There are scripts that generate trigrams from leads(documents) in DEEP.
Modules
trigrams
Just the collection of relevant trigrams for en
, es
and fr
languages.
from deep_utils.deduplication.trigrams import en, es, fr
en_trigrams = en.trigrams # [' th', 'the', 'he ', ....]
es_trigrams = es.trigrams
fr_trigrams = fr.trigrams
NOTE: The trigrams contain 10000 relevant trigrams. So, the vector created will have dimension 10000.
utils
Consists of following functions:
# This is a wrapper function for creating Elasticsearch object
es_wrapper(endpoint: str, region: str, profile_name: str = 'default') -> Elasticsearch`
remove_puncs_and_extra_spaces(text: str) -> str` which is used for preprocessing texts
vector_generator
create_trigram_vector(lang: str, text: str) -> List[float]
create_count_vector(processed_text: str, trigrams: Dict[str, int]) -> List[int]
normalize_count_vector(count_vector: List[int]) -> List[float]
elasticsearch
search_similar(similar_count: int, vector: Tuple[str, List[float]], index_name: str, es: Elasticsearch)
add_to_index(doc_id: int, vectors: Dict[str, List[float]], index_name: str, es: Elasticsearch)
index_exists(index_name: str, es: Es) -> bool
create_knn_vector_index(index_name: str, vector_size: int, es: Es, ignore_error: bool = False) -> Tuple[bool, ErrorString]
create_knn_vector_index_if_not_exists(index_name: str, vector_size: int, es: Es) -> Tuple[bool, ErrorString]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for deep_utils.deduplication-1.0.8.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0f57bb2a99a95c26add244a5ef751e6a9461b7279e23a7f278793e5c7a5cdc0f |
|
MD5 | 964aceecdc9d437d5d3d72e9a332e248 |
|
BLAKE2b-256 | 77e4401f981c5914237c6055d090741b02a87fba87f485d60b175a29175952cb |
Close
Hashes for deep_utils.deduplication-1.0.8-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5cfe6d60abea4d3f00728359db55ec9151c22ee5e27e930cd94c7a184bd5601f |
|
MD5 | 9764f711aac4a992ee299a29dc9986bf |
|
BLAKE2b-256 | 3892fd7168bd6383046b056b50c56561eb6d76b59af043021bf22840c767eefa |