Document deduplication package for DEEP
Project description
Document De-Duplication
A module for creating and indexing document character trigrams.
Installation
pip install deep_utils.deduplication
A use case scenario
Suppose we have an elasticsearch index my-index setup in AWS with vector name vector1 of size 10000
from deep_utils.deduplication.utils import es_wrapper
from deep_utils.deduplication.vector_generator import create_trigram_vector
from deep_utils.deduplication.elasticsearch import search_similar, add_to_index
es = es_wrapper('<aws endpoint>', 'aws_region')
text_document: str = 'this is test document'
vector = create_trigram_vector('en', text_document)
similar_docs_resp = search_similar(10, ('vector1', vector), 'my-index', es)
total = search_similar_resp['hits']['total']
max_score = search_similar_resp['hits']['max_score']
docs_ids = [x['_id'] for x in similar_docs_resp['hits']['hits']]
docs_scores = [x['_score'] for x in similar_docs_resp['hits']['hits']]
# To add document to index
resp = add_to_index(doc_id='1', vectors=dict(vector1=vector), index_name='my-index', es=es)
hasError = resp['errors']
Motivation
Can be found here
Scripts
There are scripts that generate trigrams from leads(documents) in DEEP.
Modules
trigrams
Just the collection of relevant trigrams for en, es and fr languages.
from deep_utils.deduplication.trigrams import en, es, fr
en_trigrams = en.trigrams # [' th', 'the', 'he ', ....]
es_trigrams = es.trigrams
fr_trigrams = fr.trigrams
NOTE: The trigrams contain 10000 relevant trigrams. So, the vector created will have dimension 10000.
utils
Consists of following functions:
# This is a wrapper function for creating Elasticsearch object
es_wrapper(endpoint: str, region: str, profile_name: str = 'default') -> Elasticsearch`
remove_puncs_and_extra_spaces(text: str) -> str` which is used for preprocessing texts
vector_generator
create_trigram_vector(lang: str, text: str) -> List[float]
create_count_vector(processed_text: str, trigrams: Dict[str, int]) -> List[int]
normalize_count_vector(count_vector: List[int]) -> List[float]
elasticsearch
search_similar(similar_count: int, vector: Tuple[str, List[float]], index_name: str, es: Elasticsearch)
add_to_index(doc_id: int, vectors: Dict[str, List[float]], index_name: str, es: Elasticsearch)
index_exists(index_name: str, es: Es) -> bool
create_knn_vector_index(index_name: str, vector_size: int, es: Es, ignore_error: bool = False) -> Tuple[bool, ErrorString]
create_knn_vector_index_if_not_exists(index_name: str, vector_size: int, es: Es) -> Tuple[bool, ErrorString]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file deep_utils.deduplication-1.0.8.tar.gz.
File metadata
- Download URL: deep_utils.deduplication-1.0.8.tar.gz
- Upload date:
- Size: 93.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.3.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0f57bb2a99a95c26add244a5ef751e6a9461b7279e23a7f278793e5c7a5cdc0f
|
|
| MD5 |
964aceecdc9d437d5d3d72e9a332e248
|
|
| BLAKE2b-256 |
77e4401f981c5914237c6055d090741b02a87fba87f485d60b175a29175952cb
|
File details
Details for the file deep_utils.deduplication-1.0.8-py3-none-any.whl.
File metadata
- Download URL: deep_utils.deduplication-1.0.8-py3-none-any.whl
- Upload date:
- Size: 270.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.3.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5cfe6d60abea4d3f00728359db55ec9151c22ee5e27e930cd94c7a184bd5601f
|
|
| MD5 |
9764f711aac4a992ee299a29dc9986bf
|
|
| BLAKE2b-256 |
3892fd7168bd6383046b056b50c56561eb6d76b59af043021bf22840c767eefa
|