Skip to main content

Document deduplication package for DEEP

Project description

Document De-Duplication

A module for creating and indexing document character trigrams.

Installation

pip install deep_utils.deduplication

A use case scenario

Suppose we have an elasticsearch index my-index setup in AWS with vector name vector1 of size 10000

from deep_utils.deduplication.utils import es_wrapper
from deep_utils.deduplication.vector_generator import create_trigram_vector
from deep_utils.deduplication.elasticsearch import search_similar, add_to_index

es = es_wrapper('<aws endpoint>', 'aws_region')
text_document: str = 'this is test document'
vector = create_trigram_vector('en', text_document)

similar_docs_resp = search_similar(10, ('vector1', vector), 'my-index', es)

total = search_similar_resp['hits']['total']
max_score = search_similar_resp['hits']['max_score']
docs_ids = [x['_id'] for x in similar_docs_resp['hits']['hits']]
docs_scores = [x['_score'] for x in similar_docs_resp['hits']['hits']]


# To add document to index
resp = add_to_index(doc_id='1', vectors=dict(vector1=vector), index_name='my-index', es=es)
hasError = resp['errors']

Motivation

Can be found here

Scripts

There are scripts that generate trigrams from leads(documents) in DEEP.

Modules

trigrams

Just the collection of relevant trigrams for en, es and fr languages.

from deep_utils.deduplication.trigrams import en, es, fr

en_trigrams = en.trigrams  # [' th', 'the', 'he ', ....]
es_trigrams = es.trigrams
fr_trigrams = fr.trigrams

NOTE: The trigrams contain 10000 relevant trigrams. So, the vector created will have dimension 10000.

utils

Consists of following functions:

# This is a wrapper function for creating Elasticsearch object
es_wrapper(endpoint: str, region: str, profile_name: str = 'default') -> Elasticsearch`
remove_puncs_and_extra_spaces(text: str) -> str` which is used for preprocessing texts

vector_generator

create_trigram_vector(lang: str, text: str) -> List[float]
create_count_vector(processed_text: str, trigrams: Dict[str, int]) -> List[int]
normalize_count_vector(count_vector: List[int]) -> List[float]

elasticsearch

search_similar(similar_count: int, vector: Tuple[str, List[float]], index_name: str, es: Elasticsearch)
add_to_index(doc_id: int, vectors: Dict[str, List[float]], index_name: str, es: Elasticsearch)
index_exists(index_name: str, es: Es) -> bool
create_knn_vector_index(index_name: str, vector_size: int, es: Es, ignore_error: bool = False) -> Tuple[bool, ErrorString]
create_knn_vector_index_if_not_exists(index_name: str, vector_size: int, es: Es) -> Tuple[bool, ErrorString]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deep_utils.deduplication-1.0.8.tar.gz (93.0 kB view hashes)

Uploaded Source

Built Distribution

deep_utils.deduplication-1.0.8-py3-none-any.whl (270.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page