Skip to main content

Document deduplication package for DEEP

Project description

Document De-Duplication

A module for creating and indexing document character trigrams.

Installation

pip install deep_utils.deduplication

A use case scenario

Suppose we have an elasticsearch index my-index setup in AWS with vector name vector1 of size 10000

from deep_utils.deduplication.utils import es_wrapper
from deep_utils.deduplication.vector_generator import create_trigram_vector
from deep_utils.deduplication.elasticsearch import search_similar, add_to_index

es = es_wrapper('<aws endpoint>', 'aws_region')
text_document: str = 'this is test document'
vector = create_trigram_vector('en', text_document)

similar_docs_resp = search_similar(10, ('vector1', vector), 'my-index', es)

total = search_similar_resp['hits']['total']
max_score = search_similar_resp['hits']['max_score']
docs_ids = [x['_id'] for x in similar_docs_resp['hits']['hits']]
docs_scores = [x['_score'] for x in similar_docs_resp['hits']['hits']]


# To add document to index
resp = add_to_index(doc_id='1', vectors=dict(vector1=vector), index_name='my-index', es=es)
hasError = resp['errors']

Motivation

Can be found here

Scripts

There are scripts that generate trigrams from leads(documents) in DEEP.

Modules

trigrams

Just the collection of relevant trigrams for en, es and fr languages.

from deep_utils.deduplication.trigrams import en, es, fr

en_trigrams = en.trigrams  # [' th', 'the', 'he ', ....]
es_trigrams = es.trigrams
fr_trigrams = fr.trigrams

NOTE: The trigrams contain 10000 relevant trigrams. So, the vector created will have dimension 10000.

utils

Consists of following functions:

# This is a wrapper function for creating Elasticsearch object
es_wrapper(endpoint: str, region: str, profile_name: str = 'default') -> Elasticsearch`
remove_puncs_and_extra_spaces(text: str) -> str` which is used for preprocessing texts

vector_generator

create_trigram_vector(lang: str, text: str) -> List[float]
create_count_vector(processed_text: str, trigrams: Dict[str, int]) -> List[int]
normalize_count_vector(count_vector: List[int]) -> List[float]

elasticsearch

search_similar(similar_count: int, vector: Tuple[str, List[float]], index_name: str, es: Elasticsearch)
add_to_index(doc_id: int, vectors: Dict[str, List[float]], index_name: str, es: Elasticsearch)
index_exists(index_name: str, es: Es) -> bool
create_knn_vector_index(index_name: str, vector_size: int, es: Es, ignore_error: bool = False) -> Tuple[bool, ErrorString]
create_knn_vector_index_if_not_exists(index_name: str, vector_size: int, es: Es) -> Tuple[bool, ErrorString]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deep_utils.deduplication-1.0.8.tar.gz (93.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

deep_utils.deduplication-1.0.8-py3-none-any.whl (270.5 kB view details)

Uploaded Python 3

File details

Details for the file deep_utils.deduplication-1.0.8.tar.gz.

File metadata

  • Download URL: deep_utils.deduplication-1.0.8.tar.gz
  • Upload date:
  • Size: 93.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.3.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.6

File hashes

Hashes for deep_utils.deduplication-1.0.8.tar.gz
Algorithm Hash digest
SHA256 0f57bb2a99a95c26add244a5ef751e6a9461b7279e23a7f278793e5c7a5cdc0f
MD5 964aceecdc9d437d5d3d72e9a332e248
BLAKE2b-256 77e4401f981c5914237c6055d090741b02a87fba87f485d60b175a29175952cb

See more details on using hashes here.

File details

Details for the file deep_utils.deduplication-1.0.8-py3-none-any.whl.

File metadata

  • Download URL: deep_utils.deduplication-1.0.8-py3-none-any.whl
  • Upload date:
  • Size: 270.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.3.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.6

File hashes

Hashes for deep_utils.deduplication-1.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 5cfe6d60abea4d3f00728359db55ec9151c22ee5e27e930cd94c7a184bd5601f
MD5 9764f711aac4a992ee299a29dc9986bf
BLAKE2b-256 3892fd7168bd6383046b056b50c56561eb6d76b59af043021bf22840c767eefa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page