Skip to main content

NLP library that extracts, compares, transforms and sorts with buckets phrases.

Project description

python-semantic-сompare

Extracts, compares, transforms and sorts with buckets phrases.

Installation

The project requires a spacy model for natural language processing. If you want to use English, please run this command

$ python -m spacy download en_core_web_lg

Usage

Extract phrases

Simple Usage

from semantic_compare import SemanticComparator as sc
comparator = sc(sentencizer=True)
phrases = comparator.extract_phrases("Create, promote and develop a business.")

Output:

['Create a business','promote a business','develop a business']

sentencizer - a splitter of sentences by punctuation(dot, question mark, exclamation mark).

Advanced Usage

from semantic_compare import SemanticComparator as sc

# Sentence splitter
def our_sentencizer(doc):
    """
    Sentence splitter function that allows splitting document on sentences
    by different punctuations and new line
    """
    for i, token in enumerate(doc[:-2]):
        if token.text == "•" or "•" in token.text:
            doc[i].is_sent_start = True
        elif (token.text == "." or token.text == '...' \ 
            or token.text == '?' or token.text == '!' or token.text == '\n') \
            and doc[i+1].is_title:
            doc[i+1].is_sent_start = True
        else:
            doc[i+1].is_sent_start = False
    return doc


# Merge entities and build noun chunks
comparator = sc(merge_entities=False, spacy_model='en_core_web_sm')
    
# Add a custom pipe for text preprocessing
comparator.add_custom_pipe(our_sentencizer, before='parser')

phrases = comparator.extract_phrases('''
Must Have:
* Experience shaping the BI strategy from C-Level to Technical developers.
* Extensive delivery of platform within a Business Intelligence and Analytics function.
* Communication with stakeholders on all levels.
''')
print('\n'.join(phrases))

Using add_custom_pipe you can add your custom pipe for text processing in spacy.

Compare phrases (Semantic similarity)

Get the similarity of phrases against each other. Example 1:

phrase1 = 'Understand customer needs'
phrase2 = 'Capture business requirements'
similarity = comparator.compare_phrases(phrase1, phrase2)
print(similarity)

Output:

0.38569751

Example 2: Get a two-dimensional matrix that clusters the similarity of phrases against each other.

phrases_1 = [
    'Communication with stakeholders',
    'Understand customer needs',
    'Experience shaping the BI strategy',
    'shaping the BI strategy',
    'Delivery of platform Analytics function',
]

phrases_2 = [
    'Extensive delivery of platform within a Business Intelligence and Analytics function',
    'shaping the BI strategy',
    'Experience shaping the BI strategy from C-Level to Technical developers',
    'Communication with stakeholders on all levels',
    'Capture business requirements',
    'Play computer games',
]
similarity = comparator.build_similarity_matrix(phrases_1, phrases_2)
print(similarity)

Output:

[[-0.03689054  0.0372301   0.17840812  0.09079809  0.65748763]
 [ 0.18079719  0.12055688  0.77624094  1.          0.22749564]
 [ 0.08472343  0.11505745  0.7030021   0.48876476  0.13252231]
 [ 0.7132235   0.07449755  0.178031    0.15712512  0.0676512 ]
 [ 0.11637229  0.38569745  0.23005028  0.25646406  0.26493344]
 [ 0.17955953  0.15243992  0.11233422  0.16087453  0.03144675]]

Bucket sorting

When you compare two documents you can see which phrases present in both or only in a specific document.

phrases_1 = [
    'Communication with stakeholders',
    'Understand customer needs',
    'Experience shaping the BI strategy',
    'shaping the BI strategy',
    'Delivery of platform Analytics function',
]

phrases_2 = [
    'Extensive delivery of platform within a Business Intelligence and Analytics function',
    'shaping the BI strategy',
    'Experience shaping the BI strategy from C-Level to Technical developers',
    'Communication with stakeholders on all levels',
    'Capture business requirements',
    'Play computer games',
]
# cut_off - a percentage of similarity should be bigger than it so that we consider that phrases are similar(default=0.3)
in_both, in_doc1, in_doc2 = comparator.bucket_sorting(
    phrases_1, phrases_2, similarity, cut_off=0.5)

Transfrom phrases

Get all steps of transformation from one phrase to another. Example:

print(comparator.transform_phrase(
    'Understand customer needs',
    'Capture business requirements',
))

Output

["Understand customer needs", "Capture customer needs", "Capture business requirements"]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic_compare-0.9.0.tar.gz (6.1 kB view details)

Uploaded Source

File details

Details for the file semantic_compare-0.9.0.tar.gz.

File metadata

  • Download URL: semantic_compare-0.9.0.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.4

File hashes

Hashes for semantic_compare-0.9.0.tar.gz
Algorithm Hash digest
SHA256 29b590136073fa2d0dca1c3a35bc181587bdb894b1fc8b5a9143cde580f4dedd
MD5 c5fb4f7d59c07baaaee38e92e2a42b82
BLAKE2b-256 02e37c8b9f4bd8f9e98e61cfd80028a0abbea661a52276851b95faa352edc3f2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page