Skip to main content

Python library for concurrent text preprocessing

Project description

contextpro

pipeline status coverage report License

contextpro is a Python library for concurrent text preprocessing using functions from some well-known NLP packages including NLTK, spaCy and TextBlob.

Installation

Windows / OS X / Linux:

  • Installation with pip

    pip install contextpro
    python -m spacy download en_core_web_sm
    
  • Installation with poetry

    poetry add contextpro
    python -m spacy download en_core_web_sm
    

Configuration

  • Before using the package, execute the below commands in your virtual environment:

    import nltk
    
    nltk.download("punkt")
    nltk.download("stopwords")
    nltk.download("wordnet")
    

Usage examples

from contextpro.normalization import batch_lowercase_text

corpus = [
    "My name is Dr. Jekyll.",
    "His name is Mr. Hyde",
    "This guy's name is Edward Scissorhands",
    "And this is Tom Parker"
]

result = batch_lowercase_text(
    corpus,
    num_workers=2
)

print(result)

[
    "my name is dr. jekyll.",
    "his name is mr. hyde",
    "this guy's name is edward scissorhands",
    "and this is tom parker"
]
from contextpro.normalization import batch_remove_non_ascii_characters

corpus = [
    "https://sitebulb.com/Folder/øê.html?大学",
    "J\xf6reskog bi\xdfchen Z\xfcrcher"
    "This is a \xA9 but not a \xAE"
    "fractions \xBC, \xBD, \xBE"
]

result = batch_remove_non_ascii_characters(
        corpus,
        num_workers=2
)

print(result)

[
    "https://sitebulb.com/Folder/.html?",
    "Jreskog bichen Zrcher",
    "This is a  but not a ",
    "fractions , , "
]
from contextpro.normalization import batch_replace_contractions

corpus = [
    "I don't want to be rude, but you shouldn't do this",
    "Do you think he'll pass his driving test?",
    "I'll see you next week",
    "I'm going for a walk"
]

result = batch_replace_contractions(
    corpus,
    num_workers=2
)

print(result)

[
    "I do not want to be rude, but you should not do this",
    "Do you think he will pass his driving test?",
    "I will see you next week",
    "I am going for a walk",
]
from contextpro.normalization import batch_remove_stopwords

corpus = [
    ['My', 'name', 'is', 'Dr', 'Jekyll'],
    ['His', 'name', 'is', 'Mr', 'Hyde'],
    ['This', 'guy', 's', 'name', 'is', 'Edward', 'Scissorhands'],
    ['And', 'this', 'is', 'Tom', 'Parker']
]

result = batch_remove_stopwords(
    corpus,
    num_workers=2
)

print(result)

[
    ['My', 'name', 'Dr', 'Jekyll'],
    ['His', 'name', 'Mr', 'Hyde'],
    ['This', 'guy', 'name', 'Edward', 'Scissorhands'],
    ['And', 'Tom', 'Parker']
]
from contextpro.normalization import batch_lemmatize

corpus =  [
    ["I", "like", "driving", "a", "car"],
    ["I", "am", "going", "for", "a", "walk"],
    ["What", "are", "you", "doing"],
    ["Where", "are", "you", "coming", "from"]
]

result = batch_lemmatize(
    corpus,
    num_workers=2,
    pos="v"
)

print(result)

[
    ['I', 'like', 'drive', 'a', 'car'],
    ['I', 'be', 'go', 'for', 'a', 'walk'],
    ['What', 'be', 'you', 'do'],
    ['Where', 'be', 'you', 'come', 'from']
]
from contextpro.normalization import batch_convert_numerals_to_numbers

corpus = [
    "A bunch of five",
    "A picture is worth a thousand words",
    "A stitch in time saves nine",
    "Back to square one",
    "Behind the eight ball",
    "Between two stools",
]

result = batch_convert_numerals_to_numbers(
    corpus,
    num_workers=2
)

print(result)

[
    'A bunch of 5',
    'A picture is worth a 1000 words',
    'A stitch in time saves 9',
    'Back to square 1',
    'Behind the 8 ball',
    'Between 2 stools',
]
from contextpro.feature_extraction import ConcurrentCountVectorizer

corpus = [
    "My name is Dr. Jekyll.",
    "His name is Mr. Hyde",
    "This guy's name is Edward Scissorhands",
    "And this is Tom Parker"
]

cvv = ConcurrentCountVectorizer(
    lowercase=True,
    remove_stopwords=True,
    ngram_range=(1, 1),
    num_workers=2
)

transformed = cvv.fit_transform(corpus)

print(cvv.get_feature_names())

[
    'dr', 'edward', 'guy', 'hyde', 'jekyll', 'mr',
    'name', 'parker', 'scissorhands', 'tom'
]

print(transformed.toarray())

[
    [1, 0, 0, 0, 1, 0, 1, 0, 0, 0],
    [0, 0, 0, 1, 0, 1, 1, 0, 0, 0],
    [0, 1, 1, 0, 0, 0, 1, 0, 1, 0],
    [0, 0, 0, 0, 0, 0, 0, 1, 0, 1]
]
from contextpro.statistics import batch_calculate_corpus_statistics

corpus = [
    "My name is Dr. Jekyll.",
    "His name is Mr. Hyde",
    "This guy's name is Edward Scissorhands",
    "And this is Tom Parker"
]

statistics = batch_calculate_corpus_statistics(
    corpus,
    lowercase=False,
    remove_stopwords=False,
    num_workers=2,
)

print(statistics)

    characters  tokens  punctuation_characters  digits  whitespace_characters  \
0          22       5                       2       0                      4
1          20       5                       1       0                      4
2          38       7                       1       0                      5
3          22       5                       0       0                      4

        ascii_characters  sentiment_score  subjectivity_score
0                22              0.0                 0.0
1                20              0.0                 0.0
2                38              0.0                 0.0
3                22              0.0                 0.0

Release History

  • 0.1.0
    • First release

Meta

Łukasz Zawieska – zawieskal@yahoo.com

Gitlab account

Github account

Distributed under the MIT license. See LICENSE for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

contextpro-2.0.0.tar.gz (14.0 kB view details)

Uploaded Source

Built Distribution

contextpro-2.0.0-py3-none-any.whl (14.7 kB view details)

Uploaded Python 3

File details

Details for the file contextpro-2.0.0.tar.gz.

File metadata

  • Download URL: contextpro-2.0.0.tar.gz
  • Upload date:
  • Size: 14.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.5 CPython/3.8.10 Linux/5.8.0-59-generic

File hashes

Hashes for contextpro-2.0.0.tar.gz
Algorithm Hash digest
SHA256 e0e99ceb57fbe7c2c73f451e5861b82155e6a8d92654561da550190207689c18
MD5 3ceb4f13ab8f7ab8313d3322e0b1b9be
BLAKE2b-256 0849c5b0bfb6f3bd579a1946da75cedb7644fc6b75d03370021fde657cb09511

See more details on using hashes here.

File details

Details for the file contextpro-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: contextpro-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 14.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.5 CPython/3.8.10 Linux/5.8.0-59-generic

File hashes

Hashes for contextpro-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7cadf744b3d61909f069e88032707ae3465c38d9748c623d364941464ba31624
MD5 4cba8e357102e7bea1bbd7d3916adaa5
BLAKE2b-256 2f1a852fd7b3f711dff1b619804ed03ec4e330b3afaa7f0e65b8fc53d62b1b6a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page