Skip to main content

Python library for concurrent text preprocessing

Project description

contextpro

pipeline status coverage report License

contextpro is a Python library for concurrent text preprocessing using functions from some well-known NLP packages including NLTK, spaCy and TextBlob.

Installation

Windows / OS X / Linux:

  • Installation with pip

    pip install contextpro
    python -m spacy download en_core_web_sm
    
  • Installation with poetry

    poetry add contextpro
    python -m spacy download en_core_web_sm
    

Configuration

  • Before using the package, execute the below commands in your virtual environment:

    import nltk
    
    nltk.download("punkt")
    nltk.download("stopwords")
    nltk.download("wordnet")
    

Usage examples

from contextpro.normalization import batch_lowercase_text

corpus = [
    "My name is Dr. Jekyll.",
    "His name is Mr. Hyde",
    "This guy's name is Edward Scissorhands",
    "And this is Tom Parker"
]

result = batch_lowercase_text(
    corpus,
    num_workers=2
)

print(result)

[
    "my name is dr. jekyll.",
    "his name is mr. hyde",
    "this guy's name is edward scissorhands",
    "and this is tom parker"
]
from contextpro.normalization import batch_remove_non_ascii_characters

corpus = [
    "https://sitebulb.com/Folder/øê.html?大学",
    "J\xf6reskog bi\xdfchen Z\xfcrcher"
    "This is a \xA9 but not a \xAE"
    "fractions \xBC, \xBD, \xBE"
]

result = batch_remove_non_ascii_characters(
        corpus,
        num_workers=2
)

print(result)

[
    "https://sitebulb.com/Folder/.html?",
    "Jreskog bichen Zrcher",
    "This is a  but not a ",
    "fractions , , "
]
from contextpro.normalization import batch_replace_contractions

corpus = [
    "I don't want to be rude, but you shouldn't do this",
    "Do you think he'll pass his driving test?",
    "I'll see you next week",
    "I'm going for a walk"
]

result = batch_replace_contractions(
    corpus,
    num_workers=2
)

print(result)

[
    "I do not want to be rude, but you should not do this",
    "Do you think he will pass his driving test?",
    "I will see you next week",
    "I am going for a walk",
]
from contextpro.normalization import batch_remove_stopwords

corpus = [
    ['My', 'name', 'is', 'Dr', 'Jekyll'],
    ['His', 'name', 'is', 'Mr', 'Hyde'],
    ['This', 'guy', 's', 'name', 'is', 'Edward', 'Scissorhands'],
    ['And', 'this', 'is', 'Tom', 'Parker']
]

result = batch_remove_stopwords(
    corpus,
    num_workers=2
)

print(result)

[
    ['My', 'name', 'Dr', 'Jekyll'],
    ['His', 'name', 'Mr', 'Hyde'],
    ['This', 'guy', 'name', 'Edward', 'Scissorhands'],
    ['And', 'Tom', 'Parker']
]
from contextpro.normalization import batch_lemmatize

corpus =  [
    ["I", "like", "driving", "a", "car"],
    ["I", "am", "going", "for", "a", "walk"],
    ["What", "are", "you", "doing"],
    ["Where", "are", "you", "coming", "from"]
]

result = batch_lemmatize(
    corpus,
    num_workers=2,
    pos="v"
)

print(result)

[
    ['I', 'like', 'drive', 'a', 'car'],
    ['I', 'be', 'go', 'for', 'a', 'walk'],
    ['What', 'be', 'you', 'do'],
    ['Where', 'be', 'you', 'come', 'from']
]
from contextpro.normalization import batch_convert_numerals_to_numbers

corpus = [
    "A bunch of five",
    "A picture is worth a thousand words",
    "A stitch in time saves nine",
    "Back to square one",
    "Behind the eight ball",
    "Between two stools",
]

result = batch_convert_numerals_to_numbers(
    corpus,
    num_workers=2
)

print(result)

[
    'A bunch of 5',
    'A picture is worth a 1000 words',
    'A stitch in time saves 9',
    'Back to square 1',
    'Behind the 8 ball',
    'Between 2 stools',
]
from contextpro.feature_extraction import ConcurrentCountVectorizer

corpus = [
    "My name is Dr. Jekyll.",
    "His name is Mr. Hyde",
    "This guy's name is Edward Scissorhands",
    "And this is Tom Parker"
]

cvv = ConcurrentCountVectorizer(
    lowercase=True,
    remove_stopwords=True,
    ngram_range=(1, 1),
    num_workers=2
)

transformed = cvv.fit_transform(corpus)

print(cvv.get_feature_names())

[
    'dr', 'edward', 'guy', 'hyde', 'jekyll', 'mr',
    'name', 'parker', 'scissorhands', 'tom'
]

print(transformed.toarray())

[
    [1, 0, 0, 0, 1, 0, 1, 0, 0, 0],
    [0, 0, 0, 1, 0, 1, 1, 0, 0, 0],
    [0, 1, 1, 0, 0, 0, 1, 0, 1, 0],
    [0, 0, 0, 0, 0, 0, 0, 1, 0, 1]
]
from contextpro.statistics import batch_calculate_corpus_statistics

corpus = [
    "My name is Dr. Jekyll.",
    "His name is Mr. Hyde",
    "This guy's name is Edward Scissorhands",
    "And this is Tom Parker"
]

statistics = batch_calculate_corpus_statistics(
    corpus,
    lowercase=False,
    remove_stopwords=False,
    num_workers=2,
)

print(statistics)

    characters  tokens  punctuation_characters  digits  whitespace_characters  \
0          22       5                       2       0                      4
1          20       5                       1       0                      4
2          38       7                       1       0                      5
3          22       5                       0       0                      4

        ascii_characters  sentiment_score  subjectivity_score
0                22              0.0                 0.0
1                20              0.0                 0.0
2                38              0.0                 0.0
3                22              0.0                 0.0

Release History

  • 0.1.0
    • First release

Meta

Łukasz Zawieska – zawieskal@yahoo.com

Gitlab account

Github account

Distributed under the MIT license. See LICENSE for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

contextpro-0.1.0.tar.gz (16.7 kB view details)

Uploaded Source

Built Distribution

contextpro-0.1.0-py3-none-any.whl (17.4 kB view details)

Uploaded Python 3

File details

Details for the file contextpro-0.1.0.tar.gz.

File metadata

  • Download URL: contextpro-0.1.0.tar.gz
  • Upload date:
  • Size: 16.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.5 CPython/3.8.5 Linux/5.8.0-48-generic

File hashes

Hashes for contextpro-0.1.0.tar.gz
Algorithm Hash digest
SHA256 01868753da5af27811d5651987cb6403eba8f489d51a992955f025acd055892d
MD5 5406251ead34dd183f577c2043d6a5a2
BLAKE2b-256 24520992f8f0a1d0f0727f33d7c0714661feeee0f50752bf4b10dd21a5254c99

See more details on using hashes here.

File details

Details for the file contextpro-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: contextpro-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 17.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.5 CPython/3.8.5 Linux/5.8.0-48-generic

File hashes

Hashes for contextpro-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c2ba51ace03baeab35c322373dc0f3c1e9029bdf7d61d64ea076db25377d39e4
MD5 efd5c610fbf311b7caea9c69e13f5f97
BLAKE2b-256 1f7be823215d21341f44bf911097fa987df4ceb8f38e4eb8d87ef4285be62f4b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page