Python library for concurrent text preprocessing
Project description
contextpro
contextpro is a Python library for concurrent text preprocessing using functions from some well-known NLP packages including NLTK, spaCy and TextBlob.
- Documentation: https://contextpro.readthedocs.io/en/latest/
- Source code: https://gitlab.com/elzawie/contextpro
Installation
Windows / OS X / Linux:
-
Installation with pip
pip install contextpro python -m spacy download en_core_web_sm
-
Installation with poetry
poetry add contextpro python -m spacy download en_core_web_sm
Configuration
-
Before using the package, execute the below commands in your virtual environment:
import nltk nltk.download("punkt") nltk.download("stopwords") nltk.download("wordnet")
Usage examples
from contextpro.normalization import batch_lowercase_text
corpus = [
"My name is Dr. Jekyll.",
"His name is Mr. Hyde",
"This guy's name is Edward Scissorhands",
"And this is Tom Parker"
]
result = batch_lowercase_text(
corpus,
num_workers=2
)
print(result)
[
"my name is dr. jekyll.",
"his name is mr. hyde",
"this guy's name is edward scissorhands",
"and this is tom parker"
]
from contextpro.normalization import batch_remove_non_ascii_characters
corpus = [
"https://sitebulb.com/Folder/øê.html?大学",
"J\xf6reskog bi\xdfchen Z\xfcrcher"
"This is a \xA9 but not a \xAE"
"fractions \xBC, \xBD, \xBE"
]
result = batch_remove_non_ascii_characters(
corpus,
num_workers=2
)
print(result)
[
"https://sitebulb.com/Folder/.html?",
"Jreskog bichen Zrcher",
"This is a but not a ",
"fractions , , "
]
from contextpro.normalization import batch_replace_contractions
corpus = [
"I don't want to be rude, but you shouldn't do this",
"Do you think he'll pass his driving test?",
"I'll see you next week",
"I'm going for a walk"
]
result = batch_replace_contractions(
corpus,
num_workers=2
)
print(result)
[
"I do not want to be rude, but you should not do this",
"Do you think he will pass his driving test?",
"I will see you next week",
"I am going for a walk",
]
from contextpro.normalization import batch_remove_stopwords
corpus = [
['My', 'name', 'is', 'Dr', 'Jekyll'],
['His', 'name', 'is', 'Mr', 'Hyde'],
['This', 'guy', 's', 'name', 'is', 'Edward', 'Scissorhands'],
['And', 'this', 'is', 'Tom', 'Parker']
]
result = batch_remove_stopwords(
corpus,
num_workers=2
)
print(result)
[
['My', 'name', 'Dr', 'Jekyll'],
['His', 'name', 'Mr', 'Hyde'],
['This', 'guy', 'name', 'Edward', 'Scissorhands'],
['And', 'Tom', 'Parker']
]
from contextpro.normalization import batch_lemmatize
corpus = [
["I", "like", "driving", "a", "car"],
["I", "am", "going", "for", "a", "walk"],
["What", "are", "you", "doing"],
["Where", "are", "you", "coming", "from"]
]
result = batch_lemmatize(
corpus,
num_workers=2,
pos="v"
)
print(result)
[
['I', 'like', 'drive', 'a', 'car'],
['I', 'be', 'go', 'for', 'a', 'walk'],
['What', 'be', 'you', 'do'],
['Where', 'be', 'you', 'come', 'from']
]
from contextpro.normalization import batch_convert_numerals_to_numbers
corpus = [
"A bunch of five",
"A picture is worth a thousand words",
"A stitch in time saves nine",
"Back to square one",
"Behind the eight ball",
"Between two stools",
]
result = batch_convert_numerals_to_numbers(
corpus,
num_workers=2
)
print(result)
[
'A bunch of 5',
'A picture is worth a 1000 words',
'A stitch in time saves 9',
'Back to square 1',
'Behind the 8 ball',
'Between 2 stools',
]
from contextpro.feature_extraction import ConcurrentCountVectorizer
corpus = [
"My name is Dr. Jekyll.",
"His name is Mr. Hyde",
"This guy's name is Edward Scissorhands",
"And this is Tom Parker"
]
cvv = ConcurrentCountVectorizer(
lowercase=True,
remove_stopwords=True,
ngram_range=(1, 1),
num_workers=2
)
transformed = cvv.fit_transform(corpus)
print(cvv.get_feature_names())
[
'dr', 'edward', 'guy', 'hyde', 'jekyll', 'mr',
'name', 'parker', 'scissorhands', 'tom'
]
print(transformed.toarray())
[
[1, 0, 0, 0, 1, 0, 1, 0, 0, 0],
[0, 0, 0, 1, 0, 1, 1, 0, 0, 0],
[0, 1, 1, 0, 0, 0, 1, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 1]
]
from contextpro.statistics import batch_calculate_corpus_statistics
corpus = [
"My name is Dr. Jekyll.",
"His name is Mr. Hyde",
"This guy's name is Edward Scissorhands",
"And this is Tom Parker"
]
statistics = batch_calculate_corpus_statistics(
corpus,
lowercase=False,
remove_stopwords=False,
num_workers=2,
)
print(statistics)
characters tokens punctuation_characters digits whitespace_characters \
0 22 5 2 0 4
1 20 5 1 0 4
2 38 7 1 0 5
3 22 5 0 0 4
ascii_characters sentiment_score subjectivity_score
0 22 0.0 0.0
1 20 0.0 0.0
2 38 0.0 0.0
3 22 0.0 0.0
Release History
- 0.1.0
- First release
Meta
Łukasz Zawieska – zawieskal@yahoo.com
Distributed under the MIT license. See LICENSE for more information.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
contextpro-0.1.0.tar.gz
(16.7 kB
view details)
Built Distribution
File details
Details for the file contextpro-0.1.0.tar.gz
.
File metadata
- Download URL: contextpro-0.1.0.tar.gz
- Upload date:
- Size: 16.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.5 CPython/3.8.5 Linux/5.8.0-48-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 01868753da5af27811d5651987cb6403eba8f489d51a992955f025acd055892d |
|
MD5 | 5406251ead34dd183f577c2043d6a5a2 |
|
BLAKE2b-256 | 24520992f8f0a1d0f0727f33d7c0714661feeee0f50752bf4b10dd21a5254c99 |
File details
Details for the file contextpro-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: contextpro-0.1.0-py3-none-any.whl
- Upload date:
- Size: 17.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.5 CPython/3.8.5 Linux/5.8.0-48-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c2ba51ace03baeab35c322373dc0f3c1e9029bdf7d61d64ea076db25377d39e4 |
|
MD5 | efd5c610fbf311b7caea9c69e13f5f97 |
|
BLAKE2b-256 | 1f7be823215d21341f44bf911097fa987df4ceb8f38e4eb8d87ef4285be62f4b |