Aruana is a collection of methods that can be used for simple NLP tasks and for machine learning text preprocessing.

Project description

Aruana NLP for Machine Learning

Aruana is a collection of methods that can be used for simple NLP tasks. It can be used for tasks involving text preprocessing for machine learning. Aruana works mainly with text strings and lists of strings.

The library is developed in Python 3.

Installing Aruana

pip

$ pip3 install aruana

If you want, you can also install Aruana in a virtual environment:

$ python -m venv .env

$ source .env/bin/activate

$ pip3 install aruana

Prerequisites

Aruana uses following external Python libraries:

nltk (3.3)
tqdm (4.19.5)
pdoc (0.5.1)

They are all documented in the requirements.txt file.

Usage examples

To use Aruana, initialize it by choosing one of the three available languages ('en', 'fr', 'pt-br')

aruana_en = Aruana('en')

Quick preprocessing

Aruana has the preprocess method, which applies commonly used preprocessed steps on you text.

text = "At the end of the day, you're solely responsible for your success and your failure. And the sooner you realize that, you accept that, and integrate that into your work ethic, you will start being successful. As long as you blame others for the reason you aren't where you want to be, you will always be a failure."
preprocessed_text = aruana_en.preprocess(text)
print(preprocessed_text)

>>> ['at', 'the', 'end', 'of', 'the', 'day', 'you', 'are', 'sole', 'respons', 'for', 'your', 'success', 'and', 'your', 'failur', 'and', 'the', 'sooner', 'you', 'realiz', 'that', 'you', 'accept', 'that', 'and', 'integr', 'that', 'into', 'your', 'work', 'ethic', 'you', 'will', 'start', 'be', 'success', 'as', 'long', 'as', 'you', 'blame', 'other', 'for', 'the', 'reason', 'you', 'are', 'not', 'where', 'you', 'want', 'to', 'be', 'you', 'will', 'alway', 'be', 'a', 'failur']

If you prefer, you can choose to:

tokenize the sentence
stem it
remove stop words

pos tag the portuguese sentences

  text = "At the end of the day, you're solely responsible for your success and your failure. And the sooner you realize that, you accept that, and integrate that into your work ethic, you will start being successful. As long as you blame others for the reason you aren't where you want to be, you will always be a failure."

  preprocessed_text = aruana_en.preprocess(text, stem=False, remove_stopwords=True)

  print(preprocessed_text)

  ['end', 'day', 'solely', 'responsible', 'success', 'failure', 'sooner', 'realize', 'that', 'accept', 'that', 'integrate', 'work', 'ethic', 'start', 'successful', 'long', 'blame', 'others', 'reason', 'want', 'be', 'always', 'failure']

List preprocessing

If you have a list of sentences or you are using Pandas, you can pass the entire list for preprocessing by using the preprocess_list method.

list_of_strings = ['I love you',
					'Please, never leave me alone',
					'If you go, I will die',
					'I am watching a lot of romantic comedy lately',
					'I have to eat icecream' ]

list_processed = aruana_en.preprocess_list(list_of_strings, stem=False, remove_stopwords=True)

print(list_processed)

>>> [['love'], ['please', 'never', 'leave', 'alone'], ['go', 'die'], ['watching', 'lot', 'romantic', 'comedy', 'lately'], ['eat', 'icecream']]

Defining your own pipeline

Use the single available methods to create a custom pipeline instead of using the quick preprocessing function.

text = "At the end of the day, @john you're solely responsible for your #success and your #failure. And the sooner you realize that, you accept that, and integrate that into your work ethic, you will start being #successful."
text = aruana_en.lower_remove_white(text)
text = aruana_en.expand_contractions(text)
text = aruana_en.replace_handles(text, 'HANDLE')
text = aruana_en.replace_hashtags(text, 'HASHTAG')
text = aruana_en.remove_stopwords(text)
text = aruana_en.replace_punctuation(text, placeholder='PUNCTUATION')
text = aruana_en.tokenize(text)
print(text)

>>> ['end', 'day', 'PUNCTUATION', 'HANDLE', 'solely', 'responsible', 'HASHTAG', 'HASHTAG', 'PUNCTUATION', 'sooner', 'realize', 'that', 'PUNCTUATION', 'accept', 'that', 'PUNCTUATION', 'integrate', 'work', 'ethic', 'PUNCTUATION', 'start', 'HASHTAG', 'PUNCTUATION']

Development

Testing

Create a clean test environment
Navigate to aruana project on your computer and generate a package using bdist_wheel
```
 $ python3 setup.py sdist bdist_wheel
```
Install the package
```
 $ python3 setup.py install
```

Docs

Navigate to aruana/aruana and type:

$ pdoc --html aruana

Release

Follow the steps below before releasing a new version:

Update all necessary documents
Generate the package using bdist
Install the new version on a clean environment for testing
If everything is ok, generate the doc using pdoc

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

Versioning

Use SemVer for versioning.

Authors

Wilame Vallantin - Initial work - Nhe'eng

License

This project is licensed under the Apache License - see the LICENSE.md file for details

V. 1.1.1

New features

Adds the random_classification method, useful for random text classification for testing model accuracy
Adds the replace_with_blob method, useful for creating blobs from a corpus for testing method accuracy
adds the strings module, with a list of punctuation and diacritic strings
adds an internal tokenizer
adds a pos-tagger for portuguese (experimental, version 0.0.1)

Improvements

expand_contractions recognizes now more words for portuguese
Preprocess text now converts emojis to text instead of completely removing them
Removes NLTK tokenizer and replaces it for an internal tokenizer

Project details

Release history Release notifications | RSS feed

This version

1.1.1

Feb 8, 2019

1.0.0

Jan 21, 2019

0.0.1

Jan 21, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Aruana-1.1.1.tar.gz (451.5 kB view details)

Uploaded Feb 8, 2019 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

Aruana-1.1.1-py3.5.egg (1.1 MB view details)

Uploaded Feb 8, 2019 Egg

Aruana-1.1.1-py3-none-any.whl (466.1 kB view details)

Uploaded Feb 8, 2019 Python 3

File details

Details for the file Aruana-1.1.1.tar.gz.

File metadata

Download URL: Aruana-1.1.1.tar.gz
Upload date: Feb 8, 2019
Size: 451.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.19.5 CPython/3.5.6

File hashes

Hashes for Aruana-1.1.1.tar.gz
Algorithm	Hash digest
SHA256	`cb618ea6e5d1815a0b514a489359e380f82990403f05757be6fc39ec7324951f`
MD5	`791e087546a88de0aea0de3cb9f56daf`
BLAKE2b-256	`50fe6a4764a1e334bf73b30ab4ea01cf216723e83259bcdc7893d7cf4b50a633`

See more details on using hashes here.

File details

Details for the file Aruana-1.1.1-py3.5.egg.

File metadata

Download URL: Aruana-1.1.1-py3.5.egg
Upload date: Feb 8, 2019
Size: 1.1 MB
Tags: Egg
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.19.5 CPython/3.5.6

File hashes

Hashes for Aruana-1.1.1-py3.5.egg
Algorithm	Hash digest
SHA256	`d83cb6fd6ef129272e30ef5ff09ae9e2ecd468130558557659ab5291240429fb`
MD5	`e679690a9826d2501a75669f5bbd419c`
BLAKE2b-256	`ee8488fd147fc05890a8671a2ac81d17b2b3f207a99e498dfc39be82ffe52448`

See more details on using hashes here.

File details

Details for the file Aruana-1.1.1-py3-none-any.whl.

File metadata

Download URL: Aruana-1.1.1-py3-none-any.whl
Upload date: Feb 8, 2019
Size: 466.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.19.5 CPython/3.5.6

File hashes

Hashes for Aruana-1.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ef4ecde8ae8083131e9773f7204b3bb0ae33d6c38c8e1cd8c98964ac1f3491ce`
MD5	`ad2b98bef0ca212b3e1f0513749e51c0`
BLAKE2b-256	`488222f17f45e45c47ad4896c097705786344df46ecc29b79d56ff9b1f340a7e`

See more details on using hashes here.

Aruana 1.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Aruana NLP for Machine Learning

Installing Aruana

pip

Prerequisites

Usage examples

Quick preprocessing

List preprocessing

Defining your own pipeline

Development

Testing

Docs

Release

Contributing

Versioning

Authors

License

V. 1.1.1

New features

Improvements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes