Skip to main content

Aruana is a collection of methods that can be used for simple NLP tasks and for machine learning text preprocessing.

Project description

Aruana NLP for Machine Learning

Aruana is a collection of methods that can be used for simple NLP tasks. It can be used for tasks involving text preprocessing for machine learning. Aruana works mainly with text strings and lists of strings.

The library is developed in Python 3.

Installing Aruana

pip

$ pip3 install aruana

If you want, you can also install Aruana in a virtual environment:

$ python -m venv .env

$ source .env/bin/activate

$ pip3 install aruana

Prerequisites

Aruana uses following external Python libraries:

  • nltk (3.3)

  • tqdm (4.19.5)

  • pdoc (0.5.1)

They are all documented in the requirements.txt file.

Usage examples

To use Aruana, initialize it by choosing one of the three available languages ('en', 'fr', 'pt-br')

aruana_en = Aruana('en')

Quick preprocessing

Aruana has the preprocess method, which applies commonly used preprocessed steps on you text.

text = "At the end of the day, you're solely responsible for your success and your failure. And the sooner you realize that, you accept that, and integrate that into your work ethic, you will start being successful. As long as you blame others for the reason you aren't where you want to be, you will always be a failure."
preprocessed_text = aruana_en.preprocess(text)
print(preprocessed_text)

>>> ['at', 'the', 'end', 'of', 'the', 'day', 'you', 'are', 'sole', 'respons', 'for', 'your', 'success', 'and', 'your', 'failur', 'and', 'the', 'sooner', 'you', 'realiz', 'that', 'you', 'accept', 'that', 'and', 'integr', 'that', 'into', 'your', 'work', 'ethic', 'you', 'will', 'start', 'be', 'success', 'as', 'long', 'as', 'you', 'blame', 'other', 'for', 'the', 'reason', 'you', 'are', 'not', 'where', 'you', 'want', 'to', 'be', 'you', 'will', 'alway', 'be', 'a', 'failur']

If you prefer, you can choose to:

  • tokenize the sentence

  • stem it

  • remove stop words

  • pos tag the portuguese sentences

      text = "At the end of the day, you're solely responsible for your success and your failure. And the sooner you realize that, you accept that, and integrate that into your work ethic, you will start being successful. As long as you blame others for the reason you aren't where you want to be, you will always be a failure."
    
      preprocessed_text = aruana_en.preprocess(text, stem=False, remove_stopwords=True)
    
      print(preprocessed_text)
    
      ['end', 'day', 'solely', 'responsible', 'success', 'failure', 'sooner', 'realize', 'that', 'accept', 'that', 'integrate', 'work', 'ethic', 'start', 'successful', 'long', 'blame', 'others', 'reason', 'want', 'be', 'always', 'failure']
    

List preprocessing

If you have a list of sentences or you are using Pandas, you can pass the entire list for preprocessing by using the preprocess_list method.

list_of_strings = ['I love you',
					'Please, never leave me alone',
					'If you go, I will die',
					'I am watching a lot of romantic comedy lately',
					'I have to eat icecream' ]

list_processed = aruana_en.preprocess_list(list_of_strings, stem=False, remove_stopwords=True)

print(list_processed)

>>> [['love'], ['please', 'never', 'leave', 'alone'], ['go', 'die'], ['watching', 'lot', 'romantic', 'comedy', 'lately'], ['eat', 'icecream']]

Defining your own pipeline

Use the single available methods to create a custom pipeline instead of using the quick preprocessing function.

text = "At the end of the day, @john you're solely responsible for your #success and your #failure. And the sooner you realize that, you accept that, and integrate that into your work ethic, you will start being #successful."
text = aruana_en.lower_remove_white(text)
text = aruana_en.expand_contractions(text)
text = aruana_en.replace_handles(text, 'HANDLE')
text = aruana_en.replace_hashtags(text, 'HASHTAG')
text = aruana_en.remove_stopwords(text)
text = aruana_en.replace_punctuation(text, placeholder='PUNCTUATION')
text = aruana_en.tokenize(text)
print(text)

>>> ['end', 'day', 'PUNCTUATION', 'HANDLE', 'solely', 'responsible', 'HASHTAG', 'HASHTAG', 'PUNCTUATION', 'sooner', 'realize', 'that', 'PUNCTUATION', 'accept', 'that', 'PUNCTUATION', 'integrate', 'work', 'ethic', 'PUNCTUATION', 'start', 'HASHTAG', 'PUNCTUATION']

Development

Testing

  1. Create a clean test environment

  2. Navigate to aruana project on your computer and generate a package using bdist_wheel

     $ python3 setup.py sdist bdist_wheel
    
  3. Install the package

     $ python3 setup.py install
    

Docs

Navigate to aruana/aruana and type:

$ pdoc --html aruana

Release

Follow the steps below before releasing a new version:

  1. Update all necessary documents

  2. Generate the package using bdist

  3. Install the new version on a clean environment for testing

  4. If everything is ok, generate the doc using pdoc

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

Versioning

Use SemVer for versioning.

Authors

  • Wilame Vallantin - Initial work - Nhe'eng

License

This project is licensed under the Apache License - see the LICENSE.md file for details

V. 1.1.1

New features

  • Adds the random_classification method, useful for random text classification for testing model accuracy
  • Adds the replace_with_blob method, useful for creating blobs from a corpus for testing method accuracy
  • adds the strings module, with a list of punctuation and diacritic strings
  • adds an internal tokenizer
  • adds a pos-tagger for portuguese (experimental, version 0.0.1)

Improvements

  • expand_contractions recognizes now more words for portuguese
  • Preprocess text now converts emojis to text instead of completely removing them
  • Removes NLTK tokenizer and replaces it for an internal tokenizer

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Aruana-1.1.1.tar.gz (451.5 kB view details)

Uploaded Source

Built Distributions

Aruana-1.1.1-py3.5.egg (1.1 MB view details)

Uploaded Source

Aruana-1.1.1-py3-none-any.whl (466.1 kB view details)

Uploaded Python 3

File details

Details for the file Aruana-1.1.1.tar.gz.

File metadata

  • Download URL: Aruana-1.1.1.tar.gz
  • Upload date:
  • Size: 451.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.19.5 CPython/3.5.6

File hashes

Hashes for Aruana-1.1.1.tar.gz
Algorithm Hash digest
SHA256 cb618ea6e5d1815a0b514a489359e380f82990403f05757be6fc39ec7324951f
MD5 791e087546a88de0aea0de3cb9f56daf
BLAKE2b-256 50fe6a4764a1e334bf73b30ab4ea01cf216723e83259bcdc7893d7cf4b50a633

See more details on using hashes here.

File details

Details for the file Aruana-1.1.1-py3.5.egg.

File metadata

  • Download URL: Aruana-1.1.1-py3.5.egg
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.19.5 CPython/3.5.6

File hashes

Hashes for Aruana-1.1.1-py3.5.egg
Algorithm Hash digest
SHA256 d83cb6fd6ef129272e30ef5ff09ae9e2ecd468130558557659ab5291240429fb
MD5 e679690a9826d2501a75669f5bbd419c
BLAKE2b-256 ee8488fd147fc05890a8671a2ac81d17b2b3f207a99e498dfc39be82ffe52448

See more details on using hashes here.

File details

Details for the file Aruana-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: Aruana-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 466.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.1 setuptools/40.6.3 requests-toolbelt/0.8.0 tqdm/4.19.5 CPython/3.5.6

File hashes

Hashes for Aruana-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ef4ecde8ae8083131e9773f7204b3bb0ae33d6c38c8e1cd8c98964ac1f3491ce
MD5 ad2b98bef0ca212b3e1f0513749e51c0
BLAKE2b-256 488222f17f45e45c47ad4896c097705786344df46ecc29b79d56ff9b1f340a7e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page