Skip to main content

All the goto functions you need to handle NLP use-cases, integrated in NLPretext

Project description

NLPretext

CI status CD status Python Version Dependencies Status

Code style: black Security: bandit Pre-commit Semantic Versions Documentation License

All the goto functions you need to handle NLP use-cases, integrated in NLPretext

TL;DR

Working on an NLP project and tired of always looking for the same silly preprocessing functions on the web? :tired_face:

Need to efficiently extract email adresses from a document? Hashtags from tweets? Remove accents from a French post? :disappointed_relieved:

NLPretext got you covered! :rocket:

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

:mag: Quickly explore below our preprocessing pipelines and individual functions referential.

Cannot find what you were looking for? Feel free to open an issue.

Installation

Supported Python Versions

  • Main version supported : 3.8
  • Other supported versions : 3.9, 3.10

We strongly advise you to do the remaining steps in a virtual environnement.

To install this library from PyPi, run the following command:

pip install nlpretext

or with Poetry

poetry add nlpretext

Usage

Default pipeline

Need to preprocess your text data but no clue about what function to use and in which order? The default preprocessing pipeline got you covered:

from nlpretext import Preprocessor
text = "I just got the best dinner in my life @latourdargent !!! I  recommend ๐Ÿ˜€ #food #paris \n"
preprocessor = Preprocessor()
text = preprocessor.run(text)
print(text)
# "I just got the best dinner in my life!!! I recommend"

Create your custom pipeline

Another possibility is to create your custom pipeline if you know exactly what function to apply on your data, here's an example:

from nlpretext import Preprocessor
from nlpretext.basic.preprocess import (normalize_whitespace, remove_punct, remove_eol_characters,
remove_stopwords, lower_text)
from nlpretext.social.preprocess import remove_mentions, remove_hashtag, remove_emoji
text = "I just got the best dinner in my life @latourdargent !!! I  recommend ๐Ÿ˜€ #food #paris \n"
preprocessor = Preprocessor()
preprocessor.pipe(lower_text)
preprocessor.pipe(remove_mentions)
preprocessor.pipe(remove_hashtag)
preprocessor.pipe(remove_emoji)
preprocessor.pipe(remove_eol_characters)
preprocessor.pipe(remove_stopwords, args={'lang': 'en'})
preprocessor.pipe(remove_punct)
preprocessor.pipe(normalize_whitespace)
text = preprocessor.run(text)
print(text)
# "dinner life recommend"

Take a look at all the functions that are available here in the preprocess.py scripts in the different folders: basic, social, token.

Load text data

Pre-processing text data is useful only if you have loaded data to process! Importing text data as strings in your code can be really simple if you have short texts contained in a local .txt, but it can quickly become difficult if you want to load a lot of texts, stored in multiple formats and divided in multiple files. Hopefully, you can use NLPretext's TextLoader class to easily import text data. while it is not mandatory our textLoader work best with dask, make sure to have the librairy installed if you want the best performances.

from nlpretext.textloader import TextLoader
files_path = "local_folder/texts/text.txt"
text_loader = TextLoader(use_dask=True)
text_dataframe = text_loader.read_text(files_path)
print(text_dataframe.text.values.tolist())
# ["I just got the best dinner in my life!!!",  "I recommend", "It was awesome"]

File path can be provided as string, list of strings, with or without wildcards. It also supports imports from cloud providers, if your machine is authentified on a project.

text_loader = TextLoader(text_column="name_of_text_column_in_your_data")

local_file_path = "local_folder/texts/text.csv" # File from local folder
local_corpus_path = ["local_folder/texts/text_1.csv", "local_folder/texts/text_2.csv", "local_folder/texts/text_3.csv"] # Multiple files from local folder

gcs_file_path = "gs://my-bucket/texts/text.json" # File from GCS
s3_file_path = "s3://my-bucket/texts/text.json" # File from S3
hdfs_file_path = "hdfs://folder/texts/text.txt" # File from HDFS
azure_file_path = "az://my-bucket/texts/text.parquet" # File from Azure

gcs_corpus_path = "gs://my-bucket/texts/text_*.json" # Multiple files from GCS with wildcard

text_dataframe_1 = text_loader.read_text(local_file_path)
text_dataframe_2 = text_loader.read_text(local_corpus_path)
text_dataframe_3 = text_loader.read_text(gcs_file_path)
text_dataframe_4 = text_loader.read_text(s3_file_path)
text_dataframe_5 = text_loader.read_text(hdfs_file_path)
text_dataframe_6 = text_loader.read_text(azure_file_path)
text_dataframe_7 = text_loader.read_text(gcs_corpus_path)

You can also specify a Preprocessor if you want your data to be directly pre-processed when loaded.

text_loader = TextLoader(text_column="text_col")
preprocessor = Preprocessor()

file_path = "local_folder/texts/text.csv" # File from local folder

raw_text_dataframe = text_loader.read_text(local_file_path)
preprocessed_text_dataframe = text_loader.read_text(local_file_path, preprocessor=preprocessor)

print(raw_text_dataframe.text_col.values.tolist())
# ["These   texts are not preprocessed",  "This is bad ## "]

print(preprocessed_text_dataframe.text_col.values.tolist())
# ["These texts are not preprocessed",  "This is bad"]

Individual Functions

Replacing emails

from nlpretext.basic.preprocess import replace_emails
example = "I have forwarded this email to obama@whitehouse.gov"
example = replace_emails(example, replace_with="*EMAIL*")
print(example)
# "I have forwarded this email to *EMAIL*"

Replacing phone numbers

from nlpretext.basic.preprocess import replace_phone_numbers
example = "My phone number is 0606060606"
example = replace_phone_numbers(example, country_to_detect=["FR"], replace_with="*PHONE*")
print(example)
# "My phone number is *PHONE*"

Removing Hashtags

from nlpretext.social.preprocess import remove_hashtag
example = "This restaurant was amazing #food #foodie #foodstagram #dinner"
example = remove_hashtag(example)
print(example)
# "This restaurant was amazing"

Extracting emojis

from nlpretext.social.preprocess import extract_emojis
example = "I take care of my skin ๐Ÿ˜€"
example = extract_emojis(example)
print(example)
# [':grinning_face:']

Data augmentation

The augmentation module helps you to generate new texts based on your given examples by modifying some words in the initial ones and to keep associated entities unchanged, if any, in the case of NER tasks. If you want words other than entities to remain unchanged, you can specify it within the stopwords argument. Modifications depend on the chosen method, the ones currently supported by the module are substitutions with synonyms using Wordnet or BERT from the nlpaug library.

from nlpretext.augmentation.text_augmentation import augment_text
example = "I want to buy a small black handbag please."
entities = [{'entity': 'Color', 'word': 'black', 'startCharIndex': 22, 'endCharIndex': 27}]
example = augment_text(example, method=โ€wordnet_synonymโ€, entities=entities)
print(example)
# "I need to buy a small black pocketbook please."

๐Ÿ“ˆ Releases

You can see the list of available releases on the GitHub Releases page.

We follow Semantic Versions specification.

We use Release Drafter. As pull requests are merged, a draft release is kept up-to-date listing the changes, ready to publish when youโ€™re ready. With the categories option, you can categorize pull requests in release notes using labels.

For Pull Requests, these labels are configured, by default:

Label Title in Releases
enhancement, feature ๐Ÿš€ Features
bug, refactoring, bugfix, fix ๐Ÿ”ง Fixes & Refactoring
build, ci, testing ๐Ÿ“ฆ Build System & CI/CD
breaking ๐Ÿ’ฅ Breaking Changes
documentation ๐Ÿ“ Documentation
dependencies โฌ†๏ธ Dependencies updates

GitHub creates the bug, enhancement, and documentation labels automatically. Dependabot creates the dependencies label. Create the remaining labels on the Issues tab of the GitHub repository, when needed.## ๐Ÿ›ก License

License

This project is licensed under the terms of the Apache Software License 2.0 license. See LICENSE for more details.## ๐Ÿ“ƒ Citation

@misc{nlpretext,
  author = {artefactory},
  title = {All the goto functions you need to handle NLP use-cases, integrated in NLPretext},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/artefactory/NLPretext}}}
}

Project Organization


โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ CONTRIBUTING.md     <- Contribution guidelines
โ”œโ”€โ”€ CODE_OF_CONDUCT.md  <- Code of conduct guidelines
โ”œโ”€โ”€ Makefile
โ”œโ”€โ”€ README.md           <- The top-level README for developers using this project.
โ”œโ”€โ”€ .github/workflows   <- Where the CI and CD lives
โ”œโ”€โ”€ datasets/external   <- Bash scripts to download external datasets
โ”œโ”€โ”€ docker              <- All you need to build a Docker image from that package
โ”œโ”€โ”€ docs                <- Sphinx HTML documentation
โ”œโ”€โ”€ nlpretext           <- Main Package. This is where the code lives
โ”‚ย ย  โ”œโ”€โ”€ preprocessor.py <- Main preprocessing script
โ”‚ย ย  โ”œโ”€โ”€ text_loader.py  <- Main loading script
โ”‚ย ย  โ”œโ”€โ”€ augmentation    <- Text augmentation script
โ”‚ย ย  โ”œโ”€โ”€ basic           <- Basic text preprocessing
โ”‚ย ย  โ”œโ”€โ”€ cli             <- Command lines that can be used
โ”‚ย ย  โ”œโ”€โ”€ social          <- Social text preprocessing
โ”‚ย ย  โ”œโ”€โ”€ token           <- Token text preprocessing
โ”‚ย   โ”œโ”€โ”€ textloader      <- File loading
โ”‚ย ย  โ”œโ”€โ”€ _config         <- Where the configuration and constants live
โ”‚ย ย  โ””โ”€โ”€ _utils          <- Where preprocessing utils scripts lives
โ”œโ”€โ”€ tests               <- Where the tests lives
โ”œโ”€โ”€ pyproject.toml      <- Package configuration
โ”œโ”€โ”€ poetry.lock         
โ””โ”€โ”€ setup.cfg           <- Configuration for plugins and other utils

Credits

  • textacy for the following basic preprocessing functions:
    • fix_bad_unicode
    • normalize_whitespace
    • unpack_english_contractions
    • replace_urls
    • replace_emails
    • replace_numbers
    • replace_currency_symbols
    • remove_punct
    • remove_accents
    • replace_phone_numbers (with some modifications of our own)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlpretext-1.2.2.tar.gz (82.2 kB view details)

Uploaded Source

Built Distribution

nlpretext-1.2.2-py3-none-any.whl (87.8 kB view details)

Uploaded Python 3

File details

Details for the file nlpretext-1.2.2.tar.gz.

File metadata

  • Download URL: nlpretext-1.2.2.tar.gz
  • Upload date:
  • Size: 82.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.8.18 Linux/6.2.0-1015-azure

File hashes

Hashes for nlpretext-1.2.2.tar.gz
Algorithm Hash digest
SHA256 5aa7a50cdfce923810742706f4fe41e956cc7ed3ec3e8c5f6f73d2168624b37c
MD5 adb581d48ec9369375833aafb736ccba
BLAKE2b-256 def8f4648115130cc61a3dfcb91e978fedfc9128e037e4127565c6c907950d56

See more details on using hashes here.

File details

Details for the file nlpretext-1.2.2-py3-none-any.whl.

File metadata

  • Download URL: nlpretext-1.2.2-py3-none-any.whl
  • Upload date:
  • Size: 87.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.8.18 Linux/6.2.0-1015-azure

File hashes

Hashes for nlpretext-1.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 fddf949dff824da46ac8d8bd046186a1ac0886e1a4ca35a50b284e4b43b3bb16
MD5 834eba23d27a8edf078f92087443d2d7
BLAKE2b-256 fb88347efb5374ec936939b879da69d68f2811821195011bc8cb496c05e14faa

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page