Skip to main content

Easily clean text with spaCy!

Project description

spacy-cleaner

Built with spaCy Build status Python Version Dependencies Status

Checked with mypy Ruff Pre-commit Semantic Versions License codecov Quality Gate Status

Easily clean text with spaCy!

Key Features

spacy-cleaner utilises spaCy Language models to replace, remove, and mutate spaCy tokens. Cleaning actions available are:

  • Remove/replace stopwords.
  • Remove/replace punctuation.
  • Remove/replace numbers.
  • Remove/replace emails.
  • Remove/replace URLs.
  • Perform lemmatisation.

See our docs for more information

Installation

pip install -U spacy-cleaner

or install with Poetry

poetry add spacy-cleaner

📖 Example

spacy-cleaner can clean text written in any language spaCy has a model for:

import spacy
from spacy_cleaner import processing, Cleaner

model = spacy.load("en_core_web_sm")

Class Pipeline allows for configurable cleaning of text using spaCy. The Pipeline is initialised with a model and functions that transform spaCy tokens:

 cleaner = Cleaner( 
    model,
    processing.remove_stopword_token,
    processing.replace_punctuation_token,
    processing.mutate_lemma_token,
)

Next the pipeline can be called with the method clean to clean a list of texts:

texts = ["Hello, my name is Cellan! I love to swim!"]

cleaner.clean(texts)
About the method clean...

The method clean is a wrapper around the spaCy Language class method pipe. Check the docs for more information:

https://spacy.io/api/language#pipe

Giving the output:

['hello _IS_PUNCT_ Cellan _IS_PUNCT_ love swim _IS_PUNCT_']

📈 Releases

You can see the list of available releases on the GitHub Releases page.

We follow Semantic Versions specification.

We use Release Drafter. As pull requests are merged, a draft release is kept up-to-date listing the changes, ready to publish when you’re ready. With the categories option, you can categorize pull requests in release notes using labels.

List of labels and corresponding titles

Label Title in Releases
enhancement, feature 🚀 Features
bug, refactoring, bugfix, fix 🔧 Fixes & Refactoring
build, ci, testing 📦 Build System & CI/CD
breaking 💥 Breaking Changes
documentation 📝 Documentation
dependencies ⬆️ Dependencies updates

You can update it in release-drafter.yml.

GitHub creates the bug, enhancement, and documentation labels for you. Dependabot creates the dependencies label. Create the remaining labels on the Issues tab of your GitHub repository, when you need them.

🛡 License

License

This project is licensed under the terms of the MIT license. See LICENSE for more details.

📃 Citation

@misc{spacy-cleaner,
  author = {spacy-cleaner},
  title = {Easily clean text with spaCy!},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/Ce11an/spacy-cleaner}}
}

🚀 Credits

This project was generated with python-package-template

This project was built using IntelliJ IDEA

JetBrains Black Box Logo logo

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy_cleaner-3.2.1.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

spacy_cleaner-3.2.1-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file spacy_cleaner-3.2.1.tar.gz.

File metadata

  • Download URL: spacy_cleaner-3.2.1.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for spacy_cleaner-3.2.1.tar.gz
Algorithm Hash digest
SHA256 89aa7bceb91b7c4710a2cdbe59cc40fd85f04a8de40c59545c0a33926a9e6cab
MD5 23e17c250fb27cd9261dcf31eeceee19
BLAKE2b-256 cb539505b923df4548a09d5e5572d74a33e0023be9d41316c9cc71a999f8b7b4

See more details on using hashes here.

File details

Details for the file spacy_cleaner-3.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for spacy_cleaner-3.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f733c8d05e3107964c1d0d11977a76f9c3db89e7658d298021d51c920c3c6c50
MD5 a7dcb6f60904020e6fffe0b040be6789
BLAKE2b-256 4337f72fd6291546245d452fb72f90cc77908f62a2cae8a0a546c213b3e1b67f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page