Easily clean text with spaCy!
Project description
Key Features
spacy-cleaner utilises spaCy Language models to replace, remove, and
mutate spaCy tokens. Cleaning actions available are:
- Remove/replace stopwords.
- Remove/replace punctuation.
- Remove/replace numbers.
- Remove/replace emails.
- Remove/replace URLs.
- Perform lemmatisation.
See our docs for more information
Installation
pip install -U spacy-cleaner
or install with Poetry
poetry add spacy-cleaner
📖 Example
spacy-cleaner can clean text written in any language spaCy has a model
for:
import spacy
from spacy_cleaner import processing, Cleaner
model = spacy.load("en_core_web_sm")
Class Pipeline allows for configurable cleaning of text using spaCy. The
Pipeline is initialised with a model and functions that transform spaCy
tokens:
cleaner = Cleaner(
model,
processing.remove_stopword_token,
processing.replace_punctuation_token,
processing.mutate_lemma_token,
)
Next the pipeline can be called with the method clean to clean a list of
texts:
texts = ["Hello, my name is Cellan! I love to swim!"]
cleaner.clean(texts)
About the method clean...
The method clean is a wrapper around the spaCy Language class method
pipe. Check the docs for more information:
Giving the output:
['hello _IS_PUNCT_ Cellan _IS_PUNCT_ love swim _IS_PUNCT_']
📈 Releases
You can see the list of available releases on the GitHub Releases page.
We follow Semantic Versions specification.
We use Release Drafter. As pull requests are merged, a draft release is kept up-to-date listing the changes, ready to publish when you’re ready. With the categories option, you can categorize pull requests in release notes using labels.
List of labels and corresponding titles
| Label | Title in Releases |
|---|---|
enhancement, feature |
🚀 Features |
bug, refactoring, bugfix, fix |
🔧 Fixes & Refactoring |
build, ci, testing |
📦 Build System & CI/CD |
breaking |
💥 Breaking Changes |
documentation |
📝 Documentation |
dependencies |
⬆️ Dependencies updates |
You can update it in release-drafter.yml.
GitHub creates the bug, enhancement, and documentation labels for you. Dependabot creates the dependencies label. Create the remaining labels on the Issues tab of your GitHub repository, when you need them.
🛡 License
This project is licensed under the terms of the MIT license. See LICENSE for more details.
📃 Citation
@misc{spacy-cleaner,
author = {spacy-cleaner},
title = {Easily clean text with spaCy!},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/Ce11an/spacy-cleaner}}
}
🚀 Credits
This project was generated with python-package-template
This project was built using IntelliJ IDEA
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spacy_cleaner-3.2.1.tar.gz.
File metadata
- Download URL: spacy_cleaner-3.2.1.tar.gz
- Upload date:
- Size: 10.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
89aa7bceb91b7c4710a2cdbe59cc40fd85f04a8de40c59545c0a33926a9e6cab
|
|
| MD5 |
23e17c250fb27cd9261dcf31eeceee19
|
|
| BLAKE2b-256 |
cb539505b923df4548a09d5e5572d74a33e0023be9d41316c9cc71a999f8b7b4
|
File details
Details for the file spacy_cleaner-3.2.1-py3-none-any.whl.
File metadata
- Download URL: spacy_cleaner-3.2.1-py3-none-any.whl
- Upload date:
- Size: 10.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f733c8d05e3107964c1d0d11977a76f9c3db89e7658d298021d51c920c3c6c50
|
|
| MD5 |
a7dcb6f60904020e6fffe0b040be6789
|
|
| BLAKE2b-256 |
4337f72fd6291546245d452fb72f90cc77908f62a2cae8a0a546c213b3e1b67f
|