Light Text Pre-processing permits to apply a chain of built-in regex rules to a input string.
Project description
Light Text Pre-processing
Light Text Pre-processing is an easy-to-use python module that permits to apply a chain of built-in regex rules to a input string. Regex rules are stored in a separate YML file and compiled at run-time. The compiling mechanism and how to add a custom regex are described below.
How it works
Package reads a list of regex from light_text_prepro/rules/regex.yml. Each row in regex.yml identifies a regex rule such as user_tag: '"@[0-9a-z](\.?[0-9a-z])*"'. In this row, user_tag is the key of the regex, whereas the '"@[0-9a-z](\.?[0-9a-z])*"'is its value.
At run-time, the package reads the regex.yml and compiles a method for each regex, the method is named as the the key of the row. For example, at the end of the process, you will be able to call the user_tag()method, that permit to match the user tagged. Each method has the optional parameter replace_with that allow you to replace the string matched by regex rule with an arbitrary text.
Package installation
List of Regex
user_tag: '"(?<![\w@])@([\w@]+(?:[.!][\w@]+)*)"'
email: '"([^@|\s]+@[^@]+\.[^@|\s]+)"'
url: '"(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})"'
punctuation: '"[-!`?,.\":;]"'
parentheses: '"[\[\]{}()]"'
special_chars: '"[$%^&*_+|~=<>:;\\]"'
ip_address: '"(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$"'
html_tag: '"^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$"'
tab_new_line: '"(\n|\t|\r)"'
multiple_space: '"[ ]+"'
emoji: '"[^\u1F600-\u1F6FF\s]"'
If you are happy wiht the list above, you can install the package via pip.
pip install light-text-prepro
How to use
from light_text_prepro.lprepro import LPrePro
...
obj = LPrePro()
...
result = obj.set_text('Hey @username, this is my email my@email.com') \
.user_tag(replace_with='[user]') \
.email(replace_with='[email]') \
.get_text()
# result -> Hey [user], this is my email [email]
Otherwise, if you want to contribute to enrich the package adding your regex rule, please follow section below.
How to add a regex rules
Setup project
$> git clone https://github.com/Arfius/light-text-prepro.git
$> cd light-text-prepro
$> pip install poetry flake8
$> poetry install
Add new regex
- Open
light_text_prepro/rules/regex.ymland add a new row. Make sure to use a unique key for the rule. If you get issue adding the regex rule, use any online regex validation tool and export the regex rule for python. (i.e. https://regex101.com/ => FLAVOR python => Copy to clipboard ) - Add a
unit testsunder thetestsfolder and make all test passed. Use$> poetry run pytestto run unit tests. - Update the section
List of Regexat the end of this file. - Create a Pull Request
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file light-text-prepro-0.3.5.tar.gz.
File metadata
- Download URL: light-text-prepro-0.3.5.tar.gz
- Upload date:
- Size: 4.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.7.0 requests/2.23.0 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
01cdea4c1225dd963ad99efe12bab66c095fc747621b7b5754c5389dc56de378
|
|
| MD5 |
92e3003cb5f655564e23f32525e1f622
|
|
| BLAKE2b-256 |
9fe12bd6e741f8237a0844ceec2bce5d73b7aa425f36f4b2ab35ca024abd0ba7
|
File details
Details for the file light_text_prepro-0.3.5-py3-none-any.whl.
File metadata
- Download URL: light_text_prepro-0.3.5-py3-none-any.whl
- Upload date:
- Size: 5.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.7.0 requests/2.23.0 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2db93a1c5fb7cdec7013501aadcac077d7025d60a0476988da5b86370e44fdc1
|
|
| MD5 |
bc57f970278e7d9d9c48c84fbd4de1f9
|
|
| BLAKE2b-256 |
ab7ed9881e891b0b5771008f1be4fdfc7ef8fe86214f2dcb66b912a4443dc527
|