Powerfull python tool for modern NLP processing
Project description
MordinezNLP
Useful toolkit for NLP projects
MordinezNLP provides tools to download the data from the web, CommonCrawl and ElasticSearch using multiprocessing and custom file processing functions
MordinezNLP has is a powerful tool to clean up dirty texts to make use of them in Neural Networks with better performance.
Use MordinezNLP to extract text data from PDFs (tables ommiting) and from HTMLs.
MordinezNLP is build on top of the SpaCy and Stanza.
Quick tour
Text cleaning and POS taggingfrom MordinezNLP.processors import BasicProcessor
from MordinezNLP.pipelines import PartOfSpeech
from MordinezNLP.tokenizers import spacy_tokenizer
import spacy
nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = spacy_tokenizer(nlp)
bp = BasicProcessor()
post_process = bp.process("this is my text to process by a funcion", language='en')
pos_tagger = PartOfSpeech(
nlp,
'en'
)
pos_output = pos_tagger.process(
[post_process],
4,
30,
)
CommonCrawl downloader
from MordinezNLP.downloaders import CommonCrawlDownloader
ccd = CommonCrawlDownloader(
[
"reddit.com/r/space/*",
"reddit.com/r/spacex/*",
]
)
ccd.download('./test_data')
PDF parser
from io import BytesIO
from MordinezNLP.parsers import process_pdf
with open("my_pdf_doc.pdf", "rb") as f:
pdf = BytesIO(f.read())
output = process_pdf(pdf)
print(output)
Installation
With pip
pip install MordinezNLP
URLs
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
No source distribution files available for this release.See tutorial on generating distribution archives.
Built Distribution
File details
Details for the file MordinezNLP-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: MordinezNLP-0.1.0-py3-none-any.whl
- Upload date:
- Size: 30.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.25.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.56.0 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d2ea7049bb863397cc156423243ced29b599b74ef6255aeeb7f3eea69fa1a896 |
|
MD5 | dbb13cd0a10bd1f3701325a8495a0e64 |
|
BLAKE2b-256 | 41245f5c394c68bf63bf0a30656be67845d09e5166b6f309cb8f793496ad4a5f |