Enables real-world data collection, bridges the gap between OCR and NLP, enabling you to convert text from any image to ready to use nlp data structures.
Project description
pic2prose
A package that can take in images and build a corpus and produce nlp datastructures for direct use in experimentation and model training.
Take any image with text, use p2p to generate NLP datastructures ready for use in fine-tuning LLM's, generating embeddings, sentiment classification, etc.
Installation
pip install pic2prose
Open up your favorite editor, import, and build a robust corpus.
from pic2prose.structures import *
# initialize the object
# may take longer if you're not using a GPU
corpus = Corp(image_path="ex1.png")
# generate co-occurrence matrix
corpus.get_co_occurrence_matrix()
# generate tf-idf matrix
corpus.get_tfidf_matrix()
# one-hot encodings
corpus.one_hot_encode()
Coming Soon
Support for building corpi from URL
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pic2prose-0.0.2.tar.gz
(4.0 kB
view hashes)
Built Distribution
Close
Hashes for pic2prose-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 74c80581b289a18e4cd3fc9f05137f6ba35e588ef2029feb8a13335e50521d27 |
|
MD5 | fffb41f158cd904e12bacced79a12bca |
|
BLAKE2b-256 | 5b112370c3e43945cd8776f3774f2ac5173de108d46a43b04b7fc8d32a1775a7 |