Skip to main content

Enables real-world data collection, bridges the gap between OCR and NLP, enabling you to convert text from any image to ready to use nlp data structures.

Project description

pic2prose

A package that can take in images and build a corpus and produce nlp datastructures for direct use in experimentation and model training.

Take any image with text, use p2p to generate NLP datastructures ready for use in fine-tuning LLM's, generating embeddings, sentiment classification, etc.

Installation

pip install pic2prose

Open up your favorite editor, import, and build a robust corpus.

from pic2prose.structures import *

# initialize the object
# may take longer if you're not using a GPU
corpus = Corp(image_path="ex1.png")

# generate co-occurrence matrix
corpus.get_co_occurrence_matrix()

# generate tf-idf matrix
corpus.get_tfidf_matrix()

# one-hot encodings
corpus.one_hot_encode()

Coming Soon

Support for building corpi from URL

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pic2prose-0.0.2.tar.gz (4.0 kB view hashes)

Uploaded Source

Built Distribution

pic2prose-0.0.2-py3-none-any.whl (4.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page