A Python library to extract text from various sources for LLM preprocessing.

These details have not been verified by PyPI

Project description

ParseThis

Coverage PyPI Build Status License

ParseThis is a powerful and flexible, tool with zero additional OS dependencies, that makes raw data effortlessly readable and structured for your AI and data processing workflows. Whether you're extracting information from PDFs, transforming files into Markdown or preparing data for LLMs and RAG pipelines, ParseThis gets the job done—quickly, effectively, and with a touch of magic. Just install as a pip package and enjoy, no configuring around with third party tools before you can use this package. Just parseThis.

For some parsers there are API Key's required. They're not required, when you just dont use them - they will error on usage when no api key was found.

Features
Prerequisites
Installation
Usage
License
ParserMatrix - Dependency overview

Features

Auto-detects file types (PDF, DOCX, CSV, and more).
Converts files into readable Markdown or plain text.
Extracts structured data for use in LLM and RAG pipelines.
Simple API for seamless integration into your workflows.

The mapping of parser to file type can be found in the ParserMatrix.

Prerequisites

Use Python 3.12 - maximum version supported by PyO3 - dependency of scrapegraph-ai, use a virtual environment with version 3.12

python3.12 -m venv myenv
source myenv/bin/activate

Installation

To install ParseThis, use pip:

pip install parsethis

For more information, see the how we install in our github action.

Usage

Use the parse() function to auto-detect the current type of content - when the autodetection is not working you can provide more information to help detect the type. The auto-parse function accepts any input - file_path, url strings, file byte content.

import parsethis

#extract image description for llm
with open('tests/fixtures/test_data_diagram.png', 'rb') as f:
    image_description = parsethis.parse(f.read(), result_format=ResultFormat.TXT)

#get transcript of audio
with open('tests/fixtures/test_data_ttsmaker-test-generated-file.mp3', 'rb') as f:
    audio_transcript = parsethis.parse(f.read(), result_format=ResultFormat.TXT)

The generic parse() function detects automatically which parsers will be used based on the file content.

import parsethis

from parsethis import ResultFormat


#automatic parse based on file_path
parsed_pdf_text = parsethis.parse('tests/fixtures/text_data_meeting_notes.pdf', result_format=ResultFormat.TXT)

#automatic parse based on file content
with open('tests/fixtures/text_data_meeting_notes.pdf', 'rb') as f:
    parsed_pdf_text = parsethis.parse(f.read(), result_format=ResultFormat.TXT)  # works with any bytes content

#automatic parse based on string
parsed_github_repository = parsethis.parse('https://github.com/jdde/ParseThis', result_format=ResultFormat.TXT)

#automatic parse based on YouTube URL
transcribed_youtube_text = parsethis.parse('https://www.youtube.com/watch?v=ca7QkcAGe', result_format=ResultFormat.TXT)

Use the parser detection when you want to just find the parser and configure it differently before it parses the content.

import parsethis

with open('tests/fixtures/text_data_meeting_notes.pdf', 'rb') as f:
    file_content = f.read()
    parser = parsethis.get_parser(file_content)
    text = parser.parse(file_content)

Or just directly use a any parser.

from parsethis import PDFParser

with open('tests/fixtures/text_data_meeting_notes.pdf', 'rb') as f:
    text = PDFParser.parse(file_content)

For more examples how to use it - see our testing section.

ParserMatrix

Overview of dependencies used for specific parsing processes.

File Type	Parser	Dependency	External Access Required
PDF	PDFParser	PyPDF2, Markitdown	❌
Image	ImageParser	OpenAI GPT	✅ env.OPENAI_API_KEY
Audio	AudioParser	OpenAI Whisper	✅ env.OPENAI_API_KEY
URL	TextParser	scrapegraphai	✅ env.OPENAI_API_KEY
YouTube	TextParser	youtube-transcript-api	❌
Github	TextParser	gitingest	❌
DOCX	OfficeParser	Markitdown	❌
PPTX	OfficeParser	Markitdown	❌
XLSX/XLS	OfficeParser	Markitdown	❌
CSV	DataParser	Markitdown	❌
JSON	DataParser	Markitdown	❌
XML	DataParser	Markitdown	❌
ZIP	ArchiveParser	Markitdown	❌

If you're working with the source code, you can install all dependencies using:

pip install -r requirements.txt

Testing

To execute tests use this:

coverage run -m pytest
#or for a single test:
pytest -k test_text_parser_github_url

License

This project is licensed under the GNU Affero General Public License v3.0 - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.3

Jul 24, 2025

0.2.2

Mar 31, 2025

0.2.1

Mar 31, 2025

0.2.0

Mar 30, 2025

0.1.9

Mar 30, 2025

0.1.8

Mar 30, 2025

0.1.7

Mar 30, 2025

0.1.5

Mar 30, 2025

0.1.4

Mar 30, 2025

0.1.1

Mar 2, 2025

This version

0.1.0

Feb 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parsethisio-0.1.0.tar.gz (28.1 kB view details)

Uploaded Feb 24, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

parsethisio-0.1.0-py3-none-any.whl (23.7 kB view details)

Uploaded Feb 24, 2025 Python 3

File details

Details for the file parsethisio-0.1.0.tar.gz.

File metadata

Download URL: parsethisio-0.1.0.tar.gz
Upload date: Feb 24, 2025
Size: 28.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for parsethisio-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`2f853031a16198c2f37b5a57e9407fc7ffe1552a00dd9902d9d5bcfc9135d7d9`
MD5	`68452760acc0a5dd8c0067248889e3bd`
BLAKE2b-256	`e5f3eedb9f63b67b1078b6c3252a0092dceb9daf8c6b9d5d53c55bdbf2be79a1`

See more details on using hashes here.

File details

Details for the file parsethisio-0.1.0-py3-none-any.whl.

File metadata

Download URL: parsethisio-0.1.0-py3-none-any.whl
Upload date: Feb 24, 2025
Size: 23.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for parsethisio-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3d516ea319041b1e061e600596caa00b6ff29449a9ef894d39d1618fe37de3ae`
MD5	`11479d3a9bb7c999c1f4268bcb129abd`
BLAKE2b-256	`a55a375b1a8569d52c3b20c97f4cfa40750741a4ca4a03d37d9576503de6b814`

See more details on using hashes here.

parsethisio 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

ParseThis

Table of Contents

Features

Prerequisites

Installation

Usage

ParserMatrix

Testing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes