Skip to main content

A Python library to extract text from various sources for LLM preprocessing.

Project description

ParseThisIO

Coverage PyPI Build Status License

ParseThisIO is a powerful and flexible tool with zero additional OS dependencies that makes raw data effortlessly readable and structured for your AI and data processing workflows. Whether you're extracting information from PDFs, transforming files into Markdown, or preparing data for LLMs and RAG pipelines, ParseThisIO gets the job done—quickly, effectively, and with a touch of magic. Just install as a pip package and enjoy, no configuring around with third-party tools before you can use this package. Just parseThis.io.

For some parsers, there are API keys required. They're not required when you just don't use them—they will error on usage when no API key was found.

ParseThis aggregates multiple open-source projects to avoid re-implementing a file type mapping for content conversion.

Table of Contents


Features

  • Auto-detects file types (pdf, docx, csv, pptx, xlsx, xls, json, xml, zip, mp3, mp4 and more).
  • Converts any file into readable Markdown or plain text.
  • Extracts structured data for use in LLM and RAG pipelines.
  • Simple API for seamless integration into your workflows.
  • Just forward user input to ParseThis and get Text || markdown.

The mapping of parser to file type can be found in the ParserMatrix.

import parsethisio

#get list of supported file extensions via 
parsethisio.get_supported_extensions()

Prerequisites

Use Python 3.12 - maximum version supported by PyO3 - dependency of scrapegraph-ai, use a virtual environment with version 3.12

python3.12 -m venv myenv
source myenv/bin/activate

Installation

To install ParseThisIO, use pip:

pip install parsethisio

Usage

Use the parse() function to auto-detect the current type of content - when the autodetection is not working you can provide more information to help detect the type. The auto-parse function accepts any input - file_path, url strings, file byte content.

import parsethisio

#extract image description for llm
with open('tests/fixtures/test_data_diagram.png', 'rb') as f:
    image_description = parsethisio.parse(f.read(), result_format=ResultFormat.TXT)

#get transcript of audio
with open('tests/fixtures/test_data_ttsmaker-test-generated-file.mp3', 'rb') as f:
    audio_transcript = parsethisio.parse(f.read(), result_format=ResultFormat.TXT)

The generic parse() function detects automatically which parsers will be used based on the file content.

import parsethisio

from parsethisio import ResultFormat


#automatic parse based on file_path
parsed_pdf_text = parsethisio.parse('tests/fixtures/text_data_meeting_notes.pdf', result_format=ResultFormat.TXT)

#automatic parse based on file content
with open('tests/fixtures/text_data_meeting_notes.pdf', 'rb') as f:
    parsed_pdf_text = parsethisio.parse(f.read(), result_format=ResultFormat.TXT)  # works with any bytes content

#automatic parse based on string
parsed_github_repository = parsethisio.parse('https://github.com/jdde/ParseThis', result_format=ResultFormat.TXT)

#automatic parse based on YouTube URL
transcribed_youtube_text = parsethisio.parse('https://www.youtube.com/watch?v=ca7QkcAGe', result_format=ResultFormat.TXT)

Use the parser detection when you want to just find the parser and configure it differently before it parses the content.

import parsethisio

with open('tests/fixtures/text_data_meeting_notes.pdf', 'rb') as f:
    file_content = f.read()
    parser = parsethisio.get_parser(file_content)
    text = parser.parse(file_content)

Or just directly use a parser.

from parsethisio import PDFParser

with open('tests/fixtures/text_data_meeting_notes.pdf', 'rb') as f:
    text = PDFParser.parse(file_content)

For more examples how to use it - see our testing section.


ParserMatrix

Overview of dependencies used for specific parsing processes.

File Type Parser Dependency External Access Required
PDF PDFParser PyPDF2, Markitdown
Image ImageParser OpenAI GPT ✅ env.OPENAI_API_KEY
Audio AudioParser OpenAI Whisper ✅ env.OPENAI_API_KEY
URL TextParser scrapegraphai ✅ env.OPENAI_API_KEY
YouTube TextParser youtube-transcript-api
Github TextParser gitingest
DOCX OfficeParser Markitdown
PPTX OfficeParser Markitdown
XLSX/XLS OfficeParser Markitdown
CSV DataParser Markitdown
JSON DataParser Markitdown
XML DataParser Markitdown
ZIP ArchiveParser Markitdown

If you're working with the source code, you can install all dependencies using:

pip install .

For more information, see the how we install it in our github action.

Testing

To execute tests use this:

coverage run -m pytest
#or for a single test:
pytest -k test_text_parser_github_url

License

This project is licensed under the GNU Affero General Public License v3.0 - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parsethisio-0.2.2.tar.gz (30.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parsethisio-0.2.2-py3-none-any.whl (26.8 kB view details)

Uploaded Python 3

File details

Details for the file parsethisio-0.2.2.tar.gz.

File metadata

  • Download URL: parsethisio-0.2.2.tar.gz
  • Upload date:
  • Size: 30.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for parsethisio-0.2.2.tar.gz
Algorithm Hash digest
SHA256 dc7f6f5570679065c8de8967d1d1358d5526a334918775f003ee6c3f61207b58
MD5 a0a379df07b1b1f5685c3440b2da2885
BLAKE2b-256 8a68afa79124d0c6e940aa45ba221d1b0e8681dd3f0c675a32c5ab3e00480f8d

See more details on using hashes here.

File details

Details for the file parsethisio-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: parsethisio-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 26.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for parsethisio-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 5ccff1dcf825e82c0fa6ced9008fa8d24ba4b58c267dfc05fd4c761a34b3176e
MD5 bca1ed1f9dee1ab2123e01deec474a7d
BLAKE2b-256 e5a05872ff465b8727f60414a8abbe1d7ea93c364d6f4bf43c178c61b06a2544

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page