AI-native extractor, powered by multimodal LLMs.
Project description
Extract clean markdown from PDFs URLs, slides, videos, and more, ready for any LLM. ⚡
thepi.pe is a package that can scrape clean markdown and extract structured data from tricky sources, like PDFs. It uses vision-language models (VLMs) under the hood, and works out-of-the-box with any LLM, VLM, or vector database. It can be used right away on a hosted cloud, or it can be run locally.
Features 🌟
- Scrape clean markdown, tables, and images from any document or webpage
- Works out-of-the-box with LLMs, vector databases, and RAG frameworks
- AI-native filetype detection, layout analysis, and structured data extraction
- Accepts a wide range of sources, including Word docs, Powerpoints, Python notebooks, GitHub repos, videos, audio, and more
Get started in 5 minutes 🚀
thepi.pe can read a wide range of filetypes and web sources, so it requires a few dependencies. It also requires vision-language model inference for AI extraction features. For these reasons, we host an API that works out-of-the-box. For more detailed setup instructions, view the docs.
pip install thepipe-api
Hosted API (Python)
You can get an API key by signing up for a free account at thepi.pe. It is completely free to try out. The, simply set the THEPIPE_API_KEY
environment variable to your API key.
from thepipe.scraper import scrape_file
from thepipe.core import chunks_to_messages
from openai import OpenAI
# scrape clean markdown
chunks = scrape_file(filepath="paper.pdf", ai_extraction=False)
# call LLM with scraped chunks
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=chunks_to_messages(chunks),
)
Local Installation (Python)
For a local installation, you can use the following command:
pip install thepipe-api[local]
You must have a local LLM server setup and running for AI extraction features. You can use any local LLM server that follows OpenAI format (such as LiteLLM) or a provider (such as OpenRouter or OpenAI). Next, set the LLM_SERVER_BASE_URL
environment variable to your LLM server's endpoint URL and set LLM_SERVER_API_KEY
. the DEFAULT_AI_MODEL
environment variable can be set to your VLM of choice. For example, you would use openai/gpt-4o-mini
if using OpenRouter or gpt-4o-mini
if using OpenAI.
For full functionality with media-rich sources, you will need to install the following dependencies:
apt-get update && apt-get install -y git ffmpeg tesseract-ocr
python -m playwright install --with-deps chromium
When using thepi.pe locally, be sure to append local=True
to your function calls:
chunks = scrape_url(url="https://example.com", local=True)
You can also use thepi.pe from the command line:
thepipe path/to/folder --include_regex .*\.tsx --local
Supported File Types 📚
Source | Input types | Multimodal | Notes |
---|---|---|---|
Webpage | URLs starting with http , https , ftp |
✔️ | Scrapes markdown, images, and tables from web pages. ai_extraction available for AI content extraction from the webpage's screenshot |
.pdf |
✔️ | Extracts page markdown and page images. ai_extraction available for AI layout analysis |
|
Word Document | .docx |
✔️ | Extracts text, tables, and images |
PowerPoint | .pptx |
✔️ | Extracts text and images from slides |
Video | .mp4 , .mov , .wmv |
✔️ | Uses Whisper for transcription and extracts frames |
Audio | .mp3 , .wav |
✔️ | Uses Whisper for transcription |
Jupyter Notebook | .ipynb |
✔️ | Extracts markdown, code, outputs, and images |
Spreadsheet | .csv , .xls , .xlsx |
❌ | Converts each row to JSON format, including row index for each |
Plaintext | .txt , .md , .rtf , etc |
❌ | Simple text extraction |
Image | .jpg , .jpeg , .png |
✔️ | Uses pytesseract for OCR in text-only mode |
ZIP File | .zip |
✔️ | Extracts and processes contained files |
Directory | any path/to/folder |
✔️ | Recursively processes all files in directory |
YouTube Video (known issues) | YouTube video URLs starting with https://youtube.com or https://www.youtube.com . |
✔️ | Uses pytube for video download and Whisper for transcription. For consistent extraction, you may need to modify your pytube installation to send a valid user agent header (see this issue). |
Tweet | URLs starting with https://twitter.com or https://x.com |
✔️ | Uses unofficial API, may break unexpectedly |
GitHub Repository | GitHub repo URLs starting with https://github.com or https://www.github.com |
✔️ | Requires GITHUB_TOKEN environment variable |
How it works 🛠️
thepi.pe uses computer vision models and heuristics to extract clean content from the source and process it for downstream use with language models, or vision transformers. The output from thepi.pe is a list of chunks containing all content within the source document. These chunks can easily be converted to a prompt format that is compatible with any LLM or multimodal model with thepipe.core.chunks_to_messages
, which gives the following format:
[
{
"role": "user",
"content": [
{
"type": "text",
"text": "..."
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,..."
}
}
]
}
]
You can feed these messages directly into the model, or alternatively you can use chunker.chunk_by_document
, chunker.chunk_by_page
, chunker.chunk_by_section
, chunker.chunk_semantic
to chunk these messages for a vector database such as ChromaDB or a RAG framework. A chunk can be converted to LlamaIndex Document/ImageDocument with .to_llamaindex
.
⚠️ It is important to be mindful of your model's token limit. GPT-4o does not work with too many images in the prompt (see discussion here). To remedy this issue, either use an LLM with a larger context window, extract larger documents with
text_only=True
, or embed the chunks into vector database.
Sponsors
Thank you to Cal.com for sponsoring this project.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file thepipe_api-1.3.9.tar.gz
.
File metadata
- Download URL: thepipe_api-1.3.9.tar.gz
- Upload date:
- Size: 30.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6bd122076fd31497c9017dcae59796ed7333d73f583c4ff6f1d951c2eaebe9e6 |
|
MD5 | 26f11dc61e6e49838186d147062d0e65 |
|
BLAKE2b-256 | a5ef344896161235254e727f56fe047f3fbe510376396a8abced4f4b40932b30 |
File details
Details for the file thepipe_api-1.3.9-py3-none-any.whl
.
File metadata
- Download URL: thepipe_api-1.3.9-py3-none-any.whl
- Upload date:
- Size: 30.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cc10f7644d56084dd3222a30ef0162418ac7800817f18a6bc748300866ed5f38 |
|
MD5 | 126d271e21c148b00624b754d011375a |
|
BLAKE2b-256 | 76f64fbd81b9ea77484ff0a60fd8bdbba37914638cd4466d7b69f4736c102df6 |