Skip to main content

AI-native extractor, powered by multimodal LLMs.

Project description

Extract markdown and visuals from PDFs URLs, slides, videos, and more, ready for multimodal LLMs. ⚡

thepi.pe is an API that can scrape multimodal data via thepipe.scrape or extract structured data via thepipe.extract from a wide range of sources. It is built to interface with vision-language models such as GPT-4o, and works out-of-the-box with any LLM or vector database. It can be used right away with a hosted cloud, or it can be run locally.

Features 🌟

  • Extract markdown, tables, and images from any document or webpage
  • Extract complex structured data from any document or webpage
  • Works out-of-the-box with LLMs, vector databases, and RAG frameworks
  • AI-native filetype detection, layout analysis, and structured data extraction
  • Multimodal scraping for video, audio, and image sources

Get started in 5 minutes 🚀

thepi.pe can read a wide range of filetypes and web sources, so it requires a few dependencies. It also requires vision-language model inference for AI extraction features. For these reasons, we host an API that works out-of-the-box. For more detailed setup instructions, view the docs.

pip install thepipe-api

Hosted API (Python)

You can get an API key by signing up for a free account at thepi.pe. The, simply set the THEPIPE_API_KEY environment variable to your API key.

from thepipe.scraper import scrape_file
from thepipe.core import chunks_to_messages
from openai import OpenAI

# scrape clean markdown
chunks = scrape_file(filepath="paper.pdf", ai_extraction=False)

# call LLM with scraped chunks
client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=chunks_to_messages(chunks),
)

Local Installation (Python)

For a local installation, you can use the following command:

pip install thepipe-api[local]

You must have a local LLM server setup and running for AI extraction features. You can use any local LLM server that follows OpenAI format (such as LiteLLM or OpenRouter). Next, set the LLM_SERVER_BASE_URL environment variable to your LLM server's endpoint URL and set LLM_SERVER_API_KEY to the API key for your LLM of choice. the DEFAULT_AI_MODEL environment variable can be set to the model name of your LLM. For example, you may use openai/gpt-4o-mini if using OpenRouter or gpt-4o-mini if using OpenAI.

For full functionality with media-rich sources, you will need to install the following dependencies:

apt-get update && apt-get install -y git ffmpeg tesseract-ocr
python -m playwright install --with-deps chromium

When using thepi.pe, be sure to append local=True to your function calls:

chunks = scrape_url(url="https://example.com", local=True)

You can also use thepi.pe from the command line:

thepipe path/to/folder --include_regex .*\.tsx --local

Supported File Types 📚

Source Input types Multimodal Notes
Webpage URLs starting with http, https, ftp ✔️ Scrapes markdown, images, and tables from web pages. ai_extraction available for AI content extraction from the webpage's screenshot
PDF .pdf ✔️ Extracts page markdown and page images. ai_extraction available for AI layout analysis
Word Document .docx ✔️ Extracts text, tables, and images
PowerPoint .pptx ✔️ Extracts text and images from slides
Video .mp4, .mov, .wmv ✔️ Uses Whisper for transcription and extracts frames
Audio .mp3, .wav ✔️ Uses Whisper for transcription
Jupyter Notebook .ipynb ✔️ Extracts markdown, code, outputs, and images
Spreadsheet .csv, .xls, .xlsx Converts each row to JSON format, including row index for each
Plaintext .txt, .md, .rtf, etc Simple text extraction
Image .jpg, .jpeg, .png ✔️ Uses pytesseract for OCR in text-only mode
ZIP File .zip ✔️ Extracts and processes contained files
Directory any path/to/folder ✔️ Recursively processes all files in directory
YouTube Video (known issues) YouTube video URLs starting with https://youtube.com or https://www.youtube.com. ✔️ Uses pytube for video download and Whisper for transcription. For consistent extraction, you may need to modify your pytube installation to send a valid user agent header (see this issue).
Tweet URLs starting with https://twitter.com or https://x.com ✔️ Uses unofficial API, may break unexpectedly
GitHub Repository GitHub repo URLs starting with https://github.com or https://www.github.com ✔️ Requires GITHUB_TOKEN environment variable

How it works 🛠️

thepi.pe uses computer vision models and heuristics to extract clean content from the source and process it for downstream use with language models, or vision transformers. The output from thepi.pe is a list of chunks containing all content within the source document. These chunks can easily be converted to a prompt format that is compatible with any LLM or multimodal model with thepipe.core.chunks_to_messages, which gives the following format:

[
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "..."
      },
      {
        "type": "image_url",
        "image_url": {
          "url": "data:image/jpeg;base64,..."
        }
      }
    ]
  }
]

You can feed these messages directly into the model, or alternatively you can use chunker.chunk_by_document, chunker.chunk_by_page, chunker.chunk_by_section, chunker.chunk_semantic to chunk these messages for a vector database such as ChromaDB or a RAG framework. A chunk can be converted to LlamaIndex Document/ImageDocument with .to_llamaindex.

⚠️ It is important to be mindful of your model's token limit. GPT-4o does not work with too many images in the prompt (see discussion here). To remedy this issue, either use an LLM with a larger context window, extract larger documents with text_only=True, or embed the chunks into vector database.

Sponsors

Book us with Cal.com

Thank you to Cal.com for sponsoring this project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thepipe_api-1.2.8.tar.gz (28.6 kB view details)

Uploaded Source

Built Distribution

thepipe_api-1.2.8-py3-none-any.whl (28.9 kB view details)

Uploaded Python 3

File details

Details for the file thepipe_api-1.2.8.tar.gz.

File metadata

  • Download URL: thepipe_api-1.2.8.tar.gz
  • Upload date:
  • Size: 28.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.8

File hashes

Hashes for thepipe_api-1.2.8.tar.gz
Algorithm Hash digest
SHA256 16109dfed2a46103bdaf04069b13dd0517fb3f5d2b72585f3333929450642fba
MD5 4686ff593605b8d1b239f1b167d54540
BLAKE2b-256 4a25c6f72a31d2c9ab72f28c3f52a89f7d2a58bbdf72f68b9aba9de8643ee174

See more details on using hashes here.

Provenance

File details

Details for the file thepipe_api-1.2.8-py3-none-any.whl.

File metadata

  • Download URL: thepipe_api-1.2.8-py3-none-any.whl
  • Upload date:
  • Size: 28.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.8

File hashes

Hashes for thepipe_api-1.2.8-py3-none-any.whl
Algorithm Hash digest
SHA256 9b348e81a32fe98d0c4988f12ecb1b56c2644f09246f128e70418a549716ad44
MD5 94d51609f60e41cd826959b00dc7b285
BLAKE2b-256 803636dc5f3a7422674331c7cb5049e7236ccdae1f3da12737dfb85ed706d624

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page