Skip to main content

Automate information extraction for multimodal LLMs.

Project description

Pipeline Illustration The Pipe

codecov python-gh-action Website get API

Prepare PDFs, word docs, slides, web pages and more for Vision-LLMs with one line of code ⚡

The Pipe is a multimodal-first tool for feeding files and web pages into vision language models like GPT-4V, Gemini Pro, and LLaVa. It is best for LLM and RAG applications that require a deep understanding of complex sources. The Pipe is available as a hosted API at thepi.pe and as a standalone tool you can use locally.

Getting Started 🚀

First, install thepipe:

pip install thepipe_api

Now you can extract comprehensive text and visuals from any file:

from thepipe_api import thepipe
chunks = thepipe.extract("example.pdf")

Or any website:

chunks = thepipe.extract("https://example.com")

Then feed it into GPT-4-Vision:

response = client.chat.completions.create(
    model="gpt-4-vision-preview",
    messages = chunks,
)

The Pipe's output is a list of sensible "chunks", and thus can be used either for storage in a vector database or for direct use as a prompt. Extra features such as data table extraction, bar chart extraction, custom web authentications and more are available in the API documentation. LiteLLM can be used to easily integrate The Pipe with any LLM provider.

Features 🌟

  • Extracts text and visuals from any file or web page 📚
  • Outputs RAG-ready chunks, optimized for multimodal LLMs 🖼️ + 💬
  • Can interpret complex PDFs, web apps, markdown, etc 🧠
  • Auto-compress prompts exceeding your chosen token limit 📦
  • Works with missing file extensions, in-memory data streams 💾
  • Works with codebases, URL, git repos, and more 🌐
  • Multi-threaded ⚡️

How it works 🛠️

The pipe is accessible from the command line or from Python. The input source is either a file path, a URL, or a directory (or zip file) path. The pipe will extract information from the source and process it for downstream use with language models, vision transformers, or vision-language models. The output from the pipe is a sensible text-based (or multimodal) representation of the extracted information, carefully crafted to fit within context windows for any models from gemma-7b to GPT-4. It uses a variety of heuristics for optimal performance with vision-language models, including AI filetype detection with filetype detection, AI PDF extraction, efficient token compression, automatic image encoding, reranking for lost-in-the-middle effects, and more, all pre-built to work out-of-the-box.

Supported File Types 📚

Source Type Input types Token Compression 🗜️ Image Extraction 👁️ Notes 📌
Directory Any /path/to/directory ✔️ ✔️ Extracts from all files in directory, supports match and ignore patterns
Code .py, .tsx, .js, .html, .css, .cpp, etc ✔️ (varies) Combines all code files. .c, .cpp, .py are compressible with ctags, others are not
Plaintext .txt, .md, .rtf, etc ✔️ Regular text files
PDF .pdf ✔️ ✔️ Extracts text and images of each page; can use AI for extraction of table data and images within pages
Image .jpg, .jpeg, .png, .gif, .bmp, .tiff, .webp, .svg ✔️ Extracts images, uses OCR if text_only
Data Table .csv, .xls, .xlsx ✔️ Extracts data from spreadsheets; converts to text representation. For very large datasets, will only extract column names and types
Jupyter Notebook .ipynb ✔️ Extracts code, markdown, and images from Jupyter notebooks
Microsoft Word Document .docx ✔️ ✔️ Extracts text and images from Word documents
Microsoft PowerPoint Presentation .pptx ✔️ ✔️ Extracts text and images from PowerPoint presentations
Website URLs (inputs containing http, https, www, ftp) ✔️ ✔️ Extracts text from web page along with image (or images if scrollable); text-only extraction available
GitHub Repository GitHub repo URLs ✔️ ✔️ Extracts from GitHub repositories; supports branch specification
ZIP File .zip ✔️ ✔️ Extracts contents of ZIP files; supports nested directory extraction

Installation 📦

Local Installation 🛠️

To use The Pipe locally, you will need playwright, ctags, pytesseract, and the local python requirements, which differ from the more lightweight API requirements:

git clone https://github.com/emcf/thepipe
pip install -r requirements_local.txt

Tip for windows users: you may need to install the python-libmagic binaries with pip install python-magic-bin.

Now you can use The Pipe:

python thepipe.py path/to/directory

This command will process all supported files within the specified directory, compressing any information over the token limit if necessary, and outputting the resulting prompt and images to a folder.

Arguments are:

  • The input source (required): can be a file path, a URL, or a directory path.
  • --local (optional): Use the local version of The Pipe instead of the hosted API.
  • --match (optional): Regex pattern to match files in the directory.
  • --ignore (optional): Regex pattern to ignore files in the directory.
  • --limit (optional): The token limit for the output prompt, defaults to 100K. Prompts exceeding the limit will be compressed.
  • --ai_extraction (optional): Extract tables, figures, and math from PDFs using our extractor. Incurs extra costs.
  • --text_only (optional): Do not extract images from documents or websites. Additionally, image files will be represented with OCR instead of as images.

Demo

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thepipe_api-0.1.4.tar.gz (18.4 kB view details)

Uploaded Source

Built Distribution

thepipe_api-0.1.4-py3-none-any.whl (17.3 kB view details)

Uploaded Python 3

File details

Details for the file thepipe_api-0.1.4.tar.gz.

File metadata

  • Download URL: thepipe_api-0.1.4.tar.gz
  • Upload date:
  • Size: 18.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/37.3 requests/2.31.0 requests-toolbelt/0.10.1 urllib3/1.26.18 tqdm/4.66.2 importlib-metadata/6.11.0 keyring/24.3.0 rfc3986/1.5.0 colorama/0.4.6 CPython/3.10.8

File hashes

Hashes for thepipe_api-0.1.4.tar.gz
Algorithm Hash digest
SHA256 52f67a17c55c626335ec42b163001215c77bb5cb4d95b8f0d99315b70347b3b3
MD5 91ea396eecbd4b5d7be3b64dd20df74c
BLAKE2b-256 2afc30550d057555b655ab681128a5ef1dbdaf3cc8878660faf6e209f16112d3

See more details on using hashes here.

Provenance

File details

Details for the file thepipe_api-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: thepipe_api-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 17.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/37.3 requests/2.31.0 requests-toolbelt/0.10.1 urllib3/1.26.18 tqdm/4.66.2 importlib-metadata/6.11.0 keyring/24.3.0 rfc3986/1.5.0 colorama/0.4.6 CPython/3.10.8

File hashes

Hashes for thepipe_api-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 e95b3d21714d157dcc604c29fa64d04049cdcc56fe706bc8fdcfb2c40801bbde
MD5 2170a6af397c060c8c1a17a2624143a3
BLAKE2b-256 98e22f450a1fa02d83c8d6794350848ae5b5c627e3ccca8b65e5d73feb43d750

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page