Skip to main content

Open anything, get clean Markdown for LLMs.

Project description

fyle

PyPI Python License Downloads

Any file in. Clean Markdown out. LLM ready.

A lightweight library that turns PDF, DOCX, XLSX, audio, video, and ~100 more formats into the Markdown your LLM already understands.


What is this?

A lightweight library for reading files. What makes it different: the output is LLM-ready — clean Markdown you can feed straight into any model, no post-processing, no cleanup.

One line. Every common file. LLM-ready Markdown. Point fyle at a path, URL, or raw bytes — what comes back is already something a model can read natively. No OCR plumbing, no format-specific parser glue, no prompt engineering to "please strip the headers and footers".

import fyle

text = fyle.read("report.pdf")   # or .docx / .xlsx / .mp3 / .mp4 / an http(s) URL / raw bytes
llm.complete(text)               # that's it.

Works out of the box on:

  • PDF / DOCX / XLSX / PPTX / HTML / Markdown / CSV — parsed into Markdown
  • Images — base64 data:image/... URLs ready for multimodal models
  • Audio / video — local ASR transcripts with [MM:SS] timestamps (+ keyframes for video)
  • SQLite — schema preview + fluent doc.table(t).query(sql) API
  • Archive — safe extraction + Markdown manifest, agent decides what to open next
  • ~100 source / config / log formats — passthrough as plain text

100% local. No cloud APIs. No telemetry. No API keys. Just fyle.open(...) and the file becomes something an LLM can see.


Install

pip install fylepy

Audio / video transcription are opt-in extras (native wheels + a ~140 MB model on first run):

pip install 'fylepy[audio]'   # faster-whisper
pip install 'fylepy[video]'   # faster-whisper + PySceneDetect + PyAV

Quick start

import fyle

doc = fyle.open("report.pdf")
# or: fyle.open("https://example.com/report.pdf")
# or: fyle.open(raw_bytes)   # format auto-detected from magic bytes

# Three views of the same document:
print(doc.text)            # pure content — whatever the reader produced
print(str(doc))            # LLM-ready: filename + format + size header, then content
print(repr(doc))           # short debug line for logs

# Typical usage — hand the whole thing to your model in one line:
llm.complete(str(doc))     # filename carries real signal the model can use

print(doc.meta.format)     # "pdf"
print(doc.meta.ext)        # "pdf"
print(doc.pages[0].text)   # just page 1

# One-shot convenience: str in, LLM-ready string out (same as str(fyle.open(...)))
text = fyle.read("report.pdf")

# Check which readers are available in your install
fyle.readers()
# {"pdf": ["pymupdf4llm*"], "audio": ["faster-whisper*"], ...}

Supported formats

Family Extensions Reader
PDF .pdf pymupdf4llm
Word .docx mammoth
Excel .xlsx openpyxl + tabulate
PowerPoint .pptx python-pptx
Web .html .htm markdownify
Markdown .md .markdown markdown-it-py
CSV .csv stdlib + tabulate
Image .png .jpg .jpeg .webp Pillow → base64 data URL
Audio .mp3 .m4a .wav .flac .ogg faster-whisper (CPU, int8)
Video .mp4 .m4v .mov .avi .mkv .webm PySceneDetect + Whisper
Database .db .sqlite .sqlite3 stdlib + fluent SQL API
Archive .zip .tar .gz .bz2 .xz ... stdlib — extract to disk + manifest
Text .py .js .go .rs .java .json .yaml .toml .sql .log ... (~100) passthrough

Audio & video

doc = fyle.open("meeting.mp4")

print(doc.text)
# # Video: meeting.mp4
#
# - Duration: `12:34`
# - Keyframes: 8
# - Language: `en`
#
# ## Transcript
#
# [00:00] Welcome everyone to the quarterly review...

for img in doc.images:
    print(img.caption, img.src[:32])
    # "02:17"  "data:image/jpeg;base64,/9j/4AA..."

First call downloads the Whisper base model (~140 MB). CPU only — no GPU needed. Override with FYLE_WHISPER_MODEL=small (or medium / large-v3) for higher quality.


SQLite

doc = fyle.open("chinook.db")

for page in doc.pages:
    print(page.name)          # table or view name
    print(page.text)          # schema + sample rows

rows = doc.table("Customer").query(
    "SELECT Country, COUNT(*) AS n FROM Customer GROUP BY Country ORDER BY n DESC"
)

Archive

doc = fyle.open("~/Downloads/invoices.zip")

print(doc.text)                # Markdown listing of extracted files
print(doc.meta.warnings)       # ["extracted to: /.../invoices/"]

# Agent's next step: fyle.open(one of the extracted files)

Refuses .. path traversal and symlink escapes; extracts to the archive's sibling directory.


Chunking for RAG

for chunk in doc.chunks(max_tokens=4000, overlap=200):
    embed(chunk.text)
    # chunk.tokens / chunk.page_range also available

Notes

  1. Offline only. Every reader runs locally. The audio/video reader downloads the Whisper model from Hugging Face on first run; after that, no network.
  2. Archive reader is list-only. It extracts files to disk and returns a manifest — it does not recursively parse contents. The agent decides what to open next.
  3. Alpha. Core is stable, but APIs may move between 0.x releases.

Feedback

Issues, PRs, and stars are welcome.


License

MIT © 2026 zhixiangxue


fyle

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fylepy-0.1.1.tar.gz (54.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fylepy-0.1.1-py3-none-any.whl (74.2 kB view details)

Uploaded Python 3

File details

Details for the file fylepy-0.1.1.tar.gz.

File metadata

  • Download URL: fylepy-0.1.1.tar.gz
  • Upload date:
  • Size: 54.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fylepy-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a1102244c9126d008636fa16fce7f87cea09e1b6dd786d70fa2b95c3d4ba3a8b
MD5 44aa8475312cd11a77e12e5f6551705b
BLAKE2b-256 f557aaab795d5df70a01be3f0f2b3dc3f8cd5ae78a52ed27976bd99d83152969

See more details on using hashes here.

Provenance

The following attestation bundles were made for fylepy-0.1.1.tar.gz:

Publisher: publish.yml on zhixiangxue/fyle

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fylepy-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: fylepy-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 74.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fylepy-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5dff3644a60458c6bc0b66f6aefa3b88d946cfc5ea04db42acd2c3cd0ccf48dd
MD5 a58d78d345932e3572f0bf8d667e083d
BLAKE2b-256 1d351651d8cb381fe01d3635cad36523c6bc109dc21a93e7ac2d5624aac5bc8e

See more details on using hashes here.

Provenance

The following attestation bundles were made for fylepy-0.1.1-py3-none-any.whl:

Publisher: publish.yml on zhixiangxue/fyle

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page