Open anything, get clean Markdown for LLMs.
Project description
Any file in. Clean Markdown out. LLM ready.
A lightweight library that turns PDF, DOCX, XLSX, audio, video, and ~100 more formats into the Markdown your LLM already understands.
What is this?
A lightweight library for reading files. What makes it different: the output is LLM-ready — clean Markdown you can feed straight into any model, no post-processing, no cleanup.
One line. Every common file. LLM-ready Markdown. Point fyle at a path, URL, or raw bytes — what comes back is already something a model can read natively. No OCR plumbing, no format-specific parser glue, no prompt engineering to "please strip the headers and footers".
import fyle
text = fyle.read("report.pdf") # or .docx / .xlsx / .mp3 / .mp4 / an http(s) URL / raw bytes
llm.complete(text) # that's it.
Works out of the box on:
- PDF / DOCX / XLSX / PPTX / HTML / Markdown / CSV — parsed into Markdown
- Images — base64
data:image/...URLs ready for multimodal models - Audio / video — local ASR transcripts with
[MM:SS]timestamps (+ keyframes for video) - SQLite — schema preview + fluent
doc.table(t).query(sql)API - Archive — safe extraction + Markdown manifest, agent decides what to open next
- ~100 source / config / log formats — passthrough as plain text
100% local. No cloud APIs. No telemetry. No API keys. Just
fyle.open(...)and the file becomes something an LLM can see.
Install
pip install fylepy
Audio / video transcription are opt-in extras (native wheels + a ~140 MB model on first run):
pip install 'fylepy[audio]' # faster-whisper
pip install 'fylepy[video]' # faster-whisper + PySceneDetect + PyAV
Quick start
import fyle
doc = fyle.open("report.pdf")
# or: fyle.open("https://example.com/report.pdf")
# or: fyle.open(raw_bytes) # format auto-detected from magic bytes
# Three views of the same document:
print(doc.text) # pure content — whatever the reader produced
print(str(doc)) # LLM-ready: filename + format + size header, then content
print(repr(doc)) # short debug line for logs
# Typical usage — hand the whole thing to your model in one line:
llm.complete(str(doc)) # filename carries real signal the model can use
print(doc.meta.format) # "pdf"
print(doc.meta.ext) # "pdf"
print(doc.pages[0].text) # just page 1
# One-shot convenience: str in, LLM-ready string out (same as str(fyle.open(...)))
text = fyle.read("report.pdf")
# Check which readers are available in your install
fyle.readers()
# {"pdf": ["pymupdf4llm*"], "audio": ["faster-whisper*"], ...}
Supported formats
| Family | Extensions | Reader |
|---|---|---|
.pdf |
pymupdf4llm | |
| Word | .docx |
mammoth |
| Excel | .xlsx |
openpyxl + tabulate |
| PowerPoint | .pptx |
python-pptx |
| Web | .html .htm |
markdownify |
| Markdown | .md .markdown |
markdown-it-py |
| CSV | .csv |
stdlib + tabulate |
| Image | .png .jpg .jpeg .webp |
Pillow → base64 data URL |
| Audio | .mp3 .m4a .wav .flac .ogg |
faster-whisper (CPU, int8) |
| Video | .mp4 .m4v .mov .avi .mkv .webm |
PySceneDetect + Whisper |
| Database | .db .sqlite .sqlite3 |
stdlib + fluent SQL API |
| Archive | .zip .tar .gz .bz2 .xz ... |
stdlib — extract to disk + manifest |
| Text | .py .js .go .rs .java .json .yaml .toml .sql .log ... (~100) |
passthrough |
Audio & video
doc = fyle.open("meeting.mp4")
print(doc.text)
# # Video: meeting.mp4
#
# - Duration: `12:34`
# - Keyframes: 8
# - Language: `en`
#
# ## Transcript
#
# [00:00] Welcome everyone to the quarterly review...
for img in doc.images:
print(img.caption, img.src[:32])
# "02:17" "data:image/jpeg;base64,/9j/4AA..."
First call downloads the Whisper base model (~140 MB). CPU only — no GPU needed.
Override with FYLE_WHISPER_MODEL=small (or medium / large-v3) for higher quality.
SQLite
doc = fyle.open("chinook.db")
for page in doc.pages:
print(page.name) # table or view name
print(page.text) # schema + sample rows
rows = doc.table("Customer").query(
"SELECT Country, COUNT(*) AS n FROM Customer GROUP BY Country ORDER BY n DESC"
)
Archive
doc = fyle.open("~/Downloads/invoices.zip")
print(doc.text) # Markdown listing of extracted files
print(doc.meta.warnings) # ["extracted to: /.../invoices/"]
# Agent's next step: fyle.open(one of the extracted files)
Refuses .. path traversal and symlink escapes; extracts to the archive's sibling directory.
Chunking for RAG
for chunk in doc.chunks(max_tokens=4000, overlap=200):
embed(chunk.text)
# chunk.tokens / chunk.page_range also available
Notes
- Offline only. Every reader runs locally. The audio/video reader downloads the Whisper model from Hugging Face on first run; after that, no network.
- Archive reader is list-only. It extracts files to disk and returns a manifest — it does not recursively parse contents. The agent decides what to open next.
- Alpha. Core is stable, but APIs may move between
0.xreleases.
Feedback
Issues, PRs, and stars are welcome.
License
MIT © 2026 zhixiangxue
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fylepy-0.1.1.tar.gz.
File metadata
- Download URL: fylepy-0.1.1.tar.gz
- Upload date:
- Size: 54.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a1102244c9126d008636fa16fce7f87cea09e1b6dd786d70fa2b95c3d4ba3a8b
|
|
| MD5 |
44aa8475312cd11a77e12e5f6551705b
|
|
| BLAKE2b-256 |
f557aaab795d5df70a01be3f0f2b3dc3f8cd5ae78a52ed27976bd99d83152969
|
Provenance
The following attestation bundles were made for fylepy-0.1.1.tar.gz:
Publisher:
publish.yml on zhixiangxue/fyle
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fylepy-0.1.1.tar.gz -
Subject digest:
a1102244c9126d008636fa16fce7f87cea09e1b6dd786d70fa2b95c3d4ba3a8b - Sigstore transparency entry: 1524518084
- Sigstore integration time:
-
Permalink:
zhixiangxue/fyle@e486763f4c62afe3878dcd609afa6f0d19fc45ca -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/zhixiangxue
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e486763f4c62afe3878dcd609afa6f0d19fc45ca -
Trigger Event:
release
-
Statement type:
File details
Details for the file fylepy-0.1.1-py3-none-any.whl.
File metadata
- Download URL: fylepy-0.1.1-py3-none-any.whl
- Upload date:
- Size: 74.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5dff3644a60458c6bc0b66f6aefa3b88d946cfc5ea04db42acd2c3cd0ccf48dd
|
|
| MD5 |
a58d78d345932e3572f0bf8d667e083d
|
|
| BLAKE2b-256 |
1d351651d8cb381fe01d3635cad36523c6bc109dc21a93e7ac2d5624aac5bc8e
|
Provenance
The following attestation bundles were made for fylepy-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on zhixiangxue/fyle
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fylepy-0.1.1-py3-none-any.whl -
Subject digest:
5dff3644a60458c6bc0b66f6aefa3b88d946cfc5ea04db42acd2c3cd0ccf48dd - Sigstore transparency entry: 1524518133
- Sigstore integration time:
-
Permalink:
zhixiangxue/fyle@e486763f4c62afe3878dcd609afa6f0d19fc45ca -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/zhixiangxue
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e486763f4c62afe3878dcd609afa6f0d19fc45ca -
Trigger Event:
release
-
Statement type: