LLM-friendly PDF splitter & image optimizer. Chunk PDFs by size and downsample images for RAG/Bedrock.
Project description
llm_pdf_chunker
LLM-friendly PDF splitting & image optimization tool.
Designed to prepare documents for RAG and LLM context windows (e.g., AWS Bedrock, Claude) by handling chunking, CMYK to RGB conversion, and smart image resizing.
Features
- LLM Optimized:
- Bypass File Size Limit*: Helps fit PDFs within strict constraints, such as the 4.5MB file size limit often encountered when using models like Claude on AWS Bedrock.
- Token Efficiency: Downsampling embedded images reduces the overall data payload while preserving necessary visual information, leading to significant savings in token usage and costs.
- PDF Chunking: Splits PDFs based on file size (specified in MB).
- Image Optimization:
- Downsampling: Resizes embedded images to a specified maximum dimension (default: 1500px).
- Color Conversion: Converts CMYK images to RGB to prevent display issues (e.g., inverted colors).
- Compression: Adjusts JPEG quality to reduce file size.
- Remove Corrupted fonts: Removes corrupted fonts created by some software.
- Callback Support: Hook into the saving process via a callback for direct uploads to S3, databases, etc., without saving chunks to local disk.
Requirements
- Python 3.11+
- External Dependencies: qpdf (native binary) is required by pikepdf.
- macOS: brew install qpdf
- Ubuntu/Debian: apt-get install qpdf
Quickstart (uv)
# ensure uv is installed
uv lock
uv sync
# Run via CLI
uv run pdf-chunker input.pdf --out-dir output
CLI Usage
usage: pdf-chunker [-h] [--max-size MAX_SIZE] [--image-max-dim IMAGE_MAX_DIM] input_pdf [output_dir]
Split large PDFs into smaller chunks
positional arguments:
input_pdf Input PDF file path
output_dir Output directory (optional, defaults to source dir)
options:
-h, --help show this help message and exit
--max-size MAX_SIZE Max chunk size in MB (default: 4.0)
--image-max-dim IMAGE_MAX_DIM
Max dimension for images in pixels (default: 1500)
Example:
Split into 10MB chunks and resize images to 2000px on the longest side.
pdf-chunker input.pdf --max-size 10.0 --image-max-dim 2000
Image Analysis Tool (pdf-image-dumper)
A debugging tool is included to inspect images embedded within a PDF. It lists details such as resolution, color space (CMYK/RGB), and filters.
pdf-image-dumper input.pdf
Output Example:
--- Analyzing PDF: input.pdf ---
Page | Name | Width | Height | Size (bytes) | ColorSpace | Filter | Bits/Comp | APP
------+------------+-------+--------+--------------+------------+--------------+-----------+-----
1 | /Im1 | 2400 | 3200 | 2,500,123 | /DeviceCMYK| /DCTDecode | 8 | APP14:Adobe
...
Python API Usage
Basic Usage
from pdf_chunker import chunk_pdf
# Split input.pdf into chunks in the 'output' directory
chunk_pdf(
input_path="input.pdf",
output_dir="output",
max_chunk_size=4 * 1024 * 1024, # 4MB (bytes)
image_max_dim=1500 # pixels
)
Advanced: Using Callbacks (e.g., Upload to S3)
By providing a save_callback, you can receive the split PDF objects (pikepdf.Pdf) directly instead of saving them to the file system.
import io
from pdf_chunker import chunk_pdf
def upload_to_s3(pdf_obj, filename):
# Convert pikepdf object to bytes
with io.BytesIO() as buffer:
pdf_obj.save(buffer)
buffer.seek(0)
# Here you would use boto3 or similar to upload
print(f"Uploading {filename} ({len(buffer.getvalue())} bytes) to S3...")
# s3.upload_fileobj(buffer, "my-bucket", filename)
chunk_pdf(
input_path="large_document.pdf",
save_callback=upload_to_s3
)
Docker / MinIO Integration Example
The example/ directory contains a complete example of integration with MinIO (S3-compatible storage).
- MinIO: Triggers a webhook event when a PDF file is uploaded.
- Callback Server: Receives the webhook, downloads the PDF, chunks it, and uploads the parts back to MinIO (without intermediate disk storage).
Run the example:
cd example
docker-compose up --build
- Open MinIO Console at http://localhost:9001 (user: minioadmin, pass: minioadmin).
- Upload a PDF to the pdfs bucket.
- Check the server logs; chunked files (_part01.pdf, etc.) will appear in the output/ folder within the bucket.
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_pdf_chunker-0.1.2.tar.gz.
File metadata
- Download URL: llm_pdf_chunker-0.1.2.tar.gz
- Upload date:
- Size: 13.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab91e566aed250a7a5651b73f084b5202c4ea52a9308c54d682811b10e160aeb
|
|
| MD5 |
70037b66a08676a8cf940bca694166fe
|
|
| BLAKE2b-256 |
7d53c401e31819b7043c0710fe8720a6f33e46c99b926776d06bc42ec479f7f3
|
Provenance
The following attestation bundles were made for llm_pdf_chunker-0.1.2.tar.gz:
Publisher:
publish.yml on fujiba/pdf-chunker
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llm_pdf_chunker-0.1.2.tar.gz -
Subject digest:
ab91e566aed250a7a5651b73f084b5202c4ea52a9308c54d682811b10e160aeb - Sigstore transparency entry: 772619676
- Sigstore integration time:
-
Permalink:
fujiba/pdf-chunker@1a6e5ed7300a4106f0f801799684737c1df1728c -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/fujiba
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@1a6e5ed7300a4106f0f801799684737c1df1728c -
Trigger Event:
push
-
Statement type:
File details
Details for the file llm_pdf_chunker-0.1.2-py3-none-any.whl.
File metadata
- Download URL: llm_pdf_chunker-0.1.2-py3-none-any.whl
- Upload date:
- Size: 17.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d455caba4d04662292b8a75de37db866a455c60c2b84d7d9befec0008f5a2ce8
|
|
| MD5 |
bc92d385158fee17a459acb2dc09c20e
|
|
| BLAKE2b-256 |
7d6eb12a080747e93ab5d4bf8e586e69238391d824649ddaeea1d846a2d150b5
|
Provenance
The following attestation bundles were made for llm_pdf_chunker-0.1.2-py3-none-any.whl:
Publisher:
publish.yml on fujiba/pdf-chunker
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llm_pdf_chunker-0.1.2-py3-none-any.whl -
Subject digest:
d455caba4d04662292b8a75de37db866a455c60c2b84d7d9befec0008f5a2ce8 - Sigstore transparency entry: 772619689
- Sigstore integration time:
-
Permalink:
fujiba/pdf-chunker@1a6e5ed7300a4106f0f801799684737c1df1728c -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/fujiba
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@1a6e5ed7300a4106f0f801799684737c1df1728c -
Trigger Event:
push
-
Statement type: