Skip to main content

Generate clean, readable PDFs from raw text or LLM output

Project description

langchain-pdf

Generate clean, readable, professional PDFs from raw text or Large Language Model (LLM) output.

langchain-pdf is designed for developers who want deterministic, well-formatted documents instead of messy markdown or broken PDFs.


GitHub stars License Python Status

✨ Why langchain-pdf?

Large Language Models often generate:

  • markdown artifacts (**bold**, ---, 1. lists)
  • inconsistent spacing
  • duplicated headings
  • orphan bullets
  • blank pages in PDFs

langchain-pdf fixes all of that.

It introduces a proper document pipeline:


LLM Output → Normalize → Parse → Render → PDF


🚀 Features

  • 🧠 Robust text normalization (handles messy LLM output)
  • 📚 Structured document parsing (headings, paragraphs, bullets)
  • 🖨️ Professional PDF rendering
  • 🛑 No blank pages or orphan content
  • 🔗 LangChain integration (Gemini ,OpenAI , Anthropic supported)
  • 💻 CLI support (no Python code required)
  • 🧪 Windows-tested (PowerShell friendly)
  • 📦 Open-source & extensible

📄 Sample Outputs

Want to see what the generated PDFs look like?

👉 Check out the sample outputs here:
docs/outputs/

📦 Installation

Clone the repository

git clone https://github.com/your-username/langchain-pdf.git
cd langchain-pdf

Create and activate a virtual environment

python -m venv venv

Windows

venv\Scripts\activate

macOS / Linux

source venv/bin/activate

Install dependencies

pip install -r requirements.txt
pip install -e .

Set ONE of the following environment variables:

  • OPENAI_API_KEY (OpenAI)
  • GOOGLE_API_KEY or GEMINI_API_KEY (Google Gemini)
  • ANTHROPIC_API_KEY (Anthropic)

🔐 Environment Setup (for AI generation)

Create a .env file in the project root:

GOOGLE_API_KEY=your_gemini_api_key_here
OPENAI_API_KEY=your_gemini_api_key_here
ANTHROPIC_API_KEY=your_gemini_api_key_here

Optional LLM Providers

OpenAI:

pip install langchain-openai

Google Gemini:

pip install langchain-google-genai

Anthropic:

pip install langchain-anthropic

.env is ignored by Git and should never be committed.


🖥️ CLI Usage

1️⃣ Convert a text file to PDF

python -m langchain_pdf.cli input.txt output.pdf

Optional title:

python -m langchain_pdf.cli input.txt output.pdf --title "My Document"

2️⃣ Generate a PDF using LangChain (Gemini)

python -m langchain_pdf.cli \
  --topic "Generative AI with LangChain" \
  --out reports/course.pdf

This will:

  • generate content using Gemini
  • normalize messy output
  • create a clean PDF automatically

3️⃣ Help

python -m langchain_pdf.cli --help

🧠 How It Works (Architecture)

┌──────────────┐
│  LLM / Text  │
└──────┬───────┘
       ↓
┌──────────────┐
│ Normalizer   │  ← removes markdown, noise, duplicates
└──────┬───────┘
       ↓
┌──────────────┐
│ Parser       │  ← converts text → document blocks
└──────┬───────┘
       ↓
┌──────────────┐
│ Renderer     │  ← layout-safe PDF rendering
└──────┬───────┘
       ↓
┌──────────────┐
│   PDF File   │
└──────────────┘

📁 Project Structure

docs/
├── outputs/
│   ├── course_overview_sample.pdf
│   ├── resume_sample.pdf
│   └── README.md
langchain-pdf/
│
├── langchain_pdf/ # Core library
|   ├──assets/
|      ├──fonts/
|        ├── DejaVuSans.ttf
|        ├── DejaVuSans-Bold.ttf
|        ├── LICENSE.txt
│   ├── __init__.py
│   ├── exporter.py
│   ├── normalizer.py
│   ├── parser.py
│   ├── renderer.py
│   ├── templates.py
│   └── cli.py
│
├── examples/             # Usage examples (not packaged)
│   ├── llm_factory.py
│   └── langchain_example.py
│
├── tests/                # Tests (optional)
│
├── README.md
├── requirements.txt
├── pyproject.toml
└── .env.example

🧪 Example Use Cases

  • Generate course PDFs from LLMs
  • Convert AI-generated reports into readable documents
  • Create resumes, study material, or technical notes
  • Build SaaS features that export PDFs
  • Automate documentation pipelines

🤔 Is this made with AI?

Yes — and engineered by a human.

AI helps generate content. langchain-pdf ensures that content is structured, readable, and professional.

The value is not generation — it’s control.


🛠️ Extending the Project

Planned / easy extensions:

  • Support for local LLMs (Ollama)
  • Batch PDF generation
  • Themes (fonts, spacing)
  • DOCX export
  • Stream / stdin input

🤝 Contributing

Contributions are welcome.

If you:

  • improve normalization
  • add render themes
  • support new LLMs

feel free to open a PR.


📜 License

MIT License — free to use, modify, and distribute.


⭐ Final Note

If you are tired of broken PDFs from AI output, langchain-pdf is built for you.

🔤 Fonts & Attribution

This project bundles the Inter font for consistent, readable PDF output.

Inter is licensed under the SIL Open Font License (OFL 1.1)
Font copyright © The Inter Project Authors.

The font license is included in: langchain_pdf/assets/fonts/LICENSE.txt

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_pdf-0.2.0.tar.gz (681.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_pdf-0.2.0-py3-none-any.whl (682.6 kB view details)

Uploaded Python 3

File details

Details for the file langchain_pdf-0.2.0.tar.gz.

File metadata

  • Download URL: langchain_pdf-0.2.0.tar.gz
  • Upload date:
  • Size: 681.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for langchain_pdf-0.2.0.tar.gz
Algorithm Hash digest
SHA256 493e96450f3949afca8233c392d7b4329ea77a6088526e3124e33295dbad0674
MD5 26d3e172c7f35cde7d928922af704ae1
BLAKE2b-256 307d1171dfd20b32b021b6cca46c75ea4e1ead7405160f74c5ab5d01f68021ab

See more details on using hashes here.

File details

Details for the file langchain_pdf-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: langchain_pdf-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 682.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for langchain_pdf-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cc084e8228b1e7c1af9a61972402c53d71642ff2c605ef532e11f6b025ac3e13
MD5 faa7ba1fb230bc0aa77f772f64213765
BLAKE2b-256 80372a5e354931f6560b9da9e03a07ceef79100f5a3c14a07df4c0c49155dc89

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page