Skip to main content

A smart PDF splitter that uses AI to extract chapters.

Project description

Folix ✂️

A smart, AI-powered PDF splitter.

Folix is a CLI tool designed to split large PDF textbooks and documents into clean, individual chapter files. Unlike standard splitters that blindly cut pages, Folix uses Mistral AI to parse the Table of Contents, automatically calculate page offsets, and handle complex layouts (like double‑column indices) with ease.


🚀 Features

  • 📚 Smart Chapter Extraction Automatically detects chapters using native PDF bookmarks (ToC).

  • 🤖 AI‑Powered Fallback If bookmarks are missing, Folix reads the visual Table of Contents page and uses Mistral AI to identify chapters.

  • 🧠 Intelligent Offset Calculation Automatically aligns printed page numbers with the physical PDF structure .

  • 👁️ Physical Layout Analysis Correctly parses multi‑column Tables of Contents that confuse standard PDF tools.

  • 🔍 Interactive Inspection Visualizes the document structure so you can choose exactly which hierarchy level (Part, Chapter, Section) to extract.

  • 🛠️ Zero‑Config CLI Simple commands for extracting, merging, and inspecting PDFs.


📦 Installation

Option A: Install via PyPI (Recommended)

pip install folix

Option B: Install from Source

git clone https://github.com/yourusername/folix.git
cd folix
pip install .

🔑 Setup (AI Features)

Folix works out‑of‑the‑box for PDFs that include standard bookmarks. For scanned books or files without metadata, you’ll need a free Mistral AI API key to enable automatic chapter detection.

1. Get an API Key

Sign up at https://console.mistral.ai (generous free tier available).

2. Set the Environment Variable

Mac / Linux

export MISTRAL_API_KEY="your_actual_key_here"

Windows (PowerShell)

$env:MISTRAL_API_KEY="your_actual_key_here"

📖 Usage

1. Extract Chapters

The primary command. Folix first attempts bookmark‑based extraction; if none are found, it automatically falls back to AI detection.

folix extract <file_name>

Options:

  • --level 1 → Extract top‑level items (e.g. Parts)
  • --level 2 → Extract chapters

2. Interactive Mode

If you’re unsure how the document is structured, run extraction normally and Folix will guide you.

folix extract <file_name>

Example Output:

📘  Analyzing structure of: complex_book.pdf
--------------------------------------------------------------------------------
Lvl  | Count  | Samples (First 3 items)
--------------------------------------------------------------------------------
1    | 5      | Part I, Part II, Part III...
2    | 32     | 1. Introduction, 2. The Basics, 3. Advanced Topics...
--------------------------------------------------------------------------------

Select a Level to extract (or 'q' to quit):

3. Merge PDFs

Combine multiple PDFs into a single file.

folix merge <pdf_names> -output <output_file_name>

4. Manual Split

Split a page range manually.

folix split input.pdf --start <start_page> --end <end_page> --output <output_file_name> 

🛠️ How It Works

Folix uses a three‑stage fallback system to ensure accurate chapter extraction:

  1. Metadata Scan Detects native PDF bookmarks (Table of Contents).

  2. AI Analysis If metadata is missing, Folix locates the visual Contents page, cleans the extracted text to reduce token usage, and sends it to Mistral AI for chapter identification.

  3. Visual Anchor & Offset Alignment

    • The AI may say: "Chapter 1 starts on page 1"
    • Folix scans the PDF to find where "Chapter 1" physically appears (e.g. page 18)
    • A global offset is calculated and applied to all chapters, ensuring precise cuts

🤝 Contributing

Contributions are welcome!

  1. Fork the repository

  2. Create your feature branch:

    git checkout -b feature/amazing-feature
    
  3. Commit your changes:

    git commit -m "Add some amazing feature"
    
  4. Push to the branch:

    git push origin feature/amazing-feature
    
  5. Open a Pull Request


📄 License

Distributed under the MIT License. See LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

folix-1.0.0.tar.gz (9.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

folix-1.0.0-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file folix-1.0.0.tar.gz.

File metadata

  • Download URL: folix-1.0.0.tar.gz
  • Upload date:
  • Size: 9.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for folix-1.0.0.tar.gz
Algorithm Hash digest
SHA256 173527e6d09cbbd5b5d118a0d8f1393155b227c36d3ef6f6b899a3c360400647
MD5 213555c24f5790496152690036d5b044
BLAKE2b-256 f08a31a0346862e5beb6d151c4b8dc60dd78cc151a211e85131ad2805120fa53

See more details on using hashes here.

File details

Details for the file folix-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: folix-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 9.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for folix-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 830c739b77ea7b2e67e2d81616328ae7f4dbc9e3173b801e3f02e5a91476b687
MD5 95911724ea8d3cb8ea459ba20d1a0645
BLAKE2b-256 f6578e13afdd70a19aaf1a5b30ded73fedf0d4684b588a4f1e6e06b3ee7e4683

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page