Skip to main content

A smart PDF splitter that uses AI to extract chapters.

Project description

Folix ✂️

A smart, AI-powered PDF splitter.

Folix is a CLI tool designed to split large PDF textbooks and documents into clean, individual chapter files. Unlike standard splitters that blindly cut pages, Folix uses Mistral AI to parse the Table of Contents, automatically calculate page offsets, and handle complex layouts (like double‑column indices) with ease.


🚀 Features

  • 📚 Smart Chapter Extraction Automatically detects chapters using native PDF bookmarks (ToC).

  • 🤖 AI‑Powered Fallback If bookmarks are missing, Folix reads the visual Table of Contents page and uses Mistral AI to identify chapters.

  • 🧠 Intelligent Offset Calculation Automatically aligns printed page numbers with the physical PDF structure .

  • 👁️ Physical Layout Analysis Correctly parses multi‑column Tables of Contents that confuse standard PDF tools.

  • 🔍 Interactive Inspection Visualizes the document structure so you can choose exactly which hierarchy level (Part, Chapter, Section) to extract.

  • 🛠️ Zero‑Config CLI Simple commands for extracting, merging, and inspecting PDFs.


📦 Installation

Option A: Install via PyPI (Recommended)

pip install folix

Option B: Install from Source

git clone https://github.com/yourusername/folix.git
cd folix
pip install .

🔑 Setup (AI Features)

Folix works out‑of‑the‑box for PDFs that include standard bookmarks. For scanned books or files without metadata, you’ll need a free Mistral AI API key to enable automatic chapter detection.

1. Get an API Key

Sign up at https://console.mistral.ai (generous free tier available).

2. Set the Environment Variable

Mac / Linux

export MISTRAL_API_KEY="your_actual_key_here"

Windows (PowerShell)

$env:MISTRAL_API_KEY="your_actual_key_here"

📖 Usage

1. Extract Chapters

The primary command. Folix first attempts bookmark‑based extraction; if none are found, it automatically falls back to AI detection.

folix extract <file_name>

Options:

  • --level 1 → Extract top‑level items (e.g. Parts)
  • --level 2 → Extract chapters

2. Interactive Mode

If you’re unsure how the document is structured, run extraction normally and Folix will guide you.

folix extract <file_name>

Example Output:

📘  Analyzing structure of: complex_book.pdf
--------------------------------------------------------------------------------
Lvl  | Count  | Samples (First 3 items)
--------------------------------------------------------------------------------
1    | 5      | Part I, Part II, Part III...
2    | 32     | 1. Introduction, 2. The Basics, 3. Advanced Topics...
--------------------------------------------------------------------------------

Select a Level to extract (or 'q' to quit):

3. Merge PDFs

Combine multiple PDFs into a single file.

folix merge <pdf_names> -output <output_file_name>

4. Manual Split

Split a page range manually.

folix split input.pdf --start <start_page> --end <end_page> --output <output_file_name> 

🛠️ How It Works

Folix uses a three‑stage fallback system to ensure accurate chapter extraction:

  1. Metadata Scan Detects native PDF bookmarks (Table of Contents).

  2. AI Analysis If metadata is missing, Folix locates the visual Contents page, cleans the extracted text to reduce token usage, and sends it to Mistral AI for chapter identification.

  3. Visual Anchor & Offset Alignment

    • The AI may say: "Chapter 1 starts on page 1"
    • Folix scans the PDF to find where "Chapter 1" physically appears (e.g. page 18)
    • A global offset is calculated and applied to all chapters, ensuring precise cuts

🤝 Contributing

Contributions are welcome!

  1. Fork the repository

  2. Create your feature branch:

    git checkout -b feature/amazing-feature
    
  3. Commit your changes:

    git commit -m "Add some amazing feature"
    
  4. Push to the branch:

    git push origin feature/amazing-feature
    
  5. Open a Pull Request


📄 License

Distributed under the MIT License. See LICENSE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

folix-1.0.1.tar.gz (9.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

folix-1.0.1-py3-none-any.whl (9.0 kB view details)

Uploaded Python 3

File details

Details for the file folix-1.0.1.tar.gz.

File metadata

  • Download URL: folix-1.0.1.tar.gz
  • Upload date:
  • Size: 9.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for folix-1.0.1.tar.gz
Algorithm Hash digest
SHA256 2d0bae88b2df5e98db9fe44c4a821ae24b43e1b3f4a5d3984baf1b5c777c9f29
MD5 afb9d77c0885d51ca5c3d76faa537c90
BLAKE2b-256 4d227529d485509cc23e0836a96f69aad6ceb0eb2ed9e5d21fb7c87471cd2fba

See more details on using hashes here.

File details

Details for the file folix-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: folix-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 9.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for folix-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1b3eb262988290b325dd6b25865860de7bec4fd6bdcf115e8ae131ef6659f569
MD5 078b7b9ead20751adabe8974a6bcbecb
BLAKE2b-256 6494c66e2bc8535544e2539b795f9c73db1ee89c206eef4c9e7cf0e51d3ccd48

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page