Extract text and figures from BibTeX library attachments into LLM-readable formats for AI-assisted research

These details have not been verified by PyPI

Project links

Project description

bib4llm

Convert your PDF library into LLM-readable format for AI-assisted research. This tool extracts text and figures from PDFs into markdown and PNG formats, making them indexable by AI coding assistants like Cursor AI. You can provide a directory with PDF files or a BibTeX file with PDF attachement paths in the file field to convert all of the attachments. The latter allows for automatic updating from e.g. a Zotero library. This tool does not perform any RAG (Retrieval-Augmented Generation) - that's left to downstream tools (e.g. Cursor AI, which indexes the active workspace folder).

Features

Reads PDF files in directory or file key in BibTex file to get paths of attachments
Extracts text and figures from PDF attachments into markdown and PNG formats using PyMuPDF4LLM (see examples)
Watches directories or BibTeX files for changes and automatically updates the converted files
Developed with Zotero + BetterBibTeX for Cursor AI in mind, but may work with other reference managers' BibTeX exports (depending on their file field format) and for other LLM-based processing

Installation

pip install bib4llm

Usage

Command Line

# Convert a BibTeX file (one-time)
bib4llm convert path/to/library.bib [options]

# Convert a PDF file directly (one-time)
bib4llm convert path/to/paper.pdf [options]

# Convert all PDFs and BibTeX files in a directory
bib4llm convert path/to/directory [options]

# Watch a BibTeX file for changes and run convert when changes occur
bib4llm watch path/to/library.bib [options]

# Watch a PDF file for changes and run convert when changes occur
bib4llm watch path/to/paper.pdf [options]

# Watch a directory for changes (including new files) and convert accordingly
bib4llm watch path/to/directory [options]

# Remove generated files
bib4llm clean path/to/library.bib [options]

The tool uses multiprocessing to process library entries in parallel. Depending on the number of papers in all of your attachments, the initial convert might take some time.

Command Options

`convert`

bib4llm convert <input_path> [options]

Arguments:
  input_path            Path to the BibTeX file, PDF file, or directory to process

Options:
  -f, --force           Force reprocessing of all entries
  -p, --processes       Number of parallel processes to use (default: number of CPU cores)
  -n, --dry-run         Show what would be processed without actually doing it
  -q, --quiet           Suppress all output except warnings and errors
  -d, --debug           Enable debug logging
  -R, --no-recursive    Disable recursive processing of directories (only applicable if input is a directory)

`watch`

bib4llm watch <input_path> [options]

Arguments:
  input_path            Path to the BibTeX file, PDF file, or directory to watch

Options:
  -p, --processes       Number of parallel processes to use (default: number of CPU cores)
  -q, --quiet           Suppress all output except warnings and errors
  -d, --debug           Enable debug logging
  -R, --no-recursive    Disable recursive watching of directories (only applicable if input is a directory)

`clean`

bib4llm clean <input_path> [options]

Arguments:
  input_path            Path to the BibTeX file or PDF file whose generated data should be removed

Options:
  -n, --dry-run         Show what would be removed without actually doing it

Setup with Zotero for Cursor AI

Install Zotero and the BetterBibTeX extension
Create a collection for your project papers
(Optional) Configure BetterBibTeX to use your preferred citation key format (e.g. AuthorYYYY)
Export your collection with BetterBibTeX and enable automatic BibTeX file updates
Place the exported .bib file in your project
Run bib4llm to convert and watch for changes:
```
bib4llm watch path/to/library.bib
```

Output Directory Structure

When processing a single PDF file:

paper.pdf -> paper-bib4llm/paper.md (and extracted images)

When processing a directory of PDF files, the directory structure is preserved:

pdf_dir/
├── paper1.pdf
└── subfolder/
    └── paper2.pdf

->

pdf_dir-bib4llm/
├── paper1/
│   ├── paper1.md
│   └── (extracted images)
└── subfolder/
    └── paper2/
        ├── paper2.md
        └── (extracted images)

For BibTeX files, each entry gets its own folder within the output directory:

bibtex_library.bib -> bibtex_library-bib4llm/
    ├── entry1/
    │   ├── entry1.md
    │   └── (extracted images)
    └── entry2/
        ├── entry2.md
        └── (extracted images)

Future work

Fix progress bar during convert (currently messed up due to tqdm + multiprocessing + logger logs)
Develop a vscode extension to automatically start the watch call based on a per-workspace setting (which .bib file).
Add support for other PDF extraction tools like llama-parse

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.2

Apr 1, 2025

0.2.1

Mar 20, 2025

0.2.0

Mar 7, 2025

0.1.5

Mar 7, 2025

0.1.4

Jan 23, 2025

0.1.3

Jan 23, 2025

0.1.2

Jan 23, 2025

0.1.0

Dec 19, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bib4llm-0.2.2.tar.gz (15.0 MB view details)

Uploaded Apr 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bib4llm-0.2.2-py3-none-any.whl (29.6 kB view details)

Uploaded Apr 1, 2025 Python 3

File details

Details for the file bib4llm-0.2.2.tar.gz.

File metadata

Download URL: bib4llm-0.2.2.tar.gz
Upload date: Apr 1, 2025
Size: 15.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for bib4llm-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`e33f5d4fd9450d515e12840511ed938359e8b51742a54af3ef5817b7ddd071ec`
MD5	`6c17e3512f066802c98635042f63b6f2`
BLAKE2b-256	`f3044f69928a8ca81b39f73338153398b76b5b20c33fabcbbe4e17681d2a75a9`

See more details on using hashes here.

File details

Details for the file bib4llm-0.2.2-py3-none-any.whl.

File metadata

Download URL: bib4llm-0.2.2-py3-none-any.whl
Upload date: Apr 1, 2025
Size: 29.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for bib4llm-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d81dbe5bb008f0ffd718573582aad257870116efdd698ca46843b7d350f08052`
MD5	`5196d63b006909d68bbf3cb62acc1b1a`
BLAKE2b-256	`9705d842a0762eb61b8b525f04811f719d6974d52d872a0d9efa1ea63b97d12b`

See more details on using hashes here.

bib4llm 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

bib4llm

Features

Installation

Usage

Command Line

Command Options

`convert`

`watch`

`clean`

Setup with Zotero for Cursor AI

Output Directory Structure

Future work

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes