A package for processing documents and generating questions and answers using LLMs on CPU.

Project description

FragenAntwortLLMCPU (Generating Efficient Question and Answer Pairs for LLM Fine-Tuning)

Incorporating question and answer pairs is crucial for creating accurate, context-aware, and user-friendly Large Language Models (LLMs).
FragenAntwortLLMCPU is a Python package for processing PDF documents and generating efficient Q&A pairs using LLMs on CPU only.
These Q&A pairs can be used to fine-tune an LLM or to build specialized training datasets.

The package leverages various NLP libraries and supports multiple GGUF models, including:

Mistral-7B-Instruct v0.3 GGUF (default)
Qwen1.5-7B-Chat GGUF

via the ctransformers backend.

Installation
Usage
Model Selection
Features
Contributing
License
Authors

Installation

You can install the package from PyPI.

Linux / macOS

pip install FragenAntwortLLMCPU

Windows Users with Anaconda

Due to some package conflicts, you may need extra steps to work around tbb.

Step 1: Uninstall `tbb` manually (if previously installed)

Find the installation location:
```
conda list tbb
```
Navigate to the directory shown and remove the tbb package files (the directory associated with tbb).

Step 2: Install `tbb` using `conda`

conda install -c conda-forge tbb

Step 3: Install FragenAntwortLLMCPU with `pip`

pip install FragenAntwortLLMCPU

Alternative: Force reinstall with pip

If you prefer not to manually remove tbb, you can try:

pip install --ignore-installed --force-reinstall FragenAntwortLLMCPU

Usage

Here is an example of how to use the DocumentProcessor:

from FragenAntwortLLMCPU import DocumentProcessor

processor = DocumentProcessor(
    book_path="/path/to/your/book/",      # Directory containing the PDF
    temp_folder="/path/to/temp/folder",
    output_file="/path/to/output/QA.jsonl",
    book_name="example.pdf",
    start_page=9,
    end_page=77,
    number_Q_A="one",                     # written number: "one", "two", ...
    target_information="foods and locations",
    max_new_tokens=1000,
    temperature=0.1,
    context_length=2100,
    max_tokens_chunk=400,
    arbitrary_prompt="",
    model_family="mistral",               # or "qwen"
    # hf_token="your_hf_token_here",      # optional, can also come from env vars
)

processor.process_book()
processor.generate_prompts()
processor.save_to_jsonl()

Parameter Explanation

book_path: Directory path where your PDF files are stored.
temp_folder: Directory for temporary output (e.g., intermediate Q&A text files).
output_file: Path to the final JSONL file containing the formatted Q&A pairs.
book_name: Name of the PDF file to process.
start_page: Starting page number for processing (1-based in the example).
end_page: Ending page number for processing (1-based in the example).
number_Q_A: Number of questions and answers to generate (as a written number, e.g. "one", "five").
target_information: The focus of the questions and answers. You can specify domain-specific entities like
"genes, diseases, locations" or "people, organizations, agreements".
max_new_tokens: Maximum number of tokens to generate per response.
temperature: Sampling temperature for the LLM (higher = more diverse).
context_length: Maximum context length for the LLM.
max_tokens_chunk: Maximum number of tokens per text chunk before sending to the LLM.
arbitrary_prompt: Custom prompt to override the default question-generation instructions.
model_family: Selects which underlying LLM to use. Supported values:
- "mistral" (default)
- "qwen"
hf_token (optional): Hugging Face API token. If not provided, the package will look for HUGGINGFACEHUB_API_TOKEN or HF_TOKEN in the environment, and can also ask interactively.

Model Selection

FragenAntwortLLMCPU uses ctransformers with GGUF models. You can choose the model family with the model_family parameter:

model_family="mistral"
Uses a Mistral-7B-Instruct v0.3 GGUF checkpoint.
model_family="qwen"
Uses a Qwen1.5-7B-Chat GGUF checkpoint.

You must download the appropriate .gguf files and ensure their filenames and locations match the configuration in document_processor.py. These files are not bundled in the Python package.

If a Hugging Face token is required (for private or gated models), you can:

Set HUGGINGFACEHUB_API_TOKEN or HF_TOKEN in your environment, or
Pass hf_token="..." to DocumentProcessor.

Features

Extracts text from PDF documents.
Splits text into manageable chunks for LLM processing.
Generates efficient question–answer pairs based on specific target information.
Supports custom prompts for question generation.
Runs entirely on CPU (no GPU required).
Supports multiple GGUF models:
- Mistral-7B-Instruct v0.3 (default)
- Qwen1.5-7B-Chat
Accepts PDF input in multiple languages (e.g. French, German, English) and generates Q&A pairs in English.

Contributing

Contributions are welcome!
Please fork the repository, open issues for bugs or feature requests, and submit pull requests with your improvements.

License

This project is licensed under the MIT License.
See the LICENSE file for details.

Authors

Mehrdad Almasi
Lars Wieneke
Demival Vasques

Project details

Release history Release notifications | RSS feed

This version

0.1.19

Dec 11, 2025

0.1.17

Jun 24, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fragenantwortllmcpu-0.1.19.tar.gz (9.2 kB view details)

Uploaded Dec 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fragenantwortllmcpu-0.1.19-py3-none-any.whl (9.5 kB view details)

Uploaded Dec 11, 2025 Python 3

File details

Details for the file fragenantwortllmcpu-0.1.19.tar.gz.

File metadata

Download URL: fragenantwortllmcpu-0.1.19.tar.gz
Upload date: Dec 11, 2025
Size: 9.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.13

File hashes

Hashes for fragenantwortllmcpu-0.1.19.tar.gz
Algorithm	Hash digest
SHA256	`c06b4307c038e9e0ae8351638a397e0138b7edbae4ad67ec8777cd4806dcdac5`
MD5	`ec2cd22b7e0b775223209a74e8cd923f`
BLAKE2b-256	`7c035c2236fc4789d59044a7ed7238a20db77075d888367dd3dcb5467af394ed`

See more details on using hashes here.

File details

Details for the file fragenantwortllmcpu-0.1.19-py3-none-any.whl.

File metadata

Download URL: fragenantwortllmcpu-0.1.19-py3-none-any.whl
Upload date: Dec 11, 2025
Size: 9.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.13

File hashes

Hashes for fragenantwortllmcpu-0.1.19-py3-none-any.whl
Algorithm	Hash digest
SHA256	`145f978c066e052107f25d55857c5e2690fad3ef766c638043263c3552b29c18`
MD5	`a44ddacb2d872b57fa4f1eb237d1702e`
BLAKE2b-256	`b487648ac49162288a95361ca6fa934b4f9688b00815522dfa42b7822654ba81`

See more details on using hashes here.

FragenAntwortLLMCPU 0.1.19

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

FragenAntwortLLMCPU (Generating Efficient Question and Answer Pairs for LLM Fine-Tuning)

Table of Contents

Installation

Linux / macOS

Windows Users with Anaconda

Step 1: Uninstall `tbb` manually (if previously installed)

Step 2: Install `tbb` using `conda`

Step 3: Install FragenAntwortLLMCPU with `pip`

Alternative: Force reinstall with pip

Usage

Parameter Explanation

Model Selection

Features

Contributing

License

Authors

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

FragenAntwortLLMCPU 0.1.19

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

FragenAntwortLLMCPU (Generating Efficient Question and Answer Pairs for LLM Fine-Tuning)

Table of Contents

Installation

Linux / macOS

Windows Users with Anaconda

Step 1: Uninstall tbb manually (if previously installed)

Step 2: Install tbb using conda

Step 3: Install FragenAntwortLLMCPU with pip

Alternative: Force reinstall with pip

Usage

Parameter Explanation

Model Selection

Features

Contributing

License

Authors

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Step 1: Uninstall `tbb` manually (if previously installed)

Step 2: Install `tbb` using `conda`

Step 3: Install FragenAntwortLLMCPU with `pip`