A package for processing documents and generating questions and answers using LLMs on CPU.
Project description
FragenAntwortLLMCPU (Generating Efficient Question and Answer Pairs for LLM Fine-Tuning)
Incorporating question and answer pairs is crucial for creating accurate, context-aware, and user-friendly Large Language Models (LLMs).
FragenAntwortLLMCPU is a Python package for processing PDF documents and generating efficient Q&A pairs using LLMs on CPU only.
These Q&A pairs can be used to fine-tune an LLM or to build specialized training datasets.
The package leverages various NLP libraries and supports multiple GGUF models, including:
- Mistral-7B-Instruct v0.3 GGUF (default)
- Qwen1.5-7B-Chat GGUF
via the ctransformers backend.
Table of Contents
Installation
You can install the package from PyPI.
Linux / macOS
pip install FragenAntwortLLMCPU
Windows Users with Anaconda
Due to some package conflicts, you may need extra steps to work around tbb.
Step 1: Uninstall tbb manually (if previously installed)
-
Find the installation location:
conda list tbb
-
Navigate to the directory shown and remove the
tbbpackage files (the directory associated withtbb).
Step 2: Install tbb using conda
conda install -c conda-forge tbb
Step 3: Install FragenAntwortLLMCPU with pip
pip install FragenAntwortLLMCPU
Alternative: Force reinstall with pip
If you prefer not to manually remove tbb, you can try:
pip install --ignore-installed --force-reinstall FragenAntwortLLMCPU
Usage
Here is an example of how to use the DocumentProcessor:
from FragenAntwortLLMCPU import DocumentProcessor
processor = DocumentProcessor(
book_path="/path/to/your/book/", # Directory containing the PDF
temp_folder="/path/to/temp/folder",
output_file="/path/to/output/QA.jsonl",
book_name="example.pdf",
start_page=9,
end_page=77,
number_Q_A="one", # written number: "one", "two", ...
target_information="foods and locations",
max_new_tokens=1000,
temperature=0.1,
context_length=2100,
max_tokens_chunk=400,
arbitrary_prompt="",
model_family="mistral", # or "qwen"
# hf_token="your_hf_token_here", # optional, can also come from env vars
)
processor.process_book()
processor.generate_prompts()
processor.save_to_jsonl()
Parameter Explanation
- book_path: Directory path where your PDF files are stored.
- temp_folder: Directory for temporary output (e.g., intermediate Q&A text files).
- output_file: Path to the final JSONL file containing the formatted Q&A pairs.
- book_name: Name of the PDF file to process.
- start_page: Starting page number for processing (1-based in the example).
- end_page: Ending page number for processing (1-based in the example).
- number_Q_A: Number of questions and answers to generate (as a written number, e.g.
"one","five"). - target_information: The focus of the questions and answers. You can specify domain-specific entities like
"genes, diseases, locations"or"people, organizations, agreements". - max_new_tokens: Maximum number of tokens to generate per response.
- temperature: Sampling temperature for the LLM (higher = more diverse).
- context_length: Maximum context length for the LLM.
- max_tokens_chunk: Maximum number of tokens per text chunk before sending to the LLM.
- arbitrary_prompt: Custom prompt to override the default question-generation instructions.
- model_family: Selects which underlying LLM to use. Supported values:
"mistral"(default)"qwen"
- hf_token (optional): Hugging Face API token. If not provided, the package will look for
HUGGINGFACEHUB_API_TOKENorHF_TOKENin the environment, and can also ask interactively.
Model Selection
FragenAntwortLLMCPU uses ctransformers with GGUF models. You can choose the model family with the
model_family parameter:
-
model_family="mistral"
Uses a Mistral-7B-Instruct v0.3 GGUF checkpoint. -
model_family="qwen"
Uses a Qwen1.5-7B-Chat GGUF checkpoint.
You must download the appropriate .gguf files and ensure their filenames and locations match the
configuration in document_processor.py. These files are not bundled in the Python package.
If a Hugging Face token is required (for private or gated models), you can:
- Set
HUGGINGFACEHUB_API_TOKENorHF_TOKENin your environment, or - Pass
hf_token="..."toDocumentProcessor.
Features
- Extracts text from PDF documents.
- Splits text into manageable chunks for LLM processing.
- Generates efficient question–answer pairs based on specific target information.
- Supports custom prompts for question generation.
- Runs entirely on CPU (no GPU required).
- Supports multiple GGUF models:
- Mistral-7B-Instruct v0.3 (default)
- Qwen1.5-7B-Chat
- Accepts PDF input in multiple languages (e.g. French, German, English) and generates Q&A pairs in English.
Contributing
Contributions are welcome!
Please fork the repository, open issues for bugs or feature requests, and submit pull requests with your improvements.
License
This project is licensed under the MIT License.
See the LICENSE file for details.
Authors
- Mehrdad Almasi
- Lars Wieneke
- Demival Vasques
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fragenantwortllmcpu-0.1.19.tar.gz.
File metadata
- Download URL: fragenantwortllmcpu-0.1.19.tar.gz
- Upload date:
- Size: 9.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c06b4307c038e9e0ae8351638a397e0138b7edbae4ad67ec8777cd4806dcdac5
|
|
| MD5 |
ec2cd22b7e0b775223209a74e8cd923f
|
|
| BLAKE2b-256 |
7c035c2236fc4789d59044a7ed7238a20db77075d888367dd3dcb5467af394ed
|
File details
Details for the file fragenantwortllmcpu-0.1.19-py3-none-any.whl.
File metadata
- Download URL: fragenantwortllmcpu-0.1.19-py3-none-any.whl
- Upload date:
- Size: 9.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
145f978c066e052107f25d55857c5e2690fad3ef766c638043263c3552b29c18
|
|
| MD5 |
a44ddacb2d872b57fa4f1eb237d1702e
|
|
| BLAKE2b-256 |
b487648ac49162288a95361ca6fa934b4f9688b00815522dfa42b7822654ba81
|