Skip to main content

A package for processing documents and generating questions and answers using LLMs on CPU.

Project description

FragenAntwortLLMCPU (Generating Efficient Question and Answer Pairs for LLM Fine-Tuning)

Downloads

Incorporating question and answer pairs is crucial for creating accurate, context-aware, and user-friendly Large Language Models (LLMs).
FragenAntwortLLMCPU is a Python package for processing PDF documents and generating efficient Q&A pairs using LLMs on CPU only.
These Q&A pairs can be used to fine-tune an LLM or to build specialized training datasets.

The package leverages various NLP libraries and supports multiple GGUF models, including:

  • Mistral-7B-Instruct v0.3 GGUF (default)
  • Qwen1.5-7B-Chat GGUF

via the ctransformers backend.


Table of Contents


Installation

You can install the package from PyPI.

Linux / macOS

pip install FragenAntwortLLMCPU

Windows Users with Anaconda

Due to some package conflicts, you may need extra steps to work around tbb.

Step 1: Uninstall tbb manually (if previously installed)

  1. Find the installation location:

    conda list tbb
    
  2. Navigate to the directory shown and remove the tbb package files (the directory associated with tbb).

Step 2: Install tbb using conda

conda install -c conda-forge tbb

Step 3: Install FragenAntwortLLMCPU with pip

pip install FragenAntwortLLMCPU

Alternative: Force reinstall with pip

If you prefer not to manually remove tbb, you can try:

pip install --ignore-installed --force-reinstall FragenAntwortLLMCPU

Usage

Here is an example of how to use the DocumentProcessor:

from FragenAntwortLLMCPU import DocumentProcessor

processor = DocumentProcessor(
    book_path="/path/to/your/book/",      # Directory containing the PDF
    temp_folder="/path/to/temp/folder",
    output_file="/path/to/output/QA.jsonl",
    book_name="example.pdf",
    start_page=9,
    end_page=77,
    number_Q_A="one",                     # written number: "one", "two", ...
    target_information="foods and locations",
    max_new_tokens=1000,
    temperature=0.1,
    context_length=2100,
    max_tokens_chunk=400,
    arbitrary_prompt="",
    model_family="mistral",               # or "qwen"
    # hf_token="your_hf_token_here",      # optional, can also come from env vars
)

processor.process_book()
processor.generate_prompts()
processor.save_to_jsonl()

Parameter Explanation

  • book_path: Directory path where your PDF files are stored.
  • temp_folder: Directory for temporary output (e.g., intermediate Q&A text files).
  • output_file: Path to the final JSONL file containing the formatted Q&A pairs.
  • book_name: Name of the PDF file to process.
  • start_page: Starting page number for processing (1-based in the example).
  • end_page: Ending page number for processing (1-based in the example).
  • number_Q_A: Number of questions and answers to generate (as a written number, e.g. "one", "five").
  • target_information: The focus of the questions and answers. You can specify domain-specific entities like
    "genes, diseases, locations" or "people, organizations, agreements".
  • max_new_tokens: Maximum number of tokens to generate per response.
  • temperature: Sampling temperature for the LLM (higher = more diverse).
  • context_length: Maximum context length for the LLM.
  • max_tokens_chunk: Maximum number of tokens per text chunk before sending to the LLM.
  • arbitrary_prompt: Custom prompt to override the default question-generation instructions.
  • model_family: Selects which underlying LLM to use. Supported values:
    • "mistral" (default)
    • "qwen"
  • hf_token (optional): Hugging Face API token. If not provided, the package will look for HUGGINGFACEHUB_API_TOKEN or HF_TOKEN in the environment, and can also ask interactively.

Model Selection

FragenAntwortLLMCPU uses ctransformers with GGUF models. You can choose the model family with the model_family parameter:

  • model_family="mistral"
    Uses a Mistral-7B-Instruct v0.3 GGUF checkpoint.

  • model_family="qwen"
    Uses a Qwen1.5-7B-Chat GGUF checkpoint.

You must download the appropriate .gguf files and ensure their filenames and locations match the configuration in document_processor.py. These files are not bundled in the Python package.

If a Hugging Face token is required (for private or gated models), you can:

  • Set HUGGINGFACEHUB_API_TOKEN or HF_TOKEN in your environment, or
  • Pass hf_token="..." to DocumentProcessor.

Features

  • Extracts text from PDF documents.
  • Splits text into manageable chunks for LLM processing.
  • Generates efficient question–answer pairs based on specific target information.
  • Supports custom prompts for question generation.
  • Runs entirely on CPU (no GPU required).
  • Supports multiple GGUF models:
    • Mistral-7B-Instruct v0.3 (default)
    • Qwen1.5-7B-Chat
  • Accepts PDF input in multiple languages (e.g. French, German, English) and generates Q&A pairs in English.

Contributing

Contributions are welcome!
Please fork the repository, open issues for bugs or feature requests, and submit pull requests with your improvements.


License

This project is licensed under the MIT License.
See the LICENSE file for details.


Authors

  • Mehrdad Almasi
  • Lars Wieneke
  • Demival Vasques

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fragenantwortllmcpu-0.1.19.tar.gz (9.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fragenantwortllmcpu-0.1.19-py3-none-any.whl (9.5 kB view details)

Uploaded Python 3

File details

Details for the file fragenantwortllmcpu-0.1.19.tar.gz.

File metadata

  • Download URL: fragenantwortllmcpu-0.1.19.tar.gz
  • Upload date:
  • Size: 9.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.13

File hashes

Hashes for fragenantwortllmcpu-0.1.19.tar.gz
Algorithm Hash digest
SHA256 c06b4307c038e9e0ae8351638a397e0138b7edbae4ad67ec8777cd4806dcdac5
MD5 ec2cd22b7e0b775223209a74e8cd923f
BLAKE2b-256 7c035c2236fc4789d59044a7ed7238a20db77075d888367dd3dcb5467af394ed

See more details on using hashes here.

File details

Details for the file fragenantwortllmcpu-0.1.19-py3-none-any.whl.

File metadata

File hashes

Hashes for fragenantwortllmcpu-0.1.19-py3-none-any.whl
Algorithm Hash digest
SHA256 145f978c066e052107f25d55857c5e2690fad3ef766c638043263c3552b29c18
MD5 a44ddacb2d872b57fa4f1eb237d1702e
BLAKE2b-256 b487648ac49162288a95361ca6fa934b4f9688b00815522dfa42b7822654ba81

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page