Skip to main content

A package for processing documents and generating questions and answers using LLMs on CPU.

Project description

FragenAntwortLLMCPU (Generating efficient Question and Answer Pairs for LLM Fine-Tuning with FragenAntwortLLMCPU)

Downloads

Incorporating question and answer pairs is crucial for creating accurate, context-aware, and user-friendly Large Language Models. FragenAntwortLLMCPU is a Python package designed for processing PDF documents and generating efficient Q&A pairs using large language models (LLMs) on CPU. This package can be used to fine-tune an LLM. It leverages various NLP libraries and the Mistral v1 LLM (https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF) to achieve this.

Table of Contents

Installation

To install the required dependencies, follow these steps:

For Linux Users

You can directly install the package using pip:

pip install FragenAntwortLLMCPU

For Windows Users with Anaconda

Due to some package conflicts, you need to perform extra steps to successfully install FragenAntwortLLMCPU.

Step 1: Uninstall tbb manually (if you have installed it previously)

  1. Find the installation location: Open Anaconda Prompt and run:

    conda list tbb
    

    This will show you the location where tbb is installed.

  2. Remove the tbb package: Navigate to the directory shown by conda list and manually delete the tbb package files. This typically involves removing the directory associated with tbb.

Step 2: Install tbb using conda

  1. Install tbb with conda: After manually removing the package, install tbb using conda:
    conda install -c conda-forge tbb
    

Step 3: Install your package using pip

Now that tbb is managed by conda, you can install your package without conflicts:

pip install FragenAntwortLLMCPU

Alternative: Force reinstall with pip

If you prefer not to manually remove the package, you can force the reinstallation using pip:

pip install --ignore-installed --force-reinstall FragenAntwortLLMCPU

Usage

Here's an example of how to use the Document Processor:

from FragenAntwortLLMCPU import DocumentProcessor

processor = DocumentProcessor(
    book_path="/path/to/your/book/",  # Directory path without ".pdf" term
    temp_folder="/path/to/temp/folder",
    output_file="/path/to/output/QA.jsonl",
    book_name="example.pdf",
    start_page=9,
    end_page=77,
    number_Q_A="one",  # This should be a written number like "one", "two", etc.
    target_information="foods and locations", 
    max_new_tokens=1000,
    temperature=0.1,
    context_length=2100,
    max_tokens_chunk=400,
    arbitrary_prompt=""
)

processor.process_book()
processor.generate_prompts()
processor.save_to_jsonl()

Explanation

  • book_path: The directory path where your PDF book files are stored.
  • temp_folder: The directory where temporary files will be stored.
  • output_file: The path to the output JSONL file where the Q&A pairs will be saved.
  • book_name: The name of the book PDF file to process.
  • start_page: The starting page number for processing.
  • end_page: The ending page number for processing.
  • number_Q_A: The number of questions and answers to generate (as a written number).
  • target_information: The focus of the questions and answers. Add the types of information you want to create Question and Answer pairs for here. This can include various entities like gene names, disease names, locations, etc. For example, you might specify: "genes, diseases, locations" if you are working on a medical dataset.
  • max_new_tokens: The maximum number of tokens to generate.
  • temperature: The sampling temperature for the LLM.
  • context_length: The maximum context length for the LLM.
  • max_tokens_chunk: The maximum number of tokens per text chunk.
  • arbitrary_prompt: A custom prompt for generating questions and answers.

Features

  • Extracts text from PDF documents
  • Splits text into manageable chunks for processing
  • Generates efficient question and answer pairs based on specific target information
  • Supports custom prompts for question generation
  • Runs on CPU: The code can be executed without the need for GPUs.
  • Uses Mistral Model (https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF): Utilizes the CTransformers version of the Mistral v1 model.
  • Multilingual Input: Accepts PDF books in French, German, or English and generates Q&A pairs in English.

Contributing

Contributions are welcome! Please fork this repository and submit pull requests.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Authors

  • Mehrdad Almasi, Lars Wieneke and Demival VASQUES FILHO

Contact

For questions or feedback, please contact Mehrdad.al.2023@gmail.com, lars.wieneke@gmail.com, demival.vasques@uni.lu.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

FragenAntwortLLMCPU-0.1.17.tar.gz (8.4 kB view details)

Uploaded Source

Built Distribution

FragenAntwortLLMCPU-0.1.17-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

File details

Details for the file FragenAntwortLLMCPU-0.1.17.tar.gz.

File metadata

  • Download URL: FragenAntwortLLMCPU-0.1.17.tar.gz
  • Upload date:
  • Size: 8.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.9.13

File hashes

Hashes for FragenAntwortLLMCPU-0.1.17.tar.gz
Algorithm Hash digest
SHA256 0c1a87f5d9c43a53f9c07d8f8b2a2d9f04f416a41ebba81603f02b50d5c4c9f9
MD5 ef8b7c3dfa66730f0979fe4ff1d84861
BLAKE2b-256 7679c253e2552cdf9d8ceed6fcefbf44d2615d64b6b4bef175dfdd61cada9ac8

See more details on using hashes here.

File details

Details for the file FragenAntwortLLMCPU-0.1.17-py3-none-any.whl.

File metadata

File hashes

Hashes for FragenAntwortLLMCPU-0.1.17-py3-none-any.whl
Algorithm Hash digest
SHA256 8e229d04f3c0a5553934cfca1565411a79c767bd302cb85b19878e0c364ec2f5
MD5 6bfdeb1c0e814c841886d0701418c43e
BLAKE2b-256 43fc8e92166e0f7cc5b4ffd2c36bb764e982703f3c69af9096b6e1f69f40d6dc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page