Skip to main content

A package for processing documents and generating questions and answers using LLMs on GPU and CPU.

Project description

FragenAntwortLLMGPU (Generating efficient Question and Answer Pairs for LLM Fine-Tuning with FragenAntwortLLMGPU)

Downloads

Incorporating question and answer pairs is crucial for creating accurate, context-aware, and user-friendly Large Language Models. FragenAntwortLLMGPU is a Python package designed for processing PDF documents and generating efficient Q&A pairs using large language models (LLMs) on CPU and GPU. This package can be used to fine-tune an LLM. It leverages various NLP libraries and the Mistral v1 LLM (https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF) to achieve this.

Table of Contents

Installation

To install the required dependencies, follow these steps:

For Linux Users

  1. Install the required version of torch manually:

    pip install torch==2.0.0+cu118 -f https://download.pytorch.org/whl/cu118/torch_stable.html
    
  2. Then, install the package using pip:

    pip install FragenAntwortLLMGPU
    

Usage

Here's an example of how to use the Document Processor:

from FragenAntwortLLMGPU import DocumentProcessor

processor = DocumentProcessor(
    book_path="/path/to/your/book/",  # Directory path without "NAME_OF_BOOK.pdf" term
    temp_folder="/path/to/temp/folder",
    output_file="/path/to/output/QA.jsonl",
    book_name="example.pdf",
    start_page=9,
    end_page=77,
	gpu_layers=100, 
    number_Q_A="five",  # This should be a written number like "one", "two", etc.
    target_information="foods and locations", 
    max_new_tokens=1000,
    temperature=0.1,
    context_length=2100,
    max_tokens_chunk=800,
    arbitrary_prompt=""
)

processor.process_book()
processor.generate_prompts()
processor.save_to_jsonl()

Usage (docker)

Example Python Script to Run Docker Container with Additional Parameters Please visit Docker Hub Repository for more information.

First, ensure you have the docker package installed:

pip install docker
nano run_docker.py

Copy the following lines and paste it in run_docker.py

# Example Python script to work with docker:

import docker
import time
from huggingface_hub import hf_hub_download

# Initialize the Docker client
client = docker.from_env()

# Define the parameters
book_path = "/home/devuser/BOOKS"  # Change "/home/devuser/" to the path of the folder containing the PDF file
temp_folder = "/home/devuser/TEMP" # Change "/home/devuser/" to the path of the TEMP folder on your machine
output_file = "/home/devuser/QA.jsonl" # Change "/home/devuser/" to the path on your machine
book_name = "TEST.pdf"
start_page = 2
end_page = 5
gpu_layers = 0   # set 0 if you do not have access GPU
number_Q_A = "five"
target_information = "Recommender Systems"
max_new_tokens = 1000
temperature = 0.1
context_length = 2100
max_tokens_chunk = 500
arbitrary_prompt = ""

# Function to retry download
def retry_download(model_name, filename, retries=3, wait=10):
    for i in range(retries):
        try:
            return hf_hub_download(model_name, filename)
        except Exception as e:
            print(f"Download failed: {e}. Retrying {i+1}/{retries}...")
            time.sleep(wait)
    raise RuntimeError("Failed to download model after multiple attempts.")

# Ensure model files are downloaded before running the container
retry_download('bert-base-uncased', 'tokenizer.json')

# Run the container with the correct parameters
container = client.containers.run(
    'mehrdadal/fragenantwortllmgpu:latest',
    command=f'python -c "from FragenAntwortLLMGPU import DocumentProcessor; '
            f'processor = DocumentProcessor('
            f'book_path=\'/app/book\', '  # Adjusted for container's internal path (if needed)
            f'temp_folder=\'/app/temp\', '  # Adjusted for container's internal path (if needed)
            f'output_file=\'/app/output/QA.jsonl\', '  # Adjusted for container's internal path (if needed)
            f'book_name=\'{book_name}\', '
            f'start_page={start_page}, '
            f'end_page={end_page}, '
            f'gpu_layers={gpu_layers}, '
            f'number_Q_A=\'{number_Q_A}\', '
            f'target_information=\'{target_information}\', '
            f'max_new_tokens={max_new_tokens}, '
            f'temperature={temperature}, '
            f'context_length={context_length}, '
            f'max_tokens_chunk={max_tokens_chunk}, '
            f'arbitrary_prompt=\'{arbitrary_prompt}\'); '
            f'processor.process_book(); '
            f'processor.generate_prompts(); '
            f'processor.save_to_jsonl()"',
    volumes={
        '/home/devuser/BOOKS': {'bind': '/app/book', 'mode': 'rw'},  # Adjusted "/home/devuser/" for host path
        '/home/devuser/TEMP': {'bind': '/app/temp', 'mode': 'rw'},  # Adjusted "/home/devuser/" for host path
        '/home/devuser': {'bind': '/app/output', 'mode': 'rw'}  # Adjusted "/home/devuser/" for host path
    },
    detach=True
)

# Wait for the container to finish
container.wait()

# Fetch the logs (if needed)
logs = container.logs()
print(logs.decode('utf-8'))

# Remove the container
container.remove()

Run the Python Script: Save the provided Python script (use CTRL+O and CTRL+X to save the file) and execute it:

python run_docker.py

Explanation

  • book_path: The directory path where your PDF book files are stored.
  • temp_folder: The directory where temporary files will be stored.
  • output_file: The path to the output JSONL file where the Q&A pairs will be saved.
  • book_name: The name of the book PDF file to process.
  • start_page: The starting page number for processing.
  • end_page: The ending page number for processing.
  • gpu_layers: This parameter specifies the number of layers to be processed on the GPU. Adjust this value based on the GPU memory capacity and the model's architecture. Increasing the number of layers processed on the GPU can improve performance but may require more memory. Example: If you have a powerful GPU, you might set gpu_layers=1000 to maximize performance.
  • number_Q_A: The number of questions and answers to generate (as a written number).
  • target_information: The focus of the questions and answers. Add the types of information you want to create Question and Answer pairs for here. This can include various entities like gene names, disease names, locations, etc. For example, you might specify: "genes, diseases, locations" if you are working on a medical dataset.
  • max_new_tokens: The maximum number of tokens to generate.
  • temperature: The temperature rate for the LLM.
  • context_length: The maximum context length for the LLM.
  • max_tokens_chunk: The maximum number of tokens per text chunk.
  • arbitrary_prompt: A custom prompt for generating questions and answers.

Features

  • Extracts text from PDF documents
  • Splits text into manageable chunks for processing
  • Generates efficient question and answer pairs based on specific target information
  • Supports custom prompts for question generation
  • Runs on CPU and GPU: The code can be executed on CPU and GPU.
  • Uses Mistral Model (https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF): Utilizes the CTransformers version of the Mistral v1 model.
  • Multilingual Input: Accepts PDF books in French, German, or English and generates Q&A pairs in English.

Contributing

Contributions are welcome! Please fork this repository and submit pull requests.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Authors

  • Mehrdad Almasi, Lars Wieneke, and Demival VASQUES FILHO

Contact

For questions or feedback, please contact Mehrdad.al.2023@gmail.com, lars.wieneke@gmail.com, demival.vasques@uni.lu.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

FragenAntwortLLMGPU-0.1.11.tar.gz (9.6 kB view details)

Uploaded Source

Built Distribution

FragenAntwortLLMGPU-0.1.11-py3-none-any.whl (10.4 kB view details)

Uploaded Python 3

File details

Details for the file FragenAntwortLLMGPU-0.1.11.tar.gz.

File metadata

  • Download URL: FragenAntwortLLMGPU-0.1.11.tar.gz
  • Upload date:
  • Size: 9.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.9.13

File hashes

Hashes for FragenAntwortLLMGPU-0.1.11.tar.gz
Algorithm Hash digest
SHA256 0e877a033703d223f305db23682559677cb1fdfb017003f06851474c7911de6c
MD5 7a8792b7578b54b04ed3e3c767fe1eab
BLAKE2b-256 ab8f137a57dd0aee0e4b18573fd50b43fa89996d4ad2d3c1728b471204111519

See more details on using hashes here.

File details

Details for the file FragenAntwortLLMGPU-0.1.11-py3-none-any.whl.

File metadata

File hashes

Hashes for FragenAntwortLLMGPU-0.1.11-py3-none-any.whl
Algorithm Hash digest
SHA256 0cd7dcd2b7e77a81bda8dcfa38409b57ae8b4c7bec7a97c0624d59a711b52c55
MD5 a8ba3e0ab76ae2364be5cdf8f0553602
BLAKE2b-256 fb2ef74b487cd77c862b32ac82fc475e55f751d29c97835aabbfd99879900387

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page