This is a helper library to push data to HuggingFace.

These details have not been verified by PyPI

Project description

huggify-data

Introduction

huggify-data 📦 is a Python library 🐍 designed to simplify the process of scraping .pdf documents, generating question-answer pairs using openai, converse with the document, and then uploading datasets 📊 to the Hugging Face Hub 🤗. This library allows you to verify ✅, process 🔄, and push 🚀 your pandas DataFrame directly to Hugging Face, making it easier to share and collaborate 🤝 on datasets. Additionally, the new version enables users to fine-tune the Llama2 model on their proprietary data, enhancing its capabilities even further. As the name suggests, the huggify-data package enhances your data experience by wrapping it in warmth, comfort, and user-friendly interactions, making data handling feel as reassuring and pleasant as a hug.

Repo

You can access the repo here: ✨ Huggify Data ✨

Installation

To use huggify-data, ensure you have the necessary libraries installed. You can easily install them using pip:

pip install huggify-data

Notebooks

We have made tutorial notebooks available to guide you through the process step-by-step:

Step 1: Scrape any .pdf file and generate question-answer pairs. Link
Step 2: Fine-tune the Llama2 model on customized data. Link
Step 3: Perform inference on customized data. Link

Examples

Here's a complete example illustrating how to use huggify-data to scrape a PDF and save it as question-answer pairs in a .csv file. The following block of code will scrape the content, convert it into a .csv, and save the file locally:

from huggify_data.scrape_modules import *

# Example usage:
pdf_path = "path_of_pdf.pdf"
openai_api_key = "<sk-API_KEY_HERE>"
generator = PDFQnAGenerator(pdf_path, openai_api_key)
generator.process_scraped_content()
generator.generate_questions_answers()
df = generator.convert_to_dataframe()
print(df)

When you have a .csv or a pd.DataFrame frrom the previous chunk of code, you can run the following code to iteratively generate a list of .md files from the .csv file.

from huggify_data.bot_modules import ChatBot
bot = ChatBot(api_key=openai_api_key)

from huggify_data.generate_md_modules import *

# This code will start generate a list of .md files
# Please make sure you are in the desired directory
markdown_generator = MarkdownGenerator(bot, df)
markdown_generator.generate_markdown()

After a list of .md files are generated, one can navigate to here to build a chatbot with RAG system to iteratively read in a list of .md file using RAG or Retrieval Augmented Generation pipeline using llama_index.

Once you have created a data frame of question-answer pairs, you can have a conversation with your data:

from huggify_data.bot_modules import *

current_prompt = "<question_about_the_document>"
chatbot = ChatBot(api_key=openai_api_key)
response = chatbot.run_rag(openai_api_key, current_prompt, df, top_n=2)
print(response)

Moreover, you can push it to the cloud. Here's a complete example illustrating how to use the huggify-data library to push data (assuming an existing .csv file with columns questions and answers) to Hugging Face Hub:

from huggify_data.push_modules import DataFrameUploader

# Example usage:
df = pd.read_csv('/content/toy_data.csv')
uploader = DataFrameUploader(df, hf_token="<huggingface-token-here>", repo_name='<desired-repo-name>', username='<your-username>')
uploader.process_data()
uploader.push_to_hub()

Here's a complete example illustrating how to use the huggify-data library to fine-tune a Llama2 model (assuming you have a directory from Hugging Face ready):

from huggify_data.train_modules import *

# Parameters
model_name = "NousResearch/Llama-2-7b-chat-hf" # Recommended base model
dataset_name = "eagle0504/sample_toy_data_v9" # Desired name, e.g., <hf_user_id>/<desired_name>
new_model = "youthless-homeless-shelter-web-scrape-dataset-v4" # Desired name
huggingface_token = userdata.get('HF_TOKEN')

# Initiate
trainer = LlamaTrainer(model_name, dataset_name, new_model, huggingface_token)
peft_config = trainer.configure_lora()
training_args = trainer.configure_training_arguments(num_train_epochs=1)

# Train
trainer.train_model(training_args, peft_config)

# Inference
some_model, some_tokenizer = trainer.load_model_and_tokenizer(
    base_model_path="NousResearch/Llama-2-7b-chat-hf",
    new_model_path="ysa-test-july-4-v3",
)

prompt = "hi, tell me a joke"
response = trainer.generate_response(
    some_model,
    some_tokenizer,
    prompt,
    max_len=200)
print(response)

To perform inference, please follow the example below:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="eagle0504/youthless-homeless-shelter-web-scrape-dataset-v4") # Same name as above
response = pipe("### Human: What is YSA? ### Assistant: ")
print(response[0]["generated_text"])
print(response[0]["generated_text"].split("### ")[-1])

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Contributing

Contributions are welcome! Please open an issue or submit a pull request if you have any improvements or suggestions.

Contact

For any questions or support, please contact [eagle0504@gmail.com](mailto: eagle0504@gmail.com).

About Me

Hello there! I'm excited to share a bit about myself and my projects. Check out these links for more information:

🏠 Personal Site: ✨ y-yin.io ✨
🎓 Education Site: 📚 Future Minds 📚

Feel free to explore and connect with me! 😊

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.4.4

Jul 24, 2024

0.4.3

Jul 24, 2024

0.4.2

Jul 24, 2024

0.4.1

Jul 4, 2024

0.4.0

Jul 4, 2024

0.3.9

Jul 4, 2024

0.3.8

Jun 27, 2024

0.3.7

Jun 26, 2024

0.3.6

Jun 26, 2024

0.3.5

Jun 26, 2024

0.3.4

Jun 26, 2024

0.3.3

Jun 24, 2024

0.3.2

Jun 23, 2024

0.3.1

Jun 23, 2024

0.3.0

Jun 23, 2024

0.2.9

Jun 23, 2024

0.2.8

Jun 23, 2024

0.2.7

Jun 23, 2024

0.2.6

Jun 23, 2024

0.2.5

Jun 22, 2024

0.2.4

Jun 22, 2024

0.2.3

Jun 22, 2024

0.2.2

Jun 22, 2024

0.2.1

Jun 22, 2024

0.2.0

Jun 22, 2024

0.1.9

Jun 22, 2024

0.1.8

Jun 22, 2024

0.1.7

Jun 22, 2024

0.1.6

Jun 22, 2024

0.1.5

Jun 22, 2024

0.1.4

Jun 22, 2024

0.1.3

Jun 22, 2024

0.1.2

Jun 22, 2024

0.1.1

Jun 22, 2024

0.1.0

Jun 22, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

huggify_data-0.4.4.tar.gz (12.4 kB view details)

Uploaded Jul 24, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

huggify_data-0.4.4-py3-none-any.whl (14.1 kB view details)

Uploaded Jul 24, 2024 Python 3

File details

Details for the file huggify_data-0.4.4.tar.gz.

File metadata

Download URL: huggify_data-0.4.4.tar.gz
Upload date: Jul 24, 2024
Size: 12.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.3 CPython/3.11.1 Windows/10

File hashes

Hashes for huggify_data-0.4.4.tar.gz
Algorithm	Hash digest
SHA256	`b6a47f67c52d7b3edcb0a86f315ba9a643c532e417260c784b60d552ce892d03`
MD5	`e827d3af7710661cd6963ea8701229a8`
BLAKE2b-256	`7bb3f9a3cb22145c822894c66c8178edda039c9f6db8e5917398d886061ae2e4`

See more details on using hashes here.

File details

Details for the file huggify_data-0.4.4-py3-none-any.whl.

File metadata

Download URL: huggify_data-0.4.4-py3-none-any.whl
Upload date: Jul 24, 2024
Size: 14.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.8.3 CPython/3.11.1 Windows/10

File hashes

Hashes for huggify_data-0.4.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`408c56a901e535c431c685ffce415b5a399e611c332a6cdd98278519f6c0d43d`
MD5	`ce13f980595bae2fcb79ed6aefd281da`
BLAKE2b-256	`1417ebd486fd6bccd2e3880857bf750012aefc910f0775454b343a3342782ca5`

See more details on using hashes here.

huggify-data 0.4.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

huggify-data

Introduction

Repo

Installation

Notebooks

Examples

License

Contributing

Contact

About Me

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes