Skip to main content

A CLI tool to extract structured Q&A from documents using LLMs.

Project description

DAmon

Data Arrangement/Annotation via Simon's tool.

A CLI tool to extract structured Q&A from documents using Large Language Models (LLMs).

Table of Contents

Features

  • Document Parsing: Automatically parses content from various document formats (PDF, CSV, DOCX, PPTX).
  • LLM-powered Q&A Extraction: Utilizes Litellm to interact with different LLMs (e.g., gemini/gemini-2.5-flash) to extract question-answer pairs and the AI's thought process.
  • Flexible Output: Exports extracted Q&A into JSONL, CSV, or Parquet formats.
  • Batch Processing: Processes single files or entire directories of documents.
  • Hugging Face Hub Integration: Easily push your extracted datasets to the Hugging Face Hub.

Installation

  1. Clone the repository:

    git clone https://github.com/simonliu-ai-project/DAmon.git
    cd DAmon
    
  2. Create a virtual environment (recommended):

    python3 -m venv venv
    source venv/bin/activate
    
  3. Install the package and its dependencies:

    pip install -e .
    

    Alternatively, you can install dependencies from requirements.txt:

    pip install -r requirements.txt
    

Configuration

This tool uses litellm to interact with various LLM providers. You'll need to set up your API keys and specify the model.

Create a .env file in the root directory of the project and add your Litellm model and API key. For example, if you are using OpenAI:

LITELLM_MODEL=gemini-2.0-flash
GEMINI_API_KEY=<gemini-api-key>

Refer to the Litellm documentation for details on configuring different LLM providers and their respective environment variables.

Usage

The main command-line interface is damon.

Extracting Q&A

Use the extract command to process documents and extract Q&A pairs.

damon extract <INPUT_PATH> [OPTIONS]
  • <INPUT_PATH>: Path to the input document (file or directory).

Options:

  • --input-format [pdf|csv|docx|pptx|auto]: Format of the input document(s). Use "auto" to detect based on file extension. Default: auto.
  • --model TEXT: Litellm model name to use for Q&A extraction. Overrides LITELLM_MODEL from .env. Default: gemini/gemini-2.5-flash (or value from LITELLM_MODEL).
  • --output-path PATH: Path to save the extracted Q&A. Can be a file or a directory. If a directory, a timestamped file will be created. Default: results/output.jsonl.
  • --export-format [jsonl|csv|parquet]: Format for exporting the extracted Q&A. Default: jsonl.
  • --num-qa-pairs INTEGER: Number of Q&A pairs to extract per document. If not specified, extracts as many as possible.

Examples:

  1. Extract from a single PDF file, output to CSV:

    damon extract data/test.pdf --input-format pdf --output-path results/manual_qa.csv --export-format csv --model gemini/gemini-2.5-flash
    
  2. Process all documents in a directory, auto-detect format, output to JSONL:

    damon extract data/ --input-format auto --output-path results/ --export-format jsonl
    
  3. Extract a specific number of Q&A pairs from a DOCX file:

    damon extract documents/report.docx --input-format docx --num-qa-pairs 5 --output-path results/report_qa.jsonl
    

Pushing to Hugging Face Hub

Use the push command to upload your extracted dataset files to the Hugging Face Hub.

damon push --input-file <FILE_PATH> --repo-id <REPO_ID> [--split <SPLIT_NAME>]
  • --input-file <FILE_PATH>: Path to the data file to push (e.g., results/output.jsonl).
  • --repo-id <REPO_ID>: Hugging Face Hub repository ID (e.g., your-username/your-dataset-repo).
  • --split <SPLIT_NAME>: Optional. The name of the dataset split (e.g., train, validation, test). Defaults to train.

Prerequisites for pushing:

  • You need to have the datasets and huggingface_hub libraries installed (pip install datasets huggingface_hub).
  • You must be logged in to Hugging Face. Run huggingface-cli login in your terminal and follow the prompts.

Example:

damon push --input-file results/output_20250630_140154.csv --repo-id your-username/my-extracted-qa-dataset --split train

Supported Document Types

  • .pdf (Portable Document Format)
  • .csv (Comma Separated Values)
  • .docx (Microsoft Word Document)
  • .pptx (Microsoft PowerPoint Presentation)

Contributing

Contributions are welcome! Please feel free to open issues or submit pull requests.

License

This project is licensed under the MIT License - see the LICENSE file for details. (Note: A LICENSE file is not included in the provided context, but it's good practice to include one.)

Author

Simon Liu

A technology enthusiast in the field of artificial intelligence solutions, he focuses on helping companies to introduce generative artificial intelligence, MLOps and large language model (LLM) technologies to promote digital transformation and technology implementation. ​

Currently, he is also a Google GenAI developer expert (GDE), actively participating in the technical community, promoting the application and development of AI technology through technical articles, speeches and practical experience sharing. Currently, he has published more than 100 technical articles on the Medium platform, covering topics such as generative AI, RAG and AI Agent, and has served as a speaker in technical seminars many times to share the practical application of AI and generative AI. ​

My Linkedin: https://www.linkedin.com/in/simonliuyuwei/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

damon-0.1.0.tar.gz (17.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

damon-0.1.0-py3-none-any.whl (15.5 kB view details)

Uploaded Python 3

File details

Details for the file damon-0.1.0.tar.gz.

File metadata

  • Download URL: damon-0.1.0.tar.gz
  • Upload date:
  • Size: 17.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for damon-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7fc622c2c7f04fa867ba6de6b482580eddb026b1ab9046b7f8f00cef10a5d4d5
MD5 7eb9808a59a7eec99122a8b24abe12d4
BLAKE2b-256 8e00bf05ac9a538c115b14f31d4e96881d51c538adb51f59bfb67e2024175357

See more details on using hashes here.

File details

Details for the file damon-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: damon-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for damon-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7c0e03ddd4ef2af441eac71b661836ac4390101ab3f555860414a8710f966b9a
MD5 2a4538bb80d496ddc782d6a8d1730445
BLAKE2b-256 45fcad5134d83c8ed57aeb5d8b2e98be78b6d6c85b64d5f0382343c536a28e75

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page