A CLI tool to extract structured Q&A from documents using LLMs.

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence
- Text Processing

Project description

DAmon

Data Arrangement/Annotation via Simon's tool.

A CLI tool to extract structured Q&A from documents using Large Language Models (LLMs).

Features
Installation
Configuration
Usage
- Extracting Q&A
- Pushing to Hugging Face Hub
Supported Document Types
Contributing
License

Features

Document Parsing: Automatically parses content from various document formats (PDF, CSV, DOCX, PPTX).
LLM-powered Q&A Extraction: Utilizes Litellm to interact with different LLMs (e.g., gemini/gemini-2.5-flash) to extract question-answer pairs and the AI's thought process.
Flexible Output: Exports extracted Q&A into JSONL, CSV, or Parquet formats.
Batch Processing: Processes single files or entire directories of documents.
Hugging Face Hub Integration: Easily push your extracted datasets to the Hugging Face Hub.

Installation

Install the package from PyPI:

pip install DAmon

Configuration

This tool uses litellm to interact with various LLM providers. You'll need to set up your API keys and specify the model.

Create a .env file in the root directory of the project and add your Litellm model and API key. For example, if you are using OpenAI:

GEMINI_API_KEY=<gemini-api-key>

Refer to the Litellm documentation for details on configuring different LLM providers and their respective environment variables.

Custom Prompt Template

You can customize the LLM's prompt template. The damon command will look for a file named DAMON_PROMPT.md in the directory where it is executed.

If DAMON_PROMPT.md exists, its content will be used as the PROMPT_TEMPLATE.
If DAMON_PROMPT.md does not exist, the default prompt template embedded in DAmon/core.py will be used.

A template file, DAMON_PROMPT_template.md, has been provided in the project root. You can copy and rename this file to DAMON_PROMPT.md to start customizing your prompt.

Usage

The main command-line interface is damon.

Processing Documents

Use the process command to process documents and extract Q&A pairs.

damon process <INPUT_PATH> [OPTIONS]

<INPUT_PATH>: Path to the input document (file or directory).

Options:

--input-format [pdf|csv|docx|pptx|auto]: Format of the input document(s). Use "auto" to detect based on file extension. Default: auto.
--model TEXT: Litellm model name to use for Q&A extraction.
--output-path PATH: Path to save the extracted Q&A. Can be a file or a directory. If a directory, a timestamped file will be created. Default: results/output.jsonl.
--export-format [jsonl|csv|parquet]: Format for exporting the extracted Q&A. Default: jsonl.
--num-qa INTEGER: Number of Q&A pairs to extract per document. If not specified, extracts as many as possible.

Examples:

Process a single PDF file, output to CSV:

damon process --input data/test.pdf --model gemini/gemini-2.5-flash --output results/cyber_output --export csv --num-qa 5

Process all documents in a directory, auto-detect format, output to JSONL:

damon extract data/ --input-format auto --output-path results/ --export-format jsonl

Process a specific number of Q&A pairs from a DOCX file:

damon process documents/report.docx --input-format docx --num-qa 5 --output-path results/report_qa.jsonl

Pushing to Hugging Face Hub

Use the push-to-hf command to upload your extracted dataset files to the Hugging Face Hub.

damon push-to-hf --input-file <FILE_PATH> --repo-id <REPO_ID> [--split <SPLIT_NAME>]

--input-file <FILE_PATH>: Path to the data file to push (e.g., results/output.jsonl).
--repo-id <REPO_ID>: Hugging Face Hub repository ID (e.g., your-username/your-dataset-repo).
--split <SPLIT_NAME>: Optional. The name of the dataset split (e.g., train, validation, test). Defaults to train.

Prerequisites for pushing:

You need to have the datasets and huggingface_hub libraries installed (pip install datasets huggingface_hub).
You must be logged in to Hugging Face. Run huggingface-cli login in your terminal and follow the prompts.

Example:

damon push-to-hf --input-file results/output_20250630_140154.csv --repo-id your-username/my-extracted-qa-dataset --split train

Supported Document Types

.pdf (Portable Document Format)
.csv (Comma Separated Values)
.docx (Microsoft Word Document)
.pptx (Microsoft PowerPoint Presentation)

Contributing

Contributions are welcome! Please feel free to open issues or submit pull requests.

License

This project is licensed under the MIT License - see the LICENSE file for details. (Note: A LICENSE file is not included in the provided context, but it's good practice to include one.)

Author

Simon Liu

A technology enthusiast in the field of artificial intelligence solutions, he focuses on helping companies to introduce generative artificial intelligence, MLOps and large language model (LLM) technologies to promote digital transformation and technology implementation.

Currently, he is also a Google GenAI developer expert (GDE), actively participating in the technical community, promoting the application and development of AI technology through technical articles, speeches and practical experience sharing. Currently, he has published more than 100 technical articles on the Medium platform, covering topics such as generative AI, RAG and AI Agent, and has served as a speaker in technical seminars many times to share the practical application of AI and generative AI.

My Linkedin: https://www.linkedin.com/in/simonliuyuwei/

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering :: Artificial Intelligence
- Text Processing

Release history Release notifications | RSS feed

This version

0.1.2

Jul 4, 2025

0.1.1

Jul 1, 2025

0.1.0

Jun 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

damon-0.1.2.tar.gz (17.8 kB view details)

Uploaded Jul 4, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

damon-0.1.2-py3-none-any.whl (16.0 kB view details)

Uploaded Jul 4, 2025 Python 3

File details

Details for the file damon-0.1.2.tar.gz.

File metadata

Download URL: damon-0.1.2.tar.gz
Upload date: Jul 4, 2025
Size: 17.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for damon-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`d0a6755efc013c549fee1af80f19c2788b98a7c4a5053943595222e354978236`
MD5	`4b643daf3b29e37cc896057e16a68a5c`
BLAKE2b-256	`e3ea725f84d0912cc812be1db49fe40251a4d75b1859f0198d3cf4e9b4a83722`

See more details on using hashes here.

File details

Details for the file damon-0.1.2-py3-none-any.whl.

File metadata

Download URL: damon-0.1.2-py3-none-any.whl
Upload date: Jul 4, 2025
Size: 16.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for damon-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d00c37b74e4bbe24022d7814a838f6e200ed4795ee3355389babb4fac61fb09e`
MD5	`af3028069e6726e3e91a0b7c22154df8`
BLAKE2b-256	`a9d951ceda20c0cd0fe2006f2555435c07aeaaf6a1c7cbc312786d2c4f466c36`

See more details on using hashes here.

DAmon 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DAmon

Table of Contents

Features

Installation

Configuration

Custom Prompt Template

Usage

Processing Documents

Pushing to Hugging Face Hub

Supported Document Types

Contributing

License

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes