Skip to main content

DYGEST: Document Insights Generator

Project description

PyPI version

๐ŸŒž dygest: Document Insights Generator

dygest is command line tool that takes .txt files as input and extracts content insights using Large Language Models and NER (Named Entity Recognition).

It generates summaries, keywords and table of contents and exports human-readable markdown or html documents. It also comes with customizable template support for html.

Table of Contents

Features

  • Text insights
    Generate concise insights for your text files using various LLM services by creating summaries, keywords, table of contents (TOC) and named entities (NER).

  • Unified LLM Interface
    dygest uses litellm and provides integration for various LLM service providers: OpenAI, Anthropic, HuggingFace, Groq, Ollama etc. Check the complete provider list for all available services.

  • Token Friendly
    dygest performs token-heavy text analysis and summarization tasks. Therefore, the underlying LLM pipeline can be tailored to your needs and specific rate limits using a mixed experts approach.

  • Mixed Experts Approach
    dygest utilizes two fully customizable LLMs to handle different processing tasks. The first, referred to as the light_model, is designed for lighter tasks such as summarization and keyword extraction. The second, called the expert_model, is optimized for more complex tasks like constructing Tables of Contents (TOCs).

    This flexibility allows for various pipeline configurations. For example, the light_model can run locally using Ollama, while the expert_model can leverage an external API service like OpenAI or Groq. This approach ensures efficiency and adaptability based on specific requirements.

  • Named Entity Recognition (NER)
    Named Entity Recognition via fast and reliable flair framework (identifies persons, organisations, locations etc.).

  • Customizable HTML Templates
    By default dygest will create a .html file that can be viewed in standard browsers and combines summaries, keywords, TOC and NER for your text. Two default templates are available (tabs and plain), but you are able to customize your own as well.

  • Input Formats: .txt, .csv, .xlsx, .doc, .docx, .pdf, .html, .xml

  • Export Formats: .json, .md, .html

Requirements

  • ๐Ÿ Python >=3.10
  • ๐Ÿ”‘ API keys for LLM services like OpenAI, Anthropic and Groq and / or a running Ollama instance

Installation

Install with pip

Create a Python virtual environment

python3 -m venv venv

Activate the environment

source venv/bin/activate

Install dygest

pip install dygest

Install from source

Clone this repository

git clone https://github.com/tsmdt/dygest.git
cd dygest

Create a Python virtual environment

python3 -m venv venv

Activate the environment

source venv/bin/activate

Install dygest

pip install .

Usage

Configuration

Copy the .env.example in the project directory and rename it to .env. Update the dygest settings by running the dygest config command or edit the .env manually.

 Usage: dygest config [OPTIONS]

 Configure LLMs, Embeddings and Named Entity Recognition. (Config file: .env)

โ•ญโ”€ Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ --add_custom       -add               TEXT     Add a custom key-value pair to the config .env (format: KEY=VALUE). [default: None] โ”‚
โ”‚ --light_model      -l                 TEXT     LLM model name for lighter tasks (summarization, keywords) [default: None]          โ”‚
โ”‚ --expert_model     -x                 TEXT     LLM model name for heavier tasks (TOCs). [default: None]                            โ”‚
โ”‚ --embedding_model  -e                 TEXT     Embedding model name. [default: None]                                               โ”‚
โ”‚ --temperature      -t                 FLOAT    Temperature of LLM. [default: None]                                                 โ”‚
โ”‚ --sleep            -s                 FLOAT    Pause LLM requests to prevent rate limit errors (in seconds). [default: None]       โ”‚
โ”‚ --chunk_size       -c                 INTEGER  Maximum number of tokens per chunk. [default: None]                                 โ”‚
โ”‚ --ner                     --no-ner             Enable Named Entity Recognition (NER). Defaults to False. [default: no-ner]         โ”‚
โ”‚ --precise                 --fast               Enable precise mode for NER. Defaults to fast mode. [default: fast]                 โ”‚
โ”‚ --lang             -lang              TEXT     Language of file(s) for NER. Defaults to auto-detection. [default: None]            โ”‚
โ”‚ --view_config      -v                          View loaded config parameters.                                                      โ”‚
โ”‚ --help                                         Show this message and exit.                                                         โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

The configuration is saved as .env in the project directory and can be easily edited either by hand or using the dygest config command.

If you want to add custom keys and values (e.g. for adding a custom api_base) just type dygest config -add CUSTOM_PROVIDER_API_BASE=https://custom-provider.com/v1

LIGHT_MODEL='ollama/gemma3:12b'
EXPERT_MODEL='groq/llama-3.3-70b-versatile'
EMBEDDING_MODEL='ollama/nomic-embed-text:latest'
TEMPERATURE='0.1'
SLEEP='0'
CHUNK_SIZE='1000'
NER='True'
NER_LANGUAGE='auto'
NER_PRECISE='False'

# API KEYS
OPENAI_API_KEY=''
GROQ_API_KEY=''

# CUSTOM SETTINGS
OLLAMA_API_BASE='http://localhost:11434'

Processing

Run the dygest LLM pipeline with the dygest run command:

 Usage: dygest run [OPTIONS]

 Create insights for your documents (summaries, keywords, TOCs).

โ•ญโ”€ Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ --files             -f         TEXT                      Path to the input folder or file. [default: None]                         โ”‚
โ”‚ --output_dir        -o         TEXT                      If not provided, outputs will be saved in the input folder.               โ”‚
โ”‚                                                          [default: None]                                                           โ”‚
โ”‚ --export_format     -ex        [all|json|markdown|html]  Set the data format for exporting. [default: html]                        โ”‚
โ”‚ --toc               -t                                   Create a Table of Contents (TOC) for the text. Defaults to False.         โ”‚
โ”‚ --summarize         -s                                   Include a short summary for the text. Defaults to False.                  โ”‚
โ”‚ --keywords          -k                                   Create descriptive keywords for the text. Defaults to False.              โ”‚
โ”‚ --sim_threshold     -sim       FLOAT                     Similarity threshold for removing duplicate topics. [default: 0.85]       โ”‚
โ”‚ --default_template  -dt        [tabs|plain]              Choose a built-in HTML template ('tabs' or 'plain'). [default: tabs]      โ”‚
โ”‚ --user_template     -ut        DIRECTORY                 Provide a custom folder path for an HTML template. [default: None]        โ”‚
โ”‚ --skip_html         -skip                                Skip files if HTML already exists in the same folder. Defaults to False.  โ”‚
โ”‚ --export_metadata   -meta                                Enable exporting metadata to output file(s). Defaults to False.           โ”‚
โ”‚ --verbose           -v                                   Enable verbose output. Defaults to False.                                 โ”‚
โ”‚ --help                                                   Show this message and exit.                                               โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Documentation

View the documentation for detailed usage information.

Acknowledgments

dygest uses great python packages:

Citation

@software{dygest,
  author       = {Thomas Schmidt},
  title        = {DYGEST: Document Insights Generator},
  organization = {Mannheim University Library},
  year         = {2025},
  url          = {https://github.com/tsmdt/whisply}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dygest-0.7.0.tar.gz (42.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dygest-0.7.0-py3-none-any.whl (43.1 kB view details)

Uploaded Python 3

File details

Details for the file dygest-0.7.0.tar.gz.

File metadata

  • Download URL: dygest-0.7.0.tar.gz
  • Upload date:
  • Size: 42.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for dygest-0.7.0.tar.gz
Algorithm Hash digest
SHA256 71ed5c1056b5aa97921129a6cc137794d8fa16aab00887e274425bace0fdbf70
MD5 0634990d23d2c2034f15d75fd2d4089a
BLAKE2b-256 ed749c021eb5df83e5be818d9133c8528192333fe73e9777d493172a438e9c4f

See more details on using hashes here.

File details

Details for the file dygest-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: dygest-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 43.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for dygest-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4c4724311d2abf979bd0002ed7fbcb25c8ca5e156867f87c7e21e1c6e22cb018
MD5 538a000fb981d893d83f9f298b8e8694
BLAKE2b-256 c16bab5491fd5e4d28afb478befd0cc4a309f87c1219927859ae5f59229c4258

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page