Skip to main content

DYGEST: Document Insights Generator

Project description

๐ŸŒž DYGEST: Document Insights Generator

[!NOTE] dygest is a text analysis tool that extracts insights from .txt files, generating summaries, keywords, TOCs, and performing Named Entity Recognition (NER).

Info

dygest was created to gain fast insights into longer transcripts of audio and video content by retrieving relevant topics and providing an easy to use HTML interface with short cuts from summaries to corresponding text chunks. NER processing further enhances those insights by identifying names of individuals, organisations, locations etc.

Features ๐Ÿงฉ

  • Text insights
    Generate concise insights for your text files using various LLM services by creating summaries, keywords, table of contents (TOC) and named entities (NER).

  • Unified LLM Interface
    dygest uses litellm and provides integration for various LLM service providers: OpenAI, Anthropic, HuggingFace, Groq, Ollama etc. Check the complete provider list for all available services.

  • Token Friendly
    dygest performs token-heavy text analysis and summarization tasks. Therefore, the underlying LLM pipeline can be tailored to your needs and specific rate limits using a mixed experts approach.

  • Mixed Experts Approach
    dygest utilizes two fully customizable LLMs to handle different processing tasks. The first, referred to as the light_model, is designed for lighter tasks such as summarization and keyword extraction. The second, called the expert_model, is optimized for more complex tasks like constructing Tables of Contents (TOCs).

    This flexibility allows for various pipeline configurations. For example, the light_model can run locally using Ollama, while the expert_model can leverage an external API service like OpenAI or Groq. This approach ensures efficiency and adaptability based on specific requirements.

[!TIP] As the expert_model is dealing with a lot of input content it is recommended to use a larger LLM (>=32B) for this task. Smaller LLMs (3B to 7B) perform well as light_model.

  • Named Entity Recognition (NER)
    Named Entity Recognition via fast and reliable flair framework (identifies persons, organisations, locations etc.).

  • User-friendly HTML Editor
    By default dygest will create a .html file that can be viewed in standard browsers and combines summaries, keywords, TOC and NER for your text. It features a text editor for you to make further changes.

  • Export Formats: .json .csv .html

Requirements

  • ๐Ÿ Python >=3.10
  • ๐Ÿ”‘ API keys for LLM services like OpenAI, Anthropic and Groq and / or a running Ollama instance

[!NOTE] API Keys have to be stored in your environment (e.g. export $OPENAI_API_KEY=skj....)

Installation

Install with pip

Create a Python virtual environment

python3.10 -m venv venv

Activate the environment

source venv/bin/activate

Install dygest

pip install dygest

Install from source

Clone this repository

git clone https://github.com/tsmdt/dygest.git
cd dygest

Create a Python virtual environment

python3.10 -m venv venv

Activate the environment

source venv/bin/activate

Install dygest

pip install .

Usage

Configuration

Customize the dygest LLM pipeline by running the dygest config command:

 Usage: dygest config [OPTIONS]

 Configure LLMs, Embeddings and Named Entity Recognition.

โ•ญโ”€ Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ --light_model      -l                 TEXT     LLM model name for lighter tasks (summarization, keywords) [default: None]                โ”‚
โ”‚ --expert_model     -x                 TEXT     LLM model name for heavier tasks (TOCs). [default: None]                                  โ”‚
โ”‚ --embedding_model  -e                 TEXT     Embedding model name. [default: None]                                                     โ”‚
โ”‚ --temperature      -t                 FLOAT    Temperature of LLM. [default: None]                                                       โ”‚
โ”‚ --sleep            -s                 FLOAT    Pause LLM requests to prevent rate limit errors (in seconds). [default: None]             โ”‚
โ”‚ --chunk_size       -c                 INTEGER  Maximum number of tokens per chunk. [default: None]                                       โ”‚
โ”‚ --ner                     --no-ner             Enable Named Entity Recognition (NER). Defaults to False. [default: no-ner]               โ”‚
โ”‚ --precise                 --fast               Enable precise mode for NER. Defaults to fast mode. [default: fast]                       โ”‚
โ”‚ --lang             -lang              TEXT     Language of file(s) for NER. Defaults to auto-detection. [default: None]                  โ”‚
โ”‚ --api_base         -api               TEXT     Set custom API base url for providers like Ollama and Hugginface. [default: None]         โ”‚
โ”‚ --view_config      -v                          View loaded config parameters.                                                            โ”‚
โ”‚ --help                                         Show this message and exit.                                                               โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

The configuration is saved as dygest_config.yaml in the project directory. The .yaml config looks like this:

light_model: ollama/mistral:latest
expert_model: groq/llama-3.3-70b-versatile
embedding_model: ollama/nomic-embed-text:latest
temperature: 0.4
chunk_size: 1000
ner: true
language: auto
precise: false
api_base: null
sleep: 0

Processing

Run the dygest LLM pipeline with the dygest run command:

 Usage: dygest run [OPTIONS]

 Create insights for your documents (summaries, keywords, TOCs).

โ•ญโ”€ Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ --files            -f         TEXT             Path to the input folder or .txt file. [default: None]                                    โ”‚
โ”‚ --output_dir       -o         TEXT             If not provided, outputs will be saved in the input folder. [default: None]               โ”‚
โ”‚ --export_format    -ex        [json|csv|html]  Set the data format for exporting. [default: html]                                        โ”‚
โ”‚ --toc              -t                          Create a Table of Contents (TOC) for the text. Defaults to False.                         โ”‚
โ”‚ --summarize        -s                          Include a short summary for the whole text. Defaults to False.                            โ”‚
โ”‚ --keywords         -k                          Create descriptive keywords for the text. Defaults to False.                              โ”‚
โ”‚ --sim_threshold    -sim       FLOAT            Similarity threshold for removing duplicate topics. [default: 0.85]                       โ”‚
โ”‚ --verbose          -v                          Enable verbose output. Defaults to False.                                                 โ”‚
โ”‚ --export_metadata  -meta                       Enable exporting metadata to output file(s). Defaults to False.                           โ”‚
โ”‚ --help                                         Show this message and exit.                                                               โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Acknowledgments

dygest uses great python packages:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dygest-0.3.3.tar.gz (28.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dygest-0.3.3-py3-none-any.whl (28.4 kB view details)

Uploaded Python 3

File details

Details for the file dygest-0.3.3.tar.gz.

File metadata

  • Download URL: dygest-0.3.3.tar.gz
  • Upload date:
  • Size: 28.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for dygest-0.3.3.tar.gz
Algorithm Hash digest
SHA256 9e938dc1708c1d6770273a94a51af4e018c531789f817be15aabfb16ca4f779d
MD5 b0c5699080c67842ce55ffe1b34fad57
BLAKE2b-256 da4b5b674a28fec1afa8cb6c61f9962c787aac3e311e2173f6a5330d514a00c0

See more details on using hashes here.

File details

Details for the file dygest-0.3.3-py3-none-any.whl.

File metadata

  • Download URL: dygest-0.3.3-py3-none-any.whl
  • Upload date:
  • Size: 28.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for dygest-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 c263793645e54cf48030a09f16c92e4d5ac8d8502e756e8a218dc7f88ce5457c
MD5 c4d63fa889616be6a3dc604a0fd86324
BLAKE2b-256 29ab9ed840e87db021d66db50196edf0a4458f83a9c2a8db366d92a307a69784

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page