DYGEST: Document Insights Generator
Project description
๐ dygest: Document Insights Generator
[!NOTE] dygest is a text analysis tool that extracts insights from documents, generating summaries, keywords, TOCs, and performing Named Entity Recognition (NER).
Info
dygest was created to gain fast insights into longer transcripts of audio and video content by retrieving relevant topics and providing an easy to use HTML interface with short cuts from summaries to corresponding text chunks. NER processing further enhances those insights by identifying names of individuals, organisations, locations etc.
Features ๐งฉ
-
Text insights
Generate concise insights for your text files using various LLM services by creating summaries, keywords, table of contents (TOC) and named entities (NER). -
Unified LLM Interface
dygest uses litellm and provides integration for various LLM service providers:OpenAI,Anthropic,HuggingFace,Groq,Ollamaetc. Check the complete provider list for all available services. -
Token Friendly
dygest performs token-heavy text analysis and summarization tasks. Therefore, the underlying LLM pipeline can be tailored to your needs and specific rate limits using a mixed experts approach. -
Mixed Experts Approach
dygest utilizes two fully customizable LLMs to handle different processing tasks. The first, referred to as thelight_model, is designed for lighter tasks such as summarization and keyword extraction. The second, called theexpert_model, is optimized for more complex tasks like constructing Tables of Contents (TOCs).This flexibility allows for various pipeline configurations. For example, the
light_modelcan run locally usingOllama, while theexpert_modelcan leverage an external API service likeOpenAIorGroq. This approach ensures efficiency and adaptability based on specific requirements.
[!TIP] As the
expert_modelis dealing with a lot of input content it is recommended to use a larger LLM (>=32B) for this task. Smaller LLMs (3Bto7B) perform well aslight_model.
-
Named Entity Recognition (NER)
Named Entity Recognition via fast and reliableflairframework (identifies persons, organisations, locations etc.). -
User-friendly HTML Editor
By defaultdygestwill create a.htmlfile that can be viewed in standard browsers and combines summaries, keywords, TOC and NER for your text. It features a text editor for you to make further changes. -
Input Formats:
.txt,.csv,.xlsx,.doc,.docx,.pdf,.html,.xml -
Export Formats:
.json,.csv,.html
Requirements
- ๐ Python
>=3.10 - ๐ API keys for LLM services like
OpenAI,AnthropicandGroqand / or a runningOllamainstance
[!NOTE] API Keys have to be stored in your environment (e.g.
export $OPENAI_API_KEY=skj....)
Installation
Install with pip
Create a Python virtual environment
python3 -m venv venv
Activate the environment
source venv/bin/activate
Install dygest
pip install dygest
Install from source
Clone this repository
git clone https://github.com/tsmdt/dygest.git
cd dygest
Create a Python virtual environment
python3 -m venv venv
Activate the environment
source venv/bin/activate
Install dygest
pip install .
Usage
Configuration
Customize the dygest LLM pipeline by running the dygest config command:
Usage: dygest config [OPTIONS]
Configure LLMs, Embeddings and Named Entity Recognition.
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --light_model -l TEXT LLM model name for lighter tasks (summarization, keywords) [default: None] โ
โ --expert_model -x TEXT LLM model name for heavier tasks (TOCs). [default: None] โ
โ --embedding_model -e TEXT Embedding model name. [default: None] โ
โ --temperature -t FLOAT Temperature of LLM. [default: None] โ
โ --sleep -s FLOAT Pause LLM requests to prevent rate limit errors (in seconds). [default: None] โ
โ --chunk_size -c INTEGER Maximum number of tokens per chunk. [default: None] โ
โ --ner --no-ner Enable Named Entity Recognition (NER). Defaults to False. [default: no-ner] โ
โ --precise --fast Enable precise mode for NER. Defaults to fast mode. [default: fast] โ
โ --lang -lang TEXT Language of file(s) for NER. Defaults to auto-detection. [default: None] โ
โ --api_base -api TEXT Set custom API base url for providers like Ollama and Hugginface. [default: None] โ
โ --view_config -v View loaded config parameters. โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
The configuration is saved as dygest_config.yaml in the project directory. The .yaml config looks like this:
light_model: ollama/mistral:latest
expert_model: groq/llama-3.3-70b-versatile
embedding_model: ollama/nomic-embed-text:latest
temperature: 0.4
chunk_size: 1000
ner: true
language: auto
precise: false
api_base: null
sleep: 0
Processing
Run the dygest LLM pipeline with the dygest run command:
Usage: dygest run [OPTIONS]
Create insights for your documents (summaries, keywords, TOCs).
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --files -f TEXT Path to the input folder or .txt file. [default: None] โ
โ --output_dir -o TEXT If not provided, outputs will be saved in the input folder. [default: None] โ
โ --export_format -ex [all|json|csv|html] Set the data format for exporting. [default: html] โ
โ --toc -t Create a Table of Contents (TOC) for the text. Defaults to False. โ
โ --summarize -s Include a short summary for the text. Defaults to False. โ
โ --keywords -k Create descriptive keywords for the text. Defaults to False. โ
โ --sim_threshold -sim FLOAT Similarity threshold for removing duplicate topics. [default: 0.85] โ
โ --verbose -v Enable verbose output. Defaults to False. โ
โ --export_metadata -meta Enable exporting metadata to output file(s). Defaults to False. โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Export formats
json
Find an example .json output in the examples folder.
Acknowledgments
dygest uses great python packages:
litellm: https://github.com/BerriAI/litellmflair: https://github.com/flairNLP/flairtyper: https://github.com/fastapi/typerjson_repair: https://github.com/mangiucugna/json_repairmarkitdown: https://github.com/microsoft/markitdown
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dygest-0.5.2.tar.gz.
File metadata
- Download URL: dygest-0.5.2.tar.gz
- Upload date:
- Size: 33.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bdb996fc2a5e16d64637acaab73f5905bf29c3d353b5cbfd59dab3de0a2e6259
|
|
| MD5 |
8f4b4d83d06f27b30b6555d8dfaf6de4
|
|
| BLAKE2b-256 |
9d1089ee100968c21509386cd32c6535c4946e48adb55e3e0d4534cea2efaa35
|
File details
Details for the file dygest-0.5.2-py3-none-any.whl.
File metadata
- Download URL: dygest-0.5.2-py3-none-any.whl
- Upload date:
- Size: 33.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e8dccdb8c3ec8c827328068684b8ecc3b13f248dc6743b726c0114245eec5cef
|
|
| MD5 |
9110b2d2e786517effccebe16dd4483a
|
|
| BLAKE2b-256 |
50e676e8de5057fe2d78b2d6172d8dbc0c0f9fdfc7194399286819d742ce9ead
|