Skip to main content

Homologous Automated Document Exploration and Summarization - A powerful tool for comparing similarly structured documents

Project description

HADES: Homologous Automated Document Exploration and Summarization

A powerful tool for comparing similarly structured documents

PyPI version Downloads

Overview

HADES is a Python package for comparing similarly structured documents. HADES is designed to streamline the work of professionals dealing with large volumes of documents, such as policy documents, legal acts, and scientific papers. The tool employs a multi-step pipeline that begins with processing PDF documents using topic modeling, summarization, and analysis of the most important words for each topic. The process concludes with an interactive web app with visualizations that facilitate the comparison of the documents. HADES has the potential to significantly improve the productivity of professionals dealing with high volumes of documents, reducing the time and effort required to complete tasks related to comparative document analysis.

Installation

Latest released version of the HADES package is available on Python Package Index (PyPI):

  1. Install spacy en-core-web-sm or en-core-web-lg model for English language according to the instructions

  2. Install HADES package using pip:

pip install -U hades-nlp

The source code and development version is currently hosted on GitHub.

Usage

The HADES package is designed to be used in a Python environment. The package can be imported as follows:

from hades.data_loading import load_processed_data
from hades.topic_modeling import ModelOptimizer, save_data_for_app, set_openai_key
from my_documents_data import PARAGRAPHS, COMMON_WORDS, STOPWORDS

The load_processed_data function loads the documents to be processed. The ModelOptimizer class is used to optimize the topic modeling process. The save_data_for_app function saves the data for the interactive web app. The set_openai_key function sets the OpenAI API key. my_documents_data contains the informations about the documents to be processed. The PARAGRAPHS variable is a list of strings that represent the paragraphs of the documents. The COMMON_WORDS variable is a list of strings that represent the most common words in the documents. The STOPWORDS variable is a list of strings that represent the most common words in the documents that should be excluded from the analysis.

First, the documents are loaded and processed:

set_openai_key("my openai key")
data_path = "my/data/path"
processed_df = load_processed_data(
    data_path=data_path,
    stop_words=STOPWORDS,
    id_column='country',
    flattened_by_col='my_column',
)

After the documents are loaded, the topic modeling process is optimized for each paragraph:

model_optimizers = []
for paragraph in PARAGRAPHS:
    filter_dict = {'paragraph': paragraph}
    model_optimizer = ModelOptimizer(
        processed_df,
        'country',
        'section',
        filter_dict,
        "lda",
        COMMON_WORDS[paragraph],
        (3,6),
        alpha=100
    )
    model_optimizer.name_topics_automatically_gpt3()
    model_optimizers.append(model_optimizer)

For each paragraph, the ModelOptimizer class is used to optimize the topic modeling process. The name_topics_automatically_gpt3 function automatically names the topics using the OpenAI GPT-3 API. User can also use the name_topics_manually function to manually name the topics.

Finally, the data is saved for the interactive web app:

save_data_for_app(model_optimizers, path='path/to/results', do_summaries=True)

The save_data_for_app function saves the data for the interactive web app. The do_summaries parameter is set to True to generate summaries for each topic.

When the data is saved, the interactive web app can be launched:

hades run-app --config path/to/results/config.json

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hades_nlp-0.1.2.tar.gz (28.5 kB view details)

Uploaded Source

Built Distribution

hades_nlp-0.1.2-py3-none-any.whl (36.5 kB view details)

Uploaded Python 3

File details

Details for the file hades_nlp-0.1.2.tar.gz.

File metadata

  • Download URL: hades_nlp-0.1.2.tar.gz
  • Upload date:
  • Size: 28.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.11.1 Windows/10

File hashes

Hashes for hades_nlp-0.1.2.tar.gz
Algorithm Hash digest
SHA256 4f5b64c5618c37f96fe51f2322b3fc59c408240a2b679744816bef429823d9e7
MD5 a43c1b751d0a8c28d877da0d21fa0180
BLAKE2b-256 f99c0584f7eadeacc610312511a7ea840a2c34809236836ea557db99f8ef59c9

See more details on using hashes here.

File details

Details for the file hades_nlp-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: hades_nlp-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 36.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.11.1 Windows/10

File hashes

Hashes for hades_nlp-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f9e3ea412e38e6a696f044af12c38b8b7ea38940be6cf6545c5a7c78b4251f99
MD5 8b58ecfb68ed2882d6c7c698fd260b21
BLAKE2b-256 b93f13386ae8829f65610f85db3fca35f8a00bf7796995e8187bea10e0b29096

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page