Skip to main content

With AuDoLab you can do LDA on highly imbalanced datasets.

Project description

AuDoLab

https://img.shields.io/pypi/v/AuDoLab.svg https://api.travis-ci.com/ArneTillmann/AuDoLab.svg?branch=main&status=passed Documentation Status https://joss.theoj.org/papers/10.21105/joss.03719/status.svg

With AuDoLab you can perform Latend Direchlet Allocation on highly imbalanced datasets.

Summary

AuDoLab provides a novel approach to one-class document classification for heavily imbalanced datasets, even if labelled training data is not available. Our package enables the user to create specific out-of-domain training data to classify a heavily underrepresented target class in a document dataset using a recently developed integration of Web Scraping, Latent Dirichlet Allocation Topic Modelling and One-class Support Vector Machines. AuDoLab can achieve high quality results even on higly specific classification problems without the need to invest in the time and cost intensive labelling of training documents by humans. Hence, AuDoLab has a broad range of scientific research or business applications.

Unsupervised document classification is mainly performed to gain insight into the underlying topics of large text corpora. In this process, documents covering highly underrepresented topics have only a minor impact on the algorithm’s topic definitions. As a result, underrepresented topics can sometimes be “overlooked” and documents are assigned topic prevalences that do not reflect the underlying content. Thus, labeling underrepresented topics in large text corpora is often done manually and can therefore be very labour-intensive and time-consuming. AuDoLab enables the user to tackle this problem and perform unsupervised one-class document classification for heavily underrepresented document classes.

Installation

Stable release

To install AuDoLab, run this command in your terminal (bash, PowerShell, etc.), given that you have python 3 and pip installed :

$ pip install AuDoLab

This is the preferred method to install AuDoLab, as it will always install the most recent stable release.

If you don’t have pip installed, this Python installation guide can guide you through the process.

From sources

The sources for AuDoLab can be downloaded from the Github repo.

You can either clone the public repository:

$ git clone git://github.com/ArneTillmann/AuDoLab

Or download the tarball:

$ curl -OJL https://github.com/ArneTillmann/AuDoLab/tarball/master

Once you have a copy of the source, you can install it with:

$ python setup.py install

Usage

Before the actuall usage you want to download the stopwords for nltk by running:

import nltk
nltk.download('stopwords')

inside a python console. To use AuDoLab in a project:

from AuDoLab import AuDoLab

Then you want to create an instance of the AuDoLab class

audo = AuDoLab.AuDoLab()

In this example we used publicly available data from the nltk package:

from nltk.corpus import reuters
import numpy as np
import pandas as pd

data = []

for fileid in reuters.fileids():
    tag, filename = fileid.split("/")
    data.append(
        (filename,
         ", ".join(
             reuters.categories(fileid)),
            reuters.raw(fileid)))

data = pd.DataFrame(data, columns=["filename", "categories", "text"])

Then you want to scrape abstracts, e.g. from IEEE with the abstract scraper:

scraped_documents = audo.get_ieee("https://ieeexplore.ieee.org/search
                                   /searchresult.jsp?newsearch=true&
                                   queryText=cotton&highlight=true&
                                   returnFacets=ALL&returnType=SEARCH&
                                   matchPubs=true&rowsPerPage=100&
                                   pageNumber=1\",
                                   pages=1)

The data as well as the scraped papers need to be preprocessed before use in the classifier:

preprocessed_target = audo.text_cleaning(data=data, column="text")

preprocessed_paper = audo.text_cleaning(
    data=scraped_documents, column="abstract")

target_tfidf, training_tfidf = audo.tf_idf(
    data=preprocessed_target,
    papers=preprocessed_paper,
    data_column="lemma",
    papers_column="lemma",
    features=100000,
)

Afterwards we can train and use the classifiers and choose the desired one:

o_svm_result = audo.one_class_svm(
    training=training_tfidf,
    predicting=target_tfidf,
    nus=np.round(np.arange(0.001, 0.5, 0.01), 7),
    quality_train=0.9,
    min_pred=0.001,
    max_pred=0.05,
)

result = audo.choose_classifier(preprocessed_target, o_svm_result, 0)

And finally you can estimate the topics of the data:

lda_target = audo.lda_modeling(data=result, num_topics=5)

audo.lda_visualize_topics(type="pyldavis")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

AuDoLab-1.0.16.tar.gz (5.2 MB view details)

Uploaded Source

Built Distribution

AuDoLab-1.0.16-py2.py3-none-any.whl (24.7 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file AuDoLab-1.0.16.tar.gz.

File metadata

  • Download URL: AuDoLab-1.0.16.tar.gz
  • Upload date:
  • Size: 5.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.7

File hashes

Hashes for AuDoLab-1.0.16.tar.gz
Algorithm Hash digest
SHA256 3cc0e9d548d11e9312a612ab71a63aee1e2cd17c9a2b648b62a559d947191802
MD5 e235352d09e61bcf2911422c7dbad506
BLAKE2b-256 026b9cb0f6dfa7897e68b082c3d834078fb7abf8f1784f9b6b61c84e6cc455ef

See more details on using hashes here.

File details

Details for the file AuDoLab-1.0.16-py2.py3-none-any.whl.

File metadata

  • Download URL: AuDoLab-1.0.16-py2.py3-none-any.whl
  • Upload date:
  • Size: 24.7 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.8.7

File hashes

Hashes for AuDoLab-1.0.16-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 12823ac99c16711248c2aaa4522563bf86e65bfa0de3f2b764372c1f52116283
MD5 8baee17b210980fb886c2354c9ba5118
BLAKE2b-256 cf722b33bfaa35924ecc32168706aee3dfc9a9648025cc4b2e0009fcac06f6b3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page