Skip to main content

With AuDoLab you can perform Latend Direchlet Allocation on highly imbalanced datasets.

Project description

AuDoLab

https://img.shields.io/pypi/v/AuDoLab.svg https://api.travis-ci.com/ArneTillmann/AuDoLab.svg?branch=main&status=passed Documentation Status

With AuDoLab you can perform Latend Direchlet Allocation on highly imbalanced datasets.

Installation

Stable release

To install AuDoLab, run this command in your terminal:

$ pip install AuDoLab

This is the preferred method to install AuDoLab, as it will always install the most recent stable release.

If you don’t have pip installed, this Python installation guide can guide you through the process.

From sources

The sources for AuDoLab can be downloaded from the Github repo.

You can either clone the public repository:

$ git clone git://github.com/ArneTillmann/AuDoLab

Or download the tarball:

$ curl -OJL https://github.com/ArneTillmann/AuDoLab/tarball/master

Once you have a copy of the source, you can install it with:

$ python setup.py install

Usage

Before the actuall usage you want to download the stopwords for nltk by running:

import nltk
nltk.download('stopwords')

inside a python console. To use AuDoLab in a project:

from AuDoLab import AuDoLab
import asyncio
import nest_asyncio
nest_asyncio.apply()
from numpy import round as np_round
from numpy import arange as np_arange

Then you want to create an instance of the AuDoLab class

audo = AuDoLab.AuDoLab()

In this example we used publicly available data from the nltk package:

from nltk.corpus import reuters
import numpy as np
import pandas as pd

data = []

for fileid in reuters.fileids():
    tag, filename = fileid.split("/")
    data.append(
        (filename,
         ", ".join(
             reuters.categories(fileid)),
            reuters.raw(fileid)))

data = pd.DataFrame(data, columns=["filename", "categories", "text"])

Then you want to scrape abstracts, e.g. from IEEE with the abstract scraper:

async def scrape():
    return await audo.scrape_abstracts(
        url=None, keywords=["cotton"], in_data="all_meta", pages=5
    )

scraped_documents = asyncio.get_event_loop().run_until_complete(scrape())

The data as well as the scraped papers need to be preprocessed before use in the classifier:

preprocessed_target = audo.preprocessing(data=data, column="text")

preprocessed_paper = audo.preprocessing(
    data=scraped_documents, column="text")

target_tfidf, training_tfidf = audo.tf_idf(
    data=preprocessed_target,
    papers=preprocessed_paper,
    data_column="lemma",
    papers_column="lemma",
    features=100000,
)

Afterwards we can train and use the classifiers and choose the desired one:

classifier = audo.one_class_svm(
    training=training_tfidf,
    predicting=target_tfidf,
    nus=np.round(np.arange(0.01, 0.5, 0.01), 7),
    quality_train=0.9,
    min_pred=0.001,
    max_pred=0.05,
)

df_data = audo.choose_classifier(preprocessed_target, classifier, 2)

And finally you can estimate the topics of the data:

audo.lda_modeling(df_data, num_topics=2)

a = audo.lda_visualize_topics()
html = a.data
with open('html_file.html', 'w') as f:
    f.write(html)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

AuDoLab-0.1.15.tar.gz (5.2 MB view hashes)

Uploaded Source

Built Distribution

AuDoLab-0.1.15-py2.py3-none-any.whl (14.8 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page