Skip to main content

TweetBench allows you to queue, run, and benchmark Tweet classification expirements with minimal configuration.

Project description

TweetBench

TweetBench allows you to queue, run, and benchmark Tweet classification expirements with minimal configuration. TweetBench imorts libraries and utilities, loads data, gathers expirements, executes pipeline on five different train/test splits, evaluates, averages, and compares scores to baseline, and generates a submission file for you. All you have to do is add and modify the pipelines in the ./expirements/ folder (Jupyter Notebook or Python scripts) with your parameters.

Prerequisites

  • Python 3
  • steuptools
  • wheel
  • virtualenv (optional)

Requirements (included in installation)

  • Jupyter Notebook
  • mapplotlib
  • pandas
  • scikit-learn

Installation

Clone this repository

git clone git@git.txstate.edu:CS7311/a-m730.git # or https://git.txstate.edu/CS7311/a-m730.git

cd a-m730/Project/source

It is recommended that you work in a virtual environment:

python -m virtualenv tweetbench_env && source tweetbench_env

Run installation:

python3 -m pip install --index-url https://test.pypi.org/simple/ --no-deps TweetBench-andrewmagill

Run Benchmark Pipeline

Start Jupyter Notebook:

jupyter notebook

Open and execute benchmark.ipynb to run the expirements contained in ./exprements/. To add a new expirement to the queue, simply add another Jupyter Notebook or python script to the ./expirements/ directory and re-run the notebook. Results will be displayed in the benchmark.ipynb Notebook, and written to the ./output/ directory.

Creating New Expriments

TweetBench will run pipelines found in any Jupyter Notebook or python script (.py file) in the expirements folder. There are some requirements, in order for an expirement to run, it must be written as a scikit-learn Pipeline (documentation and examples can be found here)

Example, the simplest possible pipeline, which should be run as the baseline for most of your expirements:

pipeline = Pipeline([('vectorizer', CountVectorizer()), ('classifier', LogisticRegression()))

You may also want to include metadata for your expirement. This is an optional step, but necessary if you want to designate a pipeline as your baseline for comparison. Your metadata variable must be named META and must be structured in the form of a Python dictionary and optionally contain the following fields: name:str, desription: str, baselinei: bool. Your pipeline's parameters will be inserted into the metadata, and output along with your expirement evaluation scores and predictions.

Example metadata:

META = {
"name": "fine-grained logreg text classifier",
"description": "Fine grained four classification: 5G Conspiracy, Other-Conspiracy, Non-conspiracy, Indeterminate",
"baseline": False
}

MediaEval 2020: FakeNews

The code used for the coarse and fine grained text classification, and classification augmented by OCR on Tweet images, as well as Lia Nogueria's community labels are included in the ./expirements/ folder.

Note: These expirements are run in the benchmark.ipynb notebook that imports libraries, loads data, gathers pipelines, and outputs results.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

TweetBench-andrewmagill-0.0.2.tar.gz (4.9 kB view hashes)

Uploaded Source

Built Distribution

TweetBench_andrewmagill-0.0.2-py3-none-any.whl (6.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page