Skip to main content

Short Text Mining

Project description

## Introduction

This package shorttext is a Python package that facilitates supervised and unsupervised learning for short text categorization. Due to the sparseness of words and the lack of information carried in the short texts themselves, an intermediate representation of the texts and documents are needed before they are put into any classification algorithm. In this package, it facilitates various types of these representations, including topic modeling and word-embedding algorithms.

Since release 1.0.0, shorttext runs on Python 2.7, 3.5, and 3.6. Since release 1.0.7, it runs on Python 3.7 as well, but the backend for keras cannot be TensorFlow. Since release 1.0.8, it runs on Python 3.7 with ‘TensorFlow’ being the backend for keras.

Characteristics:

  • example data provided (including subject keywords and NIH RePORT);
  • text preprocessing;
  • pre-trained word-embedding support;
  • gensim topic models (LDA, LSI, Random Projections) and autoencoder;
  • topic model representation supported for supervised learning using scikit-learn;
  • cosine distance classification;
  • neural network classification (including ConvNet, and C-LSTM);
  • maximum entropy classification;
  • metrics of phrases differences, including soft Jaccard score (using Damerau-Levenshtein distance), and Word Mover’s distance (WMD);
  • character-level sequence-to-sequence (seq2seq) learning; and
  • spell correction.

## Documentation

Documentation and tutorials for shorttext can be found here: [http://shorttext.rtfd.io/](http://shorttext.rtfd.io/).

See [tutorial](http://shorttext.readthedocs.io/en/latest/tutorial.html) for how to use the package, and [FAQ](https://shorttext.readthedocs.io/en/latest/faq.html).

## Installation

To install it, in a console, use pip.

` >>> pip install -U shorttext `

or, if you want the most recent development version on Github, type

` >>> pip install -U git+https://github.com/stephenhky/PyShortTextCategorization@master `

Developers are advised to make sure Keras >=2 be installed. Users are advised to install the backend Tensorflow (preferred) or Theano in advance. It is desirable if Cython has been previously installed too.

Before using, check the language model of spaCy has been installed or updated, by running:

` >>> python -m spacy download en `

See [installation guide](https://shorttext.readthedocs.io/en/latest/install.html) for more details.

## Issues

To report any issues, go to the [Issues](https://github.com/stephenhky/PyShortTextCategorization/issues) tab of the Github page and start a thread. It is welcome for developers to submit pull requests on their own to fix any errors.

## Contributors

If you would like to contribute, feel free to submit the pull requests. You can talk to me in advance through e-mails or the [Issues](https://github.com/stephenhky/PyShortTextCategorization/issues) page.

## Useful Links

## News

  • 02/14/2019: shorttext 1.0.8 released.
  • 01/30/2019: shorttext 1.0.7 released.
  • 01/29/2019: shorttext 1.0.6 released.
  • 01/13/2019: shorttext 1.0.5 released.
  • 10/03/2018: shorttext 1.0.4 released.
  • 08/06/2018: shorttext 1.0.3 released.
  • 07/24/2018: shorttext 1.0.2 released.
  • 07/17/2018: shorttext 1.0.1 released.
  • 07/14/2018: shorttext 1.0.0 released.
  • 06/18/2018: shorttext 0.7.2 released.
  • 05/30/2018: shorttext 0.7.1 released.
  • 05/17/2018: shorttext 0.7.0 released.
  • 02/27/2018: shorttext 0.6.0 released.
  • 01/19/2018: shorttext 0.5.11 released.
  • 01/15/2018: shorttext 0.5.10 released.
  • 12/14/2017: shorttext 0.5.9 released.
  • 11/08/2017: shorttext 0.5.8 released.
  • 10/27/2017: shorttext 0.5.7 released.
  • 10/17/2017: shorttext 0.5.6 released.
  • 09/28/2017: shorttext 0.5.5 released.
  • 09/08/2017: shorttext 0.5.4 released.
  • 09/02/2017: end of GSoC project. ([Report](https://rare-technologies.com/chinmayas-gsoc-2017-summary-integration-with-sklearn-keras-and-implementing-fasttext/))
  • 08/22/2017: shorttext 0.5.1 released.
  • 07/28/2017: shorttext 0.4.1 released.
  • 07/26/2017: shorttext 0.4.0 released.
  • 06/16/2017: shorttext 0.3.8 released.
  • 06/12/2017: shorttext 0.3.7 released.
  • 06/02/2017: shorttext 0.3.6 released.
  • 05/30/2017: GSoC project ([Chinmaya Pancholi](https://rare-technologies.com/google-summer-of-code-2017-week-1-on-integrating-gensim-with-scikit-learn-and-keras/), with [gensim](https://radimrehurek.com/gensim/))
  • 05/16/2017: shorttext 0.3.5 released.
  • 04/27/2017: shorttext 0.3.4 released.
  • 04/19/2017: shorttext 0.3.3 released.
  • 03/28/2017: shorttext 0.3.2 released.
  • 03/14/2017: shorttext 0.3.1 released.
  • 02/23/2017: shorttext 0.2.1 released.
  • 12/21/2016: shorttext 0.2.0 released.
  • 11/25/2016: shorttext 0.1.2 released.
  • 11/21/2016: shorttext 0.1.1 released.

## Possible Future Updates

  • [ ] More scalability;
  • [ ] Including BERT models;
  • [ ] Dividing components to other packages;
  • [ ] More available corpus.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for shorttext, version 1.0.8
Filename, size File type Python version Upload date Hashes
Filename, size shorttext-1.0.8.tar.gz (229.9 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page