Skip to main content

Text preprocessing, representation and visualization from zero to hero.

Project description

Github stars pip package pip downloads Github issues Github license

Text preprocessing, representation and visualization from zero to hero.

From zero to heroInstallationGetting StartedDocumentationContributions

From zero to hero

Texthero is a python toolkit that help you work with text-based dataset quickly and effortlessly. Texthero is very simple to learn and designed to be used on top of Pandas.

You can think of Texthero as a tool to help you understand and work with text-based dataset. Given a tabular dataset, it's easy to grasp the main concept. Instead, given a text dataset it's harder to have quick insights of the underline data.

With Texthero, preprocessing text data, map it into vectors and visualize the obtained vector space takes only a couple of lines.

Texthero is composed of only three python modules preprocessing.py, representation.py, visualization.py and it's well documented.

Installation

Install texthero via pip:

pip install texthero

☝️Under the hoods, Texthero makes use of multiple NLP and machine learning toolkits such as Gensim, NLTK, SpaCy and scikit-learn. You don't need to install them all separately, pip will take care of that.

Getting started

The best way to learn Texthero is through the Getting Started docs.

In case you are an advanced python user, then help(texthero) should do the work.

Example

Text preprocessing, TF-IDF representation and scatter visualization

import texthero as hero
import pandas as pd

df = pd.read_csv(
   "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)

df['pca'] = (
   df['text']
   .pipe(hero.clean)
   .pipe(hero.tfidf)
   .pipe(hero.pca)
)
hero.scatterplot(df, 'pca', color='topic', title="PCA BBC Sport news")

Text preprocessing, TF-IDF, K-means and visualization

import texthero as hero
import pandas as pd

df = pd.read_csv(
    "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)

df['tfidf'] = (
    df['text']
    .pipe(hero.clean)
    .pipe(hero.tfidf)
)

df['kmeans_labels'] = (
    df['tfidf']
    .pipe(hero.kmeans, n_clusters=5)
    .astype(str)
)

df['pca'] = (
    df['tfidf']
    .pipe(hero.pca)
)

hero.scatterplot(df, 'pca', color='kmeans_labels', title="K-means BBC Sport news")

API

Texthero is composed of three modules: preprocessing.py, representation.py and visualization.py.

1. Preprocessing

Scope: prepare the text data for further analysis.

Complete documentation: preprocessing

2. Representation

Scope: map text data into vectors and do dimensionality reduction.

Supported representation algorithms:

  1. Term frequency, inverse document frequency (do_tfidf)

Supported dimensionality reduction algorithms:

  1. Principal component analysis (do_pca)
  2. Non-negative matrix factorization (do_nmf)

Complete documentation: representation

3. Visualization

Scope: collection of functions to both summarize the main facts regarding the data and visualize the results. This part is very opinionated and ideal for anyone that needs a quick solution to visualize on screen the text data for instance during a text exploratory data analysis (EDA).

Most common functions:

  • Text scatterplot. Handy when coupled with dimensionality reduction algorithms such as pca.
  • Most common words
  • Most common words between two entities

Complete documentation: visualization

Contributions

Pull requests are amazing and most welcome. Start by fork this repository and open an issue.

Also, Texthero is looking for maintainers. In case of interest, just drop a line at jonathanbesomi__AT__gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

texthero-1.0.6.tar.gz (7.9 kB view hashes)

Uploaded Source

Built Distribution

texthero-1.0.6-py3-none-any.whl (12.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page