Skip to main content

Text preprocessing, representation and visualization made easy.

Project description

Github stars pip package pip downloads Github issues Github license

Text preprocessing, representation and visualization from zero to hero

From zero to heroInstallationGetting StartedDocumentationContributions

From zero to hero

Texthero is a python toolkit for quick handling of text data. Texthero is concise, simple to learn and integrates smoothly with Pandas.

Given a Pandas DataFrame with one or more text columns, texthero help to preprocess the text data, map it into vectors using different algorithms and models and visualize it on screen.

You can think of texthero as an utility tool to quickly understand text-based dataset. Given a tabular dataset such as stock predictions or most selled items, it's easy to grasp the main insights, but given a text dataset, it's harder to quickly have an understanding of the underline data. Texthero help you with that.

Installation

pip install texthero

☝️Under the hoods, texthero make use of multiple NLP/mL toolkit such as Gensim, NLTK, SpaCy and Sklearn. You don't need to install them separately; pip will take care of that.

Getting started and examples

1. Preprocessing, tf-idf representation and visualization

import texthero.texthero as hero
import pandas as pd

df = pd.DataFrame(["hello world", "hello", "world"], columns='text')
df = hero.do_preprocess(df)
df = hero.do_tfidf(df)
df = hero.do_pca(df)
hero.scatterplot(df)

2. Most common words and top TF-IDF words

import texthero.texthero as hero

3. Transformers representation and visualization [🔜]

import texthero.texthero as hero

Documentation

The way texthero is structured and his documentation follow the same principles of texthero: to provide a simple tool to text data handling. We put our best to keep the code concise, simple to read and understand.

Texthero is composed of three main components; preprocessing.py, representation.py and visualization.py.

⚒️ 1. Preprocessing

Job: prepare the text data for further analysis.

Complete documentation: preprocessing

📒 2. Representation

Job: map text data into vectors and do dimensionality reduction.

Supported representation algorithms:

  1. Term frequency, inverse document frequency (do_tfidf)
  2. Word2Vec from Gensim [🔜]
  3. GloVe [🔜]
  4. Transformers [🔜]

Supported dimensionality reduction algorithms:

  1. Principal component analysis (do_pca)
  2. Non-negative matrix factorization (do_nmf)

Complete documentation: representation

🔮 3. Visualization

Job: collection of functions to both summarize the main facts regarding the data and visualize the results. This part is very opinionated and ideal for anyone that needs a quick solution to visualize on screen the text data for instance during a text exploratory data analysis (EDA).

Most common functions:

  • Text scatterplot. Handy when coupled with dimensionality reduction algorithms such as pca.
  • Most common words
  • Most common words between two entities [🔜]

Complete documentation: visualization

Contributions

Any help, feedback and contribution are very welcome. You can simply fork this repository and open an issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

texthero-1.0.2.tar.gz (6.0 kB view hashes)

Uploaded Source

Built Distribution

texthero-1.0.2-py3-none-any.whl (10.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page