Text preprocessing, representation and visualization made easy.
Text preprocessing, representation and visualization from zero to hero.
From zero to hero
Texthero is a python toolkit that help you work with text-based dataset quickly and effortlessly. Texthero is very simple to learn and designed to be used on top of Pandas.
You can think of Texthero as a tool to help you understand and work with text-based dataset. Given a tabular dataset, it's easy to grasp the main concept. Instead, given a text dataset it's harder to have quick insights of the underline data. With Texthero, preprocessing (cleaning), mapping words into vectors and visualize the reduced vector space takes only a couple lines of code.
To use Texthero, you don't have to spend hours understanding complex and messy code. Texthero is composed of only three python modules preprocessing.py, representation.py, visualization.py and it's well documented. Other than that, Texthero is optimized to be very fast.
Texthero is available on the Python Package Index and can be installed via
pip install texthero
☝️Under the hoods, Texthero makes use of multiple NLP and machine learning toolkits such as Gensim, NLTK, SpaCy and scikit-learn. You don't need to install them all separately, pip will take care of that for you.
The best way to learn Texthero is probably via the Getting Started official documentation. In case you are a pro python user, then
help(texthero) should suffices.
import texthero.texthero as hero import pandas as pd df = pd.DataFrame(["hello world", "hello", "world"], columns='text') df = hero.do_preprocess(df) df = hero.do_tfidf(df) df = hero.do_pca(df) hero.scatterplot(df)
(The same example can also be found as a "getting-started guide" here: ...)
⚒️ 1. Preprocessing
Scope: prepare the text data for further analysis.
Complete documentation: preprocessing
📒 2. Representation
Job: map text data into vectors and do dimensionality reduction.
Supported representation algorithms:
- Term frequency, inverse document frequency (
- Word2Vec from Gensim [🔜]
- GloVe [🔜]
- Transformers [🔜]
Supported dimensionality reduction algorithms:
- Principal component analysis (
- Non-negative matrix factorization (
Complete documentation: representation
🔮 3. Visualization
Job: collection of functions to both summarize the main facts regarding the data and visualize the results. This part is very opinionated and ideal for anyone that needs a quick solution to visualize on screen the text data for instance during a text exploratory data analysis (EDA).
Most common functions:
- Text scatterplot. Handy when coupled with dimensionality reduction algorithms such as pca.
- Most common words
- Most common words between two entities [🔜]
Complete documentation: visualization
Pull requests are amazing and most welcome.
Any help, feedback and contribution are very welcome. You can simply fork this repository and open an issue.
Release history Release notifications | RSS feed
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size texthero-1.0.4-py3-none-any.whl (11.2 kB)||File type Wheel||Python version py3||Upload date||Hashes View|
|Filename, size texthero-1.0.4.tar.gz (6.4 kB)||File type Source||Python version None||Upload date||Hashes View|