Text preprocessing, representation and visualization made easy.
Project description
Text preprocessing, representation and visualization from zero to hero
From zero to hero • Installation • Getting Started • Documentation • Contributions
From zero to hero
Texthero is a python toolkit for quick handling of text data. Texthero is concise, simple to learn and integrates smoothly with Pandas.
Given a Pandas DataFrame with one or more text columns, texthero help to preprocess the text data, map it into vectors using different algorithms and models and visualize it on screen.
You can think of texthero as an utility tool to quickly understand text-based dataset. Given a tabular dataset such as stock predictions or most selled items, it's easy to grasp the main insights, but given a text dataset, it's harder to quickly have an understanding of the underline data. Texthero help you with that.
Installation
pip install texthero
☝️Under the hoods, texthero make use of multiple NLP/mL toolkit such as Gensim, NLTK, SpaCy and Sklearn. You don't need to install them separately; pip will take care of that.
Getting started and examples
1. Preprocessing, tf-idf representation and visualization
import texthero.texthero as hero
import pandas as pd
df = pd.DataFrame(["hello world", "hello", "world"], columns='text')
df = hero.do_preprocess(df)
df = hero.do_tfidf(df)
df = hero.do_pca(df)
hero.scatterplot(df)
2. Most common words and top TF-IDF words
import texthero.texthero as hero
3. Transformers representation and visualization [🔜]
import texthero.texthero as hero
Documentation
The way texthero is structured and his documentation follow the same principles of texthero: to provide a simple tool to text data handling. We put our best to keep the code concise, simple to read and understand.
Texthero is composed of three main components; preprocessing.py, representation.py and visualization.py.
⚒️ 1. Preprocessing
Job: prepare the text data for further analysis.
Complete documentation: preprocessing
📒 2. Representation
Job: map text data into vectors and do dimensionality reduction.
Supported representation algorithms:
- Term frequency, inverse document frequency (
do_tfidf
) - Word2Vec from Gensim [🔜]
- GloVe [🔜]
- Transformers [🔜]
Supported dimensionality reduction algorithms:
- Principal component analysis (
do_pca
) - Non-negative matrix factorization (
do_nmf
)
Complete documentation: representation
🔮 3. Visualization
Job: collection of functions to both summarize the main facts regarding the data and visualize the results. This part is very opinionated and ideal for anyone that needs a quick solution to visualize on screen the text data for instance during a text exploratory data analysis (EDA).
Most common functions:
- Text scatterplot. Handy when coupled with dimensionality reduction algorithms such as pca.
- Most common words
- Most common words between two entities [🔜]
Complete documentation: visualization
Contributions
Any help, feedback and contribution are very welcome. You can simply fork this repository and open an issue.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.