Skip to main content

Let's go and play with text!

Project description

TextGo

TextGo is a python package to help you work with text data conveniently and efficiently. It's a powerful NLP tool, which provides various apis including text preprocessing, representation, similarity calculation, text search and classification. Besides, it supports both English and Chinese language.

Highlights

  • Support both English and Chinese languages in text preprocessing
  • Provide various text representation algorithms including BOW, TF-IDF, LDA, LSA, PCA, Word2Vec/GloVe/FastText, BERT...
  • Support fast text search based on Faiss
  • Support various text classification algorithms including FastText, TextCNN, TextRNN, TextRCNN, TextRCNN_Att, Bert, XLNet
  • Very easy to use/employ in just a few lines of code

Installing

Install and update using pip:
pip install textgo

Note: successfully tested on python3.
Tips: the fasttext package needs to be installed manually as follows:

git clone https://github.com/facebookresearch/fastText.git
cd fastText-master
make
pip install .

Getting Started

1. Text preprocessing

Clean text

from textgo import Preprocess
# Chinese
tp1 = Preprocess(lang='zh')
texts1 = ["<text>自然语言处理是计算机科学领域与人工智能领域中的一个重要方向。<\text>", "??文本预处理~其实很简单!"]
ptexts1 = tp1.clean(texts1)
print(ptexts1)

Output: ['自然语言处理是计算机科学领域与人工智能领域中的一个重要方向', '文本预处理其实很简单']

# English
tp2 = Preprocess(lang='en')
texts2 = ["<text>Natural Language Processing, usually shortened as NLP, is a branch of artificial intelligence that deals with the interaction between computers and humans using the natural language<\text>"]
ptexts2 = tp2.clean(texts2)
print(ptexts2)

Output: ['natural language processing usually shortened as nlp is a branch of artificial intelligence that deals with the interaction between computers and humans using the natural language']

Tokenize and drop stopwords

# Chinese
tokens1 = tp1.tokenize(ptexts1)
print(tokens1)

Output: [['自然语言', '处理', '计算机科学', '领域', '人工智能', '领域', '中', '重要', '方向'], ['文本', '预处理', '其实', '很', '简单']]

# English
tokens2 = tp2.tokenize(ptexts2)
print(tokens2)

Output: [['natural', 'language', 'processing', 'usually', 'shortened', 'nlp', 'branch', 'artificial', 'intelligence', 'deals', 'interaction', 'computers', 'humans', 'using', 'natural', 'language']]

Preprocess (Clean + Tokenize + Remove stopwords + Join words)

# Chinese
ptexts1 = tp1.preprocess(texts1)
print(ptexts1)

Output: ['自然语言 处理 计算机科学 领域 人工智能 领域 中 重要 方向', '文本 预处理 其实 很 简单']

# English
ptexts2 = tp2.preprocess(texts2)
print(ptexts2)

Output: ['natural language processing usually shortened nlp branch artificial intelligence deals interaction computers humans using natural language']

2. Text representation

from textgo import Embeddings
petxts = ['自然语言 处理 计算机科学 领域 人工智能 领域 中 重要 方向', '文本 预处理 其实 很 简单']
emb = Embeddings()
# BOW
bow_emb = emb.bow(ptexts)

# TF-IDF
tfidf_emb = emb.tfidf(ptexts)

# LDA
lda_emb = emb.lda(ptexts, dim=2)

# LSA
lsa_emb = emb.lsa(petxts, dim=2)

# PCA
pca_emb = emb.pca(ptexts, dim=2)

# Word2Vec
w2v_emb = emb.word2vec(ptexts, method='word2vec', model_path='model/word2vec.bin')

# GloVe
glove_emb = emb.word2vec(ptexts, method='glove', model_path='model/glove.bin')

# FastText
ft_emb = emb.word2vec(ptexts, method='fasttext', model_path='model/fasttext.bin')

# BERT
bert_emb = emb.bert(ptexts, model_path='model/bert-base-chinese')

Tips: For methods like Word2Vec and BERT, you can load the model first and then get embeddings to avoid loading model repeatedly. Take BERT For example:

emb.load_model(method="bert", model_path='model/bert-base-chinese')
bert_emb1 = emb.bert(ptexts1)
bert_emb2 = emb.bert(ptexts2)

3. Similarity calculation

Support calculating similarity/distance between texts based on text representation mentioned above. For example, we can use bert sentence embeddings to compute cosine similarity between two sentences one by one.

from textgo import TextSim
texts1 = ["她的笑渐渐变少了。","最近天气晴朗适合出去玩!"]
texts2 = ["她变得越来越不开心了。","近来总是风雨交加没法外出!"]

ts = TextSim(lang='zh', method='bert', model_path='model/bert-base-chinese')
sim = ts.similarity(texts1, texts2, mutual=False)
print(sim)

Output: [0.9143135, 0.7350756]

Besides, we can also calculate similarity between each sentences among two datasets by setting mutual=True.

sim = ts.similarity(texts1, texts2, mutual=True)
print(sim)

Output: array([[0.9143138 , 0.772496 ], [0.704296 , 0.73507595]], dtype=float32)

4. Text search

It also supports searching query text in a large text database based on cosine similarity or euclidean distance. It provides two kinds of implementation: the normal one which is suitable for small dataset and the optimized one which is based on Faiss and suitable for large dataset.

from textgo import TextSim
# query texts
texts1 = ["A soccer game with multiple males playing."]
# database
texts2 = ["Some men are playing a sport.", "A man is driving down a lonely road.", "A happy woman in a fairy costume holds an umbrella."]
ts = TextSim(lang='en', method='word2vec', model_path='model/word2vec.bin')

Normal search

res = ts.get_similar_res(texts1, texts2, metric='cosine', threshold=0.5, topn=2)
print(res)

Output: [[(0, 'Some men are playing a sport.', 0.828474), (1, 'A man is driving down a lonely road.', 0.60927737)]]

Fast search

ts.build_index(texts2, metric='cosine')
res = ts.search(texts1, threshold=0.5, topn=2)
print(res)

Output: [[(0, 'Some men are playing a sport.', 0.828474), (1, 'A man is driving down a lonely road.', 0.60927737)]]

5. Text classification

Train a text classifier just in several lines. Models supported: FastText, TextCNN, TextRNN, TextRCNN, TextRCNN_Att, Bert, XLNet.

from textgo import Classifier

# Prepare data
X = [text1, text2, ... textn]
y = [label1, label2, ... labeln]

# load config
config_path = "./config.ini"  # Include all model parameters
model_name = "Bert" # Supported models: FastText, TextCNN, TextRNN, TextRCNN, TextRCNN_Att, Bert, XLNet
args = load_config(config_path, model_name) 
args['model_name'] = model_name 
args['save_path'] = "output/%s"%model_name

# train 
clf = Classifier(args) 
clf.train(X_train, y_train, evaluate_test=False) # If evaluate_test=True, then it will split 10% for test dataset and evaluate on test dataset. 

# predict
predclass = clf.predict(X_train) 

Resources

1. Pretrained word embeddings

Chinese

  1. 各种中文词向量:https://github.com/Embedding/Chinese-Word-Vectors
  2. 腾讯AI Lab中文词向量:https://ai.tencent.com/ailab/nlp/en/embedding.html

English

  1. GloVe: https://nlp.stanford.edu/projects/glove/
  2. FastText: https://fasttext.cc/docs/en/english-vectors.html
  3. Word2Vec: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

2. Pretrained models

https://huggingface.co/models

LICENSE

TextGo is MIT-licensed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textgo-1.4.tar.gz (35.8 kB view details)

Uploaded Source

Built Distribution

textgo-1.4-py2.py3-none-any.whl (52.5 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file textgo-1.4.tar.gz.

File metadata

  • Download URL: textgo-1.4.tar.gz
  • Upload date:
  • Size: 35.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.48.1 CPython/3.6.8

File hashes

Hashes for textgo-1.4.tar.gz
Algorithm Hash digest
SHA256 d818fda3fadd6a0f6e5d934e4af6da8f44fad799fb7fa875e32a11d74d5e447a
MD5 cbb85590701c39821671a999e75f0f12
BLAKE2b-256 2b8c85a45df58223c7c6dd12339f8b889c1a430e335f4459059b4fe9a61e4c8c

See more details on using hashes here.

File details

Details for the file textgo-1.4-py2.py3-none-any.whl.

File metadata

  • Download URL: textgo-1.4-py2.py3-none-any.whl
  • Upload date:
  • Size: 52.5 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.48.1 CPython/3.6.8

File hashes

Hashes for textgo-1.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 15539a704781355e2e70776956b183089fd0ff615fd7eefa00c9c6fc5bbff6db
MD5 f2b82ac398282c14c598d982088ea038
BLAKE2b-256 796cce4c7e42a8fbb232138c10304a5d3ca62f97df0caf48e0703084b020cc63

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page