A bunch of python codes to analyze text data in the construction industry. Mainly reconstitute the pre-exist python libraries for Natural Language Processing (NLP)
Project description
connlp
A bunch of python codes to analyze text data in the construction industry.
Mainly reconstitute the pre-exist python libraries for Natural Language Processing (NLP).
Project Information
- Supported by C!LAB (@Seoul Nat'l Univ.)
Contributors
- Seonghyeon Boris Moon (blank54@snu.ac.kr, https://github.com/blank54/)
- Gitaek Lee (lgt0427@snu.ac.kr)
- Taeyeon Chang (jgwoon1838@snu.ac.kr, a.k.a. Kowoon Chang)
- Sehwan Chung (hwani751@snu.ac.kr)
Initialize
Setup
pip install connlp
Test
If the code below runs with no error, connlp is installed successfully.
from connlp.test import hello
hello()
# 'Helloworld'
Preprocess
Preprocessing module supports English and Korean.
NOTE: No plan for other languages (by 2021.04.02.).
Normalizer
Normalizer normalizes the input text by eliminating trash characters and remaining numbers, alphabets, and punctuation marks.
from connlp.preprocess import Normalizer
normalizer = Normalizer()
normalizer.normalize(text='I am a boy!')
# 'i am a boy'
EnglishTokenizer
EnglishTokenizer tokenizes the input text in English based on word spacing.
The ngram-based tokenization is in preparation.
from connlp.preprocess import EnglishTokenizer
tokenizer = EnglishTokenizer()
tokenizer.tokenize(text='I am a boy!')
# ['I', 'am', 'a', 'boy!']
Embedding
Vectorizer
Vectorizer includes several text embedding methods that have been commonly used for decades.
tfidf
TF-IDF is the most commonly used technique for word embedding.
The TF-IDF model counts the term frequency(TF) and inverse document frequency(IDF) from the given documents.
The results included the followings.
- TF-IDF Vectorizer (a class of sklearn.feature_extraction.text.TfidfVectorizer')
- TF-IDF Matrix
- TF-IDF Vocabulary
from connlp.preprocess import EnglishTokenizer
from connlp.embedding import Vectorizer
tokenizer = EnglishTokenizer()
vectorizer = Vectorizer()
docs = ['I am a boy', 'He is a boy', 'She is a girl']
tfidf_vectorizer, tfidf_matrix, tfidf_vocab = vectorizer.tfidf(docs=docs)
type(tfidf_vectorizer)
# <class 'sklearn.feature_extraction.text.TfidfVectorizer'>
The user can get a document vector by indexing the tfidf_matrix.
tfidf_matrix[0]
# (0, 2) 0.444514311537431
# (0, 0) 0.34520501686496574
# (0, 1) 0.5844829010200651
# (0, 5) 0.5844829010200651
The tfidf_vocab returns an index for every token.
print(tfidf_vocab)
# {'i': 5, 'am': 1, 'a': 0, 'boy': 2, 'he': 4, 'is': 6, 'she': 7, 'girl': 3}
word2vec
Word2Vec is a distributed representation language model for word embedding.
The Word2vec model trains tokenized docs and returns word vectors.
The result is a class of 'gensim.models.word2vec.Word2Vec'.
from connlp.preprocess import EnglishTokenizer
from connlp.embedding import Vectorizer
tokenizer = EnglishTokenizer()
vectorizer = Vectorizer()
docs = ['I am a boy', 'He is a boy', 'She is a girl']
tokenized_docs = [tokenizer.tokenize(text=doc) for doc in docs]
w2v_model = vectorizer.word2vec(docs=tokenized_docs)
type(w2v_model)
# <class 'gensim.models.word2vec.Word2Vec'>
The user can get a word vector by .wv method.
w2v_model.wv['boy']
# [-2.0130998e-03 -3.5652996e-03 2.7793974e-03 ...]
The Word2Vec model provides the topn-most similar word vectors.
w2v_model.wv.most_similar('boy', topn=3)
# [('He', 0.05311150848865509), ('a', 0.04154288396239281), ('She', -0.029122961685061455)]
word2vec (update)
The user can update the Word2Vec model with new data.
new_docs = ['Tom is a man', 'Sally is not a boy']
tokenized_new_docs = [tokenizer.tokenize(text=doc) for doc in new_docs]
w2v_model_updated = vectorizer.word2vec_update(w2v_model=w2v_model, new_docs=tokenized_new_docs)
w2v_model_updated.wv['man']
# [4.9649975e-03 3.8002312e-04 -1.5773597e-03 ...]
doc2vec
Doc2Vec is a distributed representation language model for longer text (e.g., sentence, paragraph, document) embedding.
The Doc2vec model trains tokenized docs with tags and returns document vectors.
The result is a class of 'gensim.models.doc2vec.Doc2Vec'.
from connlp.preprocess import EnglishTokenizer
from connlp.embedding import Vectorizer
tokenizer = EnglishTokenizer()
vectorizer = Vectorizer()
docs = ['I am a boy', 'He is a boy', 'She is a girl']
tagged_docs = [(idx, tokenizer.tokenize(text=doc)) for idx, doc in enumerate(docs)]
d2v_model = vectorizer.doc2vec(tagged_docs=tagged_docs)
type(d2v_model)
# <class 'gensim.models.doc2vec.Doc2Vec'>
The Doc2Vec model can infer a new document.
test_doc = ['My', 'name', 'is', 'Peter']
d2v_model.infer_vector(doc_words=test_doc)
# [4.8494316e-03 -4.3647490e-03 1.1437446e-03 ...]
Visualization
Visualizer
Visualizer includes several simple tools for text visualization.
network
network method provides a word network for tokenized docs.
from connlp.preprocess import EnglishTokenizer
from connlp.visualize import Visualizer
tokenizer = EnglishTokenizer()
visualizer = Visualizer()
docs = ['I am a boy', 'She is a girl']
tokenized_docs = [tokenizer.tokenize(text=doc) for doc in docs]
visualizer.network(docs=tokenized_docs, show=True)
Extracting Text
TextConverter
TextConverter includes several methods that extract raw text from various types of files (e.g. PDF, HWP) and/or converts the files into plain text files (e.g. TXT).
hwp2txt
hwp2txt method converts a HWP file into a plain text file. Dependencies: pyhwp package
Install pyhwp (you need to install the pre-release version)
pip install --pre pyhwp
Example
from connlp.text_extract import TextConverter
converter = TextConverter()
hwp_fpath = '/data/raw/hwp_file.hwp'
output_fpath = '/data/processed/extracted_text.txt'
converter.hwp2txt(hwp_fpath, output_fpath) # returns 0 if no error occurs
GPU Utils
GPUMonitor
GPUMonitor generates a class to monitor and display the GPU status based on nvidia-smi.
Refer to "https://github.com/anderskm/gputil" and "https://data-newbie.tistory.com/561" for usages.
Install GPUtils module with pip.
pip install GPUtils
Write your code between the initiation of the GPUMonitor and monitor.stop().
from connlp.util import GPUMonitor
monitor = GPUMonitor(delay=3)
# >>>Write your code here<<<
monitor.stop()
# | ID | GPU | MEM |
# ------------------
# | 0 | 0% | 0% |
# | 1 | 1% | 0% |
# | 2 | 0% | 94% |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.