A bunch of python codes to analyze text data in the construction industry. Mainly reconstitute the pre-exist python libraries for Natural Language Processing (NLP)
Project description
connlp
A bunch of python codes to analyze text data in the construction industry.
Mainly reconstitute the pre-exist python libraries for Natural Language Processing (NLP).
Project Information
- Supported by C!LAB (@Seoul Nat'l Univ.)
Contributors
- Seonghyeon Boris Moon (blank54@snu.ac.kr, https://github.com/blank54/)
- Sehwan Chung (hwani751@snu.ac.kr)
- Jungyeon Kim (janykjy@snu.ac.kr)
Initialize
Setup
pip install connlp
Test
If the code below runs with no error, connlp is installed successfully.
from connlp.test import hello
hello()
# 'Helloworld'
Preprocess
Preprocessing module supports English and Korean.
NOTE: No plan for other languages (by 2021.04.02.).
Normalizer
Normalizer normalizes the input text by eliminating trash characters and remaining numbers, alphabets, and punctuation marks.
from connlp.preprocess import Normalizer
normalizer = Normalizer()
normalizer.normalize(text='I am a boy!')
# 'i am a boy'
EnglishTokenizer
EnglishTokenizer tokenizes the input text in English based on word spacing.
The ngram-based tokenization is in preparation.
from connlp.preprocess import EnglishTokenizer
tokenizer = EnglishTokenizer()
tokenizer.tokenize(text='I am a boy!')
# ['I', 'am', 'a', 'boy!']
KoreanTokenizer
KoreanTokenizer tokenizes the input text in Korean, and is based on soynlp (https://github.com/lovit/soynlp), an unsupervised text analyzer in Korean.
train
A KoreanTokenizer object first needs to be trained on (unlabeled) corpus. 'Word score' is calculated for every subword in the corpus.
from connlp.preprocess import KoreanTokenizer
tokenizer = KoreanTokenizer(min_frequency=0) # see 'soynlp' for detailed explanation on keyword arguments
docs = ['코퍼스의 첫 번째 문서입니다.', '두 번째 문서입니다.', '마지막 문서']
tokenizer.train(text=docs)
print(tokenizer.word_score)
# {'서': 0.0, '코': 0.0, '째': 0.0, '.': 0.0, '의': 0.0, '마': 0.0, '막': 0.0, '번': 0.0, '문': 0.0, '코퍼': 1.0, '번째': 1.0, '마지': 1.0, '문서': 1.0, '코퍼스': 1.0, '문서입': 0.816496580927726, '마지막': 1.0, '코퍼스의': 1.0, '문서입니': 0.8735804647362989, '문서입니다': 0.9036020036098448, '문서입니다.': 0.9221079114817278}
tokenize
Tokenization is based on the 'word score' calculated from KoreanTokenizer.train method. For each blank-separated token, a subword that has the maximum 'word score' is selectd as an individual 'word' and separated with the remaining part.
doc = docs[0] # '코퍼스의 첫 번째 문서입니다.'
tokenizer.tokenize(doc)
# ['코퍼스의', '첫', '번째', '문서', '입니다.']
Embedding
Vectorizer
Vectorizer includes several text embedding methods that have been commonly used for decades.
tfidf
TF-IDF is the most commonly used technique for word embedding.
The TF-IDF model counts the term frequency(TF) and inverse document frequency(IDF) from the given documents.
The results included the followings.
- TF-IDF Vectorizer (a class of sklearn.feature_extraction.text.TfidfVectorizer')
- TF-IDF Matrix
- TF-IDF Vocabulary
from connlp.preprocess import EnglishTokenizer
from connlp.embedding import Vectorizer
tokenizer = EnglishTokenizer()
vectorizer = Vectorizer()
docs = ['I am a boy', 'He is a boy', 'She is a girl']
tfidf_vectorizer, tfidf_matrix, tfidf_vocab = vectorizer.tfidf(docs=docs)
type(tfidf_vectorizer)
# <class 'sklearn.feature_extraction.text.TfidfVectorizer'>
The user can get a document vector by indexing the tfidf_matrix.
tfidf_matrix[0]
# (0, 2) 0.444514311537431
# (0, 0) 0.34520501686496574
# (0, 1) 0.5844829010200651
# (0, 5) 0.5844829010200651
The tfidf_vocab returns an index for every token.
print(tfidf_vocab)
# {'i': 5, 'am': 1, 'a': 0, 'boy': 2, 'he': 4, 'is': 6, 'she': 7, 'girl': 3}
word2vec
Word2Vec is a distributed representation language model for word embedding.
The Word2vec model trains tokenized docs and returns word vectors.
The result is a class of 'gensim.models.word2vec.Word2Vec'.
from connlp.preprocess import EnglishTokenizer
from connlp.embedding import Vectorizer
tokenizer = EnglishTokenizer()
vectorizer = Vectorizer()
docs = ['I am a boy', 'He is a boy', 'She is a girl']
tokenized_docs = [tokenizer.tokenize(text=doc) for doc in docs]
w2v_model = vectorizer.word2vec(docs=tokenized_docs)
type(w2v_model)
# <class 'gensim.models.word2vec.Word2Vec'>
The user can get a word vector by .wv method.
w2v_model.wv['boy']
# [-2.0130998e-03 -3.5652996e-03 2.7793974e-03 ...]
The Word2Vec model provides the topn-most similar word vectors.
w2v_model.wv.most_similar('boy', topn=3)
# [('He', 0.05311150848865509), ('a', 0.04154288396239281), ('She', -0.029122961685061455)]
word2vec (update)
The user can update the Word2Vec model with new data.
new_docs = ['Tom is a man', 'Sally is not a boy']
tokenized_new_docs = [tokenizer.tokenize(text=doc) for doc in new_docs]
w2v_model_updated = vectorizer.word2vec_update(w2v_model=w2v_model, new_docs=tokenized_new_docs)
w2v_model_updated.wv['man']
# [4.9649975e-03 3.8002312e-04 -1.5773597e-03 ...]
doc2vec
Doc2Vec is a distributed representation language model for longer text (e.g., sentence, paragraph, document) embedding.
The Doc2vec model trains tokenized docs with tags and returns document vectors.
The result is a class of 'gensim.models.doc2vec.Doc2Vec'.
from connlp.preprocess import EnglishTokenizer
from connlp.embedding import Vectorizer
tokenizer = EnglishTokenizer()
vectorizer = Vectorizer()
docs = ['I am a boy', 'He is a boy', 'She is a girl']
tagged_docs = [(idx, tokenizer.tokenize(text=doc)) for idx, doc in enumerate(docs)]
d2v_model = vectorizer.doc2vec(tagged_docs=tagged_docs)
type(d2v_model)
# <class 'gensim.models.doc2vec.Doc2Vec'>
The Doc2Vec model can infer a new document.
test_doc = ['My', 'name', 'is', 'Peter']
d2v_model.infer_vector(doc_words=test_doc)
# [4.8494316e-03 -4.3647490e-03 1.1437446e-03 ...]
Analysis
TopicModel
TopicModel is a class for topic modeling based on gensim LDA model.
It provides a simple way to train lda model and assign topics to docs.
TopicModel requires two instances.
- a dict of docs whose keys are the tag
- the number of topics for modeling
from connlp.analysis import TopicModel
num_topics = 2
docs = {'doc1': ['I', 'am', 'a', 'boy'],
'doc2': ['He', 'is', 'a', 'boy'],
'doc3': ['Cat', 'on', 'the', 'table'],
'doc4': ['Mike', 'is', 'a', 'boy'],
'doc5': ['Dog', 'on', 'the', 'table'],
}
lda_model = TopicModel(docs=docs, num_topics=num_topics)
learn
The user can train the model with learn method. Unless parameters being provided by the user, the model trains based on default parameters.
After learn, TopicModel provides model instance that is a class of <'gensim.models.ldamodel.LdaModel'>
parameters = {
'iterations': 100,
'alpha': 0.7,
'eta': 0.05,
}
lda_model.learn(parameters=parameters)
type(lda_model.model)
# <class 'gensim.models.ldamodel.LdaModel'>
coherence
TopicModel provides coherence value for model evaluation.
The coherence value is automatically calculated right after model training.
print(lda_model.coherence)
# 0.3607990279229385
assign
The user can easily assign the most proper topic to each doc using assign method.
After assign, the TopicModel provides tag2topic and topic2tag instances for convenience.
lda_model.assign()
print(lda_model.tag2topic)
print(lda_model.topic2tag)
# defaultdict(<class 'int'>, {'doc1': 1, 'doc2': 1, 'doc3': 0, 'doc4': 1, 'doc5': 0})
# defaultdict(<class 'list'>, {1: ['doc1', 'doc2', 'doc4'], 0: ['doc3', 'doc5']})
NamedEntityRecognition
Before using NER modules, the user should install proper versions of TensorFlow and Keras.
pip install config==0.4.2 gensim==3.8.1 gpustat==0.6.0 GPUtil==1.4.0 h5py==2.10.0 JPype1==0.7.1 Keras==2.2.4 konlpy==0.5.2 nltk==3.4.5 numpy==1.18.1 pandas==1.0.1 scikit-learn==0.22.1 scipy==1.4.1 silence-tensorflow==1.1.1 soynlp==0.0.493 tensorflow==1.14.0 tensorflow-gpu==1.14.0
The modules might require the module of keras-contrib.
The user can install the module by following the below.
git clone https://www.github.com/keras-team/keras-contrib.git
cd keras-contrib
python setup.py install
Labels
NER_Model is a class to conduct named entity recognition using Bi-directional Long-Short Term Memory (Bi-LSTM) and Conditional Random Field (CRF).
At the beginning, appropriate labels are required.
The labels should be numbered with start of 0.
from connlp.analysis import NER_Labels
label_dict = {'NON': 0, #None
'PER': 1, #PERSON
'FOD': 2,} #FOOD
ner_labels = NER_Labels(label_dict=label_dict)
Corpus
Next, the user should prepare data including sentences and labels, of which each data being matched by the same tag.
The tokenized sentences and labels are then combined via NER_LabeledSentence.
With the data, labels, and a proper size of max_sent_len (i.e., the maximum length of sentence for analysis), NER_Corpus would be developed.
Once the corpus was developed, every data of sentences and labels would be padded with the length of max_sent_len.
from connlp.preprocess import EnglishTokenizer
from connlp.analysis import NER_LabeledSentence, NER_Corpus
tokenizer = EnglishTokenizer()
data_sents = {'sent1': 'Sam likes pizza',
'sent2': 'Erik eats pizza',
'sent3': 'Erik and Sam are drinking soda',
'sent4': 'Flora cooks chicken',
'sent5': 'Sam ordered a chicken',
'sent6': 'Flora likes chicken sandwitch',
'sent7': 'Erik likes to drink soda'}
data_labels = {'sent1': [1, 0, 2],
'sent2': [1, 0, 2],
'sent3': [1, 0, 1, 0, 0, 2],
'sent4': [1, 0, 2],
'sent5': [1, 0, 0, 2],
'sent6': [1, 0, 2, 2],
'sent7': [1, 0, 0, 0, 2]}
docs = []
for tag, sent in data_sents.items():
words = [str(w) for w in tokenizer.tokenize(text=sent)]
labels = data_labels[tag]
docs.append(NER_LabeledSentence(tag=tag, words=words, labels=labels))
max_sent_len = 10
ner_corpus = NER_Corpus(docs=docs, ner_labels=ner_labels, max_sent_len=max_sent_len)
type(ner_corpus)
# <class 'connlp.analysis.NER_Corpus'>
Word Embedding
Every word in the NER_Corpus should be embedded into numeric vector space.
The user can conduct embedding with Word2Vec which is provided in Vectorizer of connlp.
Note that the embedding process of NER_Corpus only requires the dictionary of word vectors and the feature size.
from connlp.preprocess import EnglishTokenizer
from connlp.embedding import Vectorizer
tokenizer = EnglishTokenizer()
vectorizer = Vectorizer()
tokenized_sents = [tokenizer.tokenize(sent) for sent in data_sents.values()]
w2v_model = vectorizer.word2vec(docs=tokenized_sents)
word2vector = vectorizer.get_word_vectors(w2v_model)
feature_size = w2v_model.vector_size
ner_corpus.word_embedding(word2vector=word2vector, feature_size=feature_size)
print(ner_corpus.X_embedded)
# [[[-2.40120804e-03 1.74632657e-03 ...]
# [-3.57543468e-03 2.86567654e-03 ...]
# ...
# [ 0.00000000e+00 0.00000000e+00 ...]] ...]
Model Initialization
The parameters for Bi-LSTM and model training should be provided, however, they can be composed of a single dictionary.
The user should initialize the NER_Model with NER_Corpus and the parameters.
from connlp.analysis import NER_Model
parameters = {
# Parameters for Bi-LSTM.
'lstm_units': 512,
'lstm_return_sequences': True,
'lstm_recurrent_dropout': 0.2,
'dense_units': 100,
'dense_activation': 'relu',
# Parameters for model training.
'test_size': 0.3,
'batch_size': 1,
'epochs': 100,
'validation_split': 0.1,
}
ner_model = NER_Model()
ner_model.initialize(ner_corpus=ner_corpus, parameters=parameters)
type(ner_model)
# <class 'connlp.analysis.NER_Model'>
Model Training
The user can train the NER_Model with customized parameters.
The model automatically gets the dataset from the NER_Corpus.
ner_model.train(parameters=parameters)
# Train on 3 samples, validate on 1 samples
# Epoch 1/100
# 3/3 [==============================] - 3s 1s/step - loss: 1.4545 - crf_viterbi_accuracy: 0.3000 - val_loss: 1.0767 - val_crf_viterbi_accuracy: 0.8000
# Epoch 2/100
# 3/3 [==============================] - 0s 74ms/step - loss: 0.8602 - crf_viterbi_accuracy: 0.7000 - val_loss: 0.5287 - val_crf_viterbi_accuracy: 0.8000
# ...
Model Evaluation
The model performance can be shown in the aspects of confusion matrix and F1 score.
ner_model.evaluate()
# |--------------------------------------------------
# |Confusion Matrix:
# [[ 3 0 3 6]
# [ 1 3 0 4]
# [ 0 0 2 2]
# [ 4 3 5 12]]
# |--------------------------------------------------
# |F1 Score: 0.757
# |--------------------------------------------------
# | [NON]: 0.600
# | [PER]: 0.857
# | [FOD]: 0.571
Save
The user can save the NER_Model.
The model would save the model itself ("<FileName>.pk") and the dataset ("<FileName>-dataset.pk") that was used in model development.
Note that the directory should exist before saving the model.
from connlp.util import makedir
fpath_model = 'test/ner/model.pk'
makedir(fpath=fpath_model)
ner_model.save(fpath_model=fpath_model)
Load
If the user wants to load the already trained model, just call the model and load.
fpath_model = 'test/ner/model.pk'
ner_model = NER_Model()
ner_model.load(fpath_model=fpath_model, ner_corpus=ner_corpus, parameters=parameters)
Prediction
NER_Model can conduct a new NER task on the given sentence.
The result is a class of NER_Result.
from connlp.preprocess import EnglishTokenizer
vectorizer = Vectorizer()
new_sent = 'Tom eats apple'
tokenized_sent = tokenizer.tokenize(new_sent)
ner_result = ner_model.predict(sent=tokenized_sent)
print(ner_result)
# Tom/PER eats/NON apple/FOD
Visualization
Visualizer
Visualizer includes several simple tools for text visualization.
network
network method provides a word network for tokenized docs.
from connlp.preprocess import EnglishTokenizer
from connlp.visualize import Visualizer
tokenizer = EnglishTokenizer()
visualizer = Visualizer()
docs = ['I am a boy', 'She is a girl']
tokenized_docs = [tokenizer.tokenize(text=doc) for doc in docs]
visualizer.network(docs=tokenized_docs, show=True)
Extracting Text
TextConverter
TextConverter includes several methods that extract raw text from various types of files (e.g. PDF, HWP) and/or converts the files into plain text files (e.g. TXT).
hwp2txt
hwp2txt method converts a HWP file into a plain text file. Dependencies: pyhwp package
Install pyhwp (you need to install the pre-release version)
pip install --pre pyhwp
Example
from connlp.text_extract import TextConverter
converter = TextConverter()
hwp_fpath = '/data/raw/hwp_file.hwp'
output_fpath = '/data/processed/extracted_text.txt'
converter.hwp2txt(hwp_fpath, output_fpath) # returns 0 if no error occurs
GPU Utils
GPUMonitor
GPUMonitor generates a class to monitor and display the GPU status based on nvidia-smi.
Refer to "https://github.com/anderskm/gputil" and "https://data-newbie.tistory.com/561" for usages.
Install GPUtils module with pip.
pip install GPUtils
Write your code between the initiation of the GPUMonitor and monitor.stop().
from connlp.util import GPUMonitor
monitor = GPUMonitor(delay=3)
# >>>Write your code here<<<
monitor.stop()
# | ID | GPU | MEM |
# ------------------
# | 0 | 0% | 0% |
# | 1 | 1% | 0% |
# | 2 | 0% | 94% |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.