Skip to main content

A bunch of python codes to analyze text data in the construction industry. Mainly reconstitute the pre-exist python libraries for Natural Language Processing (NLP)

Project description

connlp

A bunch of python codes to analyze text data in the construction industry.
Mainly reconstitute the pre-exist python libraries for Natural Language Processing (NLP).

Project Information

  • Supported by C!LAB (@Seoul Nat'l Univ.)

Contributors

Initialize

Setup

pip install connlp

Test

If the code below runs with no error, connlp is installed successfully.

from connlp.test import hello
hello()

# 'Helloworld'

Preprocess

Preprocessing module supports English and Korean.
NOTE: No plan for other languages (by 2021.04.02.).

Normalizer

Normalizer normalizes the input text by eliminating trash characters and remaining numbers, alphabets, and punctuation marks.

from connlp.preprocess import Normalizer
normalizer = Normalizer()

normalizer.normalize(text='I am a boy!')

# 'i am a boy'

EnglishTokenizer

EnglishTokenizer tokenizes the input text in English based on word spacing.
The ngram-based tokenization is in preparation.

from connlp.preprocess import EnglishTokenizer
tokenizer = EnglishTokenizer()

tokenizer.tokenize(text='I am a boy!')

# ['I', 'am', 'a', 'boy!']

Embedding

Vectorizer

Vectorizer includes several text embedding methods that have been commonly used for decades.

tfidf

TF-IDF is the most commonly used technique for word embedding.
The TF-IDF model counts the term frequency(TF) and inverse document frequency(IDF) from the given documents.
The results included the followings.

  • TF-IDF Vectorizer (a class of sklearn.feature_extraction.text.TfidfVectorizer')
  • TF-IDF Matrix
  • TF-IDF Vocabulary
from connlp.preprocess import EnglishTokenizer
from connlp.embedding import Vectorizer
tokenizer = EnglishTokenizer()
vectorizer = Vectorizer()

docs = ['I am a boy', 'He is a boy', 'She is a girl']
tfidf_vectorizer, tfidf_matrix, tfidf_vocab = vectorizer.tfidf(docs=docs)
type(tfidf_vectorizer)

# <class 'sklearn.feature_extraction.text.TfidfVectorizer'>

The user can get a document vector by indexing the tfidf_matrix.

tfidf_matrix[0]

# (0, 2)    0.444514311537431
# (0, 0)    0.34520501686496574
# (0, 1)    0.5844829010200651
# (0, 5)    0.5844829010200651

The tfidf_vocab returns an index for every token.

print(tfidf_vocab)

# {'i': 5, 'am': 1, 'a': 0, 'boy': 2, 'he': 4, 'is': 6, 'she': 7, 'girl': 3}

word2vec

Word2Vec is a distributed representation language model for word embedding.
The Word2vec model trains tokenized docs and returns word vectors.
The result is a class of 'gensim.models.word2vec.Word2Vec'.

from connlp.preprocess import EnglishTokenizer
from connlp.embedding import Vectorizer
tokenizer = EnglishTokenizer()
vectorizer = Vectorizer()

docs = ['I am a boy', 'He is a boy', 'She is a girl']
tokenized_docs = [tokenizer.tokenize(text=doc) for doc in docs]
w2v_model = vectorizer.word2vec(docs=tokenized_docs)
type(w2v_model)

# <class 'gensim.models.word2vec.Word2Vec'>

The user can get a word vector by .wv method.

w2v_model.wv['boy']

# [-2.0130998e-03 -3.5652996e-03  2.7793974e-03 ...]

The Word2Vec model provides the topn-most similar word vectors.

w2v_model.wv.most_similar('boy', topn=3)

# [('He', 0.05311150848865509), ('a', 0.04154288396239281), ('She', -0.029122961685061455)]

word2vec (update)

The user can update the Word2Vec model with new data.

new_docs = ['Tom is a man', 'Sally is not a boy']
tokenized_new_docs = [tokenizer.tokenize(text=doc) for doc in new_docs]
w2v_model_updated = vectorizer.word2vec_update(w2v_model=w2v_model, new_docs=tokenized_new_docs)

w2v_model_updated.wv['man']

# [4.9649975e-03  3.8002312e-04 -1.5773597e-03 ...]

doc2vec

Doc2Vec is a distributed representation language model for longer text (e.g., sentence, paragraph, document) embedding.
The Doc2vec model trains tokenized docs with tags and returns document vectors.
The result is a class of 'gensim.models.doc2vec.Doc2Vec'.

from connlp.preprocess import EnglishTokenizer
from connlp.embedding import Vectorizer
tokenizer = EnglishTokenizer()
vectorizer = Vectorizer()

docs = ['I am a boy', 'He is a boy', 'She is a girl']
tagged_docs = [(idx, tokenizer.tokenize(text=doc)) for idx, doc in enumerate(docs)]
d2v_model = vectorizer.doc2vec(tagged_docs=tagged_docs)
type(d2v_model)

# <class 'gensim.models.doc2vec.Doc2Vec'>

The Doc2Vec model can infer a new document.

test_doc = ['My', 'name', 'is', 'Peter']
d2v_model.infer_vector(doc_words=test_doc)

# [4.8494316e-03 -4.3647490e-03  1.1437446e-03 ...]

Analysis

TopicModel

TopicModel is a class for topic modeling based on gensim LDA model.
It provides a simple way to train lda model and assign topics to docs.

TopicModel requires two instances.

  • a dict of docs whose keys are the tag
  • the number of topics for modeling
from connlp.analysis import TopicModel

num_topics = 2
docs = {'doc1': ['I', 'am', 'a', 'boy'],
        'doc2': ['He', 'is', 'a', 'boy'],
        'doc3': ['Cat', 'on', 'the', 'table'],
        'doc4': ['Mike', 'is', 'a', 'boy'],
        'doc5': ['Dog', 'on', 'the', 'table'],
        }

lda_model = TopicModel(docs=docs, num_topics=num_topics)

learn

The users can train the model with learn method. Unless parameters being provided by the users, the model trains based on default parameters.

After learn, TopicModel provides model instance that is a class of <'gensim.models.ldamodel.LdaModel'>

parameters = {
    'iterations': 100,
    'alpha': 0.7,
    'eta': 0.05,
}
lda_model.learn(parameters=parameters)
type(lda_model.model)

# <class 'gensim.models.ldamodel.LdaModel'>

coherence

TopicModel provides coherence value for model evaluation.
The coherence value is automatically calculated right after model training.

print(lda_model.coherence)

# 0.3607990279229385

assign

The users can easily assign the most proper topic to each doc using assign method.
After assign, the TopicModel provides tag2topic and topic2tag instances for convenience.

lda_model.assign()

print(lda_model.tag2topic)
print(lda_model.topic2tag)

# defaultdict(<class 'int'>, {'doc1': 1, 'doc2': 1, 'doc3': 0, 'doc4': 1, 'doc5': 0})
# defaultdict(<class 'list'>, {1: ['doc1', 'doc2', 'doc4'], 0: ['doc3', 'doc5']})

Visualization

Visualizer

Visualizer includes several simple tools for text visualization.

network

network method provides a word network for tokenized docs.

from connlp.preprocess import EnglishTokenizer
from connlp.visualize import Visualizer
tokenizer = EnglishTokenizer()
visualizer = Visualizer()

docs = ['I am a boy', 'She is a girl']
tokenized_docs = [tokenizer.tokenize(text=doc) for doc in docs]
visualizer.network(docs=tokenized_docs, show=True)

Extracting Text

TextConverter

TextConverter includes several methods that extract raw text from various types of files (e.g. PDF, HWP) and/or converts the files into plain text files (e.g. TXT).

hwp2txt

hwp2txt method converts a HWP file into a plain text file. Dependencies: pyhwp package

Install pyhwp (you need to install the pre-release version)

pip install --pre pyhwp

Example

from connlp.text_extract import TextConverter
converter = TextConverter()

hwp_fpath = '/data/raw/hwp_file.hwp'
output_fpath = '/data/processed/extracted_text.txt'

converter.hwp2txt(hwp_fpath, output_fpath) # returns 0 if no error occurs

GPU Utils

GPUMonitor

GPUMonitor generates a class to monitor and display the GPU status based on nvidia-smi.
Refer to "https://github.com/anderskm/gputil" and "https://data-newbie.tistory.com/561" for usages.

Install GPUtils module with pip.

pip install GPUtils

Write your code between the initiation of the GPUMonitor and monitor.stop().

from connlp.util import GPUMonitor

monitor = GPUMonitor(delay=3)
# >>>Write your code here<<<
monitor.stop()

# | ID | GPU | MEM |
# ------------------
# |  0 |  0% |  0% |
# |  1 |  1% |  0% |
# |  2 |  0% | 94% |

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

connlp-0.0.10.tar.gz (13.5 kB view hashes)

Uploaded Source

Built Distribution

connlp-0.0.10-py3-none-any.whl (16.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page