A bunch of python codes to analyze text data in the construction industry. Mainly reconstitute the pre-exist python libraries for Natural Language Processing (NLP)

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

connlp

A bunch of python codes to analyze text data in the construction industry.
Mainly reconstitute the pre-exist python libraries for Natural Language Processing (NLP).

Project Information

Supported by C!LAB (@Seoul Nat'l Univ.)

Contributors

Seonghyeon Boris Moon (blank54@snu.ac.kr, https://github.com/blank54/)
Gitaek Lee (lgt0427@snu.ac.kr)
Taeyeon Chang (jgwoon1838@snu.ac.kr, a.k.a. Kowoon Chang)
Sehwan Chung (hwani751@snu.ac.kr)

Initialize

Setup

pip install connlp

Test

If the code below runs with no error, connlp is installed successfully.

from connlp.test import hello
hello()

# 'Helloworld'

Preprocess

Preprocessing module supports English and Korean.
NOTE: No plan for other languages (by 2021.04.02.).

Normalizer

Normalizer normalizes the input text by eliminating trash characters and remaining numbers, alphabets, and punctuation marks.

from connlp.preprocess import Normalizer
normalizer = Normalizer()

normalizer.normalize(text='I am a boy!')

# 'i am a boy'

EnglishTokenizer

EnglishTokenizer tokenizes the input text in English based on word spacing.
The ngram-based tokenization is in preparation.

from connlp.preprocess import EnglishTokenizer
tokenizer = EnglishTokenizer()

tokenizer.tokenize(text='I am a boy!')

# ['I', 'am', 'a', 'boy!']

Embedding

Vectorizer

Vectorizer includes several text embedding methods that have been commonly used for decades.

tfidf

TF-IDF is the most commonly used technique for word embedding.
The TF-IDF model counts the term frequency(TF) and inverse document frequency(IDF) from the given documents.
The results included the followings.

TF-IDF Vectorizer (a class of sklearn.feature_extraction.text.TfidfVectorizer')
TF-IDF Matrix
TF-IDF Vocabulary

from connlp.preprocess import EnglishTokenizer
from connlp.embedding import Vectorizer
tokenizer = EnglishTokenizer()
vectorizer = Vectorizer()

docs = ['I am a boy', 'He is a boy', 'She is a girl']
tfidf_vectorizer, tfidf_matrix, tfidf_vocab = vectorizer.tfidf(docs=docs)
type(tfidf_vectorizer)

# <class 'sklearn.feature_extraction.text.TfidfVectorizer'>

The user can get a document vector by indexing the tfidf_matrix.

tfidf_matrix[0]

# (0, 2)    0.444514311537431
# (0, 0)    0.34520501686496574
# (0, 1)    0.5844829010200651
# (0, 5)    0.5844829010200651

The tfidf_vocab returns an index for every token.

print(tfidf_vocab)

# {'i': 5, 'am': 1, 'a': 0, 'boy': 2, 'he': 4, 'is': 6, 'she': 7, 'girl': 3}

word2vec

Word2Vec is a distributed representation language model for word embedding.
The Word2vec model trains tokenized docs and returns word vectors.
The result is a class of 'gensim.models.word2vec.Word2Vec'.

from connlp.preprocess import EnglishTokenizer
from connlp.embedding import Vectorizer
tokenizer = EnglishTokenizer()
vectorizer = Vectorizer()

docs = ['I am a boy', 'He is a boy', 'She is a girl']
tokenized_docs = [tokenizer.tokenize(text=doc) for doc in docs]
w2v_model = vectorizer.word2vec(docs=tokenized_docs)
type(w2v_model)

# <class 'gensim.models.word2vec.Word2Vec'>

The user can get a word vector by .wv method.

w2v_model.wv['boy']

# [-2.0130998e-03 -3.5652996e-03  2.7793974e-03 ...]

The Word2Vec model provides the topn-most similar word vectors.

w2v_model.wv.most_similar('boy', topn=3)

# [('He', 0.05311150848865509), ('a', 0.04154288396239281), ('She', -0.029122961685061455)]

word2vec (update)

The user can update the Word2Vec model with new data.

new_docs = ['Tom is a man', 'Sally is not a boy']
tokenized_new_docs = [tokenizer.tokenize(text=doc) for doc in new_docs]
w2v_model_updated = vectorizer.word2vec_update(w2v_model=w2v_model, new_docs=tokenized_new_docs)

w2v_model_updated.wv['man']

# [4.9649975e-03  3.8002312e-04 -1.5773597e-03 ...]

doc2vec

Doc2Vec is a distributed representation language model for longer text (e.g., sentence, paragraph, document) embedding.
The Doc2vec model trains tokenized docs with tags and returns document vectors.
The result is a class of 'gensim.models.doc2vec.Doc2Vec'.

from connlp.preprocess import EnglishTokenizer
from connlp.embedding import Vectorizer
tokenizer = EnglishTokenizer()
vectorizer = Vectorizer()

docs = ['I am a boy', 'He is a boy', 'She is a girl']
tagged_docs = [(idx, tokenizer.tokenize(text=doc)) for idx, doc in enumerate(docs)]
d2v_model = vectorizer.doc2vec(tagged_docs=tagged_docs)
type(d2v_model)

# <class 'gensim.models.doc2vec.Doc2Vec'>

The Doc2Vec model can infer a new document.

test_doc = ['My', 'name', 'is', 'Peter']
d2v_model.infer_vector(doc_words=test_doc)

# [4.8494316e-03 -4.3647490e-03  1.1437446e-03 ...]

Visualization

Visualizer

Visualizer includes several simple tools for text visualization.

network

network method provides a word network for tokenized docs.

from connlp.preprocess import EnglishTokenizer
from connlp.visualize import Visualizer
tokenizer = EnglishTokenizer()
visualizer = Visualizer()

docs = ['I am a boy', 'She is a girl']
tokenized_docs = [tokenizer.tokenize(text=doc) for doc in docs]
visualizer.network(docs=tokenized_docs, show=True)

Extracting Text

TextConverter

TextConverter includes several methods that extract raw text from various types of files (e.g. PDF, HWP) and/or converts the files into plain text files (e.g. TXT).

hwp2txt

hwp2txt method converts a HWP file into a plain text file. Dependencies: pyhwp package

Install pyhwp (you need to install the pre-release version)

pip install --pre pyhwp

Example

from connlp.text_extract import TextConverter
converter = TextConverter()

hwp_fpath = '/data/raw/hwp_file.hwp'
output_fpath = '/data/processed/extracted_text.txt'

converter.hwp2txt(hwp_fpath, output_fpath) # returns 0 if no error occurs

GPU Utils

GPUMonitor

GPUMonitor generates a class to monitor and display the GPU status based on nvidia-smi.
Refer to "https://github.com/anderskm/gputil" and "https://data-newbie.tistory.com/561" for usages.

Install GPUtils module with pip.

pip install GPUtils

Write your code between the initiation of the GPUMonitor and monitor.stop().

from connlp.util import GPUMonitor

monitor = GPUMonitor(delay=3)
# >>>Write your code here<<<
monitor.stop()

# | ID | GPU | MEM |
# ------------------
# |  0 |  0% |  0% |
# |  1 |  1% |  0% |
# |  2 |  0% | 94% |

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.0.18

Jul 26, 2021

0.0.17

Jul 20, 2021

0.0.16

Jul 20, 2021

0.0.15

Jul 14, 2021

0.0.14

Jul 13, 2021

0.0.13

Jul 12, 2021

0.0.12

Jul 12, 2021

0.0.11

Jun 10, 2021

0.0.10

May 26, 2021

This version

0.0.9

May 10, 2021

0.0.8

Apr 13, 2021

0.0.7

Apr 2, 2021

0.0.6

Apr 2, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

connlp-0.0.9.tar.gz (9.8 kB view hashes)

Uploaded May 10, 2021 Source

Built Distribution

connlp-0.0.9-py3-none-any.whl (14.4 kB view hashes)

Uploaded May 10, 2021 Python 3

Hashes for connlp-0.0.9.tar.gz

Hashes for connlp-0.0.9.tar.gz
Algorithm	Hash digest
SHA256	`64c8592ecb219316cfbda416f71735167bcafb9155567060fe94db9fd457d660`
MD5	`4df7030bae7baa16db88f3a5877196fd`
BLAKE2b-256	`155e4fb103df341626b642689aad0e3ee8edb686102748d4d76d3e0cfb775603`

Hashes for connlp-0.0.9-py3-none-any.whl

Hashes for connlp-0.0.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0ec260421268fb94759c38350ec747b3bdc48ca3ebd61cd1c9767a7e7c50892a`
MD5	`b6cd9104756e3d034b42e3efde6c1ad9`
BLAKE2b-256	`5859c7d5728fcd8b483b44b65aa5ef01558ee95e7413982219f2c6009a9fb01f`

connlp 0.0.9

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Project description

connlp

Project Information

Contributors

Initialize

Setup

Test

Preprocess

Normalizer

EnglishTokenizer

Embedding

Vectorizer

tfidf

word2vec

word2vec (update)

doc2vec

Visualization

Visualizer

network

Extracting Text

TextConverter

hwp2txt

GPU Utils

GPUMonitor

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution