nlp vocabulary builder and embedding loader
Project description
capricorn
capricorn is a lightweight library for helping prepare vocabulary from corpus and prepare word embedding ready to be used by learning models.
- build vocabulary from corpus
- load necessary word embedding with consistent word index in Vocabulary
getting started
pip install capricorn
import capricorn
import os
# Specify filepaths
Vocab_path = "vocab_processor"
embedding_vector_path = "path/to/embedding"
# Load vocab
if os.path.isfile(Vocab_path):
print("Loading Vocabulary ...")
vocab_processor = capricorn.VocabularyProcessor.restore(Vocab_path)
else: # build vocab
print("Building Vocabulary ...")
x_text = ["Saudi Arabia Equity Movers: Almarai, Jarir Marketing and Spimaco.",
"Orange, Thales to Get French Cloud Computing Funds, Figaro Says.",
"Stansted Could Double Passengers on Deregulation, Times Reports."]
# Build/load vocabulary
max_document_length = 11
min_freq_filter = 2
vocab_processor = capricorn.VocabularyProcessor(max_document_length=max_document_length,
min_frequency=min_freq_filter)
# only fit
# vocab_processor.fit(x_text)
# or fit_transform to get the transformed corpus
x_text_transformed = vocab_processor.fit_transform(x_text)
vocab_processor.save(Vocab_path)
print("vocab_processor saved at:", Vocab_path)
# build embedding matrix of which the index is consistent with vocab word2index mapping
embedding_matrix = vocab_processor.prepare_embedding_matrix_with_dim(embedding_vector_path, 300)
User input
The library default to use special token __UNK__ and __PAD__, if the input sequence lengths below the max_document_length when initial VocabularyProcessor, it will automatically pad the sequence use the __PAD__.
If user have pre defined special tokens when initialize Vocabulary, user need to pre-process the sequence, namely adding the self defined special tokens to the input sequence. For example if user defined __START__ and __END__ as additional special tokens and max_document_length=11, User has to process the original sentence from:
"We like it very much"
to:
"__START__ __PAD__ __PAD__ We like it very much __PAD__ __PAD__ __END__"
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file capricorn-0.1.2.tar.gz
.
File metadata
- Download URL: capricorn-0.1.2.tar.gz
- Upload date:
- Size: 8.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.7.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3cd8523f9199c759d01df490c5f8e6c0c18dfba9617da6276804b2d4294d2fa0 |
|
MD5 | 16f0a4b36f1e0bc2e608e1d2e78fec32 |
|
BLAKE2b-256 | be60766fb7ee5d9c3846bb9ad003eb6ded93455601aaa281cfb5bc97423dd26f |
File details
Details for the file capricorn-0.1.2-py2.py3-none-any.whl
.
File metadata
- Download URL: capricorn-0.1.2-py2.py3-none-any.whl
- Upload date:
- Size: 9.0 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.7.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 83f11430142796f07a016b38fbb2a44c589dfcd95a5bffb5235d650cb6a6f895 |
|
MD5 | dfdb8c64cc25c088428efa2c1edc2e50 |
|
BLAKE2b-256 | 3b3c0f893128deaa05bd1e0c4b758e18ee948b2ad72dc3fe8e80365c59a2c3e8 |