nlp vocabulary builder and embedding loader
Project description
capricorn
capricorn is a lightweight library for helping prepare vocabulary from corpus and prepare word embedding ready to be used by learning models.
- build vocabulary from corpus
- load necessary word embedding with consistent word index in Vocabulary
getting started
import capricorn
import os
# Specify filepaths
Vocab_path = "vocab_processor"
embedding_vector_path = "data/embedding/model.vec"
# Load vocab
if os.path.isfile(Vocab_path):
print("Loading Vocabulary ...")
vocab_processor = capricorn.VocabularyProcessor.restore(Vocab_path)
else: # build vocab
print("Building Vocabulary ...")
x_text = ["Saudi Arabia Equity Movers: Almarai, Jarir Marketing and Spimaco.",
"Orange, Thales to Get French Cloud Computing Funds, Figaro Says.",
"Stansted Could Double Passengers on Deregulation, Times Reports."]
# Build/load vocabulary
max_document_length = 11
min_freq_filter = 2
vocab_processor = capricorn.VocabularyProcessor(max_document_length=max_document_length, min_frequency=min_freq_filter)
vocab_processor.fit(x_text) # fit_transform to get the transformed corpus
vocab_processor.save(Vocab_path)
print "vocab_processor saved at:", Vocab_path
# build embedding matrix of which the index is consistent with vocab word2index mapping
embedding_matrix = vocab_processor.prepare_embedding_matrix(embedding_vector_path)
User input
The library default to use special token __UNK__ and __PAD__, if the input sequence lengths below the max_document_length when initial VocabularyProcessor, it will automatically pad the sequence use the __PAD__.
If user have pre defined special tokens when initialize Vocabulary, user need to pre-process the sequence, namely adding the self defined special tokens to the input sequence. For example if user defined __START__ and __END__ as additional special tokens and max_document_length=11, User has to process the original sentence from:
"We like it very much"
to:
"__START__ __PAD__ __PAD__ We like it very much __PAD__ __PAD__ __END__"
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for capricorn-0.1.1-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 682bc39549989747232cc824c459a4504577165c88569f88968769452f05b22e |
|
MD5 | a9231f08faec9ba9fca97438a551231f |
|
BLAKE2b-256 | a0140f6f397a3f6b265436e1c4ee3eb91b6b571a91f8db3dcdb5297aeff65e88 |