Skip to main content

nlp vocabulary builder and embedding loader

Project description




capricorn

capricorn is a lightweight library for helping prepare vocabulary from corpus and prepare word embedding ready to be used by learning models.

  1. build vocabulary from corpus
  2. load necessary word embedding with consistent word index in Vocabulary

getting started

import capricorn
import os

# Specify filepaths
Vocab_path = "vocab_processor"
embedding_vector_path = "data/embedding/model.vec"

# Load vocab
if os.path.isfile(Vocab_path):
    print("Loading Vocabulary ...")
    vocab_processor = capricorn.VocabularyProcessor.restore(Vocab_path)

else: # build vocab
	print("Building Vocabulary ...")

	x_text = ["Saudi Arabia Equity Movers: Almarai, Jarir Marketing and Spimaco.",
                        "Orange, Thales to Get French Cloud Computing Funds, Figaro Says.",
                        "Stansted Could Double Passengers on Deregulation, Times Reports."]

	# Build/load vocabulary
	max_document_length = 11
	min_freq_filter = 2

	vocab_processor = capricorn.VocabularyProcessor(max_document_length=max_document_length, min_frequency=min_freq_filter)
	vocab_processor.fit(x_text) # fit_transform to get the transformed corpus
	vocab_processor.save(Vocab_path)
	print "vocab_processor saved at:", Vocab_path

# build embedding matrix of which the index is consistent with vocab word2index mapping	
embedding_matrix = vocab_processor.prepare_embedding_matrix(embedding_vector_path)

User input

The library default to use special token __UNK__ and __PAD__, if the input sequence lengths below the max_document_length when initial VocabularyProcessor, it will automatically pad the sequence use the __PAD__.

If user have pre defined special tokens when initialize Vocabulary, user need to pre-process the sequence, namely adding the self defined special tokens to the input sequence. For example if user defined __START__ and __END__ as additional special tokens and max_document_length=11, User has to process the original sentence from:

"We like it very much"

to:

"__START__ __PAD__ __PAD__ We like it very much __PAD__ __PAD__ __END__"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

capricorn-0.1.1.tar.gz (7.9 kB view hashes)

Uploaded Source

Built Distribution

capricorn-0.1.1-py2.py3-none-any.whl (8.9 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page