Skip to main content

nlp vocabulary builder and embedding loader

Project description




capricorn

capricorn is a lightweight library for helping prepare vocabulary from corpus and prepare word embedding ready to be used by learning models.

  1. build vocabulary from corpus
  2. load necessary word embedding with consistent word index in Vocabulary

getting started

pip install capricorn
import capricorn
import os

# Specify filepaths
Vocab_path = "vocab_processor"
embedding_vector_path = "path/to/embedding"

# Load vocab
if os.path.isfile(Vocab_path):
  print("Loading Vocabulary ...")
  vocab_processor = capricorn.VocabularyProcessor.restore(Vocab_path)

else:  # build vocab
  print("Building Vocabulary ...")

  x_text = ["Saudi Arabia Equity Movers: Almarai, Jarir Marketing and Spimaco.",
            "Orange, Thales to Get French Cloud Computing Funds, Figaro Says.",
            "Stansted Could Double Passengers on Deregulation, Times Reports."]

  # Build/load vocabulary
  max_document_length = 11
  min_freq_filter = 2

  vocab_processor = capricorn.VocabularyProcessor(max_document_length=max_document_length,
                                                  min_frequency=min_freq_filter)
  # only fit
  # vocab_processor.fit(x_text)
  # or fit_transform to get the transformed corpus
  x_text_transformed = vocab_processor.fit_transform(x_text)
  vocab_processor.save(Vocab_path)
  print("vocab_processor saved at:", Vocab_path)

# build embedding matrix of which the index is consistent with vocab word2index mapping
embedding_matrix = vocab_processor.prepare_embedding_matrix_with_dim(embedding_vector_path, 300)

User input

The library default to use special token __UNK__ and __PAD__, if the input sequence lengths below the max_document_length when initial VocabularyProcessor, it will automatically pad the sequence use the __PAD__.

If user have pre defined special tokens when initialize Vocabulary, user need to pre-process the sequence, namely adding the self defined special tokens to the input sequence. For example if user defined __START__ and __END__ as additional special tokens and max_document_length=11, User has to process the original sentence from:

"We like it very much"

to:

"__START__ __PAD__ __PAD__ We like it very much __PAD__ __PAD__ __END__"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

capricorn-0.1.2.tar.gz (8.0 kB view details)

Uploaded Source

Built Distribution

capricorn-0.1.2-py2.py3-none-any.whl (9.0 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file capricorn-0.1.2.tar.gz.

File metadata

  • Download URL: capricorn-0.1.2.tar.gz
  • Upload date:
  • Size: 8.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.7.1

File hashes

Hashes for capricorn-0.1.2.tar.gz
Algorithm Hash digest
SHA256 3cd8523f9199c759d01df490c5f8e6c0c18dfba9617da6276804b2d4294d2fa0
MD5 16f0a4b36f1e0bc2e608e1d2e78fec32
BLAKE2b-256 be60766fb7ee5d9c3846bb9ad003eb6ded93455601aaa281cfb5bc97423dd26f

See more details on using hashes here.

File details

Details for the file capricorn-0.1.2-py2.py3-none-any.whl.

File metadata

  • Download URL: capricorn-0.1.2-py2.py3-none-any.whl
  • Upload date:
  • Size: 9.0 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.19.1 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/3.7.1

File hashes

Hashes for capricorn-0.1.2-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 83f11430142796f07a016b38fbb2a44c589dfcd95a5bffb5235d650cb6a6f895
MD5 dfdb8c64cc25c088428efa2c1edc2e50
BLAKE2b-256 3b3c0f893128deaa05bd1e0c4b758e18ee948b2ad72dc3fe8e80365c59a2c3e8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page