Skip to main content

No project description provided

Project description

Corpusit

corpusit provides easy-to-use dataset iterators for natural language modeling tasks, such as SkipGram.

It is written in rust to enable fast multi-threading random sampling with deterministic results. So you dont have to worry about the speed / reproducibility.

Corpusit does not provide tokenization functionalities. So please use corpusit on tokenized corpus files (plain texts).

Environment

Python >= 3.6

Installation

$ pip install corpusit

On Windows and MacOS

Please install rust compiler before executing pip install corpusit.

Usage

SkipGram

Each line in the corpus file is a document, and the tokens should be separated by whitespace.

import corpusit

corpus_path = 'corpusit/data/corpus.txt'
vocab = corpusit.Vocab.build(corpus_path, min_count=1, unk='<unk>')

dataset = corpusit.SkipGramDataset(
    path_to_corpus=corpus_path,
    vocab=vocab,
    win_size=10,
    sep=" ",
    mode="onepass",       # onepass | repeat | shuffle
    subsample=1e-3,
    power=0.75,
    n_neg=1,
)

it = dataset.positive_sampler(batch_size=100, seed=0, num_threads=4)

for i, pair in enumerate(it):
    print(f'Iter {i:>4d}, shape={pair.shape}. First pair: '
          f'{pair[0,0]:>5} ({vocab.i2s[pair[0,0]]:>10}), '
          f'{pair[0,1]:>5} ({vocab.i2s[pair[0,1]]:>10})')

# Return:
# Iter    0, shape=(100, 2). First pair:    14 (        is),    10 ( anarchism)
# Iter    1, shape=(100, 2). First pair:     8 (        to),   540 (      and/)
# Iter    2, shape=(100, 2). First pair:   775 (constitutes),    34 (anarchists)
# Iter    3, shape=(100, 2). First pair:    72 (     other),   214 (  criteria)
# Iter    4, shape=(100, 2). First pair:   650 (  defining),   487 ( companion)
# ...

SkipGram with negative sampling

it = dataset.sampler(100, seed=0, num_threads=4)

for i, res in enumerate(it):
    pair, label = res
    print(f'Iter {i:>4d}, shape={pair.shape}. First pair: '
          f'{pair[0,0]:>5} ({vocab.i2s[pair[0,0]]:>10}), '
          f'{pair[0,1]:>5} ({vocab.i2s[pair[0,1]]:>10}), '
          f'label = {label[0]}')

# Returns:
# Iter    0, shape=(200, 2). First pair:    15 (        is),    10 ( anarchism), label = True
# Iter    1, shape=(200, 2). First pair:     9 (        to),   722 (      and/), label = True
# Iter    2, shape=(200, 2). First pair:   389 (constitutes),    34 (anarchists), label = True
# Iter    3, shape=(200, 2). First pair:    73 (     other),   212 (  criteria), label = True
# Iter    4, shape=(200, 2). First pair:   445 (  defining),   793 ( companion), label = True
# ...

Roadmap

  • GloVe

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpusit-0.1.3.tar.gz (71.4 kB view details)

Uploaded Source

Built Distributions

corpusit-0.1.3-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (463.8 kB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

corpusit-0.1.3-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (463.5 kB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

corpusit-0.1.3-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (465.6 kB view details)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

corpusit-0.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (463.8 kB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

corpusit-0.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (463.8 kB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

corpusit-0.1.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (463.8 kB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

corpusit-0.1.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (463.4 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

corpusit-0.1.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (463.7 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

corpusit-0.1.3-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (461.9 kB view details)

Uploaded CPython 3.6m manylinux: glibc 2.17+ x86-64

File details

Details for the file corpusit-0.1.3.tar.gz.

File metadata

  • Download URL: corpusit-0.1.3.tar.gz
  • Upload date:
  • Size: 71.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/0.13.6

File hashes

Hashes for corpusit-0.1.3.tar.gz
Algorithm Hash digest
SHA256 60cc146b8d4045bc75ad29257f352f647aba61933bc4c9caaa1207b90a5e4223
MD5 4ca58f50e3873b3bad381b7f0678a0ef
BLAKE2b-256 209699e6ac19c935b3c543f725ed767f03d4b331bdba33cefbca32413201d988

See more details on using hashes here.

File details

Details for the file corpusit-0.1.3-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for corpusit-0.1.3-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1735ab179a14e15029a3218e1049ec9e8a29729baf71848b20b9940fd19be48b
MD5 106ed07e223c14398b6a9e88041ae43c
BLAKE2b-256 6d3b23916742d1a66e77decbe38f2f5af8a6174e4bdc41206c830e27ef17440b

See more details on using hashes here.

File details

Details for the file corpusit-0.1.3-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for corpusit-0.1.3-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 eb432ae1d78114d354040dcb199b67ceaf277368d8d7de7931c7372f99e73545
MD5 093e36d75554e89bb966d9a22158e11e
BLAKE2b-256 cb21ac40b83daaea66a5596c3c3468ecbc621757ba575de1871216b784226114

See more details on using hashes here.

File details

Details for the file corpusit-0.1.3-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for corpusit-0.1.3-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 47b9e71848fb9a0a35472f6cb6aa45da190a1017af2d76e88e55ffa93c59b550
MD5 6ccc9f2fc99011b033dc44eb79da9bb7
BLAKE2b-256 b0f7f251db5f98782379be8f830d4ceac0941634cf76aad6f8fc50d3ea51ccf6

See more details on using hashes here.

File details

Details for the file corpusit-0.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for corpusit-0.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6673e797a17167990efb1f6bbfbccc7ba60a7634bd55e07cc337fc2d66caa42d
MD5 19b6ed113375ea6b128997e550f80854
BLAKE2b-256 0d9fb5a2194c8ddd3fee4290b40df4e68a6fca7631de3b7453ad8514366c79c6

See more details on using hashes here.

File details

Details for the file corpusit-0.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for corpusit-0.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d7ce015590031f15566fd841dce0e82aacc034f1aeba8175d906099147dd4798
MD5 ad2f194675258f651ebac23e79d63696
BLAKE2b-256 04de93c62357e645263ac1b35427330c7893e3c4bc66e4a4c8804ce0d098c021

See more details on using hashes here.

File details

Details for the file corpusit-0.1.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for corpusit-0.1.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 63acb216bd759ddede98a9ff8dcb8141ad50125dd89c64ecf35043d33991c70a
MD5 f95b58f95379756421d1bfe28a3d6a35
BLAKE2b-256 c8e4c889f4fdf049001b325c2d9c7ebd1d3e94be2d5f8aa584361098e3180161

See more details on using hashes here.

File details

Details for the file corpusit-0.1.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for corpusit-0.1.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 be155d45fc342f86e1a771792464012d29a956d576d02c4e77f66d35bccefcc7
MD5 838e3af2d9c8525e75c502f968777256
BLAKE2b-256 9859cb1f14beaa6ad52e7173ff768a07a2c746aac43457a8dde91ab207fd77bd

See more details on using hashes here.

File details

Details for the file corpusit-0.1.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for corpusit-0.1.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d84dbd2cc16af49748f91e0b75c5bc912730d80be90cfd17d3f68ac3cc209544
MD5 152f0768ba3239465747849a64e2c27a
BLAKE2b-256 bc50267cd5972cd99866fb0ca9e054dbe5f5684d91643e72212bf25d789761fa

See more details on using hashes here.

File details

Details for the file corpusit-0.1.3-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for corpusit-0.1.3-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 372229db993d472ecc4b87de1fb3bc88986e024b2834d21510761759f899d6cd
MD5 06a7a052c8b8e4b4bbc1d56a65562b63
BLAKE2b-256 d72fb7a6ef75adfb854a3efc8a4414f2138a1958919e9f6942e4e4e14b124967

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page