No project description provided
Project description
Corpusit
corpusit
provides easy-to-use dataset iterators for natural language modeling
tasks, such as SkipGram.
It is written in rust to enable fast multi-threading random sampling with deterministic results. So you dont have to worry about the speed / reproducibility.
Corpusit does not provide tokenization functionalities. So please use corpusit
on tokenized corpus files (plain texts).
Environment
Python >= 3.6
Installation
$ pip install corpusit
On Windows and MacOS
Please install rust compiler before
executing pip install corpusit
.
Usage
SkipGram
Each line in the corpus file is a document, and the tokens should be separated by whitespace.
import corpusit
corpus_path = 'corpusit/data/corpus.txt'
vocab = corpusit.Vocab.build(corpus_path, min_count=1, unk='<unk>')
dataset = corpusit.SkipGramDataset(
path_to_corpus=corpus_path,
vocab=vocab,
win_size=10,
sep=" ",
mode="onepass", # onepass | repeat | shuffle
subsample=1e-3,
power=0.75,
n_neg=1,
)
it = dataset.positive_sampler(batch_size=100, seed=0, num_threads=4)
for i, pair in enumerate(it):
print(f'Iter {i:>4d}, shape={pair.shape}. First pair: '
f'{pair[0,0]:>5} ({vocab.i2s[pair[0,0]]:>10}), '
f'{pair[0,1]:>5} ({vocab.i2s[pair[0,1]]:>10})')
# Return:
# Iter 0, shape=(100, 2). First pair: 14 ( is), 10 ( anarchism)
# Iter 1, shape=(100, 2). First pair: 8 ( to), 540 ( and/)
# Iter 2, shape=(100, 2). First pair: 775 (constitutes), 34 (anarchists)
# Iter 3, shape=(100, 2). First pair: 72 ( other), 214 ( criteria)
# Iter 4, shape=(100, 2). First pair: 650 ( defining), 487 ( companion)
# ...
SkipGram with negative sampling
it = dataset.sampler(100, seed=0, num_threads=4)
for i, res in enumerate(it):
pair, label = res
print(f'Iter {i:>4d}, shape={pair.shape}. First pair: '
f'{pair[0,0]:>5} ({vocab.i2s[pair[0,0]]:>10}), '
f'{pair[0,1]:>5} ({vocab.i2s[pair[0,1]]:>10}), '
f'label = {label[0]}')
# Returns:
# Iter 0, shape=(200, 2). First pair: 15 ( is), 10 ( anarchism), label = True
# Iter 1, shape=(200, 2). First pair: 9 ( to), 722 ( and/), label = True
# Iter 2, shape=(200, 2). First pair: 389 (constitutes), 34 (anarchists), label = True
# Iter 3, shape=(200, 2). First pair: 73 ( other), 212 ( criteria), label = True
# Iter 4, shape=(200, 2). First pair: 445 ( defining), 793 ( companion), label = True
# ...
Roadmap
- GloVe
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
corpusit-0.1.3.tar.gz
(71.4 kB
view hashes)
Built Distributions
Close
Hashes for corpusit-0.1.3-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1735ab179a14e15029a3218e1049ec9e8a29729baf71848b20b9940fd19be48b |
|
MD5 | 106ed07e223c14398b6a9e88041ae43c |
|
BLAKE2b-256 | 6d3b23916742d1a66e77decbe38f2f5af8a6174e4bdc41206c830e27ef17440b |
Close
Hashes for corpusit-0.1.3-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | eb432ae1d78114d354040dcb199b67ceaf277368d8d7de7931c7372f99e73545 |
|
MD5 | 093e36d75554e89bb966d9a22158e11e |
|
BLAKE2b-256 | cb21ac40b83daaea66a5596c3c3468ecbc621757ba575de1871216b784226114 |
Close
Hashes for corpusit-0.1.3-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 47b9e71848fb9a0a35472f6cb6aa45da190a1017af2d76e88e55ffa93c59b550 |
|
MD5 | 6ccc9f2fc99011b033dc44eb79da9bb7 |
|
BLAKE2b-256 | b0f7f251db5f98782379be8f830d4ceac0941634cf76aad6f8fc50d3ea51ccf6 |
Close
Hashes for corpusit-0.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6673e797a17167990efb1f6bbfbccc7ba60a7634bd55e07cc337fc2d66caa42d |
|
MD5 | 19b6ed113375ea6b128997e550f80854 |
|
BLAKE2b-256 | 0d9fb5a2194c8ddd3fee4290b40df4e68a6fca7631de3b7453ad8514366c79c6 |
Close
Hashes for corpusit-0.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d7ce015590031f15566fd841dce0e82aacc034f1aeba8175d906099147dd4798 |
|
MD5 | ad2f194675258f651ebac23e79d63696 |
|
BLAKE2b-256 | 04de93c62357e645263ac1b35427330c7893e3c4bc66e4a4c8804ce0d098c021 |
Close
Hashes for corpusit-0.1.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 63acb216bd759ddede98a9ff8dcb8141ad50125dd89c64ecf35043d33991c70a |
|
MD5 | f95b58f95379756421d1bfe28a3d6a35 |
|
BLAKE2b-256 | c8e4c889f4fdf049001b325c2d9c7ebd1d3e94be2d5f8aa584361098e3180161 |
Close
Hashes for corpusit-0.1.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | be155d45fc342f86e1a771792464012d29a956d576d02c4e77f66d35bccefcc7 |
|
MD5 | 838e3af2d9c8525e75c502f968777256 |
|
BLAKE2b-256 | 9859cb1f14beaa6ad52e7173ff768a07a2c746aac43457a8dde91ab207fd77bd |
Close
Hashes for corpusit-0.1.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d84dbd2cc16af49748f91e0b75c5bc912730d80be90cfd17d3f68ac3cc209544 |
|
MD5 | 152f0768ba3239465747849a64e2c27a |
|
BLAKE2b-256 | bc50267cd5972cd99866fb0ca9e054dbe5f5684d91643e72212bf25d789761fa |
Close
Hashes for corpusit-0.1.3-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 372229db993d472ecc4b87de1fb3bc88986e024b2834d21510761759f899d6cd |
|
MD5 | 06a7a052c8b8e4b4bbc1d56a65562b63 |
|
BLAKE2b-256 | d72fb7a6ef75adfb854a3efc8a4414f2138a1958919e9f6942e4e4e14b124967 |