No project description provided
Project description
Corpusit
corpusit
provides easy-to-use dataset iterators for natural language modeling
tasks, such as SkipGram.
It is written in rust to enable fast multi-threading random sampling with deterministic results. So you dont have to worry about the speed / reproducibility.
Corpusit does not provide tokenization functionalities. So please use corpusit
on tokenized corpus files (plain texts).
Environment
Python >= 3.6
Installation
$ pip install corpusit
On Windows and MacOS
Please install rust compiler before
executing pip install corpusit
.
Usage
SkipGram
Each line in the corpus file is a document, and the tokens should be separated by whitespace.
import corpusit
corpus_path = 'corpusit/data/corpus.txt'
vocab = corpusit.Vocab.build(corpus_path, min_count=1, unk='<unk>')
dataset = corpusit.SkipGramDataset(
path_to_corpus=corpus_path,
vocab=vocab,
win_size=10,
sep=" ",
mode="onepass", # onepass | repeat | shuffle
subsample=1e-3,
power=0.75,
n_neg=1,
)
it = dataset.positive_sampler(batch_size=100, seed=0, num_threads=4)
for i, pair in enumerate(it):
print(f'Iter {i:>4d}, shape={pair.shape}. First pair: '
f'{pair[0,0]:>5} ({vocab.i2s[pair[0,0]]:>10}), '
f'{pair[0,1]:>5} ({vocab.i2s[pair[0,1]]:>10})')
# Return:
# Iter 0, shape=(100, 2). First pair: 14 ( is), 10 ( anarchism)
# Iter 1, shape=(100, 2). First pair: 8 ( to), 540 ( and/)
# Iter 2, shape=(100, 2). First pair: 775 (constitutes), 34 (anarchists)
# Iter 3, shape=(100, 2). First pair: 72 ( other), 214 ( criteria)
# Iter 4, shape=(100, 2). First pair: 650 ( defining), 487 ( companion)
# ...
SkipGram with negative sampling
it = dataset.sampler(100, seed=0, num_threads=4)
for i, res in enumerate(it):
pair, label = res
print(f'Iter {i:>4d}, shape={pair.shape}. First pair: '
f'{pair[0,0]:>5} ({vocab.i2s[pair[0,0]]:>10}), '
f'{pair[0,1]:>5} ({vocab.i2s[pair[0,1]]:>10}), '
f'label = {label[0]}')
# Returns:
# Iter 0, shape=(200, 2). First pair: 15 ( is), 10 ( anarchism), label = True
# Iter 1, shape=(200, 2). First pair: 9 ( to), 722 ( and/), label = True
# Iter 2, shape=(200, 2). First pair: 389 (constitutes), 34 (anarchists), label = True
# Iter 3, shape=(200, 2). First pair: 73 ( other), 212 ( criteria), label = True
# Iter 4, shape=(200, 2). First pair: 445 ( defining), 793 ( companion), label = True
# ...
Roadmap
- GloVe
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
corpusit-0.1.2.tar.gz
(31.3 kB
view hashes)
Built Distributions
Close
Hashes for corpusit-0.1.2-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d343ed8f0415b8b56e8709f28d92cb595851248b354708116e6e97ea2de7e046 |
|
MD5 | d966e4f36f5f640bcecf803e88a55296 |
|
BLAKE2b-256 | 6b06260e2bda88a8c91aa0f0f03582599c98dd2c602109eb7441d5142dcab0f5 |
Close
Hashes for corpusit-0.1.2-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f7e2368d451d3a4f5a3bce66afc1f423ac6a7461da0cc5066d9fb9a69180f964 |
|
MD5 | df2ae9ff67e0735b4677deb8f8e8429a |
|
BLAKE2b-256 | 5702bc164742fe4584b8231ecc019c81ecdfd928bf19d447114f0911b4610ed2 |
Close
Hashes for corpusit-0.1.2-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 26f5645c89ec46533f4510a544f1ad9cec387c74d56761a06ed1cdaf5474e472 |
|
MD5 | a7966ea52ebc7478d6d09236348b3223 |
|
BLAKE2b-256 | ad4e40ee1d96c66ca299b359a969d87f391d8b82632ebf235d7226a5ee19776e |
Close
Hashes for corpusit-0.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | da18472fe6deb1fb61e5b2730d7c3fc157849a75fa5090b22c53a3df25c506a9 |
|
MD5 | feda029bf1aee91b14cd8bf52aa02f6f |
|
BLAKE2b-256 | 4d037303571749dcaa152fe792e7f4d3f77736e537ef0f42cf9ecf3662785205 |
Close
Hashes for corpusit-0.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1019883a52f1e2f60fa5c711369c85967ba26aa6b4152fc7f5d15f9c175b0163 |
|
MD5 | 3b1ef3efc5efaf51356befbec7ffe676 |
|
BLAKE2b-256 | fc6a42ef94de1f6c922e825fcaadab9f6db60abe34509ba563e695a980f75594 |
Close
Hashes for corpusit-0.1.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7108198f123e1a45be8909bb66b3ce5fc8882be5a0315be32c087fff39b0140a |
|
MD5 | 6c506834f8636c5151d205de35d8e140 |
|
BLAKE2b-256 | 6176059c1745dce481ec79eeb174d72b23e0df29bb86c68d742692e70655edd6 |
Close
Hashes for corpusit-0.1.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 351cf70c8dbab807081ba068b620d8ea31945d0923aff18273b358b37596a357 |
|
MD5 | 7ace5a8f1bbb447e0e43c0c4d668ddb5 |
|
BLAKE2b-256 | 59b318bfed41af0ff38884f9f04b949d3bedb074ed1460c0e665525a29a29b21 |
Close
Hashes for corpusit-0.1.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | daeada4efa51110c4e6adab5e239b535dcbb7bc3b2b27f285e9ee0333651e0c6 |
|
MD5 | ea48f2fd0b596c35ef19e44f8489416c |
|
BLAKE2b-256 | 5fdc5a2e2600b09bf7c13e0f774a4a223b26c8160b87a41c5452abd466af2a74 |
Close
Hashes for corpusit-0.1.2-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 99fceeaf0a9b2430227379f9c4b25b6b56a2be288a409056752af30d1083260f |
|
MD5 | b03f6b886f12a76511df5af5fc6e4231 |
|
BLAKE2b-256 | 40feeabf5144798621e9d96c00067ed9c663d130f0ec58325a4b18663a9d477e |