Skip to main content

No project description provided

Project description

Corpusit

corpusit provides easy-to-use dataset iterators for natural language modeling tasks, such as SkipGram.

It is written in rust to enable fast multi-threading random sampling with deterministic results. So you dont have to worry about the speed / reproducibility.

Corpusit does not provide tokenization functionalities. So please use corpusit on tokenized corpus files (plain texts).

Environment

Python >= 3.6

Installation

$ pip install corpusit

On Windows and MacOS

Please install rust compiler before executing pip install corpusit.

Usage

SkipGram

Each line in the corpus file is a document, and the tokens should be separated by whitespace.

import corpusit

corpus_path = 'corpusit/data/corpus.txt'
vocab = corpusit.Vocab.build(corpus_path, min_count=1, unk='<unk>')

dataset = corpusit.SkipGramDataset(
    path_to_corpus=corpus_path,
    vocab=vocab,
    win_size=10,
    sep=" ",
    mode="onepass",       # onepass | repeat | shuffle
    subsample=1e-3,
    power=0.75,
    n_neg=1,
)

it = dataset.positive_sampler(batch_size=100, seed=0, num_threads=4)

for i, pair in enumerate(it):
    print(f'Iter {i:>4d}, shape={pair.shape}. First pair: '
          f'{pair[0,0]:>5} ({vocab.i2s[pair[0,0]]:>10}), '
          f'{pair[0,1]:>5} ({vocab.i2s[pair[0,1]]:>10})')

# Return:
# Iter    0, shape=(100, 2). First pair:    14 (        is),    10 ( anarchism)
# Iter    1, shape=(100, 2). First pair:     8 (        to),   540 (      and/)
# Iter    2, shape=(100, 2). First pair:   775 (constitutes),    34 (anarchists)
# Iter    3, shape=(100, 2). First pair:    72 (     other),   214 (  criteria)
# Iter    4, shape=(100, 2). First pair:   650 (  defining),   487 ( companion)
# ...

SkipGram with negative sampling

it = dataset.sampler(100, seed=0, num_threads=4)

for i, res in enumerate(it):
    pair, label = res
    print(f'Iter {i:>4d}, shape={pair.shape}. First pair: '
          f'{pair[0,0]:>5} ({vocab.i2s[pair[0,0]]:>10}), '
          f'{pair[0,1]:>5} ({vocab.i2s[pair[0,1]]:>10}), '
          f'label = {label[0]}')

# Returns:
# Iter    0, shape=(200, 2). First pair:    15 (        is),    10 ( anarchism), label = True
# Iter    1, shape=(200, 2). First pair:     9 (        to),   722 (      and/), label = True
# Iter    2, shape=(200, 2). First pair:   389 (constitutes),    34 (anarchists), label = True
# Iter    3, shape=(200, 2). First pair:    73 (     other),   212 (  criteria), label = True
# Iter    4, shape=(200, 2). First pair:   445 (  defining),   793 ( companion), label = True
# ...

Roadmap

  • GloVe

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpusit-0.1.3.tar.gz (71.4 kB view hashes)

Uploaded Source

Built Distributions

corpusit-0.1.3-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (463.8 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

corpusit-0.1.3-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (463.5 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

corpusit-0.1.3-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (465.6 kB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

corpusit-0.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (463.8 kB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

corpusit-0.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (463.8 kB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

corpusit-0.1.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (463.8 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

corpusit-0.1.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (463.4 kB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

corpusit-0.1.3-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (463.7 kB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

corpusit-0.1.3-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (461.9 kB view hashes)

Uploaded CPython 3.6m manylinux: glibc 2.17+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page