Skip to main content

No project description provided

Project description

Corpusit

corpusit provides easy-to-use dataset iterators for natural language modeling tasks, such as SkipGram.

It is written in rust to enable fast multi-threading random sampling with deterministic results. So you dont have to worry about the speed / reproducibility.

Corpusit does not provide tokenization functionalities. So please use corpusit on tokenized corpus files (plain texts).

Environment

Python >= 3.6

Installation

$ pip install corpusit

On Windows and MacOS

Please install rust compiler before executing pip install corpusit.

Usage

SkipGram

Each line in the corpus file is a document, and the tokens should be separated by whitespace.

import corpusit

corpus_path = 'corpusit/data/corpus.txt'
vocab = corpusit.Vocab.build(corpus_path, min_count=1, unk='<unk>')

dataset = corpusit.SkipGramDataset(
    path_to_corpus=corpus_path,
    vocab=vocab,
    win_size=10,
    sep=" ",
    mode="onepass",       # onepass | repeat | shuffle
    subsample=1e-3,
    power=0.75,
    n_neg=1,
)

it = dataset.positive_sampler(batch_size=100, seed=0, num_threads=4)

for i, pair in enumerate(it):
    print(f'Iter {i:>4d}, shape={pair.shape}. First pair: '
          f'{pair[0,0]:>5} ({vocab.i2s[pair[0,0]]:>10}), '
          f'{pair[0,1]:>5} ({vocab.i2s[pair[0,1]]:>10})')

# Return:
# Iter    0, shape=(100, 2). First pair:    14 (        is),    10 ( anarchism)
# Iter    1, shape=(100, 2). First pair:     8 (        to),   540 (      and/)
# Iter    2, shape=(100, 2). First pair:   775 (constitutes),    34 (anarchists)
# Iter    3, shape=(100, 2). First pair:    72 (     other),   214 (  criteria)
# Iter    4, shape=(100, 2). First pair:   650 (  defining),   487 ( companion)
# ...

SkipGram with negative sampling

it = dataset.sampler(100, seed=0, num_threads=4)

for i, res in enumerate(it):
    pair, label = res
    print(f'Iter {i:>4d}, shape={pair.shape}. First pair: '
          f'{pair[0,0]:>5} ({vocab.i2s[pair[0,0]]:>10}), '
          f'{pair[0,1]:>5} ({vocab.i2s[pair[0,1]]:>10}), '
          f'label = {label[0]}')

# Returns:
# Iter    0, shape=(200, 2). First pair:    15 (        is),    10 ( anarchism), label = True
# Iter    1, shape=(200, 2). First pair:     9 (        to),   722 (      and/), label = True
# Iter    2, shape=(200, 2). First pair:   389 (constitutes),    34 (anarchists), label = True
# Iter    3, shape=(200, 2). First pair:    73 (     other),   212 (  criteria), label = True
# Iter    4, shape=(200, 2). First pair:   445 (  defining),   793 ( companion), label = True
# ...

Roadmap

  • GloVe

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpusit-0.2.0.tar.gz (30.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

corpusit-0.2.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (288.2 kB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

corpusit-0.2.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (288.3 kB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

corpusit-0.2.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (287.8 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

corpusit-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (287.9 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

corpusit-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (287.9 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

corpusit-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (288.1 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

corpusit-0.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (288.7 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

corpusit-0.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (288.8 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file corpusit-0.2.0.tar.gz.

File metadata

  • Download URL: corpusit-0.2.0.tar.gz
  • Upload date:
  • Size: 30.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.9.4

File hashes

Hashes for corpusit-0.2.0.tar.gz
Algorithm Hash digest
SHA256 6354ed9dcca43be4a0720b031133e8995ecfe5ce64a1d25824f1b5b1d55a90ef
MD5 e598ee4c3ae11f306313f0f317e86d3f
BLAKE2b-256 37d5ca7ac28865eea8301f1d5514fa905d91a0929dc953067977f4ce8131dae4

See more details on using hashes here.

File details

Details for the file corpusit-0.2.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for corpusit-0.2.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ce2429238985e432234dd4c6f8eb32faf05298ad36a7fcb7fe25328299427c79
MD5 38e3552039dc5f7c87f35fb41184784d
BLAKE2b-256 31509ac3d5eab678e9ae630d82e053fe84dbaacfc99dab829cd3685da902bd4d

See more details on using hashes here.

File details

Details for the file corpusit-0.2.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for corpusit-0.2.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c98998ecd7fe5cdd3fd0c0d73cb83c5d02cfe6326322ba0aa131074b866a0c63
MD5 513c6ac050cc9879dc05e0b7bfb9d306
BLAKE2b-256 c7618ea13616ec48231d4355ec106528cbafc52b3ef4328d001dfb1fe745cb01

See more details on using hashes here.

File details

Details for the file corpusit-0.2.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for corpusit-0.2.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d86ea47ede6f8833ed966fe0f5d99a944867e0c448b649f33f7860d7c882c41d
MD5 c288dd843695f0f36ac4526aa24060ee
BLAKE2b-256 a8bef95cabbdebc4988365513ff9200713270e6c9e053b56ba488ba670c6872c

See more details on using hashes here.

File details

Details for the file corpusit-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for corpusit-0.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1b0b9459f45e670b48a5b4979a24318a833078c95b3230413644d4f4b8679bb2
MD5 1d0d360afedfe43600ea61c19e82fa27
BLAKE2b-256 8cd67e09d9075518cfd3228771d1e2cbf3e11da22a6ada95a056da19187a680a

See more details on using hashes here.

File details

Details for the file corpusit-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for corpusit-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 322b6e1cf7950f510b41fb0a7a32c96ef734001d7f37e398809bfaaf77aca24b
MD5 e0a6cc88c220826a756182b06477b6d3
BLAKE2b-256 c7833a76af3b814b18c1fa2c95cc95fa168c42df3d544647f4f0de1a76bd04ed

See more details on using hashes here.

File details

Details for the file corpusit-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for corpusit-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b5474cac8cca41a15fc0173cca1a0b8eb1ccbda680dc2c3e8f9fe9d516823125
MD5 c0c73242f59ef388910cf6fe1dbc067c
BLAKE2b-256 18796557346b01033b4c15cc68ea9f178824a7470cc9a0a5c09817860f156948

See more details on using hashes here.

File details

Details for the file corpusit-0.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for corpusit-0.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4839f7625043874edce46e30ece700772dd551896a7881f4113adba928211cd6
MD5 373dd97362dbcd35506919fa6aaeef8c
BLAKE2b-256 7f647212b62cd39558930277d25ba0adf64fa6c4b506b1b73ae47ed3d77ef969

See more details on using hashes here.

File details

Details for the file corpusit-0.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for corpusit-0.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2b2b18bdb6a3fbce3b72308ddb73547f26899d685fd3a2f4d4d4a90153e9751b
MD5 fbcda92f604fbf6f830e5a71e871a881
BLAKE2b-256 4cb1c860203ecc35aba6fb7cd79b51f906b8b36ee643513a42f0bf139d5f7214

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page