Skip to main content

No project description provided

Project description

Corpusit

corpusit provides easy-to-use dataset iterators for natural language modeling tasks, such as SkipGram.

It is written in rust to enable fast multi-threading random sampling with deterministic results. So you dont have to worry about the speed / reproducibility.

Corpusit does not provide tokenization functionalities. So please use corpusit on tokenized corpus files (plain texts).

Environment

Python >= 3.6

Installation

$ pip install corpusit

On Windows and MacOS

Please install rust compiler before executing pip install corpusit.

Usage

SkipGram with Positive Sampling

Process tokenized sequences to generate positive SkipGram pairs:

import corpusit
import numpy as np

# Create word counts mapping (word_id -> count)
word_counts = {0: 100, 1: 50, 2: 200, 3: 75, 4: 150}

# Create SkipGram configuration
config = corpusit.SkipGramConfig(
    word_counts=word_counts,
    win_size=5,
    subsample=1e-3,
    power=0.75,
    n_neg=1
)

# Create positive sampler
sampler = config.positive_sampler(seed=0)

# Process a sequence of word IDs
sequence = [0, 1, 2, 3, 4, 1, 2, 0]
pairs = sampler.process_sequence(sequence)
print(f'Generated {len(pairs)} positive pairs')
print(f'Shape: {pairs.shape}')
print(f'First few pairs: {pairs[:3]}')

SkipGram with Negative Sampling

Generate both positive and negative samples with labels:

# Create sampler with negative sampling
sampler = config.sampler(seed=0, num_threads=4)

# Process sequences
sequences = [[0, 1, 2, 3], [1, 2, 3, 4], [2, 3, 4, 0]]
pairs, labels = sampler.process_sequences(sequences)

print(f'Generated {len(pairs)} samples')
print(f'Pairs shape: {pairs.shape}')
print(f'Labels shape: {labels.shape}')
print(f'Positive samples: {np.sum(labels)}')
print(f'Negative samples: {np.sum(~labels)}')

SkipGram with Tokenization

Process raw text sequences with automatic tokenization:

# Create configuration with tokenization support
word_counts = {0: 100, 1: 50, 2: 200, 3: 75, 4: 150}
word_to_id = {"hello": 0, "world": 1, "python": 2, "rust": 3, "fast": 4}

config = corpusit.SkipGramConfigWithTokenization(
    word_counts=word_counts,
    word_to_id=word_to_id,
    separator=" ",
    win_size=5,
    subsample=1e-3,
    power=0.75,
    n_neg=1
)

# Create sampler
sampler = config.sampler(seed=0, num_threads=4)

# Process raw text
text_sequences = ["hello world python", "world python rust", "python rust fast"]
pairs, labels = sampler.process_string_sequences(text_sequences)

print(f'Generated {len(pairs)} samples from text')
print(f'First few pairs: {pairs[:3]}')
print(f'Labels: {labels[:3]}')

Roadmap

  • GloVe

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpusit-0.2.1.tar.gz (30.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

corpusit-0.2.1-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (288.3 kB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

corpusit-0.2.1-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (288.3 kB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

corpusit-0.2.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (287.8 kB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

corpusit-0.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (287.9 kB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

corpusit-0.2.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (287.9 kB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

corpusit-0.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (288.2 kB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

corpusit-0.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (288.6 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

corpusit-0.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (288.8 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file corpusit-0.2.1.tar.gz.

File metadata

  • Download URL: corpusit-0.2.1.tar.gz
  • Upload date:
  • Size: 30.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.9.4

File hashes

Hashes for corpusit-0.2.1.tar.gz
Algorithm Hash digest
SHA256 7506925ef920913e91334ea916f7c713dce365044d23536132ec0d6ed4ec15ea
MD5 a66280b28c24040f0a7a701963980174
BLAKE2b-256 2d43e2fbb85257bb7041ce0f43c8c8a1adff2ff6d31047a6e284eeab690b8161

See more details on using hashes here.

File details

Details for the file corpusit-0.2.1-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for corpusit-0.2.1-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e7eb98b446b5d18d0e377c35548ba62c279f8afd7874ca98172649fa67de10f1
MD5 bb6c1df30491c24d350b8daf92fc6876
BLAKE2b-256 5a4358934f10cbbbca4f76d17abb71b7c47ea49b8291ecedce90b3c6e1829946

See more details on using hashes here.

File details

Details for the file corpusit-0.2.1-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for corpusit-0.2.1-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9390f6cd08893081e97f30111c2d3c667fd66d71e5a4cda7e77fcf28144d7bb9
MD5 1ac7027605dc0f6c5d54674f1091009b
BLAKE2b-256 de2aa2429cdea29f185d255703b88907f8ece0f1aa939a72e146f9d5c2a15193

See more details on using hashes here.

File details

Details for the file corpusit-0.2.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for corpusit-0.2.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f3ca006dabba10a5703587420150cf9df2bcf4aea83b2ded21d8c0002811c0f8
MD5 81c8e960c1ce6e989c3ac401bfd7a6fa
BLAKE2b-256 4f94515acf99fa8be5e8989764117c0d2cb35c2e356af9eef59636d67e9adf5a

See more details on using hashes here.

File details

Details for the file corpusit-0.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for corpusit-0.2.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e2e29e02cd653c9d4df906d0403ce190f48abe9e1151030c883985cc03e38ada
MD5 fe6d803e281ea3b712bca38ea9e28e49
BLAKE2b-256 755bacc4783f213c0c35aa4a556532cd49410e9d0ec282ce842e2908b4871c84

See more details on using hashes here.

File details

Details for the file corpusit-0.2.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for corpusit-0.2.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 337353a08dd1afb476a42c83d482d0f44fbb2b66040afc9e438dd4308c64bd29
MD5 d0c8c6cafbf407efa4691f3c433df37b
BLAKE2b-256 20c736e8242b8d036f1b5207915c74fe969a4b82e2f87a174355b1b2a072c26e

See more details on using hashes here.

File details

Details for the file corpusit-0.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for corpusit-0.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 376d89eb4702b948f1038f62a5fa8a2b69388cc25d19148f3d5725af48f7b643
MD5 f870df1ba82021e63a3ac7171b639076
BLAKE2b-256 249343a3226290f9d16b68b85c0fba4fe2d4eca1795d6bd80c302ef49f370151

See more details on using hashes here.

File details

Details for the file corpusit-0.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for corpusit-0.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5ded27a69b97a117445ad0201cad40d7380daabd8cc81795ae793d0499da2f45
MD5 f9855fec56e85ea5da2296b427235ed2
BLAKE2b-256 6208522ad24e28259ea42e78e0dd97269fb121e0e504949ba72ddbba847d4d1a

See more details on using hashes here.

File details

Details for the file corpusit-0.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for corpusit-0.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 509b914330c2477538ce373eb61b5adb00e14bba8279658af3471f43e2b058a2
MD5 112da8793dd1a9c457b8fb5e6e8f2bdc
BLAKE2b-256 54dcbc3da1f5e742c8ce686ae24449450de15b512840b3db522af560e6ff2272

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page