Skip to main content

Loads SVMlight text files into scipy sparse CSR matrices in parallel.

Project description

SVMlight text files to scipy CSR

travis build pypi

Many sparse datasets are distributed in a lightweight text format called svmlight. While simple and familiar, it's terribly slow to read in python even with C++ solutions due to serial processing. Instead, svm2csr loads by using a parallel Rust extension which chunks files into byte blocks, then seeks to different blocks to parse in parallel.

# benchmark dataset is kdda training set, 2.5GB flat text
# https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html

import sklearn.datasets
%timeit sklearn.datasets.load_svmlight_file('kdda')
1min 56s ± 1.72 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

# https://github.com/mblondel/svmlight-loader
%timeit svmlight_loader.load_svmlight_file('kdda')
1min 52s ± 3.11 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

import svm2csr
%timeit svm2csr.load_svmlight_file('kdda')
11.4 s ± 527 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Above micro-benchmark performed on my 8-core laptop.

Install

pip install svm2csr

Note this package is only available pre-built for pythons, operating systems, and machine architecture targets I can build wheels for (see Publishing). Settings other than the following need to install rust and compile from source (pip install should still work, but will compile for your platform).

  • cp36-cp39, manylinux2010, x86_64

Unsupported Features

  • dtype (currently only doubles supported)
  • an svmlight ranking mode where query ids are identified with qid
  • comments in svmlight files (start with #)
  • empty or blank lines
  • multilabel extension
  • reading from compressed files
  • reading from multiple files and stacking
  • reading from streams
  • writing SVMlight files
  • n_features option
  • graceful client multiprocessing
  • mac and windows wheels

All of these are fixable (even stream reading with parallel bridge). Let me know if you'd like to make PR.

Documentation

def load_svmlight_file(fname, zero_based="auto", min_chunk_size=(16 * 1024)):
    """
    Loads an SVMlight file into a CSR matrix.

    fname (str): the file name of the file to load.
    zero_based ("auto" or bool): whether the corresponding svmlight file uses
        zero based indexing; if false or all indices are nonzero, then
        shifts indices down uniformly by 1 for python's zero indexing.
    min_chunk_size (int): minimum chunk size in bytes per
        parallel processing task

    Returns (X, y) where X is a sparse CSR matrix and y is a numpy double array
    with length equal to the number of rows in X. Values of X are doubles.
    """

Dev Info

Install maturin and pytest first.

pip install maturin pytest

Local development.

cargo test # test rust only
maturin develop # create py bindings for rust code
pytest # test python bindings

Publishing

  1. Fetch the most recent master.
  2. Bump the version in Cargo.toml appropriately if needed. Commit these changes.
  3. Tag the release. git tag -a -m "v<CURRENT VERSION>"
  4. Push to github, triggering a Travis build that tests, packages, and uploads to pypi. git push --follow-tags

Every master travis build attempts to publish to pypi (but may fail if a build with the same version is already present).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

svm2csr-0.3.1.tar.gz (13.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

svm2csr-0.3.1-cp39-cp39-manylinux2010_x86_64.whl (245.0 kB view details)

Uploaded CPython 3.9manylinux: glibc 2.12+ x86-64

svm2csr-0.3.1-cp38-cp38-manylinux2010_x86_64.whl (245.0 kB view details)

Uploaded CPython 3.8manylinux: glibc 2.12+ x86-64

svm2csr-0.3.1-cp37-cp37m-manylinux2010_x86_64.whl (244.9 kB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.12+ x86-64

svm2csr-0.3.1-cp36-cp36m-manylinux2010_x86_64.whl (245.0 kB view details)

Uploaded CPython 3.6mmanylinux: glibc 2.12+ x86-64

File details

Details for the file svm2csr-0.3.1.tar.gz.

File metadata

  • Download URL: svm2csr-0.3.1.tar.gz
  • Upload date:
  • Size: 13.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/0.9.0

File hashes

Hashes for svm2csr-0.3.1.tar.gz
Algorithm Hash digest
SHA256 8bed3d8da5527f4ed801d0093369eb3bbf6645552a5dd8fa0a7cfc96a17d9f1d
MD5 0efcbda6a5f2dec359665ed84433e8eb
BLAKE2b-256 762708d5c084144bdd4a0f5fb887c281d11d4d9d105240ebde1362e1111d6902

See more details on using hashes here.

File details

Details for the file svm2csr-0.3.1-cp39-cp39-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for svm2csr-0.3.1-cp39-cp39-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 6a268dddd58972aa76e633aec3aec3e3e336308ce321a0f75a60c03e7781fee3
MD5 574a1e4d66f23c52013071b9dc284dee
BLAKE2b-256 c195256d8da92ccddae3c2c3ba115ba8843ce341161bbf62ce38c4f86ccb160e

See more details on using hashes here.

File details

Details for the file svm2csr-0.3.1-cp38-cp38-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for svm2csr-0.3.1-cp38-cp38-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 cd24246b995207af981badd4477c7f19414f6437a5c94346fc5cc06996cccef2
MD5 ff80001c728157219af90006f4e5924c
BLAKE2b-256 80478f64ec58c5f6d38f4e7ae36253a69c779b873e9b9e89f87b461353ba35f0

See more details on using hashes here.

File details

Details for the file svm2csr-0.3.1-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for svm2csr-0.3.1-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 4d2bf204872deea327c0a5223f367277e130e6c1a1d9c6597befb3da0195804c
MD5 0871014af0ebce180f55c8c8c4931d5a
BLAKE2b-256 36f89221865074fbda58e9a8db88a52fb62d691cb34c04b5118b81ebe4b52d9a

See more details on using hashes here.

File details

Details for the file svm2csr-0.3.1-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

File hashes

Hashes for svm2csr-0.3.1-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 50e46d37f823399065007c61ed7a2f1aea8d960eb7a5bf1f8487b28f1b4bc8b9
MD5 f9f2afe08b665ae9eacfdf49267d0361
BLAKE2b-256 6047b2bb9e5f0bbeef3736f49b4bac97ba2227c5dc01ed2b19e2f439f456aee4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page