Skip to main content

Loads SVMlight text files into scipy sparse CSR matrices in parallel.

Project description

SVMlight text files to scipy CSR

travis build pypi

Many sparse datasets are distributed in a lightweight text format called svmlight. While simple and familiar, it's terribly slow to read in python even with C++ solutions due to serial processing. Instead, svm2csr loads by using a parallel Rust extension which chunks files into byte blocks, then seeks to different blocks to parse in parallel.

# benchmark dataset is kdda training set, 2.5GB flat text
# https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html

import sklearn.datasets
%timeit sklearn.datasets.load_svmlight_file('kdda')
1min 56s ± 1.72 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

# https://github.com/mblondel/svmlight-loader
%timeit svmlight_loader.load_svmlight_file('kdda')
1min 52s ± 3.11 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

import svm2csr
%timeit svm2csr.load_svmlight_file('kdda')
11.4 s ± 527 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Above micro-benchmark performed on my 8-core laptop.

Install

pip install svm2csr

Note this package is only available pre-built for pythons, operating systems, and machine architecture targets I can build wheels for (see Publishing). Settings other than the following need to install rust and compile from source (pip install should still work, but will compile for your platform).

  • cp36-cp39, manylinux2010, x86_64

Unsupported Features

  • dtype (currently only doubles supported)
  • an svmlight ranking mode where query ids are identified with qid
  • comments in svmlight files (start with #)
  • empty or blank lines
  • multilabel extension
  • reading from compressed files
  • reading from multiple files and stacking
  • reading from streams
  • writing SVMlight files
  • n_features option
  • graceful client multiprocessing
  • mac and windows wheels

All of these are fixable (even stream reading with parallel bridge). Let me know if you'd like to make PR.

Documentation

def load_svmlight_file(fname, zero_based="auto", min_chunk_size=(16 * 1024)):
    """
    Loads an SVMlight file into a CSR matrix.

    fname (str): the file name of the file to load.
    zero_based ("auto" or bool): whether the corresponding svmlight file uses
        zero based indexing; if false or all indices are nonzero, then
        shifts indices down uniformly by 1 for python's zero indexing.
    min_chunk_size (int): minimum chunk size in bytes per
        parallel processing task

    Returns (X, y) where X is a sparse CSR matrix and y is a numpy double array
    with length equal to the number of rows in X. Values of X are doubles.
    """

Dev Info

Install maturin and pytest first.

pip install maturin pytest

Local development.

cargo test # test rust only
maturin develop # create py bindings for rust code
pytest # test python bindings

Publishing

  1. Fetch the most recent master.
  2. Bump the version in Cargo.toml appropriately if needed. Commit these changes.
  3. Tag the release. git tag -a -m "v<CURRENT VERSION>"
  4. Push to github, triggering a Travis build that tests, packages, and uploads to pypi. git push --follow-tags

Every master travis build attempts to publish to pypi (but may fail if a build with the same version is already present).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

svm2csr-0.3.1.tar.gz (13.9 kB view hashes)

Uploaded Source

Built Distributions

svm2csr-0.3.1-cp39-cp39-manylinux2010_x86_64.whl (245.0 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.12+ x86-64

svm2csr-0.3.1-cp38-cp38-manylinux2010_x86_64.whl (245.0 kB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

svm2csr-0.3.1-cp37-cp37m-manylinux2010_x86_64.whl (244.9 kB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

svm2csr-0.3.1-cp36-cp36m-manylinux2010_x86_64.whl (245.0 kB view hashes)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page