Skip to main content

Sample lines from a file.

Project description

Sample lines from a file that has already been written.

Install

Install like so.

pip install sample-lines

How to

See the help for documentation.

sample-lines -h
usage: Randomly select lines from a file. [-h] [--sample-size N]
                                          [--method {simple-random,systematic}]
                                          [--repeat REPEAT]
                                          file

positional arguments:
  file

optional arguments:
  -h, --help            show this help message and exit
  --sample-size N, -n N
                        Number of lines to emit
  --method {simple-random,systematic}, -m {simple-random,systematic}
                        Sampling method
  --repeat REPEAT, -r REPEAT
                        Number of repetitions for systematic sampling

Samples are with replacement and weighted by line length. The probability of selecting a line is proportional to its line length. This allows us to sample very quickly, but it makes this approach appropriate only if your file has reasonably consistent line lengths or if you don’t care about short lines.

How fast

Consider this 1-gigabyte CSV file.

$ wc big-file.csv
 2388430 27673790 1071895374 big-file.csv

Running wc took three seconds.

time wc big-file.csv
 2388430 27673790 1071895374 big-file.csv

real    0m3.789s
user    0m3.560s
sys     0m0.190s

sample-lines is much faster. Here’s a simple random sample of 40 lines,

$ time sample-lines -n 40 -m simple-random big-file.csv > /dev/null

real    0m0.136s
user    0m0.113s
sys     0m0.018s

a systematic sample of 40 lines,

$ time sample-lines -n 40 -m systematic -r 4 big-file.csv > /dev/null

real    0m0.148s
user    0m0.122s
sys     0m0.019s

and repeated systematic sample, with 4 repeats and 10 lines each, for a total of 40 lines.

$ time sample-lines -n 10 -m systematic -r 4 big-file.csv > /dev/null

real    0m0.175s
user    0m0.140s
sys     0m0.025s

Most of the time in the above examples was spent loading Python and the various modules; printing the help takes almost as long as running the sample.

$ time sample-lines -h > /dev/null

real    0m0.157s
user    0m0.129s
sys     0m0.021s

So even a pretty big sample is still fast to run.

$ time sample-lines -n 2000 -m systematic -r 50 big-file.csv > /dev/null

real    0m2.695s
user    0m2.435s
sys     0m0.231s

Alternatives

Use sample if you want to sample from a stream.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sample-lines-0.0.3.tar.gz (2.5 kB view details)

Uploaded Source

File details

Details for the file sample-lines-0.0.3.tar.gz.

File metadata

  • Download URL: sample-lines-0.0.3.tar.gz
  • Upload date:
  • Size: 2.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for sample-lines-0.0.3.tar.gz
Algorithm Hash digest
SHA256 99a409d85d7f965474c9558d46bcc3f6c027e96096e4be3ffc0cdccf57fd61b4
MD5 3a8cdd742cca978f57356ad1e3e771e1
BLAKE2b-256 c7891734fe9794f56593621e66b9327b403916f2ae3212f18df9582b2ed796bf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page