Skip to main content

Tools for working with large datasets

Project description

Newline Tools

File processing utilities. Useful for working with massive datasets.

Installation

pip install newline-tools

CLI Usage

The newline command provides several subcommands:

newline <command> [options]

shuffle

Shuffle lines in a file:

newline shuffle <input_file> <output_file> [-b BUFFER_SIZE] [--progress] [--include_empty] [-r ROUNDS]

Options:

  • -b, --buffer_size: Buffer size in bytes (default: 1GB)
  • --progress: Show progress bars during shuffling
  • --include_empty: Include empty lines during shuffling (default: ignore empty lines)
  • -r, --rounds: Number of shuffling rounds (default: 1)

dedupe

Remove duplicate lines:

newline dedupe <input_file> <output_file> [--progress] [--error_ratio ERROR_RATIO]

Options:

  • --progress: Show progress bar during deduplication
  • --error_ratio: Error ratio for the Bloom filter (default: 1e-5)

split

Split a file into parts:

newline split <input_file> <output_prefix> (-n NUM_PARTS | -s SIZE) [--progress]

Options:

  • -n, --num_parts: Number of parts to split into
  • -s, --size: Size of each part (e.g., '100MB', '1GB')
  • --progress: Show progress bar during splitting

sample

Sample lines from a file:

newline sample <input_file> <output_file> (-n NUM_LINES | -p PERCENTAGE) [--progress]

Options:

  • -n, --num_lines: Number of lines to sample
  • -p, --percentage: Percentage of lines to sample
  • --progress: Show progress bar during sampling

Python Usage

from newline_tools import Shuffle, Dedupe, Split, Sample

# Shuffle
shuffler = Shuffle('input.txt', buffer_size=2**24, progress=True, ignore_empty=True, rounds=2)
shuffler.shuffle('output.txt')

# Dedupe
deduper = Dedupe('input.txt', progress=True)
deduper.dedupe('output.txt', error_ratio=1e-5)

# Split
splitter = Split('input.txt', progress=True)
splitter.split_by_parts('output_prefix', 5)
# or
splitter.split_by_size('output_prefix', '100MB')

# Sample
sampler = Sample('input.txt', 'output.txt', sample_size=10000)
sampler.sample(method='reservoir')  # or 'index'

License

Dedicated to the public domain (CC0). Use as you wish.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

newline_tools-0.1.0.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

newline_tools-0.1.0-py3-none-any.whl (12.4 kB view details)

Uploaded Python 3

File details

Details for the file newline_tools-0.1.0.tar.gz.

File metadata

  • Download URL: newline_tools-0.1.0.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for newline_tools-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8d93011c386465f14d0517ddf787fe99067458b038321fe1d364649c9fed2ced
MD5 5a85aecd08617aed66a1206a6abf03c0
BLAKE2b-256 260e4c54e49cc57028e8bab89e724b8e8de4f6b4d07a42cb307dec0c3bb81b87

See more details on using hashes here.

File details

Details for the file newline_tools-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for newline_tools-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 104dc304d634336eb6f98e3d8f797c791fc384f8929a24ed1d2cf2e7d354c72a
MD5 65c53fc9db11b847386dedcf6f3066c9
BLAKE2b-256 19349412c3bb5d3fb9aab902029515865d2565d7bb9b7bb939798ae90ff64566

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page