Skip to main content

Tools for working with large datasets

Project description

Newline Tools

File processing utilities for working with massive datasets.

Installation

pip install newline-tools

CLI Usage

The newline command provides several subcommands:

newline <command> [options]

shuffle

Shuffle lines in a file:

newline shuffle <input_file> <output_file> [-b BUFFER_SIZE] [--progress] [--include_empty] [-r ROUNDS] [--seed SEED]

Options:

  • -b, --buffer_size: Buffer size in bytes (default: 1GB)
  • --progress: Show progress bars during shuffling
  • --include_empty: Include empty lines during shuffling (default: ignore empty lines)
  • -r, --rounds: Number of shuffling rounds (default: 1)
  • --seed: Seed for random number generator (for reproducibility)

dedupe

Remove duplicate lines:

newline dedupe <input_file> <output_file> [--progress] [--error_ratio ERROR_RATIO]

Options:

  • --progress: Show progress bar during deduplication
  • --error_ratio: Error ratio for the Bloom filter (default: 1e-5)

split

Split a file into parts:

newline split <input_file> <output_prefix> (-n NUM_PARTS | -s SIZE | -p PROPORTIONS) [--progress]

Options:

  • -n, --num_parts: Number of parts to split into
  • -s, --size: Size of each part (e.g., '100MB', '1GB')
  • -p, --proportions: Split by proportions (must sum to 1)
  • --progress: Show progress bar during splitting

Examples:

newline split input.txt output_prefix -n 5
newline split input.txt output_prefix -s 100MB
newline split input.txt output_prefix -p 0.3 0.3 0.4

sample

Sample lines from a file:

newline sample <input_file> <output_file> (-n NUM_LINES | -p PERCENTAGE) [--progress] [--seed SEED]

Options:

  • -n, --num_lines: Number of lines to sample
  • -p, --percentage: Percentage of lines to sample
  • --progress: Show progress bar during sampling
  • --seed: Seed for random number generator (for reproducibility)

Python Usage

from newline_tools import Shuffle, Dedupe, Split, Sample

# Shuffle
shuffler = Shuffle('input.txt', buffer_size=2**24, progress=True, ignore_empty=True, rounds=2, seed=42)
shuffler.shuffle('output.txt')

# Dedupe
deduper = Dedupe('input.txt', progress=True)
deduper.dedupe('output.txt', error_ratio=1e-5)

# Split
splitter = Split('input.txt', progress=True)
splitter.split_by_parts('output_prefix', 5)
# or
splitter.split_by_size('output_prefix', '100MB')
# or
splitter.split_by_proportion('output_prefix', [0.3, 0.3, 0.4])

# Sample
sampler = Sample('input.txt', 'output.txt', sample_size=10000, progress=True, seed=42)
sampler.sample(method='reservoir')  # or 'index'

License

Dedicated to the public domain (CC0). Use as you wish.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

newline_tools-0.1.1.tar.gz (10.9 kB view details)

Uploaded Source

Built Distribution

newline_tools-0.1.1-py3-none-any.whl (12.5 kB view details)

Uploaded Python 3

File details

Details for the file newline_tools-0.1.1.tar.gz.

File metadata

  • Download URL: newline_tools-0.1.1.tar.gz
  • Upload date:
  • Size: 10.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for newline_tools-0.1.1.tar.gz
Algorithm Hash digest
SHA256 1f4b105cb95ff8e3e48d11e0d89084298584e80d284b4bac66b873af4f419d21
MD5 d61582c7852c81119445dfc080e5d2c3
BLAKE2b-256 7073f59921ece7e0e57f2224d38cbc10ee6d1023c3a3bc7ad36790ac18ded07a

See more details on using hashes here.

File details

Details for the file newline_tools-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for newline_tools-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 460902c1d143c4189d176310cda00c85994781414608303122c0895fb9c81b0f
MD5 f7fe9ca37407e87bb109055cc58f04ca
BLAKE2b-256 9671892a82262c411e1d40f78d173b51161528468f7cd638e3fd5fd5ccd6a9d8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page