Skip to main content

Tools for working with large datasets

Project description

Newline Tools

File processing utilities for working with massive datasets.

Installation

pip install newline-tools

CLI Usage

The newline command provides several subcommands:

newline <command> [options]

shuffle

Shuffle lines in a file:

newline shuffle <input_file> <output_file> [-b BUFFER_SIZE] [--progress] [--include_empty] [-r ROUNDS] [--seed SEED]

Options:

  • -b, --buffer_size: Buffer size in bytes (default: 1GB)
  • --progress: Show progress bars during shuffling
  • --include_empty: Include empty lines during shuffling (default: ignore empty lines)
  • -r, --rounds: Number of shuffling rounds (default: 1)
  • --seed: Seed for random number generator (for reproducibility)

dedupe

Remove duplicate lines:

newline dedupe <input_file> <output_file> [--progress] [--error_ratio ERROR_RATIO]

Options:

  • --progress: Show progress bar during deduplication
  • --error_ratio: Error ratio for the Bloom filter (default: 1e-5)

split

Split a file into parts:

newline split <input_file> <output_prefix> (-n NUM_PARTS | -s SIZE | -p PROPORTIONS) [--progress]

Options:

  • -n, --num_parts: Number of parts to split into
  • -s, --size: Size of each part (e.g., '100MB', '1GB')
  • -p, --proportions: Split by proportions
  • --progress: Show progress bar during splitting

Examples:

newline split input.txt output_prefix -n 5
newline split input.txt output_prefix -s 100MB
newline split input.txt output_prefix -p 0.3 0.3 0.4

sample

Sample lines from a file:

newline sample <input_file> <output_file> (-n NUM_LINES | -p PERCENTAGE) [--progress] [--seed SEED]

Options:

  • -n, --num_lines: Number of lines to sample
  • -p, --percentage: Percentage of lines to sample
  • --progress: Show progress bar during sampling
  • --seed: Seed for random number generator (for reproducibility)

Python Usage

from newline_tools import Shuffle, Dedupe, Split, Sample

# Shuffle
shuffler = Shuffle('input.txt', buffer_size=2**24, progress=True, ignore_empty=True, rounds=2, seed=42)
shuffler.shuffle('output.txt')

# Dedupe
deduper = Dedupe('input.txt', progress=True)
deduper.dedupe('output.txt', error_ratio=1e-5)

# Split
splitter = Split('input.txt', progress=True)
splitter.split_by_parts('output_prefix', 5)
# or
splitter.split_by_size('output_prefix', '100MB')
# or
splitter.split_by_proportion('output_prefix', [0.3, 0.3, 0.4])

# Sample
sampler = Sample('input.txt', 'output.txt', sample_size=10000, progress=True, seed=42)
sampler.sample(method='reservoir')  # or 'index'

License

Dedicated to the public domain (CC0). Use as you wish.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

newline_tools-0.1.2.tar.gz (10.9 kB view details)

Uploaded Source

Built Distribution

newline_tools-0.1.2-py3-none-any.whl (12.5 kB view details)

Uploaded Python 3

File details

Details for the file newline_tools-0.1.2.tar.gz.

File metadata

  • Download URL: newline_tools-0.1.2.tar.gz
  • Upload date:
  • Size: 10.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for newline_tools-0.1.2.tar.gz
Algorithm Hash digest
SHA256 f6904b245861fac1b58791301fb46f8a0a49b3e127d729f254d05168f38c492e
MD5 4fe87fd87f187ec4abcbe9f7957696ba
BLAKE2b-256 a9ba287e03a048c7dfc090c33ceed0abd525a1c2377ba87ebcb36314873422da

See more details on using hashes here.

File details

Details for the file newline_tools-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for newline_tools-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 56916a7e6c3ee5b7f5e4c8d95b63caebb5b1dd3605e357f63b0e697269ba0a8d
MD5 b728a891a70f65d2f93c2da34633b233
BLAKE2b-256 6f6beb9ca19891642433ddd3444883287d2419c63676aff10f595725d115700a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page