Tools for working with large datasets
Project description
Newline Tools
File processing utilities. Useful for working with massive datasets.
Installation
pip install newline-tools
CLI Usage
The newline
command provides several subcommands:
newline <command> [options]
shuffle
Shuffle lines in a file:
newline shuffle <input_file> <output_file> [-b BUFFER_SIZE] [--progress] [--include_empty] [-r ROUNDS]
Options:
-b, --buffer_size
: Buffer size in bytes (default: 1GB)--progress
: Show progress bars during shuffling--include_empty
: Include empty lines during shuffling (default: ignore empty lines)-r, --rounds
: Number of shuffling rounds (default: 1)
dedupe
Remove duplicate lines:
newline dedupe <input_file> <output_file> [--progress] [--error_ratio ERROR_RATIO]
Options:
--progress
: Show progress bar during deduplication--error_ratio
: Error ratio for the Bloom filter (default: 1e-5)
split
Split a file into parts:
newline split <input_file> <output_prefix> (-n NUM_PARTS | -s SIZE) [--progress]
Options:
-n, --num_parts
: Number of parts to split into-s, --size
: Size of each part (e.g., '100MB', '1GB')--progress
: Show progress bar during splitting
sample
Sample lines from a file:
newline sample <input_file> <output_file> (-n NUM_LINES | -p PERCENTAGE) [--progress]
Options:
-n, --num_lines
: Number of lines to sample-p, --percentage
: Percentage of lines to sample--progress
: Show progress bar during sampling
Python Usage
from newline_tools import Shuffle, Dedupe, Split, Sample
# Shuffle
shuffler = Shuffle('input.txt', buffer_size=2**24, progress=True, ignore_empty=True, rounds=2)
shuffler.shuffle('output.txt')
# Dedupe
deduper = Dedupe('input.txt', progress=True)
deduper.dedupe('output.txt', error_ratio=1e-5)
# Split
splitter = Split('input.txt', progress=True)
splitter.split_by_parts('output_prefix', 5)
# or
splitter.split_by_size('output_prefix', '100MB')
# Sample
sampler = Sample('input.txt', 'output.txt', sample_size=10000)
sampler.sample(method='reservoir') # or 'index'
License
Dedicated to the public domain (CC0). Use as you wish.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
newline_tools-0.1.0.tar.gz
(10.5 kB
view details)
Built Distribution
File details
Details for the file newline_tools-0.1.0.tar.gz
.
File metadata
- Download URL: newline_tools-0.1.0.tar.gz
- Upload date:
- Size: 10.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8d93011c386465f14d0517ddf787fe99067458b038321fe1d364649c9fed2ced |
|
MD5 | 5a85aecd08617aed66a1206a6abf03c0 |
|
BLAKE2b-256 | 260e4c54e49cc57028e8bab89e724b8e8de4f6b4d07a42cb307dec0c3bb81b87 |
File details
Details for the file newline_tools-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: newline_tools-0.1.0-py3-none-any.whl
- Upload date:
- Size: 12.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 104dc304d634336eb6f98e3d8f797c791fc384f8929a24ed1d2cf2e7d354c72a |
|
MD5 | 65c53fc9db11b847386dedcf6f3066c9 |
|
BLAKE2b-256 | 19349412c3bb5d3fb9aab902029515865d2565d7bb9b7bb939798ae90ff64566 |