Tools for working with large datasets
Project description
Newline Tools
File processing utilities for working with massive datasets.
Installation
pip install newline-tools
CLI Usage
The newline
command provides several subcommands:
newline <command> [options]
shuffle
Shuffle lines in a file:
newline shuffle <input_file> <output_file> [-b BUFFER_SIZE] [--progress] [--include_empty] [-r ROUNDS] [--seed SEED]
Options:
-b, --buffer_size
: Buffer size in bytes (default: 1GB)--progress
: Show progress bars during shuffling--include_empty
: Include empty lines during shuffling (default: ignore empty lines)-r, --rounds
: Number of shuffling rounds (default: 1)--seed
: Seed for random number generator (for reproducibility)
dedupe
Remove duplicate lines:
newline dedupe <input_file> <output_file> [--progress] [--error_ratio ERROR_RATIO]
Options:
--progress
: Show progress bar during deduplication--error_ratio
: Error ratio for the Bloom filter (default: 1e-5)
split
Split a file into parts:
newline split <input_file> <output_prefix> (-n NUM_PARTS | -s SIZE | -p PROPORTIONS) [--progress]
Options:
-n, --num_parts
: Number of parts to split into-s, --size
: Size of each part (e.g., '100MB', '1GB')-p, --proportions
: Split by proportions--progress
: Show progress bar during splitting
Examples:
newline split input.txt output_prefix -n 5
newline split input.txt output_prefix -s 100MB
newline split input.txt output_prefix -p 0.3 0.3 0.4
sample
Sample lines from a file:
newline sample <input_file> <output_file> (-n NUM_LINES | -p PERCENTAGE) [--progress] [--seed SEED]
Options:
-n, --num_lines
: Number of lines to sample-p, --percentage
: Percentage of lines to sample--progress
: Show progress bar during sampling--seed
: Seed for random number generator (for reproducibility)
Python Usage
from newline_tools import Shuffle, Dedupe, Split, Sample
# Shuffle
shuffler = Shuffle('input.txt', buffer_size=2**24, progress=True, ignore_empty=True, rounds=2, seed=42)
shuffler.shuffle('output.txt')
# Dedupe
deduper = Dedupe('input.txt', progress=True)
deduper.dedupe('output.txt', error_ratio=1e-5)
# Split
splitter = Split('input.txt', progress=True)
splitter.split_by_parts('output_prefix', 5)
# or
splitter.split_by_size('output_prefix', '100MB')
# or
splitter.split_by_proportion('output_prefix', [0.3, 0.3, 0.4])
# Sample
sampler = Sample('input.txt', 'output.txt', sample_size=10000, progress=True, seed=42)
sampler.sample(method='reservoir') # or 'index'
License
Dedicated to the public domain (CC0). Use as you wish.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
newline_tools-0.1.2.tar.gz
(10.9 kB
view details)
Built Distribution
File details
Details for the file newline_tools-0.1.2.tar.gz
.
File metadata
- Download URL: newline_tools-0.1.2.tar.gz
- Upload date:
- Size: 10.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f6904b245861fac1b58791301fb46f8a0a49b3e127d729f254d05168f38c492e |
|
MD5 | 4fe87fd87f187ec4abcbe9f7957696ba |
|
BLAKE2b-256 | a9ba287e03a048c7dfc090c33ceed0abd525a1c2377ba87ebcb36314873422da |
File details
Details for the file newline_tools-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: newline_tools-0.1.2-py3-none-any.whl
- Upload date:
- Size: 12.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 56916a7e6c3ee5b7f5e4c8d95b63caebb5b1dd3605e357f63b0e697269ba0a8d |
|
MD5 | b728a891a70f65d2f93c2da34633b233 |
|
BLAKE2b-256 | 6f6beb9ca19891642433ddd3444883287d2419c63676aff10f595725d115700a |