Data file manifest system using AWS S3 backend for big data ETL process orchestration.
Project description
Welcome to s3manifesto Documentation
Efficient file metadata management and intelligent partitioning for large-scale data processing on AWS S3.
Why s3manifesto?
In big data and ETL pipelines, efficiently managing thousands or millions of files becomes a critical bottleneck. s3manifesto solves this by providing:
Metadata Organization: Consolidate file metadata (URI, size, record count, ETag) into easily manageable collections
Intelligent Partitioning: Automatically group files into balanced batches for optimal parallel processing
Divide-and-Conquer Optimization: Implement efficient distributed processing workflows with predictable resource utilization
Instead of dealing with individual file metadata scattered across your data lake, s3manifesto enables you to treat collections of files as single, manageable units with powerful partitioning capabilities.
Core Concepts
1. Manifest as Metadata Collection
A manifest represents metadata for a collection of data files, where each data file contains:
S3 URI: File location identifier
ETag: Data integrity verification hash
Size: File size in bytes for resource planning
Record Count: Number of records for workload estimation
Additional attributes: Extensible metadata as needed
2. Two-File Storage System
Each manifest consists of two files stored in S3:
Manifest Summary File (JSON): Aggregate statistics and references
Manifest Data File (Parquet): Detailed per-file metadata in parquet format
This design enables quick access to summary information without loading detailed metadata, optimizing both storage and retrieval performance.
3. Intelligent File Partitioning
Manifest files can partition large collections into balanced groups using the Best Fit Decreasing (BFD) algorithm:
By Total Size: Group files into batches of ~100MB each for memory optimization
By Record Count: Group files into batches of ~10M records each for processing time consistency
Optimal Distribution: Ensures balanced workloads across parallel workers
Divide-and-Conquer Ready: Perfect for distributed processing frameworks
Example: Transform 10,000 files into 50 balanced groups of ~200 files each, with each group totaling approximately your target size or record count.
Quick Example
from s3manifesto import ManifestFile, DataFile
# Create manifest from file metadata
data_files = [
DataFile(uri="s3://bucket/file1.json", size=1000000, n_record=1000, etag="abc123"),
DataFile(uri="s3://bucket/file2.json", size=2000000, n_record=2000, etag="def456"),
DataFile(uri="s3://bucket/file3.json", size=3000000, n_record=3000, etag="ghi789")
]
manifest = ManifestFile.new(
uri="s3://bucket/manifest-data.parquet",
uri_summary="s3://bucket/manifest-summary.json",
data_file_list=data_files
)
# Write to S3
manifest.write(s3_client)
# Read from S3
manifest = ManifestFile.read("s3://bucket/manifest-summary.json", s3_client)
# Partition files for parallel processing
groups = manifest.partition_files_by_size(target_size=100_000_000) # 100MB groups
groups = manifest.partition_files_by_n_record(target_n_record=10_000_000) # 10M record groups
Install
s3manifesto is released on PyPI, so all you need is to:
$ pip install s3manifesto
To upgrade to latest version:
$ pip install --upgrade s3manifesto
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file s3manifesto-1.0.0.tar.gz
.
File metadata
- Download URL: s3manifesto-1.0.0.tar.gz
- Upload date:
- Size: 15.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
943cd8e990dcc1c4bea36085b427eeb532b917918e1f2fcdc988926f270f5bf5
|
|
MD5 |
6e4c1cc23848bde6c76dd21fc112d3fb
|
|
BLAKE2b-256 |
0da1216040a9928248dbab5fe4cde7c3a6b369b59427a1bbe047204dc31278af
|
File details
Details for the file s3manifesto-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: s3manifesto-1.0.0-py3-none-any.whl
- Upload date:
- Size: 16.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
e9f9f9ddf754a0f37008e6db70ca2ba48d1b4936fc93066c0ffea174f78d2ce9
|
|
MD5 |
778e78afe10a3b2d0f0fff039cf178d7
|
|
BLAKE2b-256 |
8547c9f58b0debaf4e8e5d9db9394ebdf7d973b140a0c2b9378e57668c1557a3
|