s3manifesto

Data file manifest system using AWS S3 backend for big data ETL process orchestration.

These details have not been verified by PyPI

Project links

Project description

https://github.com/MacHu-GWU/s3manifesto-project/actions/workflows/main.yml/badge.svg

https://codecov.io/gh/MacHu-GWU/s3manifesto-project/branch/main/graph/badge.svg

https://img.shields.io/pypi/v/s3manifesto.svg

https://img.shields.io/pypi/l/s3manifesto.svg

https://img.shields.io/pypi/pyversions/s3manifesto.svg

https://img.shields.io/badge/Link-API-blue.svg

https://img.shields.io/badge/Link-Install-blue.svg

https://img.shields.io/badge/Link-GitHub-blue.svg

https://img.shields.io/badge/Link-Submit_Issue-blue.svg

https://img.shields.io/badge/Link-Request_Feature-blue.svg

https://img.shields.io/badge/Link-Download-blue.svg

Welcome to s3manifesto Documentation

Efficient file metadata management and intelligent partitioning for large-scale data processing on AWS S3.

Why s3manifesto?

In big data and ETL pipelines, efficiently managing thousands or millions of files becomes a critical bottleneck. s3manifesto solves this by providing:

Metadata Organization: Consolidate file metadata (URI, size, record count, ETag) into easily manageable collections
Intelligent Partitioning: Automatically group files into balanced batches for optimal parallel processing
Divide-and-Conquer Optimization: Implement efficient distributed processing workflows with predictable resource utilization

Instead of dealing with individual file metadata scattered across your data lake, s3manifesto enables you to treat collections of files as single, manageable units with powerful partitioning capabilities.

Core Concepts

1. Manifest as Metadata Collection

A manifest represents metadata for a collection of data files, where each data file contains:

S3 URI: File location identifier
ETag: Data integrity verification hash
Size: File size in bytes for resource planning
Record Count: Number of records for workload estimation
Additional attributes: Extensible metadata as needed

2. Two-File Storage System

Each manifest consists of two files stored in S3:

Manifest Summary File (JSON): Aggregate statistics and references
Manifest Data File (Parquet): Detailed per-file metadata in parquet format

This design enables quick access to summary information without loading detailed metadata, optimizing both storage and retrieval performance.

3. Intelligent File Partitioning

Manifest files can partition large collections into balanced groups using the Best Fit Decreasing (BFD) algorithm:

By Total Size: Group files into batches of ~100MB each for memory optimization
By Record Count: Group files into batches of ~10M records each for processing time consistency
Optimal Distribution: Ensures balanced workloads across parallel workers
Divide-and-Conquer Ready: Perfect for distributed processing frameworks

Example: Transform 10,000 files into 50 balanced groups of ~200 files each, with each group totaling approximately your target size or record count.

Quick Example

from s3manifesto import ManifestFile, DataFile

# Create manifest from file metadata
data_files = [
    DataFile(uri="s3://bucket/file1.json", size=1000000, n_record=1000, etag="abc123"),
    DataFile(uri="s3://bucket/file2.json", size=2000000, n_record=2000, etag="def456"),
    DataFile(uri="s3://bucket/file3.json", size=3000000, n_record=3000, etag="ghi789")
]

manifest = ManifestFile.new(
    uri="s3://bucket/manifest-data.parquet",
    uri_summary="s3://bucket/manifest-summary.json",
    data_file_list=data_files
)

# Write to S3
manifest.write(s3_client)

# Read from S3
manifest = ManifestFile.read("s3://bucket/manifest-summary.json", s3_client)

# Partition files for parallel processing
groups = manifest.partition_files_by_size(target_size=100_000_000)  # 100MB groups
groups = manifest.partition_files_by_n_record(target_n_record=10_000_000)  # 10M record groups

Install

s3manifesto is released on PyPI, so all you need is to:

$ pip install s3manifesto

To upgrade to latest version:

$ pip install --upgrade s3manifesto

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Jun 23, 2025

0.4.1

Aug 10, 2024

0.3.1

Aug 10, 2024

0.2.1

Aug 10, 2024

0.1.1

Aug 9, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

s3manifesto-1.0.0.tar.gz (15.3 kB view details)

Uploaded Jun 23, 2025 Source

Built Distribution

s3manifesto-1.0.0-py3-none-any.whl (16.4 kB view details)

Uploaded Jun 23, 2025 Python 3

File details

Details for the file s3manifesto-1.0.0.tar.gz.

File metadata

Download URL: s3manifesto-1.0.0.tar.gz
Upload date: Jun 23, 2025
Size: 15.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.10

File hashes

Hashes for s3manifesto-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`943cd8e990dcc1c4bea36085b427eeb532b917918e1f2fcdc988926f270f5bf5`
MD5	`6e4c1cc23848bde6c76dd21fc112d3fb`
BLAKE2b-256	`0da1216040a9928248dbab5fe4cde7c3a6b369b59427a1bbe047204dc31278af`

See more details on using hashes here.

File details

Details for the file s3manifesto-1.0.0-py3-none-any.whl.

File metadata

Download URL: s3manifesto-1.0.0-py3-none-any.whl
Upload date: Jun 23, 2025
Size: 16.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.10

File hashes

Hashes for s3manifesto-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e9f9f9ddf754a0f37008e6db70ca2ba48d1b4936fc93066c0ffea174f78d2ce9`
MD5	`778e78afe10a3b2d0f0fff039cf178d7`
BLAKE2b-256	`8547c9f58b0debaf4e8e5d9db9394ebdf7d973b140a0c2b9378e57668c1557a3`

See more details on using hashes here.

s3manifesto 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Welcome to s3manifesto Documentation

Why s3manifesto?

Core Concepts

Quick Example

Install

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes