Skip to main content

Data file manifest system using AWS S3 backend for big data ETL process orchestration.

Project description

Documentation Status https://github.com/MacHu-GWU/s3manifesto-project/actions/workflows/main.yml/badge.svg https://codecov.io/gh/MacHu-GWU/s3manifesto-project/branch/main/graph/badge.svg https://img.shields.io/pypi/v/s3manifesto.svg https://img.shields.io/pypi/l/s3manifesto.svg https://img.shields.io/pypi/pyversions/s3manifesto.svg https://img.shields.io/badge/✍️_Release_History!--None.svg?style=social&logo=github https://img.shields.io/badge/⭐_Star_me_on_GitHub!--None.svg?style=social&logo=github
https://img.shields.io/badge/Link-API-blue.svg https://img.shields.io/badge/Link-Install-blue.svg https://img.shields.io/badge/Link-GitHub-blue.svg https://img.shields.io/badge/Link-Submit_Issue-blue.svg https://img.shields.io/badge/Link-Request_Feature-blue.svg https://img.shields.io/badge/Link-Download-blue.svg

Welcome to s3manifesto Documentation

https://s3manifesto.readthedocs.io/en/latest/_static/s3manifesto-logo.png

Efficient file metadata management and intelligent partitioning for large-scale data processing on AWS S3.

Why s3manifesto?

In big data and ETL pipelines, efficiently managing thousands or millions of files becomes a critical bottleneck. s3manifesto solves this by providing:

  • Metadata Organization: Consolidate file metadata (URI, size, record count, ETag) into easily manageable collections

  • Intelligent Partitioning: Automatically group files into balanced batches for optimal parallel processing

  • Divide-and-Conquer Optimization: Implement efficient distributed processing workflows with predictable resource utilization

Instead of dealing with individual file metadata scattered across your data lake, s3manifesto enables you to treat collections of files as single, manageable units with powerful partitioning capabilities.

Core Concepts

1. Manifest as Metadata Collection

A manifest represents metadata for a collection of data files, where each data file contains:

  • S3 URI: File location identifier

  • ETag: Data integrity verification hash

  • Size: File size in bytes for resource planning

  • Record Count: Number of records for workload estimation

  • Additional attributes: Extensible metadata as needed

2. Two-File Storage System

Each manifest consists of two files stored in S3:

  • Manifest Summary File (JSON): Aggregate statistics and references

  • Manifest Data File (Parquet): Detailed per-file metadata in parquet format

This design enables quick access to summary information without loading detailed metadata, optimizing both storage and retrieval performance.

3. Intelligent File Partitioning

Manifest files can partition large collections into balanced groups using the Best Fit Decreasing (BFD) algorithm:

  • By Total Size: Group files into batches of ~100MB each for memory optimization

  • By Record Count: Group files into batches of ~10M records each for processing time consistency

  • Optimal Distribution: Ensures balanced workloads across parallel workers

  • Divide-and-Conquer Ready: Perfect for distributed processing frameworks

Example: Transform 10,000 files into 50 balanced groups of ~200 files each, with each group totaling approximately your target size or record count.

Quick Example

from s3manifesto import ManifestFile, DataFile

# Create manifest from file metadata
data_files = [
    DataFile(uri="s3://bucket/file1.json", size=1000000, n_record=1000, etag="abc123"),
    DataFile(uri="s3://bucket/file2.json", size=2000000, n_record=2000, etag="def456"),
    DataFile(uri="s3://bucket/file3.json", size=3000000, n_record=3000, etag="ghi789")
]

manifest = ManifestFile.new(
    uri="s3://bucket/manifest-data.parquet",
    uri_summary="s3://bucket/manifest-summary.json",
    data_file_list=data_files
)

# Write to S3
manifest.write(s3_client)

# Read from S3
manifest = ManifestFile.read("s3://bucket/manifest-summary.json", s3_client)

# Partition files for parallel processing
groups = manifest.partition_files_by_size(target_size=100_000_000)  # 100MB groups
groups = manifest.partition_files_by_n_record(target_n_record=10_000_000)  # 10M record groups

Install

s3manifesto is released on PyPI, so all you need is to:

$ pip install s3manifesto

To upgrade to latest version:

$ pip install --upgrade s3manifesto

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

s3manifesto-1.0.0.tar.gz (15.3 kB view details)

Uploaded Source

Built Distribution

s3manifesto-1.0.0-py3-none-any.whl (16.4 kB view details)

Uploaded Python 3

File details

Details for the file s3manifesto-1.0.0.tar.gz.

File metadata

  • Download URL: s3manifesto-1.0.0.tar.gz
  • Upload date:
  • Size: 15.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.10

File hashes

Hashes for s3manifesto-1.0.0.tar.gz
Algorithm Hash digest
SHA256 943cd8e990dcc1c4bea36085b427eeb532b917918e1f2fcdc988926f270f5bf5
MD5 6e4c1cc23848bde6c76dd21fc112d3fb
BLAKE2b-256 0da1216040a9928248dbab5fe4cde7c3a6b369b59427a1bbe047204dc31278af

See more details on using hashes here.

File details

Details for the file s3manifesto-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: s3manifesto-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 16.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.10

File hashes

Hashes for s3manifesto-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e9f9f9ddf754a0f37008e6db70ca2ba48d1b4936fc93066c0ffea174f78d2ce9
MD5 778e78afe10a3b2d0f0fff039cf178d7
BLAKE2b-256 8547c9f58b0debaf4e8e5d9db9394ebdf7d973b140a0c2b9378e57668c1557a3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page