SampleShard - Training sample storage format
Project description
SampleShard Python
Python implementation of the SampleShard format for storing training samples.
Installation
pip install sampleshard
# With optional compression support
pip install sampleshard[compression]
# With optional xxhash support (faster hashing)
pip install sampleshard[hash]
# All optional dependencies
pip install sampleshard[all]
Quick Start
from sampleshard import SampleShardWriter, SampleShardReader
# Writing samples
with SampleShardWriter("train.smpl") as w:
w.add_sample(1, {"input": [1, 2, 3], "label": 0})
w.add_sample(2, {"input": [4, 5, 6], "label": 1})
w.add_sample(3, {"input": [7, 8, 9], "label": 2})
# Reading samples
with SampleShardReader("train.smpl") as r:
# Get sample count
print(f"Total samples: {r.sample_count()}")
# Random access by ID
sample = r.get_sample(1)
print(sample) # {"input": [1, 2, 3], "label": 0}
# Check if sample exists
if r.has_sample(2):
print("Sample 2 exists!")
# Iterate all samples
for sample_id, sample in r:
print(f"Sample {sample_id}: {sample}")
# Batch access
batch = r.get_batch([1, 2, 3])
range_batch = r.get_batch_by_range(0, 10)
Features
- Fast random access by sample ID (O(1) lookup)
- Deterministic iteration order
- Metadata-safe: Reserved entries (starting with
__) excluded from sample counts - Memory-mapped access for zero-copy reads
- Optional compression (zstd, lz4)
- CRC32C checksums for data integrity
File Format
SampleShard uses the .smpl extension and the Shard v2 binary format:
- 64-byte header with magic bytes
SHRD - Role byte = 0x02 (Sample)
- 48-byte index entries with xxHash64 name hashes
- JSON-encoded sample data
- CRC32C checksums per entry
Interoperability
SampleShard files created with Python can be read by:
- Go:
agentscope/cowrie/ucodec.OpenSampleShard() - TypeScript:
@sampleshard/core
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
sampleshard-0.1.0.tar.gz
(17.6 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sampleshard-0.1.0.tar.gz.
File metadata
- Download URL: sampleshard-0.1.0.tar.gz
- Upload date:
- Size: 17.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8dd31835538f0a9352f58b6a45868ad6d61e8526e1d95f06564e5e05ff971847
|
|
| MD5 |
68a72acece398fbfd298e896feb38f17
|
|
| BLAKE2b-256 |
15dd81dbaa35830bac28b9493218cad69da69e67d78132cf5ef511b3d696cd71
|
File details
Details for the file sampleshard-0.1.0-py3-none-any.whl.
File metadata
- Download URL: sampleshard-0.1.0-py3-none-any.whl
- Upload date:
- Size: 17.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f49ad989b0a9250740faa29c6da791bde43771d34c68f67682a39f48c222c7c0
|
|
| MD5 |
2ceab1fdab818233ee7da5b883868604
|
|
| BLAKE2b-256 |
c7172ac871d45ae66964093261392d567827c631073a64dfdce89c30edaf45ae
|