Skip to main content

Client for interacting with KL3M data stored in S3

Project description

KL3M Data Client

A lightweight client for interacting with the KL3M data pipeline and S3 storage architecture.

Features

  • Access and manage KL3M datasets stored in S3
  • List datasets and check their processing status
  • Retrieve and parse document content from all pipeline stages
  • Export datasets to JSONL format with filtering options
  • Both programmatic and command-line interfaces
  • Minimal dependencies (only boto3 and rich)
  • Streaming support for efficient handling of large datasets

Installation

pip install kl3m-data-client

For development:

pip install -e ".[dev]"

Or using pipx for a globally available CLI tool:

pipx install kl3m-data-client

Developer API Usage

Basic Usage

from kl3m_data_client import KL3MClient
from kl3m_data_client.models.common import Stage

# Initialize client
client = KL3MClient()

# List available datasets
datasets = client.list_datasets()

# Get status of a specific dataset
status = client.get_dataset_status("usc")
print(f"Dataset: {status.dataset_id}")
print(f"Document count: {status.document_count}")

# Streaming document IDs with a limit (efficient for large datasets)
for doc_id in client.iter_documents("usc", Stage.DOCUMENTS, limit=5):
    # Process each document as it's retrieved
    document = client.get_document("usc", doc_id, Stage.DOCUMENTS)
    content = document.get_content(Stage.DOCUMENTS)
    print(f"Document: {doc_id}, Title: {content.metadata.title}")
    
    # No need to load the entire dataset into memory!

# Export to JSONL using streaming
client.export_to_jsonl(
    dataset_id="usc",
    output_path="usc_export.jsonl",
    source_stage=Stage.PARQUET,
    max_documents=1000,
    deduplicate=True,
)

# Process documents during export with streaming iterator
for document in client.iter_jsonl_export(
    dataset_id="usc",
    source_stage=Stage.PARQUET,
    max_documents=100,
    deduplicate=True,
):
    # Process each document as it's streamed
    print(f"Document: {document['id']}")
    print(f"Token count: {len(document['tokens'])}")
    
    # You can perform custom processing on each document here
    # without loading the entire dataset into memory

See the examples directory for more detailed usage examples.

Working with Documents and Representations

The library provides dedicated classes for easily working with document data across all stages:

from kl3m_data_client import KL3MClient
from kl3m_data_client.models.common import Stage

# Initialize client
client = KL3MClient()

# Get a document
document = client.get_document("cap", "1000")

# === Document Stage ===
# Get basic document content
doc_content = document.get_content(Stage.DOCUMENTS)
print(f"Title: {doc_content.metadata.title}")
print(f"Content preview: {doc_content.content[:100]}...")

# === Representation Stage ===
# Get representation using the convenient helper method
representation = document.get_representation()

# Access representation data with a clean API
available_mime_types = representation.get_available_mime_types()
print(f"Available MIME types: {available_mime_types}")

# Get content for a specific MIME type
markdown_content = representation.get_content("text/markdown")

# Get available tokenizers for a representation
tokenizers = representation.get_available_tokenizers("text/markdown")
print(f"Available tokenizers: {tokenizers}")

# Get tokens and token count
tokens = representation.get_tokens("text/markdown", "cl100k_base")
token_count = representation.token_count("text/markdown", "cl100k_base")
print(f"Token count: {token_count}")

# Get a summary of all representations
summary = representation.summarize()
print(summary)

# === Parquet Stage ===
# Get parquet data with PyArrow support
parquet = document.get_parquet()
print(f"Parquet size: {parquet.size} bytes")

# Get PyArrow table
table = parquet.get_table()
print(f"Table columns: {parquet.get_columns()}")
print(f"Schema: {parquet.get_schema()}")

# Access document representations from parquet
representations = parquet.get_representations()
for rep_type, tokens in representations.items():
    print(f"{rep_type}: {len(tokens)} tokens")

# Save parquet data to a file
parquet.save_to_file("document.parquet")

Low-level S3 Utilities

The library also exposes low-level S3 utilities for advanced usage:

from kl3m_data_client.utils.s3 import (
    get_s3_client,
    list_dataset_ids,
    get_stage_prefix,
    iter_prefix,
    check_object_exists,
    get_object_bytes,
    decompress_content,
)

# Initialize S3 client directly
s3_client = get_s3_client()

# List datasets directly using S3 utilities
datasets = list_dataset_ids(s3_client, "data.kl3m.ai", Stage.DOCUMENTS)

# Iterate through S3 keys (streaming)
for key in iter_prefix(s3_client, "data.kl3m.ai", "documents/usc/"):
    # Process each key as it comes in
    pass

Command Line Interface

# List all datasets
kl3m-client list

# Show status of a dataset
kl3m-client status usc

# List documents in a dataset
kl3m-client documents usc --stage documents --count

# Inspect a specific document
kl3m-client inspect usc document_id --stage documents

# Export a dataset to JSONL
kl3m-client export-jsonl usc --output usc_export.jsonl

For detailed CLI documentation:

kl3m-client --help
kl3m-client <command> --help

AWS Authentication

The client uses boto3 for S3 access and supports all standard AWS authentication methods:

  1. Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
  2. Shared credential file (~/.aws/credentials)
  3. AWS config file (~/.aws/config)
  4. IAM role for Amazon EC2

You can also explicitly provide credentials when initializing the client:

client = KL3MClient(
    aws_access_key_id="YOUR_ACCESS_KEY",
    aws_secret_access_key="YOUR_SECRET_KEY", 
    region="us-east-1"
)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kl3m_data_client-0.1.2.tar.gz (53.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kl3m_data_client-0.1.2-py3-none-any.whl (35.8 kB view details)

Uploaded Python 3

File details

Details for the file kl3m_data_client-0.1.2.tar.gz.

File metadata

  • Download URL: kl3m_data_client-0.1.2.tar.gz
  • Upload date:
  • Size: 53.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.23

File hashes

Hashes for kl3m_data_client-0.1.2.tar.gz
Algorithm Hash digest
SHA256 b184ce57167908675ccf7666223911c42732ba9ff768336804d5f883538b13b7
MD5 028e92dbae08539979990b8819b4d0d8
BLAKE2b-256 230dcf1476a7e209510fd6d9587e86f27a250b783040adc050c86f8b1717fe59

See more details on using hashes here.

File details

Details for the file kl3m_data_client-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for kl3m_data_client-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 bc4a40f08b991bc1fe3ac9cda67dae5c74a7fca5fcff51ba507b1942314681d5
MD5 3c21207401e9bfc435bbe07b131f29c3
BLAKE2b-256 08ae20680215ccc036c0f15db14f0d22fdff85301a381dd24e7dbf9cea06f374

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page