Preview and analyze data files in Google Cloud Storage, AWS S3, and Azure Blob Storage from your terminal

These details have not been verified by PyPI

Project links

Project description

CloudCat Logo

CloudCat

The Swiss Army knife for viewing cloud storage data from your terminal

Documentation • Installation • Quick Start • Features • Examples

CloudCat is a powerful command-line tool that lets you instantly preview and analyze data files stored in Google Cloud Storage (GCS), Amazon S3, and Azure Blob Storage — without downloading entire files. Think of it as cat, head, and less combined, but for cloud storage with built-in support for CSV, JSON, Parquet, Avro, ORC, and plain text formats.

Why CloudCat?

No Downloads Required — Stream and preview data directly from cloud storage
Format-Aware — Intelligently handles CSV, JSON, Parquet, Avro, ORC, and plain text files
Directory Smart — Automatically discovers data files in Spark/Hive/Kafka output directories
Beautiful Output — Colorized tables, pretty-printed JSON, and schema visualization
Developer Friendly — Simple CLI with sensible defaults and powerful options
Compression Support — Automatic decompression of gzip, zstd, lz4, snappy, and bz2 files
SQL-like Filtering — Filter rows with WHERE clauses (e.g., --where "status=active")

Installation

Homebrew (macOS Apple Silicon)

The easiest way to install on Apple Silicon Macs (M1/M2/M3/M4) — no Python required:

brew tap jonathansudhakar1/cloudcat https://github.com/jonathansudhakar1/cloudcat.git && brew install cloudcat

This installs a self-contained binary that includes Python and all dependencies.

Intel Mac users: Homebrew bottles are not available for Intel. Please use pip install 'cloudcat[all]' instead.

To upgrade:

brew update && brew upgrade cloudcat

Note: On first run, macOS may block the app. Go to System Settings > Privacy & Security and click "Allow", or run:
xattr -d com.apple.quarantine $(which cloudcat)

pip (Python)

# Full installation with all formats and compression
pip install 'cloudcat[all]'

# Standard installation (includes GCS, S3, and Azure support)
pip install cloudcat

# With Parquet file support
pip install 'cloudcat[parquet]'

# With Avro file support
pip install 'cloudcat[avro]'

# With ORC file support (uses pyarrow)
pip install 'cloudcat[orc]'

# With compression support (zstd, lz4, snappy)
pip install 'cloudcat[compression]'

Note: If using zsh (default on macOS), quotes around extras are required to prevent shell interpretation of brackets.

To upgrade:

pip install --upgrade 'cloudcat[all]'

Requirements

Homebrew: macOS (Apple Silicon only). Intel Mac users should use pip.
pip: Python 3.7+ (all platforms)
Cloud provider credentials configured (see Authentication)

Quick Start

# Preview a CSV file from GCS
cloudcat -p gcs://my-bucket/data.csv

# Preview a Parquet file from S3
cloudcat -p s3://my-bucket/analytics/events.parquet

# Preview JSON data from Azure with pretty formatting
cloudcat -p az://my-container/logs.json -o jsonp

# Read Avro files from Kafka
cloudcat -p s3://my-bucket/kafka-export.avro

# Read ORC files from Hive
cloudcat -p gcs://my-bucket/hive-table.orc

# Read log files as plain text
cloudcat -p az://logs/app.log -i text

# Read from a Spark output directory
cloudcat -p s3://my-bucket/spark-output/ -i parquet

# Read compressed files (auto-detected)
cloudcat -p gcs://my-bucket/data.csv.gz

# Filter rows with WHERE clause
cloudcat -p s3://bucket/users.parquet --where "status=active"

# Skip first 100 rows (pagination)
cloudcat -p gcs://bucket/data.csv --offset 100 -n 10

Features

Cloud Storage Support

Provider	URL Scheme	Status
Google Cloud Storage	`gcs://` or `gs://`	✅ Supported
Amazon S3	`s3://`	✅ Supported
Azure Blob Storage	`az://` or `azure://`	✅ Supported
Azure Data Lake Gen2	`abfss://`	✅ Supported

File Format Support

Format	Auto-Detect	Use Case
CSV	✅	General data files
JSON	✅	API responses, configs
JSON Lines	✅	Log files, streaming data
Parquet	✅	Spark/analytics data
Avro	✅	Kafka, data pipelines
ORC	✅	Hive, Hadoop ecosystem
Text	✅	Log files, plain text
TSV	Via `--delimiter`	Tab-separated data

Streaming Efficiency

CloudCat uses intelligent streaming to minimize data transfer and egress costs:

Format	Compression	Streams	Column Projection	Early Row Stop
Parquet	None/Internal	✅	✅ Range requests	✅
Parquet	External (.gz)	❌	❌	❌
ORC	None/Internal	❌	❌	❌
ORC	External (.gz)	❌	❌	❌
CSV	None	✅	❌	✅
CSV	gzip/zstd/lz4/bz2	✅	❌	✅
CSV	snappy	❌	❌	❌
JSON Lines	None/streamable	✅	❌	✅
JSON Array	Any	❌	❌	❌
Avro	Any	✅	✅ Record-level	✅
Text	Any streamable	✅	N/A	✅

Streams: Only reads data as needed, stops early when row limit is reached
Column Projection: For Parquet, only fetches required column chunks via HTTP range requests
Early Row Stop: Stops reading when --num-rows limit is reached

Compression Support

Format	Extension	Built-in	Use Case
Gzip	`.gz`, `.gzip`	✅	Most common, universal
Bzip2	`.bz2`	✅	High compression ratio
Zstandard	`.zst`, `.zstd`	Optional	Fast, modern compression
LZ4	`.lz4`	Optional	Very fast decompression
Snappy	`.snappy`	Optional	Hadoop ecosystem

CloudCat automatically detects and decompresses files based on extension (e.g., data.csv.gz, logs.json.zst).

Output Formats

Format	Flag	Description
Table	`-o table`	Beautiful ASCII table with colored headers (default)
JSON	`-o json`	Standard JSON Lines output
Pretty JSON	`-o jsonp`	Syntax-highlighted, indented JSON
CSV	`-o csv`	Comma-separated values

Key Capabilities

Schema Inspection — View column names and data types
Column Selection — Display only the columns you need
Row Limiting — Control how many rows to preview
Row Offset — Skip first N rows for pagination/sampling
WHERE Filtering — Filter rows with SQL-like conditions
Record Counting — Get total record counts (with Parquet metadata optimization)
Multi-File Reading — Combine data from multiple files in a directory
Custom Delimiters — Support for tab, pipe, semicolon, and other delimiters
Auto Decompression — Transparent handling of compressed files

Examples

Basic Usage

# Preview first 10 rows (default)
cloudcat -p gcs://bucket/data.csv

# Preview 50 rows
cloudcat -p s3://bucket/data.parquet -n 50

# Show only specific columns
cloudcat -p gcs://bucket/users.json -c id,name,email

# View schema only (no data)
cloudcat -p s3://bucket/events.parquet -s schema_only

Working with Different Formats

# CSV with custom delimiter (tab-separated)
cloudcat -p gcs://bucket/data.tsv -d "\t"

# Pipe-delimited file
cloudcat -p s3://bucket/export.txt -d "|"

# Semicolon-delimited (common in European data)
cloudcat -p gcs://bucket/report.csv -d ";"

# JSON array file
cloudcat -p s3://bucket/config.json

# JSON Lines file (auto-detected)
cloudcat -p gcs://bucket/events.jsonl

Filtering and Pagination

# Filter rows with WHERE clause
cloudcat -p s3://bucket/users.parquet --where "status=active"
cloudcat -p gcs://bucket/events.json --where "age>30"
cloudcat -p s3://bucket/logs.csv --where "level=ERROR"

# String matching filters
cloudcat -p gcs://bucket/data.csv --where "name contains john"
cloudcat -p s3://bucket/emails.json --where "email endswith @gmail.com"
cloudcat -p az://logs/app.log --where "message startswith ERROR"

# Skip first N rows (pagination)
cloudcat -p gcs://bucket/data.csv --offset 100 -n 10

# Combine offset with filters
cloudcat -p s3://bucket/users.parquet --where "active=true" --offset 50 -n 20

Compressed Files

# Gzip compressed (built-in)
cloudcat -p gcs://bucket/data.csv.gz
cloudcat -p s3://bucket/logs.json.gz

# Zstandard compressed (requires: pip install cloudcat[zstd])
cloudcat -p gcs://bucket/events.parquet.zst

# LZ4 compressed (requires: pip install cloudcat[lz4])
cloudcat -p s3://bucket/data.csv.lz4

# Bzip2 compressed (built-in)
cloudcat -p az://container/archive.json.bz2

Directory Operations

CloudCat intelligently handles directories containing multiple data files (common with Spark, Hive, and distributed processing outputs):

# Auto-detect and read first data file in directory
cloudcat -p gcs://bucket/spark-output/

# Read and combine multiple files (up to 25MB by default)
cloudcat -p s3://bucket/daily-logs/ -m all

# Read up to 100MB of data from multiple files
cloudcat -p gcs://bucket/events/ -m all --max-size-mb 100

# Force reading only the first file
cloudcat -p s3://bucket/output/ -m first

CloudCat automatically:

Skips empty files
Ignores metadata files (_SUCCESS, _metadata, .crc, etc.)
Prioritizes files matching the specified format
Reports which files were selected

Output Format Examples

# Default table output (great for terminals)
cloudcat -p gcs://bucket/data.csv
# ┌────┬────────────┬─────────┐
# │ id │ name       │ value   │
# ├────┼────────────┼─────────┤
# │ 1  │ Alice      │ 100     │
# │ 2  │ Bob        │ 200     │
# └────┴────────────┴─────────┘

# Pretty JSON (great for nested data)
cloudcat -p s3://bucket/events.json -o jsonp
# {
#   "id": 1,
#   "name": "Alice",
#   "metadata": {
#     "created": "2024-01-15"
#   }
# }

# JSON Lines (great for piping to jq)
cloudcat -p gcs://bucket/data.parquet -o json | jq '.name'

# CSV (great for further processing)
cloudcat -p s3://bucket/data.json -o csv > output.csv

Data Pipeline Examples

# Convert Parquet to CSV
cloudcat -p gcs://bucket/data.parquet -o csv -n 0 > data.csv

# Preview and filter with jq
cloudcat -p s3://bucket/events.json -o json | jq 'select(.status == "error")'

# Quick data validation
cloudcat -p gcs://bucket/import.csv -s schema_only

# Sample data from large dataset
cloudcat -p s3://bucket/big-table.parquet -n 100 -c user_id,event_type

# Export specific columns to CSV
cloudcat -p gcs://bucket/users.parquet -c email,created_at -o csv -n 0 > emails.csv

Real-World Use Cases

Debugging Spark Jobs

# Check output of a Spark job
cloudcat -p gcs://data-lake/jobs/daily-etl/output/ -i parquet -n 20

# Verify schema matches expectations
cloudcat -p s3://analytics/spark-output/ -s schema_only

Log Analysis

# Preview recent logs
cloudcat -p gcs://logs/app/2024-01-15/ -m all -n 50

# Check error logs (combine with grep)
cloudcat -p s3://logs/errors/ -o json | grep "ERROR"

Data Validation

# Quick sanity check on data export
cloudcat -p gcs://exports/daily/users.csv -s show

# Verify record count
cloudcat -p s3://warehouse/transactions.parquet --count

Format Conversion

# Convert tab-separated to comma-separated
cloudcat -p gcs://imports/data.tsv -d "\t" -o csv > converted.csv

# Convert JSON to CSV for spreadsheet import
cloudcat -p s3://api-dumps/response.json -o csv > data.csv

Command Reference

Usage: cloudcat [OPTIONS]

Options:
  -p, --path TEXT              Cloud storage path (required)
                               Format: gcs://bucket/path, s3://bucket/path,
                               or az://container/path

  -o, --output-format TEXT     Output format: table, json, jsonp, csv
                               [default: table]

  -i, --input-format TEXT      Input format: csv, json, parquet, avro, orc, text
                               [default: auto-detect from extension]

  -c, --columns TEXT           Comma-separated list of columns to display
                               [default: all columns]

  -n, --num-rows INTEGER       Number of rows to display (0 for all)
                               [default: 10]

  --offset INTEGER             Skip first N rows
                               [default: 0]

  -w, --where TEXT             Filter rows with SQL-like conditions
                               Examples: "status=active", "age>30",
                               "name contains john", "email endswith @gmail.com"

  -s, --schema TEXT            Schema display: show, dont_show, schema_only
                               [default: show]

  --count                      Show total record count (scans entire file)

  -m, --multi-file-mode TEXT   Directory handling: auto, first, all
                               [default: auto]

  --max-size-mb INTEGER        Max data size for multi-file mode in MB
                               [default: 25]

  -d, --delimiter TEXT         CSV delimiter (use \t for tab)
                               [default: comma]

  --profile TEXT               AWS profile name (for S3 access)

  --project TEXT               GCP project ID (for GCS access)

  --credentials TEXT           Path to GCP service account JSON file

  --account TEXT               Azure storage account name

  --help                       Show this message and exit

WHERE Clause Operators

Operator	Example	Description
`=`	`status=active`	Exact match
`!=`	`type!=deleted`	Not equal
`>`	`age>30`	Greater than
`<`	`price<100`	Less than
`>=`	`count>=10`	Greater than or equal
`<=`	`score<=50`	Less than or equal
`contains`	`name contains john`	Case-insensitive substring match
`startswith`	`email startswith admin`	String prefix match
`endswith`	`file endswith .csv`	String suffix match

Authentication

Google Cloud Storage

CloudCat uses Application Default Credentials (ADC). Set up authentication using one of these methods:

# Option 1: User credentials (for development)
gcloud auth application-default login

# Option 2: Service account via environment variable
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

# Option 3: Service account via CLI option
cloudcat -p gcs://bucket/data.csv --credentials /path/to/service-account.json

# Option 4: Specify GCP project
cloudcat -p gcs://bucket/data.csv --project my-gcp-project

Amazon S3

CloudCat uses the standard AWS credential chain:

# Option 1: Environment variables
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="us-east-1"

# Option 2: AWS credentials file (~/.aws/credentials)
aws configure

# Option 3: AWS named profile
cloudcat -p s3://bucket/data.csv --profile production

# Option 4: IAM role (for EC2/ECS/Lambda)
# Automatically detected

Azure Blob Storage

CloudCat supports multiple authentication methods for Azure:

# Option 1: Connection string (simplest)
export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;EndpointSuffix=core.windows.net"

# Option 2: Account URL with DefaultAzureCredential (for Azure AD auth)
export AZURE_STORAGE_ACCOUNT_URL="https://youraccount.blob.core.windows.net"
az login

# Option 3: Specify storage account via CLI option
cloudcat -p az://container/data.csv --account mystorageaccount

Path format: az://container-name/path/to/blob

Performance Tips

Counting is off by default — use --count only when you need the total record count
Prefer Parquet format when possible — record counts are instant from metadata
Use --num-rows to limit data transfer for large files
Use --columns to select only needed columns (especially effective with Parquet)
Use -m first when you only need a sample from directories with many files

Troubleshooting

Common Issues

"google-cloud-storage package is required"

pip install cloudcat[gcs]

"boto3 package is required"

pip install cloudcat[s3]

"pyarrow package is required"

pip install cloudcat[parquet]

"azure-storage-blob package is required"

pip install cloudcat[azure]

"fastavro package is required"

pip install cloudcat[avro]

"pyarrow with ORC support is required"

pip install cloudcat[orc]

"zstandard package is required for .zst files"

pip install cloudcat[zstd]
# or for all compression formats:
pip install cloudcat[compression]

"lz4 package is required for .lz4 files"

pip install cloudcat[lz4]

"python-snappy package is required for .snappy files"

pip install cloudcat[snappy]

Authentication errors

GCS: Run gcloud auth application-default login
S3: Run aws configure or check your credentials
Azure: Set AZURE_STORAGE_CONNECTION_STRING or AZURE_STORAGE_ACCOUNT_URL and run az login

"Could not infer format from path"

# Specify the format explicitly
cloudcat -p gcs://bucket/data -i parquet

Contributing

Contributions are welcome! Here's how you can help:

Report bugs — Open an issue with reproduction steps
Suggest features — Open an issue describing the use case
Submit PRs — Fork, create a branch, and submit a pull request

Development Setup

# Clone the repository
git clone https://github.com/jonathansudhakar1/cloudcat.git
cd cloudcat

# Create virtual environment
python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows

# Install in development mode with all dependencies
pip install -e ".[all]"

# Run tests
pytest

Roadmap

Azure Blob Storage support
Avro format support
ORC format support
Plain text format support
SQL-like filtering (--where clause)
Compression support (gzip, zstd, lz4, snappy, bz2)
Row offset/pagination (--offset)
Interactive mode with pagination
Output to file with --output-file
Configuration file support

Related Projects

s3cmd — S3 command-line tool
gsutil — Google Cloud Storage CLI
aws-cli — AWS command-line interface
azcopy — Azure Storage data transfer tool
duckdb — In-process SQL OLAP database

License

MIT License — see LICENSE for details.

Report Bug • Request Feature

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.8

Jan 29, 2026

0.3.7

Jan 27, 2026

0.3.6

Jan 26, 2026

0.3.5

Jan 26, 2026

0.3.4

Jan 26, 2026

0.3.3

Jan 26, 2026

0.3.2

Jan 26, 2026

0.3.1

Jan 26, 2026

This version

0.3.0

Jan 26, 2026

0.2.8

Jan 23, 2026

0.2.7

Jan 23, 2026

0.2.6

Jan 23, 2026

0.2.5

Jan 23, 2026

0.2.4

Jan 23, 2026

0.2.3

Jan 23, 2026

0.2.2

Jan 23, 2026

0.2.1

Jan 23, 2026

0.2.0

Jan 23, 2026

0.1.4

May 12, 2025

0.1.3

May 12, 2025

0.1.2

May 12, 2025

0.1.1

May 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cloudcat-0.3.0.tar.gz (51.4 kB view details)

Uploaded Jan 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cloudcat-0.3.0-py3-none-any.whl (57.5 kB view details)

Uploaded Jan 26, 2026 Python 3

File details

Details for the file cloudcat-0.3.0.tar.gz.

File metadata

Download URL: cloudcat-0.3.0.tar.gz
Upload date: Jan 26, 2026
Size: 51.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cloudcat-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`a29ba93f52f428a2870a7952927e4b8ac0f83ad3ad0bfaf61df4c2f29c234a34`
MD5	`533edae92a85efdc37e8ed22da4c0e52`
BLAKE2b-256	`b141e1c438bbb2f1b0d79242f104e38670386ebd41e258cc46421dd3b42f8b15`

See more details on using hashes here.

File details

Details for the file cloudcat-0.3.0-py3-none-any.whl.

File metadata

Download URL: cloudcat-0.3.0-py3-none-any.whl
Upload date: Jan 26, 2026
Size: 57.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cloudcat-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`eac0d1af1ae2207a8dee820f311504b97c74b54f9e47b82a6bc90af2f7e76676`
MD5	`f144f3a6f9b27975212ab75f2053d5c4`
BLAKE2b-256	`5854fc0ec78ba501e31c4f0eda121d71c0cc2a8b2cdcaf63635a98768cb923a5`

See more details on using hashes here.

cloudcat 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CloudCat

Why CloudCat?

Installation

Homebrew (macOS Apple Silicon)

pip (Python)

Requirements

Quick Start

Features

Cloud Storage Support

File Format Support

Streaming Efficiency

Compression Support

Output Formats

Key Capabilities

Examples

Basic Usage

Working with Different Formats

Filtering and Pagination

Compressed Files

Directory Operations

Output Format Examples

Data Pipeline Examples

Real-World Use Cases

Debugging Spark Jobs

Log Analysis

Data Validation

Format Conversion

Command Reference

WHERE Clause Operators

Authentication

Google Cloud Storage

Amazon S3

Azure Blob Storage

Performance Tips

Troubleshooting

Common Issues

Contributing

Development Setup

Roadmap

Related Projects

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes