Skip to main content

Preview and analyze data files in Google Cloud Storage, AWS S3, and Azure Blob Storage from your terminal

Project description

CloudCat Logo

CloudCat

The Swiss Army knife for viewing cloud storage data from your terminal

PyPI version Python versions PyPI Downloads Homebrew Downloads License

InstallationQuick StartFeaturesExamplesDocumentation


CloudCat is a powerful command-line tool that lets you instantly preview and analyze data files stored in Google Cloud Storage (GCS), Amazon S3, and Azure Blob Storage — without downloading entire files. Think of it as cat, head, and less combined, but for cloud storage with built-in support for CSV, JSON, Parquet, Avro, ORC, and plain text formats.

Why CloudCat?

  • No Downloads Required — Stream and preview data directly from cloud storage
  • Format-Aware — Intelligently handles CSV, JSON, Parquet, Avro, ORC, and plain text files
  • Directory Smart — Automatically discovers data files in Spark/Hive/Kafka output directories
  • Beautiful Output — Colorized tables, pretty-printed JSON, and schema visualization
  • Developer Friendly — Simple CLI with sensible defaults and powerful options
  • Compression Support — Automatic decompression of gzip, zstd, lz4, snappy, and bz2 files
  • SQL-like Filtering — Filter rows with WHERE clauses (e.g., --where "status=active")

Installation

Homebrew (macOS Apple Silicon)

The easiest way to install on Apple Silicon Macs (M1/M2/M3/M4) — no Python required:

brew install jonathansudhakar1/cloudcat/cloudcat

This installs a self-contained binary that includes Python and all dependencies.

Intel Mac users: Homebrew bottles are not available for Intel. Please use pip install 'cloudcat[all]' instead.

To upgrade:

brew upgrade cloudcat

Note: On first run, macOS may block the app. Go to System Settings > Privacy & Security and click "Allow", or run:

xattr -d com.apple.quarantine $(which cloudcat)

pip (Python)

# Full installation with all formats and compression
pip install 'cloudcat[all]'

# Standard installation (includes GCS, S3, and Azure support)
pip install cloudcat

# With Parquet file support
pip install 'cloudcat[parquet]'

# With Avro file support
pip install 'cloudcat[avro]'

# With ORC file support (uses pyarrow)
pip install 'cloudcat[orc]'

# With compression support (zstd, lz4, snappy)
pip install 'cloudcat[compression]'

Note: If using zsh (default on macOS), quotes around extras are required to prevent shell interpretation of brackets.

To upgrade:

pip install --upgrade 'cloudcat[all]'

Requirements

  • Homebrew: macOS (Apple Silicon only). Intel Mac users should use pip.
  • pip: Python 3.7+ (all platforms)
  • Cloud provider credentials configured (see Authentication)

Quick Start

# Preview a CSV file from GCS
cloudcat -p gcs://my-bucket/data.csv

# Preview a Parquet file from S3
cloudcat -p s3://my-bucket/analytics/events.parquet

# Preview JSON data from Azure with pretty formatting
cloudcat -p az://my-container/logs.json -o jsonp

# Read Avro files from Kafka
cloudcat -p s3://my-bucket/kafka-export.avro

# Read ORC files from Hive
cloudcat -p gcs://my-bucket/hive-table.orc

# Read log files as plain text
cloudcat -p az://logs/app.log -i text

# Read from a Spark output directory
cloudcat -p s3://my-bucket/spark-output/ -i parquet

# Read compressed files (auto-detected)
cloudcat -p gcs://my-bucket/data.csv.gz

# Filter rows with WHERE clause
cloudcat -p s3://bucket/users.parquet --where "status=active"

# Skip first 100 rows (pagination)
cloudcat -p gcs://bucket/data.csv --offset 100 -n 10

Features

Cloud Storage Support

Provider URL Scheme Status
Google Cloud Storage gcs:// or gs:// ✅ Supported
Amazon S3 s3:// ✅ Supported
Azure Blob Storage az:// or azure:// ✅ Supported

File Format Support

Format Read Auto-Detect Streaming Use Case
CSV General data files
JSON API responses, configs
JSON Lines Log files, streaming data
Parquet Spark/analytics data
Avro Kafka, data pipelines
ORC Hive, Hadoop ecosystem
Text Log files, plain text
TSV Via --delimiter Tab-separated data

Compression Support

Format Extension Built-in Use Case
Gzip .gz, .gzip Most common, universal
Bzip2 .bz2 High compression ratio
Zstandard .zst, .zstd Optional Fast, modern compression
LZ4 .lz4 Optional Very fast decompression
Snappy .snappy Optional Hadoop ecosystem

CloudCat automatically detects and decompresses files based on extension (e.g., data.csv.gz, logs.json.zst).

Output Formats

Format Flag Description
Table -o table Beautiful ASCII table with colored headers (default)
JSON -o json Standard JSON Lines output
Pretty JSON -o jsonp Syntax-highlighted, indented JSON
CSV -o csv Comma-separated values

Key Capabilities

  • Schema Inspection — View column names and data types
  • Column Selection — Display only the columns you need
  • Row Limiting — Control how many rows to preview
  • Row Offset — Skip first N rows for pagination/sampling
  • WHERE Filtering — Filter rows with SQL-like conditions
  • Record Counting — Get total record counts (with Parquet metadata optimization)
  • Multi-File Reading — Combine data from multiple files in a directory
  • Custom Delimiters — Support for tab, pipe, semicolon, and other delimiters
  • Auto Decompression — Transparent handling of compressed files

Examples

Basic Usage

# Preview first 10 rows (default)
cloudcat -p gcs://bucket/data.csv

# Preview 50 rows
cloudcat -p s3://bucket/data.parquet -n 50

# Show only specific columns
cloudcat -p gcs://bucket/users.json -c id,name,email

# View schema only (no data)
cloudcat -p s3://bucket/events.parquet -s schema_only

Working with Different Formats

# CSV with custom delimiter (tab-separated)
cloudcat -p gcs://bucket/data.tsv -d "\t"

# Pipe-delimited file
cloudcat -p s3://bucket/export.txt -d "|"

# Semicolon-delimited (common in European data)
cloudcat -p gcs://bucket/report.csv -d ";"

# JSON array file
cloudcat -p s3://bucket/config.json

# JSON Lines file (auto-detected)
cloudcat -p gcs://bucket/events.jsonl

Filtering and Pagination

# Filter rows with WHERE clause
cloudcat -p s3://bucket/users.parquet --where "status=active"
cloudcat -p gcs://bucket/events.json --where "age>30"
cloudcat -p s3://bucket/logs.csv --where "level=ERROR"

# String matching filters
cloudcat -p gcs://bucket/data.csv --where "name contains john"
cloudcat -p s3://bucket/emails.json --where "email endswith @gmail.com"
cloudcat -p az://logs/app.log --where "message startswith ERROR"

# Skip first N rows (pagination)
cloudcat -p gcs://bucket/data.csv --offset 100 -n 10

# Combine offset with filters
cloudcat -p s3://bucket/users.parquet --where "active=true" --offset 50 -n 20

Compressed Files

# Gzip compressed (built-in)
cloudcat -p gcs://bucket/data.csv.gz
cloudcat -p s3://bucket/logs.json.gz

# Zstandard compressed (requires: pip install cloudcat[zstd])
cloudcat -p gcs://bucket/events.parquet.zst

# LZ4 compressed (requires: pip install cloudcat[lz4])
cloudcat -p s3://bucket/data.csv.lz4

# Bzip2 compressed (built-in)
cloudcat -p az://container/archive.json.bz2

Directory Operations

CloudCat intelligently handles directories containing multiple data files (common with Spark, Hive, and distributed processing outputs):

# Auto-detect and read first data file in directory
cloudcat -p gcs://bucket/spark-output/

# Read and combine multiple files (up to 25MB by default)
cloudcat -p s3://bucket/daily-logs/ -m all

# Read up to 100MB of data from multiple files
cloudcat -p gcs://bucket/events/ -m all --max-size-mb 100

# Force reading only the first file
cloudcat -p s3://bucket/output/ -m first

CloudCat automatically:

  • Skips empty files
  • Ignores metadata files (_SUCCESS, _metadata, .crc, etc.)
  • Prioritizes files matching the specified format
  • Reports which files were selected

Output Format Examples

# Default table output (great for terminals)
cloudcat -p gcs://bucket/data.csv
# ┌────┬────────────┬─────────┐
# │ id │ name       │ value   │
# ├────┼────────────┼─────────┤
# │ 1  │ Alice      │ 100     │
# │ 2  │ Bob        │ 200     │
# └────┴────────────┴─────────┘

# Pretty JSON (great for nested data)
cloudcat -p s3://bucket/events.json -o jsonp
# {
#   "id": 1,
#   "name": "Alice",
#   "metadata": {
#     "created": "2024-01-15"
#   }
# }

# JSON Lines (great for piping to jq)
cloudcat -p gcs://bucket/data.parquet -o json | jq '.name'

# CSV (great for further processing)
cloudcat -p s3://bucket/data.json -o csv > output.csv

Data Pipeline Examples

# Convert Parquet to CSV
cloudcat -p gcs://bucket/data.parquet -o csv -n 0 > data.csv

# Preview and filter with jq
cloudcat -p s3://bucket/events.json -o json | jq 'select(.status == "error")'

# Quick data validation
cloudcat -p gcs://bucket/import.csv -s schema_only

# Sample data from large dataset
cloudcat -p s3://bucket/big-table.parquet -n 100 -c user_id,event_type

# Export specific columns to CSV
cloudcat -p gcs://bucket/users.parquet -c email,created_at -o csv -n 0 > emails.csv

Real-World Use Cases

Debugging Spark Jobs

# Check output of a Spark job
cloudcat -p gcs://data-lake/jobs/daily-etl/output/ -i parquet -n 20

# Verify schema matches expectations
cloudcat -p s3://analytics/spark-output/ -s schema_only

Log Analysis

# Preview recent logs
cloudcat -p gcs://logs/app/2024-01-15/ -m all -n 50

# Check error logs (combine with grep)
cloudcat -p s3://logs/errors/ -o json | grep "ERROR"

Data Validation

# Quick sanity check on data export
cloudcat -p gcs://exports/daily/users.csv -s show

# Verify record count
cloudcat -p s3://warehouse/transactions.parquet --no-count

Format Conversion

# Convert tab-separated to comma-separated
cloudcat -p gcs://imports/data.tsv -d "\t" -o csv > converted.csv

# Convert JSON to CSV for spreadsheet import
cloudcat -p s3://api-dumps/response.json -o csv > data.csv

Command Reference

Usage: cloudcat [OPTIONS]

Options:
  -p, --path TEXT              Cloud storage path (required)
                               Format: gcs://bucket/path, s3://bucket/path,
                               or az://container/path

  -o, --output-format TEXT     Output format: table, json, jsonp, csv
                               [default: table]

  -i, --input-format TEXT      Input format: csv, json, parquet, avro, orc, text
                               [default: auto-detect from extension]

  -c, --columns TEXT           Comma-separated list of columns to display
                               [default: all columns]

  -n, --num-rows INTEGER       Number of rows to display (0 for all)
                               [default: 10]

  --offset INTEGER             Skip first N rows
                               [default: 0]

  -w, --where TEXT             Filter rows with SQL-like conditions
                               Examples: "status=active", "age>30",
                               "name contains john", "email endswith @gmail.com"

  -s, --schema TEXT            Schema display: show, dont_show, schema_only
                               [default: show]

  --no-count                   Disable automatic record counting

  -m, --multi-file-mode TEXT   Directory handling: auto, first, all
                               [default: auto]

  --max-size-mb INTEGER        Max data size for multi-file mode in MB
                               [default: 25]

  -d, --delimiter TEXT         CSV delimiter (use \t for tab)
                               [default: comma]

  --profile TEXT               AWS profile name (for S3 access)

  --project TEXT               GCP project ID (for GCS access)

  --credentials TEXT           Path to GCP service account JSON file

  --account TEXT               Azure storage account name

  --help                       Show this message and exit

WHERE Clause Operators

Operator Example Description
= status=active Exact match
!= type!=deleted Not equal
> age>30 Greater than
< price<100 Less than
>= count>=10 Greater than or equal
<= score<=50 Less than or equal
contains name contains john Case-insensitive substring match
startswith email startswith admin String prefix match
endswith file endswith .csv String suffix match

Authentication

Google Cloud Storage

CloudCat uses Application Default Credentials (ADC). Set up authentication using one of these methods:

# Option 1: User credentials (for development)
gcloud auth application-default login

# Option 2: Service account via environment variable
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

# Option 3: Service account via CLI option
cloudcat -p gcs://bucket/data.csv --credentials /path/to/service-account.json

# Option 4: Specify GCP project
cloudcat -p gcs://bucket/data.csv --project my-gcp-project

Amazon S3

CloudCat uses the standard AWS credential chain:

# Option 1: Environment variables
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="us-east-1"

# Option 2: AWS credentials file (~/.aws/credentials)
aws configure

# Option 3: AWS named profile
cloudcat -p s3://bucket/data.csv --profile production

# Option 4: IAM role (for EC2/ECS/Lambda)
# Automatically detected

Azure Blob Storage

CloudCat supports multiple authentication methods for Azure:

# Option 1: Connection string (simplest)
export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;EndpointSuffix=core.windows.net"

# Option 2: Account URL with DefaultAzureCredential (for Azure AD auth)
export AZURE_STORAGE_ACCOUNT_URL="https://youraccount.blob.core.windows.net"
az login

# Option 3: Specify storage account via CLI option
cloudcat -p az://container/data.csv --account mystorageaccount

Path format: az://container-name/path/to/blob

Performance Tips

  1. Use --no-count for large files when you don't need the total record count
  2. Prefer Parquet format when possible — record counts are instant from metadata
  3. Use --num-rows to limit data transfer for large files
  4. Use --columns to select only needed columns (especially effective with Parquet)
  5. Use -m first when you only need a sample from directories with many files

Troubleshooting

Common Issues

"google-cloud-storage package is required"

pip install cloudcat[gcs]

"boto3 package is required"

pip install cloudcat[s3]

"pyarrow package is required"

pip install cloudcat[parquet]

"azure-storage-blob package is required"

pip install cloudcat[azure]

"fastavro package is required"

pip install cloudcat[avro]

"pyarrow with ORC support is required"

pip install cloudcat[orc]

"zstandard package is required for .zst files"

pip install cloudcat[zstd]
# or for all compression formats:
pip install cloudcat[compression]

"lz4 package is required for .lz4 files"

pip install cloudcat[lz4]

"python-snappy package is required for .snappy files"

pip install cloudcat[snappy]

Authentication errors

  • GCS: Run gcloud auth application-default login
  • S3: Run aws configure or check your credentials
  • Azure: Set AZURE_STORAGE_CONNECTION_STRING or AZURE_STORAGE_ACCOUNT_URL and run az login

"Could not infer format from path"

# Specify the format explicitly
cloudcat -p gcs://bucket/data -i parquet

Contributing

Contributions are welcome! Here's how you can help:

  1. Report bugs — Open an issue with reproduction steps
  2. Suggest features — Open an issue describing the use case
  3. Submit PRs — Fork, create a branch, and submit a pull request

Development Setup

# Clone the repository
git clone https://github.com/jonathansudhakar1/cloudcat.git
cd cloudcat

# Create virtual environment
python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows

# Install in development mode with all dependencies
pip install -e ".[all]"

# Run tests
pytest

Roadmap

  • Azure Blob Storage support
  • Avro format support
  • ORC format support
  • Plain text format support
  • SQL-like filtering (--where clause)
  • Compression support (gzip, zstd, lz4, snappy, bz2)
  • Row offset/pagination (--offset)
  • Interactive mode with pagination
  • Output to file with --output-file
  • Configuration file support

Related Projects

  • s3cmd — S3 command-line tool
  • gsutil — Google Cloud Storage CLI
  • aws-cli — AWS command-line interface
  • azcopy — Azure Storage data transfer tool
  • duckdb — In-process SQL OLAP database

License

MIT License — see LICENSE for details.

Report BugRequest Feature

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cloudcat-0.2.5.tar.gz (41.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cloudcat-0.2.5-py3-none-any.whl (45.6 kB view details)

Uploaded Python 3

File details

Details for the file cloudcat-0.2.5.tar.gz.

File metadata

  • Download URL: cloudcat-0.2.5.tar.gz
  • Upload date:
  • Size: 41.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cloudcat-0.2.5.tar.gz
Algorithm Hash digest
SHA256 d1f19582a37035e7f3d138336f8e0cadb27589604e3c70686a45e125a6dd417e
MD5 1b773d4cbb83b17f0f5e02613feb1f2c
BLAKE2b-256 06c38ebabc63b370028403503f80f08050bb2c12988ad92329c04afce4fae6e3

See more details on using hashes here.

File details

Details for the file cloudcat-0.2.5-py3-none-any.whl.

File metadata

  • Download URL: cloudcat-0.2.5-py3-none-any.whl
  • Upload date:
  • Size: 45.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cloudcat-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 695866250751059d62d8b55b7e3e3fac2b3c1e3d58d9a517f213b50a5486618d
MD5 7953c755393c53be41cd5927c7201fbd
BLAKE2b-256 24101d76b93149d2910fe56cd3541fc01bbd27977497e1cebecaf1af37f7be2f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page