Skip to main content

Motor de ingestão de dados — Singer + Parquet + Bucket

Project description

DataSpoc Pipe

CI PyPI License Python 3.10+ Discord

Singer taps to Parquet in cloud buckets. That simple.

Docs · Tutorial · Discord

Why DataSpoc Pipe?

Most data ingestion tools drown you in orchestration complexity. DataSpoc Pipe does one thing well: connect to any of the 400+ Singer taps (databases, APIs, SaaS), convert to Parquet, and land it in your cloud bucket — cataloged and ready to query. Handles tables from kilobytes to hundreds of GBs via streaming. No DAGs, no servers, no infrastructure.

400+ data sources · Streaming (no memory limits) · Zero infrastructure · < 15 min setup

Highlights

  • Singer-compatible — use any of the 400+ existing Singer taps
  • Parquet output — columnar, compressed (zstd), ready for analytics
  • Multi-cloud — S3, GCS, Azure Blob, or local filesystem
  • Auto-catalog — generates manifest.json so downstream tools discover your tables automatically
  • Incremental ingestion — bookmark-based state tracking, only fetch new data
  • Convention-based transforms — drop a Python file in transforms/ to clean data during ingestion, per batch, no config needed
  • Built-in taps — Google Sheets (public) works out of the box, no extra install
  • CLI-first — one command to run a pipeline, cron to schedule it
  • Stateless — all state lives in the bucket, not on your machine

Installation

pip install dataspoc-pipe[s3]
Other cloud providers
# Google Cloud Storage
pip install dataspoc-pipe[gcs]

# Azure Blob Storage
pip install dataspoc-pipe[azure]

# Local filesystem only (no extras needed)
pip install dataspoc-pipe

Quick start

# 1. Initialize config structure
dataspoc-pipe init

# 2. Create a pipeline (interactive wizard)
dataspoc-pipe add my-pipeline

# 3. Edit the generated source config if needed
#    ~/.dataspoc-pipe/sources/my-pipeline.json

# 4. Run it
dataspoc-pipe run my-pipeline

# 5. Check results
dataspoc-pipe status

Your data is now at <bucket>/raw/<source>/<table>/ as Parquet.

Config structure created by init:

~/.dataspoc-pipe/
  config.yaml           # Global defaults
  sources/              # Source configs (1 JSON per source, generated by `add`)
  pipelines/            # Pipeline definitions (1 YAML per pipeline)
  transforms/           # Optional Python transforms (same name as pipeline)

How it works

                          stdout
┌─────────────┐    ┌──────────┐    ┌───────────────┐    ┌──────────────┐
│ Data Source │───>│ Singer   │───>│ DataSpoc Pipe │───>│ Cloud Bucket │
│ (DB, API, …)│    │ Tap      │    │ transform(df) │    │ (S3/GCS/Az)  │
└─────────────┘    └──────────┘    └───────┬───────┘    └──────────────┘
                                           │
                                    manifest.json
                                     state.json
                                       logs/
  1. Singer tap extracts data from the source, emits JSON on stdout
  2. Pipe reads the stream, buffers in batches (~10K records)
  3. If transforms/<pipeline>.py exists → applies transform(df) per batch
  4. Converts to Parquet and uploads to bucket
  5. Updates the manifest catalog and saves execution logs

Built-in taps

Tap Source Config template Extra install
parquet Parquet files (local or S3/GCS/Azure) Built-in None
google-sheets-public Public Google Sheets Built-in None
tap-postgres PostgreSQL Yes pip install tap-postgres
tap-mysql MySQL Yes pip install tap-mysql
tap-csv CSV files Yes pip install tap-csv
tap-s3-csv CSV on S3 Yes pip install tap-s3-csv
tap-github GitHub API Yes pip install tap-github
tap-rest-api Any REST API Yes pip install tap-rest-api
tap-mongodb MongoDB Yes pip install tap-mongodb
tap-salesforce Salesforce Yes pip install tap-salesforce
tap-google-sheets Google Sheets (OAuth) Yes pip install tap-google-sheets

Any Singer-compatible tap works. Run dataspoc-pipe add to see available templates.

Access control

DataSpoc delegates all access control to your cloud provider's IAM. Best practices:

  • One bucket per permission boundary — e.g., s3://company-public, s3://company-finance, s3://company-hr
  • Pipe needs write access to the destination bucket; users need only read access
  • Use IAM roles and policies — never store credentials in pipeline configs
  • If credentials lack permission, the pipeline fails with "Access Denied"

Part of the DataSpoc Platform

Project Role
DataSpoc Pipe (this) Ingestion: Singer taps to Parquet in cloud buckets
DataSpoc Lens Virtual warehouse: SQL + Jupyter + AI over your data lake
DataSpoc ML AutoML: train and deploy models from your lake

The bucket is the contract. Pipe writes. Lens reads. ML consumes and produces.

Community

License

Apache 2.0 — free to use, modify, and distribute.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataspoc_pipe-0.1.0.tar.gz (60.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataspoc_pipe-0.1.0-py3-none-any.whl (28.2 kB view details)

Uploaded Python 3

File details

Details for the file dataspoc_pipe-0.1.0.tar.gz.

File metadata

  • Download URL: dataspoc_pipe-0.1.0.tar.gz
  • Upload date:
  • Size: 60.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dataspoc_pipe-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0767b9be42cbb03fad7dde9861e19ff26d1abec2f06ab754aa67d7eceee8b553
MD5 d6610c093540f1578bb78b2a288fec47
BLAKE2b-256 7e781a926553ffe3c2f7610b90dcb433358bf5725c7cf1416cb89a850f4b1477

See more details on using hashes here.

File details

Details for the file dataspoc_pipe-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dataspoc_pipe-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 28.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dataspoc_pipe-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 89d78d411811f0def93837f1c2b9683460600cc0de5a5a7c7c05796e32297f14
MD5 659285152a981842e38b051b32df5224
BLAKE2b-256 ad10c8536a39d0ff5a4c21f499e7c4140abd2ead095b87fcf2788e01bcd7be5b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page