Motor de ingestão de dados — Singer + Parquet + Bucket

These details have not been verified by PyPI

Project description

DataSpoc Pipe

Singer taps to Parquet in cloud buckets. That simple.

Why DataSpoc Pipe?

Most data ingestion tools drown you in orchestration complexity. DataSpoc Pipe does one thing well: connect to any of the 400+ Singer taps (databases, APIs, SaaS), convert to Parquet, and land it in your cloud bucket — cataloged and ready to query. Handles tables from kilobytes to hundreds of GBs via streaming. No DAGs, no servers, no infrastructure.

400+ data sources · Streaming (no memory limits) · Zero infrastructure · < 15 min setup

Highlights

Singer-compatible — use any of the 400+ existing Singer taps
Parquet output — columnar, compressed (zstd), ready for analytics
Multi-cloud — S3, GCS, Azure Blob, or local filesystem
Auto-catalog — generates manifest.json so downstream tools discover your tables automatically
Incremental ingestion — bookmark-based state tracking, only fetch new data
Convention-based transforms — drop a Python file in transforms/ to clean data during ingestion, per batch, no config needed
Built-in taps — Google Sheets (public) works out of the box, no extra install
CLI-first — one command to run a pipeline, cron to schedule it
Stateless — all state lives in the bucket, not on your machine

Installation

pip install dataspoc-pipe[s3]

Other cloud providers

# Google Cloud Storage
pip install dataspoc-pipe[gcs]

# Azure Blob Storage
pip install dataspoc-pipe[azure]

# Local filesystem only (no extras needed)
pip install dataspoc-pipe

Quick start

# 1. Initialize config structure
dataspoc-pipe init

# 2. Create a pipeline (interactive wizard)
dataspoc-pipe add my-pipeline

# 3. Edit the generated source config if needed
#    ~/.dataspoc-pipe/sources/my-pipeline.json

# 4. Run it
dataspoc-pipe run my-pipeline

# 5. Check results
dataspoc-pipe status

Your data is now at <bucket>/raw/<source>/<table>/ as Parquet.

Config structure created by init:

~/.dataspoc-pipe/
  config.yaml           # Global defaults
  sources/              # Source configs (1 JSON per source, generated by `add`)
  pipelines/            # Pipeline definitions (1 YAML per pipeline)
  transforms/           # Optional Python transforms (same name as pipeline)

How it works

                          stdout
┌─────────────┐    ┌──────────┐    ┌───────────────┐    ┌──────────────┐
│ Data Source │───>│ Singer   │───>│ DataSpoc Pipe │───>│ Cloud Bucket │
│ (DB, API, …)│    │ Tap      │    │ transform(df) │    │ (S3/GCS/Az)  │
└─────────────┘    └──────────┘    └───────┬───────┘    └──────────────┘
                                           │
                                    manifest.json
                                     state.json
                                       logs/

Singer tap extracts data from the source, emits JSON on stdout
Pipe reads the stream, buffers in batches (~10K records)
If transforms/<pipeline>.py exists → applies transform(df) per batch
Converts to Parquet and uploads to bucket
Updates the manifest catalog and saves execution logs

Built-in taps

Tap	Source	Config template	Extra install
`parquet`	Parquet files (local or S3/GCS/Azure)	Built-in	None
`google-sheets-public`	Public Google Sheets	Built-in	None
`tap-postgres`	PostgreSQL	Yes	`pip install tap-postgres`
`tap-mysql`	MySQL	Yes	`pip install tap-mysql`
`tap-csv`	CSV files	Yes	`pip install tap-csv`
`tap-s3-csv`	CSV on S3	Yes	`pip install tap-s3-csv`
`tap-github`	GitHub API	Yes	`pip install tap-github`
`tap-rest-api`	Any REST API	Yes	`pip install tap-rest-api`
`tap-mongodb`	MongoDB	Yes	`pip install tap-mongodb`
`tap-salesforce`	Salesforce	Yes	`pip install tap-salesforce`
`tap-google-sheets`	Google Sheets (OAuth)	Yes	`pip install tap-google-sheets`

Any Singer-compatible tap works. Run dataspoc-pipe add to see available templates.

Access control

DataSpoc delegates all access control to your cloud provider's IAM. Best practices:

One bucket per permission boundary — e.g., s3://company-public, s3://company-finance, s3://company-hr
Pipe needs write access to the destination bucket; users need only read access
Use IAM roles and policies — never store credentials in pipeline configs
If credentials lack permission, the pipeline fails with "Access Denied"

Part of the DataSpoc Platform

Project	Role
DataSpoc Pipe (this)	Ingestion: Singer taps to Parquet in cloud buckets
DataSpoc Lens	Virtual warehouse: SQL + Jupyter + AI over your data lake
DataSpoc ML	AutoML: train and deploy models from your lake

The bucket is the contract. Pipe writes. Lens reads. ML consumes and produces.

Community

Discord — Join the conversation for questions and support
GitHub Issues — Report bugs or request features
Contributing — PRs welcome! See CONTRIBUTING.md for guidelines

License

Apache 2.0 — free to use, modify, and distribute.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.1

Apr 15, 2026

This version

0.1.0

Apr 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataspoc_pipe-0.1.0.tar.gz (60.6 kB view details)

Uploaded Apr 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dataspoc_pipe-0.1.0-py3-none-any.whl (28.2 kB view details)

Uploaded Apr 15, 2026 Python 3

File details

Details for the file dataspoc_pipe-0.1.0.tar.gz.

File metadata

Download URL: dataspoc_pipe-0.1.0.tar.gz
Upload date: Apr 15, 2026
Size: 60.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dataspoc_pipe-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0767b9be42cbb03fad7dde9861e19ff26d1abec2f06ab754aa67d7eceee8b553`
MD5	`d6610c093540f1578bb78b2a288fec47`
BLAKE2b-256	`7e781a926553ffe3c2f7610b90dcb433358bf5725c7cf1416cb89a850f4b1477`

See more details on using hashes here.

File details

Details for the file dataspoc_pipe-0.1.0-py3-none-any.whl.

File metadata

Download URL: dataspoc_pipe-0.1.0-py3-none-any.whl
Upload date: Apr 15, 2026
Size: 28.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dataspoc_pipe-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`89d78d411811f0def93837f1c2b9683460600cc0de5a5a7c7c05796e32297f14`
MD5	`659285152a981842e38b051b32df5224`
BLAKE2b-256	`ad10c8536a39d0ff5a4c21f499e7c4140abd2ead095b87fcf2788e01bcd7be5b`

See more details on using hashes here.

dataspoc-pipe 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

DataSpoc Pipe

Why DataSpoc Pipe?

Highlights

Installation

Quick start

How it works

Built-in taps

Access control

Part of the DataSpoc Platform

Community

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes