Motor de ingestão de dados — Singer + Parquet + Bucket
Project description
DataSpoc Pipe
Singer taps to Parquet in cloud buckets. That simple.
Why DataSpoc Pipe?
Most data ingestion tools drown you in orchestration complexity. DataSpoc Pipe does one thing well: connect to any of the 400+ Singer taps (databases, APIs, SaaS), convert to Parquet, and land it in your cloud bucket — cataloged and ready to query. Handles tables from kilobytes to hundreds of GBs via streaming. No DAGs, no servers, no infrastructure.
400+ data sources · Streaming (no memory limits) · Zero infrastructure · < 15 min setup
Highlights
- Singer-compatible — use any of the 400+ existing Singer taps
- Parquet output — columnar, compressed (zstd), ready for analytics
- Multi-cloud — S3, GCS, Azure Blob, or local filesystem
- Auto-catalog — generates
manifest.jsonso downstream tools discover your tables automatically - Incremental ingestion — bookmark-based state tracking, only fetch new data
- Convention-based transforms — drop a Python file in
transforms/to clean data during ingestion, per batch, no config needed - Built-in taps — Google Sheets (public) works out of the box, no extra install
- CLI-first — one command to run a pipeline, cron to schedule it
- Stateless — all state lives in the bucket, not on your machine
Installation
pip install dataspoc-pipe[s3]
Other cloud providers
# Google Cloud Storage
pip install dataspoc-pipe[gcs]
# Azure Blob Storage
pip install dataspoc-pipe[azure]
# Local filesystem only (no extras needed)
pip install dataspoc-pipe
Quick start
# 1. Initialize config structure
dataspoc-pipe init
# 2. Create a pipeline (interactive wizard)
dataspoc-pipe add my-pipeline
# 3. Edit the generated source config if needed
# ~/.dataspoc-pipe/sources/my-pipeline.json
# 4. Run it
dataspoc-pipe run my-pipeline
# 5. Check results
dataspoc-pipe status
Your data is now at <bucket>/raw/<source>/<table>/ as Parquet.
Config structure created by init:
~/.dataspoc-pipe/
config.yaml # Global defaults
sources/ # Source configs (1 JSON per source, generated by `add`)
pipelines/ # Pipeline definitions (1 YAML per pipeline)
transforms/ # Optional Python transforms (same name as pipeline)
How it works
stdout
┌─────────────┐ ┌──────────┐ ┌───────────────┐ ┌──────────────┐
│ Data Source │───>│ Singer │───>│ DataSpoc Pipe │───>│ Cloud Bucket │
│ (DB, API, …)│ │ Tap │ │ transform(df) │ │ (S3/GCS/Az) │
└─────────────┘ └──────────┘ └───────┬───────┘ └──────────────┘
│
manifest.json
state.json
logs/
- Singer tap extracts data from the source, emits JSON on stdout
- Pipe reads the stream, buffers in batches (~10K records)
- If
transforms/<pipeline>.pyexists → appliestransform(df)per batch - Converts to Parquet and uploads to bucket
- Updates the manifest catalog and saves execution logs
Built-in taps
| Tap | Source | Config template | Extra install |
|---|---|---|---|
parquet |
Parquet files (local or S3/GCS/Azure) | Built-in | None |
google-sheets-public |
Public Google Sheets | Built-in | None |
tap-postgres |
PostgreSQL | Yes | pip install tap-postgres |
tap-mysql |
MySQL | Yes | pip install tap-mysql |
tap-csv |
CSV files | Yes | pip install tap-csv |
tap-s3-csv |
CSV on S3 | Yes | pip install tap-s3-csv |
tap-github |
GitHub API | Yes | pip install tap-github |
tap-rest-api |
Any REST API | Yes | pip install tap-rest-api |
tap-mongodb |
MongoDB | Yes | pip install tap-mongodb |
tap-salesforce |
Salesforce | Yes | pip install tap-salesforce |
tap-google-sheets |
Google Sheets (OAuth) | Yes | pip install tap-google-sheets |
Any Singer-compatible tap works. Run dataspoc-pipe add to see available templates.
Access control
DataSpoc delegates all access control to your cloud provider's IAM. Best practices:
- One bucket per permission boundary — e.g.,
s3://company-public,s3://company-finance,s3://company-hr - Pipe needs write access to the destination bucket; users need only read access
- Use IAM roles and policies — never store credentials in pipeline configs
- If credentials lack permission, the pipeline fails with "Access Denied"
Part of the DataSpoc Platform
| Project | Role |
|---|---|
| DataSpoc Pipe (this) | Ingestion: Singer taps to Parquet in cloud buckets |
| DataSpoc Lens | Virtual warehouse: SQL + Jupyter + AI over your data lake |
| DataSpoc ML | AutoML: train and deploy models from your lake |
The bucket is the contract. Pipe writes. Lens reads. ML consumes and produces.
Community
- Discord — Join the conversation for questions and support
- GitHub Issues — Report bugs or request features
- Contributing — PRs welcome! See CONTRIBUTING.md for guidelines
License
Apache 2.0 — free to use, modify, and distribute.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dataspoc_pipe-0.1.0.tar.gz.
File metadata
- Download URL: dataspoc_pipe-0.1.0.tar.gz
- Upload date:
- Size: 60.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0767b9be42cbb03fad7dde9861e19ff26d1abec2f06ab754aa67d7eceee8b553
|
|
| MD5 |
d6610c093540f1578bb78b2a288fec47
|
|
| BLAKE2b-256 |
7e781a926553ffe3c2f7610b90dcb433358bf5725c7cf1416cb89a850f4b1477
|
File details
Details for the file dataspoc_pipe-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dataspoc_pipe-0.1.0-py3-none-any.whl
- Upload date:
- Size: 28.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
89d78d411811f0def93837f1c2b9683460600cc0de5a5a7c7c05796e32297f14
|
|
| MD5 |
659285152a981842e38b051b32df5224
|
|
| BLAKE2b-256 |
ad10c8536a39d0ff5a4c21f499e7c4140abd2ead095b87fcf2788e01bcd7be5b
|