Motor de ingestão de dados — Singer + Parquet + Bucket
Project description
DataSpoc Pipe
Singer taps to Parquet in cloud buckets. That simple.
Why DataSpoc Pipe?
Most data ingestion tools drown you in orchestration complexity. DataSpoc Pipe does one thing well: connect to any of the 400+ Singer taps (databases, APIs, SaaS), convert to Parquet, and land it in your cloud bucket -- cataloged and ready to query. No DAGs, no servers, no infrastructure.
400+ data sources -- Streaming (no memory limits) -- Zero infrastructure -- < 15 min setup
Installation
pip install dataspoc-pipe
Cloud storage extras:
pip install dataspoc-pipe[s3] # AWS S3
pip install dataspoc-pipe[gcs] # Google Cloud Storage
pip install dataspoc-pipe[azure] # Azure Blob Storage
Singer taps are installed separately:
pip install tap-csv
pip install tap-postgres
Quick Start
1. Initialize
dataspoc-pipe init
Creates ~/.dataspoc-pipe/ with config.yaml, pipelines/, sources/, and transforms/.
2. Install a Singer tap and prepare data
pip install tap-csv
Create /tmp/sample/users.csv:
id,name,email
1,Alice,alice@example.com
2,Bob,bob@example.com
3,Carol,carol@example.com
3. Create a pipeline
dataspoc-pipe add my-first-pipeline
The interactive wizard prompts for tap name, destination bucket, compression, incremental mode, and schedule. Or create ~/.dataspoc-pipe/pipelines/my-first-pipeline.yaml manually:
source:
tap: tap-csv
config:
files:
- entity: users
path: /tmp/sample/users.csv
keys:
- id
destination:
bucket: file:///tmp/my-lake
path: raw
compression: zstd
incremental:
enabled: false
4. Validate and run
dataspoc-pipe validate my-first-pipeline
dataspoc-pipe run my-first-pipeline
5. Check results
dataspoc-pipe status
dataspoc-pipe logs my-first-pipeline
dataspoc-pipe manifest file:///tmp/my-lake
Your data is now at /tmp/my-lake/raw/csv/users/dt=2026-03-20/users_0000.parquet.
How It Works
┌─────────────┐ ┌──────────┐ stdout ┌───────────────┐ ┌──────────────┐
│ Data Source │───>│ Singer │─────────>│ DataSpoc Pipe │───>│ Cloud Bucket │
│ (DB, API, …)│ │ Tap │ │ transform(df) │ │ (S3/GCS/Az) │
└─────────────┘ └──────────┘ └───────┬───────┘ └──────────────┘
│
manifest.json
state.json
logs/
- Singer tap extracts data from the source, emits JSON on stdout
- Pipe reads the stream, buffers in batches (~10K records)
- If
~/.dataspoc-pipe/transforms/<pipeline>.pyexists, appliestransform(df)per batch - Converts to Parquet (zstd) and uploads to bucket
- Updates the manifest catalog and saves execution logs
Commands
dataspoc-pipe init # Initialize config structure
dataspoc-pipe add <name> # Create pipeline (interactive wizard)
dataspoc-pipe run <name> # Run a pipeline
dataspoc-pipe run <name> --full # Force full extraction (ignore bookmarks)
dataspoc-pipe run _ --all # Run all pipelines
dataspoc-pipe status # Status table for all pipelines
dataspoc-pipe logs <name> # Last execution log (JSON)
dataspoc-pipe validate [name] # Test bucket and tap connectivity
dataspoc-pipe manifest <bucket> # Show data catalog
dataspoc-pipe schedule install # Install cron jobs
dataspoc-pipe schedule remove # Remove cron jobs
dataspoc-pipe --version # Show version
Incremental Extraction
Enable in pipeline YAML:
incremental:
enabled: true
Pipe saves Singer bookmarks to <bucket>/.dataspoc/state/<pipeline>/state.json. Next run only fetches new data. Use --full to re-extract everything.
Bucket Convention
This is the public contract between Pipe, Lens, and ML. Do not change without versioning.
<bucket>/
.dataspoc/
manifest.json # Data catalog
state/<pipeline>/state.json # Incremental bookmarks
logs/<pipeline>/<timestamp>.json # Execution logs
raw/<source>/<table>/
dt=YYYY-MM-DD/ # Hive-style partitioning
<table>_0000.parquet # Data files
Built-in Taps
| Tap | Source | Extra install |
|---|---|---|
parquet |
Parquet files (local/cloud) | None |
google-sheets-public |
Public Google Sheets | None |
Any Singer-compatible tap works. Run dataspoc-pipe add to see available templates.
Part of the DataSpoc Platform
| Product | Role |
|---|---|
| DataSpoc Pipe (this) | Ingestion: Singer taps to Parquet in cloud buckets |
| DataSpoc Lens | Virtual warehouse: SQL + Jupyter + AI over your data lake |
| DataSpoc ML | AutoML: train and deploy models from your lake |
The bucket is the contract. Pipe writes. Lens reads. ML learns.
Community
- GitHub Issues -- Report bugs or request features
- Contributing -- PRs welcome! See CONTRIBUTING.md for guidelines
License
Apache 2.0 -- free to use, modify, and distribute.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dataspoc_pipe-0.1.1.tar.gz.
File metadata
- Download URL: dataspoc_pipe-0.1.1.tar.gz
- Upload date:
- Size: 60.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed1947c2ab50290bd8cf57638bf42b76e2416693d1d30c21a1ba40b1127b2451
|
|
| MD5 |
c1c8794a6fef5ce1088ee7d47012ff4a
|
|
| BLAKE2b-256 |
737851bf0bb1f8e33d55ddaf2fb1c94958df32a492d0a7eb4374c37a8ad322b7
|
File details
Details for the file dataspoc_pipe-0.1.1-py3-none-any.whl.
File metadata
- Download URL: dataspoc_pipe-0.1.1-py3-none-any.whl
- Upload date:
- Size: 27.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ba4821d6c79c135496c543b2e994677f3dcec9f494ae0ee4e31c4b4088fcae17
|
|
| MD5 |
f3e0bd37588d97a6807427e2f86a8174
|
|
| BLAKE2b-256 |
d7898618c4017ab7dda1a54bd6ee7c43883bac2cee94e6e505bf06e5bde146a4
|