Skip to main content

Motor de ingestão de dados — Singer + Parquet + Bucket

Project description

DataSpoc Pipe

CI PyPI License Python 3.10+

Singer taps to Parquet in cloud buckets. That simple.

Why DataSpoc Pipe?

Most data ingestion tools drown you in orchestration complexity. DataSpoc Pipe does one thing well: connect to any of the 400+ Singer taps (databases, APIs, SaaS), convert to Parquet, and land it in your cloud bucket -- cataloged and ready to query. No DAGs, no servers, no infrastructure.

400+ data sources -- Streaming (no memory limits) -- Zero infrastructure -- < 15 min setup

Installation

pip install dataspoc-pipe

Cloud storage extras:

pip install dataspoc-pipe[s3]      # AWS S3
pip install dataspoc-pipe[gcs]     # Google Cloud Storage
pip install dataspoc-pipe[azure]   # Azure Blob Storage

Singer taps are installed separately:

pip install tap-csv
pip install tap-postgres

Quick Start

1. Initialize

dataspoc-pipe init

Creates ~/.dataspoc-pipe/ with config.yaml, pipelines/, sources/, and transforms/.

2. Install a Singer tap and prepare data

pip install tap-csv

Create /tmp/sample/users.csv:

id,name,email
1,Alice,alice@example.com
2,Bob,bob@example.com
3,Carol,carol@example.com

3. Create a pipeline

dataspoc-pipe add my-first-pipeline

The interactive wizard prompts for tap name, destination bucket, compression, incremental mode, and schedule. Or create ~/.dataspoc-pipe/pipelines/my-first-pipeline.yaml manually:

source:
  tap: tap-csv
  config:
    files:
      - entity: users
        path: /tmp/sample/users.csv
        keys:
          - id

destination:
  bucket: file:///tmp/my-lake
  path: raw
  compression: zstd

incremental:
  enabled: false

4. Validate and run

dataspoc-pipe validate my-first-pipeline
dataspoc-pipe run my-first-pipeline

5. Check results

dataspoc-pipe status
dataspoc-pipe logs my-first-pipeline
dataspoc-pipe manifest file:///tmp/my-lake

Your data is now at /tmp/my-lake/raw/csv/users/dt=2026-03-20/users_0000.parquet.

How It Works

┌─────────────┐    ┌──────────┐  stdout  ┌───────────────┐    ┌──────────────┐
│ Data Source  │───>│ Singer   │─────────>│ DataSpoc Pipe │───>│ Cloud Bucket │
│ (DB, API, …)│    │ Tap      │          │ transform(df) │    │ (S3/GCS/Az)  │
└─────────────┘    └──────────┘          └───────┬───────┘    └──────────────┘
                                                 │
                                          manifest.json
                                           state.json
                                             logs/
  1. Singer tap extracts data from the source, emits JSON on stdout
  2. Pipe reads the stream, buffers in batches (~10K records)
  3. If ~/.dataspoc-pipe/transforms/<pipeline>.py exists, applies transform(df) per batch
  4. Converts to Parquet (zstd) and uploads to bucket
  5. Updates the manifest catalog and saves execution logs

Commands

dataspoc-pipe init                    # Initialize config structure
dataspoc-pipe add <name>              # Create pipeline (interactive wizard)
dataspoc-pipe run <name>              # Run a pipeline
dataspoc-pipe run <name> --full       # Force full extraction (ignore bookmarks)
dataspoc-pipe run _ --all             # Run all pipelines
dataspoc-pipe status                  # Status table for all pipelines
dataspoc-pipe logs <name>             # Last execution log (JSON)
dataspoc-pipe validate [name]         # Test bucket and tap connectivity
dataspoc-pipe manifest <bucket>       # Show data catalog
dataspoc-pipe schedule install        # Install cron jobs
dataspoc-pipe schedule remove         # Remove cron jobs
dataspoc-pipe --version               # Show version

Incremental Extraction

Enable in pipeline YAML:

incremental:
  enabled: true

Pipe saves Singer bookmarks to <bucket>/.dataspoc/state/<pipeline>/state.json. Next run only fetches new data. Use --full to re-extract everything.

Bucket Convention

This is the public contract between Pipe, Lens, and ML. Do not change without versioning.

<bucket>/
  .dataspoc/
    manifest.json                          # Data catalog
    state/<pipeline>/state.json            # Incremental bookmarks
    logs/<pipeline>/<timestamp>.json       # Execution logs
  raw/<source>/<table>/
    dt=YYYY-MM-DD/                         # Hive-style partitioning
      <table>_0000.parquet                 # Data files

Built-in Taps

Tap Source Extra install
parquet Parquet files (local/cloud) None
google-sheets-public Public Google Sheets None

Any Singer-compatible tap works. Run dataspoc-pipe add to see available templates.

Part of the DataSpoc Platform

Product Role
DataSpoc Pipe (this) Ingestion: Singer taps to Parquet in cloud buckets
DataSpoc Lens Virtual warehouse: SQL + Jupyter + AI over your data lake
DataSpoc ML AutoML: train and deploy models from your lake

The bucket is the contract. Pipe writes. Lens reads. ML learns.

Community

License

Apache 2.0 -- free to use, modify, and distribute.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataspoc_pipe-0.1.1.tar.gz (60.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataspoc_pipe-0.1.1-py3-none-any.whl (27.9 kB view details)

Uploaded Python 3

File details

Details for the file dataspoc_pipe-0.1.1.tar.gz.

File metadata

  • Download URL: dataspoc_pipe-0.1.1.tar.gz
  • Upload date:
  • Size: 60.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dataspoc_pipe-0.1.1.tar.gz
Algorithm Hash digest
SHA256 ed1947c2ab50290bd8cf57638bf42b76e2416693d1d30c21a1ba40b1127b2451
MD5 c1c8794a6fef5ce1088ee7d47012ff4a
BLAKE2b-256 737851bf0bb1f8e33d55ddaf2fb1c94958df32a492d0a7eb4374c37a8ad322b7

See more details on using hashes here.

File details

Details for the file dataspoc_pipe-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: dataspoc_pipe-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 27.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dataspoc_pipe-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ba4821d6c79c135496c543b2e994677f3dcec9f494ae0ee4e31c4b4088fcae17
MD5 f3e0bd37588d97a6807427e2f86a8174
BLAKE2b-256 d7898618c4017ab7dda1a54bd6ee7c43883bac2cee94e6e505bf06e5bde146a4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page