Skip to main content

A horizontally scalable data movement and transformation framework

Project description

Reflowfy

A horizontally scalable data movement and transformation framework

Reflowfy enables you to build pipelines that fetch data from sources, apply custom transformations, and send results to destinations—all with millions+ record scalability.

🎯 Key Features

  • Horizontally Scalable: Process millions of records in parallel
  • Kafka-Based: Reliable message queue for job distribution
  • Kubernetes-Native: KEDA autoscaling from 0 to N workers
  • Order-Independent: Maximum parallelism without coordination overhead
  • User-Extensible: Plugin architecture for sources, destinations, and transformations
  • Two Execution Modes: Local testing and distributed production execution

🏗 Architecture

User Request
    ↓ HTTP
API (FastAPI) ────→ ReflowManager Service (port 8001)
    │                    ↓
    │                PostgreSQL (state + checkpoints)
    │                    ↓
    │                Kafka Producer (rate limited) → Kafka Topic (reflow.jobs)
    │                    ↓
    │                Worker Pool (KEDA scaled)
    │                    ↓
    └─→ Execution Tracking  Destinations

Components:

  • API: Orchestration, job splitting, route generation
  • ReflowManager: Rate limiting, state management, checkpointing
  • PostgreSQL: Persistent state storage for executions and checkpoints
  • Kafka: Job queue and load balancing
  • Workers: Generic executors that process jobs
  • KEDA: Kafka lag-based autoscaling

� Documentation

�🚀 Quick Start

1. Define a Custom Transformation

from reflowfy import BaseTransformation

class XmlToJson(BaseTransformation):
    name = "xml_to_json"
    
    def apply(self, records, context):
        # Your transformation logic
        return [parse_xml(r) for r in records]

2. Build a Pipeline

from reflowfy import build_pipeline, pipeline_registry
from reflowfy import elastic_source, kafka_destination

# Configure source with runtime parameters
source = elastic_source(
    url="http://elasticsearch:9200",
    index="logs-*",
    base_query={
        "query": {
            "range": {
                "@timestamp": {
                    "gte": "{{ start_time }}",  # Runtime parameter
                    "lte": "{{ end_time }}"
                }
            }
        }
    }
)

# Configure destination
destination = kafka_destination(
    bootstrap_servers="kafka:9092",
    topic="processed-logs"
)

# Build and register
pipeline = build_pipeline(
    name="elastic_xml_pipeline",
    source=source,
    transformations=[XmlToJson()],
    destination=destination,
    rate_limit={"jobs_per_second": 50}
)

pipeline_registry.register(pipeline)

3. Start the API

# In your main.py
from reflowfy.api.app import main
import examples.xml_to_json_pipeline  # Import to trigger registration

if __name__ == "__main__":
    main()

4. Execute Pipeline

Run Distributed (async via Kafka):

curl -X POST http://localhost:8001/run \
  -H "Content-Type: application/json" \
  -d '{
    "pipeline_name": "elastic_xml_pipeline",
    "runtime_params": {
      "start_time": "2024-01-01",
      "end_time": "2024-01-02"
    }
  }'

Dry Run (Preview jobs without executing):

curl -X POST http://localhost:8001/run \
  -H "Content-Type: application/json" \
  -d '{
    "pipeline_name": "elastic_xml_pipeline",
    "runtime_params": {
      "start_time": "2024-01-01",
      "end_time": "2024-01-02"
    },
    "dry_run": true
  }'

Returns a preview of the job execution plan, sample records, and configuration.

📦 Installation

# Using pip
pip install -e .

# Using Docker
docker build -f Dockerfile.api -t reflowfy-api .
docker build -f Dockerfile.worker -t reflowfy-worker .

🔌 Built-in Connectors

Sources

  • Elasticsearch: Scroll-based pagination with runtime parameters
  • SQL: ID range and offset-based splitting (Postgres, MySQL, etc.)
  • HTTP API: Offset/cursor pagination with authentication

Destinations

  • Kafka: Batching, compression, health checks
  • HTTP: Webhooks with retry logic

⚙️ Configuration

Environment Variables

API:

API_HOST=0.0.0.0
API_PORT=8000
KAFKA_BOOTSTRAP_SERVERS=kafka:9092
KAFKA_TOPIC=reflow.jobs

Worker:

KAFKA_BOOTSTRAP_SERVERS=kafka:9092
KAFKA_TOPIC=reflow.jobs
KAFKA_GROUP_ID=reflowfy-workers
Mode Endpoint Use Case Kafka Workers
Distributed POST /run Production execution
Dry Run POST /run (dry_run=true) Preview/Testing

📊 Monitoring

Reflowfy exposes Prometheus metrics:

  • reflowfy_jobs_processed_total - Total jobs processed
  • reflowfy_job_processing_duration_seconds - Job processing time
  • reflowfy_records_processed_total - Total records processed
  • reflowfy_active_workers - Active worker count

🐳 Kubernetes Deployment

# Deploy with Helm (using bundled charts)
# Note: For production, we recommend using the CLI 'reflowfy deploy'
helm install reflowfy-api ./reflowfy/helm/reflowfy-api
helm install reflowfy-worker ./reflowfy/helm/reflowfy-worker

KEDA will automatically scale workers based on Kafka lag.

📝 License

MIT

🤝 Contributing

Contributions welcome! This is a production-grade framework designed for real-world data processing at scale.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

reflowfy-0.1.27.tar.gz (181.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

reflowfy-0.1.27-py3-none-any.whl (210.0 kB view details)

Uploaded Python 3

File details

Details for the file reflowfy-0.1.27.tar.gz.

File metadata

  • Download URL: reflowfy-0.1.27.tar.gz
  • Upload date:
  • Size: 181.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for reflowfy-0.1.27.tar.gz
Algorithm Hash digest
SHA256 ceed48c396fe7c42952cc68b4bf8787ef334932e596cfb37a28aa455cdb42c91
MD5 a7213d8bdec3619a37326f87ef10a820
BLAKE2b-256 f36259a267b855f3c7cce9129e6b8fade82a8e6678dc88831051b3445bff7ad6

See more details on using hashes here.

File details

Details for the file reflowfy-0.1.27-py3-none-any.whl.

File metadata

  • Download URL: reflowfy-0.1.27-py3-none-any.whl
  • Upload date:
  • Size: 210.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for reflowfy-0.1.27-py3-none-any.whl
Algorithm Hash digest
SHA256 fd66dc640c3b1189119461ce667c6183c9ed9aef146129e9208b3e0256047483
MD5 91da73af98af235cf37a3af15e29e56e
BLAKE2b-256 fab3077a1bd0fb12b0c5ea2ed1e4f941c5d17a333d3ff99c68575bbe31193f94

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page