Build modular data pipelines running inside the postgres database

Project description

Ralsei

Ralsei is a Python framework for building modular data pipelines acting on a SQL database. Inspired by kedro and dbt, it aims to combine data collection (through scraping/APIs) and data processing in a single declarative pipeline.

Design goals

Lightweight and portable
Preserve knowledge of how certain data was acquired, in form of a pipeline script
Both for data collection/downloading and analysis
Control of workflow: rerun any specific task on-demand
Support for resumable long-running tasks

Installation

pip install ralsei

Example

See the documentation for an in-depth explaination

init_sources.sql

CREATE TABLE {{table}}(
  id INTEGER PRIMARY KEY,
  year INT,
  name TEXT
);
{%split%}
INSERT INTO {{table}}(year, name) VALUES
(2015, 'Physics'),
(2018, 'Computer Science'),
(2021, 'Philosophy');

logic.py

import requests
import json

def download(year: int, name: str):
  response = requests.get(
     "https://foo.com/api",
     params={"year": year, "name": name},
  )
  response.raise_for_status()
  return {"json": response.text}

def parse_page(data: str):
  for item in json.loads(data)["items"]:
     yield {"university": item["name"], "rank": item["rank"]}

app.py

from typing import Optional
from pathlib import Path
import click
import sqlalchemy
from ralsei import (
  Ralsei,
  Pipeline,
  Table,
  ValueColumn,
  Placeholder,
  compose_one,
  pop_id_fields,
)
from .logic import download, parse_page

# Define your tasks
class MyPipeline(Pipeline):
  def __init__(self, schema: Optional[str]):
     self.schema = schema

  def create_tasks(self):
     return {
        "init": CreateTableSql(
           table=Table("sources", self.schema),
           sql=Path("./init_sources.sql").read_text(),
        ),
        "download": MapToNewColumns(
           table=self.outputof("init"), # (1)!
           select=(
              "SELECT id, year, name FROM {{table}} WHERE NOT {{is_done}}" # (2)!
           ),
           columns=[ValueColumn("json", "TEXT")], # (3)!
           is_done_column="_downloaded", # (4)!
           fn=compose_one(download, pop_id_fields("id")) # (5)!
        ),
        "parse": MapToNewTable(
           source_table=self.outputof("download"),
           select="SELECT id, json FROM {{source}}",
           table=Table("records", self.schema),
           columns=[
              "record_id INTEGER PRIMARY KEY", # (6)!
              ValueColumn(
                 "source_id",
                 "INT REFERENCES {{source}}",
                 Placeholder("id"),
              ),
              ValueColumn("university", "TEXT"),
              ValueColumn("rank", "INT"),
           ],
           fn=compose(parse_page, pop_id_fields("id")),
        )
     }

# Create a CLI application
@click.option("-s", "--schema", help="Database schema")
class App(Ralsei):
  def __init__(self, db: sqlalchemy.URL, schema: Optional[str]):
     super().__init__(db, MyPipeline(schema))

if __name__ == "__main__":
  App.run_cli()

The resulting app can be used like:

python ./app.py -d sqlite:///result.sqlite --schema dev run

Project details

Release history Release notifications | RSS feed

3.1.0.post3

Oct 9, 2024

This version

3.1.0.post2

Oct 8, 2024

3.1.0.post1

Oct 7, 2024

3.1.0

Oct 7, 2024

3.0.0.dev5 pre-release

Oct 7, 2024

3.0.0.dev4 pre-release

Oct 7, 2024

3.0.0.dev3 pre-release

Sep 6, 2024

2.2.0

Nov 7, 2023

2.1.4

Sep 25, 2023

2.1.3

Sep 5, 2023

2.1.2

Sep 3, 2023

2.1.1

Sep 3, 2023

2.0.0rc1 pre-release

Aug 18, 2023

0.1.0

Jun 30, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ralsei-3.1.0.post2.tar.gz (30.5 kB view hashes)

Uploaded Oct 8, 2024 Source

Built Distribution

ralsei-3.1.0.post2-py3-none-any.whl (46.8 kB view hashes)

Uploaded Oct 8, 2024 Python 3

Hashes for ralsei-3.1.0.post2.tar.gz

Hashes for ralsei-3.1.0.post2.tar.gz
Algorithm	Hash digest
SHA256	`fa3bd8a65cfd9473f62c6b200221e5a6eebd21e506a9769d765f6196e31db0c7`
MD5	`9d9b479bcd2f4037b6e7832e15da066e`
BLAKE2b-256	`30fe8d7ce304a78ec92dc96271eb04221ca5837c61772ae574760e194e4aab4b`

Hashes for ralsei-3.1.0.post2-py3-none-any.whl

Hashes for ralsei-3.1.0.post2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`30558c4c7698c9e8bd48ac0e7b04a84c4b7c8772c0f38fb04b3792f51768acb7`
MD5	`41ceadb14675ca3fccd378a406d88ae6`
BLAKE2b-256	`b1c27bf6c9cfd498ae65a027ec5cfdf5f7293c92a30825f85be3e82da6b94d8b`