Skip to main content

No project description provided

Project description

Welcome to your Curie Project!

This is the beginning of your pipeline project. Please make note of the project structure in your file explorer. And make note that any reference to a file path should be relative to the root of your project.

What is a KETL Pipeline?: Our devs derived the name from the intended nature of the program. KETL stands for "Kinetic Extract Transform Load." It was found to best align with their goal of producing a system that would allow users to easily build and run pipelines from a mono-repo architecture. Given it's ability to simplify the process of building, running, and deploying pipelines, the name stuck.

Due Diligence: Please make sure you have read the Getting Started section before continuing.

  1. "( . )" indicates a location where the user can indicate they want to affect all items, by swapping the list for the dot. For example, curie --clean . will clean all pipelines, and curie --clean pipeline1 pipeline2 . will clean pipeline1 and pipeline2.

Getting Started 1.0.0

Command Line Interface 1.1.0

  1. Initialize a new project - Change your working directory to the location where you want to create your project. Then run either of the following commands:

    curie etl --init <project>
    curie etl -i <project>
    
  2. Add a pipeline - Change your working directory to the location of your project. Then run either of the following commands:

    curie etl add --name <pipeline_name> [--connection <connection_name>]
    curie etl add -n <pipeline_name> [-c <connection_name>]
    

    This will produce the required entries in your pathways.yaml files, and write a simple blueprint to <root>/blueprints/<pipeline_name>.yaml. You will need to edit these files to get your pipeline up and running. See "Configuration Files" and "Pipeline Blueprints" for more information.

  3. Compile your project - Compiling your project will generate the scripts that will be used to run your pipeline. By default these will be stored in <root>/scripts/compiled/<pipeline>/.

    Change your working directory to the location of your project. Then run either of the following commands:

    curie etl --run <pipeline> --compile
    curie etl -r <pipeline> -c
    
  4. Running your pipeline - Running your pipeline will execute the scripts generated in the previous step. This action affects your database. Common uses include: updating tables, building a new dataset, refreshing dependencies.

    Change your working directory to the location of your project. Then run either of the following commands:

    curie etl --run <pipeline> --tables <t1 t2 t3 ... tn (.)>
    curie etl -r <pipeline> <mode> --tables <t1 t2 t3 ... tn (.)>
    
  5. Saving your pipeline - Saving your pipeline will download selections of the tables specified in the command according to terms defined in your config file. By default these will be stored in <root>/data/<pipeline>/.

    Change your working directory to the location of your project. Then run either of the following commands:

    curie etl --save <pipeline> --tables <t1 t2 t3 ... tn (.)>
    curie etl -s <pipeline> --tables <t1 t2 t3 ... tn (.)>
     TODO: alter to allow multiple file types
    
  6. Cleaning your pipeline - Cleaning your pipeline will remove all files generated by the pipeline. This action does not affect your database. Common uses include: removing downloaded data, removing compiled scripts.

    Change your working directory to the location of your project. Then run the following command:

    curie etl --clean <pipeline (.)> --facet <facet (.)>
    
  7. Automated Documentation - Curie is self-documenting, with plenty of options to add more insight. To generate documentation for your project, run the following command:

    Change your working directory to the location of your project. Then run the following command:

    curie docs generate
    

    And to view the documentation, run the following command:

    curie docs serve
    

    This will launch a local server that will allow you to view your documentation in your browser.

Configuration Files 1.2.0

There are two primary configuration files that you will need to edit to get your project up and running. These are connections.yaml and pathways.yaml. connections.yaml contains the information needed to connect to your database. pathways.yaml contains the information needed to associate pipeline blueprints with your database connection.

After you have initialized your project, you will find these files in <root>/config/.

To declare a new connection, add a new entry to connections.yaml in the following format:

DatabaseClass:
    connection_name:
        user: username
        password: password
        host: host
        port: 1123
        database: database
        schema: schema

The format of the connection arguments is subject to the rules around the database class you are using. For example, if you are using a Postgres object, you will need to provide a schema argument. If you are using a SQLite object, you will not need to provide a schema argument.

To declare a new pathway, add a new entry to pathways.yaml in the following format:

# Where to find connection information:
Connections:
    config:
        file: connections.yaml
# Pipeline declarations:
Pathways:
    pipeline_name:
        connection: connection_name
        blueprint: blueprints/this_pipeline.yaml

Pipeline Blueprints 1.3.0

Pipeline blueprints are as they sound. They outline the steps that will be taken to run a pipeline. They are written in YAML and are stored in <root>/blueprints/. It allows a user to define a pipeline in a way that is both human-readable and machine-readable.

Three things to note:

  1. The order of the tables in the blueprint does not matter. Tables will be executed according to their dependencies.
  2. The order of the dependencies in the blueprint does not matter. The program will form a dependency graph and execute the tables in the correct order.
  3. Should a dependency not exist, the program may fault. This is a feature, not a bug. It is a good idea to check your dependencies before running a pipeline.

Root Tags: The root tags of a pipeline blueprint are SYNC, ARGUMENTS, and ETL. SYNC defines the target for downloaded tables. ARGUMENTS defines the arguments that will be passed to the pipeline. ETL defines the tables that will be executed in the pipeline.

SYNC: The SYNC tag defines the target for downloaded tables. This is a string that will be used to define the target directory for downloaded tables. By default, this will be <root>/data/<pipeline>/.

ARGUMENTS: The ARGUMENTS tag defines the arguments that will be passed to the pipeline. This is a dictionary that will be passed to the pipeline. It is a good idea to define the arguments that will be passed to the pipeline here. This will allow you to change the arguments without changing the pipeline blueprint.

You can call these arguments from your queries using Jinja format. For example if I have a variable offset: 5000 in my arguments, I can call it in my query using {{ offset }}.

ETL: The ETL tag defines the tables that will be executed in the pipeline. This is a dictionary of tables that will be executed in the pipeline. The key is the name of the table, and the value is a dictionary of the table's attributes.

Table Attributes: The table attributes are manifest, schema, save, and run. manifest defines how the query should be manifested in the database: table, view, etc. schema defines the schema of the table. save defines how the table should be saved (See "Modal Elements"). run defines how the table should be run (See "Modal Elements").

  • Modal Elements: There are two modes supported: run and save. run will execute the pipeline and save will download the data specified in the pipeline. Each will allow you to specify a list of dependencies that will be executed before the pipeline is run, and a query that will generate the data. Of course, run will execute the query and affect the database, and save will download the data generated by the query.

    • query: The query is a string that will be executed by the database. It is the core of the pipeline. These should be written in Jinja SQL.

    • script: Scripts are Jinja SQL files that will be compiled then run to execute the pipeline. Store them where you prefer, but please reserve the scripts/compiled/<pipeline> directory for compiled scripts.

    script and query are mutually exclusive. If both are specified, script will be used.

    • dependencies: Dependencies is a list of tables that must be formed before the current table is run. This allows the program to form a dependency graph and execute the tables in the correct order. This matters in procedural ETLs and in forming good data models.

    Save Mode Only

    • target: The target is a string that will be used to define the target directory for downloaded tables. By default, this will be <root>/data/<pipeline>/.
    • filetype: The filetype is a string that will be used to define the filetype for downloaded tables. By default, this will be csv.

    Run Mode Only

    • method: Defines the manner in which a table is affected: replace, truncate, merge, append. replace will drop the table and replace it with the new data. truncate will delete all rows from the table and insert the new data. merge will update the table with the new data using an identifier. append will insert the new data into the table.

Example Pipeline:

SYNC:
  target: ./data/pipeline1/raw
  overwrite: true
ARGUMENTS:
  limit: 100
ETL:
  table_name:
    manifest: table
    schema: public
    save:
      query: SELECT * FROM table limit {{limit}}


Project Structure 1.4.0

├── blueprints
│   └── pipeline1.yaml
├── config
│   ├── connections.yaml
│   └── pathways.yaml
├── data
│   └── pipeline1
│       └── table1.csv
├── scripts
│   ├── compiled
│   │   └── pipeline1
│   │       └── table1.sql (compiled)
│   └── pipeline1
│       └── table1.sql (jinjasql)
├── logs
│   └── pipeline1
│       └──table1.log
└── README-curie.md

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycurie-0.0.21.tar.gz (28.9 kB view hashes)

Uploaded Source

Built Distribution

pycurie-0.0.21-py3-none-any.whl (27.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page