Skip to main content

Microsoft Fabric ETL toolbox

Project description

bifabrik

Microsoft Fabric ETL toolbox

This is an early build - if you find a problem, please report it here: https://github.com/rjankovic/bifabrik/issues. Thanks!

What is the point?

  • make BI development in Microsoft Fabric easier by providing a fluent API for common ETL tasks
  • reduce repetitive code by setting preferences in config files

For info on all the features, see the project page

Quickstart

First, let's install the library. Either add the bifabrik library to an environment in Fabric and attach that environment to your notebook.

Or you can add %pip install bifabrik at the beginning of the notebook.

Init the class

To load data, bifabrik needs to access the spark session.

from bifabrik import bifabrik
bif = bifabrik(spark)
# 'bif' will be used in many code samples as a reference to the bifabrik class instance

Also, make sure that your notebook is connected to a lakehouse. This is the lakehouse to which bifabrik will save data.

default_lakehouse

Load CSV files (JSON is similar)

Simple tasks should be easy.

from bifabrik import bifabrik
bif = bifabrik(spark)

bif.fromCsv('Files/CsvFiles/annual-enterprise-survey-2021.csv').toTable('Survey2021').run()

...and the table is in place

display(spark.sql('SELECT * FROM Survey2021'))

Or you can make use of pattern matching

# take all files matching the pattern and concat them
bif.fromCsv('Files/*/annual-enterprise-survey-*.csv').toTable('SurveyAll').run()

These are full loads, overwriting the target table if it exists.

Configure load preferences

Is your CSV is a bit...special? No problem, we'll tend to it.

Let's say you have a European CSV with commas instead of decimal points and semicolons instead of commas as separators.

bif.fromCsv("Files/CsvFiles/dimBranch.csv").delimiter(';').decimal(',').toTable('DimBranch').run()

The backend uses pandas, so you can take advantage of many other options - see help(bif.fromCsv())

Keep the configuration

What, you have more files like that? Well then, you probably don't want to repeat the setup each time. Good news is, the bifabrik object can keep all your preferences:

from bifabrik import bifabrik
bif = bifabrik(spark)

# set the configuration
bif.cfg.csv.delimiter = ';'
bif.cfg.csv.decimal = ','

# the configuration will be applied to all these loads
bif.fromCsv("Files/CsvFiles/dimBranch.csv").toTable('DimBranch').run()
bif.fromCsv("Files/CsvFiles/dimDepartment.csv").toTable('DimDepartment').run()
bif.fromCsv("Files/CsvFiles/dimDivision.csv").toTable('DimDivision').run()

# (You can still apply configuration in the individual loads, as seen above, to override the general configuration.)

If you want to persist your configuration beyond the PySpark session, you can save it to a JSON file - see Configuration

Consistent configuration is one of the core values of the project.

We like our lakehouses to be uniform in terms of loading patterns, table structures, tracking, etc. At the same time, we want to keep it DRY.

bifabrik configuration aims to cover many aspects of the lakehouse so that you can define your conventions once, use it repeatedly, and override when neccessary.

See the github page for more details on this.

SQL transformations

Enough with the files! Let's make a simple SQL transformation, writing data to another SQL table - a straightforward full load:

bif.fromSql('''

SELECT Industry_name_NZSIOC AS Industry_Name 
,AVG(`Value`) AS AvgValue
FROM LakeHouse1.Survey2021
WHERE Variable_Code = 'H35'
GROUP BY Industry_name_NZSIOC

''').toTable('SurveySummarized').run()

# The resulting table will be saved to the lakehouse attached to your notebook.
# You can refer to a different source warehouse in the query, though.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bifabrik-0.3.0.tar.gz (18.1 kB view details)

Uploaded Source

Built Distribution

bifabrik-0.3.0-py3-none-any.whl (23.9 kB view details)

Uploaded Python 3

File details

Details for the file bifabrik-0.3.0.tar.gz.

File metadata

  • Download URL: bifabrik-0.3.0.tar.gz
  • Upload date:
  • Size: 18.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.8 Windows/10

File hashes

Hashes for bifabrik-0.3.0.tar.gz
Algorithm Hash digest
SHA256 382004895bbc592fef931413e146d045124e68bdff56ddcab468b2fd7f1cf65b
MD5 243806ecda4a977a328bc12082c7c8e0
BLAKE2b-256 6f18d793dd27f06c740851859115089b7b3503454fa358958dc38b50578cbd64

See more details on using hashes here.

File details

Details for the file bifabrik-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: bifabrik-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 23.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.11.8 Windows/10

File hashes

Hashes for bifabrik-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b746fd72cd54c85b88b5dfb4b1e07ddb4b7fb7e458f033a93f05683b13b63dfc
MD5 3a778fe70ac2da3ee4ca22974be8d3e4
BLAKE2b-256 0901535a87fa0a27f23e734f3a9ff07e37852e3e6ddabac78975cb29b037928b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page