Skip to main content

Airflow operators for reading and writing XLSX files

Project description

Airflow Provider XLSX

Apache Airflow operators for converting XLSX files from/to Parquet, CSV and JSON.

Build Status PyPI version PyPI Downloads Code style: black

System Requirements

  • Airflow Versions
    • 2.0 or newer

Installation

$ pip install airflow-provider-xlsx

Operators

FromXLSXOperator

Read an XLSX or XLS file and convert it into Parquet, CSV, JSON, JSON Lines(one line per record) file.

API Documentation

Example

XLSX Source

image

Airflow Task

from xlsx_provider.operators.from_xlsx_operator import FromXLSXOperator

xlsx_to_jsonl = FromXLSXOperator(
   task_id='xlsx_to_jsonl',
   source='{{ var.value.tmp_path }}/test.xlsx',
   target='{{ var.value.tmp_path }}/test.jsonl',
   file_format='jsonl',
   dag=dag
)

JSON Lines Output

{"month": "Jan", "high": -12.2, "mean": -16.2, "low": -20.1, "precipitation": 19}
{"month": "Feb", "high": -10.3, "mean": -14.7, "low": -19.1, "precipitation": 14}
{"month": "Mar", "high": -2.6, "mean": -7.2, "low": -11.8, "precipitation": 15}
{"month": "Apr", "high": 8.1, "mean": 3.2, "low": -1.7, "precipitation": 24}
{"month": "May", "high": 17.5, "mean": 11.6, "low": 5.6, "precipitation": 36}
{"month": "Jun", "high": 24, "mean": 18.2, "low": 12.3, "precipitation": 58}
{"month": "Jul", "high": 25.7, "mean": 20.2, "low": 14.7, "precipitation": 72}
{"month": "Aug", "high": 22.2, "mean": 17, "low": 11.7, "precipitation": 66}
{"month": "Sep", "high": 16.6, "mean": 11.5, "low": 6.4, "precipitation": 44}
{"month": "Oct", "high": 6.8, "mean": 3.4, "low": 0, "precipitation": 38}

FromXLSXQueryOperator

Execute an SQL query an XLSX/XLS file and export the result into a Parquet or CSV file

This operators loads an XLSX or XLS file into an in-memory SQLite database, executes a query on the db and stores the result into a Parquet, CSV, JSON, JSON Lines(one line per record) file. The output columns names and types are determinated by the SQL query output.

API Documentation

Example

XLSX Source

image

SQL Query

 select
     g as high_tech_sector,
     h as eur_bilion,
     i as share
 from
     high_tech
 where
     _index > 1
     and high_tech_sector <> ''
     and lower(high_tech_sector) <> 'total'

Airflow Task

from xlsx_provider.operators.from_xlsx_query_operator import FromXLSXQueryOperator

xlsx_to_csv = FromXLSXQueryOperator(
   task_id='xlsx_to_csv',
   source='{{ var.value.tmp_path }}/high_tech.xlsx',
   target='{{ var.value.tmp_path }}/high_tech.parquet',
   file_format='csv',
   csv_delimiter=',',
   table_name='high_tech',
   worksheet='Figure 3',
   query='''
       select
           g as high_tech_sector,
           h as eur_bilion,
           i as share
       from
           high_tech
       where
           _index > 1
           and high_tech_sector <> ''
           and lower(high_tech_sector) <> 'total'
   ''',
   dag = dag
)

Output

high_tech_sector,value,share
Pharmacy,78280,0.231952169555313
Electronics-telecommunications,75243,0.222954583130376
Scientific instruments,64010,0.189670433253542
Aerospace,44472,0.131776952366115
Computers office machines,21772,0.0645136852766778
Non-electrical machinery,20813,0.0616714981835167
Chemistry,19776,0.058598734453222
Electrical machinery,9730,0.028831912195612
Armament,3384,0.0100300315856265

ToXLSXOperator

Read a Parquest, CSV, JSON, JSON Lines(one line per record) file and convert it into XLSX.

API Documentation

Example
from xlsx_provider.operators.to_xlsx_operator import ToXLSXOperator

parquet_to_xlsx = ToXLSXOperator(
   task_id='parquet_to_xlsx',
   source='{{ var.value.tmp_path }}/test.parquet',
   target='{{ var.value.tmp_path }}/test.xlsx',
   dag=dag
)

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

airflow-provider-xlsx-1.0.1.tar.gz (16.8 kB view details)

Uploaded Source

Built Distribution

airflow_provider_xlsx-1.0.1-py3-none-any.whl (21.5 kB view details)

Uploaded Python 3

File details

Details for the file airflow-provider-xlsx-1.0.1.tar.gz.

File metadata

  • Download URL: airflow-provider-xlsx-1.0.1.tar.gz
  • Upload date:
  • Size: 16.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.8.2 requests/2.25.1 setuptools/59.6.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.6.15

File hashes

Hashes for airflow-provider-xlsx-1.0.1.tar.gz
Algorithm Hash digest
SHA256 63596c14ceb923fee1cacb1d65df6de7bf3b36b6d0d969dbbf27bdd90a1c7c72
MD5 78d1072ffd196c52f2349a7303adb9ed
BLAKE2b-256 953d5c3046f8b734adeaba2bf8b4a549f91475f0bbb7ad01c65b4c21aba41a1b

See more details on using hashes here.

File details

Details for the file airflow_provider_xlsx-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: airflow_provider_xlsx-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 21.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.8.2 requests/2.25.1 setuptools/59.6.0 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.6.15

File hashes

Hashes for airflow_provider_xlsx-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 40ba22e33febe43f9a046dd53b592905c8ec154b2e3ff54411df7a610be8bab5
MD5 162d28e3bab3bc5ccad5538620d2c389
BLAKE2b-256 28046b6d521523c9c94649aea400c0bffcfef61fb8df53d07d12c015e661995b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page