Skip to main content

A library to accelerate ML and ETL pipeline by connecting all data sources

Project description

DataLigo

This library helps to read and write data from most of the data sources. It accelerate the ML and ETL process without worrying about the multiple data connectors.

Installation

pip install -U dataligo

Install from sources

Alternatively, you can also clone the latest version from the repository and install it directly from the source code:

pip install -e .

Quick tour

>>> from dataligo import Ligo
>>> from transformers import pipeline

>>> ligo = Ligo('./ligo_config.yaml') # Check the sample_ligo_config.yaml for reference
>>> print(ligo.get_supported_data_sources_list())
['s3',
 'gcs',
 'azureblob',
 'bigquery',
 'snowflake',
 'redshift',
 'starrocks',
 'postgresql',
 'mysql',
 'oracle',
 'mssql',
 'mariadb',
 'sqlite',
 'elasticsearch',
 'mongodb',
 'dynamodb',
 'redis']

>>> mongodb = ligo.connect('mongodb')
>>> df = mongodb.read_as_dataframe(database='reviewdb',collection='reviews',return_type='pandas') # Default return_type is pandas
>>> df.head()
        _id	                        Review
0	64272bb06a14f52787e0a09e	good and interesting
1	64272bb06a14f52787e0a09f	This class is very helpful to me. Currently, I...
2	64272bb06a14f52787e0a0a0	like!Prof and TAs are helpful and the discussi...
3	64272bb06a14f52787e0a0a1	Easy to follow and includes a lot basic and im...
4	64272bb06a14f52787e0a0a2	Really nice teacher!I could got the point eazl...

>>> classifier = pipeline("sentiment-analysis")
>>> reviews = df.Review.tolist()
>>> results = classifier(reviews,truncation=True)
>>> for result in results:
>>>     print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
label: POSITIVE, with score: 0.9999
label: POSITIVE, with score: 0.9997
label: POSITIVE, with score: 0.9999
label: POSITIVE, with score: 0.999
label: POSITIVE, with score: 0.9967

>>> df['predicted_label'] = [result['label'] for result in results]
>>> df['predicted_score'] = [round(result['score'], 4) for result in results]

# Write the results to the MongoDB
>>> mongodb.write_dataframe(df,'reviewdb','review_sentiments')

Example DataLigo Pipeline

ETL Pipeline

dataligo ETL pipeline diagram

ML Pipeline

dataligo ML pipeline diagram

Supported Connectors

Data Sources Type pandas polars dask
S3 datalake
  • [x] read
  • [x] write
  • [x] read
  • [x] write
  • [ ] read
  • [ ] write
GCS datalake
  • [x] read
  • [x] write
  • [x] read
  • [x] write
  • [ ] read
  • [ ] write
Azure Blob Storage datalake
  • [x] read
  • [x] write
  • [x] read
  • [x] write
  • [ ] read
  • [ ] write
Snowflake datawarehouse
  • [x] read
  • [x] write
  • [x] read
  • [x] write
  • [ ] read
  • [ ] write
BigQuery datawarehouse
  • [x] read
  • [x] write
  • [x] read
  • [x] write
  • [x] read
  • [ ] write
StarRocks datawarehouse
  • [x] read
  • [x] write
  • [x] read
  • [x] write
  • [x] read
  • [ ] write
Redshift datawarehouse
  • [x] read
  • [x] write
  • [x] read
  • [x] write
  • [x] read
  • [ ] write
PostgreSQL database
  • [x] read
  • [x] write
  • [x] read
  • [x] write
  • [x] read
  • [ ] write
MySQL database
  • [x] read
  • [x] write
  • [x] read
  • [x] write
  • [x] read
  • [ ] write
MariaDB database
  • [x] read
  • [x] write
  • [x] read
  • [x] write
  • [x] read
  • [ ] write
MsSQL database
  • [x] read
  • [x] write
  • [x] read
  • [x] write
  • [x] read
  • [ ] write
Oracle database
  • [x] read
  • [x] write
  • [x] read
  • [x] write
  • [x] read
  • [ ] write
SQLite database
  • [x] read
  • [x] write
  • [x] read
  • [x] write
  • [x] read
  • [ ] write
MongoDB nosql
  • [x] read
  • [x] write
  • [x] read
  • [x] write
  • [ ] read
  • [ ] write
ElasticSearch nosql
  • [x] read
  • [x] write
  • [x] read
  • [x] write
  • [ ] read
  • [ ] write
DynamoDB nosql
  • [x] read
  • [x] write
  • [x] read
  • [x] write
  • [ ] read
  • [ ] write
Redis(beta) nosql
  • [x] read
  • [ ] write
  • [ ] read
  • [ ] write
  • [ ] read
  • [ ] write

Acknowledgement

Some functionalities of DataLigo are inspired by the following packages.

  • ConnectorX

    DataLigo used Connectorx to read data from most of the RDBMS databases to utilize the performance benefits and inspired the return_type parameter from it

  • dynamo-pandas

    DataLigo used dynamo-pandas to read and write data from DynamoDB

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataligo-0.7.3.tar.gz (25.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dataligo-0.7.3-py3-none-any.whl (27.3 kB view details)

Uploaded Python 3

File details

Details for the file dataligo-0.7.3.tar.gz.

File metadata

  • Download URL: dataligo-0.7.3.tar.gz
  • Upload date:
  • Size: 25.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.7.12

File hashes

Hashes for dataligo-0.7.3.tar.gz
Algorithm Hash digest
SHA256 2cae8c9cf4afb8b7e55ab45d219a0662f1fed55f8404bd5aed298c06f2fad547
MD5 8c120160da610d0a5285447f27a6c8b2
BLAKE2b-256 497bafc9dea7103b57bf78c65fd331716f6a8af123295fa1a4d9acd1a248c4fb

See more details on using hashes here.

File details

Details for the file dataligo-0.7.3-py3-none-any.whl.

File metadata

  • Download URL: dataligo-0.7.3-py3-none-any.whl
  • Upload date:
  • Size: 27.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.7.12

File hashes

Hashes for dataligo-0.7.3-py3-none-any.whl
Algorithm Hash digest
SHA256 fbaf9fa1c34687d8b1e7f8203c8104b52165b41cd52400795c75005c8a621c36
MD5 1c44734756c6a8a51ac124c7d824d5d4
BLAKE2b-256 451431bf6f4df03e715628409eb5097c4a24f624bbaf754cd024668e0af65260

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page