A library to accelerate ML and ETL pipeline by connecting all data sources
Project description
DataLigo
This library helps to read and write data from most of the data sources. It accelerate the ML and ETL process without worrying about the multiple data connectors.
Installation
pip install -U dataligo
Install from sources
Alternatively, you can also clone the latest version from the repository and install it directly from the source code:
pip install -e .
Quick tour
>>> from dataligo import Ligo
>>> from transformers import pipeline
>>> ligo = Ligo('./ligo_config.yaml') # Check the sample_ligo_config.yaml for reference
>>> print(ligo.get_supported_data_sources_list())
['s3',
'gcs',
'azureblob',
'bigquery',
'snowflake',
'redshift',
'starrocks',
'postgresql',
'mysql',
'oracle',
'mssql',
'mariadb',
'sqlite',
'elasticsearch',
'mongodb',
'dynamodb',
'redis']
>>> mongodb = ligo.connect('mongodb')
>>> df = mongodb.read_as_dataframe(database='reviewdb',collection='reviews',return_type='pandas') # Default return_type is pandas
>>> df.head()
_id Review
0 64272bb06a14f52787e0a09e good and interesting
1 64272bb06a14f52787e0a09f This class is very helpful to me. Currently, I...
2 64272bb06a14f52787e0a0a0 like!Prof and TAs are helpful and the discussi...
3 64272bb06a14f52787e0a0a1 Easy to follow and includes a lot basic and im...
4 64272bb06a14f52787e0a0a2 Really nice teacher!I could got the point eazl...
>>> classifier = pipeline("sentiment-analysis")
>>> reviews = df.Review.tolist()
>>> results = classifier(reviews,truncation=True)
>>> for result in results:
>>> print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
label: POSITIVE, with score: 0.9999
label: POSITIVE, with score: 0.9997
label: POSITIVE, with score: 0.9999
label: POSITIVE, with score: 0.999
label: POSITIVE, with score: 0.9967
>>> df['predicted_label'] = [result['label'] for result in results]
>>> df['predicted_score'] = [round(result['score'], 4) for result in results]
# Write the results to the MongoDB
>>> mongodb.write_dataframe(df,'reviewdb','review_sentiments')
Example DataLigo Pipeline
ETL Pipeline
ML Pipeline
Supported Connectors
Data Sources | Type | pandas | polars | dask |
---|---|---|---|---|
S3 | datalake |
|
|
|
GCS | datalake |
|
|
|
Azure Blob Storage | datalake |
|
|
|
Snowflake | datawarehouse |
|
|
|
BigQuery | datawarehouse |
|
|
|
StarRocks | datawarehouse |
|
|
|
Redshift | datawarehouse |
|
|
|
PostgreSQL | database |
|
|
|
MySQL | database |
|
|
|
MariaDB | database |
|
|
|
MsSQL | database |
|
|
|
Oracle | database |
|
|
|
SQLite | database |
|
|
|
MongoDB | nosql |
|
|
|
ElasticSearch | nosql |
|
|
|
DynamoDB | nosql |
|
|
|
Redis | nosql |
|
|
|
Acknowledgement
Some functionalities of DataLigo are inspired by the following packages.
-
DataLigo used Connectorx to read data from most of the RDBMS databases to utilize the performance benefits and inspired the return_type parameter from it
-
DataLigo used dynamo-pandas to read and write data from DynamoDB
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.