Skip to main content

A Python package that provides an efficient way to upload pandas DataFrame to MySQL and download from the database table into a DataFrame.

Project description

What is it?

pd2ml (pandas DataFrame to MySQL Loader) is a Python package that provides an efficient way to upload Pandas DataFrame to MySQL and download from the database table into a DataFrame. The application of MySQL statements LOAD DATA LOCAL INFILE and SELECT INTO OUTILE are the most essential reason for such efficient upload and download.It turns out that the advantages of pd2ml will gradually become apparent as the amount of data increases, so it will be a good assistant when dealing with massive amounts of data.

To make a digression, the name pd2ml can be pronounced as pudumal (/'pudum蓴:/), which is a homonym for purdy cat in Chinese, and this is also the inspiration for the logo of this package.

How to get it

You can install using the pip package manager by running

pip install pd2ml

Note that it is necessary for pd2ml to run in the Python3 environment.

Dependencies

Examples

It is veary easy to get started and easy to use.

Firstly, start by reading a csv file into your pandas DataFrame, e.g. by using

import pandas as pd
import sqlalchemy as sa
engine = sa.create_engine('mysql+pymysql://username:pwd@localhost:3306/db?charset=utf8&local_infile=1')
df = pd.read_csv('stock.csv')

To upload df to database, run

from pd2ml import Loader
Loader(engine).load_to(df, 'stock')

To download df from databse, run

from pd2ml import Loader
df = Loader(engine).load_from('stock')
print(df)

In particular, it supports pandas extension, so you can also use it like this

import pd2ml
# Call it directly just like any other DataFrame's native methods.
# load to db
df.load_to('stock', engine)
# load from db
df = pd.DataFrame()
df = df.load_from('stock', engine)
print(df)

What's more, it also works well with multi-processing or multi-threading

# from multiprocessing.pool import ThreadPool as Pool
from multiprocessing.pool import Pool
from utils import split_dataframe, load_to, load_from_
# load to db
with Loader(engine):
    for d in split_dataframe(df, chunk_size):
        pool.apply_async(load_to, (d, 'stock'))
# load from db
with Loader(engine):
    for table in tables:
        result = pool.apply_async(load_from_, (table, get_engine, ))
        result_arr.append(result)

For more details about examples, please see here.

Tips

To ensure that pd2ml works well, here are some tips and suggestions

  • It is essential to add parameters infile_local=1 when connecting to the database
  • To make sure pd2ml works, it must be set secure-file-priv="" in MySQL configuration file my.ini or my.cnf
  • Meanwhile, to maximize efficiency, the value innodb_buffer_pool_size may be adjusted appropriately in configuration file

Performance

To estimate the performance of pd2ml, I did a preliminary rough test on my laptop.

Test Environment

  • MySQL 5.7
  • Windows 10
  • ThinkPad of 4 cores and 8G RAM

Test Items

I created a table named stock, this table has 6 fields, of which 2 primary keys, 3 indexes. Then

  • Use pd2ml and multi-value insert SQL (INSERT INTO ... VALUES (), () ...) to insert 10000/100000/1000000 records 10 times into the empty table respectively, the average elapsed time is shown in the figure

compare_1

  • Use p2dml and multi-value replace SQL (REPLACE INTO ... VALUES (), () ...) to insert 10000/100000/1000000 records 10 times into the non-empty table, the average elapsed time is shown in the figure

compare_2

  • Use pd2ml and pandas native method read_sql_table to read 10000/100000/1000000 records 10 times from the table as DATAFRAME, the average elapsed time is shown in the figure

compare_3

Conclusion

As the amount of data increases, the speed advantage of pd2ml becomes more and more obvious. The efficiency of writing into the database is at least 20% faster, and reading from the databse is improved three to four times.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pd2ml-0.1.0.tar.gz (12.4 kB view details)

Uploaded Source

Built Distribution

pd2ml-0.1.0-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file pd2ml-0.1.0.tar.gz.

File metadata

  • Download URL: pd2ml-0.1.0.tar.gz
  • Upload date:
  • Size: 12.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for pd2ml-0.1.0.tar.gz
Algorithm Hash digest
SHA256 60cf82290fc0d10b3d87dac7d533b0b6763a0b5d3cc13e1af799e581d1d9f7f2
MD5 b49bfdce893a5f62f28ce056003a21d3
BLAKE2b-256 769509226db9b4d891bac20f0c982578123605aee9693c138a2405a80382200a

See more details on using hashes here.

File details

Details for the file pd2ml-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pd2ml-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for pd2ml-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 beb9f6695afab3534de525abb93b0eb01122a415ce4edcc11a5084a7ea1f65c9
MD5 fdd4a8d53dbb6d3956c961b990ef8aab
BLAKE2b-256 f046034f536a9bb93bbf571a3304030a23e79bec3cc95568f2e06566403ea8ed

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page