A Python package that provides an efficient way to upload pandas DataFrame to MySQL and download from the database table into a DataFrame.
Project description
What is it?
pd2ml (pandas DataFrame to MySQL Loader) is a Python package that provides an efficient way to upload Pandas DataFrame
to MySQL
and download from the database table into a DataFrame. The application of MySQL statements LOAD DATA LOCAL INFILE
and SELECT INTO OUTILE
are the most essential reason for such efficient upload and download.It turns out that the advantages of pd2ml will gradually become apparent as the amount of data increases, so it will be a good assistant when dealing with massive amounts of data.
To make a digression, the name pd2ml can be pronounced as pudumal (/'pudum蓴:/), which is a homonym for purdy cat in Chinese, and this is also the inspiration for the logo of this package.
How to get it
You can install using the pip package manager by running
pip install pd2ml
Note that it is necessary for pd2ml to run in the Python3
environment.
Dependencies
Examples
It is veary easy to get started and easy to use.
Firstly, start by reading a csv file into your pandas DataFrame, e.g. by using
import pandas as pd
import sqlalchemy as sa
engine = sa.create_engine('mysql+pymysql://username:pwd@localhost:3306/db?charset=utf8&local_infile=1')
df = pd.read_csv('stock.csv')
To upload df
to database, run
from pd2ml import Loader
Loader(engine).load_to(df, 'stock')
To download df
from databse, run
from pd2ml import Loader
df = Loader(engine).load_from('stock')
print(df)
In particular, it supports pandas extension, so you can also use it like this
import pd2ml
# Call it directly just like any other DataFrame's native methods.
# load to db
df.load_to('stock', engine)
# load from db
df = pd.DataFrame()
df = df.load_from('stock', engine)
print(df)
What's more, it also works well with multi-processing or multi-threading
# from multiprocessing.pool import ThreadPool as Pool
from multiprocessing.pool import Pool
from utils import split_dataframe, load_to, load_from_
# load to db
with Loader(engine):
for d in split_dataframe(df, chunk_size):
pool.apply_async(load_to, (d, 'stock'))
# load from db
with Loader(engine):
for table in tables:
result = pool.apply_async(load_from_, (table, get_engine, ))
result_arr.append(result)
For more details about examples, please see here.
Tips
To ensure that pd2ml works well, here are some tips and suggestions
- It is essential to add parameters
infile_local=1
when connecting to the database - To make sure pd2ml works, it must be set
secure-file-priv=""
in MySQL configuration filemy.ini
ormy.cnf
- Meanwhile, to maximize efficiency, the value
innodb_buffer_pool_size
may be adjusted appropriately in configuration file
Performance
To estimate the performance of pd2ml, I did a preliminary rough test on my laptop.
Test Environment
- MySQL 5.7
- Windows 10
- ThinkPad of 4 cores and 8G RAM
Test Items
I created a table named stock, this table has 6 fields, of which 2 primary keys, 3 indexes. Then
- Use pd2ml and multi-value insert SQL (
INSERT INTO ... VALUES (), () ...
) to insert 10000/100000/1000000 records 10 times into the empty table respectively, the average elapsed time is shown in the figure
- Use p2dml and multi-value replace SQL (
REPLACE INTO ... VALUES (), () ...
) to insert 10000/100000/1000000 records 10 times into the non-empty table, the average elapsed time is shown in the figure
- Use pd2ml and pandas native method
read_sql_table
to read 10000/100000/1000000 records 10 times from the table as DATAFRAME, the average elapsed time is shown in the figure
Conclusion
As the amount of data increases, the speed advantage of pd2ml becomes more and more obvious. The efficiency of writing into the database is at least 20% faster, and reading from the databse is improved three to four times.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pd2ml-0.1.0.tar.gz
.
File metadata
- Download URL: pd2ml-0.1.0.tar.gz
- Upload date:
- Size: 12.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 60cf82290fc0d10b3d87dac7d533b0b6763a0b5d3cc13e1af799e581d1d9f7f2 |
|
MD5 | b49bfdce893a5f62f28ce056003a21d3 |
|
BLAKE2b-256 | 769509226db9b4d891bac20f0c982578123605aee9693c138a2405a80382200a |
File details
Details for the file pd2ml-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: pd2ml-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/49.2.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | beb9f6695afab3534de525abb93b0eb01122a415ce4edcc11a5084a7ea1f65c9 |
|
MD5 | fdd4a8d53dbb6d3956c961b990ef8aab |
|
BLAKE2b-256 | f046034f536a9bb93bbf571a3304030a23e79bec3cc95568f2e06566403ea8ed |