Parquet file Load and Read from minio & S3
Project description
ParquetLoader
Parquet file Load and Read from minio & S3 or Local
This repository help read parquet file, When you train model or Analysis using bigdata.
1. Installation
1.1. Install from pip
pip install parquet-loader
1.2 Install from source codes
git clone https://github.com/Keunyoung-Jung/ParquetLoader
cd ParquetLoader
pip install -e .
2. Introduce
ParquetLoader can help to read large parquet files.
ParquetLoader is built on base of pandas and fastparquet, which helps in situations where Spark clusters are not available.
It proceeds to load data into memory based on chuck size.
Then return it as a pandas dataframe.
3. Quick Start
3.1. Local Path
If your file is located local
, you can load the data this way.
from ParquetLoader import DataLoader
dl = DataLoader(
folder='parquet_data'
shuffle=False
)
for df in dl :
print(df.head())
3.2. S3 Path
If your file is located S3
or Minio
, you have to set
environment variable.
export AWS_ACCESS_KEY_ID=my-access-key
export AWS_SECRET_ACCESS_KEY=my-secret-key
export AWS_DEFAULT_REGION=ap-northeast-2
When you have finished setting, you can load data this way.
from ParquetLoader import S3Loader
sl = S3Loader(
bucket='mysterico-feature-store',
folder="mongo-sentence-token-feature",
depth=2)
for df in sl :
print(df.head())
4. Parameters
ParquetLoader
can control reading data using parameters.
The only difference between S3Loader
and DataLoader
is the root_path
parameter.
dl = DataLoader(
chunk_size : int =100000,
root_path : str = '.', # S3Loader using "bucket"
folder : str = 'data',
shuffle : bool = True,
random_seed : int = int((time() - int(time()))*100000),
columns : list = None,
depth : int = 0,
std_out: bool = True,
filters: list = None
)
chunk_size
- default : 100,000 row
- This parameter controls the number of rows loaded into memory when reading data.
root_path
orbucket
- default : current path
- This parameter is used to specify the project path or datastore path.
folder
- default : "data"
- This parameter specifies the actual folder in which the parquet is clustered.
shuffle
- default : True
- Whether to shuffle the data.
random_seed
- default :
int((time() - int(time()))*100000)
- You can fix the order of the shuffled data by giving it a random seed.
- default :
columns
- default : None
- When reading data, you can select columns.
depth
- default : 0
- It is used when the parquet in the folder is partitioned and there is depth.
std_out
- default : True
- You can turn off output.
filters
- It is used when you want get filtered dataframe, It must use 2 dim list
- example :
[[("column","==",10)]]
4.1. Select Columns
columns
param is taken as a list.
dl = DataLoader(
folder='parquet_data',
colums=['name','age','gender']
)
4.2. Setting depth
Use if your parquet file is partitioned and depth exists.
Example
📦 data
┣ 📦 Year=2020
┣ 📜 part-0000-example.snappy.parquet
┗ 📜 part-0001-example.snappy.parquet
┣ 📦 Year=2021
┣ 📜 part-0002-example.snappy.parquet
┗ 📜 part-0003-example.snappy.parquet
The data path in this example has a depth 1
.
dl = DataLoader(
folder='parquet_data',
depth=1
)
5. Get Metadata
DataLoader
Object can get metadata your parquet
print(data_loader.schema) # get data schema
print(data_loader.columns) # get data columns
print(data_loader.count) # get total count
print(data_loader.info) # get metadata infomation
6. Customize S3 Path
If you use minio or other obejct storage,you will use s3 parameters
dl = S3Loader(
s3_endpoint_url : str = '',
s3_access_key : str = '',
s3_secret_key : str = '',
bucket : str = '.',
folder : str = 'data',
)
s3_endpoint_url
- Storage endpoint url you want to use
- example : "http://mino-service.kubeflow:9000"
s3_access_key
ands3_secret_key
- you can set s3_access_key and s3_secret_key, but I don't recommend using it
- it is recommended to use environment variables.
7. Get Filtered Dataframe
It is used when you want get filtered dataframe, It must use 2 dim list It is built with a two-dimensional list construction. (Equal fastparquet filter)
dl = S3Loader(
bucket = 'test',
folder = 'data',
filters = [[[("col1",">",10)]]]
)
The first list consists of an OR operation.
# col > 10 or col2 in ["children","kids"]
filters = [
[("col1",">",10)],
["col2","in",["children","kids"]]
]
The second list consists of an AND operation.
# col > 10 and col2 == "male"
filters = [
[("col1",">",10),("col2","==","male")]
]
You can also mix the two to make a filter.
# (col > 10 and col2 == "male") or col3 in ["children","kids"]
filters = [
[("col1",">",10),("col2","==","male")],
["col3","in",["children","kids"]]
]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file parquet_loader-0.0.4-py3-none-any.whl
.
File metadata
- Download URL: parquet_loader-0.0.4-py3-none-any.whl
- Upload date:
- Size: 12.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fa22834031403684e73d0d56f0067b9a6c708542d1146448f837cc5a7c4b1af9 |
|
MD5 | 0d4accc56e556ccb1d580ea348026790 |
|
BLAKE2b-256 | 1ed67caa75f3370202805d49d04fd7935fcff35a75e6652a07a1b70476eed0c6 |