pandas-tfrecords

Converter pandas to tfrecords & tfrecords to pandas

These details have not been verified by PyPI

Project links

Homepage

Project description

This project was born under impression from spark-tensorflow-connector and implements similar functionality in order to save easy pandas dataframe to tfrecords and to restore tfrecords to pandas dataframe.

It can work as with local files as with AWS S3 files. Please keep in mind, that here tensorflow works with local copies of remote files, which are synced via s3fs with S3. I did this workaround because my tensorflow v2.1.0 didn’t work with S3 directly and raised authentication error Credentials have expired attempting to repull from EC2 Metadata Service, but maybe it’s fixed already.

Quick start

pip install pandas-tfrecords

import pandas as pd
from pandas_tfrecords import pd2tf, tf2pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c'], 'C': [[1, 2], [3, 4], [5, 6]]})

# local
pd2tf(df, './tfrecords')
my_df = tf2pd('./tfrecords')

# S3
pd2tf(df, 's3://my-bucket/tfrecords')
my_df = tf2pd('s3://my-bucket/tfrecords')

Converted types

pandas -> tfrecords

bytes, str -> tf.string
int, np.integer -> tf.int64
float, np.floating -> tf.float32
list, np.ndarray of bytes, str, int, np.integer, float, np.floating -> sequence of tf.string, tf.int64, tf.float32

tfrecords -> pandas

tf.string -> bytes
tf.int64 -> int
tf.float32 -> float
sequence of tf.string, tf.int64, tf.float32 -> list of bytes, int, float

NB! Please pay attention it works only with one-dimentional arrays. It means [1, 2, 3] will be converted to both sides, but [[1,2,3]] won’t be converted to any side. It works that, because spark-tensorflow-connector works similar, and I didn’t learn yet how to implement nested sequences. In order to work with nested sequences you should use reshape.

API

pandas_tfrecords.pandas_to_tfrecords(df, folder, compression_type='GZIP', compression_level=9, columns=None, max_mb=50)

Arguments:

df - pandas dataframe. Please keep in mind above info about nested sequences.
folder - folder to save tfrecords, local or S3. Please be sure that it doesn’t contain other files or folders, if you want to read from this folder then.
compression_type='GZIP' - compression types: 'GZIP', 'ZLIB', None. If None not compressed.
compression_level=9 - compression level 0…9.
columns=None - list of columns to save, if None all columns will be saved.
max_mb=50 - maximum size of uncompressed data to save. If dataframe total size is bigger than this limit, then several files will be saved. If None it isn’t limited and one file will be saved.

alias pandas_tfrecords.pd2tf

pandas_tfrecords.tfrecords_to_pandas(file_paths, schema=None, compression_type='auto', cast=True)

Arguments:

file_paths - One or sequence of file paths or folders, local or S3, to read tfrecords from.
schema=None - If None schema will be detected automatically. But you can specify which columns you want to read only. It should be a dict, which keys are column names and values are column data types: str (or bytes), int, float, and for sequences it should be wrapped to list: [str] (or [bytes]), [int], [float]. For example:

df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c'], 'C': [[1, 2], [3, 4], [5, 6]]})
print(df)
   A  B       C
0  1  a  [1, 2]
1  2  b  [3, 4]
2  3  c  [5, 6]

pd2tf(df, './tfrecords')
tf2pd('./tfrecords', schema={'A': int, 'C': [int]})
   A       C
0  1  [1, 2]
1  2  [3, 4]
2  3  [5, 6]

compression_type='auto' - compression type: 'auto', 'GZIP', 'ZLIB', None.
cast=True - if True it casts bytes data after converting from tf.string. It tries to cast it to int, float and str sequentially. If it’s not possible, otherwise keeps as is.

alias pandas_tfrecords.tf2pd

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.6

Sep 23, 2021

0.1.5

Jun 10, 2021

0.1.4

Mar 1, 2020

This version

0.1.3

Feb 21, 2020

0.1.2

Feb 21, 2020

0.1.1

Feb 21, 2020

0.1

Feb 20, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas-tfrecords-0.1.3.tar.gz (5.3 kB view details)

Uploaded Feb 21, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pandas_tfrecords-0.1.3-py3-none-any.whl (6.1 kB view details)

Uploaded Feb 21, 2020 Python 3

File details

Details for the file pandas-tfrecords-0.1.3.tar.gz.

File metadata

Download URL: pandas-tfrecords-0.1.3.tar.gz
Upload date: Feb 21, 2020
Size: 5.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.2

File hashes

Hashes for pandas-tfrecords-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`61426b5689200976a681e238e24a7eff5f1fce462b451a2d9c3e2f38ca013df1`
MD5	`418d0e0254f9e4cd812378c54f788dbe`
BLAKE2b-256	`f75c47c84a82d7ae84c52eddb57eae5443f6e13a01f3a1fde607c41b25e3acc1`

See more details on using hashes here.

File details

Details for the file pandas_tfrecords-0.1.3-py3-none-any.whl.

File metadata

Download URL: pandas_tfrecords-0.1.3-py3-none-any.whl
Upload date: Feb 21, 2020
Size: 6.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.2

File hashes

Hashes for pandas_tfrecords-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7439b5bdf35ce5c33e283bb3e1cf240ac454f5b215a5b5f747a0e9b74eb358fa`
MD5	`4990900fcf2a1238d3ba63ed9de7a5d9`
BLAKE2b-256	`f1cda63e5c37d0a15693da02c1601c32d281a9f5c0048c07b9cc7a53f5251216`

See more details on using hashes here.

pandas-tfrecords 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Quick start

Converted types

pandas -> tfrecords

tfrecords -> pandas

API

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes