X-I-A Easy Protobuf Convertor
Project description
X-I-A Easy Protobuf Convertor
Introduction
Quick protobuf serialization without any definition. The main use case is for using Bigquery’s Storage Write API.
Requirements
In order to use this module, please using the following python runtime
Python 3.9 or 3.10
Windows 64bit, Linux or MacOS11+
Quick start
Install the package:
pip install xia-easy-proto
And then create your first test test.py:
from xia_easy_proto import EasyProto if __name__ == '__main__': songs = {"composer": {'given_name': 'Johann', 'family_name': 'Pachelbel'}, "title": 'Canon in D', "year": [1680, 1681]} song_class, song_payload = EasyProto.serialize(songs) print(song_class) # It is the message class print(song_payload) # It is the serialized message
What you need to do is giving a python object and then call EasyProto.serialize() and all job is done.
NO MORE precompile / NO MORE message class pre-definition.
Data Format
Structure
The module is designed to hold the json type records. That means the list of python dictionary. The embedded format could be a dictionary or even a list of dictionary.
We apply the same rule as Bigquery tables so any data exported by Bigquery are supported.
Attention, the same as Bigquery, list of List are not supported.
Data Element
Only int, float, str, bool and bytes are supported as data element. The other format will be ignored during the parse. Check FAQ part to get more information about how to deal with other data element such as datetime.
FAQ
1. Why developing this module?
The new Bigquery Storage Write API is hard to use with Python. We must compile the data model at design time which seems to be far away from a pythonic approach.
2. How to improve the performance
When transforming huge amount of data (more than 1G in memory data), please provide a complete example to avoid a content full scan.
Given a simple example : [{“Hello”: 1}, {“World”: 2}, {“Hello”: 3}, {“World”: 4}, …]. The parser won’t know the records only has two column “Hello” and “World” before the end of full scan. So if you could pass sample_data parameter as:
EasyProto.serialize(songs, sample_data=[{"Hello": 1, "World": 2}])
The cpu/ram consumption will be dramatically reduced.
When the first serialization is finished, you will get the message class as the return value. You could use it in later like:
EasyProto.serialize(songs, message_class=song_class)
When you are sure that the data structure won’t change during the whole transfer, you could precise the label parameter, Song for example, like:
EasyProto.serialize(songs, label="Song")
Let’s sort the algorithm by priority:
If label is defined and a compiled message class found under this label, using found one
If message_class is defined, using defined one
If sample_data is given, compile the message_class by using sample_data
Compile the message_class by full scan of payload
3. How to handle complex datatype
Datatype such as Datetime are never stored as datetime in the database. So it is upto you to do the adaptation. For the Bigquery use case, datetime is saved at INTEGER with the value of int(timestamp * 1000000) Anyway, this module is already better than the class streaming API because we support bytes type.
4. How to do data validation
We want to keep the things as simple as possible. You should define your own data validation before providing python data object. Again, by comparing to classical json format, we don’t loss any functionality
5. Where to find the source code
Using this module will be always FREE.
This project will be open sourced when it becomes popular.
Bigquery Integration
Here is the example if you want to put the data song to to bigquery:
import asyncio from google.protobuf.descriptor_pb2 import DescriptorProto from google.cloud.bigquery_storage_v1.types.storage import AppendRowsRequest from google.cloud.bigquery_storage_v1.types.protobuf import ProtoSchema, ProtoRows from google.cloud.bigquery_storage_v1.services.big_query_write import BigQueryWriteAsyncClient from xia_easy_proto import EasyProto songs = {"composer": {'given_name': 'Johann', 'family_name': 'Pachelbel'}, "title": 'Canon in E', "year": [1680, 1681]} song_class, song_payload = EasyProto.serialize(songs) async def main(): stream_path = BigQueryWriteAsyncClient.write_stream_path("xxx", "xxx", "xxx", "_default") bq_write_client = BigQueryWriteAsyncClient() proto_descriptor = DescriptorProto() song_class().DESCRIPTOR.CopyToProto(proto_descriptor) proto_schema = ProtoSchema(proto_descriptor=proto_descriptor) proto_data = AppendRowsRequest.ProtoData( rows=ProtoRows(serialized_rows=song_payload), writer_schema=proto_schema ) append_row_request = AppendRowsRequest( write_stream=stream_path, proto_rows=proto_data ) result = await bq_write_client.append_rows(iter([append_row_request])) async for item in result: print(item) if __name__ == "__main__": asyncio.run(main())
BQ Table should be:
[ { "name": "composer", "type": "RECORD", "mode": "NULLABLE", "fields": [ { "name": "given_name", "type": "STRING", "mode": "NULLABLE" }, { "name": "family_name", "type": "STRING", "mode": "NULLABLE" } ] } { "name": "title", "type": "STRING", "mode": "NULLABLE" }, { "name": "lyrics", "type": "STRING", "mode": "NULLABLE" }, { "name": "year", "type": "INTEGER", "mode": "REPEATED" } ]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for xia_easy_proto-1.0.3-cp310-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5167235e634a509615723602cda2b5edee621d8579476bb2c57a44ed582547a2 |
|
MD5 | 561f4cd98a5a746fdd4dd159d5fdaaa8 |
|
BLAKE2b-256 | c234d2339a4e1ef4e6e772cfb282e31d6c3328d0ace684993eb3af8e3644159c |
Hashes for xia_easy_proto-1.0.3-cp310-none-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 390673880394ee0b9228ee2d3db5ef31a4fc4308fe829784f553c50bfa0327ae |
|
MD5 | 1d099c32a90820ad5cad396b929b5d1a |
|
BLAKE2b-256 | 57ed010542167b8749c4eac69159e05b61a16a5b92bbc5173208e57b66494192 |
Hashes for xia_easy_proto-1.0.3-cp310-none-macosx_11_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 411ba18f675df692fd4e6d261de0b88590e474d15f66e4eb0d2bb8f9b96deb5c |
|
MD5 | 758bee62d3c0a9f8d9aeced40f9729bf |
|
BLAKE2b-256 | 82127fc97f8a1a22e8eaf073deb52172cef0be81c9fd7b7311bd8a50fa8ed2cf |
Hashes for xia_easy_proto-1.0.3-cp39-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c6cd666d69cddcff22d2a45a7fa5e8072a6886ddb4d119f19894d666af390709 |
|
MD5 | 6e99d2c00802de6368c8fa06014da5f9 |
|
BLAKE2b-256 | 882f0b19d801e4a69cb3594c360b7690734da4e1d303107858db4fcba4a2c64b |
Hashes for xia_easy_proto-1.0.3-cp39-none-manylinux1_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 19eec55882b4be198d754f8fe0e15907ed83ec804b49045301ef2e6be055c914 |
|
MD5 | 5340b0d6311bd47b890704171e46f16f |
|
BLAKE2b-256 | 2987ef69af30854a3f32b1cedcad9771097c40c488e9fcf7df4cc25fc9a5cb16 |
Hashes for xia_easy_proto-1.0.3-cp39-none-macosx_11_0_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | eeca96cd78dde58e60fce90ff6804c73aa156d9ef171a72137cd452d85ff200f |
|
MD5 | 67314ed9b1386383654b430095ca58ca |
|
BLAKE2b-256 | 937bc8676c6ba4c663d6ffe9c33961d150e1ce3aead79f7c88bf48f9d2c36371 |