A proposed standard `NOCK` for a Parquet format that supports efficient distributed serialization of multiple kinds of graph technologies.
The following describes a proposed standard
NOCK for a Parquet
format that supports efficient distributed serialization of multiple
kinds of graph technologies.
pynock provides Examples for working with low-level
Parquet read/write efficiently in Python.
Our intent is to serialize graphs in a way which aligns the data representations required for popular graph technologies and related data sources:
- semantic graphs (e.g., W3C formats RDF, TTL, JSON-LD, etc.)
- labeled property graphs (e.g., openCypher)
- probabilistic graphs (e.g., PSL)
- spreadsheet import/export (e.g., CSV)
- dataframes (e.g., Pandas, Dask, Spark, etc.)
- edge lists (e.g., NetworkX, cuGraph, etc.)
This approach also efficient distributed partitions based on Parquet, which can scale on a cluster to very large (+1 T node) graphs.
For details about the proposed format in Parquet files, see the
If you have questions, suggestions, or bug reports, please open an issue on our public GitHub repo.
Note that the
pynock library does not provide any support for graph
computation or querying, merely for manipulating and validating
Our intent is to provide examples where others from the broader open source developer community can help troubleshoot edge cases in Parquet.
This code has been tested and validated using Python 3.8, and we make no guarantees regarding correct behaviors on other versions.
The Parquet file formats depend on Arrow 5.0.x or later.
For the Python dependencies, the library versioning info is listed in the
To install via PIP:
python3 -m pip install -U pynock
To set up this library locally:
python3 -m venv venv source venv/bin/activate python3 -m pip install -U pip wheel python3 -m pip install -r requirements.txt
Usage via CLI
To run examples from CLI:
python3 cli.py load-parq --file dat/recipes.parq --debug
python3 cli.py load-rdf --file dat/tiny.ttl --save-csv foo.csv
For further information:
python3 cli.py --help
Usage programmatically in Python
To construct a partition file programmatically, see the
for Jupyter notebooks with sample code and debugging.
For more details about using Arrow and Parquet see:
"Apache Arrow: Read DataFrame With Zero Memory"
Towards Data Science (2020-06-25)
Why the name?
nock is the English word for the end of an arrow opposite its point.
If you must have an acronym, the proposed standard
NOCK stands for
Network Objects for Consistent Knowledge.
Also, the library name had minimal namespace collisions on GitHub and PyPi :)
To set up the build environment locally, also run:
python3 -m pip install -U pip setuptools wheel python3 -m pip install -r requirements-dev.txt
Note that we require the use of
and to configure that locally:
pre-commit install git config --local core.hooksPath .git/hooks/
First, verify that
setup.py will run correctly for the package
python3 -m pip install -e . python3 -m pytest -rx tests/ python3 -m pip uninstall pynock
Next, update the semantic version number in
setup.py and create a
release on GitHub, and make sure to update the local repo:
git stash git checkout main git pull
Make sure that you have set up your 2FA authentication for generating an API token on PyPi: https://pypi.org/manage/account/token/
Then run our PyPi push script:
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.