GraphQL service for arrow tables and parquet data sets.
Project description
GraphQL service for arrow tables and parquet data sets. The schema for a query API is derived automatically.
Usage
% env PARQUET_PATH=... uvicorn graphique.service:app
Open http://localhost:8000/graphql to try out the API in GraphiQL. There is a test fixture at ./tests/fixtures/zipcodes.parquet.
% python3 -m graphique.schema ...
outputs the graphql schema for a parquet data set.
Configuration
Graphique uses Starlette's config: in environment variables or a .env file. Config variables are used as input to parquet dataset.
- PARQUET_PATH: path to the parquet directory or file
- INDEX = []: partition keys or names of columns which represent a sorted composite index
- FEDERATED = '': field name to extend type
Querywith a federatedTable - DEBUG = False: run service in debug mode, which includes timing
- DICTIONARIES = []: names of columns to read as dictionaries
- COLUMNS = []: names of columns to read at startup;
*indicates all - FILTERS = {}: json
Queriesfor which rows to read at startup
API
types
Table: an arrow Table; the primary interface.Column: an arrow Column (a.k.a. ChunkedArray). Each arrow data type has a corresponding column implementation: Boolean, Int, Long, Float, Decimal, Date, DateTime, Time, Duration, Binary, String, List, Struct. All columns have avaluesfield for their list of scalars. Additional fields vary by type.Row: scalar fields. Arrow tables are column-oriented, and graphique encourages that usage for performance. A singlerowfield is provided for convenience, but a field for a list of rows is not. Requesting parallel columns is far more efficient.
selection
slice: contiguous selection of rowssearch: binary search if the table is sorted, i.e., provides an indexfilter: select rows from predicate functions
projection
columns: provides a field for everyColumnin the schemacolumn: access a column of any type by namerow: provides a field for each scalar of a single rowapply: transform columns by applying a function
aggregation
group: group by given columns, transforming the others into list columnspartition: partition on adjacent values in given columns, transforming the others into list columnsaggregate: apply reduce functions to list columnstables: return a list of tables by splitting on the scalars in list columns
ordering
sort: sort table by given columnsmin: select rows with smallest valuesmax: select rows with largest values
Performance
Graphique relies on native PyArrow routines wherever possible. Otherwise it falls back to using NumPy, optionally Polars, or custom optimizations.
By default, datasets are read on-demand, with only the necessary columns selected. Additionally filter(query: ...) is optimized to filter rows while reading the dataset. Although graphique is a running service, parquet is performant at reading a subset of data. Optionally specify COLUMNS to read a subset of columns (or *) at startup, trading-off memory for latency. Similarly specify FILTERS in the json format of the Query input type to read a subset of rows at startup.
Specifying an INDEX indicates the table is sorted, and enables the binary search field. Specifying just INDEX without reading (FILTERS or COLUMNS) is allowed but only recommended if it corresponds to the partition keys. In that case, search(...) is functionally equivalent to filter(query: ...).
Installation
% pip install graphique[server]
Dependencies
- pyarrow >=7
- strawberry-graphql[asgi] >=0.84.4
- uvicorn (or other ASGI server)
- pytz (optional timestamp support)
- polars (optional optimization for list aggregation)
Tests
100% branch coverage.
% pytest [--cov]
Changes
0.7
- Pyarrow >=7 required
FILTERSuse query syntax and trigger reading the datasetFEDERATEDfield configuration- List columns support sorting and filtering
- Group by and aggregate optimizations
- Dataset scanning
0.6
- Pyarrow >=6 required
- Group by optimized and replaced
uniquefield - Dictionary related optimizations
- Null consistency with arrow
countfunctions
0.5
- Pyarrow >=5 required
- Stricter validation of inputs
- Columns can be cast to another arrow data type
- Grouping uses large list arrays with 64-bit counts
- Datasets are read on-demand or optionally at startup
0.4
- Pyarrow >=4 required
sortupdated to use new native routinespartitiontables by adjacent values and differencesfiltersupports unknown column types using tagged union patternGroupsreplaced withTable.tablesandTable.aggregatefields- Tagged unions used for
filter,apply, andpartitionfunctions
0.3
- Pyarrow >=3 required
anyandallfields- String column
splitfield
0.2
ListColumnandStructColumntypesGroupstype withaggregatefieldgroupanduniqueoptimized- pyarrow >= 2 required
- Statistical fields:
mode,stddev,variance is_in,min, andmaxoptimized
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file graphique-0.7.zip.
File metadata
- Download URL: graphique-0.7.zip
- Upload date:
- Size: 38.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
73507a0863c43db92818260906b4446ad7978093b246996459d681f8c4b99190
|
|
| MD5 |
0ca07d34cc1e9456e3145a746058e987
|
|
| BLAKE2b-256 |
da5615d2f6d3b90ba9e971fc4bb4467753df117d286c4ce54961d9a028b8fc48
|
File details
Details for the file graphique-0.7-py3-none-any.whl.
File metadata
- Download URL: graphique-0.7-py3-none-any.whl
- Upload date:
- Size: 30.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b5a961f0180845ca9635fb0fde8aa62a83779867aae0657b59408c027964f039
|
|
| MD5 |
4bbe1cf262b34b965b0dca4787941f02
|
|
| BLAKE2b-256 |
1717d2d6f4983ccd067a12a91eda792eb8ecca38f2da9f62908d6e7e3b143bad
|