GraphQL service for arrow tables and parquet data sets.
Project description
GraphQL service for arrow tables and parquet data sets. The schema for a query API is derived automatically.
Usage
% env PARQUET_PATH=... uvicorn graphique.service:app
Open http://localhost:8000/graphql to try out the API in GraphiQL. There is a test fixture at ./tests/fixtures/zipcodes.parquet
.
% python3 -m graphique.schema ...
outputs the graphql schema for a parquet data set.
Configuration
Graphique uses Starlette's config: in environment variables or a .env
file. Config variables are used as input to parquet dataset.
- PARQUET_PATH: path to the parquet directory or file
- INDEX = []: partition keys or names of columns which represent a sorted composite index
- FEDERATED = '': field name to extend type
Query
with a federatedTable
- DEBUG = False: run service in debug mode, which includes timing
- DICTIONARIES = []: names of columns to read as dictionaries
- COLUMNS = []: names of columns to read at startup;
*
indicates all - FILTERS = {}: json
Queries
for which rows to read at startup
API
types
Table
: an arrow Table; the primary interface.Column
: an arrow Column (a.k.a. ChunkedArray). Each arrow data type has a corresponding column implementation: Boolean, Int, Long, Float, Decimal, Date, DateTime, Time, Duration, Binary, String, List, Struct. All columns have avalues
field for their list of scalars. Additional fields vary by type.Row
: scalar fields. Arrow tables are column-oriented, and graphique encourages that usage for performance. A singlerow
field is provided for convenience, but a field for a list of rows is not. Requesting parallel columns is far more efficient.
selection
slice
: contiguous selection of rowssearch
: binary search if the table is sorted, i.e., provides an indexfilter
: select rows from predicate functions
projection
columns
: provides a field for everyColumn
in the schemacolumn
: access a column of any type by namerow
: provides a field for each scalar of a single rowapply
: transform columns by applying a function
aggregation
group
: group by given columns, transforming the others into list columnspartition
: partition on adjacent values in given columns, transforming the others into list columnsaggregate
: apply reduce functions to list columnstables
: return a list of tables by splitting on the scalars in list columns
ordering
sort
: sort table by given columnsmin
: select rows with smallest valuesmax
: select rows with largest values
Performance
Graphique relies on native PyArrow routines wherever possible. Otherwise it falls back to using NumPy, optionally Polars, or custom optimizations.
By default, datasets are read on-demand, with only the necessary columns selected. Additionally filter(query: ...)
is optimized to filter rows while reading the dataset. Although graphique is a running service, parquet is performant at reading a subset of data. Optionally specify COLUMNS
to read a subset of columns (or *
) at startup, trading-off memory for latency. Similarly specify FILTERS
in the json format of the Query
input type to read a subset of rows at startup.
Specifying an INDEX
indicates the table is sorted, and enables the binary search
field. Specifying just INDEX
without reading (FILTERS
or COLUMNS
) is allowed but only recommended if it corresponds to the partition keys. In that case, search(...)
is functionally equivalent to filter(query: ...)
.
Installation
% pip install graphique[server]
Dependencies
- pyarrow >=7
- strawberry-graphql[asgi] >=0.84.4
- uvicorn (or other ASGI server)
- pytz (optional timestamp support)
- polars (optional optimization for list aggregation)
Tests
100% branch coverage.
% pytest [--cov]
Changes
0.7
- Pyarrow >=7 required
FILTERS
use query syntax and trigger reading the datasetFEDERATED
field configuration- List columns support sorting and filtering
- Group by and aggregate optimizations
- Dataset scanning
0.6
- Pyarrow >=6 required
- Group by optimized and replaced
unique
field - Dictionary related optimizations
- Null consistency with arrow
count
functions
0.5
- Pyarrow >=5 required
- Stricter validation of inputs
- Columns can be cast to another arrow data type
- Grouping uses large list arrays with 64-bit counts
- Datasets are read on-demand or optionally at startup
0.4
- Pyarrow >=4 required
sort
updated to use new native routinespartition
tables by adjacent values and differencesfilter
supports unknown column types using tagged union patternGroups
replaced withTable.tables
andTable.aggregate
fields- Tagged unions used for
filter
,apply
, andpartition
functions
0.3
- Pyarrow >=3 required
any
andall
fields- String column
split
field
0.2
ListColumn
andStructColumn
typesGroups
type withaggregate
fieldgroup
andunique
optimized- pyarrow >= 2 required
- Statistical fields:
mode
,stddev
,variance
is_in
,min
, andmax
optimized
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.