GraphQL service for arrow tables and parquet data sets.
Project description
GraphQL service for arrow tables and parquet data sets. The schema for a query API is derived automatically.
Usage
% env PARQUET_PATH=... uvicorn graphique.service:app
Open http://localhost:8000/ to try out the API in GraphiQL. There is a test fixture at ./tests/fixtures/zipcodes.parquet
.
% env PARQUET_PATH=... strawberry export-schema graphique.service:app.schema
outputs the graphql schema for a parquet data set.
Configuration
Graphique uses Starlette's config: in environment variables or a .env
file. Config variables are used as input to a parquet dataset.
- PARQUET_PATH: path to the parquet directory or file
- FEDERATED = '': field name to extend type
Query
with a federatedTable
- DEBUG = False: run service in debug mode, which includes timing
- COLUMNS = None: list of names, or mapping of aliases, of columns to select
- FILTERS = None: json
filter
query for which rows to read at startup
For more options create a custom ASGI app. Call graphique's GraphQL
on an arrow Dataset, Scanner, or Table. The GraphQL Table
type will be the root Query type.
Supply a mapping of names to datasets for multiple roots, and to enable federation.
import pyarrow.dataset as ds
from graphique import GraphQL
app = GraphQL(ds.dataset(...)) # Table is root query type
app = GraphQL.federated({<name>: ds.dataset(...), ...}, keys={...}) # Tables on federated fields
Start like any ASGI app.
uvicorn <module>:app
Configuration options exist to provide a convenient no-code solution, but are subject to change in the future. Using a custom app is recommended for production usage.
API
types
Dataset
: interface for an arrow dataset, scanner, or table.Table
: implements theDataset
interface. Adds typedrow
,columns
, andfilter
fields from introspecting the schema.Column
: interface for an arrow column (a.k.a. ChunkedArray). Each arrow data type has a corresponding column implementation: Boolean, Int, Long, Float, Decimal, Date, Datetime, Time, Duration, Base64, String, List, Struct. All columns have avalues
field for their list of scalars. Additional fields vary by type.Row
: scalar fields. Arrow tables are column-oriented, and graphique encourages that usage for performance. A singlerow
field is provided for convenience, but a field for a list of rows is not. Requesting parallel columns is far more efficient.
selection
slice
: contiguous selection of rowsfilter
: select rows with simple predicatesscan
: select rows and project columns with expressions
projection
columns
: provides a field for everyColumn
in the schemacolumn
: access a column of any type by namerow
: provides a field for each scalar of a single rowapply
: transform columns by applying a functionjoin
: join tables by key columns
aggregation
group
: group by given columns, transforming the others into list columnspartition
: partition on adjacent values in given columns, transforming the others into list columnsaggregate
: apply reduce functions to list columnstables
: return a list of tables by splitting on the scalars in list columns
ordering
sort
: sort table by given columnsmin
: select rows with smallest valuesmax
: select rows with largest values
Performance
Graphique relies on native PyArrow routines wherever possible. Otherwise it falls back to using NumPy or custom optimizations.
By default, datasets are read on-demand, with only the necessary rows and columns scanned. Although graphique is a running service, parquet is performant at reading a subset of data. Optionally specify FILTERS
in the json filter
format to read a subset of rows at startup, trading-off memory for latency. An empty filter ({}
) will read the whole table.
Specifying COLUMNS
will limit memory usage when reading at startup (FILTERS
). There is little speed difference as unused columns are inherently ignored. Optional aliasing can also be used for camel casing.
If index columns are detected in the schema metadata, then an initial filter
will also attempt a binary search on tables.
Installation
% pip install graphique[server]
Dependencies
- pyarrow >=12
- strawberry-graphql[asgi,cli]
- uvicorn (or other ASGI server)
Tests
100% branch coverage.
% pytest [--cov]
Changes
1.2
- Pyarrow >=12 required
- Grouping fragments optimized
- Group by empty columns
- Batch sorting and grouping into lists
1.1
- Pyarrow >=11 required
- Python >=3.8 required
- Scannable functions added
- List aggregations deprecated
- Group by fragments
- Month day nano interval array
min
andmax
fields memory optimized
1.0
- Pyarrow >=10 required
- Dataset schema introspection
- Dataset scanning with selection and projection
- Binary search on sorted columns
- List aggregation, filtering, and sorting optimizations
- Compute functions generalized
- Multiple datasets and federation
- Provisional dataset
join
andtake
0.9
- Pyarrow >=9 required
- Multi-directional sorting
- Removed unnecessary interfaces
- Filtering has stricter typing
0.8
- Pyarrow >=8 required
- Grouping and aggregation integrated
AbstractTable
interface renamed toDataset
Binary
scalar renamed toBase64
0.7
- Pyarrow >=7 required
FILTERS
use query syntax and trigger reading the datasetFEDERATED
field configuration- List columns support sorting and filtering
- Group by and aggregate optimizations
- Dataset scanning
0.6
- Pyarrow >=6 required
- Group by optimized and replaced
unique
field - Dictionary related optimizations
- Null consistency with arrow
count
functions
0.5
- Pyarrow >=5 required
- Stricter validation of inputs
- Columns can be cast to another arrow data type
- Grouping uses large list arrays with 64-bit counts
- Datasets are read on-demand or optionally at startup
0.4
- Pyarrow >=4 required
sort
updated to use new native routinespartition
tables by adjacent values and differencesfilter
supports unknown column types using tagged union patternGroups
replaced withTable.tables
andTable.aggregate
fields- Tagged unions used for
filter
,apply
, andpartition
functions
0.3
- Pyarrow >=3 required
any
andall
fields- String column
split
field
0.2
- Pyarrow >= 2 required
ListColumn
andStructColumn
typesGroups
type withaggregate
fieldgroup
andunique
optimized- Statistical fields:
mode
,stddev
,variance
is_in
,min
, andmax
optimized
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.