GraphQL service for arrow tables and parquet data sets.
Project description
GraphQL service for arrow tables and parquet data sets. The schema for a query API is derived automatically.
Usage
% env PARQUET_PATH=... uvicorn graphique.service:app
Open http://localhost:8000/graphql to try out the API in GraphiQL. There is a test fixture at ./tests/fixtures/zipcodes.parquet
.
% python3 -m graphique.schema ...
outputs the graphql schema for a parquet data set.
Configuration
Graphique uses Starlette's config: in environment variables or a .env
file. Config variables are used as input to parquet dataset.
- PARQUET_PATH: path to the parquet directory or file
- INDEX = []: partition keys or names of columns which represent a sorted composite index
- FEDERATED = '': field name to extend type
Query
with a federatedTable
- DEBUG = False: run service in debug mode, which includes timing
- DICTIONARIES = []: names of columns to read as dictionaries
- COLUMNS = []: names of columns to read at startup;
*
indicates all - FILTERS = {}: json
Queries
for which rows to read at startup
API
types
Table
: an arrow Table; the primary interface.Column
: an arrow Column (a.k.a. ChunkedArray). Each arrow data type has a corresponding column implementation: Boolean, Int, Long, Float, Decimal, Date, DateTime, Time, Duration, Base64, String, List, Struct. All columns have avalues
field for their list of scalars. Additional fields vary by type.Row
: scalar fields. Arrow tables are column-oriented, and graphique encourages that usage for performance. A singlerow
field is provided for convenience, but a field for a list of rows is not. Requesting parallel columns is far more efficient.
selection
slice
: contiguous selection of rowssearch
: binary search if the table is sorted, i.e., provides an indexfilter
: select rows from predicate functions
projection
columns
: provides a field for everyColumn
in the schemacolumn
: access a column of any type by namerow
: provides a field for each scalar of a single rowapply
: transform columns by applying a function
aggregation
group
: group by given columns, transforming the others into list columnspartition
: partition on adjacent values in given columns, transforming the others into list columnsaggregate
: apply reduce functions to list columnstables
: return a list of tables by splitting on the scalars in list columns
ordering
sort
: sort table by given columnsmin
: select rows with smallest valuesmax
: select rows with largest values
Performance
Graphique relies on native PyArrow routines wherever possible. Otherwise it falls back to using NumPy or custom optimizations.
By default, datasets are read on-demand, with only the necessary columns selected. Additionally filter(query: ...)
is optimized to filter rows while reading the dataset. Although graphique is a running service, parquet is performant at reading a subset of data. Optionally specify COLUMNS
to read a subset of columns (or *
) at startup, trading-off memory for latency. Similarly specify FILTERS
in the json format of the Query
input type to read a subset of rows at startup.
Specifying an INDEX
indicates the table is sorted, and enables the binary search
field. Specifying just INDEX
without reading (FILTERS
or COLUMNS
) is allowed but only recommended if it corresponds to the partition keys. In that case, search(...)
is functionally equivalent to filter(query: ...)
.
Installation
% pip install graphique[server]
Dependencies
- pyarrow >=8
- strawberry-graphql[asgi] >=0.109
- uvicorn (or other ASGI server)
Tests
100% branch coverage.
% pytest [--cov]
Changes
0.8
- Pyarrow >=8 required
- Grouping and aggregation integrated
AbstractTable
interface renamed toDataset
Binary
scalar renamed toBase64
0.7
- Pyarrow >=7 required
FILTERS
use query syntax and trigger reading the datasetFEDERATED
field configuration- List columns support sorting and filtering
- Group by and aggregate optimizations
- Dataset scanning
0.6
- Pyarrow >=6 required
- Group by optimized and replaced
unique
field - Dictionary related optimizations
- Null consistency with arrow
count
functions
0.5
- Pyarrow >=5 required
- Stricter validation of inputs
- Columns can be cast to another arrow data type
- Grouping uses large list arrays with 64-bit counts
- Datasets are read on-demand or optionally at startup
0.4
- Pyarrow >=4 required
sort
updated to use new native routinespartition
tables by adjacent values and differencesfilter
supports unknown column types using tagged union patternGroups
replaced withTable.tables
andTable.aggregate
fields- Tagged unions used for
filter
,apply
, andpartition
functions
0.3
- Pyarrow >=3 required
any
andall
fields- String column
split
field
0.2
ListColumn
andStructColumn
typesGroups
type withaggregate
fieldgroup
andunique
optimized- pyarrow >= 2 required
- Statistical fields:
mode
,stddev
,variance
is_in
,min
, andmax
optimized
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.