Skip to main content

GraphQL service for arrow tables and parquet data sets.

Project description

image image image image image image image image image

GraphQL service for arrow tables and parquet data sets. The schema for a query API is derived automatically.

Usage

% env PARQUET_PATH=... uvicorn graphique.service:app

Open http://localhost:8000/graphql to try out the API in GraphiQL. There is a test fixture at ./tests/fixtures/zipcodes.parquet.

% python3 -m graphique.schema ...

outputs the graphql schema for a parquet data set.

Configuration

Graphique uses Starlette's config: in environment variables or a .env file. Config variables are used as input to parquet dataset.

  • PARQUET_PATH: path to the parquet directory or file
  • INDEX = []: partition keys or names of columns which represent a sorted composite index
  • FEDERATED = '': field name to extend type Query with a federated Table
  • DEBUG = False: run service in debug mode, which includes timing
  • DICTIONARIES = []: names of columns to read as dictionaries
  • COLUMNS = []: names of columns to read at startup; * indicates all
  • FILTERS = {}: json Queries for which rows to read at startup

API

types

  • Table: an arrow Table; the primary interface.
  • Column: an arrow Column (a.k.a. ChunkedArray). Each arrow data type has a corresponding column implementation: Boolean, Int, Long, Float, Decimal, Date, DateTime, Time, Duration, Binary, String, List, Struct. All columns have a values field for their list of scalars. Additional fields vary by type.
  • Row: scalar fields. Arrow tables are column-oriented, and graphique encourages that usage for performance. A single row field is provided for convenience, but a field for a list of rows is not. Requesting parallel columns is far more efficient.

selection

  • slice: contiguous selection of rows
  • search: binary search if the table is sorted, i.e., provides an index
  • filter: select rows from predicate functions

projection

  • columns: provides a field for every Column in the schema
  • column: access a column of any type by name
  • row: provides a field for each scalar of a single row
  • apply: transform columns by applying a function

aggregation

  • group: group by given columns, transforming the others into list columns
  • partition: partition on adjacent values in given columns, transforming the others into list columns
  • aggregate: apply reduce functions to list columns
  • tables: return a list of tables by splitting on the scalars in list columns

ordering

  • sort: sort table by given columns
  • min: select rows with smallest values
  • max: select rows with largest values

Performance

Graphique relies on native PyArrow routines wherever possible. Otherwise it falls back to using NumPy, optionally Polars, or custom optimizations.

By default, datasets are read on-demand, with only the necessary columns selected. Additionally filter(query: ...) is optimized to filter rows while reading the dataset. Although graphique is a running service, parquet is performant at reading a subset of data. Optionally specify COLUMNS to read a subset of columns (or *) at startup, trading-off memory for latency. Similarly specify FILTERS in the json format of the Query input type to read a subset of rows at startup.

Specifying an INDEX indicates the table is sorted, and enables the binary search field. Specifying just INDEX without reading (FILTERS or COLUMNS) is allowed but only recommended if it corresponds to the partition keys. In that case, search(...) is functionally equivalent to filter(query: ...).

Installation

% pip install graphique[server]

Dependencies

  • pyarrow >=7
  • strawberry-graphql[asgi] >=0.84.4
  • uvicorn (or other ASGI server)
  • pytz (optional timestamp support)
  • polars (optional optimization for list aggregation)

Tests

100% branch coverage.

% pytest [--cov]

Changes

0.7

  • Pyarrow >=7 required
  • FILTERS use query syntax and trigger reading the dataset
  • FEDERATED field configuration
  • List columns support sorting and filtering
  • Group by and aggregate optimizations
  • Dataset scanning

0.6

  • Pyarrow >=6 required
  • Group by optimized and replaced unique field
  • Dictionary related optimizations
  • Null consistency with arrow count functions

0.5

  • Pyarrow >=5 required
  • Stricter validation of inputs
  • Columns can be cast to another arrow data type
  • Grouping uses large list arrays with 64-bit counts
  • Datasets are read on-demand or optionally at startup

0.4

  • Pyarrow >=4 required
  • sort updated to use new native routines
  • partition tables by adjacent values and differences
  • filter supports unknown column types using tagged union pattern
  • Groups replaced with Table.tables and Table.aggregate fields
  • Tagged unions used for filter, apply, and partition functions

0.3

  • Pyarrow >=3 required
  • any and all fields
  • String column split field

0.2

  • ListColumn and StructColumn types
  • Groups type with aggregate field
  • group and unique optimized
  • pyarrow >= 2 required
  • Statistical fields: mode, stddev, variance
  • is_in, min, and max optimized

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

graphique-0.7.zip (38.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

graphique-0.7-py3-none-any.whl (30.1 kB view details)

Uploaded Python 3

File details

Details for the file graphique-0.7.zip.

File metadata

  • Download URL: graphique-0.7.zip
  • Upload date:
  • Size: 38.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for graphique-0.7.zip
Algorithm Hash digest
SHA256 73507a0863c43db92818260906b4446ad7978093b246996459d681f8c4b99190
MD5 0ca07d34cc1e9456e3145a746058e987
BLAKE2b-256 da5615d2f6d3b90ba9e971fc4bb4467753df117d286c4ce54961d9a028b8fc48

See more details on using hashes here.

File details

Details for the file graphique-0.7-py3-none-any.whl.

File metadata

  • Download URL: graphique-0.7-py3-none-any.whl
  • Upload date:
  • Size: 30.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.10.1 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for graphique-0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 b5a961f0180845ca9635fb0fde8aa62a83779867aae0657b59408c027964f039
MD5 4bbe1cf262b34b965b0dca4787941f02
BLAKE2b-256 1717d2d6f4983ccd067a12a91eda792eb8ecca38f2da9f62908d6e7e3b143bad

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page