Skip to main content

Python SQL Query Engine

Project description

Opteryx

Query your data, where it lives.

Opteryx is an in-process SQL query engine for analysis of distributed datasets.

Documentation | Examples | Contributing

Tweet

PyPI Latest Release opteryx Downloads last_commit codecov PyPI Latest Release

What is Opteryx?

Opteryx is a Python library designed for data wrangling and analytics. It combines the benefits of in-process database engines like SQLite and DuckDB with query federators like Presto and Trino, providing a flexible and intuitive solution that allows users to seamlessly access and query data from various sources in a unified manner.

Opteryx offers the following features:

  • SQL queries on data files generated by other processes such as logs
  • A command-line tool for filtering, transforming, and combining files
  • Integration with familiar tools like pandas and Polars
  • Embeddable as a low-cost engine, enabling portability and allowing for hundreds of analysts to leverage ad hoc databases with ease
  • Unified and federated access to data on disk, in the cloud, and in on-premises databases, not only through the same interface but in the same query.

Why Use Opteryx?

Familiar Interface

Opteryx supports key parts of the Python DBAPI and SQL92 standard standards which many analysts and engineers will already know how to use.

Consistent Syntax

Opteryx creates a common SQL-layer over multiple data platforms, allowing backend systems to be upgraded, migrated or consolidated without changing any Opteryx code.

Bring your own Data

Opteryx

Opteryx supports multiple query engines, dataframe APIs and storage formats. You can mix-and-match sources in a single query. Opteryx can even JOIN datasets stored in different formats, such as Parquet and MySQL.

Opteryx allows you to query your data directly in the systems where they are stored, eliminating the need to duplicate data into a common store for analytics. This saves you the cost and effort of maintaining duplicates.

Opteryx can push parts of your query to the source query engine, allowing queries to run at the speed of the backend, rather than your local computer.

And if there's not a connector in the box for your data platform; bespoke connectors can be added.

Consumption-Based Billing Friendly

Opteryx is well-suited for deployments to environments which are pay-as-you-use, like Google Cloud Run. Great for situations where you have low-volume usage, or multiple environments, where the costs of many traditional database deployment can quickly add up.

Python Native

Opteryx is Open Source Python, it quickly and easily integrates into Python code, including Jupyter Notebooks, so you can start querying your data within a few minutes. You can use Opteryx to run SQL against pandas DataFrames, and even execute a JOIN on an in-memory DataFrame and a remote dataset.

Time Travel

Designed for data analytics in environments where decisions need to be replayable, Opteryx allows you to query data as at a point in time in the past to replay decision algorithms against facts as they were known in the past. You can even self-join tables historic data, great for finding deltas in datasets over time. (data must be structured to enable temporal queries)

Fast

Benchmarks on M1 Pro Mac running an ad hoc GROUP BY over a 1Gb parquet file via the CLI in ~1/5th of a second, from a cold start. (different systems will have different performance characteristics)

Rows Columns File Size Query Time
561225 81 1Gb 0.22sec
1064539 81 2Gb 0.27sec

Instant Elasticity

Designed to run in Knative and similar environments like Google Cloud Run, Opteryx can scale down to zero, and scale up to respond to thousands of concurrent queries within seconds.

Examples

Install from PyPI
Filter a Dataset on the Command Line
Execute a Simple Query in Python
Execute SQL on a pandas DataFrame
Query Data on Local Disk
Query Data on GCS
Query Data in SQLite
Further Examples

Install from PyPI

pip install opteryx

Filter a Dataset on the Command Line

In this example, we are running Opteryx from the command line to filter one of the internal example datasets and display the results on the console.

python -m opteryx "SELECT * FROM \$astronauts WHERE 'Apollo 11' IN UNNEST(missions);"

Opteryx this example is complete and should run as-is

Execute a Simple Query in Python

In this example, we are showing the basic usage of the Python API by executing a simple query that makes no references to any datasets.

import opteryx

result = opteryx.query("SELECT 4 * 7;")
result.head()
   | 4.0*7.0  
===#========= 
 0 |    28.0 

this example is complete and should run as-is

Execute SQL on a pandas DataFrame

In this example, we are running a SQL statement on a pandas DataFrame and returning the result as a new pandas DataFrame.

import opteryx
import pandas

pandas_df = pandas.read_csv("https://storage.googleapis.com/opteryx/exoplanets/exoplanets.csv")
opteryx.register_df("exoplanets", pandas_df)
aggregated_df = opteryx.query("SELECT koi_disposition, COUNT(*) FROM exoplanets GROUP BY koi_disposition;").pandas()
aggregated_df.head()
  koi_disposition  COUNT(*)
0       CONFIRMED      2293
1  FALSE POSITIVE      5023
2       CANDIDATE      2248 

this example is complete and should run as-is

Query Data on Local Disk

In this example, we are querying and filtering a file directly. This example will not run as written because the file being queried does not exist.

import opteryx

result = opteryx.query("SELECT * FROM 'space_missions.parquet' LIMIT 5;")
result.head()
     | Company   | Location                       | Price | Launched_at         | Rocket         | Rocket_Status | Mission        | Mission_Status 
=====#===========#================================#=======#=====================#================#===============#================#================ 
   0 | RVSN USSR | Site 1/5, Baikonur Cosmodrome, |  None | 1957-10-04 19:28:00 | Sputnik 8K71PS | Retired       | Sputnik-1      | Success        
   1 | RVSN USSR | Site 1/5, Baikonur Cosmodrome, |  None | 1957-11-03 02:30:00 | Sputnik 8K71PS | Retired       | Sputnik-2      | Success        
   2 | US Navy   | LC-18A, Cape Canaveral AFS, Fl |  None | 1957-12-06 16:44:00 | Vanguard       | Retired       | Vanguard TV3   | Failure        
   3 | AMBA      | LC-26A, Cape Canaveral AFS, Fl |  None | 1958-02-01 03:48:00 | Juno I         | Retired       | Explorer 1     | Success        
   4 | US Navy   | LC-18A, Cape Canaveral AFS, Fl |  None | 1958-02-05 07:33:00 | Vanguard       | Retired       | Vanguard TV3BU | Failure        

this example requires a data file, space_missions.parquet.

Query Data in SQLite

In this example, we are querying a SQLite database via Opteryx. This example will not run as written because the file being queried does not exist.

import opteryx
from opteryx.connectors import SqlConnector

# Register the store, so we know queries for the 'sql' store should be handled by
# the SQL Connector
opteryx.register_store(
   prefix="sql",
   connector=SqlConnector, 
   remove_prefix=True,  # the prefix isn't part of the SQLite table name
   connection="sqlite:///database.db"  # SQLAlchemy connection string
)
result = opteryx.query("SELECT name, mass, diameter, density FROM sql.planets LIMIT 5;")
result.head()
   │ name    │ mass   │ diameter │ density
═══#═════════#════════#══════════#════════
 0 │ Mercury │   0.33 │   4879.0 │  5427.0
 1 │ Venus   │   4.87 │  12104.0 │  5243.0
 2 │ Earth   │   5.97 │  12756.0 │  5514.0
 3 │ Mars    │  0.642 │   6792.0 │  3933.0
 4 │ Jupiter │ 1898.0 │ 142984.0 │  1326.0

this example requires a data file, database.db.

Query Data on GCS

In this example, we are to querying a dataset on GCS in a public bucket called 'opteryx'.

import opteryx
from opteryx.connectors import GcpCloudStorageConnector

# Register the store, so we know queries for this store should be handled by
# the GCS connector
opteryx.register_store("opteryx", GcpCloudStorageConnector)
result = opteryx.query("SELECT * FROM opteryx.space_missions LIMIT 5;")
result.head()
     | Company   | Location                       | Price | Launched_at         | Rocket         | Rocket_Status | Mission        | Mission_Status 
=====#===========#================================#=======#=====================#================#===============#================#================ 
   0 | RVSN USSR | Site 1/5, Baikonur Cosmodrome, |  None | 1957-10-04 19:28:00 | Sputnik 8K71PS | Retired       | Sputnik-1      | Success        
   1 | RVSN USSR | Site 1/5, Baikonur Cosmodrome, |  None | 1957-11-03 02:30:00 | Sputnik 8K71PS | Retired       | Sputnik-2      | Success        
   2 | US Navy   | LC-18A, Cape Canaveral AFS, Fl |  None | 1957-12-06 16:44:00 | Vanguard       | Retired       | Vanguard TV3   | Failure        
   3 | AMBA      | LC-26A, Cape Canaveral AFS, Fl |  None | 1958-02-01 03:48:00 | Juno I         | Retired       | Explorer 1     | Success        
   4 | US Navy   | LC-18A, Cape Canaveral AFS, Fl |  None | 1958-02-05 07:33:00 | Vanguard       | Retired       | Vanguard TV3BU | Failure         

this example is complete and should run as-is

Further Examples

For prepared example usage, see the below Notebooks on Google Colab.

Community

gitter Twitter Follow

How do I get Support?

For support ask our Gitter Community.

How Can I Contribute?

We are looking for volunteers to help build and direct Opteryx. If you are interested please use the Issues to let use know.

All contributions, bug reports, documentation improvements, enhancements, and ideas are welcome.

Want to help build Opteryx? See the Contribution and Set Up Guides.

Security

Static Analysis Vulnerabilities Security Rating

See the project Security Policy for information about reporting vulnerabilities.

License

License FOSSA Status

Opteryx is licensed under Apache 2.0 unless otherwise noted.

Status

Status

Opteryx is in beta. Beta means different things to different people, to us, being beta means:

  • Core functionality has good regression test coverage to help ensure stability
  • Some edge cases may have undetected bugs
  • Performance tuning is incomplete
  • Changes are focused on feature completion, bugs, performance, reducing debt, and security
  • Code structure and APIs are not stable and may change

Related Projects

  • HadroDB Data storage engine
  • mabel Streaming data APIs
  • mesos MySQL connector for Opteryx
  • orso DataFrame library

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

opteryx-0.11.0.tar.gz (384.1 kB view hashes)

Uploaded Source

Built Distributions

opteryx-0.11.0-cp311-cp311-win_amd64.whl (995.1 kB view hashes)

Uploaded CPython 3.11 Windows x86-64

opteryx-0.11.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

opteryx-0.11.0-cp311-cp311-macosx_10_15_universal2.whl (2.0 MB view hashes)

Uploaded CPython 3.11 macOS 10.15+ universal2 (ARM64, x86-64)

opteryx-0.11.0-cp310-cp310-win_amd64.whl (996.4 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

opteryx-0.11.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

opteryx-0.11.0-cp310-cp310-macosx_10_15_universal2.whl (2.0 MB view hashes)

Uploaded CPython 3.10 macOS 10.15+ universal2 (ARM64, x86-64)

opteryx-0.11.0-cp39-cp39-win_amd64.whl (999.5 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

opteryx-0.11.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

opteryx-0.11.0-cp39-cp39-macosx_10_15_universal2.whl (2.0 MB view hashes)

Uploaded CPython 3.9 macOS 10.15+ universal2 (ARM64, x86-64)

opteryx-0.11.0-cp38-cp38-win_amd64.whl (999.5 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

opteryx-0.11.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

opteryx-0.11.0-cp38-cp38-macosx_10_15_universal2.whl (2.0 MB view hashes)

Uploaded CPython 3.8 macOS 10.15+ universal2 (ARM64, x86-64)

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page