Skip to main content

Analytics DSL for Python

Project description

PyDough

PyDough is an alternative DSL that can be used to solve analytical problems by phrasing questions in terms of a logical document model instead of translating to relational SQL logic.

What Is PyDough

PyDough allows expressing analytical questions with hierarchical thinking, as seen in models such as MongoDB, since that mental model is closer to human linguistics than a relational model. Unlike MongoDB, PyDough only uses a logical document model for abstractly explaining & interacting with data, rather than a physical document model to store the data. PyDough code can be written in and interleaved with Python code, and practices a lazy evaluation scheme that does not qualify or execute any logic until requested. PyDough executes by translating its logic into SQL which it can directly executing in an arbitrary database.

Consider the following information represented by the tables in a database:

  • There are people; each person has a name, ssn, birth date, records of jobs they have had, and records of schools they have attended.
  • There are employment records; each job record has the ssn of the person being employed, the name of the company, and the total income they made from the job.
  • There are education records; each education record has the ssn of the person attending the school, the name of the school, and the total tuition they paid to that school.

Suppose I want to know for every person their name & the total income they've made from all jobs minus the total tuition paid to all schools. However, I want to include people who have never had a job or never attended any schools, and I need to account for people who could have had multiple jobs or attended multiple schools. The following PyDough snippet solves this problem:

result = People.CALCULATE(
    name,
    net_income = SUM(jobs.income_earned) - SUM(schools.tuition_paid)
)
pydough.to_df(result)

However, if answering the question with SQL, I would need to write the following less-intuitive SQL query:

SELECT
    P.name AS name,
    COALESCE(J.total_income_earned, 0) - COALESCE(S.total_tuition_paid, 0) AS net_income
FROM PEOPLE AS P
LEFT JOIN (
    SELECT person_ssn, SUM(income_earned) AS total_income_earned
    FROM EMPLOYMENT_RECORDS
    GROUP BY person_ssn
) AS J
ON P.ssn = J.person_ssn
LEFT JOIN (
    SELECT person_ssn, SUM(tuition_paid) AS total_tuition_paid
    FROM EDUCATION_RECORDS
    GROUP BY person_ssn
) AS S
ON P.ssn = S.person_ssn

Internally, PyDough solves the question by translating the much simpler logical document model logic into SQL, which can be directly executed on a database. Even if the same SQL is generated by PyDough as the example above, all a user needs to worry about is writing the much smaller PyDough code snippet in Python.

Currently, the main mechanism to execute PyDough code is via Jupyter notebooks with a special cell magic. See the usage guide and demo notebooks for more details.

Why Build PyDough?

PyDough as a DSL has several benefits over other solutions, both for human use and LLM generation:

  • ORMs still require understanding & writing SQL, including dealing directly with joins. If a human or AI is bad at writing SQL, they will be just as bad at writing ORM-based code. PyDough, on the other hand, abstracts away joins in favor of thinking about logical relationships between collections & sub-collections.
  • The complex semantics of aggregation keys, different types of joins, and aggregating before vs after joining are all abstracted away by PyDough. These details require much deeper understanding of SQL semantics than most have time to learn how to do correctly, meaning that PyDough can have a lower learning curve to write correct code for complex questions.
  • When a question is being asked, the PyDough code to answer it will look more similar to the text of the question than the SQL text would. This makes LLM generation of PyDough code simpler since there is a stronger correlation between a question asked and the PyDough code to answer it.
  • Often, PyDough code will be significantly more compact than equivalent SQL text, and therefore easier for a human to verify for logical correctness.
  • PyDough is portable between various database execution solutions, so you are not locked into one data storage solution while using PyDough.

Learning About PyDough

Refer to these documents to learn how to use PyDough:

Installing or Developing PyDough

PyDough releases are available on PyPI and can be installed via pip:

pip install pydough

For local development, PyDough uses uv as a package manager. Please refer to their docs for installation.

To run testing commands after installing uv, run the following command:

uv run pytest <pytest_arguments>

If you want to skip tests that execute runtime results because they are slower, make sure to include -m "not execute" in the pytest arguments.

Note: some tests may require an additional setup to run successfully. The demos directory contains more information on how to setup the TPC-H sqlite database. For testing, the tpch.db file must be located in the tests directory. Additionally, the setup_defog.sh script must be run so that the defog.db file is located in the tests directory.

Running CI Tests

When submitting a PR, you can control which CI tests run by adding special flags to your latest commit message.

Note: All flags are case-insensitive.

  • To run PyDough CI tests, add: [run CI] (only runs SQLite tests, no other SQL dialects)
  • To run PyDough and all dialect tests, add: [run all]
  • To run specific dialect tests, use the corresponding flag as described below.

Running Snowflake Tests on CI

To run Snowflake CI tests, add the flag [run SF] to your commit message.

Running Snowflake tests locally:

  1. Install the Snowflake Connector for Python with Pandas support

    pip install "snowflake-connector-python[pandas]"
    
  2. Set your Snowflake credentials as environment variables:

        export SF_USERNAME="your_username"
        export SF_PASSWORD="your_password"
        export SF_ACCOUNT="your_account"
    

Running MySQL Tests on CI

To run MySQL CI tests, add the flag [run mysql] to your commit message.

Running MySQL tests locally:

  1. Make sure you have Docker Desktop installed and running.

  2. Install the MySQL Connector for Python

    pip install mysql-connector-python
    
  3. Set your MySQL credentials as environment variables:

        export MYSQL_USERNAME="your_username"
        export MYSQL_PASSWORD="your_password"
    

Running Postgres Tests on CI

To run Postgres CI tests, add the flag [run postgres] to your commit message.

Running Postgres tests locally:

  1. Make sure you have Docker Desktop installed and running.

  2. Install the Postgres Connector for Python

    pip install psycopg2-binary
    
  3. Set your Postgres credentials as environment variables:

        export POSTGRES_DB="your_database"
        export POSTGRES_USER="your_username"
        export POSTGRES_PASSWORD="your_password"
    

Runtime Dependencies

PyDough requires having the following Python modules installed to use the library:

  • pytz, pandas, sqlglot

The full list of dependencies can be found in the pyproject.toml file.

Demo Notebooks

The demo folder contains a series of example Jupyter Notebooks that can be used to understand PyDough's capabilities. We recommend any new user start with the demo readme and then walk through the example Juypter notebooks.

Meta Visualizer

The meta_visualizer folder contains a simple web application that can be used to visualize the metadata of a PyDough Knowledge Graph. It displays the collections, properties, and relationships between collections in a knowledge graph and can be used to check the relations between collections to debug PyDough queries as well as build complex queries.

Please refer to the meta visualizer README for more information on features and usage.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pydough-1.0.18-py3-none-any.whl (492.8 kB view details)

Uploaded Python 3

File details

Details for the file pydough-1.0.18-py3-none-any.whl.

File metadata

  • Download URL: pydough-1.0.18-py3-none-any.whl
  • Upload date:
  • Size: 492.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pydough-1.0.18-py3-none-any.whl
Algorithm Hash digest
SHA256 226af34b46f301e8d45434170d6b0d897ca1cccb024fad62004390205997f607
MD5 8d11b0ab0e4b5c6ad4121a8ad1a7ea74
BLAKE2b-256 f32e4851456ffe44444a195d8791bd251fb9b6d3ad809a275cc11c084b584590

See more details on using hashes here.

Provenance

The following attestation bundles were made for pydough-1.0.18-py3-none-any.whl:

Publisher: build_pip.yml on bodo-ai/PyDough

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page