Skip to main content

videx, the Disaggregated, Extensible Virtual Index Engine for What-If Analysis

Project description

VIDEX

English | 简体中文

VIDEX: The Disaggregated, Extensible [VI]rtual in[DEX] Engine for What-If Analysis in MySQL 🚀

  • Virtual Index: Does not require real data, relies only on statistical information and algorithm models to accurately simulate query plans, table join orders, and index selections;
  • Decoupled: VIDEX can be deployed on a separate instance with no impact on the production MySQL environment.
  • Scalable: VIDEX offers convenient interfaces allowing users to apply models like cardinality and ndv to downstream MySQL tasks (e.g., index recommendation);

The virtual index (aka hypothetical index) aims to simulate the cost of indexes within SQL query plans, thereby demonstrating to users the impact of indexes on SQL plans without the need to create actual indexes on raw instances. This technology is widely applied in various SQL optimization tasks, including index recommendation and table join order optimization. As a reference, many other databases already have virtual index features from official or third-party sources, such as Postgres, Oracle, and IBM DB2.

Note: The term virtual index used here is distinct from the "virtual index" referenced in the MySQL Official Documents, which refers to indexes built on virtual generated columns.

Additionally, VIDEX encapsulates a set of standardized interfaces for cost estimation, addressing popular topics in academic research such as cardinality estimation and NDV (Number of Distinct Values) estimation. Researchers and database developers can easily integrate custom algorithms into VIDEX for optimization tasks. By default, VIDEX includes implementations based on histograms and NDV collected from the ANALYZE TABLE or small-scale data sampling.

VIDEX offers two startup modes:

  1. Plugin to production database: Install VIDEX as a plugin to the production database instance.
  2. Individual instance: This mode can completely avoid impacting the stability of online running instances, making it practical for industrial environments.

Functionally, VIDEX supports creating and deleting indexes (single-column indexes, composite indexes, EXTENDED_KEYS indexes). However, it currently does not support functional indexes, FULL-Text, and Spatial Indexes.

In terms of accuracy, we have tested VIDEX on complex analytical benchmarks such as TPC-H, TPC-H-Skew, and JOB. Given only the oracle NDV and cardinality, the VIDEX query plan is 100% identical to InnoDB. (Refer to Example: TPC-H for additional details). We expect that VIDEX can provide users with a better platform to more easily test the effectiveness of cardinality and NDV algorithms, and apply them on SQL optimization tasks.


1. Overview

VIDEX consists of two parts:

  • VIDEX-MySQL: Conducted a thorough review of over 90 interface functions in the MySQL handler, and implement the index-related parts.
  • VIDEX-Statistic-Server: The cost estimation service calculates NDV and Cardinality based on collected statistical information and estimation algorithms, and returns the results to the VIDEX-MySQL instance.

VIDEX creates an individual virtual database according to the specified target database in the raw instance, containing a series of tables with the same DDL, but replacing the engine from InnoDB to VIDEX.

Here's the English translation of the documentation for GitHub:

Quick Start

2.1 Install Python Environment

VIDEX requires Python 3.9 for metadata collection tasks. We recommend using Anaconda/Miniconda to create an isolated Python environment:

# Clone repository
VIDEX_HOME=videx_server
git clone git@github.com:bytedance/videx.git $VIDEX_HOME
cd $VIDEX_HOME

# Create and activate Python environment
conda create -n videx_py39 python=3.9
conda activate videx_py39

# Install VIDEX
python3.9 -m pip install -e . --use-pep517

2.2 Launch VIDEX (Docker Mode)

For simplified deployment, we provide pre-built Docker images containing:

  • VIDEX-MySQL: Based on Percona-MySQL 8.0.34-26 with integrated VIDEX plugin
  • VIDEX-Server: NDV and cardinality algorithm service

Install Docker

If you haven't installed Docker yet:

Launch VIDEX Container

docker run -d -p 13308:13308 -p 5001:5001 --name videx kangrongme/videx:0.0.2

Alternative Deployment Options

VIDEX also supports the following deployment methods, see Installation Guide:

  • Build complete MySQL Server from source
  • Build VIDEX plugin only and install on existing MySQL
  • Deploy VIDEX-Server independently (supports custom optimization algorithms)

Examples

TPCH-Tiny Example

This example demonstrates the complete VIDEX workflow using the TPC-H Tiny dataset (1% random sample from TPC-H sf1).

Environment Details

The example assumes all components are deployed locally via Docker:

Component Connection Info
Target-MySQL (Production DB) 127.0.0.1:13308, username:videx, password:password
VIDEX-MySQL (Plugin) Same as Target-MySQL
VIDEX-Server 127.0.0.1:5001

Step 1: Import Test Data

cd $VIDEX_HOME

# Create database
mysql -h127.0.0.1 -P13308 -uvidex -ppassword -e "create database tpch_tiny;"

# Import data
tar -zxf data/tpch_tiny/tpch_tiny.sql.tar.gz
mysql -h127.0.0.1 -P13308 -uvidex -ppassword -Dtpch_tiny < tpch_tiny.sql

Step 2: Collect and Import VIDEX Metadata

Ensure VIDEX environment is installed. If not, refer to 2.1 Install Python Environment.

cd $VIDEX_HOME
python src/sub_platforms/sql_opt/videx/scripts/videx_build_env.py \
 --target 127.0.0.1:13308:tpch_tiny:videx:password \
 --videx 127.0.0.1:13308:videx_tpch_tiny:videx:password

Output:

2025-02-17 13:46:48 [2855595:140670043553408] INFO     root            [videx_build_env.py:178] - Build env finished. Your VIDEX server is 127.0.0.1:5001.
You are running in non-task mode.
To use VIDEX, please set the following variable before explaining your SQL:
--------------------
-- Connect VIDEX-MySQL: mysql -h127.0.0.1 -P13308 -uvidex -ppassword -Dvidex_tpch_tiny
USE videx_tpch_tiny;
SET @VIDEX_SERVER='127.0.0.1:5001';
-- EXPLAIN YOUR_SQL;

Metadata is now collected and imported to VIDEX-Server. The JSON file is written to videx_metadata_tpch_tiny.json.

If users have prepared metadata files, they can specify --meta_path to skip collection and import directly.

Step 3: EXPLAIN SQL

Connect to VIDEX-MySQL and execute EXPLAIN.

To demonstrate VIDEX's effectiveness, we compare EXPLAIN details for TPC-H Q21, a complex query with four-table joins involving WHERE, aggregation, ORDER BY, GROUP BY, EXISTS and self-joins. MySQL can choose from 11 indexes across 4 tables.

Since VIDEX-Server is deployed on the VIDEX-MySQL node with default port (5001), we don't need to set VIDEX_SERVER additionally. If VIDEX-Server is deployed elsewhere, execute SET @VIDEX_SERVER first.

-- SET @VIDEX_SERVER='127.0.0.1:5001'; -- Not needed for Docker deployment
EXPLAIN 
FORMAT = JSON
SELECT s_name, count(*) AS numwait
FROM supplier,
     lineitem l1,
     orders,
     nation
WHERE s_suppkey = l1.l_suppkey
  AND o_orderkey = l1.l_orderkey
  AND o_orderstatus = 'F'
  AND l1.l_receiptdate > l1.l_commitdate
  AND EXISTS (SELECT *
              FROM lineitem l2
              WHERE l2.l_orderkey = l1.l_orderkey
                AND l2.l_suppkey <> l1.l_suppkey)
  AND NOT EXISTS (SELECT *
                  FROM lineitem l3
                  WHERE l3.l_orderkey = l1.l_orderkey
                    AND l3.l_suppkey <> l1.l_suppkey
                    AND l3.l_receiptdate > l3.l_commitdate)
  AND s_nationkey = n_nationkey
  AND n_name = 'IRAQ'
GROUP BY s_name
ORDER BY numwait DESC, s_name;

We compare VIDEX and InnoDB. We use EXPLAIN FORMAT=JSON, a more strict format. We compare not only table join order and index selection but every detail of query plans (e.g., rows and cost at each step).

As shown below, VIDEX (left) generates a query plan almost 100% identical to InnoDB (right). Complete EXPLAIN results are in data/tpch_tiny.

explain_tpch_tiny_compare.png

Note that VIDEX accuracy depends on three key algorithm interfaces:

  • ndv
  • cardinality
  • pct_cached (percentage of index data loaded in memory). Can be set to 0 (cold start) or 1 (hot data) if unknown, but production instances' pct_cached may vary constantly.

A key VIDEX function is simulating index costs. We add an extra index. VIDEX's index addition cost is O(1):

ALTER TABLE tpch_tiny.orders ADD INDEX idx_o_orderstatus (o_orderstatus);
ALTER TABLE videx_tpch_tiny.orders ADD INDEX idx_o_orderstatus (o_orderstatus);

Re-running EXPLAIN shows MySQL-InnoDB and VIDEX query plans changed identically, both adopting the new index.

explain_tpch_tiny_compare_alter_index.png

VIDEX's row estimate (7404) differs from MySQL-InnoDB (7362) by ~0.56%, due to cardinality estimation algorithm error.

Finally, we remove the index:

ALTER TABLE tpch_tiny.orders DROP INDEX idx_o_orderstatus;
ALTER TABLE videx_tpch_tiny.orders DROP INDEX idx_o_orderstatus;

Example 3.2 TPCH sf1 (1g)

We provide metadata file for TPC-H sf1: data/videx_metadata_tpch_sf1.json, allowing direct import without collection.

cd $VIDEX_HOME
python src/sub_platforms/sql_opt/videx/scripts/videx_build_env.py \
 --target 127.0.0.1:13308:tpch_sf1:user:password \
 --meta_path data/tpch_sf1/videx_metadata_tpch_sf1.json

Like TPCH-tiny, VIDEX generates nearly identical query plans to InnoDB for TPCH-sf1 Q21, see data/tpch_sf1.

explain_tpch_sf1_compare.png

4. API

Specify connection methods for original database and videx-stats-server. Collect statistics from original database, save to intermediate file, then import to VIDEX database.

  • If VIDEX-MySQL starts separately rather than installing plugin on target-MySQL, users can specify VIDEX-MySQL address via --videx
  • If VIDEX-Server starts separately rather than deploying on VIDEX-MySQL machine, users can specify VIDEX-Server address via --videx_server
  • If users have generated metadata file, specify --meta_path to skip collection

Command example:

cd $VIDEX_HOME/src/sub_platforms/sql_opt/videx/scripts
python videx_build_env.py --target 127.0.0.1:13308:tpch_tiny:videx:password \
[--videx 127.0.0.1:13309:videx_tpch_tiny:videx:password] \
[--videx_server 127.0.0.1:5001] \
[--meta_path /path/to/file]

🚀 5. Integrate Your Custom Model

Method 1: Add into VIDEX-Statistic-Server

Users can fully implement VidexModelBase.

If users focus on cardinality and ndv (two popular research topics), they can also inherit from VidexModelInnoDB (see VidexModelExample). VidexModelInnoDB abstracts away complexities such as system variables and index metadata formats, providing a basic (heuristic) algorithm for ndv and cardinality.

class VidexModelBase(ABC):
    """
    Abstract cost model class. VIDEX-Statistic-Server receives requests from VIDEX-MySQL for Cardinality
    and NDV estimates, parses them into structured data for ease use of developers.

    Implement these methods to inject Cardinality and NDV algorithms into MySQL.
    """

    @abstractmethod
    def cardinality(self, idx_range_cond: IndexRangeCond) -> int:
        """
        Estimates the cardinality (number of rows matching a criteria) for a given index range condition.

        Parameters:
            idx_range_cond (IndexRangeCond): Condition object representing the index range.

        Returns:
            int: Estimated number of rows that match the condition.

        Example:
            where c1 = 3 and c2 < 3 and c2 > 1, ranges = [RangeCond(c1 = 3), RangeCond(c2 < 3 and c2 > 1)]
        """
        pass

    @abstractmethod
    def ndv(self, index_name: str, table_name: str, column_list: List[str]) -> int:
        """
        Estimates the number of distinct values (NDV) for specified fields within an index.

        Parameters:
            index_name (str): Name of the index.
            table_name (str): Table Name
            column_list (List[str]): List of columns(aka. fields) for which NDV is to be estimated.

        Returns:
            int: Estimated number of distinct values.

        Example:
            index_name = 'idx_videx_c1c2', table_name= 't1', field_list = ['c1', 'c2']
        """
        raise NotImplementedError()

Method 2: Implement a New VIDEX-Statistic-Server

VIDEX-MySQL will request NDV and cardinality results via HTTP based on the user-specified address. Therefore, users can implement the HTTP response in any programming language.

License

This project is dual-licensed:

  • The MySQL engine implementation is licensed under GPL-2.0
  • All other codes and scripts are licensed under MIT

See the LICENSE directory for details.

Authors

SQLBrain Group, ByteBrain, Bytedance

Contact

If you have any questions, feel free to contact ours through email (kangrong.cn@bytedance.com, kr11thss@gmail.com).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

videx-0.0.3.tar.gz (94.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

videx-0.0.3-py3-none-any.whl (106.8 kB view details)

Uploaded Python 3

File details

Details for the file videx-0.0.3.tar.gz.

File metadata

  • Download URL: videx-0.0.3.tar.gz
  • Upload date:
  • Size: 94.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.21

File hashes

Hashes for videx-0.0.3.tar.gz
Algorithm Hash digest
SHA256 f95860f760397ac690f6d03ee4f9cc2a3f139e33de3b4a5fdef8bd97922d1837
MD5 64954943bac8941b1384b3cce958610f
BLAKE2b-256 27e1790d87e2c5f5241ada1e50cba1a8df110bb05020e86c76e0316386f009e5

See more details on using hashes here.

File details

Details for the file videx-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: videx-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 106.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.21

File hashes

Hashes for videx-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 ed47eb5e68972cd75574331115fd96f69e4f31c8013f56809049c48a21746f36
MD5 23557bf01ff2df38396011c7a740f4e3
BLAKE2b-256 7a8f1f41f9b8ccfade41895fd08b592a7e835e26ff23027d160e4769c9084771

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page