videx, the Disaggregated, Extensible Virtual Index Engine for What-If Analysis
Project description
VIDEX
VIDEX: The Disaggregated, Extensible [VI]rtual in[DEX] Engine for What-If Analysis in MySQL 🚀
- Virtual Index: Does not require real data, relies only on statistical information and algorithm models to accurately simulate query plans, table join orders, and index selections;
- Decoupled: VIDEX can be deployed on a separate instance with no impact on the production MySQL environment.
- Scalable: VIDEX offers convenient interfaces allowing users to apply models like
cardinalityandndvto downstream MySQL tasks (e.g., index recommendation);
The virtual index (aka hypothetical index) aims to simulate the cost of indexes within SQL query plans,
thereby demonstrating to users the impact of indexes on SQL plans without the need to create actual indexes on raw instances.
This technology is widely applied in various SQL optimization tasks, including index recommendation and table join order optimization.
As a reference, many other databases already have virtual index features from official or third-party sources,
such as Postgres,
Oracle,
and IBM DB2.
Note: The term
virtual indexused here is distinct from the "virtual index" referenced in the MySQL Official Documents, which refers to indexes built on virtual generated columns.
Additionally, VIDEX encapsulates a set of standardized interfaces for cost estimation,
addressing popular topics in academic research such as cardinality estimation and NDV (Number of Distinct Values) estimation.
Researchers and database developers can easily integrate custom algorithms into VIDEX for optimization tasks.
By default, VIDEX includes implementations based on histograms and NDV collected from the ANALYZE TABLE or small-scale data sampling.
VIDEX offers two startup modes:
- Plugin to production database: Install VIDEX as a plugin to the production database instance.
- Individual instance: This mode can completely avoid impacting the stability of online running instances, making it practical for industrial environments.
Functionally, VIDEX supports creating and deleting indexes (single-column indexes, composite indexes, EXTENDED_KEYS indexes). However, it currently does not support functional indexes, FULL-Text, and Spatial Indexes.
In terms of accuracy, we have tested VIDEX on complex analytical benchmarks such as TPC-H, TPC-H-Skew, and JOB.
Given only the oracle NDV and cardinality, the VIDEX query plan is 100% identical to InnoDB.
(Refer to Example: TPC-H for additional details).
We expect that VIDEX can provide users with a better platform to more easily test the effectiveness of cardinality and NDV algorithms, and apply them on SQL optimization tasks.
Overview
VIDEX consists of two parts:
- VIDEX-MySQL: Conducted a thorough review of over 90 interface functions in the MySQL handler, and implement the index-related parts.
- VIDEX-Statistic-Server: The cost estimation service calculates NDV and Cardinality based on collected statistical information and estimation algorithms, and returns the results to the VIDEX-MySQL instance.
VIDEX creates an individual virtual database according to the specified target database in the raw instance,
containing a series of tables with the same DDL, but replacing the engine from InnoDB to VIDEX.
Startup VIDEX-MySQL and VIDEX-Statistic-Server
From Docker Image
docker run -itd --name videx -p 13383:3306 -p 5001:5001 \
--entrypoint=/bin/bash hub.byted.org/boe/toutiao.mysql.sqlbrain_parse_80:54a3bf649b5c6e0795954669ee4447b9 \
-c "cd /opt/tiger/mysql-server && bash init_start.sh"
The dockerhub version will come soon.
Considering the complexity of compiling VIDEX-MySQL, a Docker image has been created for ease of use. This image includes both the VIDEX-MySQL instance and the VIDEX-Statistic-Server, with the VIDEX engine plugin already installed. It is based on Percona-MySQL release-8.0.34-26, where Percona-MySQL is a fully compatible, enhanced version of MySQL.
VIDEX-Statistic-Server and VIDEX-MySQL are decoupled; users can add new cost estimation algorithms (NDV, cardinality, index cache pct), start their own VIDEX-Statistic-Server, and specify the IP of the VIDEX Statistic Server when executing queries.
From Source Code
Compile VIDEX-MySQL Plugin
Clone the MySQL or Percona server (verified with MySQL-server 8.0+ and Percona-server 8.0.34-26+).
MySQL8_HOME=MySQL8_Server_Source
# mysql
git clone --depth=1 --recursive -b 8.0 https://github.com/mysql/mysql-server.git $MySQL8_HOME
# percona
git clone --depth=1 --recursive -b release-8.0.34-26 https://github.com/percona/percona-server.git $MySQL8_HOME
copy VIDEX-MySQL codes into $MySQL8_HOME/storage:
cp -r $VIDEX_HOME/src/mysql/videx $MySQL8_HOME/storage
Generate the necessary Makefile with cmake.
cmake .. \
-B./build \
-DWITH_DEBUG=0 \
-DCMAKE_INSTALL_PREFIX=. \
-DMYSQL_DATADIR=./data \
-DSYSCONFDIR=./etc \
-DWITH_BOOST=../boost \
-DDOWNLOAD_BOOST=1 \
-DWITH_ROCKSDB=OFF
Navigate to the videx directory and compile videx individually.
cd $MySQL8_HOME/build/storage/videx/
make -j `nproc`
Store the generated ha_videx.so in plugin_dir:
SHOW VARIABLES LIKE "%plugin%"
-> ;
+-----------------------------------------------+-----------------------------------------------+
| Variable_name | Value |
+-----------------------------------------------+-----------------------------------------------+
| default_authentication_plugin | caching_sha2_password |
| plugin_dir | /root/mysql8/lib/plugin/ |
| replication_optimize_for_static_plugin_config | OFF |
+-----------------------------------------------+-----------------------------------------------+
cp ha_videx.so /root/mysql8/lib/plugin/
Install the plugin.
INSTALL PLUGIN VIDEX SONAME 'ha_videx.so';
UNINSTALL PLUGIN VIDEX;
Verify that VIDEX has been installed. You will see a new engine VIDEX.
SHOW ENGINES;
Startup Videx-Stats-Server
We recommend using Anaconda or Miniconda to create a standalone Python environment, then install VIDEX.
cd $VIDEX_HOME
conda create -n videx_py39 python=3.9
conda activate videx_py39
python3.9 -m pip install -e . --use-pep517
Specify the port for Videx-Stats-Server and start the service.
cd $VIDEX_HOME/src/sub_platforms/sql_opt/videx/scripts
python start_videx_server.py --port 5001
Import VIDEX Metadata and Do EXPLAIN
Specify the connection details for the original database and the videx-stats-server. Gather statistics from the original database, save them to an intermediate file, then import them into the VIDEX database.
- If VIDEX-MySQL is started independently rather than as a plugin on the target-MySQL, users can specify the VIDEX-MySQL address using the
--videxparameter.- If VIDEX-Server is started independently rather than being deployed on the same machine as VIDEX-MySQL, users can specify the VIDEX-Server address using the
--videx_serverparameter.- If metadata files have already been generated, users can specify the
--meta_pathparameter to skip the collection process.
cd $VIDEX_HOME/src/sub_platforms/sql_opt/videx/scripts
python videx_build_env.py --target 127.0.0.1:13383:tpch_sf1:user:password \
[--target 127.0.0.1:13309:videx_tpch_sf1:user:password] \
[--videx_server 127.0.0.1:5001] \
[--meta_path /path/to/file]
You can use MySQL's native DDL syntax to create indexes, without any adaption and modification.
ALTER TABLE t1 ADD INDEX idx_videx_c1c2(c1, c2);
The only difference introduced by VIDEX is that you need to set the address of the videx-stats-server before querying.
Then, you can then use EXPLAIN SQL to obtain the query plan and see the impact of VIDEX virtual indexes.
- The default value for
VIDEX_SERVERis127.0.0.1:5001.- If VIDEX-MySQL and VIDEX-Server are deployed on the same instance or machine, there is no need to specify
SET @VIDEX_SERVER.
SET @VIDEX_SERVER='127.0.0.1:5001';
EXPLAIN select * from t1 where c2 > 10 and c1 = 5
Explain results are displayed as follows:
+----+-------------+-------+------------+-------+------------------------+---------+---------+-------+------+----------+--------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+------------------------+---------+---------+-------+------+----------+--------+
| 1 | SIMPLE | t1 | <null> | const | PRIMARY,idx_videx_c1c2 | PRIMARY | 4 | const | 1 | 100.0 | <null> |
+----+-------------+-------+------------+-------+------------------------+---------+---------+-------+------+----------+--------+
Use MySQL's native DDL syntax to delete indexes.
ALTER TABLE t1 DROP INDEX idx_videx_c1c2(c1, c2);
Example: TPC-H
TPC-H tiny
In this example, we will start by importing data from TPC-H-tiny, which randomly samples 1% of the data from TPC-H sf1(1g), to demonstrate how to use VIDEX.
To demonstrate the effectiveness of VIDEX, we compare the explain details on TPC-H Q21,
a complex query involving a 4-table join that includes various elements such as WHERE, aggregation, ORDER BY,
GROUP BY, EXISTS, and self-join. There are 11 indexes on 4 tables available for MySQL to choose from.
Prepare VIDEX Environment
Given a target database (target-MySQL), users can independently start VIDEX-MySQL and VIDEX-Server on any node.
Particularly, launching VIDEX through Docker is the simplest approach. The VIDEX Docker container includes
both VIDEX-MySQL and VIDEX-Server deployed on the same instance, simplifying many parameter configurations.
# TODO: The following image is ByteDance's image. In the next step, it will be replaced with the Docker Hub image.
docker run -itd --name videx -p 13383:3306 -p 5001:5001 \
--entrypoint=/bin/bash hub.byted.org/boe/toutiao.mysql.sqlbrain_parse_80:54a3bf649b5c6e0795954669ee4447b9 \
-c "cd /opt/tiger/mysql-server && bash init_start.sh"
We assume the user environment is as follows:
target-MySQL: The target instance (production database) with the address 127.0.0.1:13383:tpch_tiny:user:password.VIDEX-MySQL: An instance with the VIDEX plugin installed, located on the production database with the same address.VIDEX-Server: The VIDEX metadata and algorithm server installed on the same node, running on the default port. Address: 127.0.0.1:5001.
Import TPCH-Tiny
Import TPCH-tiny.sql into the target instance.
cd $VIDEX_HOME
mysql -h127.0.0.1 -P13383 -uroot -ppassword -e "create database tpch_tiny;"
tar -zxf data/tpch_tiny/tpch_tiny.sql.tar.gz
mysql -h127.0.0.1 -P13383 -uroot -ppassword -Dtpch_tiny < tpch_tiny.sql
Import VIDEX Metadata
pip install -e . -r requirements.txt # if the python env hasn't installed
python src/sub_platforms/sql_opt/videx/scripts/videx_build_env.py \
--target 127.0.0.1:13383:tpch_sf1:user:password
The output is as follows:
2025-02-17 13:46:48 [2855595:140670043553408] INFO root [videx_build_env.py:178] - Build env finished. Your VIDEX server is 127.0.0.1:5001.
You are running in non-task mode.
To use VIDEX, please set the following variable before explaining your SQL:
--------------------
-- Connect VIDEX-MySQL: mysql -h127.0.0.1 -P13383 -uroot -ppassowrd -Dvidex_tpch_tiny
USE videx_tpch_tiny;
SET @VIDEX_SERVER='127.0.0.1:5001';
-- EXPLAIN YOUR_SQL;
Now the metadata file has already been written to videx_metadata_tpch_tiny.json and imported into VIDEX-Server.
If the metadata file is prepared in advance, users can specify --meta_path to bypass the collection process.
EXPLAIN SQL
Connect to VIDEX-MySQL and execute EXPLAIN.
Since VIDEX-Server is deployed on the same node as VIDEX-MySQL and is running on the default port (5001), there is no need to set VIDEX_SERVER additionally.
-- SET @VIDEX_SERVER='127.0.0.1:5001';
EXPLAIN
FORMAT = JSON
SELECT s_name, count(*) AS numwait
FROM supplier,
lineitem l1,
orders,
nation
WHERE s_suppkey = l1.l_suppkey
AND o_orderkey = l1.l_orderkey
AND o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS (SELECT *
FROM lineitem l2
WHERE l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey)
AND NOT EXISTS (SELECT *
FROM lineitem l3
WHERE l3.l_orderkey = l1.l_orderkey
AND l3.l_suppkey <> l1.l_suppkey
AND l3.l_receiptdate > l3.l_commitdate)
AND s_nationkey = n_nationkey
AND n_name = 'IRAQ'
GROUP BY s_name
ORDER BY numwait DESC, s_name;
We compared VIDEX with InnoDB using EXPLAIN FORMAT=JSON, a more rigorous format.
We evaluated not only the table join order and index selection,
but also every detail of the query plan, including the number of rows and cost at each step.
As shown in the following image, VIDEX (left) can generate a query plan almost 100% the same as InnoDB (right).
The complete EXPLAIN result files are located in data/explain_result.
Note that, The simulation accuracy of VIDEX dependencies on three crucial interfaces:
ndvcardinalitypct_cached(the percentage of the index loaded into memory).
TPC-H sf1
We also prepared a metadata file for TPC-H sf1: data/videx_metadata_tpch_sf1.json.
cd $VIDEX_HOME
python src/sub_platforms/sql_opt/videx/scripts/videx_build_env.py \
--target 127.0.0.1:13383:tpch_sf1:user:password --videx_server 5001 \
--meta_path data/videx_metadata_tpch_sf1.json
Consistent with TPCH-tiny, VIDEX can generate a query plan for TPCH-sf1 Q21 that is almost identical to InnoDB's,
as detailed in data/tpch_sf1.
🚀 Integrate Your Custom Model
Method 1: Add into VIDEX-Statistic-Server
Users can fully implement VidexModelBase.
If users focus on cardinality and ndv (two popular research topics),
they can also inherit from VidexModelInnoDB (see VidexModelExample).
VidexModelInnoDB abstracts away complexities such as system variables
and index metadata formats, providing a basic (heuristic) algorithm for ndv and cardinality.
class VidexModelBase(ABC):
"""
Abstract cost model class. VIDEX-Statistic-Server receives requests from VIDEX-MySQL for Cardinality
and NDV estimates, parses them into structured data for ease use of developers.
Implement these methods to inject Cardinality and NDV algorithms into MySQL.
"""
@abstractmethod
def cardinality(self, idx_range_cond: IndexRangeCond) -> int:
"""
Estimates the cardinality (number of rows matching a criteria) for a given index range condition.
Parameters:
idx_range_cond (IndexRangeCond): Condition object representing the index range.
Returns:
int: Estimated number of rows that match the condition.
Example:
where c1 = 3 and c2 < 3 and c2 > 1, ranges = [RangeCond(c1 = 3), RangeCond(c2 < 3 and c2 > 1)]
"""
pass
@abstractmethod
def ndv(self, index_name: str, table_name: str, column_list: List[str]) -> int:
"""
Estimates the number of distinct values (NDV) for specified fields within an index.
Parameters:
index_name (str): Name of the index.
table_name (str): Table Name
column_list (List[str]): List of columns(aka. fields) for which NDV is to be estimated.
Returns:
int: Estimated number of distinct values.
Example:
index_name = 'idx_videx_c1c2', table_name= 't1', field_list = ['c1', 'c2']
"""
raise NotImplementedError()
Method 2: Implement a New VIDEX-Statistic-Server
VIDEX-MySQL will request NDV and cardinality results via HTTP based on the user-specified address. Therefore, users can implement the HTTP response in any programming language.
License
This project is dual-licensed:
- The MySQL engine implementation is licensed under GPL-2.0
- All other codes and scripts are licensed under MIT
See the LICENSE directory for details.
Authors
SQLBrain Group, ByteBrain, Bytedance
Contact
If you have any questions, feel free to contact ours through email (kangrong.cn@bytedance.com, kr11thss@gmail.com).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file videx-0.0.1.tar.gz.
File metadata
- Download URL: videx-0.0.1.tar.gz
- Upload date:
- Size: 97.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.21
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
72ce89a92958ca346a6bac51867fd525c8e49ab0325380e011aa9c5dd270ae57
|
|
| MD5 |
6b882641e51439802ea8005a1e1c1802
|
|
| BLAKE2b-256 |
3d59bfcdd0d899082be6f87f465caf46947e635573086416e330814983b34bb2
|
File details
Details for the file videx-0.0.1-py3-none-any.whl.
File metadata
- Download URL: videx-0.0.1-py3-none-any.whl
- Upload date:
- Size: 107.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.21
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8b683023ff158c7a599985bfe0448e18072df37a8ddc8de63660575edfab32a8
|
|
| MD5 |
009fcc9175e0ffe9999f798877395692
|
|
| BLAKE2b-256 |
c27ce05d806a6d822c3040fa67e0efa6a6d097506df629845c19dba5e111949d
|