Collection of tools for schema parsing and workload generation used by MongoDB Research
Project description
mdbrtools
This package contains experimental tools for schema analysis and query workload generation used by MongoDB Research (MDBR).
Disclaimer
This tool is not officially supported or endorsed by MongoDB Inc. The code is released for use "AS IS" without any warranties of any kind, including, but not limited to its installation, use, or performance. Do not run this tool in critical production systems.
Installation
Installation with pip
This tool requires python 3.x and pip on your system. To install mdbrtools
, run the following command:
pip install mdbrtools
Installation from source
Clone the respository from github. From the top-level directory, run:
pip install -e .
This installs an editable development version of mdbrtools
in your current Python environment.
Usage
See the ./notebooks
directory for more detailed examples for schema parsing and workload generation.
Schema Parsing
Schema parsing operates on a list of Python dictionaries.
from mdbrtools.schema import parse_schema
from pprint import pprint
docs = [
{"_id": 1, "mixed_field": "world", "missing_field": False},
{"_id": 2, "mixed_field": 123},
{"_id": 3, "mixed_field": False, "missing_field": True},
]
schema = parse_schema(docs)
pprint(dict(schema))
Converting the schema object to a dictionary will output some general information about the schema:
{'_id': [{'counter': 3, 'type': 'int'}],
'missing_field': [{'counter': 2, 'type': 'bool'}],
'mixed_field': [{'counter': 1, 'type': 'str'},
{'counter': 1, 'type': 'int'},
{'counter': 1, 'type': 'bool'}]}
For access to types, values and uniqueness information, see the examples in ./notebooks/schema_parsing.ipynb
.
Workload Generation
Workload generation takes either a list of Python dictionaries, or a MongoCollection
object as input.
from mdbrtools.workload import Workload
docs = [
{"_id": 1, "mixed_field": "world", "missing_field": False},
{"_id": 2, "mixed_field": 123},
{"_id": 3, "mixed_field": False, "missing_field": True},
]
workload = Workload()
workload.generate(docs, num_queries=5)
for query in workload:
print(query.to_mql())
The generated MQL queries are:
{'missing_field': True}
{'missing_field': {'$exists': False}, '_id': {'$gte': 3}}
{'_id': {'$gt': 3}, 'mixed_field': False, 'missing_field': {'$exists': False}}
{'mixed_field': {'$gte': 'world'}, '_id': 3, 'missing_field': {'$ne': False}}
{'mixed_field': 'world'}
The workload generator supports a number of different constraints on the queries:
- min. and max. number of predicates per query
- allowing only certain fields
- which query operators are allowed for which data types
- control over the weights by which operators are randomly chosen
- min. and max. query selectivity constraints
See the notebook under ./notebooks/workload_generation.ipynb
for examples.
Tests
To execute the unit tests, run from the top-level directory:
python -m unittest discover ./tests
License
MIT, see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file mdbrtools-0.1.0.tar.gz
.
File metadata
- Download URL: mdbrtools-0.1.0.tar.gz
- Upload date:
- Size: 23.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b8fd59c17f371ec181943cf1feacd5a92be26e61983078ce6ff4f1e5c736d2ad |
|
MD5 | 8dacc823d348f94161817ddd992efb19 |
|
BLAKE2b-256 | 24409f33ac59060f6c1c2251c9520de104418e2e9437116e2bd85f2594ae26d1 |