No project description provided
Project description
mongo2pq
Quick and dirty script that migrates MongoDB database to parquet files.
I wrote this after being unsatisfied with the current options for exporting
larger MongoDB collections: There is the mongoexport
command from database
tools, but that tends to be unstable and has the tendency to fail for larger
collections. There are tools such as Fivetran and Airbyte, but these are big,
heavy, monolithic programs (and, in the case of Fivetran, proprietary) that are
expensive to setup and manage. Moreover, they are not easy to connect to other
components you might have in your technology stack. Finally, I have stumbled
upon dlt. I very much liked the philosophy
and approach of this tool. However, in its current form it didn't work well for
me. Its understandable complexity requires middle layers during the extraction
and loading process and that causes sub-optimal performance for large
collections.
Hence, I wrote a simple and efficient extractor and loader that is very much focused on MongoDB and Apache Parquet.
Architecture
This script simply utilizes the asynchronous python driver for MongoDB,
motor and the python implementation
of Apache Arrow. The main idea is that there is no
middle layer: using motor we load a chunk of data from a particular connection
and immediately dump it in a corresponding parquet file. This lowers the demand
for both memory and storage even for large collections. Moreover, utilizing the
asyncio
python library allows us to speed the process up considerably since the
biggest bottleneck in an EL process tends to be the transfer of data from the
database. Therefore, we simply request multiple chunks at once and process them
as they arrive.
Installation
mongo2pq
supports Python 3.11+. You can install it with pip
pip install mongo2pq
This will add the command mongo2pq
to your $PATH
.
The dependencies of the project are handled with
poetry. Hence, for a developmental installation,
you can also use poetry
. Make sure you have poetry
available on your system
and then clone the repository and run poetry install
. This will also install
the developmental dependencies which include ipython
(which is quite heavy).
To install without this group, run poetry install --without dev
Usage
Simplest run would be started with
mongo2pq -u <URI> -o <OUTDIR> -d <DB> -c <COLLECTION1> <COLLECTION2>
where URI
is the connection string for the MongoDB instance (such as
mongodb://user:passwd@ip:port/opts
), OUTDIR
is where the script will save
the parquet files and the yaml
schema files, and DB
and COLLECTION1
COLLECTION2
are the database and its collections to download. See more
options with mongo2pq --help
.
You can also specify the URI
with the environmental variable MONGODB_URI
.
Without specifying collections, all collections in a database will be exported, and without specifying database, all databases and its collections will be exported (note that to list DBs you need root access to MongoDB).
The script will infer the schema of the database for you which is done by
sampling the collections. For large collections you might need a large sample
set (keys missing from the schema during export are dropped). This inference
can take significant time since it relies on the MongoDB sample
operation
which is expensive. In case export needs to be repeated or there is an error
during the export, the schema files are saved to yaml
and can be used next
time when running the script.
Partitioning
Parquet allows you to have a partitioning for your data. This is supported with
this script using the option -p
or --partition
which takes an argument for
the partition key. The, the parquet files will be stored with the structure
collection_name.parquet
|
|-partition_key=value1
| |
| |-data.parquet
|
|-partition_key=value2
| |
| |-data.parquet
|
...
Taking advantage of partitioning in parquet has many known benefits, but the
added bonus from the perspective of this script is that it allows it to fully
take advantage of asyncio
since each partition dataset can be extracted
independently. With at least few partitions, the extraction performance
significantly increases.
Configuration
Sometimes, you might need to make small modification to the original schema.
This can be done with a config file, which also has a yaml
format. As of now,
the config file accepts only config for the schema. To specify the schema
config, the root keyword is schema
, the next keyword is the collection name
for which the config applies. Finally, you specify all the transformations
to item fields with yaml lists. A sample config would be:
schema:
telemetry_data:
- type: retype_regex
fieldname: (?<!string)_id
fieldtype: string
- type: retype_contains
fieldname: flap_orientation
fieldtype: float
- type: rename_regex_upper
oldname: (\S+)@(\S+)
newname: \2__\1
upper: [2]
The config specified in the list part of the yaml
contains the schema
transformations. Right now, there are two types of transformations supported:
retype which changes a type of a field from an inferred type (or type in
input schema) and rename which renames item key from the key in the database.
The retype can be one of retype_regex
, where fieldname
is a regex to search
in a key, retype_contains
which searches for the substring fieldname
in a
key, and retype_equals
which retypes a specific key.
The rename can be one of rename_regex
, which takes python regex in oldname
field usually with some match groups and newname
which might reference the
groups from the oldname
, and rename_regex_upper
which has an additional key
upper
that allows you to transform a match group to upper case.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file mongo2pq-0.1.0.tar.gz
.
File metadata
- Download URL: mongo2pq-0.1.0.tar.gz
- Upload date:
- Size: 14.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.6 Linux/6.6.9-arch1-1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 800e9e70ea50b0399bf88c293ea5fbb1be68ac059efd4680e03697000c0ac8b2 |
|
MD5 | be83a1e87d966b324bef4e384a314030 |
|
BLAKE2b-256 | 5e95a1df713932160354f3f4714593f4223b281f3bf0e8a1faa1e480f3e79bc1 |
File details
Details for the file mongo2pq-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: mongo2pq-0.1.0-py3-none-any.whl
- Upload date:
- Size: 14.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.11.6 Linux/6.6.9-arch1-1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6bbcd3ee3523046e7dddbf4ecbfccb2873e46306db814baf4663fde4b5018625 |
|
MD5 | af5a68b5e3839857259ba2aeacd8d49c |
|
BLAKE2b-256 | 1a90abd6faa0efc82711ba80924a3e5076d60eec0ce254b0f2cf2c4366ee238a |