Retrieve data from a database for large amounts of data.
For the latest source, discussion, etc, please visit the GitLab repository
data_db_extractor: Easily retrieve large amounts of data from a database
This repository hosts the code for the data_db_extractor tool used for retrieving large amounts of data from a database to be used for ML workflows.
The main features of this tool are the following:
- Simple and lean API to retrieve data from a database
- Simple to advanced filtering operations on the data
- Basic data transformations such as clipping, normalization and resampling
- Spills to disk if there is not sufficient memory available
- Runs on as many cpu cores as the system has available for use
- You can use it either programmatically or via a terminal
- Python 3.7+
- PostgreSQL (or compatible database like TimescaleDB)
You can install data_db_extractor using either pip or by installing from source.
Installing from pip is simple and easy as:
$ pip install data_db_extractor
Install from Source
To install data_db_extractor from source, clone the repository from GitLab:
$ git clone https://gitlab.com/farrajota/jungle.ai-challenge-data-extractor $ cd jungle.ai-challenge-data-extractor $ make install or $ python setup.py install
To fetch data from a database and perform transformations to make it ML-ready using data_db_extractor, you first need to import the package lib in Python and access the data_db_extractor() method to query a database and store the results into a .parquet file:
>>> from data_db_extractor import data_db_extractor
Next, to use this method, you are required to pass the necessary configs to connect to the database and retrieve data from a target table:
>>> data_db_extractor(host_ip="localhost", port="5432", username="postgres", password="postgres", database="postgres", table="turbine")
In this example, the method will connect to a database hosted in localhost in the port 5432 and use the username and password to connect to the postgres database and query the turbine table and retrieve all rows and columns from the target table.
However, if you want to select only a subset of the data, you can apply some simple filtering operations when fetching data from the database using the following input args: columns, timestamp_range_ini, timestamp_range_end, and limit. For example, if you want to select only a handfull of columns, you can use the columns agr in data_db_extractor() to retrieve only the information of those columns:
>>> data_db_extractor(host_ip="localhost", port="5432", username="postgres", password="postgres", database="postgres", table="turbine", columns=["generator___mechanical___speed", "generator___temperature"])
Additionally, you can also apply some simple transformations to the data like value clipping and/or normalization and resampling data w.r.t. a target frequency (e.g., 10s, 1min, 1h, 1d, etc.). To do this, use the following input args available in data_db_extractor(): norm, clip, and resample. For example, if you want to resample the retrieved data to intervals of one minute, you can achieve this by calling data_db_extractor() with the following arguments:
>>> data_db_extractor(host_ip="localhost", port="5432", username="postgres", password="postgres", database="postgres", table="turbine", resample="1min")
The resampling transformation will squash all rows in a tiem range and compute metrics like mean, std, min, max and median for each column.
For more information, see the official documentation.
You can perform the same operations you do in Python with the data_db_extractor module on the terminal. When installing the package, a cli api is also installed on your system which allows you to quickly start using the tool with little to no effort by calling data-db-extractor on the terminal.
To list check all available options, simply do:
$ data-db-extractor --help Usage: data-db-extractor [OPTIONS] Easily retrieve large amounts of data from a database. Options: --host TEXT IP Address of the database host. --port TEXT Port number. --username TEXT User name registered on the database. --password TEXT Password of the username. --database TEXT Database name to connect to. --table TEXT Table name to perform the queries. --sql-command TEXT A custom sql query the user can pass to the database for more complex operations. --timestamp_range_ini TEXT Lower bound range to include rows by timestamp. --timestamp_range_end TEXT Upper bound range to include rows by timestamp. --columns TEXT Columns to be selected from the database. --limit INTEGER Limit of the number of rows returned (if > 0). --norm Computes the normalization transformation on the retrieved data. --clip FLOAT Clips values that are below / above *N* standard deviations from the mean. --resample TEXT Computes a resampling of the data based on a given frequency. --output-path TEXT Directory to store the generated files. --help Show this message and exit.
You can see that it has the same options than the data_db_extractor() method in data_db_extractor module. Using the last example in the previous section, we can resample the retrieved data to intervals of one minute by typping the following command on the terminal:
$ data-db-extractor \ --host localhost \ --port 5432 \ --username postgres \ --password postgres \ --database postgres \ --table turbine \ --resample 1s
So, computing what you need is as simply as configuring the connection to the database and setting the transformation you want to perform on the data being retrieved, and let the tool do the rest.
Documentation for the package can be found here.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size data_db_extractor-0.0.4-py2.py3-none-any.whl (12.1 kB)||File type Wheel||Python version py2.py3||Upload date||Hashes View hashes|
|Filename, size data_db_extractor-0.0.4.tar.gz (11.2 kB)||File type Source||Python version None||Upload date||Hashes View hashes|
Hashes for data_db_extractor-0.0.4-py2.py3-none-any.whl