Skip to main content

Retrieve data from a database for large amounts of data.

Project description

Note

For the latest source, discussion, etc, please visit the GitLab repository

data_db_extractor: Easily retrieve large amounts of data from a database

This repository hosts the code for the data_db_extractor tool used for retrieving large amounts of data from a database to be used for ML workflows.

The main features of this tool are the following:

  • Simple and lean API to retrieve data from a database
  • Simple to advanced filtering operations on the data
  • Basic data transformations such as clipping, normalization and resampling
  • Spills to disk if there is not sufficient memory available
  • Runs on as many cpu cores as the system has available for use
  • You can use it either programmatically or via a terminal

Requirements

  • Python 3.7+
  • PostgreSQL (or compatible database like TimescaleDB)

Installation

You can install data_db_extractor using either pip or by installing from source.

Pip

Installing from pip is simple and easy as:

$ pip install data_db_extractor

Install from Source

To install data_db_extractor from source, clone the repository from GitLab:

$ git clone https://gitlab.com/farrajota/jungle.ai-challenge-data-extractor
$ cd jungle.ai-challenge-data-extractor
$ make install
or
$ python setup.py install

How-to use

Via Python

To fetch data from a database and perform transformations to make it ML-ready using data_db_extractor, you first need to import the package lib in Python and access the data_db_extractor() method to query a database and store the results into a .parquet file:

>>> from data_db_extractor import data_db_extractor

Next, to use this method, you are required to pass the necessary configs to connect to the database and retrieve data from a target table:

>>> data_db_extractor(host_ip="localhost",
                      port="5432",
                      username="postgres",
                      password="postgres",
                      database="postgres",
                      table="turbine")

In this example, the method will connect to a database hosted in localhost in the port 5432 and use the username and password to connect to the postgres database and query the turbine table and retrieve all rows and columns from the target table.

However, if you want to select only a subset of the data, you can apply some simple filtering operations when fetching data from the database using the following input args: columns, timestamp_range_ini, timestamp_range_end, and limit. For example, if you want to select only a handfull of columns, you can use the columns agr in data_db_extractor() to retrieve only the information of those columns:

>>> data_db_extractor(host_ip="localhost",
                      port="5432",
                      username="postgres",
                      password="postgres",
                      database="postgres",
                      table="turbine",
                      columns=["generator___mechanical___speed",
                               "generator___temperature"])

Additionally, you can also apply some simple transformations to the data like value clipping and/or normalization and resampling data w.r.t. a target frequency (e.g., 10s, 1min, 1h, 1d, etc.). To do this, use the following input args available in data_db_extractor(): norm, clip, and resample. For example, if you want to resample the retrieved data to intervals of one minute, you can achieve this by calling data_db_extractor() with the following arguments:

>>> data_db_extractor(host_ip="localhost",
                      port="5432",
                      username="postgres",
                      password="postgres",
                      database="postgres",
                      table="turbine",
                      resample="1min")

The resampling transformation will squash all rows in a tiem range and compute metrics like mean, std, min, max and median for each column.

Note

For more information, see the official documentation.

Via Terminal

You can perform the same operations you do in Python with the data_db_extractor module on the terminal. When installing the package, a cli api is also installed on your system which allows you to quickly start using the tool with little to no effort by calling data-db-extractor on the terminal.

To list check all available options, simply do:

$ data-db-extractor --help
Usage: data-db-extractor [OPTIONS]

Easily retrieve large amounts of data from a database.

Options:
--host TEXT                 IP Address of the database host.
--port TEXT                 Port number.
--username TEXT             User name registered on the database.
--password TEXT             Password of the username.
--database TEXT             Database name to connect to.
--table TEXT                Table name to perform the queries.
--sql-command TEXT          A custom sql query the user can pass to the
                            database
                            for more complex operations.
--timestamp_range_ini TEXT  Lower bound range to include rows by timestamp.
--timestamp_range_end TEXT  Upper bound range to include rows by timestamp.
--columns TEXT              Columns to be selected from the database.
--limit INTEGER             Limit of the number of rows returned (if > 0).
--norm                      Computes the normalization transformation on the
                            retrieved data.
--clip FLOAT                Clips values that are below / above *N* standard
                            deviations from the mean.
--resample TEXT             Computes a resampling of the data based on a
                            given frequency.
--output-path TEXT          Directory to store the generated files.
--help                      Show this message and exit.

You can see that it has the same options than the data_db_extractor() method in data_db_extractor module. Using the last example in the previous section, we can resample the retrieved data to intervals of one minute by typping the following command on the terminal:

$ data-db-extractor \
    --host localhost \
    --port 5432 \
    --username postgres \
    --password postgres \
    --database postgres \
    --table turbine \
    --resample 1s

So, computing what you need is as simply as configuring the connection to the database and setting the transformation you want to perform on the data being retrieved, and let the tool do the rest.

Documentation

Documentation for the package can be found here.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for data-db-extractor, version 0.0.4
Filename, size File type Python version Upload date Hashes
Filename, size data_db_extractor-0.0.4-py2.py3-none-any.whl (12.1 kB) File type Wheel Python version py2.py3 Upload date Hashes View hashes
Filename, size data_db_extractor-0.0.4.tar.gz (11.2 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page