Singer.io target for loading data into redshift

Project description

Target Redshift

A Singer redshift target, for use with Singer streams generated by Singer taps.

Features

Creates SQL tables for Singer streams
Denests objects flattening them into the parent object's table
Denests rows into separate tables
Adds columns and sub-tables as new fields are added to the stream JSON Schema
Full stream replication via record version and ACTIVATE_VERSION messages.

Install

pip install target-redshift

Usage

Follow the Singer.io Best Practices for setting up separate tap and target virtualenvs to avoid version conflicts.

Create a config file at ~/singer.io/target_redshift_config.json with redshift connection information and target redshift schema.

{
  "redshift_host": "aws.something.or.other",
  "redshift_port": 5439,
  "redshift_database": "my_analytics",
  "redshift_username": "myuser",
  "redshift_password": "1234",
  "redshift_schema": "mytapname",
  "target_s3": {
    "aws_access_key_id": "AKIA...",
    "aws_secret_access_key": "supersecret",
    "bucket": "target_redshift_staging",
    "key_prefix": "__tmp"
  }
}

Run target-redshift against a Singer tap.

~/.virtualenvs/tap-something/bin/tap-something \
  | ~/.virtualenvs/target-redshift/bin/target-redshift \
    --config ~/singer.io/target_redshift_config.json

Config.json

The fields available to be specified in the config file are specified here.

Field	Type	Default	Details
`redshift_host`	`["string"]`	`N/A`
`redshift_port`	`["integer", "null"]`	`5432`
`redshift_database`	`["string"]`	`N/A`
`redshift_username`	`["string"]`	`N/A`
`redshift_password`	`["string"]`	`N/A`
`redshift_schema`	`["string", "null"]`	`"public"`
`invalid_records_detect`	`["boolean", "null"]`	`true`	Include `false` in your config to disable `target-redshift` from crashing on invalid records
`invalid_records_threshold`	`["integer", "null"]`	`0`	Include a positive value `n` in your config to allow for `target-redshift` to encounter at most `n` invalid records per stream before giving up.
`disable_collection`	`["string", "null"]`	`false`	Include `true` in your config to disable Singer Usage Logging.
`logging_level`	`["string", "null"]`	`"INFO"`	The level for logging. Set to `DEBUG` to get things like queries executed, timing of those queries, etc. See Python's Logger Levels for information about valid values.
`target_s3`	`["object"]`	`N/A`	See `S3` below

S3 Config.json

Field	Type	Default	Details
`aws_access_key_id`	`["string"]`	`N/A`
`aws_secret_access_key`	`["string"]`	`N/A`
`bucket`	`["string"]`	`N/A`	Bucket where staging files should be uploaded to.
`key_prefix`	`["string", "null"]`	`""`	Prefix for staging file uploads to allow for better delineation of tmp files

Known Limitations

Ignores STATE Singer messages.
Requires a JSON Schema for every stream.
Only string, string with date-time format, integer, number, boolean, object, and array types with or without null are supported. Arrays can have any of the other types listed, including objects as types within items.
- Example of JSON Schema types that work
  - ['number']
  - ['string']
  - ['string', 'null']
- Exmaple of JSON Schema types that DO NOT work
  - ['string', 'integer']
  - ['integer', 'number']
  - ['any']
  - ['null']
JSON Schema combinations such as anyOf and allOf are not supported.
JSON Schema $ref is partially supported:
- NOTE: The following limitations are known to NOT fail gracefully
- Presently you cannot have any circular or recursive $refs
- $refs must be present within the schema:
  - URI's do not work
  - if the $ref is broken, the behaviour is considered unexpected
Any values which are the string NULL will be streamed to Redshift as the literal null
Table names are restricted to:
- 127 characters in length
- can only be composed of _, lowercase letters, numbers, $
- cannot start with $
- ASCII characters
Field/Column names are restricted to:
- 127 characters in length
- ASCII characters
Fields/Columns are ALL nullable
Fields/Columns use the default largest type available for them

Usage Logging

Singer.io requires official taps and targets to collect anonymous usage data. This data is only used in aggregate to report on individual tap/targets, as well as the Singer community at-large. IP addresses are recorded to detect unique tap/targets users but not shared with third-parties.

To disable anonymous data collection set disable_collection to true in the configuration JSON file.

Developing

target-redshift utilizes setup.py for package management, and PyTest for testing.

Docker

If you have Docker and Docker Compose installed, you can easily run the following to get a local env setup quickly.

First, make sure to create a .env file in the root of this repo (it has been .gitignored so don't worry about accidentally staging it).

Therein, fill out the following information:

REDSHIFT_HOST='<your-host-name>' # Most likely 'localhost'
REDSHIFT_DATABASE='<your-db-name>' # Most likely 'dev'
REDSHIFT_SCHEMA='<your-schema-name>' # Probably 'public'
REDSHIFT_PORT='<your-port>' # Probably 5439
REDSHIFT_USERNAME='<your-user-name'
REDSHIFT_PASSWORD='<your-password>'
TARGET_S3_AWS_ACCESS_KEY_ID='<AKIA...>'
TARGET_S3_AWS_SECRET_ACCESS_KEY='<secret>'
TARGET_S3_BUCKET='<bucket-string>'
TARGET_S3_KEY_PREFIX='<some-string>' # We use 'target_redshift_test'

$ docker-compose up -d --build
$ docker logs -tf target-redshift_target-redshift_1 # You container names might differ

As soon as you see INFO: Dev environment ready. you can shell into the container and start running test commands:

$ docker exec -it target-redshift_target-redshift_1 bash # Your container names might differ

See the PyTest commands below!

DB

To run the tests, you will need an actual Redshift cluster running, and a user that either:

Has the ability to create schemas therein
- This is required if you wish to run multiple versions of the tests, similar to how we run our CI tests by varying the REDSHIFT_SCHEMA envvar
has access to the public schema
- If the REDSHIFT_SCHEMA is seen to be the string "public", the tests will ignore creating and dropping schemas
- This setup is often preferred for situations in which GRANT CREATE ON DATABASE db TO user; is viewed as too risky

Make sure to set the following env vars for PyTest:

$ EXPORT REDSHIFT_HOST='<your-host-name>' # Most likely 'localhost'
$ EXPORT REDSHIFT_DATABASE='<your-db-name>' # Most likely 'dev'
$ EXPORT REDSHIFT_SCHEMA='<your-schema-name>' # Probably 'public'
$ EXPORT REDSHIFT_PORT='<your-port>' # Probably 5439
$ EXPORT REDSHIFT_USERNAME='<your-user-name'
$ EXPORT REDSHIFT_PASSWORD='<your-password>' # Redshift requires passwords

S3

To run the tests, you will need an actual S3 bucket available.

Make sure to set the following env vars for PyTest:

$ EXPORT TARGET_S3_AWS_ACCESS_KEY_ID='<AKIA...>'
$ EXPORT TARGET_S3_AWS_SECRET_ACCESS_KEY='<secret>'
$ EXPORT TARGET_S3_BUCKET='<bucket-string>'
$ EXPORT TARGET_S3_KEY_PREFIX='<some-string>' # We use 'target_redshift_test'

PyTest

To run tests, try:

$ python setup.py pytest

If you've bash shelled into the Docker Compose container (see above), you should be able to simply use:

$ pytest

Sponsorship

Target Redshift is sponsored by Data Mill (Data Mill Services, LLC) datamill.co.

Data Mill helps organizations utilize modern data infrastructure and data science to power analytics, products, and services.

Project details

Release history Release notifications | RSS feed

0.2.4

May 7, 2020

0.2.1

Oct 7, 2019

0.2.0

Sep 10, 2019

0.0.10

Aug 28, 2019

0.0.9

Aug 22, 2019

0.0.8

Aug 12, 2019

0.0.7

May 30, 2019

0.0.6

Apr 17, 2019

0.0.5

Mar 15, 2019

0.0.4

Feb 20, 2019

0.0.3

Feb 18, 2019

This version

0.0.2

Feb 9, 2019

0.0.1

Jan 30, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

target-redshift-0.0.2.tar.gz (8.8 kB view details)

Uploaded Feb 9, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

target_redshift-0.0.2-py3-none-any.whl (20.3 kB view details)

Uploaded Feb 9, 2019 Python 3

File details

Details for the file target-redshift-0.0.2.tar.gz.

File metadata

Download URL: target-redshift-0.0.2.tar.gz
Upload date: Feb 9, 2019
Size: 8.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/3.7.2

File hashes

Hashes for target-redshift-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`3f912204222b3e47928465e6e9163a14619821c060ecac63568823c2cd1b43d5`
MD5	`96fec0ee23564bbe2039fa1ad48a699e`
BLAKE2b-256	`2f36f13bb9cdf4e880632c762787779585024bc036630cdd569ccacaba53322e`

See more details on using hashes here.

File details

Details for the file target_redshift-0.0.2-py3-none-any.whl.

File metadata

Download URL: target_redshift-0.0.2-py3-none-any.whl
Upload date: Feb 9, 2019
Size: 20.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/3.7.2

File hashes

Hashes for target_redshift-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fea3d979c5f2f40d1a62ddce69d695c75dee9649fcd6911deaaec3426fee7e73`
MD5	`a92736b75ac36ff75570b493e884b156`
BLAKE2b-256	`e96ac258f80bca252f98f3f04fead465dfd96b2e636f04089c7983aaca031fb3`

See more details on using hashes here.

target-redshift 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Target Redshift

Features

Install

Usage

Config.json

S3 Config.json

Known Limitations

Usage Logging

Developing

Docker

DB

S3

PyTest

Sponsorship

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes