Skip to main content

Command line library and Python client for Splitgraph, a version control system for data

Project description

Splitgraph

Build status Coverage Status PyPI version Discord chat room Follow

Overview

Splitgraph is a tool for building, versioning and querying reproducible datasets. It's inspired by Docker and Git, so it feels familiar. And it's powered by PostgreSQL, so it works seamlessly with existing tools in the Postgres ecosystem. Use Splitgraph to package your data into self-contained data images that you can share with other Splitgraph instances.

Splitgraph.com, or Splitgraph Cloud, is a public Splitgraph instance where you can share and discover data. It's a Splitgraph peer powered by the Splitgraph Core code in this repository, adding proprietary features like a data catalog, multitenancy, and a distributed SQL proxy.

You can explore 40k+ open datasets in the catalog. You can also connect directly to the Data Delivery Network and query any of the datasets, without installing anything.

To install sgr (the command line client) or a local Splitgraph Engine, see the Installation section of this readme.

Build and Query Versioned, Reproducible Datasets

Splitfiles give you a declarative language, inspired by Dockerfiles, for expressing data transformations in ordinary SQL familiar to any researcher or business analyst. You can reference other images, or even other databases, with a simple JOIN.

When you build data with Splitfiles, you get provenance tracking of the resulting data: it's possible to find out what sources went into every dataset and know when to rebuild it if the sources ever change. You can easily integrate Splitgraph into your existing CI pipelines, to keep your data up-to-date and stay on top of changes to upstream sources.

Splitgraph images are also version-controlled, and you can manipulate them with Git-like operations through a CLI. You can check out any image into a PostgreSQL schema and interact with it using any PostgreSQL client. Splitgraph will capture your changes to the data, and then you can commit them as delta-compressed changesets that you can package into new images.

Splitgraph supports PostgreSQL foreign data wrappers. We call this feature mounting. With mounting, you can query other databases (like PostgreSQL/MongoDB/MySQL) or open data providers (like Socrata) from your Splitgraph instance with plain SQL. You can even snapshot the results or use them in Splitfiles.

Why Splitgraph?

Splitgraph isn't opinionated and doesn't break existing abstractions. To any existing PostgreSQL application, Splitgraph images are just another database. We have carefully designed Splitgraph to not break the abstraction of a PostgreSQL table and wire protocol, because doing otherwise would mean throwing away a vast existing ecosystem of applications, users, libraries and extensions. This means that a lot of tools that work with PostgreSQL work with Splitgraph out of the box.

Components

The code in this repository, known as Splitgraph Core, contains:

  • sgr command line client: sgr is the main command line tool used to work with Splitgraph "images" (data snapshots). Use it to ingest data, work with splitfiles, and push data to Splitgraph.com.
  • Splitgraph Engine: a Docker image of the latest Postgres with Splitgraph and other required extensions pre-installed.
  • Splitgraph Python library: All Splitgraph functionality is available in the Python API, offering first-class support for data science workflows including Jupyter notebooks and Pandas dataframes.

Docs

Documentation is available at https://www.splitgraph.com/docs, specifically:

We also recommend reading our Blog, including some of our favorite posts:

Installation

Pre-requisites:

  • Docker is required to run the Splitgraph Engine. sgr must have access to Docker. You either need to install Docker locally or have access to a remote Docker socket.

For Linux and OSX, once Docker is running, install Splitgraph with a single script:

$ bash -c "$(curl -sL https://github.com/splitgraph/splitgraph/releases/latest/download/install.sh)"

This will download the sgr binary and set up the Splitgraph Engine Docker container.

Alternatively, you can get the sgr single binary from the releases page and run sgr engine add to create an engine.

See the installation guide for more installation methods.

Quick start guide

You can follow the quick start guide that will guide you through the basics of using Splitgraph with public and private data.

Alternatively, Splitgraph comes with plenty of examples to get you started.

If you're stuck or have any questions, check out the documentation or join our Discord channel!

Contributing

Setting up a development environment

  • Splitgraph requires Python 3.6 or later.
  • Install Poetry: curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python to manage dependencies
  • Install pre-commit hooks (we use Black to format code)
  • git clone --recurse-submodules https://github.com/splitgraph/splitgraph.git
  • poetry install
  • To build the engine Docker image: cd engine && make

Running tests

The test suite requires docker-compose. You will also need to add these lines to your /etc/hosts or equivalent:

127.0.0.1       local_engine
127.0.0.1       remote_engine
127.0.0.1       objectstorage

To run the core test suite, do

docker-compose -f test/architecture/docker-compose.core.yml up -d
poetry run pytest -m "not mounting and not example"

To run the test suite related to "mounting" and importing data from other databases (PostgreSQL, MySQL, Mongo), do

docker-compose -f test/architecture/docker-compose.core.yml -f test/architecture/docker-compose.core.yml up -d
poetry run pytest -m mounting

Finally, to test the example projects, do

# Example projects spin up their own engines
docker-compose -f test/architecture/docker-compose.core.yml -f test/architecture/docker-compose.core.yml down -v
poetry run pytest -m example

All of these tests run in CI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

splitgraph-0.2.13.tar.gz (225.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

splitgraph-0.2.13-py3-none-any.whl (263.5 kB view details)

Uploaded Python 3

File details

Details for the file splitgraph-0.2.13.tar.gz.

File metadata

  • Download URL: splitgraph-0.2.13.tar.gz
  • Upload date:
  • Size: 225.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.5 CPython/3.8.1 Linux/5.7.1-050701-generic

File hashes

Hashes for splitgraph-0.2.13.tar.gz
Algorithm Hash digest
SHA256 3bba110049e8ddfe65a6c6b299227609864979d6ac2d2a737e642f1480fe9393
MD5 71226aa816c2a280ac7d3ef42664861a
BLAKE2b-256 11d6fa15739a7bad7ce446910af3cef8a0cb5de65941c19524781c5fb86571bd

See more details on using hashes here.

File details

Details for the file splitgraph-0.2.13-py3-none-any.whl.

File metadata

  • Download URL: splitgraph-0.2.13-py3-none-any.whl
  • Upload date:
  • Size: 263.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.5 CPython/3.8.1 Linux/5.7.1-050701-generic

File hashes

Hashes for splitgraph-0.2.13-py3-none-any.whl
Algorithm Hash digest
SHA256 23757838fde3e1e03205671b8fe9c73fbe44bf5236026a05c08f34d3befafa0e
MD5 668cfb16636b582a5171c3cb66b3ab1b
BLAKE2b-256 aff670fa48bf0e0e47f4373c67daa1f735c119e323a669d6c2ae6c44445b586d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page