Skip to main content

Transform Pandas DataFrames into Exports to be sent to DGraph

Project description

dgraphpandas

Python Build PyPI License: MIT

A Library (with accompanying cli tool) to transform Pandas DataFrames into Exports (RDF) to be sent to DGraph Live Loader

Usage

python -m pip install dgraphpandas

Command Line

This is a real example which you can find in the samples folder and run from the root of this repository.

python -m dgraphpandas \
  --config samples/planets/dgraphpandas.json \
  --config_file_key planet \
  --file samples/planets/solar_system.csv \
  --output samples/planets/output

Module

from dgraphpandas.strategies.horizontal import horizontal_transform
from dgraphpandas.strategies.vertical import vertical_transform
from dgraphpandas.writers.upserts import generate_upserts

# Define a Configuration for your data files(s). Explained further in the Configuration section.
config = {
  "transform": "horizontal",
  "files": {
    "planet": {
      "subject_fields": ["id"],
      "edge_fields": ["type"],
      "type_overrides": {
        "order_from_sun": "int32",
        "diameter_earth_relative": "float32",
        "diameter_km": "float32",
        "mass_earth_relative": "float32",
        "mean_distance_from_sun_au": "float32",
        "orbital_period_years": "float32",
        "orbital_eccentricity": "float32",
        "mean_orbital_velocity_km_sec": "float32",
        "rotation_period_days": "float32",
        "inclination_axis_degrees": "float32",
        "mean_temperature_surface_c": "float32",
        "gravity_equator_earth_relative": "float32",
        "escape_velocity_km_sec": "float32",
        "mean_density": "float32",
        "number_moons": "int32",
        "rings": "bool"
      },
      "ignore_fields": ["image", "parent"]
    }
  }
}

# Perform a Horizontal Transform on the passed file using the config/key
intrinsic, edges = horizontal_transform('planets.csv', config, "planet")

# Generate RDF Upsert statements
intrinsic_upserts, edges_upserts = generate_upserts(intrinsic, edges)

# Do something with these statements e.g write to zip and ship to DGraph
# The cli will zip this output automatically
print(intrinsic)
print(edges)

Configuration

A Configuration file influences how we transform a DataFrame. It consists of:

  • Global configuration options

    • Options which will be applied to files
    • These can either be defined in the configuration or as kwargs in the transform method.
    • A collection of files
  • File configuration options

    • Options which will be applied only to this entry
    • subject_fields is required so the unique identifier for a row in the DataFrame can be found
    • edge_fields are optional and if provided will generate edge output
    • type_overrides are optional but recommended to ensure the correct type is attached in RDF

If you are running this with the module and passing via kwargs then these options may also be lambda callable with takes the dataframe. For example if you didn't want to hard code all your edge fields and were following a convention that all edge fields have suffix _id then you could set the edge_fields to lambda frame: frame.loc[frame['predicate'].str.endswith('_id'), 'predicate'].unique().tolist() `

config = {
  "transform": "horizontal",
  "files": {
    "planet": {
      "subject_fields": ["id"],
      "edge_fields": ["type"],
      "type_overrides": {
        "order_from_sun": "int32",
        "diameter_earth_relative": "float32",
        "diameter_km": "float32",
        "mass_earth_relative": "float32",
        "mean_distance_from_sun_au": "float32",
        "orbital_period_years": "float32",
        "orbital_eccentricity": "float32",
        "mean_orbital_velocity_km_sec": "float32",
        "rotation_period_days": "float32",
        "inclination_axis_degrees": "float32",
        "mean_temperature_surface_c": "float32",
        "gravity_equator_earth_relative": "float32",
        "escape_velocity_km_sec": "float32",
        "mean_density": "float32",
        "number_moons": "int32",
        "rings": "bool"
      },
      "ignore_fields": ["image", "parent"]
    }
  }
}

Additional Configuration

Global Level

These options can be placed on the root of the config or passed as kwargs directly.

  • add_dgraph_type_records
    • DGraph has a special field called dgraph.type, this can be used to query via the type() function. If add_dgraph_type_records is enabled, then we add dgraph.type fields to the current frame.
  • strip_id_from_edge_names
    • Its common for a data set to have a reference to another 'table' using _id convention
    • For example if you had a Student & School then the student might more sense to have (Student) - school -> (School) rather then having an _id in the predicate.
  • drop_na_intrinsic_objects
    • Automatically drop intrinsic records where the object is NA. In a relational model, you might have a column with a null entry however in a graph model you may want to omit the attribute altogether
  • drop_na_edge_objects
    • Same as drop_na_intrinsic_objects but for edges
  • key_separator
    • Separator used to combine key fields. For example if the key separator was _ and we were operating on an intrinsic attribute for a customer with id 1 then the xid would be customer_1
  • illegal_characters
    • Characters to strip from intrinsic and edge subjects. if the unique identifier has a character not supported by RDF/DGraph then strip them away or they will not be accepted by live loading.
  • illegal_characters_intrinsic_object
    • Same as illegal_characters but for the subject on intrinsic fields. These have a different set of illegal characters because subjects on intrinsic records are actual data values and are quoted. They therefore can accept many more characters then the subject.

File Level

  • type_overrides
    • Recommended. This ensures that data types are being treated as a type and the output RDF has the correct type mapped into it. Without this fields will go under the default rdf type <xs:string> but you may want a field to be a true int in RDF.
    • Additionally certain data types such as datetime64 will activate special handling to ensure the output in RDF is within the correct format to be ingested into DGraph.
    • Supported Types can be found here
  • csv_edges
    • Sometimes a vendor will provide a data file where a single column is actually a csv list and each csv value should be broken into multiple RDF statements (because they relate to independent entities). Adding that column into this list will do that.
    • For example in the Netflix sample / title file we have a cast column where the values are actor_1, actor2 then adding to csv_edges will ensure that the movie has 2 different relationships for each cast member.
  • ignore_fields
    • Add fields in the input that we don't are about to this list so they aren't present in the output.
  • override_edge_name
    • Ensure that the edge name as a different predicate and/or target_node_type to what is defined in the file.
    • For example in the Pokemon sample / pokemon_species file you will see a column called evolves_from_species which tells us for a given pokemon which other pokemon does it evolve from. If we were to use the raw data here we would get a evolves_from_species edge with an incorrect target xid. Instead we want to override the target_node_type to pokemon so the edge correctly loops back to a node of the same type.
  • pre_rename
    • Rename intrinsic predicates or edge names to something else

Samples

Samples can be found here. They follow a convention where the download script can be found within the input directory and the config, generate_upsert, publish scripts can be found root of each respective sample.

Local Setup

Assuming you have already cloned the repo and have a terminal in the root of the project.

# Create Virtual Environment and Activate it
conda create -n dgraphpandas python=3.6 # or venv
conda activate dgraphpandas

# Restore packages
python -m pip install -r requirements-dev.txt
python -m pip install -r requirements.txt

# Run Flake
flake8 --count .

# Run Tests
python -m unittest

# Create & Run DGraph
docker-compose up

# Try a Sample
# See Sample section for more details
# It should help getting some data,
# generating rdf and publishing to your
# local DGraph

# Install a Local Copy of the Library
python -m pip install -e .

# Remember to Uninstall once ready
python -m pip uninstall dgraphpandas -y

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dgraphpandas-0.0.4.tar.gz (24.1 kB view details)

Uploaded Source

Built Distribution

dgraphpandas-0.0.4-py3-none-any.whl (26.4 kB view details)

Uploaded Python 3

File details

Details for the file dgraphpandas-0.0.4.tar.gz.

File metadata

  • Download URL: dgraphpandas-0.0.4.tar.gz
  • Upload date:
  • Size: 24.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for dgraphpandas-0.0.4.tar.gz
Algorithm Hash digest
SHA256 492b508a12257c91d13eeca883950017e931aac7004d588bc800a0381bfa7b43
MD5 dea576427f27d6f6a03282ab026af200
BLAKE2b-256 190e27c22ae881081cf58f1328c1019a3b67b4b4569d6facb825cc8b00de1cac

See more details on using hashes here.

File details

Details for the file dgraphpandas-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: dgraphpandas-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 26.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for dgraphpandas-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 6caf5f74637e84b206d823057ddc8dd26abb25d2eaf7fb3c4e25fe45a0a914f2
MD5 39059ab69d096cfbd153075aaf4e34ca
BLAKE2b-256 70c57ef120cda7a4c06bb107b770a51aec46fa39c0498a380509db71f8fbfe63

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page