Skip to main content

AgeFreighter is a Python package that helps you to create a graph database using Azure Database for PostgreSQL.

Project description

AGEFreighter

a Python package that helps you to create a graph database using Azure Database for PostgreSQL.

Apache AGE™ is a PostgreSQL Graph database compatible with PostgreSQL's distributed assets and leverages graph data structures to analyze and use relationships and patterns in data.

Azure Database for PostgreSQL is a managed database service that is based on the open-source Postgres database engine.

Introducing support for Graph data in Azure Database for PostgreSQL (Preview).

0.5.0 Release

Refactored the code to make it more readable and maintainable with the separated classes for factory model. Please note how to use the new version of the package is tottally different from the previous versions.

0.5.2 Release -AzureStorageFreighter-

  • AzureStorageFreighter class is used to load data from Azure Storage into the graph database. It's totally different from other classes. The class works as follows:
    • If the argument, 'subscription_id' is not set, the class tries to find the Azure Subscription ID from your local environment using the 'az' command.
    • Creates an Azure Storage account and a blob container under the resource group where the PostgreSQL server runs in.
    • Enables the 'azure_storage' extension in the PostgreSQL server, if it's not enabled.
    • Uploads the CSV file to the blob container.
    • Creates a UDF (User Defined Function) named 'load_from_azure_storage' in the PostgreSQL server. The UDF loads data from the Azure Storage into the graph database.
    • Executes the UDF.
  • The above process takes time to prepare for loading data, making it unsuitable for loading small files, but effective for loading large files. For instance, it takes under 3 seconds to load 'actorfilms.csv' after uploading.
  • However, please note that it is still in the early stages of implementation, so there is room for optimization and potential issues due to insufficient testing.

0.5.3 Release -AzureStorageFreighter-

  • AzureStorageFreighter class is totally refactored for better performance and scalability.
    • 0.5.2 didn't work well for large files.
    • Now, it works well for large files. Checked with a 5.4GB CSV file consisting of 10M of start vertices, 10K of end vertices, and 25M edges, it took 512 seconds to load the data into the graph database with PostgreSQL Flex, Standard_D32ds_v4 (32 vcpus, 128 GiB memory) and 512TB / 7500 iops of storage.
    • Tested data was generated with tests/generate_dummy_data.py.
    • UDF to load the data to graph is no longer used.
  • However, please note that it is still in the early stages of implementation, so there is room for optimization and potential issues due to insufficient testing.

Features

  • Asynchronous connection pool support for psycopg PostgreSQL driver
  • 'direct_loading' option for loading data directly into the graph. If 'direct_loading' is True, the data is loaded into the graph using the 'INSERT' statement, not Cypher queries.
  • 'COPY' protocol support for loading data into the graph. If 'use_copy' is True, the data is loaded into the graph using the 'COPY' protocol.

Classes

  • AzureStorageFreighter
  • AvroFreighter
  • CosmosGremlinFreighter
  • CSVFreighter
  • MultiCSVFreighter
  • Neo4jFreighter
  • NetworkXFreighter
  • ParquetFreighter
  • PGFreighter

Method

All the classes have the same load() method. The method loads data into the graph database.

Arguments for each class

  • common arguments

    • graph_name (str) : the name of the graph
    • chunk_size (int) : the number of rows to be loaded at once
    • direct_loading (bool) : if True, the data is loaded into the graph using the 'INSERT' statement, not Cypher queries
    • use_copy (bool) : if True, the data is loaded into the graph using the 'COPY' protocol
    • drop_graph (bool) : if True, the graph is dropped before loading the data
  • AzureStorageFreighter

    • csv (str): CSV file path
    • start_v_label (str): Start Vertex Label
    • start_id (str): Start Vertex ID
    • start_props (list): Start Vertex Properties
    • edge_type (str): Edge Type
    • end_v_label (str): End Vertex Label
    • end_id (str): End Vertex ID
    • end_props (list): End Vertex Properties
    • graph_name (str): Graph Name
    • chunk_size (int): Chunk Size
    • drop_graph (bool): Drop Graph
  • AvroFreighter

    • source_avro (str): The path to the Avro file.
    • start_v_label (str): The label of the start vertex.
    • start_id (str): The ID of the start vertex.
    • start_props (list): The properties of the start vertex.
    • edge_type (str): The type of the edge.
    • end_v_label (str): The label of the end vertex.
    • end_id (str): The ID of the end vertex.
    • end_props (list): The properties of the end vertex.
  • CosmosGremlinFreighter

    • cosmos_gremlin_endpoint (str): The Cosmos Gremlin endpoint.
    • cosmos_gremlin_key (str): The Cosmos Gremlin key.
    • cosmos_username (str): The Cosmos username.
    • id_map (dict): The ID map.
  • CSVFreighter

    • csv (str): The path to the CSV file.
    • start_v_label (str): The label of the start vertex.
    • start_id (str): The ID of the start vertex.
    • start_props (list): The properties of the start vertex.
    • edge_type (str): The type of the edge.
    • end_v_label (str): The label of the end vertex.
    • end_id (str): The ID of the end vertex.
    • end_props (list): The properties of the end vertex.
  • MultiCSVFreighter

    • vertex_csvs (list): The paths to the vertex CSV files.
    • vertex_labels (list): The labels of the vertices.
    • edge_csvs (list): The paths to the edge CSV files.
    • edge_types (list): The types of the edges.
  • Neo4jFreighter

    • neo4j_uri (str): The URI of the Neo4j database.
    • neo4j_user (str): The username of the Neo4j database.
    • neo4j_password (str): The password of the Neo4j database.
    • neo4j_database (str): The database of the Neo4j database.
    • id_map (dict): The ID map.
  • NetworkXFreighter

    • networkx_graph (nx.Graph): The NetworkX graph.
    • id_map (dict): The ID map.
  • ParquetFreighter

    • source_parquet (str): The path to the Parquet file.
    • start_v_label (str): The label of the start vertex.
    • start_id (str): The ID of the start vertex.
    • start_props (list): The properties of the start vertex.
    • edge_type (str): The type of the edge.
    • end_v_label (str): The label of the end vertex.
    • end_id (str): The ID of the end vertex.
    • end_props (list): The properties of the end vertex.
  • PGFreighter

    • source_pg_con_string (str): The connection string of the source PostgreSQL database.
    • source_schema (str): The source schema.
    • source_tables (list): The source tables.
    • id_map (dict): The ID map.

Release Notes

  • 0.4.0 : Added 'loadFromCosmosGremlin()' function.
  • 0.4.1 : Changed base Python version to 3.9 to run on Azure Cloud Shell and Databricks 15.4ML.
  • 0.4.2 : Tuning for 'loadFromCosmosGremlin()' function.
  • 0.4.3 : Standardized the argument names. Enhanced the tests for each functions.
  • 0.4.4 : Performance tuning.
  • 0.4.5 : Simplified 'loadFromNeo4j'.
  • 0.4.6 : Added 'loadFromAvro()' function.
  • 0.5.0 : Refactored the code to make it more readable and maintainable with the separated classes for factory model. Introduced concurrent.futures for better performance.
  • 0.5.1 : Improved the usage
  • 0.5.2 : Added AzureStorageFreighter class, fixed a bug in ParquetFreighter class (THX! Reported from my co-worker, Srikanth-san)
  • 0.5.3 : Refactored AzureStorageFreighter class for better performance and scalability.

Install

pip install agefreighter

Prerequisites

  • over Python 3.9
  • This module runs on psycopg and psycopg_pool
  • Enable the Apache AGE extension in your Azure Database for PostgreSQL instance. Login Azure Portal, go to 'server parameters' blade, and check 'AGE" on within 'azure.extensions' and 'shared_preload_libraries' parameters. See, above blog post for more information.
  • Load the AGE extension in your PostgreSQL database.
CREATE EXTENSION IF NOT EXISTS age CASCADE;

Usage

import asyncio
import os
from agefreighter import Factory
import logging

log = logging.getLogger(__name__)
logging.basicConfig(
    level=logging.DEBUG,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
)


async def main():
    class_name = "CSVFreighter"
    instance = Factory.create_instance(class_name)

    await instance.connect(
        dsn=os.environ["PG_CONNECTION_STRING"],
        max_connections=64,
    )
    await instance.load(
        graph_name="AgeTester",
        start_v_label="Actor",
        start_id="ActorID",
        start_props=["Actor"],
        edge_type="ACTED_IN",
        end_v_label="Film",
        end_id="FilmID",
        end_props=["Film", "Year", "Votes", "Rating"],
        csv="./actorfilms.csv",
        drop_graph=True,
    )


if __name__ == "__main__":
    import asyncio

    asyncio.run(main())

See, tests/agefreightertester.py for more details.

Test & Samples

export PG_CONNECTION_STRING="host=your_host.postgres.database.azure.com port=5432 dbname=postgres user=account password=your_password"
cd tests/
python3.9 agefreightertester.py

For more information about Apache AGE

License

MIT License

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agefreighter-0.5.3.tar.gz (32.6 kB view details)

Uploaded Source

Built Distribution

agefreighter-0.5.3-py3-none-any.whl (29.4 kB view details)

Uploaded Python 3

File details

Details for the file agefreighter-0.5.3.tar.gz.

File metadata

  • Download URL: agefreighter-0.5.3.tar.gz
  • Upload date:
  • Size: 32.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.1

File hashes

Hashes for agefreighter-0.5.3.tar.gz
Algorithm Hash digest
SHA256 a07d7d76c47515b519a2a991dda99d4d5e84dd679b1669586cc4a32addaa6ba7
MD5 e0b0396ccc6492e2eb419fc0ff5a05bb
BLAKE2b-256 03c0599253d81af7197902da7b3f5853eba501f68642c7f82db25eabe64e1d88

See more details on using hashes here.

File details

Details for the file agefreighter-0.5.3-py3-none-any.whl.

File metadata

File hashes

Hashes for agefreighter-0.5.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b97e2016f9991fbda0400f90e16ab54ec76763ab1c3197c90a924a66a7435263
MD5 65f05e80b90894fb24c204b9624b7866
BLAKE2b-256 b0f78fd4f142b7a61e7e1984417cc8e2e90c25cb342536101e89c2aa06406ff9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page