Skip to main content

Building blocks for ETL pipelines.

Project description

Neo4j ETL Toolbox

A robust Python library of building blocks to assemble efficient, scalable ETL pipelines for Neo4j.

It simplifies the process of moving data from SQL, CSV, and Parquet sources into Neo4j by handling common concerns like batching, parallelism, logging, and error handling.

Key Features

  • Task-Based Architecture: Compose pipelines from reusable units of work.
  • Parallel Loading: Optimized strategies for high-performance loading without locking issues.
  • Data Validation: Integrated Pydantic support for ensuring data quality before loading.
  • Detailed Reporting: Built-in tracking of execution time and row counts.
  • Flexible Sources: Support for SQL (via SQLAlchemy), CSV, Neo4j and Parquet (via PyArrow).

Parallel Loading Example

The library provides specialized tasks for parallel data loading. By using a "mix-and-batch" strategy, it can load relationships in parallel while minimizing deadlocks.

Here is an example of defining a parallel CSV loader task (taken from the examples/nyc-taxi project):

from pathlib import Path
from etl_lib.core.ETLContext import ETLContext
from etl_lib.core.SplittingBatchProcessor import dict_id_extractor
from etl_lib.task.data_loading.ParallelCSVLoad2Neo4jTask import ParallelCSVLoad2Neo4jTask
from model.trip import Trip # Your Pydantic model

class LoadTripsParallelTask(ParallelCSVLoad2Neo4jTask):
    def __init__(self, context: ETLContext, csv_path: Path):
        super().__init__(
            context,
            file=csv_path,
            model=Trip,
            error_file=Path('errors_parallel.json'),
            batch_size=5000,
            max_workers=10
        )

    def _query(self):
        return """
            UNWIND $batch AS row
            MATCH (pu:Location {id: row.pu_location})
            MATCH (do:Location {id: row.do_location})
            CREATE (t:Trip {
              id: randomUUID(),
              pickup_datetime: row.pickup_datetime,
              dropoff_datetime: row.dropoff_datetime,
              ...
            })
            CREATE (t)-[:STARTED_AT]->(pu)
            CREATE (t)-[:ENDED_AT]->(do)
        """

    def _id_extractor(self):
        # Defines how to route rows to avoid locking on start/end nodes
        return dict_id_extractor(table_size=10, start_key='pu_location', end_key='do_location')

Documentation & Examples

Complete documentation can be found on https://neo-technology-field.github.io/python-etl-lib/index.html

See the examples directory for complete projects:

Installation

The library can be installed via:

pip install neo4j-etl-lib

System Dependencies

Some components or documentation tools require additional system-level packages.

Graphviz

If you are building the documentation locally and want to generate diagrams (e.g., using make docs), you need Graphviz installed.

Debian/Ubuntu:

sudo apt install graphviz

Fedora/RHEL/CentOS:

sudo dnf install graphviz

Arch Linux / CachyOS:

sudo pacman -S graphviz

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neo4j_etl_lib-0.3.4.tar.gz (40.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

neo4j_etl_lib-0.3.4-py3-none-any.whl (55.6 kB view details)

Uploaded Python 3

File details

Details for the file neo4j_etl_lib-0.3.4.tar.gz.

File metadata

  • Download URL: neo4j_etl_lib-0.3.4.tar.gz
  • Upload date:
  • Size: 40.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for neo4j_etl_lib-0.3.4.tar.gz
Algorithm Hash digest
SHA256 9fb7d57e0f5594b9d34707bebafc627c63d985de456c7829cd9e085d5b7a7599
MD5 971d9805fdeca9b4fb4585b95390444d
BLAKE2b-256 ddc1f2a5a05690b3f47d0a7c463ac9a4863db5efc825fff436d05f3506df160e

See more details on using hashes here.

File details

Details for the file neo4j_etl_lib-0.3.4-py3-none-any.whl.

File metadata

  • Download URL: neo4j_etl_lib-0.3.4-py3-none-any.whl
  • Upload date:
  • Size: 55.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for neo4j_etl_lib-0.3.4-py3-none-any.whl
Algorithm Hash digest
SHA256 0c7727ec6724d7cfba212397aa497a8bd23c8ba8e42ef38e69aa8686d65819e6
MD5 599f60b76258041572157a65dfb7b9b0
BLAKE2b-256 e1eedbed46b310c6a57a99ca383fd505f0bb11805ad761972cd696def7553c07

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page