Skip to main content

Building blocks for ETL pipelines.

Project description

Neo4j ETL Toolbox

A robust Python library of building blocks to assemble efficient, scalable ETL pipelines for Neo4j.

It simplifies the process of moving data from SQL, CSV, and Parquet sources into Neo4j by handling common concerns like batching, parallelism, logging, and error handling.

Key Features

  • Task-Based Architecture: Compose pipelines from reusable units of work.
  • Parallel Loading: Optimized strategies for high-performance loading without locking issues.
  • Data Validation: Integrated Pydantic support for ensuring data quality before loading.
  • Detailed Reporting: Built-in tracking of execution time and row counts.
  • Flexible Sources: Support for SQL (via SQLAlchemy), CSV, Neo4j and Parquet (via PyArrow).

Parallel Loading Example

The library provides specialized tasks for parallel data loading. By using a "mix-and-batch" strategy, it can load relationships in parallel while minimizing deadlocks.

Here is an example of defining a parallel CSV loader task (taken from the examples/nyc-taxi project):

from pathlib import Path
from etl_lib.core.ETLContext import ETLContext
from etl_lib.core.SplittingBatchProcessor import dict_id_extractor
from etl_lib.task.data_loading.ParallelCSVLoad2Neo4jTask import ParallelCSVLoad2Neo4jTask
from model.trip import Trip # Your Pydantic model

class LoadTripsParallelTask(ParallelCSVLoad2Neo4jTask):
    def __init__(self, context: ETLContext, csv_path: Path):
        super().__init__(
            context,
            file=csv_path,
            model=Trip,
            error_file=Path('errors_parallel.json'),
            batch_size=5000,
            max_workers=10
        )

    def _query(self):
        return """
            UNWIND $batch AS row
            MATCH (pu:Location {id: row.pu_location})
            MATCH (do:Location {id: row.do_location})
            CREATE (t:Trip {
              id: randomUUID(),
              pickup_datetime: row.pickup_datetime,
              dropoff_datetime: row.dropoff_datetime,
              ...
            })
            CREATE (t)-[:STARTED_AT]->(pu)
            CREATE (t)-[:ENDED_AT]->(do)
        """

    def _id_extractor(self):
        # Defines how to route rows to avoid locking on start/end nodes
        return dict_id_extractor(table_size=10, start_key='pu_location', end_key='do_location')

Documentation & Examples

Complete documentation can be found on https://neo-technology-field.github.io/python-etl-lib/index.html

See the examples directory for complete projects:

Installation

The library can be installed via:

pip install neo4j-etl-lib

System Dependencies

Some components or documentation tools require additional system-level packages.

Graphviz

If you are building the documentation locally and want to generate diagrams (e.g., using make docs), you need Graphviz installed.

Debian/Ubuntu:

sudo apt install graphviz

Fedora/RHEL/CentOS:

sudo dnf install graphviz

Arch Linux / CachyOS:

sudo pacman -S graphviz

Podman + Testcontainers (Linux)

Don't. I could not get this to work without a brittle setup. I currently run my tests by pointing to a running db instance via .env. And on CI I use docker and it just works.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neo4j_etl_lib-0.5.0.tar.gz (42.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

neo4j_etl_lib-0.5.0-py3-none-any.whl (57.0 kB view details)

Uploaded Python 3

File details

Details for the file neo4j_etl_lib-0.5.0.tar.gz.

File metadata

  • Download URL: neo4j_etl_lib-0.5.0.tar.gz
  • Upload date:
  • Size: 42.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for neo4j_etl_lib-0.5.0.tar.gz
Algorithm Hash digest
SHA256 31561d2474d697dc4167f6a47e3ca3f352a9eb1f863706b47a4a04886ce2856e
MD5 e93fc698ee87d62364a1b4e6922bb8a7
BLAKE2b-256 367502229e0654523a566018ed7ada2b02e6d920ee8bed0252ae668a037b9935

See more details on using hashes here.

File details

Details for the file neo4j_etl_lib-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: neo4j_etl_lib-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 57.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for neo4j_etl_lib-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 31557723ccc5ca00241a8254793e6f90ffa0e3d6aba6934aad823779242793ca
MD5 128fbce9ec46569dbdea2fa87e715814
BLAKE2b-256 0ed7d549a0e373f52ed4ac4ed24ab2c0793de5357a3f6060b0a85fe46339be65

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page