Skip to main content

Building blocks for ETL pipelines.

Project description

Neo4j ETL Toolbox

A robust Python library of building blocks to assemble efficient, scalable ETL pipelines for Neo4j.

It simplifies the process of moving data from SQL, CSV, and Parquet sources into Neo4j by handling common concerns like batching, parallelism, logging, and error handling.

Key Features

  • Task-Based Architecture: Compose pipelines from reusable units of work.
  • Parallel Loading: Optimized strategies for high-performance loading without locking issues.
  • Data Validation: Integrated Pydantic support for ensuring data quality before loading.
  • Detailed Reporting: Built-in tracking of execution time and row counts.
  • Flexible Sources: Support for SQL (via SQLAlchemy), CSV, Neo4j and Parquet (via PyArrow).

Parallel Loading Example

The library provides specialized tasks for parallel data loading. By using a "mix-and-batch" strategy, it can load relationships in parallel while minimizing deadlocks.

Here is an example of defining a parallel CSV loader task (taken from the examples/nyc-taxi project):

from pathlib import Path
from etl_lib.core.ETLContext import ETLContext
from etl_lib.core.SplittingBatchProcessor import dict_id_extractor
from etl_lib.task.data_loading.ParallelCSVLoad2Neo4jTask import ParallelCSVLoad2Neo4jTask
from model.trip import Trip # Your Pydantic model

class LoadTripsParallelTask(ParallelCSVLoad2Neo4jTask):
    def __init__(self, context: ETLContext, csv_path: Path):
        super().__init__(
            context,
            file=csv_path,
            model=Trip,
            error_file=Path('errors_parallel.json'),
            batch_size=5000,
            max_workers=10
        )

    def _query(self):
        return """
            UNWIND $batch AS row
            MATCH (pu:Location {id: row.pu_location})
            MATCH (do:Location {id: row.do_location})
            CREATE (t:Trip {
              id: randomUUID(),
              pickup_datetime: row.pickup_datetime,
              dropoff_datetime: row.dropoff_datetime,
              ...
            })
            CREATE (t)-[:STARTED_AT]->(pu)
            CREATE (t)-[:ENDED_AT]->(do)
        """

    def _id_extractor(self):
        # Defines how to route rows to avoid locking on start/end nodes
        return dict_id_extractor(table_size=10, start_key='pu_location', end_key='do_location')

Documentation & Examples

Complete documentation can be found on https://neo-technology-field.github.io/python-etl-lib/index.html

See the examples directory for complete projects:

Installation

The library can be installed via:

pip install neo4j-etl-lib

System Dependencies

Some components or documentation tools require additional system-level packages.

Graphviz

If you are building the documentation locally and want to generate diagrams (e.g., using make docs), you need Graphviz installed.

Debian/Ubuntu:

sudo apt install graphviz

Fedora/RHEL/CentOS:

sudo dnf install graphviz

Arch Linux / CachyOS:

sudo pacman -S graphviz

Podman + Testcontainers (Linux)

Don't. I could not get this to work without a brittle setup. I currently run my tests by pointing to a running db instance via .env. And on CI I use docker and it just works.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neo4j_etl_lib-0.3.6.tar.gz (41.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

neo4j_etl_lib-0.3.6-py3-none-any.whl (55.9 kB view details)

Uploaded Python 3

File details

Details for the file neo4j_etl_lib-0.3.6.tar.gz.

File metadata

  • Download URL: neo4j_etl_lib-0.3.6.tar.gz
  • Upload date:
  • Size: 41.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for neo4j_etl_lib-0.3.6.tar.gz
Algorithm Hash digest
SHA256 5cfeb7d405dbdbb5439cb818c0564fa67d5716a4a396d89d9c56eee5521afa3d
MD5 b587961a4386dd05823eae27c3bb7286
BLAKE2b-256 6012954bd14aae159b73e675211e93f49845ca03ba05766b51959caf986e6a18

See more details on using hashes here.

File details

Details for the file neo4j_etl_lib-0.3.6-py3-none-any.whl.

File metadata

  • Download URL: neo4j_etl_lib-0.3.6-py3-none-any.whl
  • Upload date:
  • Size: 55.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for neo4j_etl_lib-0.3.6-py3-none-any.whl
Algorithm Hash digest
SHA256 8e75f4af07ecd86a04717de5f2d571701a26ad09c76f36936e59669d54eab616
MD5 0896e4cf308814efb80d53c7f92de6a9
BLAKE2b-256 0edd97a08583596ae81f57d768d429aa6bfd42db8eae4dbee3dc9e11ca37587c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page