Skip to main content

A document analysis pipeline with knowledge graph

Project description

DocumentGraph: An ETL Pipeline for Document Analysis with Neo4j

Overview

DocumentGraph is a Python package designed for end-to-end document analysis, using an ETL (Extract, Transform, Load) pipeline to process textual documents and represent the extracted information in a Neo4j knowledge graph. The package extracts text from documents, preprocesses and chunks the content, generates embeddings, and identifies entities and relationships within the text. These entities, relationships, and text chunks are then loaded into a Neo4j graph database for advanced analysis and querying.

This package is ideal for users who need to process large volumes of documents and structure them into a graph-based knowledge representation, where entities and their relationships can be explored and queried efficiently.

Key Features

  • Document extraction: Loads documents from a specified input folder.
  • Text preprocessing: Cleans and chunks the document into smaller, meaningful pieces.
  • Embedding generation: Generates vector representations for text chunks.
  • Entity and relationship extraction: Detects entities and relationships within the text using a knowledge extraction model.
  • Knowledge graph loading: Loads documents, text chunks, entities, and relationships into a Neo4j graph database.

Limitations

  • The package assumes that the input documents are in .txt format.
  • Preprocessing and extraction pipelines are designed for text data only.
  • The performance depends on the quality of the pre-trained embedding models and entity extraction logic.
  • Neo4j must be set up and running (locally or via AuraDB) with proper credentials for the package to function.

Prerequisites

1. Python Environment

Ensure you have Python 3.10+ installed in your environment. You can create a virtual environment to manage dependencies easily:

python -m venv documentgraph-env
source documentgraph-env/bin/activate   # On Windows: documentgraph-env\Scripts\activate

2. Required Python Packages

Install the required dependencies from the requirements.txt file:

pip install -r requirements.txt

These packages include libraries for Neo4j, logging, and document extraction.

3. Neo4j Setup

The package requires a Neo4j database to store and query the knowledge graph. You can either use Neo4j Aura (cloud-based) or run a local Neo4j instance.

Option 1: Neo4j Aura (Cloud-based)

  • Sign up for a free or paid Neo4j Aura account at https://aura.neo4j.io/.
  • Create a new Neo4j project and note down the uri, username, and password for connection.

Option 2: Local Neo4j Instance

  • Download and install Neo4j Desktop from https://neo4j.com/download/.
  • Start a new local graph database instance.
  • The default local connection uri is usually bolt://localhost:7687, and the default username/password is neo4j/neo4j.

4. Environment Variables

You need to set up environment variables to allow the package to connect to the Neo4j database. You can add these variables to your shell environment or use a .env file.

export NEO4J_URI=bolt://localhost:7687  # For local Neo4j instance
export NEO4J_USER=neo4j
export NEO4J_PASSWORD=password
export OPENAI_API_KEY=your-openai-api-key

If you are using Neo4j Aura, replace the URI and credentials accordingly:

export NEO4J_URI=neo4j+s://your-aura-database-uri
export NEO4J_USER=your-username
export NEO4J_PASSWORD=your-password
export OPENAI_API_KEY=your-openai-api-key

5. Additional Neo4j Configuration

For proper relationship creation, ensure you have the APOC (Awesome Procedures on Cypher) plugin installed in your Neo4j instance. This is necessary for creating custom relationships between entities and text chunks.

Usage

1. Setting up the ETL Pipeline

from documentgraph import ETLConfig, DocumentAnalysisPipeline

# Create an ETLConfig with Neo4j credentials
etl_config = ETLConfig()

# Initialize the ETL pipeline
pipeline = DocumentAnalysisPipeline(etl_config)

# Execute the pipeline with the input folder containing text documents
pipeline.execute_pipeline(input_folder="path/to/your/text/files")

2. Pipeline Workflow

  1. Document Extraction: The pipeline reads all .txt files from the specified input folder.
  2. Text Preprocessing: The text is cleaned and broken down into smaller chunks.
  3. Embedding Generation: Each chunk gets converted into a vector using a pre-trained embedding model.
  4. Entity and Relationship Extraction: Entities and relationships between them are identified within the chunks.
  5. Knowledge Graph Loading: The extracted entities, relationships, and chunks are saved in the Neo4j knowledge graph.

3. Querying the Knowledge Graph

Once the pipeline has processed the documents and loaded the data into Neo4j, you can query the graph for insights using Cypher.

For example, to retrieve all entities in the graph:

MATCH (e:Entity) RETURN e LIMIT 10;

To retrieve relationships between entities:

MATCH (e1:Entity)-[r]->(e2:Entity) RETURN e1, r, e2 LIMIT 10;

Contributing

We welcome contributions to DocumentGraph! Here's how you can help:

Reporting Issues

If you encounter any bugs or have suggestions for improvements:

  1. Check the existing issues to avoid duplicates.
  2. If your issue isn't already listed, open a new issue.
  3. Clearly describe the problem or enhancement, including steps to reproduce if applicable.
  4. Add relevant labels (e.g., 'bug', 'enhancement', 'documentation').

Making Enhancements

To contribute code or documentation improvements:

  1. Fork the repository.
  2. Create a new branch for your feature: git checkout -b feature/your-feature-name.
  3. Make your changes, ensuring you follow the project's coding standards.
  4. Write or update tests as necessary.
  5. Commit your changes with clear, descriptive commit messages.
  6. Push to your fork and submit a pull request.

Proposing Major Changes

For significant changes that could alter the project's direction:

  1. Open an issue to discuss your proposal before starting work.
  2. Outline the rationale and implementation details of your proposal.
  3. Engage in discussion with maintainers and the community.
  4. If approved, follow the process for making enhancements.

We appreciate your contributions to making DocumentGraph better!

License

DocumentGraph is licensed under the Apache License Version 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

documentgraph-0.1.2.tar.gz (18.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

documentgraph-0.1.2-py3-none-any.whl (18.1 kB view details)

Uploaded Python 3

File details

Details for the file documentgraph-0.1.2.tar.gz.

File metadata

  • Download URL: documentgraph-0.1.2.tar.gz
  • Upload date:
  • Size: 18.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.13

File hashes

Hashes for documentgraph-0.1.2.tar.gz
Algorithm Hash digest
SHA256 dc02486bcf2253ecd2d310b6248e31bec5c7077652321f19602582dcfca7f80b
MD5 67ff2e0f5f23b16ed53f49ec1316f553
BLAKE2b-256 b873638418551b31ad97812c0eca0fcb2e98441b945fae7daff9ac6301c09e29

See more details on using hashes here.

File details

Details for the file documentgraph-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: documentgraph-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 18.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.13

File hashes

Hashes for documentgraph-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 597aaaf1357787b434fa17c5e7081d5b25075f2d28255b1088d11f0239483e21
MD5 bd171e4e21a657daf6e785f974954de6
BLAKE2b-256 360c7797753b209d5434f881a53cef3506335850ecf92d4079bd601fb5ac1a3f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page