A document analysis pipeline with knowledge graph

These details have not been verified by PyPI

Project links

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

DocumentGraph: An ETL Pipeline for Document Analysis with Neo4j

Overview

DocumentGraph is a Python package designed for end-to-end document analysis, using an ETL (Extract, Transform, Load) pipeline to process textual documents and represent the extracted information in a Neo4j knowledge graph. The package extracts text from documents, preprocesses and chunks the content, generates embeddings, and identifies entities and relationships within the text. These entities, relationships, and text chunks are then loaded into a Neo4j graph database for advanced analysis and querying.

This package is ideal for users who need to process large volumes of documents and structure them into a graph-based knowledge representation, where entities and their relationships can be explored and queried efficiently.

Key Features

Document extraction: Loads documents from a specified input folder.
Text preprocessing: Cleans and chunks the document into smaller, meaningful pieces.
Embedding generation: Generates vector representations for text chunks.
Entity and relationship extraction: Detects entities and relationships within the text using a knowledge extraction model.
Knowledge graph loading: Loads documents, text chunks, entities, and relationships into a Neo4j graph database.

Limitations

The package assumes that the input documents are in .txt format.
Preprocessing and extraction pipelines are designed for text data only.
The performance depends on the quality of the pre-trained embedding models and entity extraction logic.
Neo4j must be set up and running (locally or via AuraDB) with proper credentials for the package to function.

Prerequisites

1. Python Environment

Ensure you have Python 3.10+ installed in your environment. You can create a virtual environment to manage dependencies easily:

python -m venv documentgraph-env
source documentgraph-env/bin/activate   # On Windows: documentgraph-env\Scripts\activate

2. Required Python Packages

Install the required dependencies from the requirements.txt file:

pip install -r requirements.txt

These packages include libraries for Neo4j, logging, and document extraction.

3. Neo4j Setup

The package requires a Neo4j database to store and query the knowledge graph. You can either use Neo4j Aura (cloud-based) or run a local Neo4j instance.

Option 1: Neo4j Aura (Cloud-based)

Sign up for a free or paid Neo4j Aura account at https://aura.neo4j.io/.
Create a new Neo4j project and note down the uri, username, and password for connection.

Option 2: Local Neo4j Instance

Download and install Neo4j Desktop from https://neo4j.com/download/.
Start a new local graph database instance.
The default local connection uri is usually bolt://localhost:7687, and the default username/password is neo4j/neo4j.

4. Environment Variables

You need to set up environment variables to allow the package to connect to the Neo4j database. You can add these variables to your shell environment or use a .env file.

export NEO4J_URI=bolt://localhost:7687  # For local Neo4j instance
export NEO4J_USER=neo4j
export NEO4J_PASSWORD=password
export OPENAI_API_KEY=your-openai-api-key

If you are using Neo4j Aura, replace the URI and credentials accordingly:

export NEO4J_URI=neo4j+s://your-aura-database-uri
export NEO4J_USER=your-username
export NEO4J_PASSWORD=your-password
export OPENAI_API_KEY=your-openai-api-key

5. Additional Neo4j Configuration

For proper relationship creation, ensure you have the APOC (Awesome Procedures on Cypher) plugin installed in your Neo4j instance. This is necessary for creating custom relationships between entities and text chunks.

Usage

1. Setting up the ETL Pipeline

from documentgraph import ETLConfig, DocumentAnalysisPipeline

# Create an ETLConfig with Neo4j credentials
etl_config = ETLConfig()

# Initialize the ETL pipeline
pipeline = DocumentAnalysisPipeline(etl_config)

# Execute the pipeline with the input folder containing text documents
pipeline.execute_pipeline(input_folder="path/to/your/text/files")

2. Pipeline Workflow

Document Extraction: The pipeline reads all .txt files from the specified input folder.
Text Preprocessing: The text is cleaned and broken down into smaller chunks.
Embedding Generation: Each chunk gets converted into a vector using a pre-trained embedding model.
Entity and Relationship Extraction: Entities and relationships between them are identified within the chunks.
Knowledge Graph Loading: The extracted entities, relationships, and chunks are saved in the Neo4j knowledge graph.

3. Querying the Knowledge Graph

Once the pipeline has processed the documents and loaded the data into Neo4j, you can query the graph for insights using Cypher.

For example, to retrieve all entities in the graph:

MATCH (e:Entity) RETURN e LIMIT 10;

To retrieve relationships between entities:

MATCH (e1:Entity)-[r]->(e2:Entity) RETURN e1, r, e2 LIMIT 10;

Contributing

We welcome contributions to DocumentGraph! Here's how you can help:

Reporting Issues

If you encounter any bugs or have suggestions for improvements:

Check the existing issues to avoid duplicates.
If your issue isn't already listed, open a new issue.
Clearly describe the problem or enhancement, including steps to reproduce if applicable.
Add relevant labels (e.g., 'bug', 'enhancement', 'documentation').

Making Enhancements

To contribute code or documentation improvements:

Fork the repository.
Create a new branch for your feature: git checkout -b feature/your-feature-name.
Make your changes, ensuring you follow the project's coding standards.
Write or update tests as necessary.
Commit your changes with clear, descriptive commit messages.
Push to your fork and submit a pull request.

Proposing Major Changes

For significant changes that could alter the project's direction:

Open an issue to discuss your proposal before starting work.
Outline the rationale and implementation details of your proposal.
Engage in discussion with maintainers and the community.
If approved, follow the process for making enhancements.

We appreciate your contributions to making DocumentGraph better!

License

DocumentGraph is licensed under the Apache License Version 2.0.

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.2

Sep 28, 2024

0.1.1

Sep 28, 2024

0.1.0

Sep 27, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

documentgraph-0.1.2.tar.gz (18.2 kB view details)

Uploaded Sep 28, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

documentgraph-0.1.2-py3-none-any.whl (18.1 kB view details)

Uploaded Sep 28, 2024 Python 3

File details

Details for the file documentgraph-0.1.2.tar.gz.

File metadata

Download URL: documentgraph-0.1.2.tar.gz
Upload date: Sep 28, 2024
Size: 18.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.9.13

File hashes

Hashes for documentgraph-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`dc02486bcf2253ecd2d310b6248e31bec5c7077652321f19602582dcfca7f80b`
MD5	`67ff2e0f5f23b16ed53f49ec1316f553`
BLAKE2b-256	`b873638418551b31ad97812c0eca0fcb2e98441b945fae7daff9ac6301c09e29`

See more details on using hashes here.

File details

Details for the file documentgraph-0.1.2-py3-none-any.whl.

File metadata

Download URL: documentgraph-0.1.2-py3-none-any.whl
Upload date: Sep 28, 2024
Size: 18.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.9.13

File hashes

Hashes for documentgraph-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`597aaaf1357787b434fa17c5e7081d5b25075f2d28255b1088d11f0239483e21`
MD5	`bd171e4e21a657daf6e785f974954de6`
BLAKE2b-256	`360c7797753b209d5434f881a53cef3506335850ecf92d4079bd601fb5ac1a3f`

See more details on using hashes here.

documentgraph 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DocumentGraph: An ETL Pipeline for Document Analysis with Neo4j

Overview

Key Features

Limitations

Prerequisites

1. Python Environment

2. Required Python Packages

3. Neo4j Setup

Option 1: Neo4j Aura (Cloud-based)

Option 2: Local Neo4j Instance

4. Environment Variables

5. Additional Neo4j Configuration

Usage

1. Setting up the ETL Pipeline

2. Pipeline Workflow

3. Querying the Knowledge Graph

Contributing

Reporting Issues

Making Enhancements

Proposing Major Changes

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes