A document analysis pipeline with knowledge graph
Project description
DocumentGraph: An ETL Pipeline for Document Analysis with Neo4j
Overview
DocumentGraph is a Python package designed for end-to-end document analysis, using an ETL (Extract, Transform, Load) pipeline to process textual documents and represent the extracted information in a Neo4j knowledge graph. The package extracts text from documents, preprocesses and chunks the content, generates embeddings, and identifies entities and relationships within the text. These entities, relationships, and text chunks are then loaded into a Neo4j graph database for advanced analysis and querying.
This package is ideal for users who need to process large volumes of documents and structure them into a graph-based knowledge representation, where entities and their relationships can be explored and queried efficiently.
Key Features
- Document extraction: Loads documents from a specified input folder.
- Text preprocessing: Cleans and chunks the document into smaller, meaningful pieces.
- Embedding generation: Generates vector representations for text chunks.
- Entity and relationship extraction: Detects entities and relationships within the text using a knowledge extraction model.
- Knowledge graph loading: Loads documents, text chunks, entities, and relationships into a Neo4j graph database.
Limitations
- The package assumes that the input documents are in
.txtformat. - Preprocessing and extraction pipelines are designed for text data only.
- The performance depends on the quality of the pre-trained embedding models and entity extraction logic.
- Neo4j must be set up and running (locally or via AuraDB) with proper credentials for the package to function.
Prerequisites
1. Python Environment
Ensure you have Python 3.10+ installed in your environment. You can create a virtual environment to manage dependencies easily:
python -m venv documentgraph-env
source documentgraph-env/bin/activate # On Windows: documentgraph-env\Scripts\activate
2. Required Python Packages
Install the required dependencies from the requirements.txt file:
pip install -r requirements.txt
These packages include libraries for Neo4j, logging, and document extraction.
3. Neo4j Setup
The package requires a Neo4j database to store and query the knowledge graph. You can either use Neo4j Aura (cloud-based) or run a local Neo4j instance.
Option 1: Neo4j Aura (Cloud-based)
- Sign up for a free or paid Neo4j Aura account at https://aura.neo4j.io/.
- Create a new Neo4j project and note down the
uri,username, andpasswordfor connection.
Option 2: Local Neo4j Instance
- Download and install Neo4j Desktop from https://neo4j.com/download/.
- Start a new local graph database instance.
- The default local connection
uriis usuallybolt://localhost:7687, and the default username/password isneo4j/neo4j.
4. Environment Variables
You need to set up environment variables to allow the package to connect to the Neo4j database. You can add these variables to your shell environment or use a .env file.
export NEO4J_URI=bolt://localhost:7687 # For local Neo4j instance
export NEO4J_USER=neo4j
export NEO4J_PASSWORD=password
export OPENAI_API_KEY=your-openai-api-key
If you are using Neo4j Aura, replace the URI and credentials accordingly:
export NEO4J_URI=neo4j+s://your-aura-database-uri
export NEO4J_USER=your-username
export NEO4J_PASSWORD=your-password
export OPENAI_API_KEY=your-openai-api-key
5. Additional Neo4j Configuration
For proper relationship creation, ensure you have the APOC (Awesome Procedures on Cypher) plugin installed in your Neo4j instance. This is necessary for creating custom relationships between entities and text chunks.
Usage
1. Setting up the ETL Pipeline
from documentgraph import ETLConfig, DocumentAnalysisPipeline
# Create an ETLConfig with Neo4j credentials
etl_config = ETLConfig()
# Initialize the ETL pipeline
pipeline = DocumentAnalysisPipeline(etl_config)
# Execute the pipeline with the input folder containing text documents
pipeline.execute_pipeline(input_folder="path/to/your/text/files")
2. Pipeline Workflow
- Document Extraction: The pipeline reads all
.txtfiles from the specified input folder. - Text Preprocessing: The text is cleaned and broken down into smaller chunks.
- Embedding Generation: Each chunk gets converted into a vector using a pre-trained embedding model.
- Entity and Relationship Extraction: Entities and relationships between them are identified within the chunks.
- Knowledge Graph Loading: The extracted entities, relationships, and chunks are saved in the Neo4j knowledge graph.
3. Querying the Knowledge Graph
Once the pipeline has processed the documents and loaded the data into Neo4j, you can query the graph for insights using Cypher.
For example, to retrieve all entities in the graph:
MATCH (e:Entity) RETURN e LIMIT 10;
To retrieve relationships between entities:
MATCH (e1:Entity)-[r]->(e2:Entity) RETURN e1, r, e2 LIMIT 10;
Contributing
We welcome contributions to DocumentGraph! Here's how you can help:
Reporting Issues
If you encounter any bugs or have suggestions for improvements:
- Check the existing issues to avoid duplicates.
- If your issue isn't already listed, open a new issue.
- Clearly describe the problem or enhancement, including steps to reproduce if applicable.
- Add relevant labels (e.g., 'bug', 'enhancement', 'documentation').
Making Enhancements
To contribute code or documentation improvements:
- Fork the repository.
- Create a new branch for your feature:
git checkout -b feature/your-feature-name. - Make your changes, ensuring you follow the project's coding standards.
- Write or update tests as necessary.
- Commit your changes with clear, descriptive commit messages.
- Push to your fork and submit a pull request.
Proposing Major Changes
For significant changes that could alter the project's direction:
- Open an issue to discuss your proposal before starting work.
- Outline the rationale and implementation details of your proposal.
- Engage in discussion with maintainers and the community.
- If approved, follow the process for making enhancements.
We appreciate your contributions to making DocumentGraph better!
License
DocumentGraph is licensed under the Apache License Version 2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file documentgraph-0.1.2.tar.gz.
File metadata
- Download URL: documentgraph-0.1.2.tar.gz
- Upload date:
- Size: 18.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc02486bcf2253ecd2d310b6248e31bec5c7077652321f19602582dcfca7f80b
|
|
| MD5 |
67ff2e0f5f23b16ed53f49ec1316f553
|
|
| BLAKE2b-256 |
b873638418551b31ad97812c0eca0fcb2e98441b945fae7daff9ac6301c09e29
|
File details
Details for the file documentgraph-0.1.2-py3-none-any.whl.
File metadata
- Download URL: documentgraph-0.1.2-py3-none-any.whl
- Upload date:
- Size: 18.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
597aaaf1357787b434fa17c5e7081d5b25075f2d28255b1088d11f0239483e21
|
|
| MD5 |
bd171e4e21a657daf6e785f974954de6
|
|
| BLAKE2b-256 |
360c7797753b209d5434f881a53cef3506335850ecf92d4079bd601fb5ac1a3f
|