Skip to main content

Library for extracting schemas and building ontologies from documents using LLM

Project description

scrapontologies

graph

The generate schemas can be used to infer from document to use for tables in a database or for generating knowledge graph.

Features

  • Entity Extraction: Automatically identifies and extracts entities from PDF files.
  • Schema Generation: Constructs a schema based and structure of the extracted entities.
  • Visualization: Dynamic schema visualization

Quick Start

Prerequisites

Before you begin, ensure you have the following installed on your system:

  • Python: Make sure Python 3.9+ is installed.
  • Poppler: This tool is necessary for converting PDF to images.

MacOS Installation

To install Poppler on MacOS, use the following command:

brew install poppler

Linux Installation

To install Graphviz on Linux, use the following command:

sudo apt-get install poppler-utils

Windows

  1. Download the latest Poppler release for Windows from poppler releases.
  2. Extract the downloaded zip file to a location on your computer (e.g., C:\Program Files\poppler).
  3. Add the bin directory of the extracted folder to your system's PATH environment variable.

To add to PATH:

  1. Search for "Environment Variables" in the Start menu and open it.
  2. Under "System variables", find and select "Path", then click "Edit".
  3. Click "New" and add the path to the Poppler bin directory (e.g., C:\Program Files\poppler\bin).
  4. Click "OK" to save the changes.

After installation, restart your terminal or command prompt for the changes to take effect. If doesn't work try the magic restart button.

Installation

After installing the prerequisites and dependencies, you can start using scrape_schema to extract entities and their schema from PDFs.

Here’s a basic example:

git clone https://github.com/ScrapeGraphAI/scrape_schema
pip install -r requirements.txt

Usage

from scrape_schema import FileExtractor, PDFParser
import os
from dotenv import load_dotenv

load_dotenv()  # Load environment variables from .env file
api_key = os.getenv("OPENAI_API_KEY")

# Path to your PDF file
pdf_path = "./test.pdf"

# Create an LLMClient instance
llm_client = LLMClient(api_key)

# Create a PDFParser instance with the LLMClient
pdf_parser = PDFParser(llm_client)

# Create a FileExtraxctor instance with the PDF parser
pdf_extractor = FileExtractor(pdf_path, pdf_parser)

# Extract entities from the PDF
entities = pdf_extractor.generate_json_schema()

print(entities)

Output

{
  "ROOT": {
    "portfolio": {
      "type": "object",
      "properties": {
        "name": {
          "type": "string"
        },
        "series": {
          "type": "string"
        },
        "fees": {
          "type": "object",
          "properties": {
            "salesCharges": {
              "type": "string"
            },
            "fundExpenses": {
              "type": "object",
              "properties": {
                "managementExpenseRatio": {
                  "type": "string"
                },
                "tradingExpenseRatio": {
                  "type": "string"
                },
                "totalExpenses": {
                  "type": "string"
                }
              }
            },
            "trailingCommissions": {
              "type": "string"
            }
          }
        },
        "withdrawalRights": {
          "type": "object",
          "properties": {
            "timeLimit": {
              "type": "string"
            },
            "conditions": {
              "type": "array",
              "items": {
                "type": "string"
              }
            }
          }
        },
        "contactInformation": {
          "type": "object",
          "properties": {
            "companyName": {
              "type": "string"
            },
            "address": {
              "type": "string"
            },
            "phone": {
              "type": "string"
            },
            "email": {
              "type": "string"
            },
            "website": {
              "type": "string"
            }
          }
        },
        "yearByYearReturns": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "year": {
                "type": "string"
              },
              "return": {
                "type": "string"
              }
            }
          }
        },
        "bestWorstReturns": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "type": {
                "type": "string"
              },
              "return": {
                "type": "string"
              },
              "date": {
                "type": "string"
              },
              "investmentValue": {
                "type": "string"
              }
            }
          }
        },
        "averageReturn": {
          "type": "string"
        },
        "targetInvestors": {
          "type": "array",
          "items": {
            "type": "string"
          }
        },
        "taxInformation": {
          "type": "string"
        }
      }
    }
  }
}

🤝 Contributing

Feel free to contribute and join our Discord server to discuss with us improvements and give us suggestions!

Please see the contributing guidelines.

My Skills My Skills My Skills


Created by Scrapegraphai


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapontologies-1.1.0.tar.gz (508.0 kB view details)

Uploaded Source

Built Distribution

scrapontologies-1.1.0-py3-none-any.whl (24.0 kB view details)

Uploaded Python 3

File details

Details for the file scrapontologies-1.1.0.tar.gz.

File metadata

  • Download URL: scrapontologies-1.1.0.tar.gz
  • Upload date:
  • Size: 508.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for scrapontologies-1.1.0.tar.gz
Algorithm Hash digest
SHA256 af7e34730aee05efc2ed6d22069ac38b77a06afd2bba7c326d0f1ff987819f7c
MD5 4668483c07cca94fd8021e7926768202
BLAKE2b-256 988c49f40b59051b4465d6812f3a16e3c3525c683cce6a67854fa0c3a08d482c

See more details on using hashes here.

File details

Details for the file scrapontologies-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapontologies-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 72b52786c99b345a4c8c1bb12642c1849aa3f8c62ce803ca04eb083164b428fc
MD5 9faefae1d4ecd262399c1d8ddadf34ed
BLAKE2b-256 82de122571e44a8d049e5ade2293a9265072b155f398a89a351ff1656b55b6e5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page