Skip to main content

A library for generating RDF datasets from various data sources

Project description

RDF Generator

Maintenance

GitHub license

forthebadge made-with-python

PyPI - Downloads

PyPI - Version

The rdf_generator library provides tools for generating RDF datasets from various data sources, including scraped websites, Excel sheets, CSV files, text files, PDFs, and relational databases (PostgreSQL, MySQL, etc.). It aims to simplify the process of building RDF datasets, enabling seamless integration into linked data workflows.


Features

  • Modular Parsers: Support for CSV, Excel, PDFs, relational databases (PostgreSQL, MySQL), and BEACON files.
  • Web Scraping: Extract structured data from websites.
  • RDF Generation: Build RDF graphs using rdflib, complete with namespaces and serialization options.
  • Customizable Workflows: Easily extend and integrate with your data pipelines.
  • Serialization Formats: Generate RDF in Turtle, RDF/XML, JSON-LD, and other formats.

Installation

1. Install from PyPI (Standard Method)

pip install rdf_generator

2. Install Directly from GitHub (Alternative Method)

You can clone the repository and install the library manually if it's not on PyPI yet:

git clone https://github.com/judaicalink/rdf_generator.git cd rdf_generator pip install .

Or, install it directly from GitHub using:

pip install git+https://github.com/judaicalink/rdf_generator.git

Requirements

  • Python 3.7 or higher

  • Libraries:

Install dependencies with: pip install -r requirements.txt

  • Core dependencies include:
    • rdflib
    • pandas
    • requests
    • beautifulsoup4
    • PyPDF2
    • mysql-connector-python
    • psycopg2

Usage

The rdf_generator library is designed to provide parsers for multiple data sources and utilities to generate RDF datasets. Below are examples for various data sources.

  1. Generate RDF from CSV Files
from rdf_generator.parsers.csv_parser import CSVParser
from rdf_generator.rdf_builder import RDFBuilder

csv_parser = CSVParser("data/people.csv")
data = csv_parser.read_csv()

# Generate RDF
rdf_builder = RDFBuilder()
for row in data:
    rdf_builder.add_person(row['Name'], row['Email'])

# Serialize RDF
print(rdf_builder.serialize(format="turtle"))
  1. Generate RDF from PostgreSQL
from rdf_generator.parsers.sql_parser import PostgreSQLParser
from rdf_generator.rdf_builder import RDFBuilder

# Connect to the database
db_parser = PostgreSQLParser(
    host="localhost",
    database="testdb",
    user="your_username",
    password="your_password"
)

# Fetch data
query = "SELECT name, email FROM people;"
data = db_parser.fetch_data(query)

# Generate RDF
rdf_builder = RDFBuilder()
for row in data:
    rdf_builder.add_person(row['name'], row['email'])

# Serialize RDF
print(rdf_builder.serialize(format="turtle"))
  1. Generate RDF from Websites (Web Scraping)
from rdf_generator.parsers.web_scraper import WebScraper
from rdf_generator.rdf_builder import RDFBuilder

# Scrape the website
scraper = WebScraper("https://example.com")
data = scraper.extract_data("h1")  # Extract all H1 elements

# Generate RDF
rdf_builder = RDFBuilder()
for item in data:
    rdf_builder.graph.add((rdf_builder.ns[item], rdf_builder.ns.title, rdf_builder.ns[item]))

# Serialize RDF
print(rdf_builder.serialize(format="turtle"))
  1. Generate RDF from Excel Files
from rdf_generator.parsers.excel_parser import ExcelParser
from rdf_generator.rdf_builder import RDFBuilder

# Parse the Excel file
excel_parser = ExcelParser("data/people.xlsx")
data = excel_parser.read_sheet(sheet_name="People")

# Generate RDF
rdf_builder = RDFBuilder()
for row in data:
    rdf_builder.add_person(row['Name'], row['Email'])

# Serialize RDF
print(rdf_builder.serialize(format="turtle"))

Supported Parsers

Parser Description CSV Parses CSV files and extracts data as dictionaries. Excel Parses Excel (.xls/.xlsx) files and handles multiple sheets. PDF Extracts text and tables from PDF files. SQL Fetches data from relational databases like PostgreSQL and MySQL. BEACON Parses BEACON link dump files for RDF generation. Web Scrapes websites to extract structured data.

Serialization Formats

The `rdf_generator library supports the following RDF serialization formats:

  • Turtle: rdf_builder.serialize(format="turtle")
  • RDF/XML: rdf_builder.serialize(format="xml")
  • JSON-LD: rdf_builder.serialize(format="json-ld")

Example Dataset Workflow

Here’s an example pipeline to process multiple data sources and generate a combined RDF dataset:

from rdf_generator.parsers.csv_parser import CSVParser
from rdf_generator.parsers.web_scraper import WebScraper
from rdf_generator.rdf_builder import RDFBuilder

rdf_builder = RDFBuilder()

# Parse CSV
csv_parser = CSVParser("data/people.csv")
for row in csv_parser.read_csv():
    rdf_builder.add_person(row['Name'], row['Email'])

# Scrape Website
web_scraper = WebScraper("https://example.com")
titles = web_scraper.extract_data("h1")
for title in titles:
    rdf_builder.graph.add((rdf_builder.ns[title], rdf_builder.ns.label, rdf_builder.ns[title]))

# Serialize RDF
with open("output.ttl", "w") as f:
    f.write(rdf_builder.serialize(format="turtle"))

Development

Clone the Repository

To contribute or use the library without installation: git clone https://github.com/yourusername/rdf_generator.git cd rdf_generator

Install Dependencies

Install dependencies using:

pip install -r requirements.txt

Run Tests

Run unit tests using:

python -m unittest discover -s tests

License

This library is licensed under the MIT License. See the LICENSE file for details.

Contributing

Contributions are welcome! Please follow these steps:

Fork the repository. Create a feature branch: git checkout -b feature-name. Commit your changes: git commit -m "Add feature name". Push to the branch: `git push origin feature-name. Submit a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rdf_generator-0.1.4.tar.gz (9.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rdf_generator-0.1.4-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file rdf_generator-0.1.4.tar.gz.

File metadata

  • Download URL: rdf_generator-0.1.4.tar.gz
  • Upload date:
  • Size: 9.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for rdf_generator-0.1.4.tar.gz
Algorithm Hash digest
SHA256 3844e174192ed3c2334bd7e8a24f42573eaaf0eab5e60ed7ae0dbb8bc5f8f5b2
MD5 630c6bd2e1a93299e1f65f69dc702164
BLAKE2b-256 f6c42399bf5b1a4e53f8fd8bee353b1d9280cd4d4a814202b15d0b629fa9c9d1

See more details on using hashes here.

Provenance

The following attestation bundles were made for rdf_generator-0.1.4.tar.gz:

Publisher: python-publish.yml on judaicalink/rdf_generator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file rdf_generator-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: rdf_generator-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 10.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for rdf_generator-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 0ce2f2d7aa0b815934b8562cd1dcf338e44fef4f521606285aa711b8b20271b9
MD5 1b658ffff32cf9a39922e59290c6d496
BLAKE2b-256 2843022613ed6fb005c7fddda2d527d7eb3760287b87ee1d992583b4a0398a29

See more details on using hashes here.

Provenance

The following attestation bundles were made for rdf_generator-0.1.4-py3-none-any.whl:

Publisher: python-publish.yml on judaicalink/rdf_generator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page