Skip to main content

From Knowledge Graphs to Machine Learning!

Project description

Welcome to the SparkKG-ML Documentation

Welcome to the documentation for SparkKG-ML, a Python library designed to facilitate machine learning with Spark on semantic web and knowledge graph data.

SparkKG-ML is specifically built to bridge the gap between the semantic web data model and the powerful distributed computing capabilities of Apache Spark. By leveraging the flexibility of semantic web and the scalability of Spark, SparkKG-ML empowers you to extract meaningful insights and build robust machine learning models on semantic web and knowledge graph datasets.

You can find the detailed documentaion of SparkKG-ML here. This documentation serves as a comprehensive guide to understanding and effectively utilizing SparkKG-ML. Here, you will find detailed explanations of the library's core concepts, step-by-step tutorials to get you started, and a rich collection of code examples to illustrate various use cases.

Now you can also find SERE ( Scalable and Distributed Framework for Unsupervised Embeddings Computation on LaR ge-scale KnowlEdge Graphs) in our documentation.

Key Features of SparkKG-ML:

  1. Seamless Integration: SparkKG-ML seamlessly integrates with Apache Spark, providing a unified and efficient platform for distributed machine learning on semantic web and knowledge graph data.

  2. Data Processing: With SparkKG-ML, you can easily preprocess semantic web data, handle missing values, perform feature engineering, and transform your data into a format suitable for machine learning.

  3. Scalable Machine Learning: SparkKG-ML leverages the distributed computing capabilities of Spark to enable scalable and parallel machine learning on large semantic web and knowledge graph datasets.

  4. Advanced Algorithms: SparkKG-ML provides a wide range of machine learning algorithms specifically designed for semantic web and knowledge graph data, allowing you to tackle complex tasks within the context of knowledge graphs and the semantic web.

  5. Extensibility: SparkKG-ML is designed to be easily extended, allowing you to incorporate your own custom algorithms and techniques seamlessly into the library.

We hope this documentation proves to be a valuable resource as you explore the capabilities of SparkKG-ML and embark on your journey of machine learning with Spark on semantic web and knowledge graph data. Happy learning!

Installation Guide

This guide provides step-by-step instructions on how to install the SparkKG-ML library. SparkKG-ML can be installed using pip or by installing from the source code.

Installation via pip:

To install SparkKG-ML using pip, follow these steps:

  1. Open a terminal or command prompt.

  2. Run the following command to install the latest stable version of SparkKG-ML:

       
       pip install sparkkgml
    

This will download and install SparkKG-ML and its dependencies.

  1. Once the installation is complete, you can import SparkKG-ML into your Python projects and start using it for machine learning on semantic web and knowledge graph data.

Installation from source:

To install SparkKG-ML from the source code, follow these steps:

  1. Clone the SparkKG-ML repository from GitHub using the following command:

       git clone https://github.com/IDIASLab/SparkKG-ML
    

This will create a local copy of the SparkKG-ML source code on your machine.

  1. Change into the SparkKG-ML directory:

       cd sparkkgml
    
  2. Run the following command to install SparkKG-ML and its dependencies:

       pip install .
    

This will install SparkKG-ML using the source code in the current directory.

  1. Once the installation is complete, you can import SparkKG-ML into your Python projects and start using it for machine learning on semantic web and knowledge graph data.

Congratulations! You have successfully installed the SparkKG-ML library. You are now ready to explore the capabilities of SparkKG-ML and leverage its machine learning functionalities.

For more details on how to use SparkKG-ML, please refer to the documentation.

Getting Started

Let's start with a basic example, we will retrieve data from a SPARQL endpoint and convert it into a Spark DataFrame using the getDataFrame function.

        # Import the required libraries
        from sparkkgml.data_acquisition import DataAcquisition
        
        # Create an instance of DataAcquisition 
        DataAcquisitionObject = DataAcquisition()

        # Specify the SPARQL endpoint and query
        endpoint = "https://recipekg.arcc.albany.edu/RecipeKG"
        query ="""
            PREFIX schema: <https://schema.org/>
            PREFIX recipeKG:<http://purl.org/recipekg/>
            SELECT  ?recipe
            WHERE { ?recipe a schema:Recipe. }
            LIMIT 3
            """

        # Retrieve the data as a Spark DataFrame
        spark_df = DataAcquisitionObject.getDataFrame(endpoint=endpoint, query=query)
        spark_df.show()
+------------------------------------------+
| recipe                                   |
+==========================================+
| recipeKG:recipe/peanut-butter-tandy-bars |
+------------------------------------------+
| recipeKG:recipe/the-best-oatmeal-cookies |
+------------------------------------------+
| recipeKG:recipe/peach-cobbler-ii         |
+------------------------------------------+

The getDataFrame function will query the data from the specified SPARQL endpoint and return a Spark DataFrame that you can use for further analysis or machine learning tasks.

SERE

SERE is a scalable and distributed embedding framework designed for large-scale KGs (KGs), leveraging the distributed computing capabilities of Apache Spark. The framework enables the extraction of walks over a KG and then creates embeddings from these walks, fully implemented in Spark, making them ready for integration into Machine Learning (ML) pipelines.

KGs store RDF data in a graph format, where entities are linked by relations. To compute RDF data embeddings, the graph representation is converted into sequences of entities. These sequences are processed by neural language models, such as Word2Vec, treating them like sentences composed of words. This allows the model to represent each entity in the RDF graph as a vector of numerical values in a latent feature space.

SERE allows the computation of embeddings over very large KGs in scenarios where such embeddings were previously not feasible due to a significantly lower runtime and improved memory requirements. SERE is open-source, well-documented, and fully integrated into the SparkKG-ML Python library, which offers end-to-end ML pipelines over semantic data stored in KGs directly in Python.

Check out for more detailed documentation.

Evaluation

To evaluate SparkKG-ML, we investigated various factors to understand their impact on overall runtime performance, comparing the results with an existing framework. The experiments folder contains detailed information for those interested, covering not only the general analysis but also the performance of our library's additional features, including both runtime efficiency and overall performance metrics.

License

SparkKG-ML was created by IDIAS Lab. It is licensed under the terms of the Apache License 2.0 license.

Acknowledgement

This material is based upon work supported by the National Science Foundation under Grants No. CRII:III-1850097 and ECCS-1737443. This work was supported in part by Oracle Cloud credits and related resources provided by the Oracle for Research program.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sparkkgml-0.2.2.tar.gz (23.7 kB view details)

Uploaded Source

Built Distribution

sparkkgml-0.2.2-py3-none-any.whl (24.5 kB view details)

Uploaded Python 3

File details

Details for the file sparkkgml-0.2.2.tar.gz.

File metadata

  • Download URL: sparkkgml-0.2.2.tar.gz
  • Upload date:
  • Size: 23.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.10.5 Windows/10

File hashes

Hashes for sparkkgml-0.2.2.tar.gz
Algorithm Hash digest
SHA256 3f74c8cfdccd85f12600afa22feccc3120f60d39a4295d3b598b358d2eae022e
MD5 3e3bc33a69a2d4a94226603397e7aa90
BLAKE2b-256 87b87d3dcbdd6b99372297ea6192f98d9604387e52d2d73387e569a27e569877

See more details on using hashes here.

File details

Details for the file sparkkgml-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: sparkkgml-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 24.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.1 CPython/3.10.5 Windows/10

File hashes

Hashes for sparkkgml-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 913aed239f21bfafbe4a418ccfd1b3309c17ae848b31dfdbb171c60cfdd1aae3
MD5 475d31ae5e5639daca6219bf376a64a5
BLAKE2b-256 d09b91e44d13aba46b5a9ef41ebda41a7310d1060cd4519bdce8f09ba38e71f1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page