Skip to main content

Fine-tuning Automation System for LLM-driven Semantic Data Analysis

Project description

FAUST: Fine-tuning Automation System for LLM-driven Semantic Data Analysis

Contributions Contributions Contributions Contributions

Abstract:

Knowledge graph question answering (KGQA) based on large language models (LLMs) has gained significant traction, particularly on large-scale, schema-light datasets. However, existing approaches do not fully address the semantic, structural, and mapping requirements of ontology-based data access (OBDA). This limitation is especially relevant in domains such as cyber-physical systems, where data is semantically rich, heterogeneous, and dynamically changing. Moreover, large cloud-based LLMs, while proven effective in general-purpose QA, may introduce high computational costs and data privacy concerns in such domains. A straightforward alternative is to use locally deployed LLMs; however, without task-specific adaptation, they typically fail to achieve sufficient performance. To address these challenges, we present FAUST, an automated fine-tuning system for semantic data analysis. Given an ontology, FAUST generates OBDA-compliant training datasets consisting of system prompts, natural-language instructions, and corresponding SPARQL queries, enabling efficient fine-tuning of local LLMs for OBDA scenarios. In addition, we introduce the Modular OBDA Architecture (MOA), which integrates LLM-based query generation with an OBDA engine and supports interactive querying over both static and streaming data sources. We evaluate our approach on real-world sensor data in terms of query accuracy, latency, and output correctness. The results show that FAUST-based lightweight LLM fine-tuning enables robust, cost-efficient, and semantically accurate question answering, outperforming (i) raw local LLMs, (ii) prompt-engineering methods, and (iii) cloud-based LLMs.

FAUST Modules:

The logical Data Flow Diagram of FAUST, including external, process, and store elements.

FAUST consists of several modular components for automatic generation of NL-to-SPARQL training datasets from domain ontologies. The framework starts with the KG Maker, which instantiates the ontology using a configurable knowledge graph matrix (KGM) and produces the initial knowledge graphs. Next, the KG Reader queries these graphs and generates reusable ontology elements, such as classes, properties, instances, date ranges, and value samples, used throughout dataset generation.

The Agnostic Module (AM) creates ontology-independent NL/SPARQL pairs using generic RDF/OWL concepts (e.g., classes, instances, properties). The General Module (GM) focuses on ontology-specific conceptual queries derived from competency-question templates, capturing semantic relations between classes and properties. The Domain Module (DM) extends this process with instance-level knowledge, generating realistic OBDA queries involving domain entities, measurements, aggregations, and temporal constraints.

Finally, the Orchestrator coordinates all modules and combines their outputs into complete training and validation datasets, exported in JSON or CSV format for LLM fine-tuning.

Ontology Documentation:

DBC Ontology Specification (w3id.org/dbc-ontology), used for FAUST implementation:

Documentation

FAUST User Guide

Running FAUST on user hardware involves two steps:

  1. Knowledge Graph Matrices (KGMs) Population
  2. FAUST Deployment

1. KGM Population

Signal representation in KGM, including individuals, type, and properties.

KGM presents a tabular format for mapping individuals (instances) with corresponding properties. Each sheet in the document depicts a single class, with the first column reserved for instances, while the remaining ones reflect combined data and object properties. Also, the user is not required to define datatypes for each literal, as this is resolved in the later OBDA mapping phase. An example representation of dbc:Signal instances (individuals) is shown above.

The user is required to populate KVM with their own domain-specific instances and save it to the /KVM folder, as a reference. In the next iteration, KVM can be split into KVM_train and KVM_val, although it is not mandatory.

2. FAUST Deployment

  • After verifying the KGM, check the config file config.yaml, and ensure that the settings reflect project requirements.

  • Deploy the framework:

    python3 FAUST.py
    

Results: Training and Validation datasets are created in the project root directory.

Modular OBDA Architecture

The Modular OBDA Architecture (MOA).

To streamline the development of LLM-driven OBDA systems, we designed MOA, a Modular OBDA Architecture that supports independent, project-specific implementation of individual system components. Its modular design promotes reuse across platforms, systems, and programming environments. For example, the user plane can be implemented as a standalone GUI application, an embedded component, or a web-based frontend, while the architecture supports integration with diverse LLM deployments, including local, institution-hosted, and cloud-based models.

Modular OBDA Architecture (MOA) is organized as a modular multi-layer system comprising, from top to bottom, the user plane, an LLM-based translation layer, the processing unit, an OBDA engine linking the ontology to the system configuration, the mapping layer, a relational database, and the data layer. This structure allows each component to be implemented independently, thereby increasing architectural flexibility and supporting a wide range of deployment scenarios.

MOA Implementation

  • To run the MOA GUI (Linux) Application, set the corresponding Ollama-built LLM model in the App's config section and run:

    python3 MOA_App.py
    
  • The recommendation for the OBDA Engine is the Ontop Endpoint CLI, although the Docker version is also available.

  • An example configuration, including input files, db, jdbc, and a data sample (raw data), is available in the /Ontop folder.

  • To start the Ontop endpoint, run:

    ./ontop endpoint --ontology=input/DBC.ttl --mapping=input/mapping.ttl --properties=input/ontop.properties --port=8080
    

Module Requirements:

  • customtkinter
  • tkinter
  • requests
  • ollama

License

All resources are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.

License: CC BY-NC-SA 4.0

Citation

🔴 Correct DOI!

Ivanovic, P., Hranisavljevic, N., & Maleshkova, M. (2026). paitools/FAUST: Fine-tuning Automation System v1.0: Pre-publication Release. Zenodo. https://doi.org/10.5281/zenodo.20083471

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

faust_obda-1.0.0.tar.gz (15.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

faust_obda-1.0.0-py3-none-any.whl (17.3 kB view details)

Uploaded Python 3

File details

Details for the file faust_obda-1.0.0.tar.gz.

File metadata

  • Download URL: faust_obda-1.0.0.tar.gz
  • Upload date:
  • Size: 15.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for faust_obda-1.0.0.tar.gz
Algorithm Hash digest
SHA256 d43b2e3bb744056d4597bb744b25328393d2d59490d67c03093d0b8e7a2f0bf2
MD5 e5f54bb52a56d91c95e41136545f262f
BLAKE2b-256 71b0624cb4ff40e6ab0b8e19dcc1c7aab162f01262f2f941185e3a6336800ecc

See more details on using hashes here.

File details

Details for the file faust_obda-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: faust_obda-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 17.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for faust_obda-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4b9a005b5d3d99bd9426c11b4f2d9e426514807e69135fd721ac4f1a49b7cce4
MD5 b4d6612cea7046e4cedfebf4d3ab728d
BLAKE2b-256 2b660325dd4524cb522cf2e7112810fd667ac643e1e41ebc6a2571626f1f6877

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page