Skip to main content

A Human-in-the-Loop Workflow for Scientific Schema Mining with Large Language Models

Project description

schema-miner pro logo

PyPI - Version Pepy Total Downloads Maintained Yes pre-commit security: bandit MIT License DOI Read the Docs

SCHEMA-MINERpro: Agentic AI for Ontology Grounding over LLM-Discovered Scientific Schemas in a Human-in-the-Loop Workflow

Schema-Miner is an open-source framework for scientific schema mining. It combines Large Language Models (LLMs) with human-in-the-loop refinement to extract, and semantically ground schema properties from unstructured text. Schema-Miner Pro extends this framework with an automated ontology-grounding component, aligning the schema with formal ontologies (e.g., QUDT). Documentation and usage guides are available at schema-miner.readthedocs.io.

🧪 Installation

Install the package directly from PyPI using pip:

pip install schema-miner

If you are working with the source code directly, install dependencies from requirements.txt:

git clone https://github.com/sciknoworg/schema-miner.git
cd schema-miner
pip install -r requirements.txt

⚙️ System Requirements

Running with OpenAI models (e.g., GPT-4o, GPT-4-turbo) requires no special hardware beyond a basic system with internet access, since inference is API-based. For open-source models (e.g., Llama 3.1 8B), local execution is possible on CPU but slow; for practical performance, a GPU with sufficient VRAM (per model specifications) is strongly recommended.

For more details, please check the documentation: https://schema-miner.readthedocs.io/en/latest/.

🚀 Quick Start

For a quick start, see the provided example notebooks highlighting the overall workflows of the schema-miner.

📚 Citing this Work

If you use this repository in your research or applications, please cite the appropriate paper(s):

  • Schema-Miner (schema discovery/mining only):

    Sameer Sadruddin, Jennifer D’Souza, Eleni Poupaki, Alex Watkins, Hamed Babaei Giglou, Anisa Rula, Bora Karasulu, Sören Auer, Adrie Mackus, and Erwin Kessels. LLMs4SchemaDiscovery: A Human-in-the-Loop Workflow for Scientific Schema Mining with Large Language Models. In The Semantic Web – ESWC 2025, Springer, Cham, pp. 244–261. https://doi.org/10.1007/978-3-031-94578-6_14

    📌 BibTeX

    @InProceedings{10.1007/978-3-031-94578-6_14,
      author    = {Sadruddin, Sameer and D'Souza, Jennifer and Poupaki, Eleni and Watkins, Alex and Babaei Giglou, Hamed and Rula, Anisa and Karasulu, Bora and Auer, S{\"o}ren and Mackus, Adrie and Kessels, Erwin},
      editor    = {Curry, Edward and Acosta, Maribel and Poveda-Villal{\'o}n, Maria and van Erp, Marieke and Ojo, Adegboyega and Hose, Katja and Shimizu, Cogan and Lisena, Pasquale},
      title     = {LLMs4SchemaDiscovery: A Human-in-the-Loop Workflow for Scientific Schema Mining with Large Language Models},
      booktitle = {The Semantic Web},
      year      = {2025},
      publisher = {Springer Nature Switzerland},
      address   = {Cham},
      pages     = {244--261},
      isbn      = {978-3-031-94578-6},
    }
    
  • Schema-Minerpro (schema mining with QUDT grounding / ontology grounding):

    Sameer Sadruddin, Jennifer D’Souza, Eleni Poupaki, Alex Watkins, Bora Karasulu, Sören Auer, Adrie Mackus, and Erwin Kessels. SCHEMA-MINERpro: Agentic AI for Ontology Grounding over LLM-Discovered Scientific Schemas in a Human-in-the-Loop Workflow. In Semantic Web Journal. https://www.semantic-web-journal.net/system/files/swj3871.pdf

    📌 BibTeX

    @InProceedings{10.1007/978-3-031-94578-6_14,
      author    = {Sadruddin, Sameer and D'Souza, Jennifer and Poupaki, Eleni and Watkins, Alex and Karasulu, Bora and Auer, S{\"o}ren and Mackus, Adrie and Kessels, Erwin},
      title     = {SCHEMA-MINERpro: Agentic AI for Ontology Grounding over LLM-Discovered Scientific Schemas in a Human-in-the-Loop Workflow},
      journal = {Semantic Web Journal},
      year      = {2025},
    }
    

👥 Contact & Contributions

We’d love to hear from you! Whether you're interested in collaborating on Schema-MinerPro or have ideas to extend its capabilities, feel free to reach out:

  • Collaboration inquiries: Contact Jennifer D'Souza at jennifer.dsouza [at] tib.eu

  • Development questions or bug reports: Please open an issue right here in the repository or get in touch with the lead developer Sameer Sadruddin at sameer.sadruddin [at] tib.eu

Let’s build better schema-mining tools—together!

📃 License

This work is licensed under a MIT License

🔗 Links

Source Code: https://github.com/sciknoworg/schema-miner

Documentation: https://schema-miner.readthedocs.io/en/latest/

Issues: https://github.com/sciknoworg/schema-miner/issues

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schema_miner-2.0.1.tar.gz (33.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

schema_miner-2.0.1-py3-none-any.whl (44.2 kB view details)

Uploaded Python 3

File details

Details for the file schema_miner-2.0.1.tar.gz.

File metadata

  • Download URL: schema_miner-2.0.1.tar.gz
  • Upload date:
  • Size: 33.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for schema_miner-2.0.1.tar.gz
Algorithm Hash digest
SHA256 1628a9886723f608bffa63efff919db4022660907a940aedb4dcca3478744bbd
MD5 715c485c78bb69797241fc8c7fb2c44b
BLAKE2b-256 925d70e098008d50c73efcd57c2500071b44d86cc7d5dcb3f2e029b0a4aba3d7

See more details on using hashes here.

File details

Details for the file schema_miner-2.0.1-py3-none-any.whl.

File metadata

  • Download URL: schema_miner-2.0.1-py3-none-any.whl
  • Upload date:
  • Size: 44.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for schema_miner-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e274a1a3e550fe5ea88d3549d2ccbd38811f2b38ee795f6c45e494555d0b6d7b
MD5 e8d17cdedd6661aeb6b576549a4f2335
BLAKE2b-256 9efa2fb67cc974c23f233a4b604f5f091cbb68c8a2ffc05e157707880034818c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page