Skip to main content

Multi-agent system for extracting and processing structured composition-property data from scientific literature

Project description

ComProScanner Logo

Python Version License: MIT PyPI Documentation Coverage PyPI - Downloads Ask DeepWiki Digital Discovery

ComProScanner

A comprehensive Python package for extracting composition-property data from scientific articles for building databases

Overview

ComProScanner is a multi-agent framework designed to extract composition-property relationships from scientific articles in materials science. It automates the entire workflow from metadata collection to data extraction, evaluation, and visualization.

Key Features:

  • 📚 Multi-publisher support (Elsevier, Springer, Wiley, IOP, local PDFs)
  • 🤖 Agentic extraction using CrewAI framework
  • 🔍 RAG-powered context retrieval for cost effective automation with accuracy
  • 📊 Comprehensive evaluation and visualization tools
  • 🎯 Customizable extraction workflows
  • 🌐 Knowledge graph generation

Installation

Install from PyPI:

pip install comproscanner

Or install from source:

git clone https://github.com/slimeslab/ComProScanner.git
cd comproscanner
pip install -e .

Quick Start

Here's a complete example extracting piezoelectric coefficient ($d_{33}$) data:

from comproscanner import ComProScanner

# Initialize scanner
scanner = ComProScanner(main_property_keyword="piezoelectric")

# Collect metadata
scanner.collect_metadata(
    base_queries=["piezoelectric", "piezoelectricity"],
    extra_queries=["ceramics", "applications"]
)

# Process articles
property_keywords = {
    "exact_keywords": ["d33"],
    "substring_keywords": [" d 33 "]
}

scanner.process_articles(
    property_keywords=property_keywords,
    source_list=["elsevier", "springer"]
)

# Extract composition-property data
scanner.extract_composition_property_data(
    main_extraction_keyword="d33"
)

Workflow

ComProScanner Workflow

The ComProScanner workflow consists of four main stages:

  1. Metadata Retrieval - Find relevant scientific articles
  2. Article Collection - Extract full-text from various publishers
  3. Information Extraction - Use LLM agents to extract structured data
  4. Post Processing & Dataset Creation - Evaluate, clean, and visualize results

Documentation

📖 Full documentation is available at slimeslab.github.io/ComProScanner

Core Capabilities

Supported Publishers

  • Elsevier (via TDM API)
  • Springer Nature (via TDM API)
  • Wiley (via TDM API)
  • IOP Publishing (via SFTP bulk access)
  • Local PDFs (any publication)

Data Extraction

  • Composition-property relationships
  • Material families
  • Synthesis methods and precursors
  • Characterization techniques
  • Synthesis steps

Evaluation Methods

  • Semantic Evaluation - Using semantic similarity measures
  • Agentic Evaluation - LLM-powered contextual analysis

Visualization

  • Data Visualization
  • Evaluation Visualization

Example Use Cases

Extract Data from Multiple Sources

scanner.process_articles(
    property_keywords=property_keywords,
    source_list=["elsevier", "springer", "wiley"]
)

Customize RAG Configuration

scanner.extract_composition_property_data(
    main_extraction_keyword="d33",
    rag_chat_model="gemini-2.5-pro",
    rag_max_tokens=2048,
    rag_top_k=5
)

Visualize Results

from comproscanner import data_visualizer, eval_visualizer

# Create knowledge graph
data_visualizer.create_knowledge_graph(result_file="results.json")

# Plot evaluation metrics
eval_visualizer.plot_multiple_radar_charts(
    result_sources=["model1.json", "model2.json"],
    model_names=["GPT-4o", "Claude-3.5"]
)

Requirements

  • Python 3.12 or 3.13
  • TDM API keys for desired publishers (Elsevier, Springer, Wiley)
  • LLM API keys (OpenAI, Anthropic, Google, etc.)
  • Optional: Neo4j for knowledge graph visualization

Citation

If you use ComProScanner in your research, please cite:

@Article{roy2026comproscannermultiagentbasedframework,
author ="Roy, Aritra and Grisan, Enrico and Buckeridge, John and Gattinoni, Chiara",
title  ="ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature",
journal  ="Digital Discovery",
year  ="2026",
pages  ="Accepted",
publisher  ="RSC",
doi  ="10.1039/D5DD00521C",
url  ="https://doi.org/10.1039/D5DD00521C"
}

Changelog

See the CHANGELOG for details on what has changed in each version.

Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Copyright © 2025-2026 SLIMES Lab

Contact

Author: Aritra Roy

Project Links:


Made with ❤️ by SLIMES Lab

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

comproscanner-0.1.6.tar.gz (173.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

comproscanner-0.1.6-py3-none-any.whl (194.2 kB view details)

Uploaded Python 3

File details

Details for the file comproscanner-0.1.6.tar.gz.

File metadata

  • Download URL: comproscanner-0.1.6.tar.gz
  • Upload date:
  • Size: 173.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for comproscanner-0.1.6.tar.gz
Algorithm Hash digest
SHA256 ceac898b5af518e70d4af1afdbbcc826aa1d62397490a49ec45e3f99b92eddf6
MD5 09311ba2c8f9d2c3fddcc9ee2600c9d6
BLAKE2b-256 92aa45b83ab433156cba6e946e8e87046461ec96082db5c5314336abcf6e852a

See more details on using hashes here.

File details

Details for the file comproscanner-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: comproscanner-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 194.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for comproscanner-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 eddea52ee199ea044aa2268f916c2944c04fd35ad7b705a3786fd57cf0ead054
MD5 22e0abc8c1ebbfcee0b9caeb6c339907
BLAKE2b-256 f7259b11b7ca4561feefc9022552a6cae0dc72df77d8cb4abfa5c871db71e581

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page