Multi-agent system for extracting and processing structured composition-property data from scientific literature
Project description
ComProScanner
A comprehensive Python package for extracting composition-property data from scientific articles for building databases
Overview
ComProScanner is a multi-agent framework designed to extract composition-property relationships from scientific articles in materials science. It automates the entire workflow from metadata collection to data extraction, evaluation, and visualization.
Key Features:
- 📚 Multi-publisher support (Elsevier, Springer, Wiley, IOP, local PDFs)
- 🤖 Agentic extraction using CrewAI framework
- 🔍 RAG-powered context retrieval for cost effective automation with accuracy
- 📊 Comprehensive evaluation and visualization tools
- 🎯 Customizable extraction workflows
- 🌐 Knowledge graph generation
Installation
Install from PyPI:
pip install comproscanner
Or install from source:
git clone https://github.com/slimeslab/ComProScanner.git
cd comproscanner
pip install -e .
Quick Start
Here's a complete example extracting piezoelectric coefficient ($d_{33}$) data:
from comproscanner import ComProScanner
# Initialize scanner
scanner = ComProScanner(main_property_keyword="piezoelectric")
# Collect metadata
scanner.collect_metadata(
base_queries=["piezoelectric", "piezoelectricity"],
extra_queries=["ceramics", "applications"]
)
# Process articles
property_keywords = {
"exact_keywords": ["d33"],
"substring_keywords": [" d 33 "]
}
scanner.process_articles(
property_keywords=property_keywords,
source_list=["elsevier", "springer"]
)
# Extract composition-property data
scanner.extract_composition_property_data(
main_extraction_keyword="d33"
)
Workflow
The ComProScanner workflow consists of four main stages:
- Metadata Retrieval - Find relevant scientific articles
- Article Collection - Extract full-text from various publishers
- Information Extraction - Use LLM agents to extract structured data
- Post Processing & Dataset Creation - Evaluate, clean, and visualize results
Documentation
📖 Full documentation is available at slimeslab.github.io/ComProScanner
Core Capabilities
Supported Publishers
- Elsevier (via TDM API)
- Springer Nature (via TDM API)
- Wiley (via TDM API)
- IOP Publishing (via SFTP bulk access)
- Local PDFs (any publication)
Data Extraction
- Composition-property relationships
- Material families
- Synthesis methods and precursors
- Characterization techniques
- Synthesis steps
Evaluation Methods
- Semantic Evaluation - Using semantic similarity measures
- Agentic Evaluation - LLM-powered contextual analysis
Visualization
- Data Visualization
- Evaluation Visualization
Example Use Cases
Extract Data from Multiple Sources
scanner.process_articles(
property_keywords=property_keywords,
source_list=["elsevier", "springer", "wiley"]
)
Customize RAG Configuration
scanner.extract_composition_property_data(
main_extraction_keyword="d33",
rag_chat_model="gemini-2.5-pro",
rag_max_tokens=2048,
rag_top_k=5
)
Visualize Results
from comproscanner import data_visualizer, eval_visualizer
# Create knowledge graph
data_visualizer.create_knowledge_graph(result_file="results.json")
# Plot evaluation metrics
eval_visualizer.plot_multiple_radar_charts(
result_sources=["model1.json", "model2.json"],
model_names=["GPT-4o", "Claude-3.5"]
)
Requirements
- Python 3.12 or 3.13
- TDM API keys for desired publishers (Elsevier, Springer, Wiley)
- LLM API keys (OpenAI, Anthropic, Google, etc.)
- Optional: Neo4j for knowledge graph visualization
Citation
If you use ComProScanner in your research, please cite:
@Article{roy2026comproscannermultiagentbasedframework,
author ="Roy, Aritra and Grisan, Enrico and Buckeridge, John and Gattinoni, Chiara",
title ="ComProScanner: a multi-agent based framework for composition-property structured data extraction from scientific literature",
journal ="Digital Discovery",
year ="2026",
pages ="Accepted",
publisher ="RSC",
doi ="10.1039/D5DD00521C",
url ="https://doi.org/10.1039/D5DD00521C"
}
Changelog
See the CHANGELOG for details on what has changed in each version.
Contributing
We welcome contributions! Please see our Contributing Guidelines for details.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Copyright © 2025-2026 SLIMES Lab
Contact
Author: Aritra Roy
- 🌐 Website: aritraroy.live
- 📧 Email: contact@aritraroy.live
- 🐙 GitHub: @aritraroy24
- 𝕏 Twitter: @aritraroy24
Project Links:
- 📦 PyPI: pypi.org/project/comproscanner
- 📖 Documentation: slimeslab.github.io/ComProScanner
- 🐛 Issues: github.com/slimeslab/ComProScanner/issues
Made with ❤️ by SLIMES Lab
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file comproscanner-0.1.6.tar.gz.
File metadata
- Download URL: comproscanner-0.1.6.tar.gz
- Upload date:
- Size: 173.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ceac898b5af518e70d4af1afdbbcc826aa1d62397490a49ec45e3f99b92eddf6
|
|
| MD5 |
09311ba2c8f9d2c3fddcc9ee2600c9d6
|
|
| BLAKE2b-256 |
92aa45b83ab433156cba6e946e8e87046461ec96082db5c5314336abcf6e852a
|
File details
Details for the file comproscanner-0.1.6-py3-none-any.whl.
File metadata
- Download URL: comproscanner-0.1.6-py3-none-any.whl
- Upload date:
- Size: 194.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eddea52ee199ea044aa2268f916c2944c04fd35ad7b705a3786fd57cf0ead054
|
|
| MD5 |
22e0abc8c1ebbfcee0b9caeb6c339907
|
|
| BLAKE2b-256 |
f7259b11b7ca4561feefc9022552a6cae0dc72df77d8cb4abfa5c871db71e581
|