Automated System for Mining Articles (asma)
Project description
asma — Automated System for Mining Articles
asma is a modular, extensible Python library designed to automate the ingestion, parsing, and structured extraction of scientific research articles from PDFs using NCBI PMC APIs and local Large Language Models (LLMs).
Features
- Automated DOI Extraction & Validation: Lazy-loads PDF processing utilities to scan and validate DOIs via Crossref.
- NCBI PMC Ingestion: Resolves DOIs to PMCIDs and PubMed IDs with robust exponential backoff.
- Dual-Purpose BioC Parsing:
- LLM-optimized Markdown: Strips references and serializes tables to raw CSV layout to conserve tokens and improve accuracy.
- Human-optimized Markdown: Preserves references and builds clean Markdown tables for easy reading.
- In-Context Prompt Engineering: Decouples instruction logic from schema definition fields to support dynamic prompting.
- LLM Provider Agnostic: Interface-driven (
LLMProvider) to easily swap between LM Studio, Ollama, OpenAI, or other backends. - Automated Validation: Evaluate extraction outputs against ground-truth files and generate markdown report cards.
Repository Structure
├── src/
│ └── asma/ # Main library source code
│ ├── core/ # Markdown parsing and Evaluation engine
│ ├── providers/ # Crossref resolvers, PMC fetchers, LM Studio client
│ ├── utils/ # PDF helpers, XML parsers, text utils
│ └── config.py # Prompt templates and default schemas
├── tests/ # Unit test suite
├── run_pipeline.py # End-to-end command-line orchestrator
├── pipeline.ipynb # Interactive Jupyter demo notebook
├── pyproject.toml # Package definition (PEP-621)
└── asma_documentation.md # Detailed SDK reference & developer guide
Quick Start
1. Installation
Install the package in editable mode:
pip install -e .
To enable local PDF DOI extraction, install the PDF support extras (installs PyMuPDF):
pip install asma[pdf]
2. Run the Orchestrator
To run the pipeline end-to-end (requires a local model server loaded on LM Studio):
python run_pipeline.py 36374021
3. Detailed Documentation
For a comprehensive guide covering custom schemas, extending providers (like Ollama), streaming controls, and the testing framework, read the Developer Reference Guide.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file asma-0.1.0.tar.gz.
File metadata
- Download URL: asma-0.1.0.tar.gz
- Upload date:
- Size: 14.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cad6d17bc40d0ab7d843ef946a494966060f27338b473f9a48b96f6d97a9435a
|
|
| MD5 |
bf40f74efb0a32fcb89c6099b7461f8e
|
|
| BLAKE2b-256 |
4e6cbd06134fa0b4dc0b742306d3393c51068f23eececb44aa2a56fb2b2ce2a6
|
File details
Details for the file asma-0.1.0-py3-none-any.whl.
File metadata
- Download URL: asma-0.1.0-py3-none-any.whl
- Upload date:
- Size: 16.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf8df0c2612b7397ea672ba5535453f8391173049e71923a5f719535a8b242c1
|
|
| MD5 |
3e85d20e6ec2ef13afd797a51ec839fe
|
|
| BLAKE2b-256 |
3d3adb5805856c8f8bba9305c6489bf5b0a5611534d3d1697aba36abe3566306
|