Skip to main content

Automated System for Mining Articles (asma)

Project description

asma — Automated System for Mining Articles

asma is a modular, extensible Python library designed to automate the ingestion, parsing, and structured extraction of scientific research articles from PDFs using NCBI PMC APIs and local Large Language Models (LLMs).

Features

  • Automated DOI Extraction & Validation: Lazy-loads PDF processing utilities to scan and validate DOIs via Crossref.
  • NCBI PMC Ingestion: Resolves DOIs to PMCIDs and PubMed IDs with robust exponential backoff.
  • Dual-Purpose BioC Parsing:
    • LLM-optimized Markdown: Strips references and serializes tables to raw CSV layout to conserve tokens and improve accuracy.
    • Human-optimized Markdown: Preserves references and builds clean Markdown tables for easy reading.
  • In-Context Prompt Engineering: Decouples instruction logic from schema definition fields to support dynamic prompting.
  • LLM Provider Agnostic: Interface-driven (LLMProvider) to easily swap between LM Studio, Ollama, OpenAI, or other backends.
  • Automated Validation: Evaluate extraction outputs against ground-truth files and generate markdown report cards.

Repository Structure

├── src/
│   └── asma/              # Main library source code
│       ├── core/          # Markdown parsing and Evaluation engine
│       ├── providers/     # Crossref resolvers, PMC fetchers, LM Studio client
│       ├── utils/         # PDF helpers, XML parsers, text utils
│       └── config.py      # Prompt templates and default schemas
├── tests/                 # Unit test suite
├── run_pipeline.py        # End-to-end command-line orchestrator
├── pipeline.ipynb         # Interactive Jupyter demo notebook
├── pyproject.toml         # Package definition (PEP-621)
└── asma_documentation.md  # Detailed SDK reference & developer guide

Quick Start

1. Installation

Install the package in editable mode:

pip install -e .

To enable local PDF DOI extraction, install the PDF support extras (installs PyMuPDF):

pip install asma[pdf]

2. Run the Orchestrator

To run the pipeline end-to-end (requires a local model server loaded on LM Studio):

python run_pipeline.py 36374021

3. Detailed Documentation

For a comprehensive guide covering custom schemas, extending providers (like Ollama), streaming controls, and the testing framework, read the Developer Reference Guide.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

asma-0.1.0.tar.gz (14.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

asma-0.1.0-py3-none-any.whl (16.8 kB view details)

Uploaded Python 3

File details

Details for the file asma-0.1.0.tar.gz.

File metadata

  • Download URL: asma-0.1.0.tar.gz
  • Upload date:
  • Size: 14.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for asma-0.1.0.tar.gz
Algorithm Hash digest
SHA256 cad6d17bc40d0ab7d843ef946a494966060f27338b473f9a48b96f6d97a9435a
MD5 bf40f74efb0a32fcb89c6099b7461f8e
BLAKE2b-256 4e6cbd06134fa0b4dc0b742306d3393c51068f23eececb44aa2a56fb2b2ce2a6

See more details on using hashes here.

File details

Details for the file asma-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: asma-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for asma-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bf8df0c2612b7397ea672ba5535453f8391173049e71923a5f719535a8b242c1
MD5 3e85d20e6ec2ef13afd797a51ec839fe
BLAKE2b-256 3d3adb5805856c8f8bba9305c6489bf5b0a5611534d3d1697aba36abe3566306

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page