Skip to main content

FDA 510(k) and De Novo document extraction, enrichment, and structuring pipeline

Project description

Ashmatics FDA Pipeline

Version: 0.1.0 | Last Updated: 2025-11-29

Copyright 2025 Asher Informatics PBC - Proprietary and Confidential


FDA 510(k) and De Novo document extraction, enrichment, and structuring pipeline for the Ashmatics Knowledge Base platform.

Overview

The Ashmatics FDA Pipeline provides a comprehensive solution for extracting structured metadata from FDA regulatory documents (510(k) summaries and De Novo decision summaries). It transforms unstructured PDF documents into MongoDB-ready structured documents suitable for AI/ML analysis and knowledge base integration.

Key Features

  • PDF Parsing: High-quality document parsing using DocLing
  • Metadata Extraction: Regex-based extraction with LLM validation
  • Table Processing: Multi-page table consolidation and classification
  • Predicate Extraction: Multi-source predicate device identification
  • AI/ML Data Extraction: Training data and performance metrics extraction
  • Domain Knowledge: Product code-aware extraction with confidence scoring
  • Batch Processing: Concurrent document processing with structured output

Installation

Prerequisites

  • Python 3.12+
  • uv package manager (recommended)
  • Access to JFK-Ashmatics private repositories

Install from GitHub

# Using uv (recommended)
uv add "ashmatics-fda-pipeline @ git+https://github.com/JFK-Ashmatics/ashmatics-fda-pipeline.git"

# With optional dependencies
uv add "ashmatics-fda-pipeline[all] @ git+https://github.com/JFK-Ashmatics/ashmatics-fda-pipeline.git"

Development Installation

git clone https://github.com/JFK-Ashmatics/ashmatics-fda-pipeline.git
cd ashmatics-fda-pipeline
uv sync --all-extras

Quick Start

CLI Usage

# Process a batch of PDFs
fda-pipeline process /path/to/pdfs --output /path/to/output

# Process a single document
fda-pipeline process-single /path/to/K123456.pdf

# Show version
fda-pipeline version

Python API

from ashmatics_fda_pipeline import FDA510kPipeline, PipelineConfig

# Configure pipeline
config = PipelineConfig(
    enable_llm_validation=True,
    enable_performance_extraction=True,
    llm_provider="azure_openai",
)

# Create pipeline
pipeline = FDA510kPipeline(config)

# Process single document
result = await pipeline.process_single(Path("K123456.pdf"))

print(f"K-Number: {result.k_number}")
print(f"Manufacturer: {result.metadata['manufacturer']}")

# Process batch
results = await pipeline.process_batch([Path("K123456.pdf"), Path("K234567.pdf")])

Architecture

ashmatics_fda_pipeline/
├── __init__.py          # Public API exports
├── config.py            # PipelineConfig dataclass
├── pipeline.py          # FDA510kPipeline main class
├── pipeline_registry.py # Factory pattern for pipelines
├── cli.py               # Typer CLI entry point
│
├── extractors/          # Metadata extraction
│   ├── base.py          # DocumentExtractor ABC
│   ├── metadata_extractor.py  # FDA510kExtractor
│   └── llm_validator.py       # LLM validation
│
├── enrichers/           # Content enrichment
│   ├── table_classifier.py       # Table classification
│   ├── table_consolidator.py     # Multi-page table merge
│   ├── predicate_extractor.py    # Predicate device extraction
│   ├── training_data_extractor.py    # AI/ML training data
│   └── performance_data_extractor.py # Validation results
│
├── mappers/             # Schema mapping
│   ├── base.py          # DocumentMapper ABC
│   └── document_mapper.py # RegulatoryDocumentMapper
│
├── storage/             # Output management
│   └── batch_output.py  # BatchOutputManager
│
└── domain_knowledge/    # FDA domain patterns
    ├── __init__.py      # DomainKnowledge, DocumentPatternLoader
    ├── document_patterns.py
    ├── ai_device_expectations.yaml
    ├── 510k_summary_document_patterns.yaml
    └── de_novo_document_patterns.yaml

Configuration

Environment Variables

# Azure OpenAI (default LLM provider)
AZURE_OPENAI_API_KEY=your-api-key
AZURE_OPENAI_ENDPOINT=https://your-endpoint.openai.azure.com/
AZURE_OPENAI_DEPLOYMENT_NAME=gpt-4o

# OpenAI (alternative)
OPENAI_API_KEY=your-api-key

# Anthropic (alternative)
ANTHROPIC_API_KEY=your-api-key

PipelineConfig Options

@dataclass
class PipelineConfig:
    # LLM configuration
    enable_llm_validation: bool = True
    llm_provider: str = "azure_openai"
    enable_performance_extraction: bool = True

    # Processing
    max_concurrent: int = 3
    batch_size: int = 5

    # Output
    enable_batch_output: bool = True
    write_markdown: bool = False
    save_figures: bool = True
    save_tables: bool = True

Output Structure

When enable_batch_output=True, the pipeline creates a structured output:

batch-YYYYMMDD-HHMMSS/
├── manifest.json          # Batch metadata and processing summary
├── K123456/
│   ├── K123456_parsed.md  # Parsed markdown
│   ├── K123456_metadata.json
│   ├── K123456_mongo_doc.json
│   ├── figures/
│   │   ├── figure_1.png
│   │   └── figure_metadata.json
│   └── tables/
│       ├── table_1.md
│       ├── table_1.json
│       └── table_classifications.json
└── K234567/
    └── ...

Dependencies

Core Ashmatics Packages

  • ashmatics-tools: Base utilities, parsers, LLM clients
  • ashmatics-datamodels: Shared Pydantic data models

Optional Dependencies

# LLM enrichment (Azure OpenAI, OpenAI, Anthropic)
uv add ashmatics-fda-pipeline[llm]

# Azure storage support
uv add ashmatics-fda-pipeline[azure]

# MongoDB support
uv add ashmatics-fda-pipeline[mongodb]

# All optional dependencies
uv add ashmatics-fda-pipeline[all]

Development

Running Tests

# Run all tests
uv run pytest

# With coverage
uv run pytest --cov=ashmatics_fda_pipeline

# Run specific test
uv run pytest tests/unit/test_extractors.py -v

Code Quality

# Format code
uv run ruff format .

# Lint
uv run ruff check .

# Type check
uv run mypy src/

License

Copyright 2025 Asher Informatics PBC. All rights reserved.

This software is proprietary and confidential. Unauthorized copying, modification, distribution, or use is strictly prohibited.

See LICENSE for details.

Support

For licensing inquiries: legal@asherinformatics.com

For technical support: engineering@asherinformatics.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ashmatics_fda_pipeline-0.1.1.tar.gz (107.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ashmatics_fda_pipeline-0.1.1-py3-none-any.whl (121.3 kB view details)

Uploaded Python 3

File details

Details for the file ashmatics_fda_pipeline-0.1.1.tar.gz.

File metadata

  • Download URL: ashmatics_fda_pipeline-0.1.1.tar.gz
  • Upload date:
  • Size: 107.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ashmatics_fda_pipeline-0.1.1.tar.gz
Algorithm Hash digest
SHA256 48f56ed050e2f8a08d93c05db7bbbabd1a9a5deb68f5a7d193b2c5deafa3e730
MD5 c455cfcb54ab3b8b33670bd30c0b697b
BLAKE2b-256 c2f22f1efdd31d281ddc0e86ca7204d73913e7934992bc910716d4ba7e272947

See more details on using hashes here.

Provenance

The following attestation bundles were made for ashmatics_fda_pipeline-0.1.1.tar.gz:

Publisher: publish.yml on AshMatics/ashmatics-fda-pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ashmatics_fda_pipeline-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for ashmatics_fda_pipeline-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ebb913804362629d88e8bd34701d48c0ac297248c1ffd19af0e1776575694b19
MD5 878a6e77d11456db642b54936ad801cb
BLAKE2b-256 119bd5b938359e357dd4921e03d1a134362181c7eb3df19fe0aabf0811ad479b

See more details on using hashes here.

Provenance

The following attestation bundles were made for ashmatics_fda_pipeline-0.1.1-py3-none-any.whl:

Publisher: publish.yml on AshMatics/ashmatics-fda-pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page