FDA 510(k) and De Novo document extraction, enrichment, and structuring pipeline

These details have not been verified by PyPI

Project description

Ashmatics FDA Pipeline

Version: 0.1.0 | Last Updated: 2025-11-29

FDA 510(k) and De Novo document extraction, enrichment, and structuring pipeline for the Ashmatics Knowledge Base platform.

Overview

The Ashmatics FDA Pipeline provides a comprehensive solution for extracting structured metadata from FDA regulatory documents (510(k) summaries and De Novo decision summaries). It transforms unstructured PDF documents into MongoDB-ready structured documents suitable for AI/ML analysis and knowledge base integration.

Key Features

PDF Parsing: High-quality document parsing using DocLing
Metadata Extraction: Regex-based extraction with LLM validation
Table Processing: Multi-page table consolidation and classification
Predicate Extraction: Multi-source predicate device identification
AI/ML Data Extraction: Training data and performance metrics extraction
Domain Knowledge: Product code-aware extraction with confidence scoring
Batch Processing: Concurrent document processing with structured output

Installation

Prerequisites

Python 3.12+
uv package manager (recommended)
Access to JFK-Ashmatics private repositories

Install from GitHub

# Using uv (recommended)
uv add "ashmatics-fda-pipeline @ git+https://github.com/JFK-Ashmatics/ashmatics-fda-pipeline.git"

# With optional dependencies
uv add "ashmatics-fda-pipeline[all] @ git+https://github.com/JFK-Ashmatics/ashmatics-fda-pipeline.git"

Development Installation

git clone https://github.com/JFK-Ashmatics/ashmatics-fda-pipeline.git
cd ashmatics-fda-pipeline
uv sync --all-extras

Quick Start

CLI Usage

# Process a batch of PDFs
fda-pipeline process /path/to/pdfs --output /path/to/output

# Process a single document
fda-pipeline process-single /path/to/K123456.pdf

# Show version
fda-pipeline version

Python API

from ashmatics_fda_pipeline import FDA510kPipeline, PipelineConfig

# Configure pipeline
config = PipelineConfig(
    enable_llm_validation=True,
    enable_performance_extraction=True,
    llm_provider="azure_openai",
)

# Create pipeline
pipeline = FDA510kPipeline(config)

# Process single document
result = await pipeline.process_single(Path("K123456.pdf"))

print(f"K-Number: {result.k_number}")
print(f"Manufacturer: {result.metadata['manufacturer']}")

# Process batch
results = await pipeline.process_batch([Path("K123456.pdf"), Path("K234567.pdf")])

Architecture

ashmatics_fda_pipeline/
├── __init__.py          # Public API exports
├── config.py            # PipelineConfig dataclass
├── pipeline.py          # FDA510kPipeline main class
├── pipeline_registry.py # Factory pattern for pipelines
├── cli.py               # Typer CLI entry point
│
├── extractors/          # Metadata extraction
│   ├── base.py          # DocumentExtractor ABC
│   ├── metadata_extractor.py  # FDA510kExtractor
│   └── llm_validator.py       # LLM validation
│
├── enrichers/           # Content enrichment
│   ├── table_classifier.py       # Table classification
│   ├── table_consolidator.py     # Multi-page table merge
│   ├── predicate_extractor.py    # Predicate device extraction
│   ├── training_data_extractor.py    # AI/ML training data
│   └── performance_data_extractor.py # Validation results
│
├── mappers/             # Schema mapping
│   ├── base.py          # DocumentMapper ABC
│   └── document_mapper.py # RegulatoryDocumentMapper
│
├── storage/             # Output management
│   └── batch_output.py  # BatchOutputManager
│
└── domain_knowledge/    # FDA domain patterns
    ├── __init__.py      # DomainKnowledge, DocumentPatternLoader
    ├── document_patterns.py
    ├── ai_device_expectations.yaml
    ├── 510k_summary_document_patterns.yaml
    └── de_novo_document_patterns.yaml

Configuration

Environment Variables

# Azure OpenAI (default LLM provider)
AZURE_OPENAI_API_KEY=your-api-key
AZURE_OPENAI_ENDPOINT=https://your-endpoint.openai.azure.com/
AZURE_OPENAI_DEPLOYMENT_NAME=gpt-4o

# OpenAI (alternative)
OPENAI_API_KEY=your-api-key

# Anthropic (alternative)
ANTHROPIC_API_KEY=your-api-key

PipelineConfig Options

@dataclass
class PipelineConfig:
    # LLM configuration
    enable_llm_validation: bool = True
    llm_provider: str = "azure_openai"
    enable_performance_extraction: bool = True

    # Processing
    max_concurrent: int = 3
    batch_size: int = 5

    # Output
    enable_batch_output: bool = True
    write_markdown: bool = False
    save_figures: bool = True
    save_tables: bool = True

Output Structure

When enable_batch_output=True, the pipeline creates a structured output:

batch-YYYYMMDD-HHMMSS/
├── manifest.json          # Batch metadata and processing summary
├── K123456/
│   ├── K123456_parsed.md  # Parsed markdown
│   ├── K123456_metadata.json
│   ├── K123456_mongo_doc.json
│   ├── figures/
│   │   ├── figure_1.png
│   │   └── figure_metadata.json
│   └── tables/
│       ├── table_1.md
│       ├── table_1.json
│       └── table_classifications.json
└── K234567/
    └── ...

Dependencies

Core Ashmatics Packages

ashmatics-tools: Base utilities, parsers, LLM clients
ashmatics-datamodels: Shared Pydantic data models

Optional Dependencies

# LLM enrichment (Azure OpenAI, OpenAI, Anthropic)
uv add ashmatics-fda-pipeline[llm]

# Azure storage support
uv add ashmatics-fda-pipeline[azure]

# MongoDB support
uv add ashmatics-fda-pipeline[mongodb]

# All optional dependencies
uv add ashmatics-fda-pipeline[all]

Development

Running Tests

# Run all tests
uv run pytest

# With coverage
uv run pytest --cov=ashmatics_fda_pipeline

# Run specific test
uv run pytest tests/unit/test_extractors.py -v

Code Quality

# Format code
uv run ruff format .

# Lint
uv run ruff check .

# Type check
uv run mypy src/

License

This software is proprietary and confidential. Unauthorized copying, modification, distribution, or use is strictly prohibited.

See LICENSE for details.

Support

For licensing inquiries: legal@asherinformatics.com

For technical support: engineering@asherinformatics.com

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.1

Apr 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ashmatics_fda_pipeline-0.1.1.tar.gz (107.4 kB view details)

Uploaded Apr 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ashmatics_fda_pipeline-0.1.1-py3-none-any.whl (121.3 kB view details)

Uploaded Apr 15, 2026 Python 3

File details

Details for the file ashmatics_fda_pipeline-0.1.1.tar.gz.

File metadata

Download URL: ashmatics_fda_pipeline-0.1.1.tar.gz
Upload date: Apr 15, 2026
Size: 107.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ashmatics_fda_pipeline-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`48f56ed050e2f8a08d93c05db7bbbabd1a9a5deb68f5a7d193b2c5deafa3e730`
MD5	`c455cfcb54ab3b8b33670bd30c0b697b`
BLAKE2b-256	`c2f22f1efdd31d281ddc0e86ca7204d73913e7934992bc910716d4ba7e272947`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ashmatics_fda_pipeline-0.1.1.tar.gz:

Publisher: publish.yml on AshMatics/ashmatics-fda-pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ashmatics_fda_pipeline-0.1.1.tar.gz
- Subject digest: 48f56ed050e2f8a08d93c05db7bbbabd1a9a5deb68f5a7d193b2c5deafa3e730
- Sigstore transparency entry: 1313144302
- Sigstore integration time: Apr 15, 2026
Source repository:
- Permalink: AshMatics/ashmatics-fda-pipeline@9795fa4724d02a341c9479d7b6fab65260754364
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/AshMatics
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@9795fa4724d02a341c9479d7b6fab65260754364
- Trigger Event: push

File details

Details for the file ashmatics_fda_pipeline-0.1.1-py3-none-any.whl.

File metadata

Download URL: ashmatics_fda_pipeline-0.1.1-py3-none-any.whl
Upload date: Apr 15, 2026
Size: 121.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ashmatics_fda_pipeline-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ebb913804362629d88e8bd34701d48c0ac297248c1ffd19af0e1776575694b19`
MD5	`878a6e77d11456db642b54936ad801cb`
BLAKE2b-256	`119bd5b938359e357dd4921e03d1a134362181c7eb3df19fe0aabf0811ad479b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for ashmatics_fda_pipeline-0.1.1-py3-none-any.whl:

Publisher: publish.yml on AshMatics/ashmatics-fda-pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: ashmatics_fda_pipeline-0.1.1-py3-none-any.whl
- Subject digest: ebb913804362629d88e8bd34701d48c0ac297248c1ffd19af0e1776575694b19
- Sigstore transparency entry: 1313144629
- Sigstore integration time: Apr 15, 2026
Source repository:
- Permalink: AshMatics/ashmatics-fda-pipeline@9795fa4724d02a341c9479d7b6fab65260754364
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/AshMatics
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@9795fa4724d02a341c9479d7b6fab65260754364
- Trigger Event: push

ashmatics-fda-pipeline 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Ashmatics FDA Pipeline

Overview

Key Features

Installation

Prerequisites

Install from GitHub

Development Installation

Quick Start

CLI Usage

Python API

Architecture

Configuration

Environment Variables

PipelineConfig Options

Output Structure

Dependencies

Core Ashmatics Packages

Optional Dependencies

Development

Running Tests

Code Quality

License

Support

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance