FDA 510(k) and De Novo document extraction, enrichment, and structuring pipeline
Project description
Ashmatics FDA Pipeline
Version: 0.1.0 | Last Updated: 2025-11-29
Copyright 2025 Asher Informatics PBC - Proprietary and Confidential
FDA 510(k) and De Novo document extraction, enrichment, and structuring pipeline for the Ashmatics Knowledge Base platform.
Overview
The Ashmatics FDA Pipeline provides a comprehensive solution for extracting structured metadata from FDA regulatory documents (510(k) summaries and De Novo decision summaries). It transforms unstructured PDF documents into MongoDB-ready structured documents suitable for AI/ML analysis and knowledge base integration.
Key Features
- PDF Parsing: High-quality document parsing using DocLing
- Metadata Extraction: Regex-based extraction with LLM validation
- Table Processing: Multi-page table consolidation and classification
- Predicate Extraction: Multi-source predicate device identification
- AI/ML Data Extraction: Training data and performance metrics extraction
- Domain Knowledge: Product code-aware extraction with confidence scoring
- Batch Processing: Concurrent document processing with structured output
Installation
Prerequisites
- Python 3.12+
- uv package manager (recommended)
- Access to JFK-Ashmatics private repositories
Install from GitHub
# Using uv (recommended)
uv add "ashmatics-fda-pipeline @ git+https://github.com/JFK-Ashmatics/ashmatics-fda-pipeline.git"
# With optional dependencies
uv add "ashmatics-fda-pipeline[all] @ git+https://github.com/JFK-Ashmatics/ashmatics-fda-pipeline.git"
Development Installation
git clone https://github.com/JFK-Ashmatics/ashmatics-fda-pipeline.git
cd ashmatics-fda-pipeline
uv sync --all-extras
Quick Start
CLI Usage
# Process a batch of PDFs
fda-pipeline process /path/to/pdfs --output /path/to/output
# Process a single document
fda-pipeline process-single /path/to/K123456.pdf
# Show version
fda-pipeline version
Python API
from ashmatics_fda_pipeline import FDA510kPipeline, PipelineConfig
# Configure pipeline
config = PipelineConfig(
enable_llm_validation=True,
enable_performance_extraction=True,
llm_provider="azure_openai",
)
# Create pipeline
pipeline = FDA510kPipeline(config)
# Process single document
result = await pipeline.process_single(Path("K123456.pdf"))
print(f"K-Number: {result.k_number}")
print(f"Manufacturer: {result.metadata['manufacturer']}")
# Process batch
results = await pipeline.process_batch([Path("K123456.pdf"), Path("K234567.pdf")])
Architecture
ashmatics_fda_pipeline/
├── __init__.py # Public API exports
├── config.py # PipelineConfig dataclass
├── pipeline.py # FDA510kPipeline main class
├── pipeline_registry.py # Factory pattern for pipelines
├── cli.py # Typer CLI entry point
│
├── extractors/ # Metadata extraction
│ ├── base.py # DocumentExtractor ABC
│ ├── metadata_extractor.py # FDA510kExtractor
│ └── llm_validator.py # LLM validation
│
├── enrichers/ # Content enrichment
│ ├── table_classifier.py # Table classification
│ ├── table_consolidator.py # Multi-page table merge
│ ├── predicate_extractor.py # Predicate device extraction
│ ├── training_data_extractor.py # AI/ML training data
│ └── performance_data_extractor.py # Validation results
│
├── mappers/ # Schema mapping
│ ├── base.py # DocumentMapper ABC
│ └── document_mapper.py # RegulatoryDocumentMapper
│
├── storage/ # Output management
│ └── batch_output.py # BatchOutputManager
│
└── domain_knowledge/ # FDA domain patterns
├── __init__.py # DomainKnowledge, DocumentPatternLoader
├── document_patterns.py
├── ai_device_expectations.yaml
├── 510k_summary_document_patterns.yaml
└── de_novo_document_patterns.yaml
Configuration
Environment Variables
# Azure OpenAI (default LLM provider)
AZURE_OPENAI_API_KEY=your-api-key
AZURE_OPENAI_ENDPOINT=https://your-endpoint.openai.azure.com/
AZURE_OPENAI_DEPLOYMENT_NAME=gpt-4o
# OpenAI (alternative)
OPENAI_API_KEY=your-api-key
# Anthropic (alternative)
ANTHROPIC_API_KEY=your-api-key
PipelineConfig Options
@dataclass
class PipelineConfig:
# LLM configuration
enable_llm_validation: bool = True
llm_provider: str = "azure_openai"
enable_performance_extraction: bool = True
# Processing
max_concurrent: int = 3
batch_size: int = 5
# Output
enable_batch_output: bool = True
write_markdown: bool = False
save_figures: bool = True
save_tables: bool = True
Output Structure
When enable_batch_output=True, the pipeline creates a structured output:
batch-YYYYMMDD-HHMMSS/
├── manifest.json # Batch metadata and processing summary
├── K123456/
│ ├── K123456_parsed.md # Parsed markdown
│ ├── K123456_metadata.json
│ ├── K123456_mongo_doc.json
│ ├── figures/
│ │ ├── figure_1.png
│ │ └── figure_metadata.json
│ └── tables/
│ ├── table_1.md
│ ├── table_1.json
│ └── table_classifications.json
└── K234567/
└── ...
Dependencies
Core Ashmatics Packages
- ashmatics-tools: Base utilities, parsers, LLM clients
- ashmatics-datamodels: Shared Pydantic data models
Optional Dependencies
# LLM enrichment (Azure OpenAI, OpenAI, Anthropic)
uv add ashmatics-fda-pipeline[llm]
# Azure storage support
uv add ashmatics-fda-pipeline[azure]
# MongoDB support
uv add ashmatics-fda-pipeline[mongodb]
# All optional dependencies
uv add ashmatics-fda-pipeline[all]
Development
Running Tests
# Run all tests
uv run pytest
# With coverage
uv run pytest --cov=ashmatics_fda_pipeline
# Run specific test
uv run pytest tests/unit/test_extractors.py -v
Code Quality
# Format code
uv run ruff format .
# Lint
uv run ruff check .
# Type check
uv run mypy src/
License
Copyright 2025 Asher Informatics PBC. All rights reserved.
This software is proprietary and confidential. Unauthorized copying, modification, distribution, or use is strictly prohibited.
See LICENSE for details.
Support
For licensing inquiries: legal@asherinformatics.com
For technical support: engineering@asherinformatics.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ashmatics_fda_pipeline-0.1.1.tar.gz.
File metadata
- Download URL: ashmatics_fda_pipeline-0.1.1.tar.gz
- Upload date:
- Size: 107.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
48f56ed050e2f8a08d93c05db7bbbabd1a9a5deb68f5a7d193b2c5deafa3e730
|
|
| MD5 |
c455cfcb54ab3b8b33670bd30c0b697b
|
|
| BLAKE2b-256 |
c2f22f1efdd31d281ddc0e86ca7204d73913e7934992bc910716d4ba7e272947
|
Provenance
The following attestation bundles were made for ashmatics_fda_pipeline-0.1.1.tar.gz:
Publisher:
publish.yml on AshMatics/ashmatics-fda-pipeline
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ashmatics_fda_pipeline-0.1.1.tar.gz -
Subject digest:
48f56ed050e2f8a08d93c05db7bbbabd1a9a5deb68f5a7d193b2c5deafa3e730 - Sigstore transparency entry: 1313144302
- Sigstore integration time:
-
Permalink:
AshMatics/ashmatics-fda-pipeline@9795fa4724d02a341c9479d7b6fab65260754364 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/AshMatics
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9795fa4724d02a341c9479d7b6fab65260754364 -
Trigger Event:
push
-
Statement type:
File details
Details for the file ashmatics_fda_pipeline-0.1.1-py3-none-any.whl.
File metadata
- Download URL: ashmatics_fda_pipeline-0.1.1-py3-none-any.whl
- Upload date:
- Size: 121.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ebb913804362629d88e8bd34701d48c0ac297248c1ffd19af0e1776575694b19
|
|
| MD5 |
878a6e77d11456db642b54936ad801cb
|
|
| BLAKE2b-256 |
119bd5b938359e357dd4921e03d1a134362181c7eb3df19fe0aabf0811ad479b
|
Provenance
The following attestation bundles were made for ashmatics_fda_pipeline-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on AshMatics/ashmatics-fda-pipeline
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ashmatics_fda_pipeline-0.1.1-py3-none-any.whl -
Subject digest:
ebb913804362629d88e8bd34701d48c0ac297248c1ffd19af0e1776575694b19 - Sigstore transparency entry: 1313144629
- Sigstore integration time:
-
Permalink:
AshMatics/ashmatics-fda-pipeline@9795fa4724d02a341c9479d7b6fab65260754364 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/AshMatics
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9795fa4724d02a341c9479d7b6fab65260754364 -
Trigger Event:
push
-
Statement type: