A Python package to extract AUTOSAR model from PDF files to markdown

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

AUTOSAR PDF to Text

A Python package to extract AUTOSAR model hierarchies from PDF specification documents and convert them to markdown format.

Features

PDF Extraction: Extract AUTOSAR packages, classes, enumerations, and primitive types from PDF specification documents
Two-Phase Parsing: Read phase extracts all text from PDF, parse phase processes complete buffer for multi-page definitions
Hierarchical Parsing: Parse complex hierarchical class structures with inheritance relationships
Source Location Tracking: Track PDF file and page number for each type definition and base class reference
Markdown Output: Generate well-formatted markdown output with proper indentation
JSON Output: Generate structured JSON output with complete type information
Type Mapping: Generate type-to-package mapping in JSON or Markdown table format
Class Details: Support for abstract classes, attributes, ATP markers, and source information
Class Hierarchy: Generate separate class inheritance hierarchy files showing root classes and their subclasses
Individual Class Files: Create separate markdown files for each class with detailed information
Model Validation: Built-in duplicate prevention and validation at the model level
Subclasses Validation: Validate subclass relationships against actual inheritance hierarchy
Comprehensive Coverage: 97%+ test coverage with robust error handling

Installation

pip install autosar-pdf2txt

Or install from source:

git clone https://github.com/melodypapa/autosar-pdf.git
cd autosar-pdf
pip install -e .

Version: 2.0.0 (Production Release)

Requirements

Python 3.7+
pdfplumber

Usage

Command Line Interface

The autosar-extract command provides a simple interface for extracting AUTOSAR models from PDF files.

# Generate type-to-package mapping
autosar-extract examples/pdf/AUTOSAR_CP_TPS_ECUConfiguration.pdf --mapping mapping.md

# Generate class inheritance hierarchy
autosar-extract examples/pdf/AUTOSAR_CP_TPS_ECUConfiguration.pdf --hierarchy hierarchy.md

# Generate individual class files
autosar-extract examples/pdf/AUTOSAR_CP_TPS_ECUConfiguration.pdf --class-details classes/

# Combine multiple outputs
autosar-extract examples/pdf/ --mapping mapping.md --hierarchy hierarchy.md --class-details classes/

# Generate mapping in JSON format (auto-detected from .json extension)
autosar-extract examples/pdf/ --mapping mapping.json

# Process multiple PDFs
autosar-extract path/to/file1.pdf path/to/file2.pdf path/to/file3.pdf --mapping mapping.md

# Process all PDFs in a directory
autosar-extract path/to/directory --mapping mapping.md

# Enable verbose mode for detailed debug information
autosar-extract examples/pdf/ --mapping mapping.md -v

# Write logs to a file with timestamps
autosar-extract examples/pdf/ --mapping mapping.md --log-file extraction.log

# Combine log file with verbose mode
autosar-extract examples/pdf/ --mapping mapping.md --log-file extraction.log -v

CLI Options

pdf_files: Path(s) to PDF file(s) or director(y/ies) containing PDFs to parse
--mapping FILE: Generate type-to-package mapping to FILE
--hierarchy FILE: Generate class inheritance hierarchy to FILE
--class-details DIR: Generate individual class files to DIR/
--format {markdown,json}: Output format (default: inferred from file extension)
-v, --verbose: Enable verbose output mode for detailed debug information
--log-file LOG_FILE: Write log messages to a file with timestamps (default: console only)

Note: At least one output flag (--mapping, --hierarchy, or --class-details) must be specified.

Migration from v1.x to v2.0

Version 2.0.0 includes breaking changes to CLI arguments. Here's how to migrate:

Old: Generate mapping

autosar-extract input.pdf -o output.md --generate-mapping

New:

autosar-extract input.pdf --mapping output.md

Old: Generate hierarchy

autosar-extract input.pdf -o output.md --include-class-hierarchy

New:

autosar-extract input.pdf --hierarchy output.md

Old: Generate class details

autosar-extract input.pdf -o output.md --include-class-details

New:

autosar-extract input.pdf --class-details output/

Old: Combine mapping + hierarchy

autosar-extract input.pdf -o output.md --generate-mapping --include-class-hierarchy

New:

autosar-extract input.pdf --mapping mapping.md --hierarchy hierarchy.md

Note: The --generate-mapping flag conflicts with --include-class-details and --include-class-hierarchy. These options cannot be used together.

Python API

You can also use the package programmatically in your Python code:

from autosar_pdf2txt import PdfParser, MarkdownWriter, MappingWriter

# Parse single PDF file
parser = PdfParser()
packages = parser.parse_pdf("path/to/file.pdf")

# Parse multiple PDF files
parser = PdfParser()
all_packages = []
for pdf_path in ["path/to/file1.pdf", "path/to/file2.pdf"]:
    packages = parser.parse_pdf(pdf_path)
    all_packages.extend(packages)

# Write package hierarchy to markdown
writer = MarkdownWriter()
markdown = writer.write_packages(all_packages)
print(markdown)

# Generate class inheritance hierarchy
from autosar_pdf2txt import AutosarClass

# Collect all classes from packages
all_classes = []
for pkg in all_packages:
    classes_from_pkg = writer._collect_classes_from_package(pkg)
    all_classes.extend(classes_from_pkg)

# Get root classes (classes with no parent/inheritance)
root_classes = [cls for cls in all_classes if not cls.bases]

# Write class hierarchy
hierarchy = writer.write_class_hierarchy(root_classes, all_classes)
print(hierarchy)

# Generate type-to-package mapping
mapping_writer = MappingWriter()
json_mapping = mapping_writer.write_mapping(all_packages, format="json")
md_mapping = mapping_writer.write_mapping(all_packages, format="markdown")

Data Models

The package provides comprehensive data models for representing AUTOSAR structures:

AutosarPackage

Represents a hierarchical package containing classes and subpackages.

from autosar_pdf2txt import AutosarPackage, AutosarClass

pkg = AutosarPackage(name="AUTOSAR")
pkg.add_class(AutosarClass(name="MyClass", package="M2::AUTOSAR", is_abstract=False))

AutosarClass

Represents an AUTOSAR class with attributes, inheritance, and optional ATP markers.

from autosar_pdf2txt import AutosarClass, AutosarAttribute, ATPType

cls = AutosarClass(
    name="SwComponentPrototype",
    package="M2::AUTOSAR::Components",
    is_abstract=False,
    atp_type=ATPType.ATP_MIXED_STRING,
    attributes=[
        AutosarAttribute(
            name="shortName",
            type="String",
            mult="1",
            kind=AttributeKind.ATTRIBUTE
        )
    ]
)

AutosarEnumeration

Represents an AUTOSAR enumeration type with literals.

from autosar_pdf2txt import AutosarEnumeration, AutosarEnumLiteral

enum = AutosarEnumeration(
    name="Category",
    package="M2::AUTOSAR"
)
enum.enumeration_literals = [
    AutosarEnumLiteral(name="VALUE1", index=0, description="First value"),
    AutosarEnumLiteral(name="VALUE2", index=1, description="Second value"),
]

AutosarDoc

Represents a complete AUTOSAR document with packages and root classes.

from autosar_pdf2txt import AutosarDoc

doc = AutosarDoc(packages=[pkg1, pkg2], root_classes=[root_cls1, root_cls2])

# Query packages and classes
pkg = doc.get_package("AUTOSAR")
cls = doc.get_root_class("SwComponentPrototype")

Examples

The repository includes sample AUTOSAR specification PDFs in the examples/pdf/ directory:

AUTOSAR_CP_TPS_BSWModuleDescriptionTemplate.pdf
AUTOSAR_CP_TPS_DiagnosticExtractTemplate.pdf
AUTOSAR_CP_TPS_ECUConfiguration.pdf
AUTOSAR_CP_TPS_ECUResourceTemplate.pdf
AUTOSAR_CP_TPS_SoftwareComponentTemplate.pdf
AUTOSAR_CP_TPS_SystemTemplate.pdf
AUTOSAR_CP_TPS_TimingExtensions.pdf

Example: Basic Extraction

# Extract a single AUTOSAR template
autosar-extract examples/pdf/AUTOSAR_CP_TPS_ECUConfiguration.pdf

# Extract all AUTOSAR templates from the examples directory
autosar-extract examples/pdf/

# Save output to a markdown file
autosar-extract examples/pdf/ -o autosar_templates.md

# Extract specific templates
autosar-extract \
  examples/pdf/AUTOSAR_CP_TPS_SystemTemplate.pdf \
  examples/pdf/AUTOSAR_CP_TPS_SoftwareComponentTemplate.pdf \
  -o system_and_component.md

# Extract with verbose output to see processing details
autosar-extract examples/pdf/AUTOSAR_CP_TPS_ECUConfiguration.pdf -v

Example: Generate Class Hierarchy

Create a separate file showing the class inheritance hierarchy:

# Extract Software Component Template with class hierarchy
autosar-extract examples/pdf/AUTOSAR_CP_TPS_SoftwareComponentTemplate.pdf -o autosar_model.md --include-class-hierarchy 

# This creates two files:
# - software_components.md (package hierarchy)
# - software_components-hierarchy.md (class inheritance tree)

The class hierarchy file shows:

## Class Hierarchy

* SwComponentPrototype
  * RequiredSwComponentPrototype
* SwcInternalBehavior
  * RunnableEntity
    * ClientServerOperation
  * TriggerEntity

Example: Generate Individual Class Files

Generate separate markdown files for each AUTOSAR class:

# Extract and create individual class files
autosar-extract examples/pdf/AUTOSAR_CP_TPS_ECUConfiguration.pdf \
  --include-class-details \
  -o data/autosar_models.md

# This creates:
# - data/autosar_models.md (consolidated output)
# - data/autosar_models/classes/<PackageName>/<ClassName>.md (individual files)

Example: Combined Output

Generate all outputs in a single run:

autosar-extract examples/pdf/ -o autosar_complete.json --include-class-hierarchy --include-class-details

Output:

Parsing: examples/pdf/AUTOSAR_CP_TPS_SoftwareComponentTemplate.pdf
Found 15 packages
Collected 234 classes from 15 packages
Generated class hierarchy for 45 root classes
Writing to: autosar_complete.md
Class hierarchy written to: autosar_complete-hierarchy.md
Writing class files to: autosar_complete/classes/


**Common auto-corrections** include:
- Attribute name case corrections (e.g., `Shortname` → `shortName`)
- Type name corrections (e.g., `SwComponent` → `SwComponentType`)
### Example: Generate Type-to-Package Mapping

Generate a simple mapping of all types to their package paths:

```bash
# Generate JSON mapping
autosar-extract examples/pdf/AUTOSAR_CP_TPS_SoftwareComponentTemplate.pdf \
  -o mapping.json --generate-mapping

# Generate Markdown table mapping
autosar-extract examples/pdf/AUTOSAR_CP_TPS_SoftwareComponentTemplate.pdf \
  -o mapping.md --generate-mapping

JSON Output Format (mapping.json):

{
  "types": [
    {
      "name": "SwComponentPrototype",
      "type": "Class",
      "package_path": "M2::AUTOSAR::Components"
    },
    {
      "name": "Category",
      "type": "Enumeration",
      "package_path": "M2::AUTOSAR::DataTypes"
    },
    {
      "name": "LimitValue",
      "type": "Primitive",
      "package_path": "M2::AUTOSAR::DataTypes"
    }
  ]
}

Markdown Output Format (mapping.md):

# Type to Package Mapping

| Name | Type | Package Path |
|------|------|--------------|
| SwComponentPrototype | Class | M2::AUTOSAR::Components |
| RequiredSwComponentPrototype | Class | M2::AUTOSAR::Components |
| Category | Enumeration | M2::AUTOSAR::DataTypes |
| LimitValue | Primitive | M2::AUTOSAR::DataTypes |

Python API for Mapping Generation:

from autosar_pdf2txt import PdfParser, MappingWriter

# Parse PDFs
parser = PdfParser()
doc = parser.parse_pdfs(["examples/pdf/AUTOSAR_CP_TPS_SoftwareComponentTemplate.pdf"])

# Generate mapping
writer = MappingWriter()

# JSON format
json_mapping = writer.write_mapping(doc.packages, format="json")
print(json_mapping)

# Markdown format
md_mapping = writer.write_mapping(doc.packages, format="markdown")
print(md_mapping)

Alternative import (if you prefer importing from the writer submodule):

from autosar_pdf2txt import PdfParser
from autosar_pdf2txt.writer import MappingWriter

Output Format

Package Hierarchy Output

The package hierarchy uses asterisk-based markdown formatting with indentation:

* AUTOSAR
  * DataTypes
    * String
  * Components
    * SwComponentPrototype (abstract)
    * RequiredSwComponentPrototype

Packages: indented 2 spaces per level
Classes: indented 1 level deeper than their parent package
Abstract classes marked with (abstract) suffix

Class Hierarchy Output

The class hierarchy shows inheritance relationships from root classes:

## Class Hierarchy

* RootClass1 (abstract)
  * ChildClass1
    * GrandchildClass
  * ChildClass2
* RootClass2
  * ChildClass3

Root classes (no parent) at top level
Child classes indented 2 spaces per inheritance level
Circular references detected and marked with "(cycle detected)"

JSON Output Format

The tool also supports JSON output for machine-readable data extraction and programmatic processing:

# Explicit format selection
autosar-extract input.pdf -o output.json --format json
autosar-extract input.pdf -o output.md --format markdown

# Automatic format inference from file extension
autosar-extract input.pdf -o output.json    # Creates JSON output
autosar-extract input.pdf -o output.md      # Creates markdown output
autosar-extract input.pdf -o output         # Default: markdown

JSON File Structure

JSON output creates a multi-file structure with separate files for different entity types:

output/
├── index.json                              # Root index with overview
└── packages/
    ├── M2.json                              # Package metadata
    ├── M2.classes.json                      # All classes in M2
    ├── M2.enums.json                        # All enumerations in M2
    ├── M2_AUTOSAR.json                      # Subpackage metadata
    ├── M2_AUTOSAR.classes.json              # Classes in subpackage
    └── ...

JSON Schema

index.json - Root index with:

version: Schema version
metadata: Generation timestamp, source files, entity counts
packages: List of package references

Package metadata file (packages/{name}.json):

name: Package name
path: Full package path with :: separator
files: References to entity files
subpackages: Child package metadata
summary: Entity counts

Classes file (packages/{name}.classes.json):

Complete class data including attributes, sources, inheritance hierarchy
atp_type: ATP marker type or null
implements, implemented_by: ATP interface relationships

Enumerations file (packages/{name}.enums.json):

Enumeration literals with index and description
Tags merged into description with <br>Tags: format

Primitives file (packages/{name}.primitives.json):

Primitive types with attributes (no inheritance fields)

For complete JSON schema details, see JSON Writer Design Document.

Individual Class Files

Each class file contains detailed information:

# Package: AUTOSAR::Components

## Class: SwComponentPrototype

**Abstract**: No
**Package**: M2::AUTOSAR::Components
**Parent**: None
**ATP Type**: None

### Attributes

| Name | Type | Mult. | Kind | Note |
|------|------|-------|------|------|
| shortName | String | 1 | attribute | |
| category | Category | 0..1 | attribute | |

Development

Running Tests

# Run all tests
pytest tests/

# Run with coverage
pytest tests/ --cov=autosar_pdf2txt --cov-report=term-missing

# Run specific test file
pytest tests/models/test_autosar_models.py -v

Code Quality

# Linting
ruff check src/ tests/

# Type checking
mypy src/autosar_pdf2txt/

# Run full quality checks
pytest tests/ && ruff check src/ tests/ && mypy src/autosar_pdf2txt/

Test Coverage

The project maintains 97%+ test coverage with comprehensive test suites for all modules:

Models: 100% coverage (attributes, containers, enums, types)
Parser: 90% coverage (PDF parsing, pattern recognition, hierarchy building, subclasses validation)
Writer: 100% coverage (markdown generation, class hierarchy, file output)
CLI: 82% coverage (acceptable per requirements - error handling paths)

License

MIT License - see LICENSE file for details

Contributing

Contributions are welcome! Please ensure:

All tests pass: pytest tests/
Code coverage remains ≥95%
Linting passes: ruff check src/ tests/
Type checking passes: mypy src/autosar_pdf2txt/

Project Links

GitHub Repository: https://github.com/melodypapa/autosar-pdf
Issue Tracker: https://github.com/melodypapa/autosar-pdf/issues
Documentation: See docs/ directory for detailed requirements and development guidelines

Changelog

Version 2.0.0 (Breaking Change)

CLI Redesign: Redesigned CLI output arguments for better flexibility
Removed: -o, --generate-mapping, --include-class-hierarchy, --include-class-details
Added: --mapping FILE, --hierarchy FILE, --class-details DIR
Feature: Output flags can now be combined in any combination
Feature: Format auto-detected from file extension (.md, .json)
Migration: See "Migration from v1.x to v2.0" section in README

Version 1.0.0

Production Release: Project has reached production stability with comprehensive test coverage
CamelCase Attribute Extraction: Fixed attribute parsing for camelCase names like shortNameFragment (SWR_PARSER_00012)
Improved Attribute Name Parsing: Resolved issues with Referrable class showing correct attributes (shortName and shortNameFragment)
Modern Python Packaging: Migrated from setup.py to pyproject.toml with PEP 621 compliance
Enhanced Type Detection: Added 34 common type suffixes to exclusion list for better camelCase detection
Test Coverage: Maintained 97%+ test coverage with 524 total tests (510 unit + 14 integration)
Python 3.12 Support: Added Python 3.12 to supported versions
Development Status: Updated from "Beta" to "4 - Production" status
Type-to-Package Mapping: Added mapping generation feature with --generate-mapping CLI flag (from PR #167)

Version 0.19.0

Added page number tracking in two-phase parsing (SWR_PARSER_00030) for accurate source location
Enhanced multi-page class definition parsing with improved state management
Added integration tests for multi-page class parsing scenarios
Improved page boundary marker handling with <<<PAGE:N>>> format
Specialized parsers now receive accurate page numbers from parse phase
Fixed page number assignment for types defined beyond page 1
Enhanced integration test documentation with multi-page parsing test cases

Version 0.18.0

Enhanced M2 package prefix preservation as root metamodel package
Improved source location tracking with AUTOSAR standard and release extraction
Added markdown table format for source information output (SWR_WRITER_00008)
Refactored duplicate type handling to log warnings instead of raising errors
Renamed AutosarSource to AutosarDocumentSource for clarity
Enhanced source information display in individual class files
Updated requirements documentation with source location details
Added 7 new AUTOSAR FO (Foundation) template PDFs to examples

Version 0.17.0

Enhanced integration tests for multi-page class definition parsing
Improved state management for multi-page definitions
Added test documentation for multi-page parsing scenarios
Fixed issues with class definitions spanning multiple pages
Improved error messages for parsing failures

Version 0.16.0

Added CLI log file support (--log-file) for persistent logging with timestamps
Implemented subclasses validation (SWR_PARSER_00029) to detect inheritance contradictions
Added comprehensive TDD enforcement documentation to prevent future violations
Enhanced test documentation with 15 new test cases for log file feature
Enhanced test documentation with 10 new test cases for subclasses validation
Improved test coverage from 96% to 97%
Updated AGENTS.md with mandatory TDD section
Updated development guidelines with TDD enforcement and common mistakes

Version 0.15.0

Implemented two-phase PDF parsing approach (read phase + parse phase)
Added specialized parsers for classes, enumerations, and primitives
Added ancestry-based parent resolution for complex inheritance hierarchies
Added source location tracking for PDF file and page number
Added subclasses attribute to track explicitly documented subclass relationships
Refactored requirements documentation into separate module files
Enhanced TDD rules with test type selection strategy
Fixed multi-line class list parsing and multi-page class definition handling

Version 0.9.0

Added class hierarchy generation feature (--include-class-hierarchy)
Added separate output file for class hierarchy
Enhanced /sync-docs command with coverage validation
Improved test coverage from 90% to 96%
Added AutosarDoc model for document-level operations
Added enumeration and enum literal support
Enhanced logging for class hierarchy generation
Fixed model validation and duplicate prevention

Version 0.8.0

Initial release with basic PDF extraction and markdown output
Support for packages, classes, and attributes
ATP marker support
Individual class file generation

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

melodypapa

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

2.0.0

Feb 14, 2026

1.0.0

Feb 9, 2026

0.26.0

Jan 31, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autosar_pdf2txt-2.0.0.tar.gz (73.7 kB view details)

Uploaded Feb 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

autosar_pdf2txt-2.0.0-py3-none-any.whl (78.9 kB view details)

Uploaded Feb 14, 2026 Python 3

File details

Details for the file autosar_pdf2txt-2.0.0.tar.gz.

File metadata

Download URL: autosar_pdf2txt-2.0.0.tar.gz
Upload date: Feb 14, 2026
Size: 73.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for autosar_pdf2txt-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`974f341adb9f6a34734c211c295c9c9050c66f8f02b8c3b88912a3de55b0c0a5`
MD5	`a8c5b1f176ba5b19f7e1872e0cdbf1e9`
BLAKE2b-256	`d3f241e47105267d3d9e844b04d98fd7a57d42827468594f7ad892e031f4fc3d`

See more details on using hashes here.

File details

Details for the file autosar_pdf2txt-2.0.0-py3-none-any.whl.

File metadata

Download URL: autosar_pdf2txt-2.0.0-py3-none-any.whl
Upload date: Feb 14, 2026
Size: 78.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for autosar_pdf2txt-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7d2db13d408af16843d022d3fa45919efa72bc64168d4fdd7815fb62148ed5d8`
MD5	`2d840b20f56fed254e259e1d1756334f`
BLAKE2b-256	`88c8b16cdb0ba8de53bec6a504f7a600070f07693fe493a2e2d4ad003cbd3b2c`

See more details on using hashes here.

autosar-pdf2txt 2.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

AUTOSAR PDF to Text

Features

Installation

Requirements

Usage

Command Line Interface

CLI Options

Migration from v1.x to v2.0

Python API

Data Models

AutosarPackage

AutosarClass

AutosarEnumeration

AutosarDoc

Examples

Example: Basic Extraction

Example: Generate Class Hierarchy

Example: Generate Individual Class Files

Example: Combined Output

Output Format

Package Hierarchy Output

Class Hierarchy Output

JSON Output Format

JSON File Structure

JSON Schema

Individual Class Files

Development

Running Tests

Code Quality

Test Coverage

License

Contributing

Project Links

Changelog

Version 2.0.0 (Breaking Change)

Version 1.0.0

Version 0.19.0

Version 0.18.0

Version 0.17.0

Version 0.16.0

Version 0.15.0

Version 0.9.0

Version 0.8.0

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes