Skip to main content

No project description provided

Project description

D361: Robust Offline Documentation Generator

License PyPI version Python Versions Code style: Ruff Checked with MyPy Tests Passing

๐Ÿš€ TL;DR

D361 is a robust, enterprise-grade Python package that creates comprehensive offline versions of Document360 knowledge bases and other sitemap-based documentation sites. It's the generic, reusable foundation of the Document360 unified toolkit, designed for reliability, performance, and seamless integration.

Quick Start:

# Install and generate offline docs in one command
pip install d361 && playwright install chromium
d361-offline all --map-url="https://docs.example.com/sitemap-en.xml" --output-dir="offline_docs"

# Or use the standalone binary (no Python required)
curl -L -o d361-offline https://github.com/twardoch/d361/releases/latest/download/d361-offline-ubuntu-latest
chmod +x d361-offline && ./d361-offline all --map-url="https://docs.example.com/sitemap-en.xml"

Key Features:

  • ๐ŸŽฏ Complete Documentation Capture - Intelligently extracts entire documentation structures
  • ๐Ÿš€ Multi-Strategy Parsing - Robust sitemap parsing with multiple fallback mechanisms
  • ๐Ÿค– Browser Automation - Playwright-based extraction with stealth techniques for dynamic content
  • ๐Ÿ“ฑ Multi-Format Output - HTML, Markdown, and combined documentation files
  • โšก Performance Optimized - Concurrent downloads with intelligent retry logic
  • ๐Ÿ”„ Navigation Preservation - Maintains original site structure for intuitive offline browsing

๐Ÿ“ฆ What is D361?

D361 is the robust offline documentation generator that serves as the foundational component of the Document360 unified toolkit. As a standalone package, it specializes in extracting, processing, and organizing documentation content for offline access, with enterprise-grade reliability and performance.

Core Purpose: D361 automates the complete process of downloading entire Document360 sites (or other sitemap-based documentation) and converting them into comprehensive, browsable offline formats. It's designed to handle the complexities of modern documentation sites, including dynamic content, virtual scrolling, and complex navigation structures.

The D361 Workflow:

  1. ๐Ÿ” Multi-Strategy Discovery - Advanced sitemap parsing with multiple fallback mechanisms
  2. ๐Ÿ—บ๏ธ Dynamic Structure Extraction - Intelligently maps navigation hierarchies from live sites
  3. โšก Concurrent Content Fetching - High-performance parallel downloading with retry logic
  4. ๐Ÿ”„ Multi-Format Processing - Converts content to HTML, Markdown, and combined formats
  5. ๐Ÿ“ Intelligent Organization - Creates structured offline archives with preserved navigation

Result: A complete, self-contained documentation snapshot that works entirely offline.

๐ŸŽฏ Who Uses D361?

Enterprise Documentation Teams:

  • Technical Writers - Archive documentation versions, perform offline reviews, and create distribution packages
  • DevOps Engineers - Integrate offline documentation into deployment pipelines and container images
  • Support Engineers - Access knowledge bases instantly in customer support scenarios
  • Compliance Teams - Create immutable documentation snapshots for regulatory requirements

Development & Integration:

  • Software Developers - Bundle documentation with applications for offline environments
  • System Integrators - Deploy documentation in air-gapped or restricted network environments
  • CI/CD Pipelines - Automated documentation processing and archival as part of build processes
  • Documentation Toolkit Builders - Use D361 as a foundational component (like in vexy-help)

Specialized Use Cases:

  • Industrial/Manufacturing - Offline documentation access on factory floors and production environments
  • Healthcare/Government - Secure, compliant documentation in regulated environments
  • Field Service - Technical documentation for remote locations with limited connectivity
  • Training & Education - Portable documentation packages for distributed learning

๐Ÿš€ Why Choose D361?

๐Ÿ”ง Technical Excellence:

  • Robust Architecture - Handles complex modern documentation sites with dynamic content
  • Enterprise Performance - Concurrent processing with intelligent retry mechanisms and error handling
  • Multiple Fallback Strategies - Ensures successful content extraction even with challenging sites
  • Format Flexibility - Outputs HTML, Markdown, and combined formats for different use cases

๐ŸŒ Real-World Reliability:

  • Production-Tested - Successfully processes large-scale documentation sites with thousands of pages
  • Stealth Browser Automation - Advanced Playwright techniques to handle cookie banners, virtual scrolling, and dynamic loading
  • Content Preservation - Maintains original navigation structure, styling, and cross-references
  • Error Resilience - Comprehensive error handling ensures partial success even with network issues

๐Ÿ”„ Integration-Friendly:

  • Standalone Operation - Works independently without external dependencies on other toolkit components
  • API-First Design - Clean programmatic interface for integration into larger workflows
  • Container-Ready - Docker-friendly with minimal resource requirements
  • Cross-Platform - Native support for Linux, macOS, and Windows environments

โšก Core Features & Capabilities

๐ŸŽฏ Complete Documentation Extraction

D361 employs sophisticated techniques to capture entire documentation ecosystems:

# Advanced content discovery with multiple fallback strategies
from d361.offline.parser import parse_sitemap

# Strategy 1: Direct sitemap parsing
urls = await parse_sitemap("https://docs.example.com/sitemap-en.xml")

# Strategy 2: Robots.txt discovery + parsing  
urls = await parse_sitemap("https://docs.example.com/robots.txt", strategy="robots")

# Strategy 3: Stealth browser automation for protected sites
urls = await parse_sitemap("https://docs.example.com", strategy="stealth")

What gets captured:

  • ๐Ÿ“„ All article content (HTML + converted Markdown)
  • ๐Ÿ—บ๏ธ Complete navigation hierarchy with nested categories
  • ๐Ÿ–ผ๏ธ Referenced images and media files
  • ๐Ÿ”— Cross-references and internal links
  • ๐ŸŽจ Original styling and CSS (optional)

๐Ÿš€ Multi-Strategy Sitemap Parsing

Robust discovery mechanisms ensure content extraction even from challenging sites:

from d361.offline.d361_offline import D361Offline
from d361.offline.config import Config

config = Config(
    map_url="https://docs.example.com/sitemap-en.xml",
    # Fallback strategies automatically attempted if primary fails
    effort=True,  # Enable additional discovery strategies
    max_concurrent=8,  # Concurrent parsing attempts
    retries=3  # Per-strategy retry attempts
)

offline_gen = D361Offline(config)
await offline_gen.prep()  # Intelligent sitemap discovery and parsing

Parsing Strategies:

  1. Direct Navigation - Standard HTTP GET to sitemap URL
  2. Stealth Browser - Playwright with human-like behavior patterns
  3. HTTP Direct - aiohttp-based lightweight parsing
  4. Robots.txt Discovery - Automatic sitemap URL discovery
  5. Google Cache - Last resort via cached versions

๐Ÿค– Advanced Browser Automation

Playwright-powered content extraction handles modern web complexity:

from d361.offline.browser import setup_browser, expand_all_items

# Configure stealth browser with realistic parameters
browser_config = {
    'headless': True,
    'user_agent': 'Mozilla/5.0 (compatible; D361 Documentation Archiver)',
    'viewport': {'width': 1920, 'height': 1080},
    'extra_http_headers': {
        'Accept-Language': 'en-US,en;q=0.9',
        'Accept-Encoding': 'gzip, deflate, br'
    }
}

async with setup_browser(**browser_config) as browser:
    page = await browser.new_page()
    await page.goto("https://docs.example.com")
    
    # Handle dynamic content loading
    navigation_tree = await page.locator('#left-panel d360-data-list-tree-view').first
    await expand_all_items(navigation_tree, page)  # Recursively expand all navigation
    
    # Extract complete navigation structure
    nav_data = await extract_tree_structure(navigation_tree)

Browser Automation Capabilities:

  • ๐Ÿช Cookie Banner Dismissal - Automatically handles consent dialogs
  • ๐Ÿ“œ Virtual Scrolling - Loads all content from virtually rendered lists
  • ๐ŸŒณ Dynamic Tree Expansion - Recursively expands navigation hierarchies
  • โฑ๏ธ Network Idle Detection - Waits for complete content loading
  • ๐Ÿ”„ Retry Logic - Handles intermittent failures gracefully

๐Ÿ“ฑ Multi-Format Output Generation

Flexible output formats for different consumption needs:

# Configure output formats and customization
config = Config(
    map_url="https://docs.example.com/sitemap-en.xml",
    output_dir=Path("./offline_docs"),
    css_file=Path("./custom-styling.css"),  # Custom CSS for HTML output
    
    # File naming patterns
    all_docs_html_filename="complete_documentation.html",
    all_docs_md_filename="complete_documentation.md",
    
    # Processing options
    test=False,  # Process all content (not just test subset)
    verbose=True  # Detailed logging
)

offline_gen = D361Offline(config)
await offline_gen.all()  # Generate all formats

Generated Output Structure:

offline_docs/docs.example.com/
โ”œโ”€โ”€ prep.json              # Sitemap discovery metadata
โ”œโ”€โ”€ fetch.json             # Content extraction results  
โ”œโ”€โ”€ nav.json               # Navigation structure data
โ”œโ”€โ”€ nav.html               # Standalone navigation menu
โ”œโ”€โ”€ nav.md                 # Markdown navigation index
โ”œโ”€โ”€ all_docs.html          # Complete HTML with embedded navigation
โ”œโ”€โ”€ all_docs.md            # Complete Markdown with TOC
โ”œโ”€โ”€ html/                  # Individual HTML pages
โ”‚   โ”œโ”€โ”€ getting-started.html
โ”‚   โ”œโ”€โ”€ api-reference.html
โ”‚   โ””โ”€โ”€ ...
โ””โ”€โ”€ md/                    # Individual Markdown pages
    โ”œโ”€โ”€ getting-started.md
    โ”œโ”€โ”€ api-reference.md
    โ””โ”€โ”€ ...

โšก Performance-Optimized Processing

Enterprise-grade performance with intelligent resource management:

from d361.offline.config import Config

# Performance-tuned configuration
config = Config(
    map_url="https://docs.example.com/sitemap-en.xml",
    max_concurrent=12,        # Concurrent page fetching
    timeout=60,              # Per-page timeout (seconds)
    retries=5,               # Retry attempts for failed pages  
    pause=0,                 # No artificial delays (max speed)
)

# Monitor performance during processing
offline_gen = D361Offline(config)

start_time = time.time()
result = await offline_gen.all()
processing_time = time.time() - start_time

print(f"Processed {len(result['content'])} pages in {processing_time:.2f}s")
print(f"Average: {processing_time/len(result['content']):.3f}s per page")

Performance Features:

  • ๐Ÿš€ Concurrent Downloads - Configurable parallel processing (default: 5 concurrent)
  • ๐Ÿ”„ Exponential Backoff - Intelligent retry delays with tenacity library
  • ๐Ÿ’พ Memory Efficient - Streaming content processing to minimize memory usage
  • ๐Ÿ“Š Progress Tracking - Real-time processing status and performance metrics
  • โšก Network Optimization - Connection pooling and keep-alive for HTTP efficiency

Installation

D361 can be installed in multiple ways depending on your needs:

Quick Installation (Recommended)

# One-line installation script
curl -sSL https://raw.githubusercontent.com/twardoch/d361/main/scripts/install.sh | bash

Manual Installation

Via pip:

pip install d361
playwright install chromium

Via uv (faster):

uv pip install d361
playwright install chromium

Binary Download (No Python required):

# Linux
curl -L -o d361-offline https://github.com/twardoch/d361/releases/latest/download/d361-offline-ubuntu-latest
chmod +x d361-offline

# macOS
curl -L -o d361-offline https://github.com/twardoch/d361/releases/latest/download/d361-offline-macos-latest
chmod +x d361-offline

# Windows
curl -L -o d361-offline.exe https://github.com/twardoch/d361/releases/latest/download/d361-offline-windows-latest.exe

Installation Options

The installation script supports various options:

# Install specific version
./scripts/install.sh --version 1.0.0

# Install via specific method
./scripts/install.sh --method binary

# Install with Playwright browsers
./scripts/install.sh --install-browsers

# Install to custom directory
./scripts/install.sh --install-dir ~/.local/bin

# See all options
./scripts/install.sh --help

Command Line Usage

The package provides a command-line interface d361-offline with several operations. The main commands are prep, fetch, build, and all.

1. all (Recommended for most users): Runs the entire process: preparation, fetching, and building.

d361-offline all --map-url="https://docs.example.com/sitemap-en.xml" --output-dir="my_offline_docs"
  • --map-url: (Required) The URL to your Document360 sitemap (usually ends with sitemap-en.xml or similar).
  • --output-dir: (Optional) The directory where offline documentation will be saved. Defaults to a folder named after the domain in the current directory (e.g., ./docs.example.com/).
  • --style: (Optional) Path to a custom CSS file to style the HTML output.
  • --nav-url: (Optional) URL of a specific page to extract navigation from. If not provided, uses the first URL from the sitemap.

2. Individual Steps (for advanced control):

  • prep: Parses the sitemap and extracts the navigation structure.

    d361-offline prep --map-url="https://docs.example.com/sitemap-en.xml" --output-dir="my_docs"
    

    This creates a prep.json file in the output directory.

  • fetch: Downloads the content for all URLs found in the prep phase.

    d361-offline fetch --prep-file="my_docs/prep.json" --output-dir="my_docs"
    

    This creates a fetch.json file and saves individual HTML/Markdown pages.

  • build: Generates the final combined documentation files from the fetched content.

    d361-offline build --fetch-file="my_docs/fetch.json" --output-dir="my_docs" --style="path/to/custom.css"
    

Getting Help: For a full list of options for each command, use d361-offline <command> --help. For example: d361-offline all --help.

Programmatic Usage

You can also use D361 from your Python scripts:

import asyncio
from pathlib import Path
from d361.offline.config import Config
from d361.offline.d361_offline import D361Offline

async def generate_my_docs():
    # Configure the generator
    # Ensure map_url is provided
    sitemap_url = "https://docs.example.com/sitemap-en.xml" # Replace with actual sitemap URL
    if not sitemap_url:
        raise ValueError("map_url must be set for Config")

    config = Config(
        map_url=sitemap_url,
        output_dir=Path("custom_offline_docs"),  # Output will be in ./custom_offline_docs/docs.example.com/
        css_file=Path("styles/my_custom_style.css") if Path("styles/my_custom_style.css").exists() else None,
        max_concurrent=5,  # Number of parallel downloads
        retries=3,         # Number of retries for failed requests
        timeout=60,        # Timeout for page loads in seconds
        verbose=False,     # Set to True for detailed logging
        test=False         # Set to True to process only a few items for testing
    )

    # Create an instance of the offline generator
    offline_generator = D361Offline(config)

    try:
        print(f"Starting offline generation for {config.map_url}...")
        print(f"Output will be saved to: {config.output_dir.resolve()}")

        # Run the entire process: prep, fetch, and build
        await offline_generator.all()

        # Alternatively, run individual phases:
        # print("Running prep phase...")
        # prep_data = await offline_generator.prep()
        # print(f"Prep phase complete. Data saved to {config.prep_file}")

        # print("Running fetch phase...")
        # fetch_data = await offline_generator.fetch(prep_file=config.prep_file)
        # print(f"Fetch phase complete. Data saved to {config.fetch_file}")

        # print("Running build phase...")
        # await offline_generator.build(fetch_file=config.fetch_file)
        # print("Build phase complete.")

        print("Offline documentation generated successfully!")
        print(f"Combined HTML: {config.output_dir / config.all_docs_html_filename}")
        print(f"Combined Markdown: {config.output_dir / config.all_docs_md_filename}")

    except Exception as e:
        print(f"An error occurred: {e}")
        import traceback
        traceback.print_exc()

if __name__ == "__main__":
    # Ensure Playwright browsers are installed:
    # Run `playwright install` in your terminal if you haven't already.
    asyncio.run(generate_my_docs())

Configuration Options

The behavior of d361-offline is controlled by the Config model (see src/d361/offline/config.py). Key options include:

Option (Config field) CLI Argument Description Default (from Config)
map_url --map-url URL of the sitemap (e.g., sitemap.xml) Required (None by default, must be set)
nav_url --nav-url URL of a page to extract navigation from (optional) None
output_dir --output-dir Base directory for output. A subfolder named after the domain will be created here. Current working directory
css_file --style (build) Path to custom CSS file for styling HTML output None
effort --effort (prep) Try harder to map all sitemap links in navigation False
max_concurrent --parallel Maximum number of concurrent download requests 5
retries --retries Number of retry attempts for failed requests 3
timeout --timeout Request timeout in seconds for page loads 60
verbose --verbose Enable verbose (DEBUG level) logging False
test --test (prep, fetch) Test mode: process only a few items (typically 5) False
pause --wait (prep) Pause during navigation extraction (for debugging browser) False (numeric value for seconds in CLI)

Note: Default output_dir behavior: If map_url is https://docs.example.com/... and output_dir is my_docs (or not set, defaulting to current dir), the actual output path will be my_docs/docs.example.com/.

Output Structure

The generated offline documentation will be organized as follows in your specified output directory (e.g., output_dir/your_domain_com/):

output_dir/your_domain_com/
โ”œโ”€โ”€ prep.json          # Intermediate data from preparation phase (URLs, navigation)
โ”œโ”€โ”€ fetch.json         # Intermediate data from fetch phase (content map)
โ”œโ”€โ”€ nav.json           # Navigation structure in JSON format
โ”œโ”€โ”€ nav.html           # Standalone HTML version of the navigation menu
โ”œโ”€โ”€ nav.md             # Markdown version of the navigation menu
โ”œโ”€โ”€ all_docs.html      # Combined HTML documentation with navigation and styling
โ”œโ”€โ”€ all_docs.md        # Combined Markdown documentation with a table of contents
โ”œโ”€โ”€ html/              # Directory for individual HTML page files
โ”‚   โ”œโ”€โ”€ page-slug-1.html
โ”‚   โ”œโ”€โ”€ page-slug-2.html
โ”‚   โ””โ”€โ”€ ...
โ””โ”€โ”€ md/                # Directory for individual Markdown page files
    โ”œโ”€โ”€ page-slug-1.md
    โ”œโ”€โ”€ page-slug-2.md
    โ””โ”€โ”€ ...

If a custom CSS file is used, it will be copied into html/assets/ and linked in all_docs.html.

Part 2: Technical Deep-Dive

This section describes how D361 works internally, its architecture, and guidelines for contributors.

How the Code Works

D361 operates in a three-phase workflow: Prep, Fetch, and Build. All operations are asynchronous using Python's asyncio library for efficient I/O and concurrency.

Core Workflow:

  1. Prep Phase:

    • Parses the sitemap (map_url) to get a list of all unique page URLs.
    • Extracts the navigation structure (table of contents) from a specified page (nav_url or the first sitemap URL).
    • Saves this information (urls, navigation, config) into prep.json.
    • Generates nav.json, nav.html, and nav.md.
  2. Fetch Phase:

    • Reads prep.json.
    • For each URL, fetches the page content (title, HTML body, Markdown version).
    • Saves individual page content as html/<slug>.html and md/<slug>.md.
    • Saves all fetched content mapped by URL, along with the navigation structure and config, into fetch.json.
  3. Build Phase:

    • Reads fetch.json.
    • Generates all_docs.html: A single HTML file containing all articles, prepended with the navigation menu and linked to the specified CSS.
    • Generates all_docs.md: A single Markdown file containing all articles, with a generated table of contents at the top.

Key Components and Processes:

  • Configuration (src/d361/offline/config.py):

    • The Config class (Pydantic model) manages all settings. It validates inputs, computes default values (like output_dir based on the domain), and provides paths for various output files.
  • Main Orchestrator (src/d361/offline/d361_offline.py):

    • The D361Offline class is the heart of the generator. It takes a Config object and has methods for prep(), fetch(), build(), and all().
    • It coordinates interactions between other modules.
  • Command Line Interface (src/d361/offline/__main__.py):

    • Uses the fire library to expose D361Offline methods and configuration options to the command line as d361-offline prep, fetch, build, all.
  • Sitemap Parsing (src/d361/offline/parser.py):

    • The parse_sitemap function is responsible for fetching and extracting URLs from the sitemap.xml.
    • It employs multiple strategies for robustness:
      1. Direct Playwright navigation (_parse_with_playwright_direct).
      2. Playwright with enhanced stealth techniques (_parse_with_playwright_stealth) to mimic human browsing.
      3. Direct HTTP GET request using aiohttp (_parse_with_aiohttp_direct).
      4. Checking robots.txt for sitemap directives and then parsing found URLs (_parse_with_playwright_via_robots).
      5. As a last resort, it can try Google's web cache of the sitemap.
    • Uses BeautifulSoup (with lxml parser) to parse XML content and extract <loc> tags.
  • Navigation Extraction (src/d361/offline/navigation.py):

    • The extract_navigation function uses Playwright to load the nav_url.
    • This is one ofthe most complex interactions due to Document360's dynamic UI:
      • Cookie/Consent Handling: Attempts to detect and dismiss various cookie consent banners.
      • Tree Expansion: Locates the main navigation tree element (e.g., #left-panel ... d360-data-list-tree-view). It then calls expand_navigation_tree which uses browser.expand_all_items.
      • expand_all_items (in browser.py) repeatedly scrolls the navigation pane (to load virtually rendered items via scroll_to_bottom) and clicks on collapsed item indicators (e.g., triangle icons) until all navigation nodes are visible.
      • Structure Parsing: extract_tree_structure then iterates over the DOM elements of the expanded tree to rebuild the hierarchical navigation data (titles, links, children).
    • Includes fallback mechanisms if standard Document360 selectors are not found.
  • Content Fetching and Processing (src/d361/offline/content.py, D361Offline.process_url):

    • For each URL, D361Offline.process_url launches a Playwright page.
    • extract_page_content (in content.py) is called:
      • Navigates to the URL.
      • Attempts to dismiss cookie banners.
      • Waits for network idle and content to render.
      • Extracts the page title (trying common selectors like h1.article-title).
      • Extracts the main article HTML content (trying selectors like #articleContent, .article-content).
      • Converts the extracted HTML to Markdown using the markdownify library.
    • The D361Offline class then saves this content to html/<slug>.html and md/<slug>.md. Slugs are generated from URLs.
  • Browser Automation (src/d361/offline/browser.py):

    • setup_browser: Configures and launches Playwright (Chromium by default) with specific arguments to appear more like a regular browser and handle various environments.
    • scroll_to_bottom: Handles scrolling within elements that use virtual scrolling (common in Document360 navigation) to ensure all items are loaded into the DOM.
    • expand_all_items: A sophisticated function to recursively find and click "expand" icons in a tree structure, dealing with items that might only appear after scrolling or previous expansions. It uses multiple selector strategies.
  • Output Generation (D361Offline._generate_combined_files, src/d361/offline/generator.py):

    • D361Offline._generate_combined_files is responsible for creating all_docs.html and all_docs.md.
    • For all_docs.html:
      • It includes a navigation section generated from nav.json.
      • It appends the HTML content of each article, ordered by the navigation structure.
      • It embeds the custom CSS (if provided) or a default style.
    • For all_docs.md:
      • It generates a Table of Contents based on the navigation and article titles.
      • It appends the Markdown content of each article.
    • The generator.py module contains helper functions for creating directories and was initially intended for more granular file generation, though much of that logic is now within D361Offline.
  • Error Handling and Retries:

    • The tenacity library is used in content.extract_page_content to automatically retry page content extraction on failure, using exponential backoff.
    • Individual URL processing errors are logged but generally don't stop the entire batch, allowing the tool to fetch as much content as possible.

Development Environment

This project uses Hatch for managing dependencies, virtual environments, and running development tasks. Hatch leverages uv if available, which significantly speeds up environment setup and package installation.

Setup:

  1. Install Hatch and uv: It's recommended to install uv first, then use it to install hatch.

    # Install uv (refer to official uv documentation for your OS)
    # Example for Linux/macOS:
    curl -LsSf https://astral.sh/uv/install.sh | sh
    # Then install Hatch using uv
    uv pip install hatch
    
  2. Create/Activate Hatch Environment: Navigate to the project root directory and run:

    hatch shell
    

    This command:

    • Creates an isolated virtual environment (e.g., in .hatch/) if one doesn't exist.
    • Installs all project dependencies, including development tools (pytest, ruff, mypy, etc.), using uv if available.
    • Activates the environment in your current shell.
  3. Install Playwright Browsers: After activating the environment, install the necessary browser binaries for Playwright:

    playwright install
    

    This typically installs Chromium, Firefox, and WebKit. D361 primarily uses Chromium.

Running Tasks with Hatch:

Hatch scripts are defined in pyproject.toml under [tool.hatch.envs.*.scripts].

  • Run Tests: The project uses pytest.

    # Run tests with coverage report
    hatch run test:test-cov
    
    # Run tests without coverage
    hatch run test:test
    
  • Linting and Formatting: The project uses Ruff for super-fast linting and formatting, and MyPy for static type checking.

    # Format code and fix lint issues (where possible)
    hatch run lint:fix  # Or an alias: hatch run fix
    
    # Check for linting and formatting issues
    hatch run lint:style # Or an alias: hatch run lint
    
    # Run static type checking
    hatch run lint:typing # Or an alias: hatch run type-check
    
    # Run all checks (style, format, types)
    hatch run lint:all
    
  • Pre-commit Hooks: The project is configured with pre-commit hooks (see .pre-commit-config.yaml). Install them to automatically run checks before each commit:

    pre-commit install
    

Coding and Contribution Guidelines

Contributions are highly welcome! Please adhere to the following guidelines:

  1. Branching Strategy:

    • Create new branches from main for features or bug fixes (e.g., feat/add-new-exporter, fix/navigation-parsing-bug).
  2. Code Style & Quality:

    • Formatting: Code is formatted with Ruff. Run hatch run lint:fix before committing.
    • Linting: Code is linted with Ruff. Ensure hatch run lint:style passes.
    • Type Checking: All code should pass MyPy checks. Run hatch run lint:typing.
    • Pythonic Code: Write clear, readable, and idiomatic Python.
    • Docstrings and Comments: Add docstrings to all public modules, classes, and functions. Use comments for complex logic.
  3. Commit Messages:

    • Follow the Conventional Commits specification.
    • Examples:
      • feat: add support for Confluence sitemap parsing
      • fix: improve resilience of cookie banner dismissal
      • docs: update README with advanced usage examples
      • refactor: simplify content extraction logic
      • test: add unit tests for slug generation
  4. Testing:

    • Write tests for all new features and bug fixes using pytest.
    • Place tests in the tests/ directory, mirroring the structure of src/d361/.
    • Aim for high test coverage. Check coverage with hatch run test:test-cov.
    • Ensure all tests pass locally before submitting a Pull Request.
  5. Pull Requests (PRs):

    • Submit PRs against the main branch.
    • Provide a clear and descriptive title and summary for your PR.
    • Explain the "what" and "why" of your changes. Link to any relevant issues.
    • Ensure all GitHub Actions CI checks (tests, linting, type checking) pass on your PR.
    • Be responsive to feedback and code reviews.
  6. Dependencies:

    • Minimize new dependencies. If adding one, justify its need.
    • Add new dependencies to pyproject.toml under [project.dependencies] or [project.optional-dependencies.dev].

Releases

D361 follows Semantic Versioning and provides multiple distribution formats:

  • PyPI Package: Available on PyPI for pip and uv installation
  • Binary Releases: Pre-built executables for Linux, macOS, and Windows
  • Source Code: Available on GitHub

Each release includes:

  • Source distribution (.tar.gz)
  • Wheel distribution (.whl)
  • Standalone binaries for all platforms
  • Automated testing across Python 3.10-3.12 and multiple operating systems

Release Process

New releases are automatically created when version tags are pushed:

# Create and push a new release tag
git tag v1.0.0
git push origin v1.0.0

This triggers the CI/CD pipeline which:

  1. Runs comprehensive tests on all platforms
  2. Builds Python packages and binaries
  3. Publishes to PyPI
  4. Creates GitHub release with binary artifacts

For development and contribution guidelines, see DEVELOPMENT.md.

License

D361 is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

d361-2.2.1.tar.gz (9.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

d361-2.2.1-py3-none-any.whl (328.3 kB view details)

Uploaded Python 3

File details

Details for the file d361-2.2.1.tar.gz.

File metadata

  • Download URL: d361-2.2.1.tar.gz
  • Upload date:
  • Size: 9.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.28.1

File hashes

Hashes for d361-2.2.1.tar.gz
Algorithm Hash digest
SHA256 e5ed6866c1ca0de698c646a530e056f836cc48f03116c423aa8b76210251cda1
MD5 0622b1ab3f91a46af88594a6acc1092d
BLAKE2b-256 97cc0ec8f293fd1e943d51c1f7b9b036e988988aea705b7ca47b15c85fb8502d

See more details on using hashes here.

File details

Details for the file d361-2.2.1-py3-none-any.whl.

File metadata

  • Download URL: d361-2.2.1-py3-none-any.whl
  • Upload date:
  • Size: 328.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.28.1

File hashes

Hashes for d361-2.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 186ced79a2f7adc65fe0f9679e1c03f73fbe82a4c0830da9104d7dc2a52b771f
MD5 f61a014fda038f198ae29b1c42b3c4cc
BLAKE2b-256 31ac3e8f6a8c4f0870cd1e668551793a7270e2f4b36c21957f5ed61f17753f0b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page