Project dedicated to content extraction from unstructured files that contain some useful information.

Project description

HTML Content Extraction Tool

A powerful command-line tool for extracting structured content from HTML documents. Converts HTML sections into hierarchical JSON data while preserving formatting, links, and semantic structure.

Features

Hierarchical Parsing: Automatically detects heading levels and creates nested section structures
HTML Preservation: Maintains original formatting, links, and semantic elements
Smart Element Filtering: Includes meaningful content while filtering out irrelevant elements
Flexible Input/Output: Read from files or stdin, output to files or stdout
Section Support: Works with existing <section>, <article>, and <main> elements
Custom Headings: Supports both standard headings (h1-h6) and custom headings with aria-level

Installation

# Install dependencies
pip install beautifulsoup4

# Clone or download this repository
git clone <repository-url>
cd content-extraction

Usage

Basic Usage

# Parse HTML file and output to stdout
python main.py example.html

# Parse with pretty-printed JSON
python main.py --pretty example.html

# Save output to file
python main.py example.html -o output.json

# Read from stdin
cat example.html | python main.py --pretty

# Verbose mode with debug information
python main.py --verbose example.html

Command Line Options

usage: main.py [-h] [-o FILE] [--pretty] [-v] [--version] [input_file]

Extract structured content from HTML documents

positional arguments:
  input_file         Input HTML file (if not provided, reads from stdin)

options:
  -h, --help         show this help message and exit
  -o, --output FILE  Output JSON file (if not provided, writes to stdout)
  --pretty           Pretty-print JSON output with indentation
  -v, --verbose      Show verbose output and debug information
  --version          show program's version number and exit

Output Format

The tool outputs JSON with the following structure:

{
  "title": "Section Title",
  "text": "<p>HTML content preserved</p>",
  "level": 1,
  "subsections": [
    {
      "title": "Subsection Title",
      "text": "<p>Subsection content</p>",
      "level": 2,
      "subsections": []
    }
  ]
}

Fields

title: Text content of the highest-level heading in the section
text: All content except headings, with HTML formatting preserved
level: Aria level of the main heading (1-6, or custom levels)
subsections: Array of nested subsections with the same structure

Examples

Simple Section

Input HTML:

<section>
    <h2>Getting Started</h2>
    <p>Welcome to our <a href="/api">API</a>!</p>
    <ul>
        <li>Step 1: Register</li>
        <li>Step 2: Get API key</li>
    </ul>
</section>

Output:

{
  "title": "Getting Started",
  "text": "<p>Welcome to our <a href=\"/api\">API</a>!</p>\n<ul>\n<li>Step 1: Register</li>\n<li>Step 2: Get API key</li>\n</ul>",
  "level": 2,
  "subsections": []
}

Nested Sections

Input HTML:

<main>
    <h1>Documentation</h1>
    <p>Introduction text.</p>
    <h2>Installation</h2>
    <p>Installation instructions.</p>
    <h3>Requirements</h3>
    <p>System requirements.</p>
    <h2>Usage</h2>
    <p>Usage examples.</p>
</main>

Output:

{
  "title": "Documentation",
  "text": "<p>Introduction text.</p>",
  "level": 1,
  "subsections": [
    {
      "title": "Installation",
      "text": "<p>Installation instructions.</p>",
      "level": 2,
      "subsections": [
        {
          "title": "Requirements",
          "text": "<p>System requirements.</p>",
          "level": 3,
          "subsections": []
        }
      ]
    },
    {
      "title": "Usage",
      "text": "<p>Usage examples.</p>",
      "level": 2,
      "subsections": []
    }
  ]
}

Supported HTML Elements

Included Elements

Paragraphs (<p>)
Lists (<ul>, <ol>, <li>)
Links (<a>)
Formatting (<strong>, <em>, <code>, etc.)
Semantic elements (<section>, <article>, <aside>, etc.)
Tables (<table>, <tr>, <td>, etc.)
Media (<img>, <figure>)
Code blocks (<pre>, <code>)
Quotes (<blockquote>, <q>)
All other content elements with meaningful text

Excluded Elements

Headings (processed separately as section titles)
Script and style tags
Meta elements
Empty elements
Elements containing headings (processed as subsections)

Smart Root Element Detection

The tool automatically detects the best root element in this priority order:

<main> - Primary content area
<article> - Standalone article content
<section> - Document section
<body> - Document body
First substantial <div> - Fallback for div-based layouts
Entire document - Last resort

Advanced Features

Custom Headings

Supports custom headings with ARIA attributes:

<div role="heading" aria-level="2">Custom Heading</div>

Aria Level Overrides

Standard headings can have their levels overridden:

<h3 aria-level="1">This is treated as level 1</h3>

Mixed Content

Handles complex layouts with mixed content types:

<div>
    <h1>Main Title</h1>
    <p>Introduction</p>
    <section>
        <h2>Section in Section</h2>
        <p>Section content</p>
    </section>
    <h2>Regular Heading</h2>
    <p>Regular content</p>
</div>

Testing

Run the test suite:

python -m pytest tests/ -v

The project includes comprehensive tests covering:

Basic parsing functionality
Heading level detection
Content extraction
Section handling
Edge cases and error conditions

License

This project is open source. See LICENSE file for details.

Contributing

Contributions are welcome! Please submit pull requests with tests for any new features.

Project details

Release history Release notifications | RSS feed

0.5.0

Sep 26, 2025

0.4.4

Sep 12, 2025

0.4.3

Sep 11, 2025

0.4.2

Sep 11, 2025

0.4.1

Sep 11, 2025

0.4.0

Sep 11, 2025

0.3.1

Jul 25, 2025

This version

0.3.0

Jul 25, 2025

0.2.0

Jul 25, 2025

0.1.0

Jul 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

content_extraction-0.3.0.tar.gz (21.3 kB view details)

Uploaded Jul 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

content_extraction-0.3.0-py3-none-any.whl (22.7 kB view details)

Uploaded Jul 25, 2025 Python 3

File details

Details for the file content_extraction-0.3.0.tar.gz.

File metadata

Download URL: content_extraction-0.3.0.tar.gz
Upload date: Jul 25, 2025
Size: 21.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for content_extraction-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`a049c7f0c1d7c6f9e4770c5ff22693d3a976bc338fc306eb69e424e2b280bc6f`
MD5	`771d05102dcfc2350f028da13b5b6792`
BLAKE2b-256	`fb06cea34c3c23a037fbe2555766dbcfdbb00e441b97d323ca0a96198c0efc9c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for content_extraction-0.3.0.tar.gz:

Publisher: python-package.yml on ChrisW-priv/html-chunking

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: content_extraction-0.3.0.tar.gz
- Subject digest: a049c7f0c1d7c6f9e4770c5ff22693d3a976bc338fc306eb69e424e2b280bc6f
- Sigstore transparency entry: 311878422
- Sigstore integration time: Jul 25, 2025
Source repository:
- Permalink: ChrisW-priv/html-chunking@71b7a8f54e6b616f424864871045b92d5ea32f0c
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/ChrisW-priv
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-package.yml@71b7a8f54e6b616f424864871045b92d5ea32f0c
- Trigger Event: release

File details

Details for the file content_extraction-0.3.0-py3-none-any.whl.

File metadata

Download URL: content_extraction-0.3.0-py3-none-any.whl
Upload date: Jul 25, 2025
Size: 22.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for content_extraction-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f1a01e87f4ad8ca7a9ea70b7f853554a89ec08071e164e415c510eac11978a90`
MD5	`6a4c56a52be2dabce0d2f4fa1d296d67`
BLAKE2b-256	`ef4cd2e744d32b18a27770052ae5f9ca16e5bf04513e87f77295846202dc4cd2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for content_extraction-0.3.0-py3-none-any.whl:

Publisher: python-package.yml on ChrisW-priv/html-chunking

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: content_extraction-0.3.0-py3-none-any.whl
- Subject digest: f1a01e87f4ad8ca7a9ea70b7f853554a89ec08071e164e415c510eac11978a90
- Sigstore transparency entry: 311878450
- Sigstore integration time: Jul 25, 2025
Source repository:
- Permalink: ChrisW-priv/html-chunking@71b7a8f54e6b616f424864871045b92d5ea32f0c
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/ChrisW-priv
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-package.yml@71b7a8f54e6b616f424864871045b92d5ea32f0c
- Trigger Event: release

content-extraction 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

HTML Content Extraction Tool

Features

Installation

Usage

Basic Usage

Command Line Options

Output Format

Fields

Examples

Simple Section

Nested Sections

Supported HTML Elements

Included Elements

Excluded Elements

Smart Root Element Detection

Advanced Features

Custom Headings

Aria Level Overrides

Mixed Content

Testing

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance