Project dedicated to content extraction from unstructured files that contain some useful information.
Project description
HTML Content Extraction Tool
A powerful command-line tool for extracting structured content from HTML documents. Converts HTML sections into hierarchical JSON data while preserving formatting, links, and semantic structure.
Features
- Hierarchical Parsing: Automatically detects heading levels and creates nested section structures
- HTML Preservation: Maintains original formatting, links, and semantic elements
- Smart Element Filtering: Includes meaningful content while filtering out irrelevant elements
- Flexible Input/Output: Read from files or stdin, output to files or stdout
- Section Support: Works with existing
<section>,<article>, and<main>elements - Custom Headings: Supports both standard headings (
h1-h6) and custom headings witharia-level
Installation
# Install dependencies
pip install beautifulsoup4
# Clone or download this repository
git clone <repository-url>
cd content-extraction
Usage
Basic Usage
# Parse HTML file and output to stdout
python main.py example.html
# Parse with pretty-printed JSON
python main.py --pretty example.html
# Save output to file
python main.py example.html -o output.json
# Read from stdin
cat example.html | python main.py --pretty
# Verbose mode with debug information
python main.py --verbose example.html
Command Line Options
usage: main.py [-h] [-o FILE] [--pretty] [-v] [--version] [input_file]
Extract structured content from HTML documents
positional arguments:
input_file Input HTML file (if not provided, reads from stdin)
options:
-h, --help show this help message and exit
-o, --output FILE Output JSON file (if not provided, writes to stdout)
--pretty Pretty-print JSON output with indentation
-v, --verbose Show verbose output and debug information
--version show program's version number and exit
Output Format
The tool outputs JSON with the following structure:
{
"title": "Section Title",
"text": "<p>HTML content preserved</p>",
"level": 1,
"subsections": [
{
"title": "Subsection Title",
"text": "<p>Subsection content</p>",
"level": 2,
"subsections": []
}
]
}
Fields
title: Text content of the highest-level heading in the sectiontext: All content except headings, with HTML formatting preservedlevel: Aria level of the main heading (1-6, or custom levels)subsections: Array of nested subsections with the same structure
Examples
Simple Section
Input HTML:
<section>
<h2>Getting Started</h2>
<p>Welcome to our <a href="/api">API</a>!</p>
<ul>
<li>Step 1: Register</li>
<li>Step 2: Get API key</li>
</ul>
</section>
Output:
{
"title": "Getting Started",
"text": "<p>Welcome to our <a href=\"/api\">API</a>!</p>\n<ul>\n<li>Step 1: Register</li>\n<li>Step 2: Get API key</li>\n</ul>",
"level": 2,
"subsections": []
}
Nested Sections
Input HTML:
<main>
<h1>Documentation</h1>
<p>Introduction text.</p>
<h2>Installation</h2>
<p>Installation instructions.</p>
<h3>Requirements</h3>
<p>System requirements.</p>
<h2>Usage</h2>
<p>Usage examples.</p>
</main>
Output:
{
"title": "Documentation",
"text": "<p>Introduction text.</p>",
"level": 1,
"subsections": [
{
"title": "Installation",
"text": "<p>Installation instructions.</p>",
"level": 2,
"subsections": [
{
"title": "Requirements",
"text": "<p>System requirements.</p>",
"level": 3,
"subsections": []
}
]
},
{
"title": "Usage",
"text": "<p>Usage examples.</p>",
"level": 2,
"subsections": []
}
]
}
Supported HTML Elements
Included Elements
- Paragraphs (
<p>) - Lists (
<ul>,<ol>,<li>) - Links (
<a>) - Formatting (
<strong>,<em>,<code>, etc.) - Semantic elements (
<section>,<article>,<aside>, etc.) - Tables (
<table>,<tr>,<td>, etc.) - Media (
<img>,<figure>) - Code blocks (
<pre>,<code>) - Quotes (
<blockquote>,<q>) - All other content elements with meaningful text
Excluded Elements
- Headings (processed separately as section titles)
- Script and style tags
- Meta elements
- Empty elements
- Elements containing headings (processed as subsections)
Smart Root Element Detection
The tool automatically detects the best root element in this priority order:
<main>- Primary content area<article>- Standalone article content<section>- Document section<body>- Document body- First substantial
<div>- Fallback for div-based layouts - Entire document - Last resort
Advanced Features
Custom Headings
Supports custom headings with ARIA attributes:
<div role="heading" aria-level="2">Custom Heading</div>
Aria Level Overrides
Standard headings can have their levels overridden:
<h3 aria-level="1">This is treated as level 1</h3>
Mixed Content
Handles complex layouts with mixed content types:
<div>
<h1>Main Title</h1>
<p>Introduction</p>
<section>
<h2>Section in Section</h2>
<p>Section content</p>
</section>
<h2>Regular Heading</h2>
<p>Regular content</p>
</div>
Testing
Run the test suite:
python -m pytest tests/ -v
The project includes comprehensive tests covering:
- Basic parsing functionality
- Heading level detection
- Content extraction
- Section handling
- Edge cases and error conditions
License
This project is open source. See LICENSE file for details.
Contributing
Contributions are welcome! Please submit pull requests with tests for any new features.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file content_extraction-0.4.4.tar.gz.
File metadata
- Download URL: content_extraction-0.4.4.tar.gz
- Upload date:
- Size: 21.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8a40cd9285113d38e09cf73331c50d3e3b9ac7e87eaa1b50ffee47a1a60783ea
|
|
| MD5 |
ba032888cfcc7f1b0e46996d4074501f
|
|
| BLAKE2b-256 |
19853676eff703893ba0b9f9642c17390afbe862020156bdbca1e2f6a061c405
|
Provenance
The following attestation bundles were made for content_extraction-0.4.4.tar.gz:
Publisher:
python-package.yml on ChrisW-priv/html-chunking
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
content_extraction-0.4.4.tar.gz -
Subject digest:
8a40cd9285113d38e09cf73331c50d3e3b9ac7e87eaa1b50ffee47a1a60783ea - Sigstore transparency entry: 505382890
- Sigstore integration time:
-
Permalink:
ChrisW-priv/html-chunking@ac8747542ba567a172c35b6ba0ff106a2816f19c -
Branch / Tag:
refs/tags/v0.4.4 - Owner: https://github.com/ChrisW-priv
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-package.yml@ac8747542ba567a172c35b6ba0ff106a2816f19c -
Trigger Event:
release
-
Statement type:
File details
Details for the file content_extraction-0.4.4-py3-none-any.whl.
File metadata
- Download URL: content_extraction-0.4.4-py3-none-any.whl
- Upload date:
- Size: 22.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a3238fdee81991ba7e89cc89675467bb3d62a80d89fc89c624971edb46bc129
|
|
| MD5 |
2a1c2706cc7f555740d3b54badd95562
|
|
| BLAKE2b-256 |
8d23ccc7a0a4ca0b817efad395602cdb6b81c57c6eee2e401e55091c356ece4f
|
Provenance
The following attestation bundles were made for content_extraction-0.4.4-py3-none-any.whl:
Publisher:
python-package.yml on ChrisW-priv/html-chunking
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
content_extraction-0.4.4-py3-none-any.whl -
Subject digest:
3a3238fdee81991ba7e89cc89675467bb3d62a80d89fc89c624971edb46bc129 - Sigstore transparency entry: 505382897
- Sigstore integration time:
-
Permalink:
ChrisW-priv/html-chunking@ac8747542ba567a172c35b6ba0ff106a2816f19c -
Branch / Tag:
refs/tags/v0.4.4 - Owner: https://github.com/ChrisW-priv
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-package.yml@ac8747542ba567a172c35b6ba0ff106a2816f19c -
Trigger Event:
release
-
Statement type: