Convert bs4 Tags into Json
Project description
bs2json
A lightweight Python library that converts BeautifulSoup4 HTML elements into structured JSON. Parse any HTML and get clean, traversable dictionaries — preserving document order, with full control over comments, whitespace, and label naming.
Python 3.8+ | Only dependency: beautifulsoup4
Table of Contents
| Section | Description |
|---|---|
| Installation | How to install |
| Quick Start | Basic usage example |
| Output Format | How HTML maps to JSON |
| Conversion | Converting tags, multiple tags, from BeautifulSoup |
| Options | group_by_tag, comments, whitespace, labels, config |
| Output | Save to file, pretty print |
| Advanced Usage | Context manager, callable, extension mode |
| API Reference | BS2Json methods, ConversionConfig fields |
| Contributing | How to contribute |
Installation
pip install -U bs2json
Quick Start
from bs2json import BS2Json
html = """
<html>
<head><title>My Page</title></head>
<body>
<h1>Welcome</h1>
<p class="intro">Hello <b>world</b></p>
<a href="/link1">Link 1</a>
<a href="/link2">Link 2</a>
</body>
</html>
"""
converter = BS2Json(html)
result = converter.convert()
converter.prettify()
Output Format
Elements preserve their original document order. The JSON structure follows these rules:
| HTML | JSON |
|---|---|
<h1>text</h1> |
{"h1": "text"} |
<p class="x">text</p> |
{"p": {"attrs": {"class": ["x"]}, "text": "text"}} |
<div><h1>A</h1><p>B</p></div> |
{"div": {"children": [{"h1": "A"}, {"p": "B"}]}} |
<a href="/">link</a> |
{"a": {"attrs": {"href": "/"}, "text": "link"}} |
<!-- note --> |
{"comment": "<!-- note -->"} |
- Single text child stays simple:
{"tag": "text"} - Multiple children use:
{"tag": {"children": [...]}} - Attributes appear under the
"attrs"key - Mixed content (text + tags) preserves order in
children
Full output example
{'html': {'head': {'title': 'My Page'},
'body': {'children': [{'h1': 'Welcome'},
{'p': {'attrs': {'class': ['intro']},
'children': [{'text': 'Hello'},
{'b': 'world'}]}},
{'a': {'attrs': {'href': '/link1'},
'text': 'Link 1'}},
{'a': {'attrs': {'href': '/link2'},
'text': 'Link 2'}}]}}}
Conversion
Convert Specific Tags
converter = BS2Json(html)
# By tag name
converter.convert('body')
# By CSS class
converter.convert(class_='intro')
# By attribute
converter.convert('a', href='/link1')
# {'a': {'attrs': {'href': '/link1'}, 'text': 'Link 1'}}
Convert Multiple Tags
converter = BS2Json(html)
# As a list of individual results
converter.convert_all('a')
# [{'a': {'attrs': {'href': '/link1'}, 'text': 'Link 1'}},
# {'a': {'attrs': {'href': '/link2'}, 'text': 'Link 2'}}]
# Grouped by tag name into a single dict
converter.convert_all('a', join=True)
# [{'a': [{'attrs': {'href': '/link1'}, 'text': 'Link 1'},
# {'attrs': {'href': '/link2'}, 'text': 'Link 2'}]}]
From BeautifulSoup Objects
You can pass an existing BeautifulSoup object or Tag instead of raw HTML:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
# From a soup object
BS2Json(soup).convert()
# From a specific tag
BS2Json(soup.find('body')).convert()
# Convert on-the-fly with no soup
converter = BS2Json()
converter.convert(soup.body)
Options
Group by Tag Name
By default, elements preserve document order. Use group_by_tag=True to group siblings by tag name — useful when you don't care about order and want quick access by tag:
html = '<html><body><h3>First</h3><p>Text</p><h3>Second</h3></body></html>'
# Default: preserves document order
BS2Json(html).convert()
# {'html': {'body': {'children': [{'h3': 'First'}, {'p': 'Text'}, {'h3': 'Second'}]}}}
# Grouped: siblings merged by tag name
BS2Json(html, group_by_tag=True).convert()
# {'html': {'body': {'h3': ['First', 'Second'], 'p': 'Text'}}}
Comments
comment_html = '<html><body><!-- TODO --><p>text</p></body></html>'
# Included by default
BS2Json(comment_html).convert()
# {'html': {'body': {'children': [{'comment': '<!-- TODO -->'}, {'p': 'text'}]}}}
# Exclude comments
BS2Json(comment_html, include_comments=False).convert()
# {'html': {'body': {'p': 'text'}}}
Whitespace
ws_html = '<html><body><p> hello </p></body></html>'
# Stripped by default
BS2Json(ws_html).convert()
# {'html': {'body': {'p': 'hello'}}}
# Preserve whitespace
BS2Json(ws_html, strip=False).convert()
# {'html': {'body': {'p': ' hello '}}}
Custom Labels
Change the JSON key names for attributes, text content, and comments:
converter = BS2Json('<html><body><p class="x">hello</p></body></html>')
converter.labels(attrs='attributes', text='content', comment='notes')
result = converter.convert()
# {'html': {'body': {'p': {'attributes': {'class': ['x']}, 'content': 'hello'}}}}
Or via constructor:
BS2Json(html, attr_name='@', text_name='#text', comment_name='#comment')
Configuration Object
All options are stored in a ConversionConfig dataclass, accessible and modifiable at any time:
from bs2json import BS2Json, ConversionConfig
converter = BS2Json(html, strip=False)
print(converter.config)
# ConversionConfig(attr_name='attrs', text_name='text', comment_name='comment',
# include_comments=True, strip=False, group_by_tag=False)
# Modify config directly
converter.config.group_by_tag = True
converter.config.include_comments = False
Output
Save to File
converter = BS2Json(html)
converter.convert()
# Save to JSON file (pretty-printed, 4-space indent)
converter.save('output.json')
# Save compact
converter.save('compact.json', prettify=False)
# Custom indent
converter.save('indented.json', indent=2)
# Save to a file-like object
import io
buf = io.StringIO()
converter.save(buf)
Pretty Print
converter = BS2Json(html)
converter.convert()
converter.prettify() # prints to stdout
Advanced Usage
Context Manager and Callable
# Use as context manager
with BS2Json(html) as converter:
result = converter.convert()
# Use as callable (shortcut for .convert())
converter = BS2Json(html)
result = converter()
Extension Mode
Monkey-patch .to_json() directly onto every BeautifulSoup Tag element:
from bs4 import BeautifulSoup
from bs2json import install, remove
install()
soup = BeautifulSoup(html, 'html.parser')
# Now every tag has .to_json()
soup.find('body').to_json()
soup.find('a').to_json(include_comments=False, strip=False)
remove() # clean up when done
API Reference
BS2Json
| Method | Description |
|---|---|
BS2Json(soup, features, *, include_comments, strip, group_by_tag, **kwargs) |
Initialize from HTML string, Tag, or BeautifulSoup object |
.convert(element=None, json=None, *, inplace=False, **kwargs) |
Convert a single tag to a dict |
.convert_all(elements=None, lst=None, *, join=False, **kwargs) |
Convert multiple tags to a list of dicts |
.labels(attrs=..., text=..., comment=...) |
Change JSON key names |
.save(file, /, mode='w', *, prettify=True, indent=4) |
Save last result to file path or file object |
.prettify() |
Pretty-print last result to stdout |
.config |
ConversionConfig dataclass with all options |
.last_obj |
Result of the most recent conversion |
.soup |
The underlying BeautifulSoup object |
ConversionConfig
| Field | Default | Description |
|---|---|---|
attr_name |
"attrs" |
JSON key for element attributes |
text_name |
"text" |
JSON key for text content |
comment_name |
"comment" |
JSON key for HTML comments |
include_comments |
True |
Whether to include HTML comments |
strip |
True |
Strip leading/trailing whitespace from text |
group_by_tag |
False |
Group siblings by tag name instead of preserving order |
Contributing
See CONTRIBUTING.md for development setup, versioning guide, and how to submit changes.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bs2json-0.3.0.tar.gz.
File metadata
- Download URL: bs2json-0.3.0.tar.gz
- Upload date:
- Size: 14.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.11 {"installer":{"name":"uv","version":"0.10.11","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ec789741f8ef836e07a0881e713aa8f02384f6912ef6f9d53f35b6cfd64e8b67
|
|
| MD5 |
d834d0ae752f0fbe20aee04cd9eeb358
|
|
| BLAKE2b-256 |
1db10373ceda8ff1488fe1c73bf49a84554f33ff256cb16dc975c391b79b4f8f
|
File details
Details for the file bs2json-0.3.0-py3-none-any.whl.
File metadata
- Download URL: bs2json-0.3.0-py3-none-any.whl
- Upload date:
- Size: 10.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.11 {"installer":{"name":"uv","version":"0.10.11","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
02c219c7fe7d2ac4b81486fed01c60c51942b6bf7c9c38a768e12f64db142b42
|
|
| MD5 |
57fcef97ba0c8ae6fdea8e7110207c32
|
|
| BLAKE2b-256 |
cc3d197ae883f866e6c170e475025a9ebd05a6008ce3b8b414e2cb262fcebe95
|