Convert bs4 Tags into Json
Project description
bs2json
A lightweight Python library that converts BeautifulSoup4 HTML elements into structured JSON. Parse any HTML and get clean, traversable dictionaries — with full control over element ordering, comments, whitespace, and label naming.
Installation
pip install -U bs2json
Requirements: Python 3.8+ | Only dependency: beautifulsoup4
Quick Start
from bs2json import BS2Json
html = """
<html>
<head><title>My Page</title></head>
<body>
<h1>Welcome</h1>
<p class="intro">Hello <b>world</b></p>
<a href="/link1">Link 1</a>
<a href="/link2">Link 2</a>
</body>
</html>
"""
converter = BS2Json(html)
result = converter.convert()
converter.prettify()
Output:
{'html': {'head': {'title': 'My Page'},
'body': {'h1': 'Welcome',
'p': {'attrs': {'class': ['intro']},
'text': 'Hello',
'b': 'world'},
'a': [{'attrs': {'href': '/link1'}, 'text': 'Link 1'},
{'attrs': {'href': '/link2'}, 'text': 'Link 2'}]}}}
Features
Convert Specific Tags
converter = BS2Json(html)
# By tag name
converter.convert('body')
# {'body': {'h1': 'Welcome', 'p': {...}, 'a': [...]}}
# By CSS class
converter.convert(class_='intro')
# {'p': {'attrs': {'class': ['intro']}, 'text': 'Hello', 'b': 'world'}}
# By id or any bs4 find() argument
converter.convert('a', href='/link1')
# {'a': {'attrs': {'href': '/link1'}, 'text': 'Link 1'}}
Convert Multiple Tags
converter = BS2Json(html)
# As a list of individual results
converter.convert_all('a')
# [{'a': {'attrs': {'href': '/link1'}, 'text': 'Link 1'}},
# {'a': {'attrs': {'href': '/link2'}, 'text': 'Link 2'}}]
# Grouped by tag name into a single dict
converter.convert_all('a', join=True)
# [{'a': [{'attrs': {'href': '/link1'}, 'text': 'Link 1'},
# {'attrs': {'href': '/link2'}, 'text': 'Link 2'}]}]
Preserve Element Order
By default, sibling elements with the same tag are grouped together. Use keep_order=True to preserve the original document order — useful when the sequence of elements matters:
html = '<html><body><h3>First</h3><p>Text</p><h3>Second</h3></body></html>'
# Default: groups by tag name
BS2Json(html).convert()
# {'html': {'body': {'h3': ['First', 'Second'], 'p': 'Text'}}}
# Ordered: preserves document order
BS2Json(html, keep_order=True).convert()
# {'html': [{'body': [{'h3': 'First'}, {'p': 'Text'}, {'h3': 'Second'}]}]}
Control Comments and Whitespace
comment_html = '<html><body><!-- TODO: fix --><p> hello </p></body></html>'
# Include comments (default)
BS2Json(comment_html).convert()
# {'html': {'body': {'comment': '<!-- TODO: fix -->', 'p': 'hello'}}}
# Exclude comments
BS2Json(comment_html, include_comments=False).convert()
# {'html': {'body': {'p': 'hello'}}}
# Preserve whitespace (stripped by default)
BS2Json(comment_html, strip=False).convert()
# {'html': {'body': {'comment': '<!-- TODO: fix -->', 'p': ' hello '}}}
Custom Labels
Change the JSON key names for attributes, text content, and comments:
converter = BS2Json('<html><body><p class="x">hello</p></body></html>')
converter.labels(attrs='attributes', text='content', comment='notes')
result = converter.convert()
# {'html': {'body': {'p': {'attributes': {'class': ['x']}, 'content': 'hello'}}}}
Save and Prettify
converter = BS2Json(html)
converter.convert()
# Save to JSON file
converter.save('output.json')
# Save with custom formatting
converter.save('compact.json', prettify=False)
converter.save('indented.json', indent=2)
# Save to a file-like object
import io
buf = io.StringIO()
converter.save(buf)
# Pretty-print to stdout
converter.prettify()
Context Manager and Callable
# Use as context manager
with BS2Json(html) as converter:
result = converter.convert()
# Use as callable (shortcut for .convert())
converter = BS2Json(html)
result = converter()
Extension Mode
Monkey-patch .to_json() directly onto every BeautifulSoup Tag element:
from bs4 import BeautifulSoup
from bs2json import install, remove
install()
soup = BeautifulSoup(html, 'html.parser')
# Now every tag has .to_json()
soup.find('body').to_json()
soup.body.to_json(keep_order=True)
soup.find('a').to_json(include_comments=False, strip=False)
remove() # clean up when done
Configuration Object
All conversion options are stored in a ConversionConfig dataclass, accessible and modifiable at any time:
from bs2json import BS2Json, ConversionConfig
converter = BS2Json(html, keep_order=True, strip=False)
print(converter.config)
# ConversionConfig(attr_name='attrs', text_name='text', comment_name='comment',
# include_comments=True, strip=False, keep_order=True)
# Modify config directly
converter.config.keep_order = False
converter.config.include_comments = False
Also Works With BeautifulSoup Objects
You can pass an existing BeautifulSoup object or Tag instead of a raw HTML string:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
# From a soup object
BS2Json(soup).convert()
# From a specific tag
tag = soup.find('body')
BS2Json(tag).convert()
# Convert on-the-fly with no soup
converter = BS2Json()
converter.convert(tag)
API Reference
BS2Json
| Method | Description |
|---|---|
BS2Json(soup, features, *, include_comments, strip, keep_order, **kwargs) |
Initialize from HTML string, Tag, or BeautifulSoup object |
.convert(element=None, json=None, *, inplace=False, **kwargs) |
Convert a single tag to a dict |
.convert_all(elements=None, lst=None, *, join=False, **kwargs) |
Convert multiple tags to a list of dicts |
.labels(attrs=..., text=..., comment=...) |
Change JSON key names |
.save(file, /, mode='w', *, prettify=True, indent=4) |
Save last result to file path or file object |
.prettify() |
Pretty-print last result to stdout |
.config |
ConversionConfig dataclass with all options |
.last_obj |
Result of the most recent conversion |
.soup |
The underlying BeautifulSoup object |
ConversionConfig
| Field | Default | Description |
|---|---|---|
attr_name |
"attrs" |
JSON key for element attributes |
text_name |
"text" |
JSON key for text content |
comment_name |
"comment" |
JSON key for HTML comments |
include_comments |
True |
Whether to include HTML comments |
strip |
True |
Strip leading/trailing whitespace from text |
keep_order |
False |
Preserve element order instead of grouping |
Contributing
We appreciate all contributions. If you are planning to contribute bug-fixes, please do so without further discussion.
If you plan to contribute new features, please first open an issue and discuss the feature with us.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bs2json-0.2.0.tar.gz.
File metadata
- Download URL: bs2json-0.2.0.tar.gz
- Upload date:
- Size: 13.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.11 {"installer":{"name":"uv","version":"0.10.11","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d24b6848610df58f6820678485efffd5873c12ccbcc2c49212af828f0c8bcd5c
|
|
| MD5 |
48fd2a84d5e0607035e957786854df78
|
|
| BLAKE2b-256 |
392b207058a82cb2bc6669ff32261f085a2bba98757497943213cd57a49460f6
|
File details
Details for the file bs2json-0.2.0-py3-none-any.whl.
File metadata
- Download URL: bs2json-0.2.0-py3-none-any.whl
- Upload date:
- Size: 9.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.11 {"installer":{"name":"uv","version":"0.10.11","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68dcf29a547265c519bc6be72c87a9b703a0cade4b8016af7a0b81f70567b312
|
|
| MD5 |
ac31261ced163cddd8846f7065576543
|
|
| BLAKE2b-256 |
5eae3f47af52d4e5afb6a4f3cff4e6653fbe99e7aa94c8d445e8abfc1cb4e768
|