Skip to main content

Convert bs4 Tags into Json

Project description

PyPI version PyPI downloads PyPI pyversions PyPI license GitHub stars GitHub issues GitHub last commit

bs2json

A lightweight Python library that converts BeautifulSoup4 HTML elements into structured JSON. Parse any HTML and get clean, traversable dictionaries — preserving document order, with full control over comments, whitespace, and label naming.

Python 3.8+ | Only dependency: beautifulsoup4


Table of Contents
Section Description
Installation How to install
Quick Start Basic usage example
Output Format How HTML maps to JSON
Conversion Converting tags, multiple tags, from BeautifulSoup
Options group_by_tag, comments, whitespace, labels, config
Output Save to file, pretty print
Advanced Usage Context manager, callable, extension mode
API Reference BS2Json methods, ConversionConfig fields
Contributing How to contribute

Installation

pip install -U bs2json

Quick Start

from bs2json import BS2Json

html = """
<html>
<head><title>My Page</title></head>
<body>
    <h1>Welcome</h1>
    <p class="intro">Hello <b>world</b></p>
    <a href="/link1">Link 1</a>
    <a href="/link2">Link 2</a>
</body>
</html>
"""

converter = BS2Json(html)
result = converter.convert()
converter.prettify()

Output Format

Elements preserve their original document order. The JSON structure follows these rules:

HTML JSON
<h1>text</h1> {"h1": "text"}
<p class="x">text</p> {"p": {"attrs": {"class": ["x"]}, "text": "text"}}
<div><h1>A</h1><p>B</p></div> {"div": {"children": [{"h1": "A"}, {"p": "B"}]}}
<a href="/">link</a> {"a": {"attrs": {"href": "/"}, "text": "link"}}
<!-- note --> {"comment": "<!-- note -->"}
  • Single text child stays simple: {"tag": "text"}
  • Multiple children use: {"tag": {"children": [...]}}
  • Attributes appear under the "attrs" key
  • Mixed content (text + tags) preserves order in children
Full output example
{'html': {'head': {'title': 'My Page'},
          'body': {'children': [{'h1': 'Welcome'},
                                {'p': {'attrs': {'class': ['intro']},
                                       'children': [{'text': 'Hello'},
                                                    {'b': 'world'}]}},
                                {'a': {'attrs': {'href': '/link1'},
                                       'text': 'Link 1'}},
                                {'a': {'attrs': {'href': '/link2'},
                                       'text': 'Link 2'}}]}}}

Conversion

Convert Specific Tags
converter = BS2Json(html)

# By tag name
converter.convert('body')

# By CSS class
converter.convert(class_='intro')

# By attribute
converter.convert('a', href='/link1')
# {'a': {'attrs': {'href': '/link1'}, 'text': 'Link 1'}}
Convert Multiple Tags
converter = BS2Json(html)

# As a list of individual results
converter.convert_all('a')
# [{'a': {'attrs': {'href': '/link1'}, 'text': 'Link 1'}},
#  {'a': {'attrs': {'href': '/link2'}, 'text': 'Link 2'}}]

# Grouped by tag name into a single dict
converter.convert_all('a', join=True)
# [{'a': [{'attrs': {'href': '/link1'}, 'text': 'Link 1'},
#         {'attrs': {'href': '/link2'}, 'text': 'Link 2'}]}]
From BeautifulSoup Objects

You can pass an existing BeautifulSoup object or Tag instead of raw HTML:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

# From a soup object
BS2Json(soup).convert()

# From a specific tag
BS2Json(soup.find('body')).convert()

# Convert on-the-fly with no soup
converter = BS2Json()
converter.convert(soup.body)

Options

Group by Tag Name

By default, elements preserve document order. Use group_by_tag=True to group siblings by tag name — useful when you don't care about order and want quick access by tag:

html = '<html><body><h3>First</h3><p>Text</p><h3>Second</h3></body></html>'

# Default: preserves document order
BS2Json(html).convert()
# {'html': {'body': {'children': [{'h3': 'First'}, {'p': 'Text'}, {'h3': 'Second'}]}}}

# Grouped: siblings merged by tag name
BS2Json(html, group_by_tag=True).convert()
# {'html': {'body': {'h3': ['First', 'Second'], 'p': 'Text'}}}
Comments
comment_html = '<html><body><!-- TODO --><p>text</p></body></html>'

# Included by default
BS2Json(comment_html).convert()
# {'html': {'body': {'children': [{'comment': '<!-- TODO -->'}, {'p': 'text'}]}}}

# Exclude comments
BS2Json(comment_html, include_comments=False).convert()
# {'html': {'body': {'p': 'text'}}}
Whitespace
ws_html = '<html><body><p>  hello  </p></body></html>'

# Stripped by default
BS2Json(ws_html).convert()
# {'html': {'body': {'p': 'hello'}}}

# Preserve whitespace
BS2Json(ws_html, strip=False).convert()
# {'html': {'body': {'p': '  hello  '}}}
Custom Labels

Change the JSON key names for attributes, text content, and comments:

converter = BS2Json('<html><body><p class="x">hello</p></body></html>')
converter.labels(attrs='attributes', text='content', comment='notes')
result = converter.convert()
# {'html': {'body': {'p': {'attributes': {'class': ['x']}, 'content': 'hello'}}}}

Or via constructor:

BS2Json(html, attr_name='@', text_name='#text', comment_name='#comment')
Configuration Object

All options are stored in a ConversionConfig dataclass, accessible and modifiable at any time:

from bs2json import BS2Json, ConversionConfig

converter = BS2Json(html, strip=False)
print(converter.config)
# ConversionConfig(attr_name='attrs', text_name='text', comment_name='comment',
#                  include_comments=True, strip=False, group_by_tag=False)

# Modify config directly
converter.config.group_by_tag = True
converter.config.include_comments = False

Output

Save to File
converter = BS2Json(html)
converter.convert()

# Save to JSON file (pretty-printed, 4-space indent)
converter.save('output.json')

# Save compact
converter.save('compact.json', prettify=False)

# Custom indent
converter.save('indented.json', indent=2)

# Save to a file-like object
import io
buf = io.StringIO()
converter.save(buf)
Pretty Print
converter = BS2Json(html)
converter.convert()
converter.prettify()  # prints to stdout

Advanced Usage

Context Manager and Callable
# Use as context manager
with BS2Json(html) as converter:
    result = converter.convert()

# Use as callable (shortcut for .convert())
converter = BS2Json(html)
result = converter()
Extension Mode

Monkey-patch .to_json() directly onto every BeautifulSoup Tag element:

from bs4 import BeautifulSoup
from bs2json import install, remove

install()

soup = BeautifulSoup(html, 'html.parser')

# Now every tag has .to_json()
soup.find('body').to_json()
soup.find('a').to_json(include_comments=False, strip=False)

remove()  # clean up when done

API Reference

BS2Json
Method Description
BS2Json(soup, features, *, include_comments, strip, group_by_tag, **kwargs) Initialize from HTML string, Tag, or BeautifulSoup object
.convert(element=None, json=None, *, inplace=False, **kwargs) Convert a single tag to a dict
.convert_all(elements=None, lst=None, *, join=False, **kwargs) Convert multiple tags to a list of dicts
.labels(attrs=..., text=..., comment=...) Change JSON key names
.save(file, /, mode='w', *, prettify=True, indent=4) Save last result to file path or file object
.prettify() Pretty-print last result to stdout
.config ConversionConfig dataclass with all options
.last_obj Result of the most recent conversion
.soup The underlying BeautifulSoup object
ConversionConfig
Field Default Description
attr_name "attrs" JSON key for element attributes
text_name "text" JSON key for text content
comment_name "comment" JSON key for HTML comments
include_comments True Whether to include HTML comments
strip True Strip leading/trailing whitespace from text
group_by_tag False Group siblings by tag name instead of preserving order

Contributing

See CONTRIBUTING.md for development setup, versioning guide, and how to submit changes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bs2json-0.3.0.tar.gz (14.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bs2json-0.3.0-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file bs2json-0.3.0.tar.gz.

File metadata

  • Download URL: bs2json-0.3.0.tar.gz
  • Upload date:
  • Size: 14.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.11 {"installer":{"name":"uv","version":"0.10.11","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for bs2json-0.3.0.tar.gz
Algorithm Hash digest
SHA256 ec789741f8ef836e07a0881e713aa8f02384f6912ef6f9d53f35b6cfd64e8b67
MD5 d834d0ae752f0fbe20aee04cd9eeb358
BLAKE2b-256 1db10373ceda8ff1488fe1c73bf49a84554f33ff256cb16dc975c391b79b4f8f

See more details on using hashes here.

File details

Details for the file bs2json-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: bs2json-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 10.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.11 {"installer":{"name":"uv","version":"0.10.11","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for bs2json-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 02c219c7fe7d2ac4b81486fed01c60c51942b6bf7c9c38a768e12f64db142b42
MD5 57fcef97ba0c8ae6fdea8e7110207c32
BLAKE2b-256 cc3d197ae883f866e6c170e475025a9ebd05a6008ce3b8b414e2cb262fcebe95

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page