Skip to main content

A simple wrapper for adding a simple way for storing parsed HTML content

Project description

HTML Serializer Parser

Based on html-to-json library https://pypi.org/project/html-to-json/ this library extends its functionality adding an additional layer for extra information like: query selector for every node, list of all query selectors, different return options, by list, by tree dictionary and/or by dict, if adds an specific property for every node

Quick Start

import json
from html2json.parser import ParserOptions, html2json


if __name__ == "__main__":
    # You can use an HTML file, raw HTML String or and endpoint
    FILE_DIR = "./PATH_TO_YOUR_FILES/index.html"
    output = html2json(
        input_path=open(FILE_DIR).read(),
        options=ParserOptions.parser_factory(
            store_as_list=True,
            store_as_dict=True,
            store_as_tree_dict=True
        ),
        raw_content=False,
    )
    # Will retrieve a dict with following keys
    # as_list, as_dict, as_tree_dict, query_selectors
    json_output = json.dumps(output, ensure_ascii=True, indent=2)
    with open("data.json", "w") as o:
        o.write(json_output)

Some code examples

Extracting query selectors

from html2json.parser import ParserOptions, html2json
VALID_SCRAPPING_SITE = "https://www.scrapethissite.com/pages/simple/"

def get_query_selectors():
    output = html2json(
        input_path=VALID_SCRAPPING_SITE,
        options=ParserOptions.parser_factory(False, False, False),
        raw_content=False,
    )
    print(output['query_selectors'])

get_query_selectors()

Let's create a block using the internal logic

If you wanna go deeper, maybe you want to build the logic by yourself replicating the functionality of html2jsonfunction

import requests
from bs4 import BeautifulSoup
from html2json.parser import ParserOptions, Html2JsonParser
VALID_SCRAPPING_SITE = "https://www.scrapethissite.com/pages/simple/"

def step_by_step_usage():
    # We get the HTML content using Requests
    html_content = requests.get(VALID_SCRAPPING_SITE).content
    # We instantiate a BeautifulSoup object using the content
    soup_instance = BeautifulSoup(html_content, 'html.parser')
    # We get a Html2JsonParser instance, prepared to return the
    # data in a List, injecting a BeautifulSoup instance
    parser = Html2JsonParser(
        soup_instance=soup_instance,
        **ParserOptions.parser_factory(
            store_as_list=True,
            store_as_dict=False,
            store_as_tree_dict=False
        ).as_dict()
    )
    # We have to tell the parser that we want to process the information
    # if you don't call this method, the content won't be processed
    parser.process_parser()

    # We extract the query selectors that we collected before
    query_selectors = parser.query_selectors
    # We extract the list of all the nodes that we collected before
    process_list = parser.as_list()

    # We return the data
    return query_selectors, process_list

# Using the Jupyter blocks we will show the 5 first nodes that we collected before
_, process_list = step_by_step_usage()
print(process_list[:5])

Let's use the library in a real case using Pandas

In this example we will extract the information with the library, load the dictionary into a Pandas dataframe, apply some filters, and then store the information in CSV format

import requests
from bs4 import BeautifulSoup
import pandas as pd
from html2json.parser import ParserOptions, Html2JsonParser
VALID_SCRAPPING_SITE = "https://www.scrapethissite.com/pages/simple/"

def get_pandas_info():
    # We get the HTML content using Requests
    html_content = requests.get(VALID_SCRAPPING_SITE).content
    # We instantiate a BeautifulSoup object using the content
    soup_instance = BeautifulSoup(html_content, 'html.parser')
    # We get a Html2JsonParser instance, prepared to return the
    # data in a List, injecting a BeautifulSoup instance
    parser = Html2JsonParser(
        soup_instance=soup_instance,
        **ParserOptions.parser_factory(
            store_as_list=True,
            store_as_dict=False,
            store_as_tree_dict=False
        ).as_dict()
    )
    # We have to tell the parser that we want to process the information
    # if you don't call this method, the content won't be processed
    parser.process_parser()

    # We extract the list of all the nodes that we collected before
    process_list = parser.as_list()

    # Based on the process list we build a Pandas DataFrame
    df = pd.DataFrame(process_list)

    # We know that, all the information that we want, is in an h3 tag
    # we will apply a filter just for having all the information that
    # we really want
    df = df[df['tag'] == 'h3']

    # We are going to reduce the columns that we really want
    # we obtain from the JSON: node_id, tag, children, attrs, query_selector
    # and content. we will store only node_id, tag and content
    df = df[['node_id', 'tag', 'content']]

    # We will return the generated CSV using Pandas method
    return df.to_csv()

csv_data = get_pandas_info()
print(csv_data)

Changelog

  • 0.0.3
    • Modified internal logic for allowing dependency injection
    • If a BeautifulSoup object is injected html_content is not required
    • If a BeautifulSoup object is injected library won't analyze html_content because is None
    • If a BeautifulSoup object is not injected and html_content is not provided it will raise an Html2JsonEmptyBody exception

TODOS

  • Improve Readme to be easier to understand
  • Improve abstractions in order to be easier to modify specific steps
  • Add docstrings
  • Include more tests
  • Avoid repeating node content extending bs4.element.Tag class

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bs4_html2json-0.0.3.tar.gz (7.7 kB view details)

Uploaded Source

Built Distribution

bs4_html2json-0.0.3-py3-none-any.whl (7.2 kB view details)

Uploaded Python 3

File details

Details for the file bs4_html2json-0.0.3.tar.gz.

File metadata

  • Download URL: bs4_html2json-0.0.3.tar.gz
  • Upload date:
  • Size: 7.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for bs4_html2json-0.0.3.tar.gz
Algorithm Hash digest
SHA256 226e435c2acf6f84b15b6b9bbe4c02883c3f09c115554e45d6041ef61be1a4ee
MD5 d41f578a9db6b42ec562684e3b20fe60
BLAKE2b-256 bcd1b669f65737009c57ebc947bd846edd88a385bf64fe2a7fd23dd532ff969f

See more details on using hashes here.

File details

Details for the file bs4_html2json-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for bs4_html2json-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 e50dbbb2dedbe83cda0eb1b7347a7c41c1d9f6a2da7592c1b3b1a0addd0119e5
MD5 fdc8e7d2fcd5a39f3a697f9e5b0e021a
BLAKE2b-256 d960590a44699efce20c54c439f128985d38cafcf00ea9a2a0d8edca12e54b21

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page