bs4 to pd.DataFrame
Project description
bs4 to pd.DataFrame
Tested against Windows / Python 3.11 / Anaconda
pip install bs42frame
Parse HTML content and extract information using BeautifulSoup.
This function takes HTML content as input, parses it using BeautifulSoup, and extracts
information about the HTML structure, tag attributes, tag text, and the BeautifulSoup
object for each element found in the HTML.
Args:
html (str, bytes, or file path): The HTML content to be parsed. It can be provided as
a string, bytes, or a file path. If a file path is provided, the function will
attempt to read the file.
Returns:
pandas.DataFrame: A DataFrame containing the extracted information from the HTML.
The DataFrame columns include 'aa_tag' (HTML tag name), 'aa_attrs' (list of tag
attributes), 'aa_text' (text content of the tag), 'aa_soup' (BeautifulSoup object
for the tag), 'aa_old_index' (original index of the tag), 'aa_key' (attribute
key), and 'aa_value' (attribute value).
Example:
from bs42frame import parse_html
df = parse_html(
html=r"C:\Users\hansc\Downloads\Your Repositories.mhtml"
)
# aa_tag aa_text aa_soup aa_old_index aa_key aa_value
# 1000 span Import repository [\r\n Import repository\r\n\r\n] 274 ActionListItem-label class
# 1001 li [] 275 presentation role
# 1002 li [] 275 true aria-hidden
# 1003 li [] 275 true data-view-component
# 1004 li [] 275 ActionList-sectionDivider class
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
bs42frame-0.10.tar.gz
(40.8 kB
view details)
Built Distribution
bs42frame-0.10-py3-none-any.whl
(41.9 kB
view details)
File details
Details for the file bs42frame-0.10.tar.gz
.
File metadata
- Download URL: bs42frame-0.10.tar.gz
- Upload date:
- Size: 40.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fd7e4c0fc2ba629c9c469b91a493804d2725f5f2551ed5aa590c5355863e1b15 |
|
MD5 | ce753f03b2ac81b63215e91dcbcbcb80 |
|
BLAKE2b-256 | c1c12df87bb9a9239f70f9903d7b869b87dc0cf8e3f69e169f226194cd61b19d |
File details
Details for the file bs42frame-0.10-py3-none-any.whl
.
File metadata
- Download URL: bs42frame-0.10-py3-none-any.whl
- Upload date:
- Size: 41.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7c4234c1825976ac585ade207c760d4ceaf9a754204eae40c804aa9e6fe6311d |
|
MD5 | 360f08f3f952a4295cf726df21555203 |
|
BLAKE2b-256 | 15fbdf54789707ec5e59a5dac17d5e45e5347ed016ab76ba3282b22e93adfbcc |