One-line-web-scraping by combining pandas and BeautifulSoup4
Project description
One line web scraping by combining pandas and BeautifulSoup4
Check out the video
Code from the video
pip install a-pandas-ex-bs4df
from a_pandas_ex_bs4df import pd_add_bs4_to_df
import pandas as pd
pd_add_bs4_to_df()
from PrettyColorPrinter import add_printer #optional
add_printer(True) #optional
df=pd.Q_bs4_to_df(r'https://github.com/search?l=Python&q=python&type=Repositories')
df.loc[(~df.bb_href.isna()) & df.aa_attrs_values.str.contains('middle',regex=False, na=False)]
df.loc[(~df.bb_href.isna()) & df.aa_attrs_values.str.contains('middle',regex=False, na=False)].ff_fetchParents.apply(lambda x: x())
df.loc[(~df.bb_src.isna()) & (~df.bb_src.str.contains(r'\.png$',regex=True,na=False))]
df.loc[(~df.bb_src.isna()) & (df.bb_src.str.contains(r'\.png$',regex=True,na=False))]
Parameters:
htmlcode:Union[str,bytes]
file path, url or html source code
urls will be downloaded with requests
dontuse:tuple
bs4 attributes to exclude from the dataframe
default = (
"element_classes",
"builder",
"is_xml",
"known_xml",
"_namespaces",
"parse_only",
"markup",
"contains_replacement_characters",
"original_encoding",
"declared_html_encoding",
"parser_class",
"namespace",
"prefix",
"cdata_list_attributes",
"preserve_whitespace_tag_stack",
"open_tag_counter",
"preserve_whitespace_tags",
"interesting_string_types",
"current_data",
"string_container_stack",
"_most_recent_element",
"currentTag",
)
parser: str
Have a look at the bs4 documentation
(default='lxml')
tags_to_find:Union[bool,str]=True
will be passed to soup.find_all()
Have a look at the bs4 documentation
(default=True) #everything
Returns:
df: pd.DataFrame
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
a_pandas_ex_bs4df-0.12.tar.gz
(11.7 kB
view details)
Built Distribution
File details
Details for the file a_pandas_ex_bs4df-0.12.tar.gz
.
File metadata
- Download URL: a_pandas_ex_bs4df-0.12.tar.gz
- Upload date:
- Size: 11.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a8ad36f66097ab9ce2b87bf78834a83f98dbcae52028ef4273539989fc4dad98 |
|
MD5 | a60084d3daa14362ffd86284c4842766 |
|
BLAKE2b-256 | 7ccb531cb3d2430149f436bee195c0b64d75923bdaf8309df4f8d7c7b67846a1 |
File details
Details for the file a_pandas_ex_bs4df-0.12-py3-none-any.whl
.
File metadata
- Download URL: a_pandas_ex_bs4df-0.12-py3-none-any.whl
- Upload date:
- Size: 12.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 305eb2e4f0e4da5b1760caea05913ce85344cc52538b0ba44dbb3461822b7a80 |
|
MD5 | cf8396f90949c6977b55735e8c3c45d2 |
|
BLAKE2b-256 | 560e71c035e9b8e675fb8ea4487276c3a53ae01e05d0094a13db05cd32c0bc79 |