One line web scraping by combining pandas and BeautifulSoup4
Project description
One line web scraping by combining pandas and BeautifulSoup4
Check out the video
<a href="https://www.youtube.com/watch?v=pvnODvnMyrg">
<img src="https://img.youtube.com/vi/pvnODvnMyrg/0.jpg" style="width:100%;">
</a>
Code from the video
pip install a-pandas-ex-bs4df
from a_pandas_ex_bs4df import pd_add_bs4_to_df
import pandas as pd
pd_add_bs4_to_df()
from PrettyColorPrinter import add_printer #optional
add_printer(True) #optional
df=pd.Q_bs4_to_df(r'https://github.com/search?l=Python&q=python&type=Repositories')
df.loc[(~df.bb_href.isna()) & df.aa_attrs_values.str.contains('middle',regex=False, na=False)]
df.loc[(~df.bb_href.isna()) & df.aa_attrs_values.str.contains('middle',regex=False, na=False)].ff_fetchParents.apply(lambda x: x())
df.loc[(~df.bb_src.isna()) & (~df.bb_src.str.contains(r'\.png$',regex=True,na=False))]
df.loc[(~df.bb_src.isna()) & (df.bb_src.str.contains(r'\.png$',regex=True,na=False))]
Parameters:
htmlcode:Union[str,bytes]
file path, url or html source code
urls will be downloaded with requests
dontuse:tuple
bs4 attributes to exclude from the dataframe
default = (
"element_classes",
"builder",
"is_xml",
"known_xml",
"_namespaces",
"parse_only",
"markup",
"contains_replacement_characters",
"original_encoding",
"declared_html_encoding",
"parser_class",
"namespace",
"prefix",
"cdata_list_attributes",
"preserve_whitespace_tag_stack",
"open_tag_counter",
"preserve_whitespace_tags",
"interesting_string_types",
"current_data",
"string_container_stack",
"_most_recent_element",
"currentTag",
)
parser: str
Have a look at the bs4 documentation
(default='lxml')
tags_to_find:Union[bool,str]=True
will be passed to soup.find_all()
Have a look at the bs4 documentation
(default=True) #everything
Returns:
df: pd.DataFrame
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
a_pandas_ex_bs4df-0.10.tar.gz
(5.4 kB
view hashes)
Built Distribution
Close
Hashes for a_pandas_ex_bs4df-0.10-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 58383acd844ccdac85b7f22a2e865bc077e944bcfc02f615d23a563168ccdebf |
|
MD5 | eb457682b329a9b7d96ab8ce71a4e177 |
|
BLAKE2b-256 | 0321d85dcef2301023e46cf66aeed325a0fce1492c89d90d84295561840ee67d |