One line web scraping by combining pandas and BeautifulSoup4
Project description
One line web scraping by combining pandas and BeautifulSoup4
Check out the video
<a href="https://www.youtube.com/watch?v=pvnODvnMyrg">
<img src="https://img.youtube.com/vi/pvnODvnMyrg/0.jpg" style="width:100%;">
</a>
Code from the video
pip install a-pandas-ex-bs4df
from a_pandas_ex_bs4df import pd_add_bs4_to_df
import pandas as pd
pd_add_bs4_to_df()
from PrettyColorPrinter import add_printer #optional
add_printer(True) #optional
df=pd.Q_bs4_to_df(r'https://github.com/search?l=Python&q=python&type=Repositories')
df.loc[(~df.bb_href.isna()) & df.aa_attrs_values.str.contains('middle',regex=False, na=False)]
df.loc[(~df.bb_href.isna()) & df.aa_attrs_values.str.contains('middle',regex=False, na=False)].ff_fetchParents.apply(lambda x: x())
df.loc[(~df.bb_src.isna()) & (~df.bb_src.str.contains(r'\.png$',regex=True,na=False))]
df.loc[(~df.bb_src.isna()) & (df.bb_src.str.contains(r'\.png$',regex=True,na=False))]
Parameters:
htmlcode:Union[str,bytes]
file path, url or html source code
urls will be downloaded with requests
dontuse:tuple
bs4 attributes to exclude from the dataframe
default = (
"element_classes",
"builder",
"is_xml",
"known_xml",
"_namespaces",
"parse_only",
"markup",
"contains_replacement_characters",
"original_encoding",
"declared_html_encoding",
"parser_class",
"namespace",
"prefix",
"cdata_list_attributes",
"preserve_whitespace_tag_stack",
"open_tag_counter",
"preserve_whitespace_tags",
"interesting_string_types",
"current_data",
"string_container_stack",
"_most_recent_element",
"currentTag",
)
parser: str
Have a look at the bs4 documentation
(default='lxml')
tags_to_find:Union[bool,str]=True
will be passed to soup.find_all()
Have a look at the bs4 documentation
(default=True) #everything
Returns:
df: pd.DataFrame
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file a_pandas_ex_bs4df-0.10.tar.gz
.
File metadata
- Download URL: a_pandas_ex_bs4df-0.10.tar.gz
- Upload date:
- Size: 5.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2b22ace100590415338716a259c3adcbb939042ec08998b5abb975ecdf73a845 |
|
MD5 | c90939dab6c03bc332d2f8b019acafe0 |
|
BLAKE2b-256 | ce0472c94c4c717af32875e28964dce7b9ea04824eb56a11518f6e2f24f7ed6c |
Provenance
File details
Details for the file a_pandas_ex_bs4df-0.10-py3-none-any.whl
.
File metadata
- Download URL: a_pandas_ex_bs4df-0.10-py3-none-any.whl
- Upload date:
- Size: 7.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 58383acd844ccdac85b7f22a2e865bc077e944bcfc02f615d23a563168ccdebf |
|
MD5 | eb457682b329a9b7d96ab8ce71a4e177 |
|
BLAKE2b-256 | 0321d85dcef2301023e46cf66aeed325a0fce1492c89d90d84295561840ee67d |