Skip to main content

One-line-web-scraping by combining pandas and BeautifulSoup4

Project description

One line web scraping by combining pandas and BeautifulSoup4

Check out the video
Code from the video
pip install a-pandas-ex-bs4df 
from a_pandas_ex_bs4df import pd_add_bs4_to_df
import pandas as pd
pd_add_bs4_to_df()    

from PrettyColorPrinter import add_printer #optional
add_printer(True) #optional

df=pd.Q_bs4_to_df(r'https://github.com/search?l=Python&q=python&type=Repositories')
df.loc[(~df.bb_href.isna()) & df.aa_attrs_values.str.contains('middle',regex=False, na=False)]
df.loc[(~df.bb_href.isna()) & df.aa_attrs_values.str.contains('middle',regex=False, na=False)].ff_fetchParents.apply(lambda x: x())
df.loc[(~df.bb_src.isna()) & (~df.bb_src.str.contains(r'\.png$',regex=True,na=False))]
df.loc[(~df.bb_src.isna()) & (df.bb_src.str.contains(r'\.png$',regex=True,na=False))]
Parameters:
    htmlcode:Union[str,bytes]
        file path, url or html source code
        urls will be downloaded with requests
    dontuse:tuple
        bs4 attributes to exclude from the dataframe
        default = (
        "element_classes",
        "builder",
        "is_xml",
        "known_xml",
        "_namespaces",
        "parse_only",
        "markup",
        "contains_replacement_characters",
        "original_encoding",
        "declared_html_encoding",
        "parser_class",
        "namespace",
        "prefix",
        "cdata_list_attributes",
        "preserve_whitespace_tag_stack",
        "open_tag_counter",
        "preserve_whitespace_tags",
        "interesting_string_types",
        "current_data",
        "string_container_stack",
        "_most_recent_element",
        "currentTag",
    )
    parser: str
        Have a look at the bs4 documentation
        (default='lxml')
    tags_to_find:Union[bool,str]=True
        will be passed to soup.find_all()
        Have a look at the bs4 documentation
        (default=True) #everything
Returns:
    df: pd.DataFrame

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

a_pandas_ex_bs4df-0.12.tar.gz (11.7 kB view details)

Uploaded Source

Built Distribution

a_pandas_ex_bs4df-0.12-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file a_pandas_ex_bs4df-0.12.tar.gz.

File metadata

  • Download URL: a_pandas_ex_bs4df-0.12.tar.gz
  • Upload date:
  • Size: 11.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for a_pandas_ex_bs4df-0.12.tar.gz
Algorithm Hash digest
SHA256 a8ad36f66097ab9ce2b87bf78834a83f98dbcae52028ef4273539989fc4dad98
MD5 a60084d3daa14362ffd86284c4842766
BLAKE2b-256 7ccb531cb3d2430149f436bee195c0b64d75923bdaf8309df4f8d7c7b67846a1

See more details on using hashes here.

File details

Details for the file a_pandas_ex_bs4df-0.12-py3-none-any.whl.

File metadata

File hashes

Hashes for a_pandas_ex_bs4df-0.12-py3-none-any.whl
Algorithm Hash digest
SHA256 305eb2e4f0e4da5b1760caea05913ce85344cc52538b0ba44dbb3461822b7a80
MD5 cf8396f90949c6977b55735e8c3c45d2
BLAKE2b-256 560e71c035e9b8e675fb8ea4487276c3a53ae01e05d0094a13db05cd32c0bc79

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page