Skip to main content

One line web scraping by combining pandas and BeautifulSoup4

Project description

One line web scraping by combining pandas and BeautifulSoup4

Check out the video
  <a href="https://www.youtube.com/watch?v=pvnODvnMyrg">

     <img src="https://img.youtube.com/vi/pvnODvnMyrg/0.jpg" style="width:100%;">

  </a>
Code from the video
pip install a-pandas-ex-bs4df 
from a_pandas_ex_bs4df import pd_add_bs4_to_df

import pandas as pd

pd_add_bs4_to_df()    



from PrettyColorPrinter import add_printer #optional

add_printer(True) #optional



df=pd.Q_bs4_to_df(r'https://github.com/search?l=Python&q=python&type=Repositories')

df.loc[(~df.bb_href.isna()) & df.aa_attrs_values.str.contains('middle',regex=False, na=False)]

df.loc[(~df.bb_href.isna()) & df.aa_attrs_values.str.contains('middle',regex=False, na=False)].ff_fetchParents.apply(lambda x: x())

df.loc[(~df.bb_src.isna()) & (~df.bb_src.str.contains(r'\.png$',regex=True,na=False))]

df.loc[(~df.bb_src.isna()) & (df.bb_src.str.contains(r'\.png$',regex=True,na=False))]
Parameters:

    htmlcode:Union[str,bytes]

        file path, url or html source code

        urls will be downloaded with requests

    dontuse:tuple

        bs4 attributes to exclude from the dataframe

        default = (

        "element_classes",

        "builder",

        "is_xml",

        "known_xml",

        "_namespaces",

        "parse_only",

        "markup",

        "contains_replacement_characters",

        "original_encoding",

        "declared_html_encoding",

        "parser_class",

        "namespace",

        "prefix",

        "cdata_list_attributes",

        "preserve_whitespace_tag_stack",

        "open_tag_counter",

        "preserve_whitespace_tags",

        "interesting_string_types",

        "current_data",

        "string_container_stack",

        "_most_recent_element",

        "currentTag",

    )

    parser: str

        Have a look at the bs4 documentation

        (default='lxml')

    tags_to_find:Union[bool,str]=True

        will be passed to soup.find_all()

        Have a look at the bs4 documentation

        (default=True) #everything

Returns:

    df: pd.DataFrame

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

a_pandas_ex_bs4df-0.10.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

a_pandas_ex_bs4df-0.10-py3-none-any.whl (7.7 kB view details)

Uploaded Python 3

File details

Details for the file a_pandas_ex_bs4df-0.10.tar.gz.

File metadata

  • Download URL: a_pandas_ex_bs4df-0.10.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for a_pandas_ex_bs4df-0.10.tar.gz
Algorithm Hash digest
SHA256 2b22ace100590415338716a259c3adcbb939042ec08998b5abb975ecdf73a845
MD5 c90939dab6c03bc332d2f8b019acafe0
BLAKE2b-256 ce0472c94c4c717af32875e28964dce7b9ea04824eb56a11518f6e2f24f7ed6c

See more details on using hashes here.

Provenance

File details

Details for the file a_pandas_ex_bs4df-0.10-py3-none-any.whl.

File metadata

File hashes

Hashes for a_pandas_ex_bs4df-0.10-py3-none-any.whl
Algorithm Hash digest
SHA256 58383acd844ccdac85b7f22a2e865bc077e944bcfc02f615d23a563168ccdebf
MD5 eb457682b329a9b7d96ab8ce71a4e177
BLAKE2b-256 0321d85dcef2301023e46cf66aeed325a0fce1492c89d90d84295561840ee67d

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page