a-pandas-ex-bs4df

One line web scraping by combining pandas and BeautifulSoup4

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

One line web scraping by combining pandas and BeautifulSoup4

Check out the video

  <a href="https://www.youtube.com/watch?v=pvnODvnMyrg">

     <img src="https://img.youtube.com/vi/pvnODvnMyrg/0.jpg" style="width:100%;">

  </a>

Code from the video

pip install a-pandas-ex-bs4df

from a_pandas_ex_bs4df import pd_add_bs4_to_df

import pandas as pd

pd_add_bs4_to_df()    



from PrettyColorPrinter import add_printer #optional

add_printer(True) #optional



df=pd.Q_bs4_to_df(r'https://github.com/search?l=Python&q=python&type=Repositories')

df.loc[(~df.bb_href.isna()) & df.aa_attrs_values.str.contains('middle',regex=False, na=False)]

df.loc[(~df.bb_href.isna()) & df.aa_attrs_values.str.contains('middle',regex=False, na=False)].ff_fetchParents.apply(lambda x: x())

df.loc[(~df.bb_src.isna()) & (~df.bb_src.str.contains(r'\.png$',regex=True,na=False))]

df.loc[(~df.bb_src.isna()) & (df.bb_src.str.contains(r'\.png$',regex=True,na=False))]

Parameters:

    htmlcode:Union[str,bytes]

        file path, url or html source code

        urls will be downloaded with requests

    dontuse:tuple

        bs4 attributes to exclude from the dataframe

        default = (

        "element_classes",

        "builder",

        "is_xml",

        "known_xml",

        "_namespaces",

        "parse_only",

        "markup",

        "contains_replacement_characters",

        "original_encoding",

        "declared_html_encoding",

        "parser_class",

        "namespace",

        "prefix",

        "cdata_list_attributes",

        "preserve_whitespace_tag_stack",

        "open_tag_counter",

        "preserve_whitespace_tags",

        "interesting_string_types",

        "current_data",

        "string_container_stack",

        "_most_recent_element",

        "currentTag",

    )

    parser: str

        Have a look at the bs4 documentation

        (default='lxml')

    tags_to_find:Union[bool,str]=True

        will be passed to soup.find_all()

        Have a look at the bs4 documentation

        (default=True) #everything

Returns:

    df: pd.DataFrame

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.12

Oct 27, 2023

0.11

Oct 12, 2023

This version

0.10

Oct 29, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

a_pandas_ex_bs4df-0.10.tar.gz (5.4 kB view hashes)

Uploaded Oct 29, 2022 Source

Built Distribution

a_pandas_ex_bs4df-0.10-py3-none-any.whl (7.7 kB view hashes)

Uploaded Oct 29, 2022 Python 3

Hashes for a_pandas_ex_bs4df-0.10.tar.gz

Hashes for a_pandas_ex_bs4df-0.10.tar.gz
Algorithm	Hash digest
SHA256	`2b22ace100590415338716a259c3adcbb939042ec08998b5abb975ecdf73a845`
MD5	`c90939dab6c03bc332d2f8b019acafe0`
BLAKE2b-256	`ce0472c94c4c717af32875e28964dce7b9ea04824eb56a11518f6e2f24f7ed6c`

Hashes for a_pandas_ex_bs4df-0.10-py3-none-any.whl

Hashes for a_pandas_ex_bs4df-0.10-py3-none-any.whl
Algorithm	Hash digest
SHA256	`58383acd844ccdac85b7f22a2e865bc077e944bcfc02f615d23a563168ccdebf`
MD5	`eb457682b329a9b7d96ab8ce71a4e177`
BLAKE2b-256	`0321d85dcef2301023e46cf66aeed325a0fce1492c89d90d84295561840ee67d`