lxml to pandas for fast web scraping
Project description
lxml to pandas for fast web scraping
Tested against Windows / Python 3.11 / Anaconda
pip install lxml2pandas
from lxml2pandas import subprocess_parsing
from PrettyColorPrinter import add_printer
add_printer(1)
htmldata = [
("bet365", r"C:\Users\hansc\Downloads\bet365 - Apostas Desportivas Online.mhtml"),
(
"betano",
r"C:\Users\hansc\Downloads\Brasil Brasileirão - Série A Apostas - Futebol Odds _ Betano.mhtml",
),
("sportingbet", r"C:\Users\hansc\Downloads\Apostas Futebol _ Sportingbet.mhtml"),
]
df = subprocess_parsing(
htmldata,
chunks=1,
processes=5,
fake_header=True,
print_stdout=True,
print_stderr=True,
)
#
firstlist = df.loc[
df.aa_attr_values.str.contains("ovm-FixtureList ovm-Competition_Fixtures", na=False)
]
for k, i in firstlist.iterrows():
df2 = df.loc[df.aa_element_id.isin(i.aa_all_children)]
df3 = df2.loc[
df2.aa_attr_values.isin(
["ovm-FixtureDetailsTwoWay_TeamName", "ovm-ParticipantOddsOnly_Odds"]
)
]
df4 = [df3.iloc[x : x + 5] for x in range(0, len(df3), 5)]
for dframe in df4:
if len(dframe) == 5:
dfc = dframe["aa_attr_values"].value_counts()
try:
if (
dfc.loc["ovm-FixtureDetailsTwoWay_TeamName"] == 2
and dfc.loc["ovm-ParticipantOddsOnly_Odds"] == 3
):
print(dframe)
except Exception:
pass
chi = df.loc[df.aa_attr_values == "events-list__grid__event"].aa_all_children
for c in chi:
(
df.loc[
(df.aa_element_id.isin(c))
& (df.aa_doc_id == "betano")
& (
(
(df.aa_tag == "span")
& (df.aa_attr_values == "selections__selection__odd")
)
| (
(df.aa_tag == "span")
& (df.aa_attr_values.str.contains("participant-name", na=False))
)
)
]
).ds_color_print_all()
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
lxml2pandas-0.12.tar.gz
(39.4 kB
view hashes)
Built Distribution
lxml2pandas-0.12-py3-none-any.whl
(39.3 kB
view hashes)
Close
Hashes for lxml2pandas-0.12-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a3d0f108b3dc44ee453a3909b920a576200f8c842e891a48ef2123051db4cee8 |
|
MD5 | 427cf7ae2f544295230700dddda3f28c |
|
BLAKE2b-256 | 9d7c36b7a9b5668be06faa77a12376f45d6e33a7a7ae4db52ddbe80631db1f58 |