lxml to pandas for fast web scraping
Project description
lxml to pandas for fast web scraping
Tested against Windows / Python 3.11 / Anaconda
pip install lxml2pandas
from lxml2pandas import subprocess_parsing
df = subprocess_parsing(
tmpfiles,
chunks=1,
processes=4,
print_stdout=True,
print_stderr=True,
print_function_exceptions=True,
children_and_parents=True,
allowed_tags=('span',),
filter_function=lambda x: 'odd' in (gh:=str(x.aa_attr_values).lower()) or 'team' in gh,
)
df = subprocess_parsing(
tmpfiles,
chunks=1,
processes=4,
print_stdout=True,
print_stderr=True,
print_function_exceptions=True,
children_and_parents=False,
allowed_tags=('span',),
filter_function=lambda x: 'odd' in (gh := str(x.aa_attr_values).lower()) or 'team' in gh,
# lambda x:'yt' in str(x.aa_attr_values).lower()
)
df = subprocess_parsing(
tmpfiles,
chunks=1,
processes=4,
print_stdout=True,
print_stderr=True,
children_and_parents=True,
allowed_tags=(),
forbidden_tags=['div','body','html'],
filter_function=lambda x:'Odd' in str(x.aa_html),
)
df = subprocess_parsing(
tmpfiles,
chunks=1,
processes=4,
print_stdout=True,
print_stderr=True,
print_function_exceptions=True,
children_and_parents=False,
allowed_tags=('div',),
filter_function= lambda x:'team' in str(x.aa_attr_values).lower()
# lambda x:'yt' in str(x.aa_attr_values).lower() # lambda x:x.aa_tag=='div' or str(x.aa_attr_keys) in ['class','href'] or 'mais' in str(x.aa_text)
)
df = subprocess_parsing(
parsedata,
chunks=1,
processes=4,
print_stdout=True,
print_stderr=True,
print_function_exceptions=True,
children_and_parents=True,
allowed_tags=(),
forbidden_tags=('html','body'),
filter_function= lambda x:'t' in str(x.aa_attr_values).lower()
)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
lxml2pandas-0.11.tar.gz
(24.8 kB
view hashes)
Built Distribution
lxml2pandas-0.11-py3-none-any.whl
(24.6 kB
view hashes)
Close
Hashes for lxml2pandas-0.11-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 95f31ffc75c51ed0c55857052ff8f3139a0a2362e3149bdd2611a153341d783d |
|
MD5 | 3850a6ba8048e07d9f3c16cbd3ac8409 |
|
BLAKE2b-256 | 277e82c16f3b7ea69172d9f9bb882822602a8f1a4939df37f807d6c48d161e77 |