A package that allows you to smartly scrape data from a web page and export it to a CSV file.
Project description
Smart Web Scraper
This Python script allows you to smartly scrape data from a web page and export it to a CSV file.
It utilizes the requests, BeautifulSoup, re, and dateutil libraries to retrieve and parse HTML content, extract specific fields, and handle dates and prices.
Features
- Retrieves HTML content from a specified URL using the requests library.
- Parses the HTML content using BeautifulSoup to extract desired information.
- Extracts dates from text using the extract_date method, leveraging the dateutil library.
- Extracts prices from text using the extract_price method, using regular expressions.
- Exports the extracted data to a CSV file, allowing customization of repeater selector and fields.
- Handles cases where the repeater selector does not have any elements or the output file is being used by another program.
Usage
from smartWebScraper import SmartWebScraper
scraper = SmartWebScraper(
# change the URL_TO_SCRAPE with your url
url="URL_TO_SCRAPE",
# optional, default to True, mark as False if you want empty field to be empty instead of N/A
empty_as_na=False,
# optional, default to data.csv
filename='test.csv'
)
# the below is an example of fields to be passed
# use your own
# list of tuples
# first element ex: Title is the header column name in the csv file
# second element ex: h3.bc-heading is the field selector use any selector you want you can use tags, classes or ids etc...
# third element ex: text is telling the program what to extract use text to extract text or use attribute name ex: href
# fourth element is optional. Add if you want to tell the program to treat this field as price or datetime (will extract the price or datetime automatically)
fields = [
('Title', 'h3.bc-heading', 'text'),
('Sub Title', 'li.bc-list-item.subtitle span', 'text'),
('Author', 'span.bc-text.bc-size-small.bc-color-secondary a', 'text'),
('Author Link', 'span.bc-text.bc-size-small.bc-color-secondary a', 'href'),
('Link', 'h3.bc-heading a.bc-link', 'href'),
('Image', '.bc-image-inset-border', 'src'),
('Length', 'li.bc-list-item.runtimeLabel', 'text'),
('Date', 'li.bc-list-item.releaseDateLabel span', 'text', 'date'),
('Language', 'li.bc-list-item.languageLabel', 'text'),
('Price', '.buybox-regular-price', 'text', 'price'),
]
# scraper.export_to_csv method take the repeater_selector (the selector of the repeated elements and the fields you created above)
result = scraper.export_to_csv(repeater_selector='li.bc-list-item.productListItem', fields=fields)
print(result)
output
{'success': True, 'message': 'CSV file test.csv created successfully.'}
Contributing
Contributions are welcome! If you find any issues or have suggestions for improvement, please open an issue or submit a pull request on GitHub.
License
This project is licensed under the MIT License
You can now copy this code and use it as your README.md file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file smartWebScraper-1.0.0.tar.gz
.
File metadata
- Download URL: smartWebScraper-1.0.0.tar.gz
- Upload date:
- Size: 4.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.11
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | abc849d319ebba0ef15e401b85b4cb29139513bda4474c84795168239b469c32 |
|
MD5 | f9223a1014c03ac8dbf447dd9fcf52a0 |
|
BLAKE2b-256 | 31ff71d0b8894260b796f7eedd7e473160c73ca47bd38df3719f6ba3c12fed9d |
File details
Details for the file smartWebScraper-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: smartWebScraper-1.0.0-py3-none-any.whl
- Upload date:
- Size: 6.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.11
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f4646798689b335b234a2939c9f792c893c56c7f2bc7e1dbb0e98af07e52df99 |
|
MD5 | c8f955eb981c93277c7ec6ce0c39bb09 |
|
BLAKE2b-256 | 21675f3b98e7e0a2b186829e6959d782ea6f008115c8d7262725eb3b39acf007 |