A package that allows you to smartly scrape data from a web page and export it to a CSV file.

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 5 - Production/Stable
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language

Project description

Smart Web Scraper

This Python script allows you to smartly scrape data from a web page and export it to a CSV file.

It utilizes the requests, BeautifulSoup, re, and dateutil libraries to retrieve and parse HTML content, extract specific fields, and handle dates and prices.

Features

Retrieves HTML content from a specified URL using the requests library.
Parses the HTML content using BeautifulSoup to extract desired information.
Extracts dates from text using the extract_date method, leveraging the dateutil library.
Extracts prices from text using the extract_price method, using regular expressions.
Exports the extracted data to a CSV file, allowing customization of repeater selector and fields.
Handles cases where the repeater selector does not have any elements or the output file is being used by another program.

Usage

from smartWebScraper import SmartWebScraper

scraper = SmartWebScraper(
    # change the URL_TO_SCRAPE with your url
    url="URL_TO_SCRAPE",
    # optional, default to True, mark as False if you want empty field to be empty instead of N/A
    empty_as_na=False,
    # optional, default to data.csv
    filename='test.csv'
)

# the below is an example of fields to be passed
# use your own
# list of tuples
# first element ex: Title is the header column name in the csv file
# second element ex: h3.bc-heading is the field selector use any selector you want you can use tags, classes or ids etc...
# third element ex: text is telling the program what to extract use text to extract text or use attribute name ex: href
# fourth element is optional. Add if you want to tell the program to treat this field as price or datetime (will extract the price or datetime automatically)
fields = [
            ('Title', 'h3.bc-heading', 'text'),
            ('Sub Title', 'li.bc-list-item.subtitle span', 'text'),
            ('Author', 'span.bc-text.bc-size-small.bc-color-secondary a', 'text'),
            ('Author Link', 'span.bc-text.bc-size-small.bc-color-secondary a', 'href'),
            ('Link', 'h3.bc-heading a.bc-link', 'href'),
            ('Image', '.bc-image-inset-border', 'src'),
            ('Length', 'li.bc-list-item.runtimeLabel', 'text'),
            ('Date', 'li.bc-list-item.releaseDateLabel span', 'text', 'date'),
            ('Language', 'li.bc-list-item.languageLabel', 'text'),
            ('Price', '.buybox-regular-price', 'text', 'price'),
         ]
# scraper.export_to_csv method take the repeater_selector (the selector of the repeated elements and the fields you created above)
result = scraper.export_to_csv(repeater_selector='li.bc-list-item.productListItem', fields=fields)
print(result)

output

{'success': True, 'message': 'CSV file test.csv created successfully.'}

Contributing

Contributions are welcome! If you find any issues or have suggestions for improvement, please open an issue or submit a pull request on GitHub.

License

This project is licensed under the MIT License

You can now copy this code and use it as your README.md file.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 5 - Production/Stable
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language

Release history Release notifications | RSS feed

This version

1.0.0

Jun 13, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smartWebScraper-1.0.0.tar.gz (4.7 kB view hashes)

Uploaded Jun 13, 2023 Source

Built Distribution

smartWebScraper-1.0.0-py3-none-any.whl (6.2 kB view hashes)

Uploaded Jun 13, 2023 Python 3

Hashes for smartWebScraper-1.0.0.tar.gz

Hashes for smartWebScraper-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`abc849d319ebba0ef15e401b85b4cb29139513bda4474c84795168239b469c32`
MD5	`f9223a1014c03ac8dbf447dd9fcf52a0`
BLAKE2b-256	`31ff71d0b8894260b796f7eedd7e473160c73ca47bd38df3719f6ba3c12fed9d`

Hashes for smartWebScraper-1.0.0-py3-none-any.whl

Hashes for smartWebScraper-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f4646798689b335b234a2939c9f792c893c56c7f2bc7e1dbb0e98af07e52df99`
MD5	`c8f955eb981c93277c7ec6ce0c39bb09`
BLAKE2b-256	`21675f3b98e7e0a2b186829e6959d782ea6f008115c8d7262725eb3b39acf007`