Skip to main content

A package that allows you to smartly scrape data from a web page and export it to a CSV file.

Project description

Smart Web Scraper

This Python script allows you to smartly scrape data from a web page and export it to a CSV file.

It utilizes the requests, BeautifulSoup, re, and dateutil libraries to retrieve and parse HTML content, extract specific fields, and handle dates and prices.

Features

  • Retrieves HTML content from a specified URL using the requests library.
  • Parses the HTML content using BeautifulSoup to extract desired information.
  • Extracts dates from text using the extract_date method, leveraging the dateutil library.
  • Extracts prices from text using the extract_price method, using regular expressions.
  • Exports the extracted data to a CSV file, allowing customization of repeater selector and fields.
  • Handles cases where the repeater selector does not have any elements or the output file is being used by another program.

Usage

from smartWebScraper import SmartWebScraper

scraper = SmartWebScraper(
    # change the URL_TO_SCRAPE with your url
    url="URL_TO_SCRAPE",
    # optional, default to True, mark as False if you want empty field to be empty instead of N/A
    empty_as_na=False,
    # optional, default to data.csv
    filename='test.csv'
)

# the below is an example of fields to be passed
# use your own
# list of tuples
# first element ex: Title is the header column name in the csv file
# second element ex: h3.bc-heading is the field selector use any selector you want you can use tags, classes or ids etc...
# third element ex: text is telling the program what to extract use text to extract text or use attribute name ex: href
# fourth element is optional. Add if you want to tell the program to treat this field as price or datetime (will extract the price or datetime automatically)
fields = [
            ('Title', 'h3.bc-heading', 'text'),
            ('Sub Title', 'li.bc-list-item.subtitle span', 'text'),
            ('Author', 'span.bc-text.bc-size-small.bc-color-secondary a', 'text'),
            ('Author Link', 'span.bc-text.bc-size-small.bc-color-secondary a', 'href'),
            ('Link', 'h3.bc-heading a.bc-link', 'href'),
            ('Image', '.bc-image-inset-border', 'src'),
            ('Length', 'li.bc-list-item.runtimeLabel', 'text'),
            ('Date', 'li.bc-list-item.releaseDateLabel span', 'text', 'date'),
            ('Language', 'li.bc-list-item.languageLabel', 'text'),
            ('Price', '.buybox-regular-price', 'text', 'price'),
         ]
# scraper.export_to_csv method take the repeater_selector (the selector of the repeated elements and the fields you created above)
result = scraper.export_to_csv(repeater_selector='li.bc-list-item.productListItem', fields=fields)
print(result)

output

{'success': True, 'message': 'CSV file test.csv created successfully.'}

Contributing

Contributions are welcome! If you find any issues or have suggestions for improvement, please open an issue or submit a pull request on GitHub.

License

This project is licensed under the MIT License

You can now copy this code and use it as your README.md file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smartWebScraper-1.0.0.tar.gz (4.7 kB view details)

Uploaded Source

Built Distribution

smartWebScraper-1.0.0-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file smartWebScraper-1.0.0.tar.gz.

File metadata

  • Download URL: smartWebScraper-1.0.0.tar.gz
  • Upload date:
  • Size: 4.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.11

File hashes

Hashes for smartWebScraper-1.0.0.tar.gz
Algorithm Hash digest
SHA256 abc849d319ebba0ef15e401b85b4cb29139513bda4474c84795168239b469c32
MD5 f9223a1014c03ac8dbf447dd9fcf52a0
BLAKE2b-256 31ff71d0b8894260b796f7eedd7e473160c73ca47bd38df3719f6ba3c12fed9d

See more details on using hashes here.

File details

Details for the file smartWebScraper-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for smartWebScraper-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f4646798689b335b234a2939c9f792c893c56c7f2bc7e1dbb0e98af07e52df99
MD5 c8f955eb981c93277c7ec6ce0c39bb09
BLAKE2b-256 21675f3b98e7e0a2b186829e6959d782ea6f008115c8d7262725eb3b39acf007

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page