Skip to main content

Versatile web scraper for extracting data from both static HTML and JavaScript-rendered sites, using headless browsers for dynamic content.

Project description

Web Scraper for Static and JS Rendered Sites

This repository contains a versatile web scraper capable of handling both statically rendered HTML sites and dynamically rendered JavaScript sites. By combining techniques for traditional HTML scraping with methods for interacting with JavaScript-heavy websites, this single program offers a comprehensive solution for extracting data from a wide range of web sources.

Features

  • Dual Scraping Modes: Can processes static HTML or JS-rendered content.
  • Headless Browser Support: Utilizes headless browsers to render and scrape JavaScript-heavy sites.
  • Easy Configuration: Simple setup and configuration to specify target URLs and data extraction rules.
  • Efficient Data Extraction: Optimized for speed and reliability, ensuring accurate data retrieval.

Usage

1. Modify the config.yaml file as per your requirement.

  • The configuration should follow same format as shown in config.yaml by default as a sample.
  • The terminator key in configuration is optional. Only required if you have a specific terminating point to stop scraping.
  • The path key can have multiple values, in case you want to extract elements from multiple css-selectors.
  • The depth of configuration does not matter as long as the depth ends with the key path.

Here's the sample configuration for you:

<YOUR_VENDOR_NAME>:
  base:
    url: <URL_OF_THE_SITE_FROM-WHERE_YOU_WANT_TO_EXTRACT_ALL_URLS>
    path:
      - <PATH_TO_SINGLE_URL_ELEMENT>
  content:
    <CONTENT_SECTION_1>:
      path:
        - <YOUR_CSS_SELECTOR_PATH>
      terminator: <YOUR_CSS_SELECTOR_PATH_TO_TERMINATING_POINT>
    <CONTENT_SECTION_2>:
      path:
        - <YOUR_CSS_SELECTOR_PATH>
      terminator: <YOUR_CSS_SELECTOR_PATH_TO_TERMINATING_POINT>

2. Run the following code

To get the JS-Rendered Content using Headless browser, follow along with the code below:

import asyncio

from tqdm import tqdm

from scrapeall.parse import HTMLParser


async def main(vendor: str, config_path: str):
    data = dict()

    parser = HTMLParser(vendor, config_path)
    await parser.initialize_browser()
    await parser.get_all_urls()
    for key, url in tqdm(parser.urls.items()):
        await parser.get_content(url)
        data[key] = parser.data

    await parser.close_browser()
    return data


if __name__ == "__main__":
    VENDOR = "<YOUR_VENDOR_NAME>"
    CONFIG_PATH = "<PATH_TO_YOUR_YAML_FILE>"
    data = asyncio.run(main(VENDOR, CONFIG_PATH))

To get the Content from static HTML sites/pages, modify the main function as follows:

async def main(vendor: str, config_path: str):
    data = dict()

    parser = HTMLParser(vendor, config_path)
    await parser.get_all_urls()
    for key, url in tqdm(parser.urls.items()):
        await parser.get_content(url)
        data[key] = parser.data
    return data

If you want to save the content scrapped, you can do that by following the given code snippet:

import asyncio

from tqdm import tqdm

from scrapeall.utils import save_data
from scrapeall.parse import HTMLParser


async def main(vendor: str, config_path: str, output_file: str = ""):
    data = dict()

    parser = HTMLParser(vendor, config_path)
    await parser.initialize_browser()
    await parser.get_all_urls()
    for key, url in tqdm(parser.urls.items()):
        await parser.get_content(url)
        data[key] = parser.data

    await parser.close_browser()
    if output_file:
        await save_data(filename=output_file, data=data)
    return data


if __name__ == "__main__":
    VENDOR = "<YOUR_VENDOR_NAME>"
    CONFIG_PATH = "<PATH_TO_YOUR_YAML_FILE>"
    OUTPUT_FILE = "<PATH_TO_YOUR_JSON_FILE>"
    data = asyncio.run(main(VENDOR, CONFIG, OUTPUT_FILE))

Remember: YOUR_VENDOR_NAME should match one in the <CONFIG_PATH>.yaml file.

Contributing

Contributions are welcome! Please fork the repository and submit a pull request with your enhancements or bug fixes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapeall-0.1.2.tar.gz (4.3 kB view details)

Uploaded Source

Built Distribution

scrapeall-0.1.2-py3-none-any.whl (4.7 kB view details)

Uploaded Python 3

File details

Details for the file scrapeall-0.1.2.tar.gz.

File metadata

  • Download URL: scrapeall-0.1.2.tar.gz
  • Upload date:
  • Size: 4.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for scrapeall-0.1.2.tar.gz
Algorithm Hash digest
SHA256 a32a9b42caadf86b9687052d7be6b656a8d39fbf5607982028eaa5de7d01b238
MD5 7346b2ae8df6a8e15af8766fde344d26
BLAKE2b-256 73fab13ca47d7835bcec88fae8783b02a6ec4872f1d3308a0f204eb5b9f01d95

See more details on using hashes here.

File details

Details for the file scrapeall-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: scrapeall-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 4.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.19

File hashes

Hashes for scrapeall-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 181121f4e5dc1c508a74fcdfe8a25e87381547860943e18fce037d4887f0f21b
MD5 d752af6ce9b58ff284aee6895ac281b3
BLAKE2b-256 2ed25322acd7d49ae65f74bd0441b09f9de9d365c87cd32365f343cfbc0eb9c1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page