Skip to main content

Versatile web scraper for extracting data from both static HTML and JavaScript-rendered sites, using headless browsers for dynamic content.

Project description

Web Scraper for Static and JS Rendered Sites

This repository contains a versatile web scraper capable of handling both statically rendered HTML sites and dynamically rendered JavaScript sites. By combining techniques for traditional HTML scraping with methods for interacting with JavaScript-heavy websites, this single program offers a comprehensive solution for extracting data from a wide range of web sources.

Features

  • Dual Scraping Modes: Can processes static HTML or JS-rendered content.
  • Headless Browser Support: Utilizes headless browsers to render and scrape JavaScript-heavy sites.
  • Easy Configuration: Simple setup and configuration to specify target URLs and data extraction rules.
  • Efficient Data Extraction: Optimized for speed and reliability, ensuring accurate data retrieval.

Usage

1. Modify the config.yaml file as per your requirement.

  • The configuration should follow same format as shown in config.yaml by default as a sample.
  • The terminator key in configuration is optional. Only required if you have a specific terminating point to stop scraping.
  • The path key can have multiple values, in case you want to extract elements from multiple css-selectors.
  • The depth of configuration does not matter as long as the depth ends with the key path.

Here's the sample configuration for you:

<YOUR_VENDOR_NAME>:
  base:
    url: <URL_OF_THE_SITE_FROM-WHERE_YOU_WANT_TO_EXTRACT_ALL_URLS>
    path:
      - <PATH_TO_SINGLE_URL_ELEMENT>
  content:
    <CONTENT_SECTION_1>:
      path:
        - <YOUR_CSS_SELECTOR_PATH>
      terminator: <YOUR_CSS_SELECTOR_PATH_TO_TERMINATING_POINT>
    <CONTENT_SECTION_2>:
      path:
        - <YOUR_CSS_SELECTOR_PATH>
      terminator: <YOUR_CSS_SELECTOR_PATH_TO_TERMINATING_POINT>

2. Run the following code

To get the JS-Rendered Content using Headless browser, follow along with the code below:

import asyncio

from tqdm import tqdm

from scrapeall.parse import HTMLParser


async def main(vendor: str, config_path: str):
    data = dict()

    parser = HTMLParser(vendor, config_path)
    await parser.initialize_browser()
    await parser.get_all_urls()
    for key, url in tqdm(parser.urls.items()):
        await parser.get_content(url)
        data[key] = parser.data

    await parser.close_browser()
    return data


if __name__ == "__main__":
    VENDOR = "<YOUR_VENDOR_NAME>"
    CONFIG_PATH = "<PATH_TO_YOUR_YAML_FILE>"
    data = asyncio.run(main(VENDOR, CONFIG_PATH))

To get the Content from static HTML sites/pages, modify the main function as follows:

async def main(vendor: str, config_path: str):
    data = dict()

    parser = HTMLParser(vendor, config_path)
    await parser.get_all_urls()
    for key, url in tqdm(parser.urls.items()):
        await parser.get_content(url)
        data[key] = parser.data
    return data

If you want to save the content scrapped, you can do that by following the given code snippet:

import asyncio

from tqdm import tqdm

from scrapeall.utils import save_data
from scrapeall.parse import HTMLParser


async def main(vendor: str, config_path: str, output_file: str = ""):
    data = dict()

    parser = HTMLParser(vendor, config_path)
    await parser.initialize_browser()
    await parser.get_all_urls()
    for key, url in tqdm(parser.urls.items()):
        await parser.get_content(url)
        data[key] = parser.data

    await parser.close_browser()
    if output_file:
        await save_data(filename=output_file, data=data)
    return data


if __name__ == "__main__":
    VENDOR = "<YOUR_VENDOR_NAME>"
    CONFIG_PATH = "<PATH_TO_YOUR_YAML_FILE>"
    OUTPUT_FILE = "<PATH_TO_YOUR_JSON_FILE>"
    data = asyncio.run(main(VENDOR, CONFIG, OUTPUT_FILE))

Remember: YOUR_VENDOR_NAME should match one in the <CONFIG_PATH>.yaml file.

Contributing

Contributions are welcome! Please fork the repository and submit a pull request with your enhancements or bug fixes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapeall-0.1.2.tar.gz (4.3 kB view hashes)

Uploaded Source

Built Distribution

scrapeall-0.1.2-py3-none-any.whl (4.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page