Versatile web scraper for extracting data from both static HTML and JavaScript-rendered sites, using headless browsers for dynamic content.

These details have not been verified by PyPI

Project links

Homepage

Project description

Web Scraper for Static and JS Rendered Sites

This repository contains a versatile web scraper capable of handling both statically rendered HTML sites and dynamically rendered JavaScript sites. By combining techniques for traditional HTML scraping with methods for interacting with JavaScript-heavy websites, this single program offers a comprehensive solution for extracting data from a wide range of web sources.

Features

Dual Scraping Modes: Can processes static HTML or JS-rendered content.
Headless Browser Support: Utilizes headless browsers to render and scrape JavaScript-heavy sites.
Easy Configuration: Simple setup and configuration to specify target URLs and data extraction rules.
Efficient Data Extraction: Optimized for speed and reliability, ensuring accurate data retrieval.

Usage

1. Modify the `config.yaml` file as per your requirement.

The configuration should follow same format as shown in config.yaml by default as a sample.
The terminator key in configuration is optional. Only required if you have a specific terminating point to stop scraping.
The path key can have multiple values, in case you want to extract elements from multiple css-selectors.
The depth of configuration does not matter as long as the depth ends with the key path.

Here's the sample configuration for you:

<YOUR_VENDOR_NAME>:
  base:
    url: <URL_OF_THE_SITE_FROM-WHERE_YOU_WANT_TO_EXTRACT_ALL_URLS>
    path:
      - <PATH_TO_SINGLE_URL_ELEMENT>
  content:
    <CONTENT_SECTION_1>:
      path:
        - <YOUR_CSS_SELECTOR_PATH>
      terminator: <YOUR_CSS_SELECTOR_PATH_TO_TERMINATING_POINT>
    <CONTENT_SECTION_2>:
      path:
        - <YOUR_CSS_SELECTOR_PATH>
      terminator: <YOUR_CSS_SELECTOR_PATH_TO_TERMINATING_POINT>

2. Run the following code

To get the JS-Rendered Content using Headless browser, follow along with the code below:

import asyncio

from tqdm import tqdm

from scrapeall.parse import HTMLParser


async def main(vendor: str, config_path: str):
    data = dict()

    parser = HTMLParser(vendor, config_path)
    await parser.initialize_browser()
    await parser.get_all_urls()
    for key, url in tqdm(parser.urls.items()):
        await parser.get_content(url)
        data[key] = parser.data

    await parser.close_browser()
    return data


if __name__ == "__main__":
    VENDOR = "<YOUR_VENDOR_NAME>"
    CONFIG_PATH = "<PATH_TO_YOUR_YAML_FILE>"
    data = asyncio.run(main(VENDOR, CONFIG_PATH))

To get the Content from static HTML sites/pages, modify the main function as follows:

async def main(vendor: str, config_path: str):
    data = dict()

    parser = HTMLParser(vendor, config_path)
    await parser.get_all_urls()
    for key, url in tqdm(parser.urls.items()):
        await parser.get_content(url)
        data[key] = parser.data
    return data

If you want to save the content scrapped, you can do that by following the given code snippet:

import asyncio

from tqdm import tqdm

from scrapeall.utils import save_data
from scrapeall.parse import HTMLParser


async def main(vendor: str, config_path: str, output_file: str = ""):
    data = dict()

    parser = HTMLParser(vendor, config_path)
    await parser.initialize_browser()
    await parser.get_all_urls()
    for key, url in tqdm(parser.urls.items()):
        await parser.get_content(url)
        data[key] = parser.data

    await parser.close_browser()
    if output_file:
        await save_data(filename=output_file, data=data)
    return data


if __name__ == "__main__":
    VENDOR = "<YOUR_VENDOR_NAME>"
    CONFIG_PATH = "<PATH_TO_YOUR_YAML_FILE>"
    OUTPUT_FILE = "<PATH_TO_YOUR_JSON_FILE>"
    data = asyncio.run(main(VENDOR, CONFIG, OUTPUT_FILE))

Remember: YOUR_VENDOR_NAME should match one in the <CONFIG_PATH>.yaml file.

Contributing

Contributions are welcome! Please fork the repository and submit a pull request with your enhancements or bug fixes.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.2

Aug 31, 2024

This version

0.1.1

Aug 28, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapeall-0.1.1.tar.gz (4.3 kB view hashes)

Uploaded Aug 28, 2024 Source

Built Distribution

scrapeall-0.1.1-py3-none-any.whl (4.7 kB view hashes)

Uploaded Aug 28, 2024 Python 3

Hashes for scrapeall-0.1.1.tar.gz

Hashes for scrapeall-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`df774a2c0467f8aa94433b44702507d3744b50ac8ed6038a8a0036bc730afce4`
MD5	`bf85b02df758017b501c25e2b0a75cbd`
BLAKE2b-256	`70ef7ee9e21e2c0f39523686235911e537fd800adafa045eebe3f9ff8ec31293`

Hashes for scrapeall-0.1.1-py3-none-any.whl

Hashes for scrapeall-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c617ae65496f5b5b9a337f40eb03715fa6deb8302b4d9be9c82b150e1ac6ce3f`
MD5	`5548785cb90adb0d5359a60d681fd33c`
BLAKE2b-256	`47b50e6ddc248d551febe566192a605b875554192749a3f11b5dc3ba5c2a54a0`

scrapeall 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Web Scraper for Static and JS Rendered Sites

Features

Usage

1. Modify the `config.yaml` file as per your requirement.

2. Run the following code

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

scrapeall 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Web Scraper for Static and JS Rendered Sites

Features

Usage

1. Modify the config.yaml file as per your requirement.

2. Run the following code

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

1. Modify the `config.yaml` file as per your requirement.