Skip to main content

ScrapDynamics is a powerful framework for exploring and crawling websites.

Project description

scrapdynamics

ScrapDynamics

Table of Contents

Introduction

ScrapDynamics is a powerful framework for exploring and crawling websites, extracting links and finding information using regex expressions, making it easy to automate the process of collecting any data on a web.

Installation

To install ScrapDynamics, you can use the following commands:

git clone https://github.com/GuyChahine/ScrapDynamics.git
pip install -r requirements.txt

Getting Started

CLI

To use ScrapDynamics from the command line interface, you can use the following command:

python -m scrapdynamics -u https://example.org -o ./results.json

In this example it starts the crawler at the specified URL "https://example.org" and saves the results in a JSON file at the path "./results.json".

You can also use the following options:

usage: ScrapDynamics [-h] [-v] [-u URL] [-o OUTPUT]
options:
  -h           --help             show this help message and exit
  -v           --version          show version
  -u URL       --url URL          base url
  -o OUTPUT    --output OUTPUT    path/filename.extention

Library

When using ScrapDynamics as a library, you can import it into your Python code and use the following code snippet:

import scrapdynamics as sd

crawler = sd.Crawler("https://example.org")
crawler.start()
crawler.to_json("./results.json")

This code creates a Crawler object with the specified URL, starts the crawling process, and saves the results in a JSON file.

Advance Usage

Settings

You can customize the behavior of ScrapDynamics by modifying the settings. Here's an example of how to create a Settings object and set various options:

from scrapdynamics.settings import Settings

settings = Settings()

You can modify the following settings:

  • link_findall: regex expression to find links
link_findall = "href=\"((?:https?|\/\w|\/\/\w).+?)\""
  • link_relative_sub: regex expression to substitute relative links
link_relative_sub = ["(^\/\w.+?$)", "https://{domain}\1"]
  • link_schema_relative_sub: regex expression to substitute schema relative links
link_schema_relative_sub = ["(^(?:\/\/\w).*?$)", "https:\1"]
  • domain_findall: regex expression to find domain from a url
domain_findall = "https?:\/\/(?:www\.)?([^\/\s\'\"]+)"
  • search_expressions: dict of regex expression to look for in the html page
search_expressions = {
    "title": "(?:<title>|<meta.*?property=\"og:title\".*?content=\")(.*?)(?:<\/title>|\".*?>)",
    "emails": "[\w\-\.]+?\@[\w\-\.]+?\.[\w]+",
    "phones": "(?:tel\:)(\+?[\d\-\ ]{6,20})(?!\d)",
}
  • restrict_to_domain: restrict future urls to the domain given at the start
restrict_to_domain = True
  • depth: max depth to crawl
depth = 1
  • simulate_human: use selenium webdriver to get html page
simulate_human = False
  • scroll_first_page: selenium webdriver scroll down the first url given at the start
scroll_first_page = False
  • scroll_all_page: selenium webdriver scroll down all the url found
scroll_all_page = False
  • headless: don't show the page of selenium webdriver
headless = False
  • get_timeout: time in seconds of a GET timeout
get_timeout = 3
  • progress_bar: use progress bar or simple prints
progress_bar = False
  • request_header: header to add when doing a GET request with the module requests
request_header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"}
  • valid_content_type: content type of page content to allow the crawler to explore
valid_content_type = ["text/html"]
  • xpath_restrict_link_crawl: xpath where children elements will be used to find links for depth 1
xpath_restrict_link_crawl = "/html"

Here's an example of how to use the Settings object with ScrapDynamics:

import scrapdynamics as sd
from scrapdynamics.settings import Settings

settings = Settings(progress_bar=True)
crawler = sd.Crawler("https://example.org", settings)
crawler.start()
print(crawler.show())

This code creates a Settings object with the progress_bar option set to True, creates a Crawler object with the specified URL and settings, starts the crawling process, and displays the results.

Features

  • Regex-based Information Extraction: ScrapDynamics supports the use of regular expressions to search for specific information within the explored website. In addition to the regular expression patterns already implemented, you can define custom regular expression patterns and extract any other structured information.

  • Website Crawling: ScrapDynamics provides a robust web crawling functionality that allows you to navigate through a website and discover all its accessible pages. It follows links, collects URLs, and traverses the website structure efficiently.

  • Customizable Scraping Rules: You have full control over the scraping process. You can define the starting URL, specify the depth of crawling, set exclusion rules for certain URLs, and fine-tune the behavior of the crawler according to your requirements.

  • Data Export: The extracted information can be easily exported to various formats, such as EXCEL, CSV or JSON, allowing you to further analyze or integrate the scraped data into your existing workflows.

Examples

  • Linkedin Job Scrapping In this example, ScrapDynamics is set for scraping job details. It include link of the job, job title, description of the job, company name, job location, time of posting, and save the results in an Excel file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapdynamics-0.2.0.tar.gz (12.2 kB view hashes)

Uploaded Source

Built Distribution

scrapdynamics-0.2.0-py3-none-any.whl (12.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page