ScrapDynamics is a powerful framework for exploring and crawling websites.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

scrapdynamics

ScrapDynamics

Introduction
Installation
Getting Started
- CLI
- Library
Advance Usage
- Settings
Features
Examples

Introduction

ScrapDynamics is a powerful framework for exploring and crawling websites, extracting links and finding information using regex expressions, making it easy to automate the process of collecting any data on a web.

Installation

To install ScrapDynamics, you can use the following commands:

git clone https://github.com/GuyChahine/ScrapDynamics.git

pip install -r requirements.txt

Getting Started

CLI

To use ScrapDynamics from the command line interface, you can use the following command:

python -m scrapdynamics -u https://example.org -o ./results.json

In this example it starts the crawler at the specified URL "https://example.org" and saves the results in a JSON file at the path "./results.json".

You can also use the following options:

usage: ScrapDynamics [-h] [-v] [-u URL] [-o OUTPUT]
options:
  -h           --help             show this help message and exit
  -v           --version          show version
  -u URL       --url URL          base url
  -o OUTPUT    --output OUTPUT    path/filename.extention

Library

When using ScrapDynamics as a library, you can import it into your Python code and use the following code snippet:

import scrapdynamics as sd

crawler = sd.Crawler("https://example.org")
crawler.start()
crawler.to_json("./results.json")

This code creates a Crawler object with the specified URL, starts the crawling process, and saves the results in a JSON file.

Advance Usage

Settings

You can customize the behavior of ScrapDynamics by modifying the settings. Here's an example of how to create a Settings object and set various options:

from scrapdynamics.settings import Settings

settings = Settings()

You can modify the following settings:

link_findall: regex expression to find links

link_findall = "href=\"((?:https?|\/\w|\/\/\w).+?)\""

link_relative_sub: regex expression to substitute relative links

link_relative_sub = ["(^\/\w.+?$)", "https://{domain}\1"]

link_schema_relative_sub: regex expression to substitute schema relative links

link_schema_relative_sub = ["(^(?:\/\/\w).*?$)", "https:\1"]

domain_findall: regex expression to find domain from a url

domain_findall = "https?:\/\/(?:www\.)?([^\/\s\'\"]+)"

search_expressions: dict of regex expression to look for in the html page

search_expressions = {
    "title": "(?:<title>|<meta.*?property=\"og:title\".*?content=\")(.*?)(?:<\/title>|\".*?>)",
    "emails": "[\w\-\.]+?\@[\w\-\.]+?\.[\w]+",
    "phones": "(?:tel\:)(\+?[\d\-\ ]{6,20})(?!\d)",
}

restrict_to_domain: restrict future urls to the domain given at the start

restrict_to_domain = True

depth: max depth to crawl

depth = 1

simulate_human: use selenium webdriver to get html page

simulate_human = False

scroll_first_page: selenium webdriver scroll down the first url given at the start

scroll_first_page = False

scroll_all_page: selenium webdriver scroll down all the url found

scroll_all_page = False

headless: don't show the page of selenium webdriver

headless = False

get_timeout: time in seconds of a GET timeout

get_timeout = 3

progress_bar: use progress bar or simple prints

progress_bar = False

request_header: header to add when doing a GET request with the module requests

request_header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"}

valid_content_type: content type of page content to allow the crawler to explore

valid_content_type = ["text/html"]

xpath_restrict_link_crawl: xpath where children elements will be used to find links for depth 1

xpath_restrict_link_crawl = "/html"

Here's an example of how to use the Settings object with ScrapDynamics:

import scrapdynamics as sd
from scrapdynamics.settings import Settings

settings = Settings(progress_bar=True)
crawler = sd.Crawler("https://example.org", settings)
crawler.start()
print(crawler.show())

This code creates a Settings object with the progress_bar option set to True, creates a Crawler object with the specified URL and settings, starts the crawling process, and displays the results.

Features

Regex-based Information Extraction: ScrapDynamics supports the use of regular expressions to search for specific information within the explored website. In addition to the regular expression patterns already implemented, you can define custom regular expression patterns and extract any other structured information.
Website Crawling: ScrapDynamics provides a robust web crawling functionality that allows you to navigate through a website and discover all its accessible pages. It follows links, collects URLs, and traverses the website structure efficiently.
Customizable Scraping Rules: You have full control over the scraping process. You can define the starting URL, specify the depth of crawling, set exclusion rules for certain URLs, and fine-tune the behavior of the crawler according to your requirements.
Data Export: The extracted information can be easily exported to various formats, such as EXCEL, CSV or JSON, allowing you to further analyze or integrate the scraped data into your existing workflows.

Examples

Linkedin Job Scrapping In this example, ScrapDynamics is set for scraping job details. It include link of the job, job title, description of the job, company name, job location, time of posting, and save the results in an Excel file.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.2.0

Jun 16, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapdynamics-0.2.0.tar.gz (12.2 kB view hashes)

Uploaded Jun 16, 2023 Source

Built Distribution

scrapdynamics-0.2.0-py3-none-any.whl (12.1 kB view hashes)

Uploaded Jun 16, 2023 Python 3

Hashes for scrapdynamics-0.2.0.tar.gz

Hashes for scrapdynamics-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`82d5a389a72a574d703a1c6daaf6a72d0f595f43e7be7b037b0bfa4824f84a86`
MD5	`8629712dceb8be6ed19f5b9c0c2a8bc6`
BLAKE2b-256	`d4fb23099161e10c16ceb912aa3a11c862e82426aa1baef96c01f5e066bec672`

Hashes for scrapdynamics-0.2.0-py3-none-any.whl

Hashes for scrapdynamics-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f62cbc0dac9a2fc75e2e941fd8e0618450f651a076533440dc88bca93263247a`
MD5	`104dc94e8411bf1bc255b217b8b9c952`
BLAKE2b-256	`f20d3e49edd3b3c99f5f6c65d424eab95335f5be6a11bd4bf32241a93b31a4e1`