ScrapDynamics is a powerful framework for exploring and crawling websites.
Project description
ScrapDynamics
Table of Contents
Introduction
ScrapDynamics is a powerful framework for exploring and crawling websites, extracting links and finding information using regex expressions, making it easy to automate the process of collecting any data on a web.
Installation
To install ScrapDynamics, you can use the following commands:
git clone https://github.com/GuyChahine/ScrapDynamics.git
pip install -r requirements.txt
Getting Started
CLI
To use ScrapDynamics from the command line interface, you can use the following command:
python -m scrapdynamics -u https://example.org -o ./results.json
In this example it starts the crawler at the specified URL "https://example.org" and saves the results in a JSON file at the path "./results.json".
You can also use the following options:
usage: ScrapDynamics [-h] [-v] [-u URL] [-o OUTPUT]
options:
-h --help show this help message and exit
-v --version show version
-u URL --url URL base url
-o OUTPUT --output OUTPUT path/filename.extention
Library
When using ScrapDynamics as a library, you can import it into your Python code and use the following code snippet:
import scrapdynamics as sd
crawler = sd.Crawler("https://example.org")
crawler.start()
crawler.to_json("./results.json")
This code creates a Crawler object with the specified URL, starts the crawling process, and saves the results in a JSON file.
Advance Usage
Settings
You can customize the behavior of ScrapDynamics by modifying the settings. Here's an example of how to create a Settings object and set various options:
from scrapdynamics.settings import Settings
settings = Settings()
You can modify the following settings:
- link_findall: regex expression to find links
link_findall = "href=\"((?:https?|\/\w|\/\/\w).+?)\""
- link_relative_sub: regex expression to substitute relative links
link_relative_sub = ["(^\/\w.+?$)", "https://{domain}\1"]
- link_schema_relative_sub: regex expression to substitute schema relative links
link_schema_relative_sub = ["(^(?:\/\/\w).*?$)", "https:\1"]
- domain_findall: regex expression to find domain from a url
domain_findall = "https?:\/\/(?:www\.)?([^\/\s\'\"]+)"
- search_expressions: dict of regex expression to look for in the html page
search_expressions = {
"title": "(?:<title>|<meta.*?property=\"og:title\".*?content=\")(.*?)(?:<\/title>|\".*?>)",
"emails": "[\w\-\.]+?\@[\w\-\.]+?\.[\w]+",
"phones": "(?:tel\:)(\+?[\d\-\ ]{6,20})(?!\d)",
}
- restrict_to_domain: restrict future urls to the domain given at the start
restrict_to_domain = True
- depth: max depth to crawl
depth = 1
- simulate_human: use selenium webdriver to get html page
simulate_human = False
- scroll_first_page: selenium webdriver scroll down the first url given at the start
scroll_first_page = False
- scroll_all_page: selenium webdriver scroll down all the url found
scroll_all_page = False
- headless: don't show the page of selenium webdriver
headless = False
- get_timeout: time in seconds of a GET timeout
get_timeout = 3
- progress_bar: use progress bar or simple prints
progress_bar = False
- request_header: header to add when doing a GET request with the module requests
request_header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"}
- valid_content_type: content type of page content to allow the crawler to explore
valid_content_type = ["text/html"]
- xpath_restrict_link_crawl: xpath where children elements will be used to find links for depth 1
xpath_restrict_link_crawl = "/html"
Here's an example of how to use the Settings object with ScrapDynamics:
import scrapdynamics as sd
from scrapdynamics.settings import Settings
settings = Settings(progress_bar=True)
crawler = sd.Crawler("https://example.org", settings)
crawler.start()
print(crawler.show())
This code creates a Settings object with the progress_bar option set to True, creates a Crawler object with the specified URL and settings, starts the crawling process, and displays the results.
Features
-
Regex-based Information Extraction: ScrapDynamics supports the use of regular expressions to search for specific information within the explored website. In addition to the regular expression patterns already implemented, you can define custom regular expression patterns and extract any other structured information.
-
Website Crawling: ScrapDynamics provides a robust web crawling functionality that allows you to navigate through a website and discover all its accessible pages. It follows links, collects URLs, and traverses the website structure efficiently.
-
Customizable Scraping Rules: You have full control over the scraping process. You can define the starting URL, specify the depth of crawling, set exclusion rules for certain URLs, and fine-tune the behavior of the crawler according to your requirements.
-
Data Export: The extracted information can be easily exported to various formats, such as EXCEL, CSV or JSON, allowing you to further analyze or integrate the scraped data into your existing workflows.
Examples
- Linkedin Job Scrapping In this example, ScrapDynamics is set for scraping job details. It include link of the job, job title, description of the job, company name, job location, time of posting, and save the results in an Excel file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrapdynamics-0.2.0.tar.gz
.
File metadata
- Download URL: scrapdynamics-0.2.0.tar.gz
- Upload date:
- Size: 12.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 82d5a389a72a574d703a1c6daaf6a72d0f595f43e7be7b037b0bfa4824f84a86 |
|
MD5 | 8629712dceb8be6ed19f5b9c0c2a8bc6 |
|
BLAKE2b-256 | d4fb23099161e10c16ceb912aa3a11c862e82426aa1baef96c01f5e066bec672 |
File details
Details for the file scrapdynamics-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: scrapdynamics-0.2.0-py3-none-any.whl
- Upload date:
- Size: 12.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f62cbc0dac9a2fc75e2e941fd8e0618450f651a076533440dc88bca93263247a |
|
MD5 | 104dc94e8411bf1bc255b217b8b9c952 |
|
BLAKE2b-256 | f20d3e49edd3b3c99f5f6c65d424eab95335f5be6a11bd4bf32241a93b31a4e1 |