Scrapy helper
Project description
README
Goldendoodle is a universal spider for analyzing wether domains include a search string.
It is made for a first analyze of a unknown site. The analyze one search string is done for one search string on a list of start URLs. The allowed domain(s) are extracted from the starting URLs.
The result is a Scrapy output file with Goldendoodle items.
How to execute:
From goldendoodle run:
scrapy crawl recherchedechaîne_gldnddl -O <file> [-a gldnddlSearchString=<string>] -a option=<None|regex|email> -a start_urls=<url>[,<url>]
p. e.:
scrapy crawl recherchedechaîne_gldnddl -O ../reports/findings.json -a gldnddlSearchString='(Bereich.*obe[nr].*?)(?=<)' -a start_urls=https://www.drta-archiv.de/blauaugenkaerpflinge/ 2> ../reports/stderr.log
option
option is optional, which means None.
If gldnddlSearchString is a regular expression, than you should set option=regex. Otherwise findingElements and findingElementsQuery will not be filled. With option=regex you will get findingElements and findingElementsQuery only for the first finding in the site.
With option=email no gldnddlSearchString is necessary. There is a search algorithm that gives a result for the most common types of email coding. If you find an email that is not recognized, please let me know so I can adjust the algorithm.
With option=email AND gldnddlSearchString set, the gldnddlSearchString is used, but the results are prepared, as if emails have been searched.
example:
scrapy crawl recherchedechaîne_gldnddl -O reports/findings.json -a gldnddlSearchString='for use in illustrative examples in documents' -a start_urls=https://example.com/ | tee reports/log.txt
Prerequisites
- see requirements.txt
Documentation
Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file Goldendoodle-23.tar.gz
.
File metadata
- Download URL: Goldendoodle-23.tar.gz
- Upload date:
- Size: 24.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.14 CPython/3.11.1 Linux/6.1.7-200.fc37.x86_64
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 909af2eefe8244cd54f18a6540506944ed7b08728b5811d21283d132a4af1e1b |
|
MD5 | 31bdf710b624a628535ddb6196cd560a |
|
BLAKE2b-256 | fb406ad0d94f441b1b6c4183fea08205962a48fd0a4af4cca6dda4bdfe442132 |
File details
Details for the file Goldendoodle-23-py3-none-any.whl
.
File metadata
- Download URL: Goldendoodle-23-py3-none-any.whl
- Upload date:
- Size: 25.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.14 CPython/3.11.1 Linux/6.1.7-200.fc37.x86_64
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 08d007ef15b414feb7fb64c00d613ebbb6ab8ed13bc736b0f116b9421d2bcef9 |
|
MD5 | e541846ea575acf0ae20b623dd45da15 |
|
BLAKE2b-256 | e87c6e53c32a9255b1d558a53c659d37b1915e0c37844b8c2bd489c2d4368e65 |