Skip to main content

Scrapy helper

Project description

README

Goldendoodle is a universal spider for analyzing wether domains include a search string.

It is made for a first analyze of a unknown site. The analyze one search string is done for one search string on a list of start URLs. The allowed domain(s) are extracted from the starting URLs.

The result is a Scrapy output file with Goldendoodle items.

How to execute:

From goldendoodle run:

scrapy crawl recherchedechaîne_gldnddl -O <file> [-a gldnddlSearchString=<string>] -a option=<None|regex|email> -a start_urls=<url>[,<url>]

p. e.:

scrapy crawl recherchedechaîne_gldnddl -O ../reports/findings.json -a gldnddlSearchString='(Bereich.*obe[nr].*?)(?=<)' -a start_urls=https://www.drta-archiv.de/blauaugenkaerpflinge/ 2> ../reports/stderr.log

option

option is optional, which means None.

If gldnddlSearchString is a regular expression, than you should set option=regex. Otherwise findingElements and findingElementsQuery will not be filled. With option=regex you will get findingElements and findingElementsQuery only for the first finding in the site.

With option=email no gldnddlSearchString is necessary. There is a search algorithm that gives a result for the most common types of email coding. If you find an email that is not recognized, please let me know so I can adjust the algorithm.

With option=email AND gldnddlSearchString set, the gldnddlSearchString is used, but the results are prepared, as if emails have been searched.

example:

scrapy crawl recherchedechaîne_gldnddl -O reports/findings.json -a gldnddlSearchString='for use in illustrative examples in documents' -a start_urls=https://example.com/ | tee reports/log.txt

Prerequisites

  • see requirements.txt

Documentation

see Goldendoodle docs

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Goldendoodle-23.tar.gz (24.7 MB view details)

Uploaded Source

Built Distribution

Goldendoodle-23-py3-none-any.whl (25.1 MB view details)

Uploaded Python 3

File details

Details for the file Goldendoodle-23.tar.gz.

File metadata

  • Download URL: Goldendoodle-23.tar.gz
  • Upload date:
  • Size: 24.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.14 CPython/3.11.1 Linux/6.1.7-200.fc37.x86_64

File hashes

Hashes for Goldendoodle-23.tar.gz
Algorithm Hash digest
SHA256 909af2eefe8244cd54f18a6540506944ed7b08728b5811d21283d132a4af1e1b
MD5 31bdf710b624a628535ddb6196cd560a
BLAKE2b-256 fb406ad0d94f441b1b6c4183fea08205962a48fd0a4af4cca6dda4bdfe442132

See more details on using hashes here.

File details

Details for the file Goldendoodle-23-py3-none-any.whl.

File metadata

  • Download URL: Goldendoodle-23-py3-none-any.whl
  • Upload date:
  • Size: 25.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.14 CPython/3.11.1 Linux/6.1.7-200.fc37.x86_64

File hashes

Hashes for Goldendoodle-23-py3-none-any.whl
Algorithm Hash digest
SHA256 08d007ef15b414feb7fb64c00d613ebbb6ab8ed13bc736b0f116b9421d2bcef9
MD5 e541846ea575acf0ae20b623dd45da15
BLAKE2b-256 e87c6e53c32a9255b1d558a53c659d37b1915e0c37844b8c2bd489c2d4368e65

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page