Profile Scout is a kit that uses crawling and machine learning to identify profile pages on any website, simplifying the process of extracting user profiles, gathering information, and performing targeted actions.

These details have not been verified by PyPI

Project links

Project description

Profile Scout

License Last commit Repo size

Table of Contents

About
Capabilities
Common Use Cases
Installation
- PyPI
- Source
  - Host
  - Docker container
Possibilities for future improvements
Contributing

About

Profile Scout is a versatile Python package that offers scraping and detection capabilities for profile pages on any given website, including support for information extraction. By leveraging its robust search functionality and machine learning, this tool crawls the provided URL and identifies the URLs of profile pages within the website. Profile Scout offers a convenient solution for extracting user profiles, gathering valuable information, and performing targeted actions on profile pages. With its streamlined approach, this tool simplifies the process of locating and accessing profile pages, making it an invaluable asset for data collection, web scraping, and analysis tasks. Additionally, it supports information extraction techniques, allowing users to extract specific data from profile pages efficiently.

Profiel Scout can be useful to:

Investigators and OSINT Specialists (information extraction, creating information graphs, ...)
Penetration Testers and Ethical Hackers/Social Engineers (information extraction, reconnaissance, profile building)
Scientists and researchers (data engineering, data science, social science, research)
Companies (talent research, marketing, contact acquisition/harvesting)
Organizations (contact acquisition/harvesting, data collecting, database updating)

Capabilities

Profile Scout is mainly a crawler. For given URL, it will crawl the site and perform selected actions. If the file with URLs is provided, each URL will be processed in seperate thread.

Main features:

Flexible and controlled page scraping (HTML, page screenshot, or both)
Detecting and scraping profile pages during the crawling process
Locating the collective page from which all profile pages originate.
Information extraction from HTML files

Options:

-h, --help            
    show this help message and exit
    
--url URL             
    URL of the website to crawl
    
-f URLS_FILE_PATH, --file URLS_FILE_PATH
    Path to the file with URLs of the websites to crawl
    
-D DIRECTORY, --directory DIRECTORY
    Extract data from HTML files in the directory. To avoid saving output, set '-ep'/'--export-path' to ''

-v, --version
    print current version of the program

-a {scrape_pages,scrape_profiles,find_origin}, --action {scrape_pages,scrape_profiles,find_origin}
    Action to perform at a time of visiting the page (default: scrape_pages)
    
-b, --buffer          
    Buffer errors and outputs until crawling of website is finished and then create logs
    
-br, --bump-relevant  
    Bump relevant links to the top of the visiting queue (based on RELEVANT_WORDS list)
    
-ep EXPORT_PATH, --export-path EXPORT_PATH
    Path to destination directory for exporting
    
-ic {scooby}, --image-classifier {scooby}
    Image classifier to be used for identifying profile pages (default: scooby)
    
-cs CRAWL_SLEEP, --crawl-sleep CRAWL_SLEEP
    Time to sleep between each page visit (default: 2)
    
-d DEPTH, --depth DEPTH
    Maximum crawl depth (default: 2)
    
-if, --include-fragment
    Consider links with URI Fragment (e.g. http://example.com/some#fragment) as seperate page
    
-ol OUT_LOG_PATH, --output-log-path OUT_LOG_PATH
    Path to output log file. Ignored if '-f'/'--file' is used
    
-el ERR_LOG_PATH, --error-log-path ERR_LOG_PATH
    Path to error log file. Ignored if '-f'/'--file' is used
    
-so {all,html,screenshot}, --scrape-option {all,html,screenshot}
    Data to be scraped (default: all)
                
-t MAX_THREADS, --threads MAX_THREADS
    Maximum number of threads to use if '-f'/'--file' is provided (default: 4)
    
-mp MAX_PAGES, --max-pages MAX_PAGES
    Maximum number of pages to scrape and page is considered scraped if the action is performed successfully (default: unlimited)
    
-p, --preserve        
    Preserve whole URI (e.g. 'http://example.com/something/' instead of 'http://example.com/')

-r RESOLUTION, --resolution RESOLUTION
    Resolution of headless browser and output images. Format: WIDTHxHIGHT (default: 2880x1620)

Full input line format is: '[DEPTH [CRAWL_SLEEP]] URL"

DEPTH and CRAWL_SLEEP are optional and if a number is present it will be consider as DEPTH.
For example, "3 https://example.com" means that the URL should be crawled to a depth of 3.

If some of the fields (DEPTH or CRAWL_SLEEP) are present in the line then corresponding argument is ignored.

Writing too much on the storage drive can reduce its lifespan. To mitigate this issue, if there are more than
30 links, informational and error messages will be buffered and written at the end of
the crawling process.

RELEVANT_WORDS=['profile', 'user', 'users', 'about-us', 'team', 'employees', 'staff', 'professor', 
                'profil', 'o-nama', 'zaposlen', 'nastavnik', 'nastavnici', 'saradnici', 'profesor', 'osoblje', 
                'запослен', 'наставник', 'наставници', 'сарадници', 'професор', 'особље']

Common Use Cases

Note: Order of arguments/switches doesn't matter

Scraping

Scrape the URL up to a depth of 2 (-d) or a maximum of 300 scraped pages (-mp), depending on which comes first. Store scraped data at /data (-ep)

profilescout --url https://example.com -d 2 -mp 300 -ep /data

Scrape HTML (-so html) for every page up to a depth of 2 for the list of URLs (-f). Number of threads to be used is set with -t

profilescout -ep /data -t `nproc` -f links.txt -d 2 -so html

Start scraping screenshots from specific page (-p). It is import to note here that without -p, program would ignore full path, to be precise /about-us/meet-the-team/ part

profilescout -p --url https://www.wowt.com/about-us/meet-the-team/ -mp 4 -so screenshot

Scrape each website in the URLs list and postpone writing to the storge disk (by using buffer, -b)

profilescout -b -t `nproc` -f links.txt -d 0 -ep /data

Profile related tasks

Scrape profile pages (-a scrape_profiles) and prioritize links that are relevant to some specific domain (-br). For example, if we were searching for profile pages of professors we would like to give priority to links that contain related terms which could lead us to the profile page. Note: you can change it in file constants.py

profilescout -br -t `nproc` -f links.txt -a scrape_profiles -mp 30

Find and screenshot profile, store it as 600x400 (-r) image and then wait (-cs) 30 seconds before moving to the next profile

profilescout -br -t `nproc` -f links.txt -a scrape_profiles -mp 1000 -d 3 -cs 30 -r 600x400

Locate the origin page of profile pages (-a locate_origin) with classifier called scooby (-ic scooby). Note that visited pages are lond so in can be used for something like scanning the website

profilescout -t `nproc` -f links.txt -a locate_origin -ic scooby

Information extraction

Extract information (-D) contained in profile HTMLs that are located at /data and store it at ~/results (-ep)

profilescout -D /data -ep ~/results

Installation

PyPi

pip3 install profilescout

Source

Host

Create virtual environment (optional, but recommended)

python3 -m venv /path/to/some/dir

Activate virtual environment (skip if you skipped the first step)

source /path/to/some/dir/bin/activate

Install requirements

pip3 install -r requirements.txt

Install package locally

pip3 install -e .

Explore

profilescout -h

Docker container

Create image and run container. Execute this in project's directory

mkdir "/path/to/screenshot/dir/"            # if it does not exist
# this line may differ depending on your shell, 
# so check the documentation for the equivalent file to .bashrc
echo 'export SS_EXPORT_PATH="/path/to/screenshot/dir/"' >> ~/.bashrc
docker build -t profilescout .
docker run -it -v "$SS_EXPORT_PATH":/data profilescout

Add --rm if you want it to be disposable (one-time task)

Test deployment (inside docker container)

profilescout -mp 4 -t 1 -ep '/data' -p --url https://en.wikipedia.org/wiki/GNU

Possibilities for future improvements

Classification
- Profile classification based on existing data (without crawling)
- Classification using HTML and images, as well as the selection of appropriate classifiers
Scraping
- Intelligent downloading of files through links available on the profile page
Crawling
- Support for scraping using proxies.
Crawling actions
- Ability to provide custom actions
- Actions before and after page loading.
- Multiple actions for each stage of page processing (before, during, and after access).
Crawling strategy
- Ability to provide custom heuristics
- Ability to choose crawling strategy (link filters, etc.)
- Support for deeper link bump
- Selection of relevant words using CLI
Usability
- Saving progress and the ability to resume
- Increased automation (if the profile is not found at depth DEPTH, increase the depth and continue).
Extraction
- Support for national numbers, e.g. 011/123-4567
- Experiment with lightweight LLMs
- Experiment with Key-Value extraction and Layout techniques like LayoutLM

Contributing

If you discover a bug or have a feature idea, feel free to open an issue or PR.
Any improvements or suggestions are welcome!

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.2.post1

Aug 12, 2023

0.3.2.post0

Aug 12, 2023

0.3.2

Aug 12, 2023

0.3.1

Aug 9, 2023

0.3.0

Aug 9, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

profilescout-0.3.2.post1.tar.gz (41.7 kB view details)

Uploaded Aug 12, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

profilescout-0.3.2.post1-py3-none-any.whl (46.5 kB view details)

Uploaded Aug 12, 2023 Python 3

File details

Details for the file profilescout-0.3.2.post1.tar.gz.

File metadata

Download URL: profilescout-0.3.2.post1.tar.gz
Upload date: Aug 12, 2023
Size: 41.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: python-httpx/0.24.1

File hashes

Hashes for profilescout-0.3.2.post1.tar.gz
Algorithm	Hash digest
SHA256	`7063d4c744968ac08d23d1311151c5fa3813e0f51b4b4cb86c27fa4a4975cddb`
MD5	`e510a911685578e76fc79453a3e074af`
BLAKE2b-256	`5ef5d3bcceb16cf311fe0a4136a559be18d4edc28fa98196450e6c61a804ae8d`

See more details on using hashes here.

File details

Details for the file profilescout-0.3.2.post1-py3-none-any.whl.

File metadata

Download URL: profilescout-0.3.2.post1-py3-none-any.whl
Upload date: Aug 12, 2023
Size: 46.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: python-httpx/0.24.1

File hashes

Hashes for profilescout-0.3.2.post1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9ec8d68533ad67daf6bf139d959895ad187ac8b40d12559419cc99bbecdcf950`
MD5	`d0ecf7e063063eeb32f5ba2f23ffd658`
BLAKE2b-256	`9a95cad95877c2cbeb2e0dc6389714c5ea5fcef2c27e3c7ff24e8c6da62e1e20`

See more details on using hashes here.

profilescout 0.3.2.post1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Profile Scout

About

Capabilities

Common Use Cases

Scraping

Profile related tasks

Information extraction

Installation

PyPi

Source

Host

Docker container

Possibilities for future improvements

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes