Skip to main content

A powerful web scraping and downloading utility

Project description

Scrappify

Scrappify is a powerful yet simple website scraping and downloading tool. It allows you to easily scrape links, download files, filter by file types, extract patterns (like emails or phone numbers), and perform deep crawling — all from Python or the command line.


Features

  • Download entire websites
  • Extract links, emails, phone numbers, or custom regex patterns
  • Filter downloads by file type (images, documents, scripts, etc.)
  • Fast downloads with configurable workers
  • Cross-domain crawling support
  • Command-line interface (CLI) and Python API

Installation

pip install scrappify

Python Usage

Basic Usage

from scrappify import url, scrap, download

# Download entire website
url_download = url("https://example.com")
downloaded_files = download(url_download, output_dir="my_site")
print(f"Downloaded {len(downloaded_files)} files")

# Get all links from a page
links = scrap(url_download)
print(f"Found {len(links)} links")

File Type Filtering

from scrappify import url, download
from scrappify.patterns import file_type

# Download only JavaScript files
js_files = download("https://example.com", file_type="js", output_dir="js_files")

# Download images using category
images = download("https://example.com", file_type=file_type['image'], output_dir="images")

# Download multiple specific file types
docs_and_images = download("https://example.com", file_type=["pdf", "jpg", "png"])

Pattern Searching

from scrappify import url, download
from scrappify.patterns import pattern

# Find emails in all downloaded files
email_results = download("https://example.com", pattern=pattern['email'])

# Find phone numbers in HTML files only
phone_results = download("https://example.com", file_type="html", pattern=pattern['phone'])

# Custom regex pattern
custom_pattern = r'\b\d{3}-\d{2}-\d{4}\b'  # SSN pattern
ssn_results = download("https://example.com", pattern=custom_pattern)

# Combine file type and pattern
results = download("https://example.com", file_type="js", pattern=pattern['url'])

Advanced Scraping

from scrappify import url, scrap, download

# Deep crawling (multiple levels)
deep_links = scrap("https://example.com", depth=3)
print(f"Found {len(deep_links)} links across 3 levels")

# Download with increased workers
fast_download = download("https://example.com", max_workers=20, output_dir="fast_download")

# Cross-domain downloading (disable same-domain restriction)
all_links = scrap("https://example.com", same_domain_only=False)

Programmatic Pattern Extraction

from scrappify.core.utils import search_pattern_in_file

# Search pattern in specific file
results = search_pattern_in_file("downloaded_file.html", r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
for result in results:
    print(f"Email found: {result['match']} at line {result['line']}")

Command Line Usage

# Download entire website
scrappify https://example.com -o my_site

# Download only PDF files
scrappify https://example.com -t pdf -o documents

# Download images and search for emails
scrappify https://example.com -t image -p email -o images_with_emails

# Deep crawl (3 levels) and download everything
scrappify https://example.com -d 3 -o deep_site

# Use custom regex pattern
scrappify https://example.com -p '\b\d{3}-\d{2}-\d{4}\b' -o ssn_search

# List available patterns
scrappify --list-patterns

# List available file types
scrappify --list-types

# High-performance download with 20 workers
scrappify https://example.com -w 20 -o fast_download

Complex Examples

# Download all JavaScript and CSS files, search for URLs
scrappify https://example.com -t javascript -t css -p url -o assets_with_urls

# Download documents and images, search for prices
scrappify https://example.com -t document -t image -p price -o priced_content

# Deep crawl with custom pattern
scrappify https://example.com -d 2 -p '#[a-zA-Z0-9_]+' -o hashtags

Available Options

File Types

  • image → png, jpg, gif, svg, etc.
  • document → pdf, docx, txt, etc.
  • javascript, css, html
  • Custom extensions supported (e.g., zip, mp4)

Patterns

  • email → find emails
  • phone → detect phone numbers
  • url → extract URLs
  • price → detect price patterns
  • Custom regex patterns supported

License

MIT License © 2025 [MrFidal]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrappify-0.0.1.tar.gz (8.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scrappify-0.0.1-py3-none-any.whl (8.6 kB view details)

Uploaded Python 3

File details

Details for the file scrappify-0.0.1.tar.gz.

File metadata

  • Download URL: scrappify-0.0.1.tar.gz
  • Upload date:
  • Size: 8.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for scrappify-0.0.1.tar.gz
Algorithm Hash digest
SHA256 8ebf524e78cede0598ceedf6213504d4830fa416b025069d0e190c90ba0d7fbe
MD5 8c5ded4056078f9ce9cf58f7fe443977
BLAKE2b-256 a426307582480604e891dd22ed1bcdb9c28497e3605d686245ce866d2f4064bc

See more details on using hashes here.

File details

Details for the file scrappify-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: scrappify-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 8.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for scrappify-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 44aea120bd9c95d0b31cd2f82d5666fa497b0c627cf82890b8c843b55676a735
MD5 e7c8c51b4dfffe3ae4941a9aff4c0b2d
BLAKE2b-256 6304b9e36203d24758e08a628322077df06a27157be647a5956297ec85bf980f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page