easyscrapper is a fast, lightweight Python package and CLI tool that lets developers, data scientists, and AI engineers extract text, HTML, emails, links, canonical, meta and images from any public webpage - perfect for AI, RAG pipelines, SEO, content aggregation, and scalable data workflows with just one command or a few lines of code.

These details have not been verified by PyPI

Project links

Project description

EasyScrapper

EasyScrapper is a simple yet powerful Python package that helps you scrape structured content, plain text, emails, links, images, and generate RAG-ready chunks from any webpage. Built on requests and BeautifulSoup, it is easy to use and ready for integration with AI/NLP workflows.

Installation

To install the EasyScrapper package, you can use pip. Run the following command:

pip install easyscrapper

Features

Feature	Description
URL Fetching	Fetches and parses HTML content from any public webpage.
Custom Tag Extraction	Extracts content from specified HTML tags like h1, h2, p, etc.
Link Extraction	Gathers all valid hyperlinks from anchor tags (`<a href>`).
Email Extraction	Finds email addresses from plain text and mailto: links.
Image Extraction	Extracts images from `<img>`, href, and style attributes including background images.
Plain Text Extraction	Extracts and cleans all text content from the page.
RAG Chunking	Converts plain text into AI-friendly chunks (useful for RAG and vector databases).
Save to File	Saves the scraped result in `.txt` or `.json` formats.
Manual Chunking	Allows users to input their own text and get it chunked for AI applications.

Use Cases

Use Case	Description
Basic Web Scraping	Extract headings, paragraphs, links, and other HTML content from any public website.
SEO Audits	Collect all headings and metadata to analyze page structure and content hierarchy.
Email Lead Collection	Quickly find and extract emails from websites and landing pages.
Image Collection	Gather all image URLs (including hidden ones) for data analysis or machine learning.
Plain Text Conversion	Convert a web page into clean, readable plain text.
RAG (Retrieval-Augmented Generation)	Convert scraped or custom text into clean, token-friendly chunks for RAG/LLM ingestion.
Data Archiving	Save scraped data in JSON or TXT format for offline use or long-term storage.
Preprocessing for AI	Prepare web data for indexing into vector stores like Pinecone, FAISS, ChromaDB, etc.

Function Reference

Function	Description
`fetch_content()`	Loads content from the URL
`get_raw_content()`	Returns the raw HTML
`extract_title()`	Gets the `<title>` of the page
`extract_custom_tags(tags)`	Extracts content from specified tags
`extract_links()`	Extracts all absolute hyperlinks
`extract_emails()`	Finds emails in the HTML and mailto links
`extract_images()`	Finds images from `<img>`, href, style
`extract_plain_text()`	Returns the entire plain text of the page
`chunk_plain_text(text, max_words)`	Chunks plain text into N-word segments
`chunk_provided_text(text, max_words)`	Same as above, but from user input
`scrape_all(tags, plain_text, rag_format)`	Extracts everything at once
`extract_meta_tags()`	Extracts meta tags such as title, description, keywords, and Open Graph data
`get_canonical_url()`	Extracts the canonical URL
`save_to_file(data, filename)`	Saves output to `.txt` or `.json` file

EasyScrapper CLI Options

Flag	Alias	Description	Example Usage
`--tags`	`-t`	Extract specified HTML tags (`h1`, `p`, `div`, etc.)	`--tags h1 p div`
`--plain-text`	`-p`	Extract full plain visible text	`--plain-text` or `-p`
`--emails`	`-e`	Extract all email addresses (inline and `mailto:` links)	`--emails` or `-e`
`--links`	`-l`	Extract all hyperlinks (from `<a href>`)	`--links` or `-l`
`--images`	`-i`	Extract all image URLs (including background images in `style`/`href`)	`--images` or `-i`
`--rag`	`-r`	Extract RAG-style chunks from plain text (for vector DBs or AI pipelines)	`--rag` or `-r`
`--meta`	`-m`	Extract all `<meta>` tags (SEO, Open Graph, Twitter card info, etc.)	`--meta` or `-m`
`--canonical`	`-c`	Extract canonical link tag (`<link rel="canonical" href="...">`)	`--canonical` or `-c`
`--output`	`-o`	Save the result to a JSON file	`--output data.json` or `-o out.json`
`--help`	`-h`	Show help message

Usage

1. Fetching Raw Content - fetch_content()

Fetches and prepares the HTML content from the given URL. This is required before calling any other function.

from easyscrapper.scraper import WebScraper

# Specify the URL to scrape
url = 'https://www.example.com'  # Replace with your target URL

# Initialize the WebScraper
scraper = WebScraper(url)

# Fetch content from the URL
scraper.fetch_content()

# Get and print the raw content
raw_content = scraper.get_raw_content()
print("Raw Content:")
print(raw_content)  # Print the entire raw content
# Output: Entire HTML content of the webpage

# Get and print the first 500 characters of the raw content
print("\nFirst 500 Characters of Raw Content:")
print(raw_content[:500])  # Print the first 500 characters
# Output: (First 500 characters of the HTML content)

# Get and print the last 500 characters of the raw content
print(raw_content[-500:])  # Print the last 500 characters
# Output: (Last 500 characters of the HTML content)

2. Fetching Raw Content - get_raw_content()

Returns the raw HTML content (with all tags and markup).

scraper.fetch_content()
html = scraper.get_raw_content()
print(html[:500])  # Print the first 500 characters of raw HTML

3. Extract Page Title - extract_title()

Retrieve the content of the title tag from the webpage.

scraper = WebScraper('https://www.example.com')
scraper.fetch_content()

title = scraper.extract_title()
print("Page Title:", title)

4. Extract Specific Tags - extract_custom_tags(tags: List[str])

Get only the clean text from selected HTML tags like headings (h1, h2) or paragraphs (p).

scraper = WebScraper('https://www.example.com')
scraper.fetch_content()

data = scraper.extract_custom_tags(['h1', 'p'])
print(data)

5. Extract All Links - extract_links()

Finds and returns all hyperlink URLs ( href="...") as absolute URLs.

scraper = WebScraper('https://www.example.com')
scraper.fetch_content()

links = scraper.extract_links()
print("Links found:", links)

6. Extract Emails - extract_emails()

Detect and extract all email addresses present on the webpage.

scraper = WebScraper('https://www.example.com')
scraper.fetch_content()

emails = scraper.extract_emails()
print("Emails:", emails)

7. Extract Image URLs - extract_images()

Automatically extract all image sources, including those in image tags, links, and styles.

scraper = WebScraper('https://www.example.com')
scraper.fetch_content()

images = scraper.extract_images()
for img in images:
    print("Image:", img['src'], "| Alt Text:", img['alt'])

8. Extract Plain Text - extract_plain_text()

Get the full visible plain text of the webpage, without any HTML tags or scripts.

scraper = WebScraper('https://www.example.com')
scraper.fetch_content()

text = scraper.extract_plain_text()
print("Visible Text:", text[:500])

9. Chunk Extracted Text - chunk_plain_text(text: str, max_words: int = 100)

Split the extracted plain text into smaller chunks (e.g., 100 words each) for use in NLP, AI, or search systems.

scraper = WebScraper('https://www.example.com')
scraper.fetch_content()

plain_text = scraper.extract_plain_text()
chunks = scraper.chunk_plain_text(plain_text, max_words=100)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk}")

10. Chunk Custom Text - chunk_provided_text(text: str, max_words: int = 100)

Split your own text (not scraped) into chunks â€” useful for preparing content for AI training or search indexing.

scraper = WebScraper('https://www.example.com')

custom_text = "This is a long paragraph that needs to be broken into smaller chunks..."
chunks = scraper.chunk_provided_text(custom_text, max_words=50)

for chunk in chunks:
    print(chunk)

11. Scrape All Content Together - scrape_all(tags=None, plain_text=True, rag_format=False)

Extract everything in one go - title, text tags, emails, links, images, and full text.

plain_text=True to extract full text
rag_format=True to get AI/RAG chunks

scraper = WebScraper('https://www.example.com')
scraper.fetch_content()
# Extract full data with plain text
full_data = scraper.scrape_all(tags=['h1', 'p'], plain_text=True)
print(full_data)

12. Get RAG-Friendly Chunks - scrape_all(tags=None, plain_text=False, rag_format=True)

Automatically turn plain text into source-linked chunks ready for retrieval-augmented generation (RAG) systems.

scraper = WebScraper('https://www.example.com')
scraper.fetch_content()

# Get just RAG-ready text chunks
rag_data = scraper.scrape_all(tags=['p'], rag_format=True)
for chunk in rag_data['rag_chunks']:
    print(chunk)

13. Save Output to File - save_to_file(data, filename='output.txt')

Save any output (plain text or structured data) to a .txt or .json file.

.txt for plain text
.json for structured data

scraper = WebScraper('https://www.example.com')
scraper.fetch_content()

# Save plain text
text = scraper.extract_plain_text()
scraper.save_to_file(text, filename='easyscrapper_output.txt')

# Save full scraped result as JSON
data = scraper.scrape_all(tags=['h1', 'p'], plain_text=True)
scraper.save_to_file(data, filename='easyscrapper_output.json')

14. Meta Tag Extraction - extract_meta_tags()

Extracts all <meta> tags from a webpage, including name, property, and content attributes. This is helpful for getting SEO data, Open Graph info, keywords, descriptions, etc.

scraper = WebScraper('https://pypi.org/project/easyscrapper/')
scraper.fetch_content()

meta_tags = scraper.extract_meta_tags()
print("Meta Tags:")
for name, content in meta_tags.items():
    print(f"{name}: {content}")

15. Canonical URL Extraction - get_canonical_url()

Finds the <link rel="canonical" href="..."> tag, which indicates the preferred URL for the content. Useful in SEO or deduplication tasks.

scraper = WebScraper('https://github.com/Krishnatadi/')
scraper.fetch_content()

canonical = scraper.get_canonical_url()
print("Canonical URL:", canonical)

Example CLI Commands

# Extract h1, h2, and p tags
easyscrapper https://www.example.com --tags h1 h2 p

# Extract all plain text and save to a JSON file
easyscrapper https://www.example.com --plain --output page.json

# Extract all emails, links, and images
easyscrapper https://www.example.com --emails --links --images

# RAG-style chunked output
easyscrapper https://www.example.com --rag

# Full combo with everything
easyscrapper https://www.example.com -t h1 p -p -e -l -i -r -o full_output.json

# Extracts all meta tags
easyscrapper https://www.example.com --meta

# Extracts canonical URL
easyscrapper https://www.example.com --canonical

# Extracts both meta tags and canonical URL
easyscrapper https://www.example.com -m -c

What Can You Do with EasyScrapper?

Whether you're a developer, researcher, or digital marketer, EasyScrapper helps you pull structured content from websitesâ€”fast. Hereâ€™s how you can put it to work:

AI & Machine Learning

Build NLP Training Sets: Easily gather clean text from multiple websites to create high-quality datasets.
Summarize Content Automatically: Scrape blog posts or articles and feed them into summarization models.
Train Language Models: Collect multilingual data to train or test language detection and classification systems.

Retrieval-Augmented Generation (RAG)

Feed Your Vector DB: Break down scraped text into chunks for vector embedding with FAISS, Pinecone, etc.
Inject Real-Time Context: Dynamically scrape relevant info and add it to prompts for more accurate LLM outputs.
Tap into Web Knowledge: Let your RAG systems stay up-to-date with live website content.

Web & SEO Analytics

Analyze HTML Tags: Extract headings, paragraphs, and other tags to optimize SEO strategies.
Monitor Links: Find broken or redirected internal and outbound links.
Audit Media: Check images for alt-text, size, and loading performance.

Market Intelligence

Track Competitors: Monitor product pages or blogs for pricing or messaging changes.
Spot Trends: Pull headlines or blog snippets to detect emerging topics.
Collect Emails (Ethically!): Scrape emails for outreachâ€”just be sure to follow privacy laws.

Legal & Compliance

Compare Legal Docs: Scrape privacy policies or terms of service for quick analysis or summaries.
Run Compliance Checks: Make sure required disclaimers or clauses exist on client or partner sites.

News Aggregation

Chunk the News: Break long articles into readable bits for newsletters or summaries.
Detect Events in Real-Time: Pull headlines to power event detection systems or alerts.

Developer Utilities

Generate Realistic Test Data: Use real-world HTML for unit or integration tests.
Build Your Own Crawler: Combine EasyScrapper with your custom logic to scrape multiple pages or domains.

Documentation Mining

Scrape Technical Docs: Pull changelogs, API references, or developer guides for indexing or embedding.
Parse Open Source Projects: Extract structured info from READMEs or API docs.

E-commerce Data Extraction

Mine Product Info: Scrape titles, prices, descriptions, and more for research or comparison.
Grab Images & Metadata: Collect all product-related images and details automatically.

Research & Academia

Create Custom Web Datasets: Pull content from niche sites to build your own research corpus.
Explore Search & Retrieval: Use real-world data for experiments in search relevance, ranking, and more.

Join the Community

If you find EasyScrapper useful, consider giving it a star on GitHub and sharing it with others. Your support helps us improve and expand our tool!

Issues and Feedback

For issues, feedback, and feature requests, please open an issue on our GitHub Issues page. We actively monitor and respond to community feedback.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.2

Jun 30, 2025

1.0.1

Jun 30, 2025

1.0.0

Nov 1, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

easyscrapper-1.0.2.tar.gz (16.2 kB view details)

Uploaded Jun 30, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

easyscrapper-1.0.2-py3-none-any.whl (11.5 kB view details)

Uploaded Jun 30, 2025 Python 3

File details

Details for the file easyscrapper-1.0.2.tar.gz.

File metadata

Download URL: easyscrapper-1.0.2.tar.gz
Upload date: Jun 30, 2025
Size: 16.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for easyscrapper-1.0.2.tar.gz
Algorithm	Hash digest
SHA256	`681514169b69abf1eb7ae14a50849afda6557acfcca3e455269d602e57e946f6`
MD5	`f6147fb1873f18631cc98ead52b1a040`
BLAKE2b-256	`78040dcd86f0cd604b00ff93bde4213654b8bb619d4a574d3cac00ca989cca8f`

See more details on using hashes here.

File details

Details for the file easyscrapper-1.0.2-py3-none-any.whl.

File metadata

Download URL: easyscrapper-1.0.2-py3-none-any.whl
Upload date: Jun 30, 2025
Size: 11.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for easyscrapper-1.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`61489cb4d3e77cca612f3a038c89a7fa3f1fac22fbb36aa59af30dcff0f12e11`
MD5	`f384afc073ad6150261198203b258457`
BLAKE2b-256	`be0c85d4ec3f7b9a54698bb04bd0006442a7a01e14557728952eda678bddce63`

See more details on using hashes here.

easyscrapper 1.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

EasyScrapper

Installation

Features

Use Cases

Function Reference

EasyScrapper CLI Options

Usage

1. Fetching Raw Content - fetch_content()

2. Fetching Raw Content - get_raw_content()

3. Extract Page Title - extract_title()

4. Extract Specific Tags - extract_custom_tags(tags: List[str])

5. Extract All Links - extract_links()

6. Extract Emails - extract_emails()

7. Extract Image URLs - extract_images()

8. Extract Plain Text - extract_plain_text()

9. Chunk Extracted Text - chunk_plain_text(text: str, max_words: int = 100)

10. Chunk Custom Text - chunk_provided_text(text: str, max_words: int = 100)

11. Scrape All Content Together - scrape_all(tags=None, plain_text=True, rag_format=False)

12. Get RAG-Friendly Chunks - scrape_all(tags=None, plain_text=False, rag_format=True)

13. Save Output to File - save_to_file(data, filename='output.txt')

14. Meta Tag Extraction - extract_meta_tags()

15. Canonical URL Extraction - get_canonical_url()

Example CLI Commands

What Can You Do with EasyScrapper?

AI & Machine Learning

Retrieval-Augmented Generation (RAG)

Web & SEO Analytics

Market Intelligence

Legal & Compliance

News Aggregation

Developer Utilities

Documentation Mining

E-commerce Data Extraction

Research & Academia

Join the Community

Issues and Feedback

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes