A tool to fetch the main content of a webpage and convert it to Markdown or plain text.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

webclip

webclip is a powerful Python tool to fetch, extract, and convert the main content of webpages into clean, readable Markdown or plain text. It intelligently removes clutter like ads, headers, and navigation bars, letting you focus on the article's core content.

It can be used as a command-line application for quick conversions, batch processing, or as a library in your own Python projects.

Features

Smart Content Extraction: Uses readability to identify and extract the primary article or content from any URL
Dual Output Formats: Converts cleaned HTML to either rich Markdown or customizable plain text
Batch Processing: Process multiple URLs from files or stdin for automation
Robust Network Handling: Built-in retry logic with exponential backoff for unreliable connections
Flexible Text Options: Fine-tune output by removing links, images, emphasis, tables, or lists
Metadata Extraction: Optionally include page titles, status codes, and content information
Multiple Input Methods: Single URLs, file lists, or piped input from other commands
Library Integration: Clean API for use in your own Python projects

Installation

To install webclip, you can clone the repository and install it using pip:

# Clone the repository
git clone https://github.com/your-username/webclip.git
cd webclip

# Install the package in editable mode
# (Your changes to the source code will be reflected immediately)
pip install -e .

This will install the package and its dependencies, and make the webclip command available in your terminal.

Dependencies

requests
readability-lxml
html2text

Command-Line Usage

Basic Examples

Get plain text content:

webclip "https://en.wikipedia.org/wiki/Python_(programming_language)"

Get Markdown output:

webclip "https://www.example.com/article" --markdown

Include source URL and metadata:

webclip "https://www.example.com/article" -m -i --metadata

Batch Processing

Process URLs from a file:

# Create a file with URLs (one per line)
echo "https://example.com/article1" > urls.txt
echo "https://example.com/article2" >> urls.txt

# Process all URLs and save to file
webclip -f urls.txt -o results.txt

Process URLs from stdin:

echo "https://example.com" | webclip --markdown
# or
cat urls.txt | webclip -m --no-links

Text Formatting Options

Clean text output (remove formatting elements):

webclip "https://example.com" --no-links --no-images --no-emphasis

Minimal text output:

webclip "https://example.com" --no-links --no-images --no-tables --no-lists

Preserve original formatting:

webclip "https://example.com" --preserve-formatting --preserve-whitespace

Network Configuration

Custom timeout and retry settings:

webclip "https://slow-site.com" --timeout 30 --retries 5 --delay 2.0

Quiet mode for scripting:

webclip "https://example.com" --quiet > content.txt

Complete Command Reference

usage: webclip [-h] [-f FILE] [-o OUTPUT] [-m] [-i] [--metadata]
               [--no-links] [--no-images] [--no-emphasis] [--no-tables]
               [--no-lists] [--preserve-whitespace] [--preserve-formatting]
               [--timeout TIMEOUT] [--retries RETRIES] [--delay DELAY]
               [--quiet] [--version]
               [url]

Examples:
  webclip https://example.com                    # Basic usage
  webclip https://example.com -m -i              # Markdown with source URL
  echo "https://example.com" | webclip           # From stdin
  webclip -f urls.txt -o output.txt              # Process file of URLs
  webclip https://example.com --no-links --quiet # Clean text output

Options:
  -h, --help            show this help message and exit
  url                   URL to process (optional if using -f or stdin)
  -f FILE, --file FILE  File containing URLs to process (one per line)
  -o OUTPUT, --output OUTPUT
                        Output file (default: stdout)
  -m, --markdown        Output in Markdown format (default: plain text)
  -i, --include-url     Include source URL in output
  --metadata            Include metadata (title, status, etc.) in output
  --quiet               Suppress progress messages
  --version             show program's version number and exit

Text Formatting (ignored with -m):
  --no-links            Remove links
  --no-images           Remove image references
  --no-emphasis         Remove bold/italic
  --no-tables           Remove tables
  --no-lists            Remove list formatting
  --preserve-whitespace Don't clean up excessive whitespace
  --preserve-formatting Preserve original formatting quirks

Network Options:
  --timeout TIMEOUT     Request timeout (default: 15s)
  --retries RETRIES     Retry attempts (default: 3)
  --delay DELAY         Delay between retries (default: 1.0s)

Library Usage

You can import webclip into your own Python scripts for programmatic content extraction:

Basic Usage

from webclip.main import WebClip, TextOptions

# Initialize WebClip
webclip = WebClip()

# Extract content from a URL
url = "https://en.wikipedia.org/wiki/Web_scraping"

try:
    # Get Markdown content
    markdown_content, metadata = webclip.get_url_content(url, output_format='markdown')
    print("--- MARKDOWN ---")
    print(markdown_content)
    print(f"Title: {metadata['title']}")
    
    # Get plain text content with custom options
    text_options = TextOptions(no_links=True, no_images=True)
    text_content, _ = webclip.get_url_content(url, output_format='text', text_options=text_options)
    print("\n--- CLEAN TEXT ---")
    print(text_content)
    
except Exception as e:
    print(f"Error: {e}")

Advanced Usage

from webclip.main import WebClip, TextOptions

# Initialize with custom settings
webclip = WebClip(timeout=30, retries=5, delay=2.0)

# Custom text processing options
text_options = TextOptions(
    no_links=True,
    no_images=True,
    no_emphasis=False,  # Keep bold/italic
    strip_whitespace=True,
    preserve_formatting=False
)

urls = [
    "https://example.com/article1",
    "https://example.com/article2",
    "https://example.com/article3"
]

results = []
for url in urls:
    try:
        content, metadata = webclip.get_url_content(
            url, 
            output_format='text', 
            text_options=text_options
        )
        results.append({
            'url': url,
            'title': metadata['title'],
            'content': content,
            'success': True
        })
    except Exception as e:
        results.append({
            'url': url,
            'error': str(e),
            'success': False
        })

# Process results
successful = [r for r in results if r['success']]
failed = [r for r in results if not r['success']]

print(f"Successfully processed: {len(successful)}")
print(f"Failed: {len(failed)}")

TextOptions Configuration

from webclip.main import TextOptions

# Maximum cleanup - minimal text output
minimal_options = TextOptions(
    no_links=True,
    no_images=True,
    no_emphasis=True,
    no_tables=True,
    no_lists=True,
    strip_whitespace=True
)

# Preserve formatting - keep original structure
preserve_options = TextOptions(
    no_links=False,
    no_images=False,
    no_emphasis=False,
    no_tables=False,
    no_lists=False,
    strip_whitespace=False,
    preserve_formatting=True
)

Error Handling

webclip includes comprehensive error handling for common web scraping issues:

Network timeouts - Automatic retry with exponential backoff
HTTP errors - Proper handling of 4xx/5xx status codes
Invalid URLs - URL validation and normalization
Content extraction failures - Graceful fallback to raw HTML
Encoding issues - Automatic encoding detection and handling

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.2.0

Jun 20, 2025

0.1.2

Jun 20, 2025

0.1.1

Jun 20, 2025

0.1.0

Jun 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webclipper-0.2.0.tar.gz (11.0 kB view details)

Uploaded Jun 20, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

webclipper-0.2.0-py3-none-any.whl (9.2 kB view details)

Uploaded Jun 20, 2025 Python 3

File details

Details for the file webclipper-0.2.0.tar.gz.

File metadata

Download URL: webclipper-0.2.0.tar.gz
Upload date: Jun 20, 2025
Size: 11.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for webclipper-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`0415579963924f0ebe4aea9fe12b1460b639156f685bad0ffad59b5f3d446a3a`
MD5	`0d35738d99aa4c4d3ae53a73fbaf57c4`
BLAKE2b-256	`180938cad1a8c3053407d84036306c666772c19b1213fbe7fb332bb37c22e598`

See more details on using hashes here.

File details

Details for the file webclipper-0.2.0-py3-none-any.whl.

File metadata

Download URL: webclipper-0.2.0-py3-none-any.whl
Upload date: Jun 20, 2025
Size: 9.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for webclipper-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`505e01d53ba1cd0eb3fcaf639a35c2deaeba8f6a23c5d283d721f51fa92945ae`
MD5	`19b3b21b13c6cefb7fc2fc0c4def246b`
BLAKE2b-256	`b2ae99b87b3c48b5b2451a4d6db0960ae46afd2dafe74d30314ad057f57f22f3`

See more details on using hashes here.

webclipper 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

webclip

Features

Installation

Dependencies

Command-Line Usage

Basic Examples

Batch Processing

Text Formatting Options

Network Configuration

Complete Command Reference

Library Usage

Basic Usage

Advanced Usage

TextOptions Configuration

Error Handling

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes