Skip to main content

A tool to fetch the main content of a webpage and convert it to Markdown or plain text.

Project description

webclipper

webclipper is a simple Python tool to fetch the main content of a webpage and convert it into clean, readable Markdown or plain text. It removes clutter like ads, headers, and navigation bars, letting you focus on the article's text.

It can be used as a command-line application for quick conversions in your terminal or as a library in your own Python projects.

Features

  • Content Extraction: Uses readibility to identify and extract the primary article or content from a URL.
  • Dual Output: Convertis cleaned HTML to either Markdown or plain text.
  • Flexible Usage: Works as both a standalone command-line tool and an importable Python library.

Installation

To install webclipper, you can clone the repository and install it using pip.


# Clone the repository (if you haven't already)

git clone [https://github.com/your-username/webclipper.git](https://www.google.com/search?q=https://github.com/your-username/webclipper.git)
cd webclipper

# Install the package in editable mode

# (Your changes to the source code will be reflected immediately)

pip install -e .

This will install the package and its dependencies, and also make the webclipper command available in your terminal.

How to Use

As a Command-Line App

Once installed, you can use the webclipper command directly from your terminal. The output is sent to standard output, so you can easily redirect it to a file.

Basic Usage (get plain text):


webclipper "[https://en.wikipedia.org/wiki/Python\_(programming\_language](https://en.wikipedia.org/wiki/Python_\(programming_language\))"

Get Markdown Output:

Use the -m or --markdown flag.


webclipper "[https://www.some-article-url.com](https://www.google.com/search?q=https://www.some-article-url.com)" --markdown

Include the Source URL:

Use the -i or --include-url flag to append the source URL at the end of the output.


webclipper "[https://www.some-article-url.com](https://www.google.com/search?q=https://www.some-article-url.com)" -m -i

Redirect to a File:

You can save the output using standard shell redirection.


webclipper "[https://www.some-article-url.com](https://www.google.com/search?q=https://www.some-article-url.com)" \> my\_article.txt

As a Library

You can also import webclipper into your own Python scripts to integrate its functionality. The get_url_content function is all you need.

from webclipper import get\_url\_content

# The URL of the article you want to clip

article\_url = "[https://en.wikipedia.org/wiki/Web\_scraping](https://en.wikipedia.org/wiki/Web_scraping)"

try:
    # Get the content as Markdown
    markdown\_content = get\_url\_content(article\_url, output\_format='markdown')
    print("--- MARKDOWN ---")
    print(markdown\_content)

    # Get the content as plain text
    text_content = get_url_content(article_url, output_format='text')
    print("\n--- PLAIN TEXT ---")
    print(text_content)

except Exception as e:
    print(f"An error occurred: {e}")

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webclipper-0.1.0.tar.gz (3.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

webclipper-0.1.0-py3-none-any.whl (4.0 kB view details)

Uploaded Python 3

File details

Details for the file webclipper-0.1.0.tar.gz.

File metadata

  • Download URL: webclipper-0.1.0.tar.gz
  • Upload date:
  • Size: 3.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for webclipper-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d52fd1a91345c8e4e08b9a7b8ae48098991eb8e19267fb92b9dda71726d5d6fb
MD5 b5939ad8d41353ef30ab19e89b0f4112
BLAKE2b-256 d699e8ac7474c9c0ad1ad0b1b8d095e28c1db7dea02cb13484e717b6b8f9bb7f

See more details on using hashes here.

File details

Details for the file webclipper-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: webclipper-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 4.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for webclipper-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cb67010013e66d1f49189965d7b2a5094f89de8c325cab4a53c955fbb3db5841
MD5 3a7929580ec29f4546c17c2dc868d05f
BLAKE2b-256 6a39f398f7cd8b611ac64781eb5a528209e460e05e56ce293f9fd0914799b51a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page