Skip to main content

A tool to fetch the main content of a webpage and convert it to Markdown or plain text.

Project description

webclipper

webclipper is a simple Python tool to fetch the main content of a webpage and convert it into clean, readable Markdown or plain text. It removes clutter like ads, headers, and navigation bars, letting you focus on the article's text.

It can be used as a command-line application for quick conversions in your terminal or as a library in your own Python projects.

Features

  • Content Extraction: Uses readibility to identify and extract the primary article or content from a URL.
  • Dual Output: Convertis cleaned HTML to either Markdown or plain text.
  • Flexible Usage: Works as both a standalone command-line tool and an importable Python library.

Installation

To install webclipper, you can clone the repository and install it using pip.


# Clone the repository (if you haven't already)

git clone [https://github.com/your-username/webclipper.git](https://www.google.com/search?q=https://github.com/your-username/webclipper.git)
cd webclipper

# Install the package in editable mode

# (Your changes to the source code will be reflected immediately)

pip install -e .

This will install the package and its dependencies, and also make the webclipper command available in your terminal.

How to Use

As a Command-Line App

Once installed, you can use the webclipper command directly from your terminal. The output is sent to standard output, so you can easily redirect it to a file.

Basic Usage (get plain text):


webclipper "[https://en.wikipedia.org/wiki/Python\_(programming\_language](https://en.wikipedia.org/wiki/Python_\(programming_language\))"

Get Markdown Output:

Use the -m or --markdown flag.


webclipper "[https://www.some-article-url.com](https://www.google.com/search?q=https://www.some-article-url.com)" --markdown

Include the Source URL:

Use the -i or --include-url flag to append the source URL at the end of the output.


webclipper "[https://www.some-article-url.com](https://www.google.com/search?q=https://www.some-article-url.com)" -m -i

Redirect to a File:

You can save the output using standard shell redirection.


webclipper "[https://www.some-article-url.com](https://www.google.com/search?q=https://www.some-article-url.com)" \> my\_article.txt

As a Library

You can also import webclipper into your own Python scripts to integrate its functionality. The get_url_content function is all you need.

from webclipper import get\_url\_content

# The URL of the article you want to clip

article\_url = "[https://en.wikipedia.org/wiki/Web\_scraping](https://en.wikipedia.org/wiki/Web_scraping)"

try:
    # Get the content as Markdown
    markdown\_content = get\_url\_content(article\_url, output\_format='markdown')
    print("--- MARKDOWN ---")
    print(markdown\_content)

    # Get the content as plain text
    text_content = get_url_content(article_url, output_format='text')
    print("\n--- PLAIN TEXT ---")
    print(text_content)

except Exception as e:
    print(f"An error occurred: {e}")

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webclipper-0.1.1.tar.gz (3.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

webclipper-0.1.1-py3-none-any.whl (4.0 kB view details)

Uploaded Python 3

File details

Details for the file webclipper-0.1.1.tar.gz.

File metadata

  • Download URL: webclipper-0.1.1.tar.gz
  • Upload date:
  • Size: 3.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for webclipper-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a8e1b8e814155c4424c81552faa5afb0c44de48a054cf2b7a17b96e34aef8fcb
MD5 104e9d7affab0da3f91a9bdad2d5792b
BLAKE2b-256 1ccc5f5014f7c902580771b9462dfee7860e7f92dda6043b47d72b859cb907df

See more details on using hashes here.

File details

Details for the file webclipper-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: webclipper-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 4.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for webclipper-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 501c42a4da1126905da22d91926a63091a3e17c4a0bc3a37270df2fe571b97ef
MD5 84a2c5aaf71831c5b78327d968f5d957
BLAKE2b-256 3757deeb57d75b948dae64ba8cd9f89e78da7cdef42944afc9c11c7385df88cb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page