BLECH is a tool designed to automatically identify and extract links to individual posts from a blog's main page, index, or feed. After identifying the links, it proceeds to fetch and parse the content of each blog post, making it easier to process, analyze, or archive blog data.

These details have not been verified by PyPI

Project links

Project description

Blog Link Extractor & Content Handler (BLECH)

Installation

From PyPI (Recommended)

# Install using pip
pip install blech

From Source

This project uses Poetry for dependency management. To install from source:

Install Poetry if you haven't already:

curl -sSL https://install.python-poetry.org | python3 -

Clone the repository and install dependencies:

git clone https://github.com/jkarenko/blog-link-extractor-content-handler
cd blog-link-extractor-content-handler
poetry install

Development

To run the development version:

poetry run blech [OPTIONS] <BASE_URL>

Publishing to PyPI

To publish this package to PyPI:

Make sure you have the latest version of Poetry:

poetry self update

Build the package:

poetry build

Publish to PyPI (you'll need a PyPI account and API token):

# For the first time publishing
poetry publish --username __token__ --password your_api_token

# For subsequent updates
poetry publish --build

For more information on obtaining a PyPI API token, visit: https://pypi.org/help/#apitoken

Usage

blech [OPTIONS] <BASE_URL>

Positional Arguments:

<BASE_URL>: (Required) The starting URL of the blog's main page, index, or feed where post links can be found.

Options:

-o, --output <FILENAME>: (Optional) The file where extracted content should be saved. If not provided, a default filename will be generated based on the blog's domain (e.g., example-blog.com_blog_posts.txt).
-l, --lang <LANG_CODE>: (Optional) Filter posts by language code (e.g., 'en', 'fi'). This primarily works when the blog uses a WordPress REST API that supports language filtering.
--one-file: (Optional) Save all blog posts to a single file instead of separate files. By default, each post is saved as a separate file in a directory named based on the output filename without the .txt extension (e.g., example-blog.com_blog_posts).
--max-pages <NUMBER>: (Optional) Maximum number of pages to fetch. Overrides the default limit (10 pages).
--start-page <NUMBER>: (Optional) Starting page number for scraping. Default is 1.
--end-page <NUMBER>: (Optional) Ending page number for scraping. Default is the value of max-pages.
--posts-per-page <NUMBER>: (Optional) Number of posts to fetch per API request. Overrides the default value (20 posts).
-h, --help: (Optional) Show this help message and exit.

Example:

# Scrape English posts from the blog archive and save to a specific file
poetry run blech --output my_blog_extract.txt --lang en https://example-blog.com/archive

# Scrape all posts and use the default filename
poetry run blech https://another-blog.org/

# Scrape posts and save each one as a separate file in a directory (default behavior)
poetry run blech https://example-blog.com/

# Scrape posts, specify output directory name, and save as separate files
poetry run blech --output custom_dir_name.txt https://example-blog.com/

# Scrape posts and save all to a single file
poetry run blech --one-file https://example-blog.com/

# Scrape posts, specify output filename, and save to a single file
poetry run blech --output custom_file_name.txt --one-file https://example-blog.com/

# Scrape posts with pagination control (fetch up to 20 pages)
poetry run blech --max-pages 20 https://example-blog.com/

# Scrape posts starting from page 3
poetry run blech --start-page 3 https://example-blog.com/

# Scrape posts from page 2 to page 5 only
poetry run blech --start-page 2 --end-page 5 https://example-blog.com/

# Scrape posts with 50 posts per API request (instead of the default 20)
poetry run blech --posts-per-page 50 https://example-blog.com/

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

Apr 1, 2025

0.2.0

Mar 31, 2025

0.1.2

Mar 31, 2025

0.1.1

Mar 31, 2025

0.1.0

Mar 31, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

blech-0.3.0.tar.gz (17.7 kB view details)

Uploaded Apr 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

blech-0.3.0-py3-none-any.whl (18.9 kB view details)

Uploaded Apr 1, 2025 Python 3

File details

Details for the file blech-0.3.0.tar.gz.

File metadata

Download URL: blech-0.3.0.tar.gz
Upload date: Apr 1, 2025
Size: 17.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.2 CPython/3.13.2 Darwin/24.4.0

File hashes

Hashes for blech-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`8c8f7cced0e8dce00ae878a0803ab479cbe462ce31b7abb3df2d9d519acc6452`
MD5	`59187260af34c61f3cc788e7bafee3c4`
BLAKE2b-256	`ae01aeb4c7b05dd7f28692f57827b8a84e557706bacf253678908328bc37c259`

See more details on using hashes here.

File details

Details for the file blech-0.3.0-py3-none-any.whl.

File metadata

Download URL: blech-0.3.0-py3-none-any.whl
Upload date: Apr 1, 2025
Size: 18.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.2 CPython/3.13.2 Darwin/24.4.0

File hashes

Hashes for blech-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`77c3c90be6c5ce279832ede31b854e0022176f29e766562da5203ac762606b1b`
MD5	`1b1c8ac995573c97630b1a5fbfbf4173`
BLAKE2b-256	`e8ff0e03603521a12a07293aca216097c8184470017dfecaafb8ea86c28197a1`

See more details on using hashes here.

blech 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Blog Link Extractor & Content Handler (BLECH)

Installation

From PyPI (Recommended)

From Source

Development

Publishing to PyPI

Usage

Positional Arguments:

Options:

Example:

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes