Skip to main content

A web content extractor and archiver for simplenote

Project description

Web Content Extractor and Archiver

This project is a web content extractor and archiver that fetches content from specified URLs from simplenote only, processes it, and saves it as Markdown files organized by date.start the script with the simplenote url in your clipboard and it will fetch the content from the url and save it as a markdown file in the output directory.

Project Structure

web-content-extractor
├── src
│   ├── simex.py          # Main script to orchestrate fetching and processing
│   ├── fetcher.py        # Functions for fetching content from URLs
│   ├── processor.py      # Functions for processing fetched HTML content
│   ├── archiver.py       # Manages saving processed content as Markdown files
│   ├── imp_clip.py       # Functions for updating sources from clipboard
│   └── utils
│       └── __init__.py   # Utility functions shared across modules
├── requirements.txt      # Project dependencies
├── config.yaml           # Configuration settings for sources and paths
├── output                # Directory for saved Markdown files
└── README.md             # Project documentation

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd web-content-extractor
    
  2. Install the required dependencies:

    pip install -r requirements.txt
    

Usage

  1. Configure the config.yaml file to specify the URLs and output filenames.
  2. Run the main script:
    python src/llmix.py
    

Customization

  • Modify the SOURCES dictionary in config.yaml to add or change the URLs you want to fetch content from.
  • Adjust the BASE_PATH in config.yaml to change where the Markdown files are saved.

Contributing

Feel free to submit issues or pull requests for improvements or additional features.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simexp-0.1.2.tar.gz (2.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

simexp-0.1.2-py3-none-any.whl (2.2 kB view details)

Uploaded Python 3

File details

Details for the file simexp-0.1.2.tar.gz.

File metadata

  • Download URL: simexp-0.1.2.tar.gz
  • Upload date:
  • Size: 2.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for simexp-0.1.2.tar.gz
Algorithm Hash digest
SHA256 d9320f00d4e5c218eaecdc099efbadf660af7b7db1c20e7fb882e38f03cabf26
MD5 ca0be5cc5c8839018601cae8c190642e
BLAKE2b-256 e2adf4e8369f9e4d659f3e1157a1ef6105cf52c709b7dc781b4220add8b49e29

See more details on using hashes here.

File details

Details for the file simexp-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: simexp-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 2.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for simexp-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8037ddcf7737e7361822a48d24323b1503c05181bcf66035e4e96f4b10e12a46
MD5 4d7b68a786f3f30dc98d9a16fe62f0dc
BLAKE2b-256 417b6ee7137751e5f9673668d7b5ff2a853c01612bdad399a1dc3df4d1a90ca4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page