Skip to main content

Scrape.md is a command line interface for scraping and converting website content to Markdown.

Project description

Scrape.md

Introduction

Scrape.md is a Python package that allows you to scrape text from any website and generate a comprehensive transcription of the content using OpenAI's API. The script features a four-stage iterative process to ensure high-quality Markdown files that closely match the original content.

Requirements

  • Python 3.6+

  • OpenAI API Key set as an environment variable (OPENAI_API_KEY)

    • This can be done with the following command:
    export OPENAI_API_KEY="your-key-here"
    
    • You can also add this to your .bashrc or .zshrc file to make it permanent.
  • Optional: Set the SCRAPE_ARCHIVE_PATH environment variable to specify a directory where the generated Markdown files will be saved. I figure those who are wanting to use this with Obsidian or another note-taking app might want to save all their transcriptions in a specific directory. That's what I'm doing, at least.

    • Set it using the following command:
    export SCRAPE_ARCHIVE_PATH="/path/to/your/archive"
    

Installation

First, clone the repository:

git clone https://github.com/bobbyhiddn/Scrape.md.git

Then you can either install the package using pip:

pip install .

Or install the package using the setup.py file:

python setup.py install

Usage

To use the package, run the following command:

scrape_md https://www.example.com

By default, this will create a Markdown file in the current directory with a filename based on the content of the website you are scraping.

If you have the SCRAPE_ARCHIVE_PATH environment variable set, when you run the script, you will be prompted to choose whether to save the file in your specified scrape archive path ($SCRAPE_ARCHIVE_PATH) or in the current working directory.

Four-Stage Iterative Process

The script now employs a four-stage process to enhance the quality of the transcribed content:

  1. First Draft Generation: Creates an initial Markdown version of the website's content.
  2. Review Stage: An AI assistant reviews the first draft for discrepancies and provides detailed feedback.
  3. Improvement Stage: The first draft is improved based on the AI's feedback to better match the original content.
  4. Final Review: A final AI review is performed to ensure the quality of the transcription. The review is displayed in the CLI for your reference.

This process ensures that the generated Markdown file is clean, accurate, and closely reflects the original content with all important details preserved.

Example

scrape_md https://www.greenmatters.com/news/new-species-2024

Output:

Fetching content.
Title: New Species 2024: The Animals and Plant Species Revealed This Year
Generating the first draft.
Reviewing the first draft.
Improving the Markdown content.
Generating a suitable filename.
Content saved to new_species_discoveries_2024.md
Final AI Review:
[Detailed review output]

The generated file new_species_discoveries_2024.md will contain a high-quality transcription of the webpage content.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrape_md-1.0.0.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

scrape_md-1.0.0-py3-none-any.whl (5.7 kB view details)

Uploaded Python 3

File details

Details for the file scrape_md-1.0.0.tar.gz.

File metadata

  • Download URL: scrape_md-1.0.0.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for scrape_md-1.0.0.tar.gz
Algorithm Hash digest
SHA256 9648bc9e6b0e174373190fdcc576581ee890fb5158fc29c38cf3dca8ad633ebc
MD5 6118e32e11283ee083829c50b3c0fbf1
BLAKE2b-256 a9132d0a90b8dd7582acd84f892722e189566941e693fd420d87205f6db474b8

See more details on using hashes here.

File details

Details for the file scrape_md-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: scrape_md-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 5.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.20

File hashes

Hashes for scrape_md-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7488436db8b1688c2fdd8644b5007cd89e26ada39e92dd240ef8b13af61299a6
MD5 3bb960fbddb887406e24499eea261970
BLAKE2b-256 4d7dc3f87b55b718ba6e8b214743803cc6c4c2dde51f318465867c3fff83180c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page