A Python tool to scrape content from deepwiki sites and convert it to Markdown format

These details have not been verified by PyPI

Project links

Homepage

Project description

Deepwiki to Markdown Converter

このドキュメントの日本語版は README_ja.md にあります。

A Python tool to scrape content from deepwiki sites and convert it to Markdown format.

Features

Scrapes content from deepwiki sites
Extracts navigation items from the specified UI elements
Converts HTML content to Markdown format
Saves the converted files in an organized directory structure
Supports scraping multiple libraries
Supports static page scraping with requests
Offers direct scraping methods for improved reliability and direct Markdown fetching

Requirements

Python 3.6 or higher
Required Python packages:
- requests
- beautifulsoup4
- argparse
- markdownify

Installation

Option 1: Install from PyPI

pip install deepwiki-to-md

Option 2: Install from source

Clone this repository:

git clone https://github.com/yourusername/deepwiki_to_md.git
cd deepwiki_to_md

Install the package in development mode:
```
pip install -e .
```

Usage

Basic Usage

If installed from PyPI, you can use the command-line tool:

deepwiki-to-md "https://deepwiki.com/library_path"

Or with explicit parameters:

deepwiki-to-md --library "library_name" "https://deepwiki.example.com/library_path"

If installed from source, you can run the script directly:

python -m deepwiki_to_md.run_scraper "https://deepwiki.com/library_path"

Or with explicit parameters:

python -m deepwiki_to_md.run_scraper --library "library_name" "https://deepwiki.example.com/library_path"

Note: The output directory will be created in the current working directory where the command is executed, not in the package installation directory.

Using the Python API

You can also use the DeepwikiScraper class directly in your Python code. See example.py for a complete example:

from deepwiki_to_md import DeepwikiScraper
from deepwiki_to_md.direct_scraper import DirectDeepwikiScraper
from deepwiki_to_md.direct_md_scraper import DirectMarkdownScraper

# Create a scraper instance (default uses DirectMarkdownScraper)
scraper = DeepwikiScraper(output_dir="MyDocuments")

# Scrape a library using the default (DirectMarkdownScraper)
scraper.scrape_library("python", "https://deepwiki.com/python")

# Create another scraper with a different output directory
other_scraper = DeepwikiScraper(output_dir="OtherDocuments")

# Scrape another library
other_scraper.scrape_library("javascript", "https://deepwiki.example.com/javascript")

# --- Using DirectDeepwikiScraper (HTML to Markdown) ---
# Create a scraper instance explicitly using DirectDeepwikiScraper
html_scraper = DeepwikiScraper(
    output_dir="HtmlScrapedDocuments",
    use_direct_scraper=True,  # Enable DirectDeepwikiScraper
    use_alternative_scraper=False,
    use_direct_md_scraper=False
)
html_scraper.scrape_library("go", "https://deepwiki.com/go")

# --- Using DirectMarkdownScraper (Direct Markdown Fetching) ---
# Create a scraper instance explicitly using DirectMarkdownScraper
md_scraper = DeepwikiScraper(
    output_dir="DirectMarkdownDocuments",
    use_direct_scraper=False,
    use_alternative_scraper=False,
    use_direct_md_scraper=True  # Enable DirectMarkdownScraper (this is the default)
)
md_scraper.scrape_library("rust", "https://deepwiki.com/rust")

# --- Using the individual direct scrapers --- 
# Create a DirectDeepwikiScraper instance (HTML to Markdown)
direct_html_scraper = DirectDeepwikiScraper(output_dir="DirectHtmlScraped")

# Scrape a specific page directly (HTML to Markdown)
direct_html_scraper.scrape_page(
    "https://deepwiki.com/python/cpython/2.1-bytecode-interpreter-and-optimization",
    "python_bytecode",
    save_html=True  # Optionally save the original HTML
)

# Create a DirectMarkdownScraper instance (Direct Markdown Fetching)
direct_md_scraper = DirectMarkdownScraper(output_dir="DirectMarkdownFetched")

# Scrape a specific page directly as Markdown
direct_md_scraper.scrape_page(
   "https://deepwiki.com/python/cpython/2.1-bytecode-interpreter-and-optimization",
   "python_bytecode"
)

# You can also use the run method for multiple direct scrapes (for DirectDeepwikiScraper)
# direct_html_results = direct_html_scraper.run([
#     {"name": "page1", "url": "url1"},
#     {"name": "page2", "url": "url2"}
# ])

# You can also use the run method for multiple direct scrapes (for DirectMarkdownScraper)
# direct_md_results = direct_md_scraper.run([
#     {"name": "page1", "url": "url1"},
#     {"name": "page2", "url": "url2"}
# ])

Run the example script:

python example.py

Command-line Arguments

library_url: URL of the library to scrape (can be provided as a positional argument).
--library, -l: Library name and URL to scrape. Can be specified multiple times for different libraries. Format: --library NAME URL.
--output-dir, -o: Output directory for Markdown files (default: Documents).
--use-direct-scraper: Use DirectDeepwikiScraper (HTML to Markdown conversion). Overrides --use-direct-md-scraper if both are specified.
--no-direct-scraper: Disable DirectDeepwikiScraper.
--use-alternative-scraper: Use the scrape_deepwiki function from direct_scraper.py as a fallback if the primary method fails (default: True).
--no-alternative-scraper: Disable the alternative scraper fallback.
--use-direct-md-scraper: Use DirectMarkdownScraper (fetches Markdown directly). This is the default behavior if no scraper type is explicitly specified.
--no-direct-md-scraper: Disable DirectMarkdownScraper.

Scraper Priority:

If --use-direct-scraper is specified, DirectDeepwikiScraper (HTML to Markdown) is used.
If --use-direct-md-scraper is specified (and --use-direct-scraper is not), DirectMarkdownScraper (Direct Markdown) is used.
If neither --use-direct-scraper nor --use-direct-md-scraper is specified, DirectMarkdownScraper (Direct Markdown) is used by default.
The --use-alternative-scraper flag controls a fallback mechanism within the chosen primary scraper.

Examples

Simplified usage (uses DirectMarkdownScraper by default):

python -m deepwiki_to_md.run_scraper "https://deepwiki.com/python"

Scrape a single library with explicit parameters (uses DirectMarkdownScraper by default):

python -m deepwiki_to_md.run_scraper --library "python" "https://deepwiki.example.com/python"

Scrape multiple libraries (uses DirectMarkdownScraper by default):

python -m deepwiki_to_md.run_scraper --library "python" "https://deepwiki.example.com/python" --library "javascript" "https://deepwiki.example.com/javascript"

Specify a custom output directory:

python -m deepwiki_to_md.run_scraper "https://deepwiki.com/python" --output-dir "MyDocuments"

Explicitly use DirectMarkdownScraper (Direct Markdown):

python -m deepwiki_to_md.run_scraper "https://deepwiki.com/python" --use-direct-md-scraper

Explicitly use DirectDeepwikiScraper (HTML to Markdown):

python -m deepwiki_to_md.run_scraper "https://deepwiki.com/python" --use-direct-scraper

Disable the alternative scraper fallback:

python -m deepwiki_to_md.run_scraper "https://deepwiki.com/python" --no-alternative-scraper

Use DirectDeepwikiScraper and disable the alternative fallback:

python -m deepwiki_to_md.run_scraper "https://deepwiki.com/python" --use-direct-scraper --no-alternative-scraper

Output Structure

The converted Markdown files will be saved in the following directory structure:

<output_dir>/
├── <library_name1>/
│   └── md/
│       ├── <page_name1>.md
│       ├── <page_name2>.md
│       └── ...
├── <library_name2>/
│   └── md/
│       ├── <page_name1>.md
│       ├── <page_name2>.md
│       └── ...
└── ...

<output_dir> is the directory specified by --output-dir (default: Documents).
<library_name> is the name provided for the library.
Each page from the Deepwiki site is saved as a separate .md file within the md subdirectory.

How It Works

The tool offers different scraping strategies:

1. Direct Markdown Scraping (`DirectMarkdownScraper` - Default)

Priority: Highest (used by default).
Method: Connects to the Deepwiki site using specialized headers optimized for fetching raw Markdown content directly from the server's internal API or data structures.
Process:
1. Sends requests designed to retrieve Markdown data.
2. Parses the response (often JSON or a specific text format) to extract the Markdown content.
3. Cleans up the extracted Markdown (removes potential artifacts like metadata or script tags).
4. Saves the cleaned Markdown content directly to files.
Advantage: Highest fidelity Markdown, preserves original formatting, avoids HTML conversion errors.

2. Direct HTML Scraping (`DirectDeepwikiScraper`)

Priority: Medium (used if --use-direct-scraper is specified).
Method: Connects to the Deepwiki site using headers that mimic a browser request to fetch the rendered HTML page.
Process:
1. Fetches the full HTML of the page.
2. Uses BeautifulSoup to parse the HTML.
3. Identifies the main content area using various CSS selectors.
4. Uses markdownify library to convert the selected HTML content block to Markdown.
5. Saves the converted Markdown.
Advantage: More robust than basic static scraping if direct Markdown fetching fails or is unavailable.
Disadvantage: Relies on HTML structure and conversion quality.

3. Alternative Scraper Fallback (`scrape_deepwiki` from `direct_scraper.py`)

Priority: Lowest (used as a fallback within the primary scraper if --use-alternative-scraper is enabled, which is the default).
Method: A simpler static request mechanism, potentially used if the main methods encounter issues (e.g., complex navigation or unexpected page structure).
Process: Fetches HTML and attempts basic content extraction.

Navigation and Hierarchy

Both DirectMarkdownScraper and DirectDeepwikiScraper attempt to identify navigation links (like a table of contents or sidebar) within the fetched content (either Markdown or HTML).
They recursively follow these links to scrape the entire library structure.

Error Handling

The tool includes robust error handling:

Validates domains before scraping.
Checks domain reachability.
Provides clear error messages.
Implements retry mechanisms with exponential backoff for transient network errors.
Falls back to the alternative scraper if configured and the primary method fails.

Customization

You can modify the Python scripts (deepwiki_to_md/deepwiki_to_md.py, deepwiki_to_md/direct_scraper.py, deepwiki_to_md/direct_md_scraper.py) to customize:

HTML selectors used for content extraction (in DirectDeepwikiScraper).
Markdown parsing/cleaning logic (in DirectMarkdownScraper).
HTML to Markdown conversion options (markdownify settings).
Output file naming conventions.
Request headers and delays.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.0.3

Oct 1, 2025

2.0.2

Sep 30, 2025

2.0.1

Sep 30, 2025

2.0.0

Sep 30, 2025

0.3.2

May 6, 2025

0.3.1

May 6, 2025

0.3.0

May 5, 2025

0.2.0

May 5, 2025

This version

0.1.0

May 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

deepwiki_to_md-0.1.0.tar.gz (26.1 kB view details)

Uploaded May 2, 2025 Source

File details

Details for the file deepwiki_to_md-0.1.0.tar.gz.

File metadata

Download URL: deepwiki_to_md-0.1.0.tar.gz
Upload date: May 2, 2025
Size: 26.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for deepwiki_to_md-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d08ef9b78f159481eff8ed7684c28b03125bf0a4e3f8ab638c81e19cd9f34161`
MD5	`c4c5fe834aa6ba871baba6bd7d3f8238`
BLAKE2b-256	`4b2adc608a1cf182a5b636543b80f653cf8fc1ec0c25ed210d30de87c38f2655`

See more details on using hashes here.

deepwiki-to-md 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Deepwiki to Markdown Converter

Features

Requirements

Installation

Option 1: Install from PyPI

Option 2: Install from source

Usage

Basic Usage

Using the Python API

Command-line Arguments

Examples

Output Structure

How It Works

1. Direct Markdown Scraping (DirectMarkdownScraper - Default)

2. Direct HTML Scraping (DirectDeepwikiScraper)

3. Alternative Scraper Fallback (scrape_deepwiki from direct_scraper.py)

Navigation and Hierarchy

Error Handling

Customization

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes

1. Direct Markdown Scraping (`DirectMarkdownScraper` - Default)

2. Direct HTML Scraping (`DirectDeepwikiScraper`)

3. Alternative Scraper Fallback (`scrape_deepwiki` from `direct_scraper.py`)