A Python tool to scrape content from deepwiki sites and convert it to Markdown format
Project description
Deepwiki to Markdown Converter
このドキュメントの日本語版は README_ja.md にあります。
A Python tool to scrape content from deepwiki sites and convert it to Markdown format.
Features
- Scrapes content from deepwiki sites
- Extracts navigation items from the specified UI elements
- Converts HTML content to Markdown format
- Saves the converted files in an organized directory structure
- Supports scraping multiple libraries
- Supports static page scraping with requests
- Offers direct scraping methods for improved reliability and direct Markdown fetching
Requirements
- Python 3.6 or higher
- Required Python packages:
- requests
- beautifulsoup4
- argparse
- markdownify
Installation
Option 1: Install from PyPI
pip install deepwiki-to-md
Option 2: Install from source
-
Clone this repository:
git clone https://github.com/yourusername/deepwiki_to_md.git cd deepwiki_to_md -
Install the package in development mode:
pip install -e .
Usage
Basic Usage
If installed from PyPI, you can use the command-line tool:
deepwiki-to-md "https://deepwiki.com/library_path"
Or with explicit parameters:
deepwiki-to-md --library "library_name" "https://deepwiki.example.com/library_path"
If installed from source, you can run the script directly:
python -m deepwiki_to_md.run_scraper "https://deepwiki.com/library_path"
Or with explicit parameters:
python -m deepwiki_to_md.run_scraper --library "library_name" "https://deepwiki.example.com/library_path"
Note: The output directory will be created in the current working directory where the command is executed, not in the package installation directory.
Using the Python API
You can also use the DeepwikiScraper class directly in your Python code. See example.py for a complete example:
from deepwiki_to_md import DeepwikiScraper
from deepwiki_to_md.direct_scraper import DirectDeepwikiScraper
from deepwiki_to_md.direct_md_scraper import DirectMarkdownScraper
# Create a scraper instance (default uses DirectMarkdownScraper)
scraper = DeepwikiScraper(output_dir="MyDocuments")
# Scrape a library using the default (DirectMarkdownScraper)
scraper.scrape_library("python", "https://deepwiki.com/python")
# Create another scraper with a different output directory
other_scraper = DeepwikiScraper(output_dir="OtherDocuments")
# Scrape another library
other_scraper.scrape_library("javascript", "https://deepwiki.example.com/javascript")
# --- Using DirectDeepwikiScraper (HTML to Markdown) ---
# Create a scraper instance explicitly using DirectDeepwikiScraper
html_scraper = DeepwikiScraper(
output_dir="HtmlScrapedDocuments",
use_direct_scraper=True, # Enable DirectDeepwikiScraper
use_alternative_scraper=False,
use_direct_md_scraper=False
)
html_scraper.scrape_library("go", "https://deepwiki.com/go")
# --- Using DirectMarkdownScraper (Direct Markdown Fetching) ---
# Create a scraper instance explicitly using DirectMarkdownScraper
md_scraper = DeepwikiScraper(
output_dir="DirectMarkdownDocuments",
use_direct_scraper=False,
use_alternative_scraper=False,
use_direct_md_scraper=True # Enable DirectMarkdownScraper (this is the default)
)
md_scraper.scrape_library("rust", "https://deepwiki.com/rust")
# --- Using the individual direct scrapers ---
# Create a DirectDeepwikiScraper instance (HTML to Markdown)
direct_html_scraper = DirectDeepwikiScraper(output_dir="DirectHtmlScraped")
# Scrape a specific page directly (HTML to Markdown)
direct_html_scraper.scrape_page(
"https://deepwiki.com/python/cpython/2.1-bytecode-interpreter-and-optimization",
"python_bytecode",
save_html=True # Optionally save the original HTML
)
# Create a DirectMarkdownScraper instance (Direct Markdown Fetching)
direct_md_scraper = DirectMarkdownScraper(output_dir="DirectMarkdownFetched")
# Scrape a specific page directly as Markdown
direct_md_scraper.scrape_page(
"https://deepwiki.com/python/cpython/2.1-bytecode-interpreter-and-optimization",
"python_bytecode"
)
# You can also use the run method for multiple direct scrapes (for DirectDeepwikiScraper)
# direct_html_results = direct_html_scraper.run([
# {"name": "page1", "url": "url1"},
# {"name": "page2", "url": "url2"}
# ])
# You can also use the run method for multiple direct scrapes (for DirectMarkdownScraper)
# direct_md_results = direct_md_scraper.run([
# {"name": "page1", "url": "url1"},
# {"name": "page2", "url": "url2"}
# ])
Run the example script:
python example.py
Command-line Arguments
library_url: URL of the library to scrape (can be provided as a positional argument).--library,-l: Library name and URL to scrape. Can be specified multiple times for different libraries. Format:--library NAME URL.--output-dir,-o: Output directory for Markdown files (default:Documents).--use-direct-scraper: UseDirectDeepwikiScraper(HTML to Markdown conversion). Overrides--use-direct-md-scraperif both are specified.--no-direct-scraper: DisableDirectDeepwikiScraper.--use-alternative-scraper: Use thescrape_deepwikifunction fromdirect_scraper.pyas a fallback if the primary method fails (default: True).--no-alternative-scraper: Disable the alternative scraper fallback.--use-direct-md-scraper: UseDirectMarkdownScraper(fetches Markdown directly). This is the default behavior if no scraper type is explicitly specified.--no-direct-md-scraper: DisableDirectMarkdownScraper.
Scraper Priority:
- If
--use-direct-scraperis specified,DirectDeepwikiScraper(HTML to Markdown) is used. - If
--use-direct-md-scraperis specified (and--use-direct-scraperis not),DirectMarkdownScraper(Direct Markdown) is used. - If neither
--use-direct-scrapernor--use-direct-md-scraperis specified,DirectMarkdownScraper(Direct Markdown) is used by default. - The
--use-alternative-scraperflag controls a fallback mechanism within the chosen primary scraper.
Examples
-
Simplified usage (uses DirectMarkdownScraper by default):
python -m deepwiki_to_md.run_scraper "https://deepwiki.com/python" -
Scrape a single library with explicit parameters (uses DirectMarkdownScraper by default):
python -m deepwiki_to_md.run_scraper --library "python" "https://deepwiki.example.com/python" -
Scrape multiple libraries (uses DirectMarkdownScraper by default):
python -m deepwiki_to_md.run_scraper --library "python" "https://deepwiki.example.com/python" --library "javascript" "https://deepwiki.example.com/javascript" -
Specify a custom output directory:
python -m deepwiki_to_md.run_scraper "https://deepwiki.com/python" --output-dir "MyDocuments" -
Explicitly use DirectMarkdownScraper (Direct Markdown):
python -m deepwiki_to_md.run_scraper "https://deepwiki.com/python" --use-direct-md-scraper -
Explicitly use DirectDeepwikiScraper (HTML to Markdown):
python -m deepwiki_to_md.run_scraper "https://deepwiki.com/python" --use-direct-scraper -
Disable the alternative scraper fallback:
python -m deepwiki_to_md.run_scraper "https://deepwiki.com/python" --no-alternative-scraper -
Use DirectDeepwikiScraper and disable the alternative fallback:
python -m deepwiki_to_md.run_scraper "https://deepwiki.com/python" --use-direct-scraper --no-alternative-scraper
Output Structure
The converted Markdown files will be saved in the following directory structure:
<output_dir>/
├── <library_name1>/
│ └── md/
│ ├── <page_name1>.md
│ ├── <page_name2>.md
│ └── ...
├── <library_name2>/
│ └── md/
│ ├── <page_name1>.md
│ ├── <page_name2>.md
│ └── ...
└── ...
<output_dir>is the directory specified by--output-dir(default:Documents).<library_name>is the name provided for the library.- Each page from the Deepwiki site is saved as a separate
.mdfile within themdsubdirectory.
How It Works
The tool offers different scraping strategies:
1. Direct Markdown Scraping (DirectMarkdownScraper - Default)
- Priority: Highest (used by default).
- Method: Connects to the Deepwiki site using specialized headers optimized for fetching raw Markdown content directly from the server's internal API or data structures.
- Process:
- Sends requests designed to retrieve Markdown data.
- Parses the response (often JSON or a specific text format) to extract the Markdown content.
- Cleans up the extracted Markdown (removes potential artifacts like metadata or script tags).
- Saves the cleaned Markdown content directly to files.
- Advantage: Highest fidelity Markdown, preserves original formatting, avoids HTML conversion errors.
2. Direct HTML Scraping (DirectDeepwikiScraper)
- Priority: Medium (used if
--use-direct-scraperis specified). - Method: Connects to the Deepwiki site using headers that mimic a browser request to fetch the rendered HTML page.
- Process:
- Fetches the full HTML of the page.
- Uses BeautifulSoup to parse the HTML.
- Identifies the main content area using various CSS selectors.
- Uses
markdownifylibrary to convert the selected HTML content block to Markdown. - Saves the converted Markdown.
- Advantage: More robust than basic static scraping if direct Markdown fetching fails or is unavailable.
- Disadvantage: Relies on HTML structure and conversion quality.
3. Alternative Scraper Fallback (scrape_deepwiki from direct_scraper.py)
- Priority: Lowest (used as a fallback within the primary scraper if
--use-alternative-scraperis enabled, which is the default). - Method: A simpler static request mechanism, potentially used if the main methods encounter issues (e.g., complex navigation or unexpected page structure).
- Process: Fetches HTML and attempts basic content extraction.
Navigation and Hierarchy
- Both
DirectMarkdownScraperandDirectDeepwikiScraperattempt to identify navigation links (like a table of contents or sidebar) within the fetched content (either Markdown or HTML). - They recursively follow these links to scrape the entire library structure.
Error Handling
The tool includes robust error handling:
- Validates domains before scraping.
- Checks domain reachability.
- Provides clear error messages.
- Implements retry mechanisms with exponential backoff for transient network errors.
- Falls back to the alternative scraper if configured and the primary method fails.
Customization
You can modify the Python scripts (deepwiki_to_md/deepwiki_to_md.py, deepwiki_to_md/direct_scraper.py,
deepwiki_to_md/direct_md_scraper.py) to customize:
- HTML selectors used for content extraction (in
DirectDeepwikiScraper). - Markdown parsing/cleaning logic (in
DirectMarkdownScraper). - HTML to Markdown conversion options (
markdownifysettings). - Output file naming conventions.
- Request headers and delays.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file deepwiki_to_md-0.1.0.tar.gz.
File metadata
- Download URL: deepwiki_to_md-0.1.0.tar.gz
- Upload date:
- Size: 26.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d08ef9b78f159481eff8ed7684c28b03125bf0a4e3f8ab638c81e19cd9f34161
|
|
| MD5 |
c4c5fe834aa6ba871baba6bd7d3f8238
|
|
| BLAKE2b-256 |
4b2adc608a1cf182a5b636543b80f653cf8fc1ec0c25ed210d30de87c38f2655
|