A Python tool to scrape content from deepwiki sites and convert it to Markdown format
Project description
Deepwiki to Markdown Converter
The Japanese version of this document is available at README_ja.md.
A Python tool to scrape content from deepwiki sites and convert it to Markdown format. It provides various scraping strategies and utility functions for processing the scraped data.
Features
- Scrapes content from deepwiki sites using multiple strategies:
- Direct Markdown Fetching (default)
- Direct HTML Scraping with conversion
- Simple static fallback
- Extracts navigation items from specified UI elements to traverse libraries
- Converts HTML content to Markdown format using
markdownify - Saves the converted files in an organized directory structure
- Supports scraping multiple libraries in a single run
- Includes error handling with domain validation, reachability checks, and retry mechanisms
- Offers a utility to convert Markdown files to YAML format while preserving formatting
- Provides a utility to fix links within the scraped Markdown files
- Supports scraping responses from chat interfaces using Selenium
Requirements
- Python 3.6 or higher
- Required Python packages (see
requirements.txt):requestsbeautifulsoup4argparsemarkdownifyselenium(Required for the chat scraping feature)webdriver-manager(Required for the chat scraping feature)pyyaml(Required for the Markdown to YAML conversion feature)
Installation
Option 1: Install from PyPI
pip install deepwiki-to-md
This will install the core dependencies listed in setup.py. Note that selenium, webdriver-manager, and pyyaml are listed in requirements.txt but not in setup.py's install_requires. If you need the chat scraping or YAML conversion features, you may need to install these manually or install from source including requirements.txt.
Option 2: Install from source
Clone this repository:
git clone https://github.com/yourusername/deepwiki_to_md.git
cd deepwiki_to_md
Install the package in development mode, including all dependencies from requirements.txt:
pip install -e . -r requirements.txt
Usage
Basic Usage (Command Line)
If installed from PyPI, you can use the command-line tool:
deepwiki-to-md "https://deepwiki.com/library_path"
Or with explicit parameters:
deepwiki-to-md --library "library_name" "https://deepwiki.example.com/library_path"
If installed from source, you can run the script directly:
python -m deepwiki_to_md.run_scraper "https://deepwiki.com/library_path"
Or with explicit parameters:
python -m deepwiki_to_md.run_scraper --library "library_name" "https://deepwiki.example.com/library_path"
Note: The output directory will be created in the current working directory where the command is executed, not in the package installation directory.
Repository Creation Tool
The package also includes a tool to create repository requests by setting an email and submitting a form:
If installed from PyPI, you can use the command-line tool:
deepwiki-create --url "https://example.com/repository/create" --email "user@example.com"
To run in headless mode (without opening a browser window):
deepwiki-create --url "https://example.com/repository/create" --email "user@example.com" --headless
If installed from source, you can run the script directly:
python -m deepwiki_to_md.create --url "https://example.com/repository/create" --email "user@example.com"
Using the Python API
You can also use the DeepwikiScraper class directly in your Python code:
from deepwiki_to_md import DeepwikiScraper
# Import specific scraper classes if needed for direct use
from deepwiki_to_md.direct_scraper import DirectDeepwikiScraper # For HTML -> MD
from deepwiki_to_md.direct_md_scraper import DirectMarkdownScraper # For Direct MD
# Import the RepositoryCreator class for repository creation
from deepwiki_to_md.create import RepositoryCreator
# Create a scraper instance (DirectMarkdownScraper is used by default)
scraper = DeepwikiScraper(output_dir="MyDocuments")
# Scrape a library using the default (DirectMarkdownScraper)
scraper.scrape_library("python", "https://deepwiki.com/python/cpython")
# Create another scraper with a different output directory
other_scraper = DeepwikiScraper(output_dir="OtherDocuments")
# Scrape another library (still uses DirectMarkdownScraper by default)
other_scraper.scrape_library("javascript", "https://deepwiki.example.com/javascript")
# --- Using DirectDeepwikiScraper explicitly (HTML to Markdown) ---
# Create a scraper instance explicitly using DirectDeepwikiScraper
# This scraper fetches HTML and converts it to Markdown
html_scraper = DeepwikiScraper(
output_dir="HtmlScrapedDocuments",
use_direct_scraper=True, # Enable DirectDeepwikiScraper
use_alternative_scraper=False, # Disable alternative fallback for clarity
use_direct_md_scraper=False # Disable DirectMarkdownScraper
)
html_scraper.scrape_library("go", "https://deepwiki.com/go")
# --- Using DirectMarkdownScraper explicitly (Direct Markdown Fetching) ---
# Create a scraper instance explicitly using DirectMarkdownScraper
# This is already the default, but can be specified for clarity or if other defaults change
md_scraper = DeepwikiScraper(
output_dir="DirectMarkdownDocuments",
use_direct_scraper=False,
use_alternative_scraper=False,
use_direct_md_scraper=True # Enable DirectMarkdownScraper (this is the default)
)
md_scraper.scrape_library("rust", "https://deepwiki.com/rust")
# --- Using the individual direct scrapers directly ---
# These classes can be used independently for scraping specific pages or lists of pages
# Create a DirectDeepwikiScraper instance (HTML to Markdown)
direct_html_scraper = DirectDeepwikiScraper(output_dir="DirectHtmlScraped")
# Scrape a specific page directly (HTML to Markdown)
direct_html_scraper.scrape_page(
"https://deepwiki.com/python/cpython/2.1-bytecode-interpreter-and-optimization",
"python_bytecode", # Library name/path part for output folder
save_html=True # Optionally save the original HTML
)
# Create a DirectMarkdownScraper instance (Direct Markdown Fetching)
direct_md_scraper = DirectMarkdownScraper(output_dir="DirectMarkdownFetched")
# Scrape a specific page directly as Markdown
direct_md_scraper.scrape_page(
"https://deepwiki.com/python/cpython/2.1-bytecode-interpreter-and-optimization",
"python_bytecode" # Library name/path part for output folder
)
# --- Using the RepositoryCreator for repository creation requests ---
# Create a RepositoryCreator instance
creator = RepositoryCreator(headless=False) # Set headless=True to run without browser UI
try:
# Send a repository creation request
success = creator.create(
url="https://example.com/repository/create",
email="user@example.com"
)
if success:
print("Repository creation request sent successfully")
else:
print("Failed to send repository creation request")
finally:
# Always close the browser when done
creator.close()
Command-line Arguments
For deepwiki-to-md or python -m deepwiki_to_md.run_scraper:
library_url: URL of the library to scrape (can be provided as a positional argument).--library,-l: Library name and URL to scrape. Can be specified multiple times for different libraries. Format:--library NAME URL.--output-dir,-o: Output directory for Markdown files (default: Documents).--use-direct-scraper: Use DirectDeepwikiScraper (HTML to Markdown conversion). Prioritized over--use-direct-md-scraperif both are specified.--no-direct-scraper: Disable DirectDeepwikiScraper.--use-alternative-scraper: Use the scrape_deepwiki function from direct_scraper.py as a fallback if the primary method fails (default: True).--no-alternative-scraper: Disable the alternative scraper fallback.--use-direct-md-scraper: Use DirectMarkdownScraper (fetches Markdown directly). This is the default behavior if no scraper type is explicitly specified.--no-direct-md-scraper: Disable DirectMarkdownScraper.
Scraper Priority:
- If
--use-direct-scraperis specified, DirectDeepwikiScraper (HTML to Markdown) is used. - If
--use-direct-md-scraperis specified (and--use-direct-scraperis not), DirectMarkdownScraper (Direct Markdown) is used. - If neither is specified, DirectMarkdownScraper (Direct Markdown) is used by default.
- The
--use-alternative-scraperflag controls a fallback mechanism within the chosen primary scraper.
For deepwiki-create or python -m deepwiki_to_md.create:
--url(required): The URL of the repository creation page.--email(required): The email address to notify.--headless: Run the browser in headless mode (without UI).
Examples (Command Line)
Simplified usage (uses DirectMarkdownScraper by default):
python -m deepwiki_to_md.run_scraper "https://deepwiki.com/python/cpython"
# Or if installed via pip: deepwiki-to-md "https://deepwiki.com/python/cpython"
Scrape a single library with explicit parameters:
python -m deepwiki_to_md.run_scraper --library "python" "https://deepwiki.com/python/cpython"
Scrape multiple libraries:
python -m deepwiki_to_md.run_scraper --library "python" "https://deepwiki.com/python/cpython" --library "microsoft/vscode" "https://deepwiki.com/microsoft/vscode"
Specify a custom output directory:
python -m deepwiki_to_md.run_scraper "https://deepwiki.com/python/cpython" --output-dir "MyDocuments"
Explicitly use DirectMarkdownScraper (Direct Markdown):
python -m deepwiki_to_md.run_scraper "https://deepwiki.com/python/cpython" --use-direct-md-scraper
Explicitly use DirectDeepwikiScraper (HTML to Markdown):
python -m deepwiki_to_md.run_scraper "https://deepwiki.com/python/cpython" --use-direct-scraper
Disable the alternative scraper fallback:
python -m deepwiki_to_md.run_scraper "https://deepwiki.com/python/cpython" --no-alternative-scraper
Using the repository creation tool:
deepwiki-create --url "https://example.com/repository/create" --email "user@example.com"
Using the repository creation tool in headless mode:
deepwiki-create --url "https://example.com/repository/create" --email "user@example.com" --headless
Usage with run_direct_scraper.py
You can also use the run_direct_scraper.py script, which is a simplified entry point specifically for the DirectDeepwikiScraper (HTML to Markdown):
python -m deepwiki_to_md.run_direct_scraper "https://deepwiki.com/python/cpython"
# Or with explicit parameters:
python -m deepwiki_to_md.run_direct_scraper --library "python" "https://deepwiki.com/python/cpython"
# To save HTML as well:
python -m deepwiki_to_md.run_direct_scraper "https://deepwiki.com/python/cpython" --save-html
Arguments for run_direct_scraper.py:
library_url: URL of the library (positional).--library,-l: Library name and URL (can be multiple).--output-dir,-o: Output directory (default: DynamicDocuments).--save-html: Save original HTML files alongside Markdown.
Output Structure
The converted Markdown files will be saved in the following directory structure:
<output_dir>/
├── <library_name1>/
│ └── md/
│ ├── <page_name1>.md
│ ├── <page_name2>.md
│ └── ...
│ └── html/ # Only if --save-html is used with DirectDeepwikiScraper
│ ├── <page_name1>.html
│ ├── <page_name2>.html
│ └── ...
├── <library_name2>/
│ └── md/
│ ├── <page_name1>.md
│ ├── <page_name2>.md
│ └── ...
└── ...
<output_dir>is the directory specified by--output-dir(default: Documents for run_scraper.py, DynamicDocuments for run_direct_scraper.py).<library_name>is the name provided for the library (or inferred from the URL path).- Each page from the Deepwiki site is saved as a separate .md file within the md subdirectory.
- Original HTML is saved in the html subdirectory if the
--save-htmloption is used with DirectDeepwikiScraper.
How It Works
The tool offers different scraping strategies to maximize compatibility and output quality:
1. Direct Markdown Scraping (DirectMarkdownScraper - Default)
- Priority: Highest (used by default if no other scraper is explicitly chosen).
- Method: Attempts to fetch the raw Markdown content directly from the Deepwiki site's underlying data source or API. This is done by sending requests with specialized headers that mimic internal application requests.
- Process:
- Sends requests designed to retrieve Markdown data (using specific Accept headers or query parameters)
- Parses the response to extract the Markdown content
- Performs minimal cleaning on the extracted Markdown
- Splits the content into multiple files based on level 2 headings (##)
- Saves the cleaned and split Markdown content directly to .md files
- Advantage: Produces the highest fidelity Markdown, preserving the original formatting and structure as intended by the author.
2. Direct HTML Scraping (DirectDeepwikiScraper)
- Priority: Medium (used if
--use-direct-scraperis specified). - Method: Connects to the Deepwiki site using headers that mimic a standard browser request to fetch the fully rendered HTML page.
- Process:
- Fetches the full HTML of the page using the scrape_deepwiki function
- Uses BeautifulSoup to parse the HTML
- Identifies the main content area using a list of potential CSS selectors
- Uses the markdownify library to convert the selected HTML content to Markdown
- Saves the converted Markdown
- Advantage: More robust than basic static scraping if direct Markdown fetching fails or is unavailable.
3. Alternative Scraper Fallback
- Priority: Lowest (used as a fallback if
--use-alternative-scraperis enabled). - Method: A simpler static requests mechanism with specific headers designed to fetch the page HTML reliably.
Markdown to YAML Conversion Utility
The tool provides a utility to convert Markdown files to YAML format while preserving formatting. This is particularly useful for processing the scraped content for LLMs.
Using the Conversion Tool (Command Line)
python -m deepwiki_to_md.chat convert --md "path/to/markdown/file.md"
# Or if console script entry point is installed:
# deepwiki-chat convert --md "path/to/markdown/file.md"
To specify a custom output directory:
python -m deepwiki_to_md.chat convert --md "path/to/markdown/file.md" --output "path/to/output/directory"
Using the Python API (Markdown to YAML)
from deepwiki_to_md.md_to_yaml import convert_md_file_to_yaml, markdown_to_yaml
# Convert a Markdown file to YAML
yaml_file_path = convert_md_file_to_yaml("path/to/markdown/file.md")
# Convert a Markdown file to YAML with a custom output directory
yaml_file_path = convert_md_file_to_yaml("path/to/markdown/file.md", "path/to/output/directory")
# Or convert a Markdown string directly to a YAML string
markdown_string = "# My Document\n\nThis is the content."
yaml_string = markdown_to_yaml(markdown_string)
print(yaml_string)
YAML Format
The converted YAML file includes a structured representation of the document while embedding the original Markdown content:
timestamp: 'YYYY-MM-DD HH:MM:SS' # Timestamp of the conversion
title: Extracted Document Title # Title extracted from the first H1/H2 header
content: |
# Original Title
## Section 1
Content of section 1.
* List item 1
* List item 2
print("code")
[Link Text](url)
## Section 2
Content of section 2.
... # Full original Markdown content is preserved
links:
- text: Link Text
url: url # List of links extracted from the Markdown
images: [ ] # List of images extracted (currently empty)
metadata:
headers: # List of all header texts
- Original Title
- Section 1
- Section 2
...
paragraphs_count: 5 # Count of paragraphs
lists_count: 1 # Count of lists
tables_count: 0 # Count of tables
Markdown Link Fixing Utility
The tool automatically runs a link-fixing utility on the generated .md files. This utility finds Markdown links in the format Text and replaces them with Text.
Using the Link Fixing Tool (Command Line)
python -m deepwiki_to_md.fix_markdown_links "path/to/your/markdown/directory"
Using the Python API (Link Fixing)
from deepwiki_to_md.fix_markdown_links import fix_markdown_links
# Fix links in all markdown files within a directory
fix_markdown_links("path/to/your/markdown/directory")
Chat Scraping Feature (Requires Selenium)
The tool includes a feature to interact with chat interfaces using Selenium and save the responses.
Using the Chat Scraper (Command Line)
python -m deepwiki_to_md.chat --url "https://deepwiki.com/some_chat_page" --message "Your message here" --wait 10 --debug --format "html,md,yaml" --output "MyChatResponses" --deep
Arguments for chat.py:
--url: URL of the chat interface.--message: Message to send.--selector: CSS selector for the chat input (default: textarea).--button: CSS selector for the submit button (default: button).--wait: Time to wait for response in seconds (default: 30).--debug: Enable debug mode.--output: Output directory (default: ChatResponses).--deep: Enable "Deep Research" mode (specific to some interfaces).--headless: Run browser in headless mode.--format: Output format(s): html, md, yaml, or comma-separated list (default: html).
Note: The chat scraper uses Selenium, which requires a compatible browser installed.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file deepwiki_to_md-0.3.0.tar.gz.
File metadata
- Download URL: deepwiki_to_md-0.3.0.tar.gz
- Upload date:
- Size: 43.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c0179384b5e90d9d6bc9c4979d3128203d7c42b6f77a5dc17f4f62757f1fa683
|
|
| MD5 |
cfdccba756c4c531d099f61f531871cc
|
|
| BLAKE2b-256 |
640d44eba00b736d7e1cc47266c1d83725366e019a86d02862afa5cfe346c66d
|
File details
Details for the file deepwiki_to_md-0.3.0-py3-none-any.whl.
File metadata
- Download URL: deepwiki_to_md-0.3.0-py3-none-any.whl
- Upload date:
- Size: 50.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
630c934e0a15b9b39d4d490c3de7a22c8ccad080c9ac979d1a0a73484e53b15b
|
|
| MD5 |
837f2680e0df45108afe51ff8a6e9541
|
|
| BLAKE2b-256 |
e87805a0ed27b507a5fe2c1181e6a39416679adb7c2dbb0939c9e3c4f05fcb00
|