A Python library to scrape Feishu wiki pages and convert them to Markdown
Project description
Feishu Wiki Scraper
A Python library to scrape Feishu (飞书) wiki pages and convert them to Markdown format, similar to Firecrawl. This tool can scrape entire wiki sites by following sidebar links and extracting all content.
Features
- 🚀 Scrape single Feishu wiki pages or entire wiki sites
- 📝 Convert HTML content to clean Markdown format
- 🔗 Automatically follow sidebar links to scrape related pages
- 🍪 Support for authentication via cookies and custom headers
- ⚙️ Configurable scraping options (delays, max pages, etc.)
- 💾 Export to Markdown files or JSON format
- 📂 Directory output mode — save each page as a separate
.mdfile preserving wiki tree structure - 🎯 Command-line interface for easy usage
- 🔥 Firecrawl-compatible JSON output with metadata
Installation
From source
git clone https://github.com/rwifeng/feishu-wiki-scrape.git
cd feishu-wiki-scrape
pip install -e .
Dependencies
pip install -r requirements.txt
Usage
Command Line Interface
Basic usage to scrape a Feishu wiki:
# Save to a single Markdown file
feishu-wiki-scrape https://zcn3fx96oxg4.feishu.cn/wiki/H5V5wMczPif5A5khSG3cWx65nbc -o output.md
# Save as a directory tree (one .md file per page, preserving wiki structure)
feishu-wiki-scrape https://zcn3fx96oxg4.feishu.cn/wiki/H5V5wMczPif5A5khSG3cWx65nbc -o ./wiki-docs/
Options
-o, --output: Output path (default:output.md). If the path ends with/, is an existing directory, or has no file extension, each page is saved as a separate.mdfile in a nested directory tree matching the wiki structure--max-pages: Maximum number of pages to scrape (default: unlimited)--no-sidebar: Don't follow sidebar links (scrape only the given URL)--delay: Delay between requests in seconds (default: 1.0)--cookies: Cookies as JSON string for authentication--headers: Custom headers as JSON string--json-output: Output as JSON instead of Markdown file--firecrawl-format: Output in Firecrawl-compatible JSON format with metadata-v, --verbose: Enable verbose logging
Examples
Scrape a single page without following links:
feishu-wiki-scrape https://example.feishu.cn/wiki/page --no-sidebar -o single_page.md
Scrape with authentication cookies:
feishu-wiki-scrape https://example.feishu.cn/wiki/page \
--cookies '{"session_id": "your-session-id"}' \
-o authenticated_output.md
Limit to 10 pages with custom delay:
feishu-wiki-scrape https://example.feishu.cn/wiki/page \
--max-pages 10 \
--delay 2.0 \
-o limited_output.md
Output as JSON:
feishu-wiki-scrape https://example.feishu.cn/wiki/page --json-output > output.json
Save as directory tree preserving wiki structure:
feishu-wiki-scrape https://example.feishu.cn/wiki/page -o ./docs/
This produces a directory tree like:
docs/
🚀 Introduction/
index.md # parent page with children
Getting Started.md
FAQ/
index.md
Common Errors.md
Claude.md # leaf page (no children)
Python API
from feishu_wiki_scrape import FeishuWikiScraper
# Create scraper instance
scraper = FeishuWikiScraper(
cookies={"session_id": "your-session-id"}, # Optional
headers={"Custom-Header": "value"}, # Optional
delay=1.0 # Delay between requests
)
# Scrape a single page
page = scraper.scrape_page("https://example.feishu.cn/wiki/page")
print(page["title"])
print(page["markdown"])
# Scrape entire wiki (follows sidebar links)
results = scraper.scrape_wiki(
start_url="https://example.feishu.cn/wiki/page",
max_pages=50, # Optional: limit number of pages
include_sidebar=True # Follow sidebar links
)
for page in results:
print(f"Title: {page['title']}")
print(f"URL: {page['url']}")
print(f"Content:\n{page['markdown']}\n")
# Save to file
scraper.scrape_to_file(
start_url="https://example.feishu.cn/wiki/page",
output_file="output.md",
max_pages=None, # No limit
include_sidebar=True
)
# Save to directory tree (preserves wiki sidebar structure)
count = scraper.scrape_wiki_to_directory(
start_url="https://example.feishu.cn/wiki/page",
output_dir="./docs/",
max_pages=None # No limit
)
print(f"Saved {count} pages")
Firecrawl-Compatible Output
This library supports Firecrawl-compatible JSON output with rich metadata, making it easy to build API-compatible tools.
Using CLI
feishu-wiki-scrape https://example.feishu.cn/wiki/page \
--firecrawl-format \
--max-pages 10 > output.json
Output format:
{
"success": true,
"status": "completed",
"completed": 10,
"total": 10,
"data": [
{
"markdown": "# Page Title\n\nPage content...",
"metadata": {
"url": "https://example.feishu.cn/wiki/page",
"title": "Page Title",
"keywords": "keyword1, keyword2",
"language": "zh-CN",
"sourceURL": "https://example.feishu.cn/wiki/page",
"statusCode": 200,
"contentType": "text/html; charset=utf-8",
"description": "Page description"
}
}
]
}
Using Python API
from feishu_wiki_scrape import FeishuWikiScraper
scraper = FeishuWikiScraper()
# Scrape with metadata
results = scraper.scrape_wiki_with_metadata(
start_url="https://example.feishu.cn/wiki/page",
max_pages=10,
include_sidebar=True
)
# Format as Firecrawl response (automatically handles metadata format)
firecrawl_response = scraper.format_as_firecrawl(results, start_url)
print(firecrawl_response)
For a complete example of building a Firecrawl-compatible API, see example_firecrawl.py.
How It Works
- Page Fetching: Uses
requeststo fetch wiki pages with configurable headers and cookies - Content Extraction: Parses HTML with
BeautifulSoupto extract main content area - Link Discovery: Finds all wiki links in sidebars and navigation elements
- Markdown Conversion: Converts HTML to clean Markdown using
html2text - Crawling: Follows links breadth-first to scrape entire wiki sites
- Rate Limiting: Respects configurable delays between requests
Authentication
Feishu wikis may require authentication. You can provide cookies or headers:
Getting Cookies
- Open the Feishu wiki in your browser
- Open Developer Tools (F12)
- Go to Application/Storage > Cookies
- Copy the relevant cookie values
- Pass them using
--cookiesoption or in Python code
Example:
feishu-wiki-scrape https://example.feishu.cn/wiki/page \
--cookies '{"session_id": "abc123", "other_cookie": "value"}'
Output Format
Single Markdown File (-o output.md)
Pages are separated by horizontal rules (---) with each page containing:
- Page title as H1 heading
- Source URL
- Markdown content
Directory Tree (-o dir/)
Each wiki page is saved as a separate .md file. The directory structure mirrors the wiki's sidebar tree:
- Pages with children become a directory containing
index.md(the page content) plus child pages - Leaf pages (no children) are saved as
{title}.mdin the parent directory - The wiki space root container is skipped so the output directory maps directly to the top-level pages
JSON Format
[
{
"url": "https://example.feishu.cn/wiki/page1",
"title": "Page Title",
"markdown": "# Content\n\nPage content in markdown..."
},
{
"url": "https://example.feishu.cn/wiki/page2",
"title": "Another Page",
"markdown": "# Content\n\nMore content..."
}
]
Troubleshooting
Pages not loading
- Check if authentication is required (try with cookies)
- Verify the URL is accessible in a browser
- Increase delay between requests
Missing content
- Some content may be loaded dynamically with JavaScript
- Try using cookies from an authenticated session
- Check verbose output with
-vflag
License
MIT License
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file feishu_wiki_scrape-0.1.0.tar.gz.
File metadata
- Download URL: feishu_wiki_scrape-0.1.0.tar.gz
- Upload date:
- Size: 21.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a90e36105fe43dcc117955e2fab8cf384fb3704cbc818da93884f738c46adc7b
|
|
| MD5 |
718d38136b2c068acaebbc6e72d72a1d
|
|
| BLAKE2b-256 |
ce78033b65cbb1e0ec7303d9e9a180fae44993c4284f127b330cf77838808318
|
File details
Details for the file feishu_wiki_scrape-0.1.0-py3-none-any.whl.
File metadata
- Download URL: feishu_wiki_scrape-0.1.0-py3-none-any.whl
- Upload date:
- Size: 19.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a4d18ce30f28173194890c6460c7d31d15f4ffe4d5fd1b190f4da898cbf3dac0
|
|
| MD5 |
85a44684e46bb6695bd1c4f3529b5342
|
|
| BLAKE2b-256 |
39b923f7bac8ae539d5391006d3ba753867ec41abe3ce9ad92f7a7443da9e45a
|