Turn any web article into clean Markdown via CLI.
Project description
墨探 (omni-article-markdown)
轻松将网页文章(博客、新闻、文档等)转换为 Markdown 格式。
简介
墨探的开发初衷,是为了解决一个问题:如何将来自互联网上各种不同网站的文章内容,精准且高效地转换成统一的Markdown格式。
众所周知,万维网上的网站设计风格迥异,其HTML结构也呈现出千差万别的特点。这种多样性给自动化内容提取和格式转换带来了巨大的困难。要实现一个能够适应各种复杂HTML结构的通用解决方案,并非易事。
我的想法是:从特定的网站开始适配,以点到面,逐步抽取出通用的解决方案,最后尽可能多的覆盖更多网站。
功能介绍
- 支持大部分 html 元素转换
- 部分页面支持katex公式转换(示例:https://quantum.country/qcvc)
- 部分页面支持github gist(示例:https://towardsdatascience.com/hands-on-multi-agent-llm-restaurant-simulation-with-python-and-openai)
- 支持保存成文件或输出至
stdout - 支持突破某些网站的防爬虫策略(通过
playwright)
以下是一些网站示例,大家可以自己测试下效果。
| 站点 | 链接 | 备注 |
|---|---|---|
| Medium | link | |
| csdn | link | |
| 掘金 | link | |
| 公众号 | link | |
| 网易 | link | |
| 简书 | link | |
| Towards Data Science | link | |
| Quantamagazine | link | |
| Cloudflare博客 | link | |
| 阿里云开发者社区 | link | |
| 微软技术文档 | link | |
| InfoQ | link | |
| 博客园 | link | |
| 思否 | link | |
| 开源中国 | link | |
| Forbes | link | |
| 少数派 | link | |
| 语雀 | link | |
| 腾讯云开发者社区 | link | |
| 人人都是产品经理 | link | |
| Jetbrains博客 | link | |
| Claude文档 | link | |
| Anthropic | link | |
| Meta博客 | link | |
| Android Developers Blog | link | |
| Spring Blog | link | |
| Hackernoon | link | |
| 领英博客 | link | |
| 华尔街见闻 | link | |
| 苹果开发者文档 | link | |
| 百家号 | link | |
| Snowflake 技术博客 | link | |
| 知乎专栏 | link | |
| 今日头条 | link | |
| X Articles | link | |
| 飞书 | link | |
| link | 已失效 |
安装方式
方式一:pip(推荐)
pip install omni-article-markdown
安装完成后即可使用:
mdcli --help
基本用法
仅转换
mdcli https://example.com
保存到当前目录
mdcli https://example.com -s
保存到指定路径
mdcli https://example.com -s /home/user/
架构说明
墨探主要分为三个模块:
- Reader 模块的功能是读取整个网页内容
- Extractor 模块的功能是提取正文内容,清理无用数据
- Parser 模块的功能是将 HTML 转换为 Markdown
贡献与反馈
- 发现解析问题?欢迎提交 Issue
- 改进解析?欢迎贡献 Pull Request
赞助
如果你觉得墨探对你有帮助,可以给我家猫咪买点罐头 ❤️
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file omni_article_markdown-0.2.0.tar.gz.
File metadata
- Download URL: omni_article_markdown-0.2.0.tar.gz
- Upload date:
- Size: 41.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d6afc3a35246c3381ac2ef7ab664264932c85b9a69539886e70927a1a0ca8a38
|
|
| MD5 |
4586ad59f459f899954b491dbec4d84a
|
|
| BLAKE2b-256 |
5fd78c679f006057660c39afdc354225ffdace75427ff4e9faa6dd68605a4912
|
File details
Details for the file omni_article_markdown-0.2.0-py3-none-any.whl.
File metadata
- Download URL: omni_article_markdown-0.2.0-py3-none-any.whl
- Upload date:
- Size: 5.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4592f9753774f6f7dd25cfe3aec49e3f807d8dd783207bedfb67e00ba6c6f4d7
|
|
| MD5 |
87b7d648dc9c3d25c9f179eed1850d86
|
|
| BLAKE2b-256 |
3eebe3c149cf7202abf9d2b36efa67ad7e1189ed9cdd7141462c349d809af033
|