Add your description here
Project description
DataHarvest
DataHarvest 是一个用于数据搜索、爬取、清洗的工具。
搜索
搜索引擎 | 官网 | 支持 |
---|---|---|
tavily | https://docs.tavily.com/ | ✅ |
天工 | https://www.tiangong.cn/ | coming soon |
数据爬取&清洗
网站 | 内容 | url pattern | 爬取 | 清洗 |
---|---|---|---|---|
百度百科 | 词条 | baike.baidu.com/item | ✅ | ✅ |
百度百家号 | 文章 | baijiahao.baidu.com/s | ✅ | ✅ |
B站 | 文章 | www.bilibili.com/read | ✅ | ✅ |
腾讯网 | 文章 | new.qq.com/rain/a | ✅ | ✅ |
360个人图书馆 | 文章 | www.360doc.com/content | ✅ | ✅ |
360百科 | 词条 | baike.so.com/doc | ✅ | ✅ |
搜狗百科 | 词条 | baike.sogou.com/v | ✅ | ✅ |
搜狐 | 文章 | www.sohu.com/a | ✅ | ✅ |
头条 | 文章 | www.toutiao.com/article | ✅ | ✅ |
网易 | 文章 | www.163.com/\w+/article/.+ | ✅ | ✅ |
微信公众号 | 文章 | weixin.qq.com/s | ✅ | ✅ |
马蜂窝 | coming soon | |||
小红书 | coming soon |
其他情况使用基础playwright数据爬取和html2text数据清洗,但并未做特殊适配。
安装与使用
pip install dataharvest
最佳实践
搜索
from dataharvest.searcher.tavily_searcher import TavilySearcher
api_key = "xxx" # 或者设置环境变量 TAVILY_API_KEY
searcher = TavilySearcher(api_key)
searcher.search("战国水晶杯")
SearchResult(keyword='战国水晶杯', answer=None, images=None, items=[
SearchResultItem(title='战国水晶杯_百度百科', url='https://baike.baidu.com/item/战国水晶杯/7041521', score=0.98661,
description='战国水晶杯为战国晚期水晶器皿,于1990年出土于浙江省杭州市半山镇石塘村,现藏于杭州博物馆。战国水晶杯高15.4厘米、口径7.8厘米、底径5.4厘米,整器略带淡琥珀色,局部可见絮状包裹体;器身为敞口,平唇,斜直壁,圆底,圈足外撇;光素无纹,造型简洁。',
content='')])
爬虫
from dataharvest.purifier import AutoPurifier
from dataharvest.spider import AutoSpider
url = "https://baike.so.com/doc/5579340-5792710.html?src=index#entry_concern"
auto_spider = AutoSpider()
doc = auto_spider.crawl(url)
print(doc)
清洗
from dataharvest.purifier import AutoPurifier
from dataharvest.spider import AutoSpider
url = "https://baike.so.com/doc/5579340-5792710.html?src=index#entry_concern"
auto_spider = AutoSpider()
doc = auto_spider.crawl(url)
print(doc)
auto_purifier = AutoPurifier()
doc = auto_purifier.purify(doc)
print(doc)
效果:
整合
import asyncio
from dataharvest.base import DataHarvest
from dataharvest.searcher import TavilySearcher
searcher = TavilySearcher()
dh = DataHarvest()
r = searcher.search("战国水晶杯")
tasks = [dh.a_crawl_and_purify(item.url) for item in r.items]
loop = asyncio.get_event_loop()
docs = loop.run_until_complete(asyncio.gather(*tasks))
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
dataharvest-0.2.1.tar.gz
(9.2 kB
view details)
Built Distribution
File details
Details for the file dataharvest-0.2.1.tar.gz
.
File metadata
- Download URL: dataharvest-0.2.1.tar.gz
- Upload date:
- Size: 9.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b1734b3c2f0a83537bd1c4c9377f80b6acc2597fe8184594b3d881241b46cd81 |
|
MD5 | fc624ac9b0fa0719d05f19c8d7e371c5 |
|
BLAKE2b-256 | ec42dda2cdf60046f0a2336b20e9b000e1a56ddf10e4fae1103e6128d1743a51 |
File details
Details for the file dataharvest-0.2.1-py3-none-any.whl
.
File metadata
- Download URL: dataharvest-0.2.1-py3-none-any.whl
- Upload date:
- Size: 18.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a0fe03f6c2860e05f7ddd42ba525bd64f6a0dde4dd0bd80299f1d661ca9eaa35 |
|
MD5 | 9c2d7571564926d521ab22709c956757 |
|
BLAKE2b-256 | 2d68475afce069c8b4c945b37bc3a20f8bfcb8f88ce1f3b3eef70d86edf62cfe |