Add your description here
Project description
DataHarvest
DataHarvest 是一个用于数据搜索🔍、爬取🕷、清洗🧽的工具。
AI时代,数据是一切的基石,DataHarvest 能够帮助获取干净有效的数据。
搜索
搜索引擎 | 官网 | 支持 |
---|---|---|
tavily | https://docs.tavily.com/ | ✅ |
天工 | https://www.tiangong.cn/ | coming soon |
数据爬取&清洗
网站 | 内容 | url pattern | 爬取 | 清洗 |
---|---|---|---|---|
百度百科 | 词条 | baike.baidu.com/item/ | ✅ | ✅ |
百度百家号 | 文章 | baijiahao.baidu.com/s/ | ✅ | ✅ |
B站 | 文章 | www.bilibili.com/read/ | ✅ | ✅ |
腾讯网 | 文章 | new.qq.com/rain/a/ | ✅ | ✅ |
360个人图书馆 | 文章 | www.360doc.com/content/ | ✅ | ✅ |
360百科 | 词条 | baike.so.com/doc/ | ✅ | ✅ |
搜狗百科 | 词条 | baike.sogou.com/v/ | ✅ | ✅ |
搜狐 | 文章 | www.sohu.com/a/ | ✅ | ✅ |
头条 | 文章 | www.toutiao.com/article/ | ✅ | ✅ |
网易 | 文章 | www.163.com/\w+/article/.+ | ✅ | ✅ |
微信公众号 | 文章 | weixin.qq.com/s/ | ✅ | ✅ |
马蜂窝 | 文章 | www.mafengwo.cn/i/ | ✅ | |
小红书 | coming soon |
其他情况使用基础playwright数据爬取和html2text数据清洗,但并未做特殊适配。
安装与使用
pip install dataharvest
playwright install
最佳实践
搜索
from dataharvest.searcher import TavilySearcher
api_key = "xxx" # 或者设置环境变量 TAVILY_API_KEY
searcher = TavilySearcher(api_key)
searcher.search("战国水晶杯")
SearchResult(keyword='战国水晶杯', answer=None, images=None, items=[
SearchResultItem(title='战国水晶杯_百度百科', url='https://baike.baidu.com/item/战国水晶杯/7041521', score=0.98661,
description='战国水晶杯为战国晚期水晶器皿,于1990年出土于浙江省杭州市半山镇石塘村,现藏于杭州博物馆。战国水晶杯高15.4厘米、口径7.8厘米、底径5.4厘米,整器略带淡琥珀色,局部可见絮状包裹体;器身为敞口,平唇,斜直壁,圆底,圈足外撇;光素无纹,造型简洁。',
content='')])
爬虫
from dataharvest.spider import AutoSpider
url = "https://baike.so.com/doc/5579340-5792710.html?src=index#entry_concern"
auto_spider = AutoSpider()
doc = auto_spider.crawl(url)
print(doc)
清洗
from dataharvest.purifier import AutoPurifier
from dataharvest.spider import AutoSpider
url = "https://baike.so.com/doc/5579340-5792710.html?src=index#entry_concern"
auto_spider = AutoSpider()
doc = auto_spider.crawl(url)
print(doc)
auto_purifier = AutoPurifier()
doc = auto_purifier.purify(doc)
print(doc)
效果:
整合
import asyncio
from dataharvest.base import DataHarvest
from dataharvest.searcher import TavilySearcher
searcher = TavilySearcher()
dh = DataHarvest()
r = searcher.search("战国水晶杯")
tasks = [dh.a_crawl_and_purify(item.url) for item in r.items]
loop = asyncio.get_event_loop()
docs = loop.run_until_complete(asyncio.gather(*tasks))
鸣谢
伙伴们如果觉着这个项目对你有帮助,那么请帮助点一个star✨。如果觉着存在问题或者有其他需求,那么欢迎在issue提出。当然,我们非常欢迎您加入帮忙完善。
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
dataharvest-0.2.3.tar.gz
(11.5 kB
view details)
Built Distribution
File details
Details for the file dataharvest-0.2.3.tar.gz
.
File metadata
- Download URL: dataharvest-0.2.3.tar.gz
- Upload date:
- Size: 11.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1c7939d3232bffb34401434ddfb09ae87646248ce9faccdfa2e0d3d60c1fc536 |
|
MD5 | 81973df8ec3564fd3b12f2627607deef |
|
BLAKE2b-256 | cc03480f18748594a8f175cd18fc6c56d2232622d0cc5a7c3d18c4ef9f5b0c6a |
File details
Details for the file dataharvest-0.2.3-py3-none-any.whl
.
File metadata
- Download URL: dataharvest-0.2.3-py3-none-any.whl
- Upload date:
- Size: 21.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.12.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 792db1fc667aad2b33605511d2cf5c538239ee88e339ccac9730f5b4df6771cb |
|
MD5 | a82e2e82b4810ff8c31ecfc63d0de79c |
|
BLAKE2b-256 | 6c4690d43ae511723447670989dc7078a75e3b56674a66ef0f144267496447de |