Skip to main content

Add your description here

Project description

DataHarvest

DataHarvest 是一个用于数据搜索🔍、爬取🕷、清洗🧽的工具。

AI时代,数据是一切的基石,DataHarvest 能够帮助获取干净有效的数据。

DataHarvest

搜索

搜索引擎 官网 支持
tavily https://docs.tavily.com/
天工 https://www.tiangong.cn/ coming soon

数据爬取&清洗

网站 内容 url pattern 爬取 清洗
百度百科 词条 baike.baidu.com/item/
百度百家号 文章 baijiahao.baidu.com/s/
B站 文章 www.bilibili.com/read/
腾讯网 文章 new.qq.com/rain/a/
360个人图书馆 文章 www.360doc.com/content/
360百科 词条 baike.so.com/doc/
搜狗百科 词条 baike.sogou.com/v/
搜狐 文章 www.sohu.com/a/
头条 文章 www.toutiao.com/article/
网易 文章 www.163.com/\w+/article/.+
微信公众号 文章 weixin.qq.com/s/
马蜂窝 文章 www.mafengwo.cn/i/
小红书 coming soon

其他情况使用基础playwright数据爬取和html2text数据清洗,但并未做特殊适配。

安装与使用

pip install dataharvest
playwright install

最佳实践

搜索

from dataharvest.searcher import TavilySearcher

api_key = "xxx"  # 或者设置环境变量 TAVILY_API_KEY

searcher = TavilySearcher(api_key)
searcher.search("战国水晶杯")
SearchResult(keyword='战国水晶杯', answer=None, images=None, items=[
    SearchResultItem(title='战国水晶杯_百度百科', url='https://baike.baidu.com/item/战国水晶杯/7041521', score=0.98661,
                     description='战国水晶杯为战国晚期水晶器皿,于1990年出土于浙江省杭州市半山镇石塘村,现藏于杭州博物馆。战国水晶杯高15.4厘米、口径7.8厘米、底径5.4厘米,整器略带淡琥珀色,局部可见絮状包裹体;器身为敞口,平唇,斜直壁,圆底,圈足外撇;光素无纹,造型简洁。',
                     content='')])

爬虫

from dataharvest.spider import AutoSpider

url = "https://baike.so.com/doc/5579340-5792710.html?src=index#entry_concern"
auto_spider = AutoSpider()
doc = auto_spider.crawl(url)
print(doc)

清洗

from dataharvest.purifier import AutoPurifier
from dataharvest.spider import AutoSpider

url = "https://baike.so.com/doc/5579340-5792710.html?src=index#entry_concern"
auto_spider = AutoSpider()
doc = auto_spider.crawl(url)
print(doc)
auto_purifier = AutoPurifier()
doc = auto_purifier.purify(doc)
print(doc)

效果:

整合

import asyncio

from dataharvest.base import DataHarvest
from dataharvest.searcher import TavilySearcher

searcher = TavilySearcher()
dh = DataHarvest()
r = searcher.search("战国水晶杯")
tasks = [dh.a_crawl_and_purify(item.url) for item in r.items]
loop = asyncio.get_event_loop()
docs = loop.run_until_complete(asyncio.gather(*tasks))

鸣谢

伙伴们如果觉着这个项目对你有帮助,那么请帮助点一个star✨。如果觉着存在问题或者有其他需求,那么欢迎在issue提出。当然,我们非常欢迎您加入帮忙完善。

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataharvest-0.2.3.tar.gz (11.5 kB view details)

Uploaded Source

Built Distribution

dataharvest-0.2.3-py3-none-any.whl (21.5 kB view details)

Uploaded Python 3

File details

Details for the file dataharvest-0.2.3.tar.gz.

File metadata

  • Download URL: dataharvest-0.2.3.tar.gz
  • Upload date:
  • Size: 11.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.3

File hashes

Hashes for dataharvest-0.2.3.tar.gz
Algorithm Hash digest
SHA256 1c7939d3232bffb34401434ddfb09ae87646248ce9faccdfa2e0d3d60c1fc536
MD5 81973df8ec3564fd3b12f2627607deef
BLAKE2b-256 cc03480f18748594a8f175cd18fc6c56d2232622d0cc5a7c3d18c4ef9f5b0c6a

See more details on using hashes here.

File details

Details for the file dataharvest-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: dataharvest-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 21.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.3

File hashes

Hashes for dataharvest-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 792db1fc667aad2b33605511d2cf5c538239ee88e339ccac9730f5b4df6771cb
MD5 a82e2e82b4810ff8c31ecfc63d0de79c
BLAKE2b-256 6c4690d43ae511723447670989dc7078a75e3b56674a66ef0f144267496447de

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page