Skip to main content

Add your description here

Project description

DataHarvest

DataHarvest 是一个用于数据搜索、爬取、清洗的工具。

DataHarvest

搜索

搜索引擎 官网 支持
tavily https://docs.tavily.com/
天工 https://www.tiangong.cn/ coming soon

数据爬取&清洗

网站 内容 url pattern 爬取 清洗
百度百科 词条 baike.baidu.com/item
百度百家号 文章 baijiahao.baidu.com/s
B站 文章 www.bilibili.com/read
腾讯网 文章 new.qq.com/rain/a
360个人图书馆 文章 www.360doc.com/content
360百科 词条 baike.so.com/doc
搜狗百科 词条 baike.sogou.com/v
搜狐 文章 www.sohu.com/a
头条 文章 www.toutiao.com/article
网易 文章 www.163.com/\w+/article/.+
微信公众号 文章 weixin.qq.com/s
马蜂窝 coming soon
小红书 coming soon

其他情况使用基础playwright数据爬取和html2text数据清洗,但并未做特殊适配。

安装与使用

pip install dataharvest

最佳实践

搜索

from dataharvest.searcher.tavily_searcher import TavilySearcher

api_key = "xxx"  # 或者设置环境变量 TAVILY_API_KEY

searcher = TavilySearcher(api_key)
searcher.search("战国水晶杯")
SearchResult(keyword='战国水晶杯', answer=None, images=None, items=[
    SearchResultItem(title='战国水晶杯_百度百科', url='https://baike.baidu.com/item/战国水晶杯/7041521', score=0.98661,
                     description='战国水晶杯为战国晚期水晶器皿,于1990年出土于浙江省杭州市半山镇石塘村,现藏于杭州博物馆。战国水晶杯高15.4厘米、口径7.8厘米、底径5.4厘米,整器略带淡琥珀色,局部可见絮状包裹体;器身为敞口,平唇,斜直壁,圆底,圈足外撇;光素无纹,造型简洁。',
                     content='')])

爬虫

from dataharvest.purifier import AutoPurifier
from dataharvest.spider import AutoSpider

url = "https://baike.so.com/doc/5579340-5792710.html?src=index#entry_concern"
auto_spider = AutoSpider()
doc = auto_spider.crawl(url)
print(doc)

清洗

from dataharvest.purifier import AutoPurifier
from dataharvest.spider import AutoSpider

url = "https://baike.so.com/doc/5579340-5792710.html?src=index#entry_concern"
auto_spider = AutoSpider()
doc = auto_spider.crawl(url)
print(doc)
auto_purifier = AutoPurifier()
doc = auto_purifier.purify(doc)
print(doc)

效果:

整合

import asyncio

from dataharvest.base import DataHarvest
from dataharvest.searcher import TavilySearcher

searcher = TavilySearcher()
dh = DataHarvest()
r = searcher.search("战国水晶杯")
tasks = [dh.a_crawl_and_purify(item.url) for item in r.items]
loop = asyncio.get_event_loop()
docs = loop.run_until_complete(asyncio.gather(*tasks))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataharvest-0.2.1.tar.gz (9.2 kB view details)

Uploaded Source

Built Distribution

dataharvest-0.2.1-py3-none-any.whl (18.5 kB view details)

Uploaded Python 3

File details

Details for the file dataharvest-0.2.1.tar.gz.

File metadata

  • Download URL: dataharvest-0.2.1.tar.gz
  • Upload date:
  • Size: 9.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.3

File hashes

Hashes for dataharvest-0.2.1.tar.gz
Algorithm Hash digest
SHA256 b1734b3c2f0a83537bd1c4c9377f80b6acc2597fe8184594b3d881241b46cd81
MD5 fc624ac9b0fa0719d05f19c8d7e371c5
BLAKE2b-256 ec42dda2cdf60046f0a2336b20e9b000e1a56ddf10e4fae1103e6128d1743a51

See more details on using hashes here.

File details

Details for the file dataharvest-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: dataharvest-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 18.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.12.3

File hashes

Hashes for dataharvest-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a0fe03f6c2860e05f7ddd42ba525bd64f6a0dde4dd0bd80299f1d661ca9eaa35
MD5 9c2d7571564926d521ab22709c956757
BLAKE2b-256 2d68475afce069c8b4c945b37bc3a20f8bfcb8f88ce1f3b3eef70d86edf62cfe

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page