Skip to main content

An efficient information processing program.

Project description

cmip

一个高效的信息处理库。

安装

pip install -U cmip

用法

1. 动态渲染异步爬虫

example:

from cmip.web import web_scraping
import asyncio
urls = [
        "https://baidu.com",
        "https://qq.com",
        # ...More URL
    ]
asyncio.run(web_scraping(urls, output_path="output", max_concurrent_tasks=10, save_image=True, min_img_size=200))

参数含义:

urls 网页链接(包含协议头)
output_path 输出路径
max_concurrent_tasks 最大同时执行任务数,根据自身机器资源和网络情况调整
save_image 是否保存图片
min_img_size 当图片小于这个值时不爬取

2. 计算网页结构相似度

example:

from cmip.web import html, simhash_array, hamming_distance_array
import requests

headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}

html_text_baidu = requests.get("https://baidu.com", headers=headers).text
html_text_example_com = requests.get("https://example.com", headers=headers).text
html_text_example_org = requests.get("https://example.org", headers=headers).text
dom_tree_baidu = html.dom_tree(html_text_baidu)
dom_tree_example_com = html.dom_tree(html_text_example_com)
dom_tree_example_org = html.dom_tree(html_text_example_org)
simhash_baidu = simhash_array(dom_tree_baidu)
simhash_example_com = simhash_array(dom_tree_example_com)
simhash_example_org = simhash_array(dom_tree_example_org)

print("Similarity:", hamming_distance_array([simhash_example_com], [simhash_baidu, simhash_example_org]))

一般而言,当网页距离在4以内时,网页结构较为相似。

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cmip-0.0.9.tar.gz (24.1 kB view details)

Uploaded Source

Built Distribution

cmip-0.0.9-py3-none-any.whl (32.7 kB view details)

Uploaded Python 3

File details

Details for the file cmip-0.0.9.tar.gz.

File metadata

  • Download URL: cmip-0.0.9.tar.gz
  • Upload date:
  • Size: 24.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.13

File hashes

Hashes for cmip-0.0.9.tar.gz
Algorithm Hash digest
SHA256 6aa47f047f3b7260fb11b1fabcf07c4f7564cc27a7aa40268f0138fa7772b583
MD5 d7f81267f094c526fc428cb6c1337f79
BLAKE2b-256 05a978cb280d4e24e843ea9bd9b22844b19eddfd935902d2898490ed04d91fad

See more details on using hashes here.

File details

Details for the file cmip-0.0.9-py3-none-any.whl.

File metadata

  • Download URL: cmip-0.0.9-py3-none-any.whl
  • Upload date:
  • Size: 32.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.13

File hashes

Hashes for cmip-0.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 ef865b932fc03edd27121b11db38291f0192519a812d3098728d1038a3082837
MD5 222deb156c52d9d8b9526d0f0e8657df
BLAKE2b-256 cf8d356b62487eb5e7a0bad54736c8ecde96415ee15267065c4d67fdd29b8eb5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page