An efficient information processing program.
Project description
cmip
一个高效的信息处理库。
安装
pip install -U cmip
用法
1. 动态渲染异步爬虫
example:
from cmip.web import web_scraping
import asyncio
urls = [
"https://baidu.com",
"https://qq.com",
# ...More URL
]
asyncio.run(web_scraping(urls, output_path="output", max_concurrent_tasks=10, save_image=True, min_img_size=200))
参数含义:
urls | 网页链接(包含协议头) |
---|---|
output_path | 输出路径 |
max_concurrent_tasks | 最大同时执行任务数,根据自身机器资源和网络情况调整 |
save_image | 是否保存图片 |
min_img_size | 当图片小于这个值时不爬取 |
2. 计算网页结构相似度
example:
from cmip.web import html, simhash_array, hamming_distance_array
import requests
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
html_text_baidu = requests.get("https://baidu.com", headers=headers).text
html_text_example_com = requests.get("https://example.com", headers=headers).text
html_text_example_org = requests.get("https://example.org", headers=headers).text
dom_tree_baidu = html.dom_tree(html_text_baidu)
dom_tree_example_com = html.dom_tree(html_text_example_com)
dom_tree_example_org = html.dom_tree(html_text_example_org)
simhash_baidu = simhash_array(dom_tree_baidu)
simhash_example_com = simhash_array(dom_tree_example_com)
simhash_example_org = simhash_array(dom_tree_example_org)
print("Similarity:", hamming_distance_array([simhash_example_com], [simhash_baidu, simhash_example_org]))
一般而言,当网页距离在4以内时,网页结构较为相似。
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
cmip-0.0.9.tar.gz
(24.1 kB
view details)
Built Distribution
cmip-0.0.9-py3-none-any.whl
(32.7 kB
view details)
File details
Details for the file cmip-0.0.9.tar.gz
.
File metadata
- Download URL: cmip-0.0.9.tar.gz
- Upload date:
- Size: 24.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6aa47f047f3b7260fb11b1fabcf07c4f7564cc27a7aa40268f0138fa7772b583 |
|
MD5 | d7f81267f094c526fc428cb6c1337f79 |
|
BLAKE2b-256 | 05a978cb280d4e24e843ea9bd9b22844b19eddfd935902d2898490ed04d91fad |
File details
Details for the file cmip-0.0.9-py3-none-any.whl
.
File metadata
- Download URL: cmip-0.0.9-py3-none-any.whl
- Upload date:
- Size: 32.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ef865b932fc03edd27121b11db38291f0192519a812d3098728d1038a3082837 |
|
MD5 | 222deb156c52d9d8b9526d0f0e8657df |
|
BLAKE2b-256 | cf8d356b62487eb5e7a0bad54736c8ecde96415ee15267065c4d67fdd29b8eb5 |