LinkedIn 搜索结果资料提取工具 — 自动化抓取、解析、导出

These details have not been verified by PyPI

Project description

linkedin-horse

LinkedIn 搜索结果资料自动提取工具 — 基于 Selenium 自动化浏览器，从 LinkedIn 搜索页批量抓取个人资料数据，支持分页采集、自动去重、JSON 中间存储和 Excel 最终导出。

功能特性

自动化采集：基于 Selenium 驱动 Chrome 浏览器，模拟真实用户操作
智能解析：多策略 HTML 解析，兼容 LinkedIn 新旧版页面结构
分页抓取：支持指定起止页码，逐页自动翻页采集
自动去重：基于 profile_url 自动过滤重复记录，支持增量采集
双格式存储：每页数据实时保存为 JSON 文件，最终合并导出 Excel
重试机制：每页最多 3 次重试，网络波动不丢数据
现代 CLI：基于 Typer + Rich 构建，彩色输出、进度条、配置面板
Cookie 引导：首次运行自动检测 Cookie 文件，提供详细的导出操作指引

安装和环境配置

安装

pip install linkedin-horse

环境要求

Python >= 3.9
Chrome 浏览器（用于 Selenium 驱动）
ChromeDriver（版本需与 Chrome 匹配）

Cookie 配置（首次使用必读）

linkedin-horse 需要你的 LinkedIn 登录态 Cookie 来访问搜索结果。首次运行时，程序会自动检测并提示你配置。

操作步骤：

安装 EditThisCookie 插件 打开 Chrome 浏览器，访问 Chrome Web Store，搜索 "EditThisCookie" 并安装。
登录 LinkedIn 在 Chrome 中访问 https://www.linkedin.com 并登录你的账号，确保页面正常显示首页 Feed。
导出 Cookies 点击浏览器右上角的 EditThisCookie 插件图标（饼干形状），在弹出窗口中点击"导出"按钮。
保存文件 新建文本文件，粘贴剪贴板内容，保存为 linkedin_cookies.json，放置在程序运行目录下。
验证确保文件是合法的 JSON 数组格式（以 [ 开头，以 ] 结尾）。

使用示例

基本用法

# 从 LinkedIn 搜索页提取数据（第 1-5 页）
linkedin-horse extract \
  --base-url "https://www.linkedin.com/search/results/people/?keywords=python%20developer&origin=GLOBAL_SEARCH_HEADER" \
  --search-keyword "python_developer" \
  --start-page 1 \
  --end-page 5

完整参数示例

linkedin-horse extract \
  --base-url "https://www.linkedin.com/search/results/people/?keywords=data%20engineer" \
  --search-keyword "data_engineer" \
  --start-page 1 \
  --end-page 20 \
  --headless \
  --cookies-json ./my_cookies.json \
  --max-retries 5 \
  --retry-delay 10

输出结构

运行后会生成以下文件结构：

./
├── data_engineer/                    # 以 search_keyword 命名的数据目录
│   ├── data_engineer_1.json          # 第 1 页数据
│   ├── data_engineer_2.json          # 第 2 页数据
│   └── ...
└── data_engineer.xlsx                # 最终合并的 Excel 文件

查看帮助

linkedin-horse --help
linkedin-horse extract --help

API 接口说明

linkedin-horse 采用模块化设计，核心模块可独立调用：

extractor 模块

from linkedin_horse.extractor import extract_profile_data_from_page

# 传入 HTML 源码，返回个人资料字典列表
profiles = extract_profile_data_from_page(html_source)

export 模块

from linkedin_horse.export import save_page_json, merge_json_to_excel
from pathlib import Path

# 保存单页数据为 JSON
save_page_json(profiles, Path("my_search"), "my_search", page=1)

# 合并所有 JSON 为 Excel
merge_json_to_excel(Path("my_search"), Path("my_search.xlsx"))

browser 模块

from linkedin_horse.browser import init_browser, fetch_page_with_retry, close_browser
from pathlib import Path

bot, driver = init_browser(Path("linkedin_cookies.json"), headless=True)
html = fetch_page_with_retry(driver, url, page_num=1)
close_browser(driver)

cookies 模块

from linkedin_horse.cookies import check_cookies
from pathlib import Path

# 检查 Cookie 文件，不存在则输出指引并退出
check_cookies(Path("linkedin_cookies.json"))

依赖项清单

依赖	用途
`typer`	CLI 框架
`rich`	终端美化输出
`beautifulsoup4`	HTML 解析
`pandas`	数据处理与 Excel 导出
`openpyxl`	Excel 文件引擎
`python-dotenv`	环境变量加载
`linkedin-cat`	LinkedIn 浏览器自动化
`llmdog`	LLM 调用封装
`larkfunc`	通用工具函数库

技术架构

linkedin_horse/
├── cli.py          # Typer CLI 入口，参数解析与流程编排
├── output.py       # Rich 统一输出模块（主题、彩色打印函数）
├── cookies.py      # Cookie 文件检查与用户操作指引
├── extractor.py    # HTML 解析与个人资料数据提取（核心逻辑）
├── browser.py      # 浏览器初始化与页面获取（含重试机制）
└── export.py       # JSON 分页存储 + Excel 合并导出

数据流：搜索 URL → 逐页抓取 HTML → 解析提取 → JSON 分页保存 → Excel 合并导出

各模块职责单一、接口清晰，便于后续扩展（如增加新的解析策略、输出格式等）。

贡献指南与许可证

贡献

Fork 本仓库
创建功能分支 (git checkout -b feature/my-feature)
提交更改 (git commit -m 'Add my feature')
推送到分支 (git push origin feature/my-feature)
创建 Pull Request

许可证

本项目基于 MIT 许可证开源。

免责声明

本工具仅供学习和研究用途，使用者需自行承担使用风险
使用本工具前请确保遵守 LinkedIn 的服务条款和使用政策
过度频繁的自动化访问可能导致账号被限制，请合理控制采集频率
开发者不对因使用本工具产生的任何后果承担责任
请尊重他人隐私，合法合规地使用采集到的数据

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Apr 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

linkedin_horse-0.1.0.tar.gz (14.2 kB view details)

Uploaded Apr 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

linkedin_horse-0.1.0-py3-none-any.whl (14.9 kB view details)

Uploaded Apr 29, 2026 Python 3

File details

Details for the file linkedin_horse-0.1.0.tar.gz.

File metadata

Download URL: linkedin_horse-0.1.0.tar.gz
Upload date: Apr 29, 2026
Size: 14.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.3

File hashes

Hashes for linkedin_horse-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`02aed7d5932884e69e4c189b55375c93cf25d456c17d582047f7642421d48188`
MD5	`e023b05a30e41d0ffdc8f6e150f979c2`
BLAKE2b-256	`a95db6f649fc5cd40493f9dd73ea871c060b29510a5e9783387a0c943e2b008f`

See more details on using hashes here.

File details

Details for the file linkedin_horse-0.1.0-py3-none-any.whl.

File metadata

Download URL: linkedin_horse-0.1.0-py3-none-any.whl
Upload date: Apr 29, 2026
Size: 14.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.3

File hashes

Hashes for linkedin_horse-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`402cc9b9b5e371302429046ba68ddc0fafb1e7bce8e84f93b3f783f5e1effcd9`
MD5	`33da1fb0d49c8a18eb5cbe17bcd4236b`
BLAKE2b-256	`78be80c35f30cdd529f793532863ad97d210e0540a40c3f126d59546e313600b`

See more details on using hashes here.

linkedin-horse 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

linkedin-horse

功能特性

安装和环境配置

安装

环境要求

Cookie 配置（首次使用必读）

使用示例

基本用法

完整参数示例

输出结构

查看帮助

API 接口说明

extractor 模块

export 模块

browser 模块

cookies 模块

依赖项清单

技术架构

贡献指南与许可证

贡献

许可证

免责声明

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes