LinkedIn 个人资料抓取 + LLM 简历结构化提取工具

These details have not been verified by PyPI

Project description

linkedin-lion

LinkedIn Talent 个人资料抓取 + LLM 简历结构化提取工具

基于 Typer + Rich 构建的现代 Python CLI 工具包，将 LinkedIn Talent 个人资料抓取与 LLM 简历信息提取整合为一套完整流程。

功能特性

lion scrape — 抓取单个 LinkedIn Talent Profile，提取纯文本保存为 .txt
lion batch — 从 URL 列表文件批量抓取，支持断点续传（已存在文件自动跳过）
lion extract — 调用 LLM（llmdog）将 .txt 简历结构化为 JSON，支持批量处理
全部命令通过 Rich 渲染输出，配色现代、信息层级清晰
核心函数均可作为 Python 模块导入使用

安装和环境配置

1. 安装包

pip install linkedin-lion

2. 配置 `.env` 文件

在项目根目录（或 ~/.linkedin_lion/）创建 .env 文件。

版本 A：最简配置（全依赖 llmdog 内置默认值）

# 仅配置 Selenium Cookie （lion scrape / lion batch 必填）
LION_COOKIE_FILE=cookie.json

适用场景：llmdog 已通过自身配置（如 ~/.llmdog/config.json）设置好 API Key 和模型，无需在 linkedin-lion 層重复配置。

版本 B：完整配置（显式指定所有 LLM 参数）

# LLM 配置（lion extract 命令可用，均为可选）
LION_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxx
LION_API_URL=https://api.deepseek.com/v1/chat/completions
LION_MODEL=deepseek-chat
LION_TIMEOUT=60
LION_VERIFY_SSL=true

# Selenium Cookie 文件路径（lion scrape / lion batch 命令使用）
LION_COOKIE_FILE=cookie.json

所有 LLM 参数均为可选：未设置的字段不会传递给 llmdog，由 llmdog 自行使用其内置默认值。

3. 准备 Cookie 文件

将 LinkedIn Talent 的登录态 Cookie 导出为 cookie.json，放置于项目根目录或通过 --cookie 参数指定路径。

使用示例

lion scrape — 抓取单个 Profile

# 基础用法
lion scrape "https://www.linkedin.com/talent/profile/ACoAAA..." --folder ./output

# 指定 Cookie 文件 + 无头模式
lion scrape "https://www.linkedin.com/talent/profile/ACoAAA..." \
    --cookie path/to/cookie.json \
    --folder ./resumes \
    --headless

# 查看完整帮助
lion scrape --help

lion batch — 批量抓取

准备 urls.txt（每行一个 LinkedIn URL）：

https://www.linkedin.com/in/johndoe/
https://www.linkedin.com/in/janedoe/

# 批量抓取全部
lion batch urls.txt --output-dir ./resumes

# 仅处理第 10～50 条
lion batch urls.txt --output-dir ./resumes --start 9 --end 50

# 使用 diskcache URN 缓存（加速 Talent Profile 直链构建）
lion batch urls.txt --output-dir ./resumes --cache-dir /path/to/cache_dir

# 无头模式
lion batch urls.txt --output-dir ./resumes --headless

lion extract — LLM 简历提取

# 处理单个文件
lion extract resume.txt --output ./json_output

# 批量处理目录下所有 .txt 文件
lion extract ./resumes --output ./json_output

# 自定义提示模板
lion extract ./resumes --output ./json_output --prompt-file my_prompt.txt

# 增加重试次数
lion extract ./resumes --output ./json_output --max-retries 5

完整工作流示例

# Step 1：批量抓取 LinkedIn Profile
lion batch urls.txt --output-dir ./resumes --headless

# Step 2：LLM 提取结构化 JSON
lion extract ./resumes --output ./json_output

API 接口说明

所有核心功能均可作为 Python 函数导入：

from linkedin_lion import login, scrape_profile, batch_scrape, extract_resume

# --- 单个抓取 ---
driver = login(cookie_file="cookie.json", headless=False)
text = scrape_profile(
    driver,
    profile_url="https://www.linkedin.com/talent/profile/ACoAAA...",
    filename="resume.txt",
    folder="./output",
)
driver.quit()

# --- 批量抓取 ---
batch_scrape(
    filepath="urls.txt",
    output_dir="./resumes",
    headless=True,
    start=0,
    end=100,
    cookie_file="cookie.json",
)

# --- LLM 提取 ---
extract_resume(
    input_path="./resumes",        # 目录或单个 .txt 文件
    output_dir="./json_output",
    max_retries=3,
    prompt_template=None,          # None 使用内置默认模板
)

配置 API

函数签名与参数说明

load_config() — 加载并返回 LionConfig 实例

from linkedin_lion.config import load_config, get_llm_config

def load_config(
    api_key:     Optional[str]  = None,  # LLM API 密钥，优先覆盖 LION_API_KEY
    api_url:     Optional[str]  = None,  # LLM API 端点，优先覆盖 LION_API_URL
    model:       Optional[str]  = None,  # 模型名称，优先覆盖 LION_MODEL
    timeout:     Optional[int]  = None,  # 超时秒数，优先覆盖 LION_TIMEOUT
    verify_ssl:  Optional[bool] = None,  # SSL 验证，优先覆盖 LION_VERIFY_SSL
    cookie_file: Optional[str]  = None,  # Cookie 路径，优先覆盖 LION_COOKIE_FILE
) -> LionConfig: ...

参数优先级（高 → 低）：

函数参数（代码传入）
    ↓
当前目录 .env  （./  .env）
    ↓
用户全局 .env  （~/.linkedin_lion/.env）
    ↓
系统环境变量   （export LION_API_KEY=...）
    ↓
None           （交由 llmdog 使用其内置默认值）

get_llm_config(cfg) — 将 LionConfig 转换为 llmdog 所需的关键字参数字典

def get_llm_config(cfg: LionConfig) -> dict:
    # 仅包含已显式设置（非 None）的字段
    # 空字段不传递，llmdog 自行使用内置默认值
    ...

参数	类型	默认	说明
`api_key`	`str \| None`	`None`	LLM 服务的 API 密钥
`api_url`	`str \| None`	`None`	LLM API 完整端点 URL
`model`	`str \| None`	`None`	模型名称，如 `deepseek-chat`
`timeout`	`int \| None`	`None`	HTTP 请求超时（秒）
`verify_ssl`	`bool \| None`	`None`	是否验证 HTTPS 证书
`cookie_file`	`str`	`"cookie.json"`	Selenium Cookie 文件路径

版本 A：最简配置（推荐场景：llmdog 已独立配置好）

适用场景： llmdog 已通过 ~/.llmdog/config.json 或其自身环境变量设置好 API Key 和模型，linkedin-lion 仅负责抓取，无需重复配置 LLM 参数。

from linkedin_lion.config import load_config, get_llm_config

# 仅配置 Cookie，LLM 参数全部交给 llmdog 内置默认值
cfg = load_config(cookie_file="path/to/cookie.json")
llm_kwargs = get_llm_config(cfg)
# llm_kwargs == {}  →  等价于直接调用 llm_load_config()

.env 文件只需一行：

LION_COOKIE_FILE=cookie.json

版本 B：完整配置（推荐场景：需要指定 LLM 服务商）

适用场景： 使用非默认 LLM 服务商（如 DeepSeek、OpenAI、本地 Ollama），需要显式指定全部 LLM 参数。

from linkedin_lion.config import load_config, get_llm_config
from llmdog.config import load_config as llm_load_config

# 完整配置，显式覆盖所有 LLM 参数
cfg = load_config(
    api_key="sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxx",
    api_url="https://api.deepseek.com/v1/chat/completions",
    model="deepseek-chat",
    timeout=60,
    verify_ssl=True,
    cookie_file="cookie.json",
)

llm_kwargs = get_llm_config(cfg)
# llm_kwargs == {
#     'api_key':    'sk-xxxx...',
#     'api_url':    'https://api.deepseek.com/v1/chat/completions',
#     'model':      'deepseek-chat',
#     'timeout':    60,
#     'verify_ssl': True,
# }

# 传入 llmdog 使用
llm_cfg = llm_load_config(**llm_kwargs)

版本 C：部分配置（推荐场景：只需覆盖个别参数）

适用场景： llmdog 默认配置基本可用，仅需覆盖模型名称或超时等少数参数。

# 只覆盖模型和超时，其余由 llmdog 内置默认值处理
cfg = load_config(
    model="deepseek-coder",
    timeout=120,
)
llm_kwargs = get_llm_config(cfg)
# llm_kwargs == {'model': 'deepseek-coder', 'timeout': 120}

最佳实践：在自定义脚本中集成配置

from linkedin_lion.config import load_config, get_llm_config, ensure_api_key
from linkedin_lion import extract_resume
from llmdog.config import load_config as llm_load_config

def run_pipeline(input_dir: str, output_dir: str):
    # 1. 加载配置（自动读取 .env 和环境变量）
    cfg = load_config()

    # 2. 检查 API Key 是否配置
    if not ensure_api_key(cfg):
        print("警告：未配置 LION_API_KEY，将使用 llmdog 内置配置")

    # 3. 构建 llmdog 配置
    llm_kwargs = get_llm_config(cfg)
    llm_cfg = llm_load_config(**llm_kwargs)

    # 4. 执行提取
    extract_resume(
        input_path=input_dir,
        output_dir=output_dir,
        max_retries=3,
    )

run_pipeline("./resumes", "./json_output")

错误处理与常见配置问题

问题 1：LION_TIMEOUT 设置了非数字值

# 错误示例
LION_TIMEOUT=sixty   # ❌ 会触发 ValueError

# 正确示例
LION_TIMEOUT=60      # ✅ 整数字符串

问题 2：LION_VERIFY_SSL 格式不识别

# 以下均识别为 False
LION_VERIFY_SSL=false
LION_VERIFY_SSL=False
LION_VERIFY_SSL=0
LION_VERIFY_SSL=no

# 以下均识别为 True
LION_VERIFY_SSL=true
LION_VERIFY_SSL=True
LION_VERIFY_SSL=1
LION_VERIFY_SSL=yes

# 其他任何值 → 视为 None（交 llmdog 决定）

问题 3：多个 .env 文件冲突

加载顺序：当前目录 .env → ~/.linkedin_lion/.env
已设置的环境变量不会被低优先级文件覆盖（override=False）
建议：项目级配置放 ./  .env，全局默认配置放 ~/.linkedin_lion/.env

问题 4：代码传参不生效

# 环境变量已设置 LION_MODEL=gpt-4，但代码传参优先级更高
cfg = load_config(model="deepseek-chat")  # model="deepseek-chat" 生效，忽略环境变量

问题 5：Cookie 文件路径找不到

import os
from linkedin_lion.config import load_config

cfg = load_config()
if not os.path.exists(cfg.cookie_file):
    raise FileNotFoundError(
        f"Cookie 文件不存在：{cfg.cookie_file}\n"
        "请通过 LION_COOKIE_FILE 环境变量或 --cookie 参数指定正确路径"
    )

ensure_api_key() — 检查 API Key 是否已配置

from linkedin_lion.config import load_config, ensure_api_key

cfg = load_config()
if ensure_api_key(cfg):
    print("API Key 已配置，可调用 LLM 服务")
else:
    print("API Key 未配置，lion extract 将依赖 llmdog 内置配置")

依赖项清单

依赖	用途
`typer[all]>=0.12`	CLI 框架
`rich>=13.0`	终端美化输出
`selenium`	浏览器自动化
`browser-dog`	Selenium Cookie 登录封装
`beautifulsoup4`	HTML 文本提取
`llmdog`	LLM 调用（chat 接口）
`larkfunc`	文件读写 / 文本处理工具函数
`diskcache`	LinkedIn URN 本地缓存
`python-dotenv`	.env 配置文件加载

贡献指南与许可证

贡献

Fork 本仓库并创建功能分支
代码风格遵循项目内 DESIGN_SPEC.md 规范（Typer + Rich，零裸 print）
提交 PR 前确保代码通过基础导入测试

许可证

MIT License — 详见 LICENSE

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.0.3

Apr 27, 2026

0.0.2

Apr 26, 2026

0.0.1

Apr 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

linkedin_lion-0.0.3.tar.gz (38.1 kB view details)

Uploaded Apr 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

linkedin_lion-0.0.3-py3-none-any.whl (25.6 kB view details)

Uploaded Apr 27, 2026 Python 3

File details

Details for the file linkedin_lion-0.0.3.tar.gz.

File metadata

Download URL: linkedin_lion-0.0.3.tar.gz
Upload date: Apr 27, 2026
Size: 38.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.3

File hashes

Hashes for linkedin_lion-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`707ed04a606a2634059bdbf3b877f264836947ca28bbc17a7ee088a02128181f`
MD5	`08a291656cf7754b8d0382d77475f8be`
BLAKE2b-256	`7ac361a0d9a9d08375b80085feea2de10669fe4080113adc0ea1e3ff5d662cf9`

See more details on using hashes here.

File details

Details for the file linkedin_lion-0.0.3-py3-none-any.whl.

File metadata

Download URL: linkedin_lion-0.0.3-py3-none-any.whl
Upload date: Apr 27, 2026
Size: 25.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.3

File hashes

Hashes for linkedin_lion-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`570514edf783d6d8acca31381472e804fc7936b3812aeb4e1008fd0796828fc6`
MD5	`504ca6d83245d00e3a02e9feefdfc588`
BLAKE2b-256	`47e41c2e93673dd1c58282c5713c01670d7d719b1d77e7550c716e8b51c7aea8`

See more details on using hashes here.

linkedin-lion 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

linkedin-lion

功能特性

安装和环境配置

1. 安装包

2. 配置 .env 文件

版本 A：最简配置（全依赖 llmdog 内置默认值）

版本 B：完整配置（显式指定所有 LLM 参数）

3. 准备 Cookie 文件

使用示例

lion scrape — 抓取单个 Profile

lion batch — 批量抓取

lion extract — LLM 简历提取

完整工作流示例

API 接口说明

配置 API

函数签名与参数说明

版本 A：最简配置（推荐场景：llmdog 已独立配置好）

版本 B：完整配置（推荐场景：需要指定 LLM 服务商）

版本 C：部分配置（推荐场景：只需覆盖个别参数）

最佳实践：在自定义脚本中集成配置

错误处理与常见配置问题

依赖项清单

贡献指南与许可证

贡献

许可证

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

2. 配置 `.env` 文件