PDF metadata and content extraction tool with AI-powered analysis

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

changyy

These details have not been verified by PyPI

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language

Project description

pdf-ai-extractor

PyPI

PDF metadata and content extraction tool with AI-powered analysis capabilities.

Features

Extract PDF metadata including abstract, keywords, and bookmarks/TOC
Multiple analysis backends:
- Local extraction (no AI)
- OpenAI-powered analysis
- xAI-powered analysis
- Hugging Face models integration

Installation

pip install pdf-ai-extractor

From source:

git clone https://github.com/changyy/py-pdf-ai-extractor.git
cd py-pdf-ai-extractor
pip install -e .

Usage

Command Line Interface

# Basic usage (local extraction)
pdf-ai-extractor input.pdf

# Using OpenAI backend
pdf-ai-extractor --backend openai --api-key YOUR_API_KEY input.pdf

# Using xAI backend
pdf-ai-extractor --backend xai --api-key YOUR_API_KEY input.pdf

# Using Hugging Face backend
pdf-ai-extractor --backend huggingface --model-name "model/name" input.pdf

# Save output to file
pdf-ai-extractor input.pdf --output result.json

Example

% wget https://github.com/datawhalechina/leedl-tutorial/releases/download/v1.2.2/LeeDL_Tutorial_v.1.2.2.pdf -O /tmp/LeeDL_Tutorial.pdf

% pdf-ai-extractor -b openai -k 'sk-XXXXXXX' /tmp/LeeDL_Tutorial.pdf    
[
  {
    "path": "/tmp/LeeDL_Tutorial.pdf",
    "abstract": "The 'LeeDL Tutorial' authored by Wang Qi, Yang Yiyuan, and Jiang Ji is a comprehensive guide to deep learning, inspired by the popular machine learning course by Professor Li Hongyi from National Taiwan University. This tutorial aims to make deep learning accessible to Chinese-speaking students by simplifying complex theories and providing detailed derivations of formulas. It covers essential topics in deep learning, including foundational concepts, practical methodologies, and advanced techniques, while integrating original content and supplementary materials from previous courses. The tutorial is designed for beginners and those seeking to deepen their understanding of deep learning, making it a recommended resource for students interested in the field. The authors, who are members of the Datawhale organization, have backgrounds in artificial intelligence, reinforcement learning, and computer vision, further enhancing the tutorial's credibility and depth.",
    "keywords": [
      "深度学习",
      "机器学习",
      "李宏毅",
      "教程",
      "人工智能",
      "数据挖掘",
      "优化算法",
      "模型训练",
      "中文教育",
      "开源组织"
    ],
    "bookmarks": [
      "机器学习基础",
      "案例学习",
      "线性模型",
      "分段线性曲线",
      "模型变形",
      "机器学习框架",
      "实践方法论",
      "模型偏差",
      "优化问题",
      "过拟合",
      "交叉验证",
      "不匹配",
      "深度学习基础",
      "局部极小值与鞍点",
      "临界点及其种类",
      "判断临界值种类的方法",
      "批量和动量",
      "批量大小对梯度下降法的影响",
      "动量法",
      "自适应学习率",
      "AdaGrad",
      "RMSProp",
      "Adam",
      "学习率调度",
      "优化总结",
      "分类",
      "分类与回归的关系",
      "带有 softmax 的分类",
      "分类损失",
      "批量归一化",
      "考虑深度学习",
      "测试时的批量归一化",
      "内部协变量偏移",
      "卷积神经网络",
      "观察1：检测模式不需要整张图像",
      "简化1：感受野",
      "观察2：同样的模式可能会出现在图像的不同区域",
      "简化2：共享参数",
      "简化1和2的总结",
      "观察3：下采样不影响模式检测",
      "简化3：汇聚",
      "卷积神经网络的应用：下围棋",
      "循环神经网络",
      "独热编码",
      "什么是RNN？",
      "RNN架构",
      "其他RNN",
      "Elman 网络 &Jordan 网络",
      "双向循环神经网络",
      "长短期记忆网络",
      "LSTM举例",
      "LSTM运算示例",
      "LSTM原理",
      "RNN学习方式",
      "如何解决RNN梯度消失或者爆炸",
      "RNN其他应用",
      "多对一序列",
      "多对多序列",
      "序列到序列",
      "自注意力机制",
      "输入是向量序列的情况",
      "类型1：输入与输出数量相同",
      "类型2：输入是一个序列，输出是一个标签",
      "类型3：序列到序列",
      "自注意力的运作原理",
      "多头注意力",
      "位置编码",
      "截断自注意力",
      "自注意力与卷积神经网络对比",
      "自注意力与循环神经网络对比",
      "Transformer",
      "序列到序列模型",
      "语音识别、机器翻译与语音翻译",
      "语音合成",
      "聊天机器人",
      "问答任务",
      "句法分析",
      "多标签分类",
      "Transformer结构",
      "Transformer编码器",
      "Transformer解码器",
      "自回归解码器",
      "非自回归解码器",
      "编码器-解码器注意⼒",
      "Transformer的训练过程",
      "序列到序列模型训练常用技巧",
      "复制机制",
      "引导注意力",
      "束搜索",
      "加入噪声",
      "使用强化学习训练",
      "计划采样",
      "生成模型",
      "生成对抗网络",
      "生成器",
      "辨别器",
      "生成器与辨别器的训练过程",
      "GAN的应用案例",
      "GAN的理论介绍",
      "WGAN算法",
      "训练GAN的难点与技巧",
      "GAN的性能评估方法",
      "条件型生成",
      "Cycle GAN",
      "扩散模型",
      "自监督学习",
      "来⾃Transformers的双向编码器表⽰（BERT）",
      "BERT的使用方式",
      "BERT有用的原因",
      "BERT的变种",
      "⽣成式预训练（GPT）",
      "自编码器",
      "自编码器的概念",
      "为什么需要自编码器？",
      "去噪自编码器",
      "自编码器应用之特征解耦",
      "自编码器应用之离散隐表征",
      "自编码器的其他应用",
      "对抗攻击",
      "对抗攻击简介",
      "如何进行网络攻击",
      "快速梯度符号法",
      "白盒攻击与黑盒攻击",
      "其他模态数据被攻击案例",
      "现实世界中的攻击",
      "防御方式中的被动防御",
      "防御方式中的主动防御",
      "迁移学习",
      "领域偏移",
      "领域自适应",
      "领域泛化",
      "强化学习",
      "强化学习应用",
      "玩电子游戏",
      "下围棋",
      "强化学习框架",
      "第1步：未知函数",
      "第2步：定义损失",
      "第3步：优化",
      "评价动作的标准",
      "使用即时奖励作为评价标准",
      "使用累积奖励作为评价标准",
      "使用折扣累积奖励作为评价标准",
      "使用折扣累积奖励减去基线作为评价标准",
      "Actor-Critic",
      "优势 Actor-Critic",
      "元学习",
      "元学习的概念",
      "元学习的三个步骤",
      "元学习与机器学习",
      "元学习的实例算法",
      "元学习的应用",
      "终身学习",
      "灾难性遗忘",
      "终身学习评估方法",
      "终身学习的主要解法",
      "网络压缩",
      "网络剪枝",
      "知识蒸馏",
      "参数量化",
      "网络架构设计",
      "动态计算",
      "可解释性人工智能",
      "可解释性人工智能的重要性",
      "决策树模型的可解释性",
      "可解释性机器学习的目标",
      "可解释性机器学习中的局部解释",
      "可解释性机器学习中的全局解释",
      "扩展与小结",
      "ChatGPT",
      "ChatGPT简介和功能",
      "对于ChatGPT的误解",
      "ChatGPT背后的关键技术——预训练",
      "ChatGPT带来的研究问题",
      "术语"
    ],
    "error": null
  }
]

% pdf-ai-extractor -b xai -k 'xai-XXXXX' /tmp/LeeDL_Tutorial.pdf 
[
  {
    "path": "/tmp/LeeDL_Tutorial.pdf",
    "abstract": "本教程基于李宏毅教授的《机器学习》（2021年春）课程，旨在为深度学习初学者提供一个轻松入门的中文学习资源。教程内容全面，涵盖了深度学习的基本理论和实践方法，通过幽默风趣的讲解和大量动漫相关的例子，使深奥的理论变得易于理解。教程不仅选取了课程的精华内容，还对公式进行了详细推导，对难点进行了重点讲解，并补充了其他深度学习相关知识。",
    "keywords": [
      "深度学习",
      "机器学习",
      "李宏毅",
      "教程",
      "入门",
      "中文",
      "强化学习",
      "计算机视觉",
      "时间序列",
      "数据挖掘"
    ],
    "bookmarks": [
      "机器学习基础",
      "案例学习",
      "线性模型",
      "分段线性曲线",
      "模型变形",
      "机器学习框架",
      ...
      "术语"
    ],
    "error": null
  }
]


% pdf-ai-extractor -b huggingface /tmp/LeeDL_Tutorial.pdf
[
  {
    "path": "/tmp/LeeDL_Tutorial.pdf",
    "abstract": "LeeDL Tutorialhttps://github.com/datawhalechina/leedl-tutorial.html is a free, open-source version of the popular LeeDL tutorial. The tutorial is broken up into two parts: 1.2.2 and 2.3.",
    "keywords": [],
    "bookmarks": [
      "机器学习基础",
      "案例学习",
      ...
      "术语"
    ],
    "error": null
  }
]

Python API

from pdf_ai_extractor.handlers import create_handler

# Local analysis
handler = create_handler("local")
result = handler.result("input.pdf")
print(result)

# Using OpenAI
handler = create_handler(
    "openai",
    config={"api_key": "YOUR_API_KEY"}
)
result = handler.result("input.pdf")
print(result)

# Using xAI
handler = create_handler(
    "xai",
    config={"api_key": "YOUR_API_KEY"}
)
result = handler.result("input.pdf")
print(result)

# Using HuggingFace - facebook/bart-large-cnn
handler = create_handler(
    "huggingface",
    config={"model_name": "facebook/bart-large-cnn"}
)
result = handler.result("input.pdf")
print(result)

# Output format
# {
#     "abstract": "Document abstract extracted from the PDF",
#     "keywords": ["keyword1", "keyword2", ...],
#     "bookmarks": ["Chapter 1", "Section 1.1", ...]
# }

You can also set additional parameters:

# Advanced OpenAI settings
handler = create_handler(
    "openai",
    config={
        "api_key": "YOUR_API_KEY",
        "model": "gpt-4",
        "max_tokens": 500
    }
)

# Advanced xAI settings
handler = create_handler(
    "xai",
    config={
        "api_key": "YOUR_API_KEY",
        "model": "grok-beta",
        "max_content_length": 4000
    }
)

# Advanced HuggingFace settings
handler = create_handler(
    "huggingface",
    config={
        "model_name": "facebook/bart-large-cnn",
        "device": "cuda",  # 或 "cpu", "mps"
        "max_content_length": 1024,
        "min_length": 50,
        "max_length": 150
    }
)

Development

Setup for development:

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows use: venv\Scripts\activate

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

Configuration

The tool can be configured using environment variables or a config file:

# Environment variables
export OPENAI_API_KEY="your-key"
export XAI_API_KEY="your-key"

Or create a ~/.pdf-ai-extractor.yaml file:

openai:
  api_key: "your-key"
xai:
  api_key: "your-key"
huggingface:
  model_name: "default/model"

Output Format

The tool outputs JSON in the following format:

{
  "abstract": "Document abstract extracted from the PDF",
  "keywords": ["keyword1", "keyword2", "keyword3"],
  "bookmarks": [
    "Chapter 1",
    "Section 1.1"
  ]
}

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

changyy

These details have not been verified by PyPI

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language

Release history Release notifications | RSS feed

This version

1.0.1

Jun 4, 2025

1.0.0

Nov 27, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_ai_extractor-1.0.1.tar.gz (19.8 kB view details)

Uploaded Jun 4, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf_ai_extractor-1.0.1-py3-none-any.whl (16.3 kB view details)

Uploaded Jun 4, 2025 Python 3

File details

Details for the file pdf_ai_extractor-1.0.1.tar.gz.

File metadata

Download URL: pdf_ai_extractor-1.0.1.tar.gz
Upload date: Jun 4, 2025
Size: 19.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pdf_ai_extractor-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`2aefec0b44e9e6dbd13cef2270e20d10c439ca35ee50396e1877c443e6a2db16`
MD5	`98ff33232c9426ad95f18a41a671fce0`
BLAKE2b-256	`e1c66ec63e3d34be530a5863ebd6780cb7348a4644c65b6417d0acdc68182e79`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf_ai_extractor-1.0.1.tar.gz:

Publisher: python-publish.yml on changyy/py-pdf-ai-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pdf_ai_extractor-1.0.1.tar.gz
- Subject digest: 2aefec0b44e9e6dbd13cef2270e20d10c439ca35ee50396e1877c443e6a2db16
- Sigstore transparency entry: 229894163
- Sigstore integration time: Jun 4, 2025
Source repository:
- Permalink: changyy/py-pdf-ai-extractor@05e30278b9d40642df8e3055d8313a266c684a92
- Branch / Tag: refs/tags/1.0.1
- Owner: https://github.com/changyy
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@05e30278b9d40642df8e3055d8313a266c684a92
- Trigger Event: release

File details

Details for the file pdf_ai_extractor-1.0.1-py3-none-any.whl.

File metadata

Download URL: pdf_ai_extractor-1.0.1-py3-none-any.whl
Upload date: Jun 4, 2025
Size: 16.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pdf_ai_extractor-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`46a63b2256ee25d81d1c55d7d416774d99dd7d42ff4e110d08480fdf124915d7`
MD5	`5c4ebb798f6fcb0ed6fec11e8f301e49`
BLAKE2b-256	`d8da4ef7c8aebb9747d6ee76d96e8a83e56d97f57f58c45d8e4420f4379ceb35`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf_ai_extractor-1.0.1-py3-none-any.whl:

Publisher: python-publish.yml on changyy/py-pdf-ai-extractor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pdf_ai_extractor-1.0.1-py3-none-any.whl
- Subject digest: 46a63b2256ee25d81d1c55d7d416774d99dd7d42ff4e110d08480fdf124915d7
- Sigstore transparency entry: 229894168
- Sigstore integration time: Jun 4, 2025
Source repository:
- Permalink: changyy/py-pdf-ai-extractor@05e30278b9d40642df8e3055d8313a266c684a92
- Branch / Tag: refs/tags/1.0.1
- Owner: https://github.com/changyy
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@05e30278b9d40642df8e3055d8313a266c684a92
- Trigger Event: release

pdf-ai-extractor 1.0.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

pdf-ai-extractor

Features

Installation

Usage

Command Line Interface

Example

Python API

Development

Configuration

Output Format

License

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance