Skip to main content

A Python script that can parse a Chinese patent of invention type to extract fields, sections, and subsections in it.

Project description

ChinesePatentParser

A Python script that can parse a Chinese patent of invention type to extract named fields, sections, and subsections in it. The parsing result can then be used for various NLP tasks.

Chinese patent of invention type typically has fixed template with named fields, sections and subsections, like below:

patent image

The parser uses regular expression to extract the named fields, sections and subsections.

Dependencies

The script uses PdfPlumber to extract text in the input PDF.

Installation

Simply run in command line:

pip install ChinesePatentParser

How to Use

To use the script in command line, run it like following:

python -m ChinesePatentParser.patent_parser ./example/Alibaba.pdf > ./example/Alibaba.json

To use the parser in your script, do something like below:

from ChinesePatentParser import patent_parser  # Absolute import

pdf_path = './example/Alibaba.pdf'

parser = patent_parser.PatentParser()

data = parser.parse_pdf_file(pdf_path)

data_json = data.to_json()

print(f"\n{data_json}")

Example Result

The parser will extract all the named fields, sections and subsections to output as in JSON format, like below:

{
    "申请公布号": "CN 102890692 A",
    "申请公布日": "2013.01.23 A 296098201 NC (19)中华人民共和国国家知识产权局 *CN102890692A* (12)发明专利申请",
    "申请号": "201110207897.1",
    "申请日": "2011.07.22",
    "申请人": "阿里巴巴集团控股有限公司\n地址 英属开曼群岛大开曼资本大厦一座四\n层847号邮箱",
    "发明人": "孙一鸣 强琦 蔡波洋 金晓军\n吴宗远",
    "代理机构": "北京润泽恒知识产权代理有\n限公司 11319\n代理人 苏培华",
    "国际分类号": "l.\nG06F 17/30(2006.01)\n权利权要利求要书求 书2 页2页 说 说明明书书 121 2页页 附附图图 77 页页",
    "发明名称": "一种网页信息抽取方法及抽取系统",
    "摘要": "本申请提供了一种网页信息抽取方法及抽取系统,...,可以实现大批量网页高度自动化的信息抽取。",
    "权利要求书": [
      "1.一种网页信息抽取方法,其特征在于,包括:\n通过界面交互方式配置网页信息抽取任务,并存入数据库;\n监控数据库,当发现数据库中存入新的网页信息抽取任务后,将所述新的网页信息抽\n取任务发送给调度器;\n调度器解析网页信息抽取任务,并依据解析结果自动执行所述网页信息抽取任务。",
      "2. 根据权利要求 1 所述的方法,其特征在于,..., 对所述点击行为或抽取行为进行细化配置。",
      ...,
      ...,
      ...,
      "11.根据权利要求10所述的系统,其特征在于,...,则依据点击行为的配置调度渲染引擎进行渲染。\n33"
    ],
    "技术领域": [
      "[0001] 本申请涉及网页处理技术,特别是涉及一种网页信息抽取方法及抽取系统。"
    ],
    "背景技术": [
      "[0002] 网页信息抽取就是获取网页的数据,...,另一种就是利用机器学习方法进行抽取。",
      ...,
      ...,
      ...,
      "[0007] 因此,目前还没有一种真正简单、...网应用进行网页信息的自动抽取。"
    ],
    "发明内容": [
      "[0008] 本申请提供了一种网页信息抽取方法及抽取系统,...技术门槛较高的问题。",
      "[0009] 为了解决上述问题,本申请公开了一种网页信息抽取方法,包括:",
      ...,
      ...,
      ...,
      "[0046] 当然,实施本申请的任一产品不一定需要同时达到以上所述的所有优点。"
    ],
    "附图说明": [
      "[0047] 图1是本申请实施例所述一种网页信息抽取方法的流程图;",
      "[0048] 图2是本申请实施例中页面节点的示意图;",
      ...,
      ...,
      ...,
      "[0055] 图9是本申请实施例所述一种网页信息抽取系统的结构图。"
    ],
    "具体实施方式": [
      "[0056] 为使本申请的上述目的、特征和优点...进一步详细的说明。",
      "[0057] 本申请提供了一种网页信息抽取方法及系统,...,可实现针对互联网站点的信息抽取。",
      ...,
      ...,
      ...,
      "[0244] 以上对本申请所提供的一种网页信息抽取方法及抽取系统,..., 本申请的限制。\n1155"
    ]
  }

See the json file in the example folder for complete extraction result of the example.

Acknowledgement

Thanks to the authors of all the dependencies libraries, the Python and Open Source Community.

Project details


Release history Release notifications | RSS feed

This version

1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chinesepatentparser-1.0.tar.gz (9.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ChinesePatentParser-1.0-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file chinesepatentparser-1.0.tar.gz.

File metadata

  • Download URL: chinesepatentparser-1.0.tar.gz
  • Upload date:
  • Size: 9.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.15

File hashes

Hashes for chinesepatentparser-1.0.tar.gz
Algorithm Hash digest
SHA256 cf3450e10e5516ede81999cb28499a4c483265263288fb6fc17b368389ab953b
MD5 fbd0df02381db38947eee263970bc411
BLAKE2b-256 3a052ac1026a0a7bc05e89c130d97804ca78c3873ec36c730dcb32ffe16c206a

See more details on using hashes here.

File details

Details for the file ChinesePatentParser-1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for ChinesePatentParser-1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0aba41232504ae958d2807d9eefc127d2b0fc6fb69ba042e017be7041c2d0f3f
MD5 67810d5438ee5c3e61650eec4e2ca466
BLAKE2b-256 e0983a5161afb399d2c459e1de577dd274a43a81d78ecd1e0100d68226e4b48a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page