Skip to main content

修复llm_enhancement参数支持,优化代码结构

Project description

DOC-JSON-SDK (PYTHON)

什么是DOC-JSON

doc-json-model 简要描述

DOC-JSON-SDK功能特点

  • 提供DocMind文档结构化输出的doc-json结果反序列化对象,以及辅助功能函数SDK

使用场景

使用场景: DocMind 文档智能解析调用

阿里云官网 文档智能解析调用

集成方式

  • 源码安装
#poetry 准备环境
poetry install
#使用虚拟环境
poetry shell
# 构建
poetry build
twine check $pkg_path
# 上传
twine upload -r aliyun-pypi pkg_path --verbose
  • python 3.7以上 环境

集团环境

pip3 install -i http://yum.tbsite.net/aliyun-pypi/simple/ --extra-index-url https://mirrors.aliyun.com/pypi/simple/   --trusted-host=yum.tbsite.net  doc_json_sdk

云上环境

pip install https://docmind-api-cn-hangzhou.oss-cn-hangzhou.aliyuncs.com/sdk/doc_json_sdk-1.0.8-py3-none-any.whl

release:

  • 1.0.8 : 当前版本,修复llm_enhancement参数支持,优化代码结构,兼容Python 3.12和macOS环境,修复字体文件包含问题
  • 1.0.4 : 已废除,不兼容Python 3.12
  • 1.0.3 : 已废除,请勿使用
  • 1.0.0 : 正式版本
  • 0.1.9.0: 新调用和接口方式
  • 0.1.8.0:修复
  • 设置DocMind文档智能解析环境变量
export ALIBABA_CLOUD_ACCESS_KEY_ID=<access_key_id>
export ALIBABA_CLOUD_ACCESS_KEY_SECRET=<access_key_secret>
#调用服务

功能方法示例

1、获得json数据:

2、json加载/公有云服务调用

加载对象可以是:

  • doc-json 字符串对象
from doc_json_sdk.loader.document_model_loader import DocumentModelLoader
def test_local_json_document():
    file_path = "gongshi.json"
    loader = DocumentModelLoader()
    document = loader.load(doc_json_fp=open(file_path,"r"))
  • 公有云环境调用(配置ALIBABA_CLOUD_ACCESS_KEY_ID,ALIBABA_CLOUD_ACCESS_KEY_SECRET)
from doc_json_sdk.loader.document_model_loader import DocumentModelLoader
from doc_json_sdk.handler.document_handler import DocumentExtractHandler, DocumentDigitalExtractHandler
from doc_json_sdk.handler.document_parser_handler import DocumentParserHandler, DocumentParserWithCallbackHandler
def test_document_hander():
    file_path = "gongshi.png"
    file_url = None
    # DocumentExtractHandler:文档智能解析,DocumentDigitalExtractHandler:文档电子解析
    loader = DocumentModelLoader(handler=DocumentExtractHandler())
    document = loader.load(file_path=file_path,file_url=file_url)
  • 公式参数调用/markdown输出/json保存
from doc_json_sdk.loader.document_model_loader import DocumentModelLoader
from doc_json_sdk.handler.document_handler import DocumentExtractHandler
def test_render_formula_markdown():
    file_path = "gongshi.png"
    file_url = None
    handler = DocumentExtractHandler()
    loader = DocumentModelLoader(handler=handler)
    document = loader.load(file_path=file_path,file_url=file_url,
                           formula_enhancement=True,
                           markdown_result=True,
                           save_json_path="/Users/sanchuan/Downloads/docmind.json")
  • 私有化服务调用(配置PRIVATE_DOCMIND_HOST或显式传入)
from doc_json_sdk.loader.document_model_loader import DocumentModelLoader
from doc_json_sdk.handler.document_private_handler import  PrivateDocumentExtractHandler,PrivateDigitalDocumentExtractHandler
def test_private_document_hander():
    file_path = "gongshi.png"
    file_url = None
    loader = DocumentModelLoader(handler=PrivateDocumentExtractHandler(host="127.0.0.1:7001"))
    document = loader.load(file_path=file_path,file_url=file_url)

3、功能函数

3.1 对DocumentModel使用处理为markdown

使用内置函数处理为markdown

from doc_json_sdk.loader.document_model_loader import DocumentModelLoader
from doc_json_sdk.handler.document_handler import DocumentExtractHandler, DocumentDigitalExtractHandler
from doc_json_sdk.handler.document_parser_handler import DocumentParserHandler, DocumentParserWithCallbackHandler
from doc_json_sdk.render.document_model_render import DocumentModelRender
def test_render_markdown():
    file_path = "gongshi.png"
    file_url = None
    loader = DocumentModelLoader(handler=DocumentExtractHandler())
    document = loader.load(file_path=file_path,file_url=file_url,markdown_result=True)
    render = DocumentModelRender(document_model=document)
    with open("/Users/sanchuan/Downloads/docmind.md","w") as f:
        f.write(render.render_markdown_result())

可视化查看处理效果

from doc_json_sdk.loader.document_model_loader import DocumentModelLoader
from doc_json_sdk.handler.document_handler import DocumentExtractHandler, DocumentDigitalExtractHandler
from doc_json_sdk.handler.document_parser_handler import DocumentParserHandler, DocumentParserWithCallbackHandler
from doc_json_sdk.render.document_model_render import DocumentModelRender
def test_document_hander():
    file_path = "gongshi.png"
    file_url = None
    loader = DocumentModelLoader(handler=DocumentExtractHandler())
    document = loader.load(file_path=file_path,file_url=file_url)
    render = DocumentModelRender(document_model=document)
    render.render_image_result("/Users/sanchuan/Downloads")

3.2 对Layout版面块使用

LayoutModel 对象分为内容信息(来源电子解析/OCR)、版面类型信息(来源OCR/NLP)、逻辑关系信息(来源NLP)

doc-json-layout-model 简要描述

from doc_json_sdk.model.enums.layout_type_enum import LayoutTypeEnum

for layout in document:
    type_enum = layout.get_layout_type_enum()
    if (type_enum == LayoutTypeEnum.Elements.FOOTER or
            type_enum == LayoutTypeEnum.Elements.HEADER or
            type_enum == LayoutTypeEnum.Elements.NOTE):
        #  header and footer notes
        pass
    elif type_enum == LayoutTypeEnum.Elements.IMAGE:
        # image with head_line or split_line
        if layout.type.find("_line")!=-1:
            continue
    elif type_enum == LayoutTypeEnum.Elements.TABLE:
        #table
        pass
    else:
        # paragraph or note(table or figure)
        pass

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doc_json_sdk-1.0.8.tar.gz (62.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doc_json_sdk-1.0.8-py3-none-any.whl (62.2 MB view details)

Uploaded Python 3

File details

Details for the file doc_json_sdk-1.0.8.tar.gz.

File metadata

  • Download URL: doc_json_sdk-1.0.8.tar.gz
  • Upload date:
  • Size: 62.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.10.16 Darwin/24.3.0

File hashes

Hashes for doc_json_sdk-1.0.8.tar.gz
Algorithm Hash digest
SHA256 1ddc9c15226e0c0b45d71f9926127204a0662606e14ad4b337791633c0ea6134
MD5 2f959ccc12f715d23f888cc0f1fcaf5f
BLAKE2b-256 d69e789c0cafab28ac27269902df4d278cddd1ab01be4cbc1b5a70e255f8f195

See more details on using hashes here.

File details

Details for the file doc_json_sdk-1.0.8-py3-none-any.whl.

File metadata

  • Download URL: doc_json_sdk-1.0.8-py3-none-any.whl
  • Upload date:
  • Size: 62.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.10.16 Darwin/24.3.0

File hashes

Hashes for doc_json_sdk-1.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 a797a946686a317c9c6d6a5aa34c11d2380eefca52ee2676efa90d31377bea33
MD5 083e564234792ddd665458320d48127a
BLAKE2b-256 364c3f7483ec8fdab2e378caac08f0c7711f5da74e4ac55705bc304f98fa4741

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page