Skip to main content

修复llm_enhancement参数支持,优化代码结构

Project description

DOC-JSON-SDK (PYTHON)

什么是DOC-JSON

doc-json-model 简要描述

DOC-JSON-SDK功能特点

  • 提供DocMind文档结构化输出的doc-json结果反序列化对象,以及辅助功能函数SDK

使用场景

使用场景: DocMind 文档智能解析调用

阿里云官网 文档智能解析调用

集成方式

  • 源码安装
#poetry 准备环境
poetry install
#使用虚拟环境
poetry shell
# 构建
poetry build
twine check $pkg_path
# 上传
twine upload -r aliyun-pypi pkg_path --verbose
  • python 3.7以上 环境

集团环境

pip3 install -i http://yum.tbsite.net/aliyun-pypi/simple/ --extra-index-url https://mirrors.aliyun.com/pypi/simple/   --trusted-host=yum.tbsite.net  doc_json_sdk

云上环境

pip install https://docmind-api-cn-hangzhou.oss-cn-hangzhou.aliyuncs.com/sdk/doc_json_sdk-1.0.5-py3-none-any.whl

release:

  • 1.0.5 : 当前版本,修复llm_enhancement参数支持,优化代码结构,兼容Python 3.12和macOS环境
  • 1.0.4 : 已废除,不兼容Python 3.12
  • 1.0.3 : 已废除,请勿使用
  • 1.0.0 : 正式版本
  • 0.1.9.0: 新调用和接口方式
  • 0.1.8.0:修复
  • 设置DocMind文档智能解析环境变量
export ALIBABA_CLOUD_ACCESS_KEY_ID=<access_key_id>
export ALIBABA_CLOUD_ACCESS_KEY_SECRET=<access_key_secret>
#调用服务

功能方法示例

1、获得json数据:

2、json加载/公有云服务调用

加载对象可以是:

  • doc-json 字符串对象
from doc_json_sdk.loader.document_model_loader import DocumentModelLoader
def test_local_json_document():
    file_path = "gongshi.json"
    loader = DocumentModelLoader()
    document = loader.load(doc_json_fp=open(file_path,"r"))
  • 公有云环境调用(配置ALIBABA_CLOUD_ACCESS_KEY_ID,ALIBABA_CLOUD_ACCESS_KEY_SECRET)
from doc_json_sdk.loader.document_model_loader import DocumentModelLoader
from doc_json_sdk.handler.document_handler import DocumentExtractHandler, DocumentDigitalExtractHandler
from doc_json_sdk.handler.document_parser_handler import DocumentParserHandler, DocumentParserWithCallbackHandler
def test_document_hander():
    file_path = "gongshi.png"
    file_url = None
    # DocumentExtractHandler:文档智能解析,DocumentDigitalExtractHandler:文档电子解析
    loader = DocumentModelLoader(handler=DocumentExtractHandler())
    document = loader.load(file_path=file_path,file_url=file_url)
  • 公式参数调用/markdown输出/json保存
from doc_json_sdk.loader.document_model_loader import DocumentModelLoader
from doc_json_sdk.handler.document_handler import DocumentExtractHandler
def test_render_formula_markdown():
    file_path = "gongshi.png"
    file_url = None
    handler = DocumentExtractHandler()
    loader = DocumentModelLoader(handler=handler)
    document = loader.load(file_path=file_path,file_url=file_url,
                           formula_enhancement=True,
                           markdown_result=True,
                           save_json_path="/Users/sanchuan/Downloads/docmind.json")
  • 私有化服务调用(配置PRIVATE_DOCMIND_HOST或显式传入)
from doc_json_sdk.loader.document_model_loader import DocumentModelLoader
from doc_json_sdk.handler.document_private_handler import  PrivateDocumentExtractHandler,PrivateDigitalDocumentExtractHandler
def test_private_document_hander():
    file_path = "gongshi.png"
    file_url = None
    loader = DocumentModelLoader(handler=PrivateDocumentExtractHandler(host="127.0.0.1:7001"))
    document = loader.load(file_path=file_path,file_url=file_url)

3、功能函数

3.1 对DocumentModel使用处理为markdown

使用内置函数处理为markdown

from doc_json_sdk.loader.document_model_loader import DocumentModelLoader
from doc_json_sdk.handler.document_handler import DocumentExtractHandler, DocumentDigitalExtractHandler
from doc_json_sdk.handler.document_parser_handler import DocumentParserHandler, DocumentParserWithCallbackHandler
from doc_json_sdk.render.document_model_render import DocumentModelRender
def test_render_markdown():
    file_path = "gongshi.png"
    file_url = None
    loader = DocumentModelLoader(handler=DocumentExtractHandler())
    document = loader.load(file_path=file_path,file_url=file_url,markdown_result=True)
    render = DocumentModelRender(document_model=document)
    with open("/Users/sanchuan/Downloads/docmind.md","w") as f:
        f.write(render.render_markdown_result())

可视化查看处理效果

from doc_json_sdk.loader.document_model_loader import DocumentModelLoader
from doc_json_sdk.handler.document_handler import DocumentExtractHandler, DocumentDigitalExtractHandler
from doc_json_sdk.handler.document_parser_handler import DocumentParserHandler, DocumentParserWithCallbackHandler
from doc_json_sdk.render.document_model_render import DocumentModelRender
def test_document_hander():
    file_path = "gongshi.png"
    file_url = None
    loader = DocumentModelLoader(handler=DocumentExtractHandler())
    document = loader.load(file_path=file_path,file_url=file_url)
    render = DocumentModelRender(document_model=document)
    render.render_image_result("/Users/sanchuan/Downloads")

3.2 对Layout版面块使用

LayoutModel 对象分为内容信息(来源电子解析/OCR)、版面类型信息(来源OCR/NLP)、逻辑关系信息(来源NLP)

doc-json-layout-model 简要描述

from doc_json_sdk.model.enums.layout_type_enum import LayoutTypeEnum

for layout in document:
    type_enum = layout.get_layout_type_enum()
    if (type_enum == LayoutTypeEnum.Elements.FOOTER or
            type_enum == LayoutTypeEnum.Elements.HEADER or
            type_enum == LayoutTypeEnum.Elements.NOTE):
        #  header and footer notes
        pass
    elif type_enum == LayoutTypeEnum.Elements.IMAGE:
        # image with head_line or split_line
        if layout.type.find("_line")!=-1:
            continue
    elif type_enum == LayoutTypeEnum.Elements.TABLE:
        #table
        pass
    else:
        # paragraph or note(table or figure)
        pass

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doc_json_sdk-1.0.7.tar.gz (27.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doc_json_sdk-1.0.7-py3-none-any.whl (42.4 kB view details)

Uploaded Python 3

File details

Details for the file doc_json_sdk-1.0.7.tar.gz.

File metadata

  • Download URL: doc_json_sdk-1.0.7.tar.gz
  • Upload date:
  • Size: 27.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.10.16 Darwin/24.3.0

File hashes

Hashes for doc_json_sdk-1.0.7.tar.gz
Algorithm Hash digest
SHA256 ca75d23a166213e649d92e06b27e5af32c1a885e7b21c0eaa21f02d07ee6d7c2
MD5 de3082824683067da351d87a77ac2ad8
BLAKE2b-256 699843647f1364040002339014b20c7a12f75d56fd6a077b9c2f8117bc9c2e0e

See more details on using hashes here.

File details

Details for the file doc_json_sdk-1.0.7-py3-none-any.whl.

File metadata

  • Download URL: doc_json_sdk-1.0.7-py3-none-any.whl
  • Upload date:
  • Size: 42.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.10.16 Darwin/24.3.0

File hashes

Hashes for doc_json_sdk-1.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 335247d501b8f0ecf130a96824c1cc3fbffe744e73f2e54fc4fb54990e488cd9
MD5 0076c9e5b8fda25c2e14578af507bf37
BLAKE2b-256 4a8a07899c762ae4e4ecdf3c34bbcfcf893217ad333d2f2bbe645bab3e58fa0e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page