Skip to main content

Add your description here

This project has been archived.

The maintainers of this project have marked this project as archived. No new releases are expected.

Project description

๐Ÿ“Š Structured Output Kit

๐Ÿš€ PDF/์ด๋ฏธ์ง€ ํŒŒ์‹ฑ + LLM ๊ตฌ์กฐํ™”๋œ ์ถœ๋ ฅ ์ถ”์ถœ + ์ •๋Ÿ‰์  ํ‰๊ฐ€ + ์‹œ๊ฐํ™”๋ฅผ ์œ„ํ•œ ํ†ตํ•ฉ ๋ฒค์น˜๋งˆํฌ ํˆดํ‚ท

Python 3.12+ License FastAPI Streamlit

๋‹ค์–‘ํ•œ ํŒŒ์‹ฑ ํ”„๋ ˆ์ž„์›Œํฌ(Docling, PyPDF, PDFPlumber ๋“ฑ)๋กœ ๋ฌธ์„œ๋ฅผ ํ…์ŠคํŠธํ™”ํ•˜๊ณ , ์—ฌ๋Ÿฌ LLM ํ˜ธ์ŠคํŠธ(OpenAI, Anthropic, Google, Ollama ๋“ฑ)์™€ ์ถ”์ถœ ํ”„๋ ˆ์ž„์›Œํฌ(Instructor, LangChain, LlamaIndex, Marvin ๋“ฑ)๋ฅผ ํ†ต์ผ๋œ ์ธํ„ฐํŽ˜์ด์Šค๋กœ ์‹คํ—˜ํ•˜์—ฌ ๊ตฌ์กฐํ™”๋œ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๊ณ , ์ •๋‹ต JSON๊ณผ์˜ ์œ ์‚ฌ๋„๋ฅผ ์ •๋Ÿ‰ํ™”ํ•˜์—ฌ ์‹œ๊ฐํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ์ข…ํ•ฉ ๋ฒค์น˜๋งˆํฌ ๋„๊ตฌ์ž…๋‹ˆ๋‹ค.

โœจ ์ฃผ์š” ํŠน์ง•

๏ฟฝ ๋‹ค์ค‘ ํŒŒ์‹ฑ ํ”„๋ ˆ์ž„์›Œํฌ

  • PDF: Docling, PyPDF, PDFPlumber, PyMuPDF ์ง€์›
  • ์ด๋ฏธ์ง€: Vision Language Model(VLM) ๊ธฐ๋ฐ˜ OCR
  • Microsoft: MarkItDown์œผ๋กœ ๋‹ค์–‘ํ•œ ๋ฌธ์„œ ํ˜•์‹ ์ง€์›

๐Ÿ”„ ๋‹ค์ค‘ LLM ํ˜ธ์ŠคํŠธ & ํ”„๋ ˆ์ž„์›Œํฌ

  • ํ˜ธ์ŠคํŠธ: OpenAI, Anthropic, Google, Ollama, OpenAI-Compatible ์„œ๋ฒ„
  • ํ”„๋ ˆ์ž„์›Œํฌ: Instructor, LangChain(Tool/Parser), LlamaIndex, Marvin, Mirascope, Ollama ๋“ฑ

๐ŸŽฏ ์ •๋Ÿ‰์  ํ‰๊ฐ€ ์‹œ์Šคํ…œ

  • ์ž„๋ฒ ๋”ฉ ์œ ์‚ฌ๋„(์ฝ”์‚ฌ์ธ ์œ ์‚ฌ์„ฑ)์™€ ์™„์ „์ผ์น˜ ๊ธฐ๋ฐ˜ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์Šค์ฝ”์–ด๋ง
  • ํ•„๋“œ๋ณ„ ์„ธ๋ถ€ ํ‰๊ฐ€ ๋ฐ ํ—๊ฐ€๋ฆฌ์•ˆ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ†ตํ•œ ์ตœ์  ๋งค์นญ

๐Ÿ“Š ์‹ค์‹œ๊ฐ„ ์‹œ๊ฐํ™”

  • Streamlit ๊ธฐ๋ฐ˜ ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒ ๋Œ€์‹œ๋ณด๋“œ
  • ์ •์  HTML ๋ฆฌํฌํŠธ ์ƒ์„ฑ
  • ์„ฑ๋Šฅ ๋ถ„ํฌ ๋ฐ ํ•„๋“œ๋ณ„ ์ƒ์„ธ ๋ถ„์„

๐Ÿš€ API & CLI ํ†ตํ•ฉ ์ธํ„ฐํŽ˜์ด์Šค

  • RESTful API ์„œ๋ฒ„ (FastAPI)
  • Typer ๊ธฐ๋ฐ˜ ๋ช…๋ น์ค„ ์ธํ„ฐํŽ˜์ด์Šค
  • ํŒŒ์‹ฑ โ†’ ์ถ”์ถœ โ†’ ํ‰๊ฐ€ โ†’ ์‹œ๊ฐํ™” ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ ์ง€์›

โšก YAML ๊ธฐ๋ฐ˜ ์›Œํฌํ”Œ๋กœ์šฐ

  • ์—ฌ๋Ÿฌ ํŒŒ์‹ฑ ๋ฐฉ๋ฒ• ร— ์—ฌ๋Ÿฌ ์ถ”์ถœ ์„ค์ •์˜ ์ž๋™ ์กฐํ•ฉ ์‹คํ–‰
  • ์„ค์ • ํŒŒ์ผ ๊ธฐ๋ฐ˜ ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ
  • ํŒŒ์‹ฑ ์—†์ด ์ง์ ‘ ํ…์ŠคํŠธ ์ž…๋ ฅ๋„ ์ง€์›
  • ์‹คํ–‰ ๊ฒฐ๊ณผ ์ž๋™ ์ •๋ฆฌ ๋ฐ ์š”์•ฝ ๋ฆฌํฌํŠธ

๐Ÿ”ง ํ™•์žฅ์„ฑ & ์ปค์Šคํ„ฐ๋งˆ์ด์ง•

  • ์ปค์Šคํ…€ ์Šคํ‚ค๋งˆ ์ถ”๊ฐ€ (Pydantic ๊ธฐ๋ฐ˜)
  • ํ‰๊ฐ€ ๊ธฐ์ค€ ์ปค์Šคํ„ฐ๋งˆ์ด์ง• (YAML ์„ค์ •)
  • ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„์›Œํฌ ์‰ฝ๊ฒŒ ์ถ”๊ฐ€ ๊ฐ€๋Šฅ

๐Ÿš€ Quick Start

1๏ธโƒฃ ์„ค์น˜

# uv ์„ค์น˜ (๊ถŒ์žฅ)
curl -fsSL https://astral.sh/uv/install.sh | sh

# ํ”„๋กœ์ ํŠธ ํด๋ก  ๋ฐ ์˜์กด์„ฑ ์„ค์น˜
git clone https://github.com/Bae-ChangHyun/StructuredOutputKit.git
cd StructuredOutputKit
uv venv
source .venv/bin/activate
uv sync

2๏ธโƒฃ ํ™˜๊ฒฝ ์„ค์ •

# ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ํŒŒ์ผ ์ƒ์„ฑ
cp .env.example .env

# .env ํŒŒ์ผ์— API ํ‚ค ์„ค์ •
echo "OPENAI_API_KEY=your_api_key_here" >> .env

3๏ธโƒฃ 30์ดˆ ํ…Œ์ŠคํŠธ

# API ์„œ๋ฒ„ ์‹œ์ž‘
python main.py

# ์ƒˆ ํ„ฐ๋ฏธ๋„์—์„œ ํ…์ŠคํŠธ ์ถ”์ถœ ํ…Œ์ŠคํŠธ
curl -X POST http://localhost:8000/v1/extraction \
  -H 'Content-Type: application/json' \
  -d '{
    "input_text": "์•ˆ๋…•ํ•˜์„ธ์š”. ์ €๋Š” ํ™๊ธธ๋™์ž…๋‹ˆ๋‹ค. ์ปดํ“จํ„ฐ๊ณตํ•™๊ณผ๋ฅผ ์กธ์—…ํ–ˆ๊ณ  Python ๊ฐœ๋ฐœ์ž๋กœ 3๋…„๊ฐ„ ๊ทผ๋ฌดํ–ˆ์Šต๋‹ˆ๋‹ค.",
    "schema_name": "schema_han",
    "framework": "OpenAIFramework",
    "host_info": {
      "provider": "openai",
      "base_url": "https://api.openai.com/v1",
      "model": "gpt-4o-mini"
    }
  }'

4๏ธโƒฃ CLI๋กœ ์‹œ์ž‘ํ•˜๊ธฐ

# ํ…์ŠคํŠธ์—์„œ ๊ตฌ์กฐํ™”๋œ ์ •๋ณด ์ถ”์ถœ
python main.py --cli extract --input "์•ˆ๋…•ํ•˜์„ธ์š”. ๊น€์ฒ ์ˆ˜์ž…๋‹ˆ๋‹ค. ์„œ์šธ๋Œ€ํ•™๊ต ์กธ์—… ํ›„ ์‚ผ์„ฑ์—์„œ 5๋…„๊ฐ„ ๊ทผ๋ฌดํ–ˆ์Šต๋‹ˆ๋‹ค."

# PDF ํŒŒ์ผ ํŒŒ์‹ฑ (์˜ˆ์‹œ)
python main.py --cli parse --file document.pdf --framework DoclingFramework

# ํ‰๊ฐ€ ์‹คํ–‰ (์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ)
python main.py --cli eval \
  --pred result/extraction/$(ls result/extraction | tail -1)/result.json \
  --gt data/๋ฆฌ๋ฉค๋ฒ„-s1.json

# ์‹œ๊ฐํ™” ์‹คํ–‰
python main.py --cli viz --eval-result result/evaluation/$(ls result/evaluation | tail -1)/eval_result.json

5๏ธโƒฃ ์›Œํฌํ”Œ๋กœ์šฐ๋กœ ์‹œ์ž‘ํ•˜๊ธฐ

# ์›Œํฌํ”Œ๋กœ์šฐ ํ…œํ”Œ๋ฆฟ ์ƒ์„ฑ
python workflow_cli.py workflow template --output my_workflow.yaml

# ์„ค์ • ํŒŒ์ผ ํŽธ์ง‘ ํ›„ (API ํ‚ค ์„ค์ • ํ•„์š”)
python workflow_cli.py workflow run my_workflow.yaml

# ๊ฐ„๋‹จํ•œ ์ถ”์ถœ๋งŒ ํ…Œ์ŠคํŠธ (ํŒŒ์‹ฑ ์—†์Œ)
python workflow_cli.py workflow run test_extraction_only.yaml

๐Ÿ“‹ ๋ชฉ์ฐจ

๐Ÿ“‹ ๋ชฉ์ฐจ

๐Ÿ“ฆ ์„ค์น˜ ๊ฐ€์ด๋“œ

์‹œ์Šคํ…œ ์š”๊ตฌ์‚ฌํ•ญ

  • Python 3.12 ์ด์ƒ
  • Linux/macOS/Windows
  • ์ตœ์†Œ 4GB RAM (VLM ์‚ฌ์šฉ ์‹œ 8GB+ ๊ถŒ์žฅ)

์„ค์น˜ ๋ฐฉ๋ฒ•

๋ฐฉ๋ฒ• 1: uv ์‚ฌ์šฉ (๊ถŒ์žฅ)
# uv ์„ค์น˜
curl -fsSL https://astral.sh/uv/install.sh | sh

# ํ”„๋กœ์ ํŠธ ์„ค์ •
git clone https://github.com/Bae-ChangHyun/StructuredOutputKit.git
cd StructuredOutputKit
uv venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
uv sync
๋ฐฉ๋ฒ• 2: pip ์‚ฌ์šฉ
git clone https://github.com/Bae-ChangHyun/StructuredOutputKit.git
cd StructuredOutputKit
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -e .
๋ฐฉ๋ฒ• 3: Docker (์‹คํ—˜์ )
FROM python:3.12-slim
WORKDIR /app
COPY pyproject.toml uv.lock ./
RUN pip install uv && uv sync --no-dev
COPY . .
ENV API_HOST=0.0.0.0 API_PORT=8000
EXPOSE 8000
CMD ["python", "main.py"]
docker build -t structured-output-kit .
docker run -p 8000:8000 --env-file .env structured-output-kit

โš™๏ธ ํ™˜๊ฒฝ ์„ค์ •

ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ์„ค์ •

.env.example์„ ๋ณต์‚ฌํ•˜์—ฌ .env ํŒŒ์ผ์„ ์ƒ์„ฑํ•˜๊ณ  ํ•„์š”ํ•œ API ํ‚ค๋ฅผ ์„ค์ •ํ•˜์„ธ์š”.

ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ์ƒ์„ธ ์„ค์ •
# ์„œ๋ฒ„ ์„ค์ •
API_HOST=0.0.0.0
API_PORT=8000
DEBUG=true

# OpenAI
OPENAI_API_KEY=sk-your-api-key
OPENAI_MODELS=gpt-4o-mini
OPENAI_EMBED_MODELS=text-embedding-3-small

# Anthropic
ANTHROPIC_API_KEY=your-api-key
ANTHROPIC_MODELS=claude-3-5-sonnet-latest

# Google
GOOGLE_API_KEY=your-api-key
GOOGLE_MODELS=gemini-1.5-flash

# OpenAI-Compatible (vLLM, Together AI ๋“ฑ)
OPENAI_COMPATIBLE_BASEURL=http://localhost:8000/v1
OPENAI_COMPATIBLE_MODELS=your-model-name
OPENAI_COMPATIBLE_API_KEY=dummy

# Ollama
OLLAMA_BASEURL=http://localhost:11434/v1
OLLAMA_MODELS=llama3.1:8b

# HuggingFace (๋กœ์ปฌ ์ž„๋ฒ ๋”ฉ)
HUGGINGFACE_EMBED_MODELS=jhgan/ko-sroberta-multitask

# Langfuse (์„ ํƒ์‚ฌํ•ญ)
LANGFUSE_HOST=your-langfuse-host
LANGFUSE_PUBLIC_KEY=your-public-key
LANGFUSE_SECRET_KEY=your-secret-key

# ์ œํ•œ ์„ค์ •
MAX_FILE_SIZE=10485760
TASK_TIMEOUT=3600

๐Ÿ’ป ์‚ฌ์šฉ๋ฒ•

API ์‚ฌ์šฉ๋ฒ•

์„œ๋ฒ„ ์‹œ์ž‘

# ๊ธฐ๋ณธ ์‹คํ–‰
python main.py

# ์ปค์Šคํ…€ ํฌํŠธ๋กœ ์‹คํ–‰
python main.py --port 8080

# ๊ฐœ๋ฐœ ๋ชจ๋“œ (์ž๋™ ๋ฆฌ๋กœ๋“œ)
python main.py --reload

API ๋ฌธ์„œ: http://localhost:8000/docs

์ฃผ์š” ์—”๋“œํฌ์ธํŠธ

๏ฟฝ ํŒŒ์‹ฑ API - POST /v1/parsing

PDF/์ด๋ฏธ์ง€ ํŒŒ์ผ ํŒŒ์‹ฑ:

curl -X POST http://localhost:8000/v1/parsing \
  -F 'file=@document.pdf' \
  -F 'framework=DoclingFramework'

Python ์‚ฌ์šฉ๋ฒ•:

import requests

with open("document.pdf", "rb") as f:
    response = requests.post(
        "http://localhost:8000/v1/parsing",
        files={"file": f},
        data={
            "framework": "DoclingFramework",
            "extra_kwargs": '{"parse_figures": true}'
        }
    )

result = response.json()
print(f"ํŒŒ์‹ฑ๋œ ํ…์ŠคํŠธ: {result['data']['content']}")
print(f"ํŒŒ์‹ฑ ์‹œ๊ฐ„: {result['latency']}์ดˆ")
๏ฟฝ๐Ÿ”„ ์ถ”์ถœ API - POST /v1/extraction

๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ•:

curl -X POST http://localhost:8000/v1/extraction \
  -H 'Content-Type: application/json' \
  -d '{
    "input_text": "์•ˆ๋…•ํ•˜์„ธ์š”. ์ œ ์ด๋ฆ„์€ ํ™๊ธธ๋™์ž…๋‹ˆ๋‹ค.",
    "schema_name": "schema_han",
    "framework": "OpenAIFramework",
    "host_info": {
      "provider": "openai",
      "base_url": "https://api.openai.com/v1",
      "model": "gpt-4o-mini"
    }
  }'

Python ์‚ฌ์šฉ๋ฒ•:

import requests

response = requests.post("http://localhost:8000/v1/extraction", json={
    "input_text": "๊น€์ฒ ์ˆ˜์ž…๋‹ˆ๋‹ค. ์„œ์šธ๋Œ€ํ•™๊ต ์ปดํ“จํ„ฐ๊ณตํ•™๊ณผ ์กธ์—… ํ›„ ๋„ค์ด๋ฒ„์—์„œ 5๋…„๊ฐ„ ๊ทผ๋ฌดํ–ˆ์Šต๋‹ˆ๋‹ค.",
    "schema_name": "schema_han",
    "framework": "OpenAIFramework",
    "extra_kwargs": {"temperature": 0.1, "timeout": 900},
    "host_info": {
        "provider": "openai",
        "base_url": "https://api.openai.com/v1", 
        "model": "gpt-4o-mini"
    }
})

result = response.json()
print(f"์ถ”์ถœ ๊ฒฐ๊ณผ: {result['data']['result']}")
print(f"์„ฑ๊ณต๋ฅ : {result['success_rate']}")
print(f"์‘๋‹ต ์‹œ๊ฐ„: {result['latency']}์ดˆ")
๐Ÿ“Š ํ‰๊ฐ€ API - POST /v1/evaluation
curl -X POST http://localhost:8000/v1/evaluation \
  -H 'Content-Type: application/json' \
  -d '{
    "pred_json_path": "result/extraction/20250812_0850/result.json",
    "gt_json_path": "data/๋ฆฌ๋ฉค๋ฒ„-s1.json",
    "schema_name": "schema_han",
    "host_info": {
      "provider": "huggingface",
      "base_url": "",
      "model": "jhgan/ko-sroberta-multitask"
    }
  }'
๐ŸŽจ ์‹œ๊ฐํ™” API - POST /v1/visualization/generate
curl -X POST http://localhost:8000/v1/visualization/generate \
  -H 'Content-Type: application/json' \
  -d '{
    "eval_result_path": "result/evaluation/20250812_0854/eval_result.json"
  }'
๐Ÿ”ง ์œ ํ‹ธ๋ฆฌํ‹ฐ API
# ์ง€์› ํ˜ธ์ŠคํŠธ ๋ชฉ๋ก
curl http://localhost:8000/v1/utils/providers

# ํ˜ธ์ŠคํŠธ๋ณ„ ํ”„๋ ˆ์ž„์›Œํฌ ๋ชฉ๋ก  
curl http://localhost:8000/v1/utils/frameworks?provider=openai

# ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์Šคํ‚ค๋งˆ ๋ชฉ๋ก
curl http://localhost:8000/v1/utils/schemas

# ํŒŒ์‹ฑ ํ”„๋ ˆ์ž„์›Œํฌ ๋ชฉ๋ก
curl http://localhost:8000/v1/utils/parsing-frameworks

CLI ์‚ฌ์šฉ๋ฒ•

ํŒŒ์‹ฑ (Parsing)

# PDF ํŒŒ์ผ ํŒŒ์‹ฑ
python main.py --cli parse --file document.pdf --framework DoclingFramework

# ์ด๋ฏธ์ง€ OCR (VLM ์‚ฌ์šฉ)
python main.py --cli parse --file image.png --framework VLMFramework

# ๊ณ ๊ธ‰ ์˜ต์…˜
python main.py --cli parse \
  --file document.pdf \
  --framework PDFPlumberFramework \
  --kwargs '{"parse_tables":true}' \
  --save

์ถ”์ถœ (Extract)

# ๊ธฐ๋ณธ ์ถ”์ถœ
python main.py --cli extract --input "ํ™๊ธธ๋™์ž…๋‹ˆ๋‹ค. ์„œ์šธ๋Œ€ ์กธ์—… ํ›„ ์นด์นด์˜ค์—์„œ 3๋…„ ๊ทผ๋ฌดํ–ˆ์Šต๋‹ˆ๋‹ค."

# ํŒŒ์ผ์—์„œ ์ถ”์ถœ
python main.py --cli extract --input ./sample.txt --schema schema_han

# ๊ณ ๊ธ‰ ์˜ต์…˜
python main.py --cli extract \
  --input "ํ…์ŠคํŠธ ๋‚ด์šฉ" \
  --schema schema_han \
  --retries 3 \
  --kwargs '{"temperature":0.1,"timeout":900}' \
  --save

ํ‰๊ฐ€ (Evaluation)

# ๊ธฐ๋ณธ ํ‰๊ฐ€
python main.py --cli eval \
  --pred result/extraction/latest/result.json \
  --gt data/๋ฆฌ๋ฉค๋ฒ„-s1.json

# ์ปค์Šคํ…€ ํ‰๊ฐ€ ๊ธฐ์ค€ ์‚ฌ์šฉ
python main.py --cli eval \
  --pred result/extraction/latest/result.json \
  --gt data/๋ฆฌ๋ฉค๋ฒ„-s1.json \
  --criteria evaluation/criteria/custom_criteria.json \
  --save

์‹œ๊ฐํ™” (Visualization)

# Streamlit ๋Œ€์‹œ๋ณด๋“œ ์‹คํ–‰
python main.py --cli viz --eval-result result/evaluation/latest/eval_result.json

# ์ •์  HTML ์ƒ์„ฑ
python main.py --cli viz \
  --eval-result result/evaluation/latest/eval_result.json \
  --html \
  --out result/visualization/custom_dir

๐Ÿ”„ ์›Œํฌํ”Œ๋กœ์šฐ ์‚ฌ์šฉ๋ฒ•

์›Œํฌํ”Œ๋กœ์šฐ ๊ธฐ๋Šฅ์„ ์‚ฌ์šฉํ•˜๋ฉด parsing, extraction, evaluation ๋‹จ๊ณ„๋ฅผ YAML ์„ค์ • ํŒŒ์ผ์„ ํ†ตํ•ด ํ•œ ๋ฒˆ์— ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ ํŒŒ์‹ฑ ๋ฐฉ๋ฒ•๊ณผ ์ถ”์ถœ ์„ค์ •์˜ ์กฐํ•ฉ์„ ์ž๋™์œผ๋กœ ์‹คํ–‰ํ•˜์—ฌ ์ตœ์ ์˜ ๊ฒฐ๊ณผ๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ฃผ์š” ํŠน์ง•

  • ๋‹ค๋‹จ๊ณ„ ํ†ตํ•ฉ ์‹คํ–‰: parsing โ†’ extraction โ†’ evaluation ํŒŒ์ดํ”„๋ผ์ธ
  • ์กฐํ•ฉ ์‹คํ–‰: ์—ฌ๋Ÿฌ ํŒŒ์‹ฑ ์„ค์ • ร— ์—ฌ๋Ÿฌ ์ถ”์ถœ ์„ค์ •์˜ ๋ชจ๋“  ์กฐํ•ฉ ์ž๋™ ์‹คํ–‰
  • ์„ค์ • ๊ธฐ๋ฐ˜: YAML ํŒŒ์ผ๋กœ ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ ๊ด€๋ฆฌ
  • ๊ธฐ์กด ์ฝ”๋“œ ์žฌ์‚ฌ์šฉ: ๊ธฐ์กด CLI ๊ธฐ๋Šฅ๋“ค์„ ๊ทธ๋Œ€๋กœ ํ™œ์šฉ
  • ํ™•์žฅ์„ฑ: ์ƒˆ๋กœ์šด ๋‹จ๊ณ„๋‚˜ ํ”„๋ ˆ์ž„์›Œํฌ ์‰ฝ๊ฒŒ ์ถ”๊ฐ€ ๊ฐ€๋Šฅ

๋น ๋ฅธ ์‹œ์ž‘

# 1. ํ…œํ”Œ๋ฆฟ ์ƒ์„ฑ
python workflow_cli.py workflow template --output my_workflow.yaml

# 2. ์„ค์ • ํŒŒ์ผ ํŽธ์ง‘ ํ›„ ์‹คํ–‰
python workflow_cli.py workflow run my_workflow.yaml

์˜ˆ์ œ 1: ์ถ”์ถœ๋งŒ ์‹คํ–‰ (ํŒŒ์‹ฑ ์—†์Œ)

# test_extraction_only.yaml
name: "extraction_test"
description: "์ถ”์ถœ ๊ธฐ๋Šฅ๋งŒ ํ…Œ์ŠคํŠธ"

# ํŒŒ์‹ฑ ์„ค์ • ์—†์Œ
parsing: null

# ์ถ”์ถœ ์„ค์ • (์—ฌ๋Ÿฌ ๊ฐœ ๊ฐ€๋Šฅ)
extraction:
  - prompt: "Extract person information"
    input_text: "์•ˆ๋…•ํ•˜์„ธ์š”. ๊น€์ฒ ์ˆ˜์ž…๋‹ˆ๋‹ค. ์„œ์šธ๋Œ€ ์กธ์—… ํ›„ ์‚ผ์„ฑ์—์„œ 3๋…„ ๊ทผ๋ฌดํ–ˆ์Šต๋‹ˆ๋‹ค."
    schema_name: "schema_han"
    framework: "openai"
    host_info:
      provider: "openai"
      model: "gpt-4o-mini"
      api_key: "${OPENAI_API_KEY}"
    retries: 2
    extra_kwargs:
      temperature: 0.1
    save: true

evaluation:
  enabled: false

์˜ˆ์ œ 2: ํŒŒ์‹ฑ + ์ถ”์ถœ ์กฐํ•ฉ ์‹คํ–‰

# test_parsing_extraction.yaml
name: "full_pipeline_test"
description: "ํŒŒ์‹ฑ๊ณผ ์ถ”์ถœ ์กฐํ•ฉ ํ…Œ์ŠคํŠธ"

# ํŒŒ์‹ฑ ์„ค์ • (2๊ฐœ)
parsing:
  - file_path: "./document1.pdf"
    framework: "docling"
    extra_kwargs:
      use_ocr: true
    save: true
  
  - file_path: "./document1.pdf"
    framework: "pypdf"
    extra_kwargs: {}
    save: true

# ์ถ”์ถœ ์„ค์ • (2๊ฐœ)
extraction:
  - prompt: "Extract person information"
    schema_name: "schema_han"
    framework: "openai"
    host_info:
      provider: "openai"
      model: "gpt-4o-mini"
      api_key: "${OPENAI_API_KEY}"
    save: true
  
  - prompt: "Extract detailed career info"
    schema_name: "schema_han"
    framework: "anthropic"
    host_info:
      provider: "anthropic"
      model: "claude-3-sonnet"
      api_key: "${ANTHROPIC_API_KEY}"
    save: true

evaluation:
  enabled: false

# ์ด 2(parsing) ร— 2(extraction) = 4๊ฐœ ์กฐํ•ฉ ์‹คํ–‰:
# 1. docling + openai
# 2. docling + anthropic
# 3. pypdf + openai
# 4. pypdf + anthropic

์›Œํฌํ”Œ๋กœ์šฐ ๋ช…๋ น์–ด

# ์›Œํฌํ”Œ๋กœ์šฐ ์‹คํ–‰
python workflow_cli.py workflow run config.yaml

# ์„ค์ • ๊ฒ€์ฆ
python workflow_cli.py workflow validate config.yaml

# ํ…œํ”Œ๋ฆฟ ์ƒ์„ฑ
python workflow_cli.py workflow template
python workflow_cli.py workflow template --no-eval  # ํ‰๊ฐ€ ์„ค์ • ์ œ์™ธ

# ๊ณ ๊ธ‰ ์˜ต์…˜
python workflow_cli.py workflow run config.yaml \
  --parallel          # ๋ณ‘๋ ฌ ์‹คํ–‰
  --no-fail-fast     # ์‹คํŒจํ•ด๋„ ๊ณ„์† ์ง„ํ–‰
  --output ./results # ์ถœ๋ ฅ ๋””๋ ‰ํ† ๋ฆฌ ์ง€์ •
  --dry-run          # ์‹ค์ œ ์‹คํ–‰ ์—†์ด ๊ฒ€์ฆ๋งŒ

๊ฒฐ๊ณผ ๊ตฌ์กฐ

result/workflow/
โ””โ”€โ”€ workflow_name_20240824_143022/
    โ”œโ”€โ”€ workflow_config.json      # ์‹คํ–‰๋œ ์„ค์ •
    โ”œโ”€โ”€ workflow_summary.json     # ์‹คํ–‰ ์š”์•ฝ
    โ”œโ”€โ”€ combination_0_0/          # ์ฒซ ๋ฒˆ์งธ ์กฐํ•ฉ ๊ฒฐ๊ณผ
    โ”‚   โ”œโ”€โ”€ parsing_result.txt
    โ”‚   โ”œโ”€โ”€ extraction_result.json
    โ”‚   โ””โ”€โ”€ evaluation_result.json
    โ”œโ”€โ”€ combination_0_1/          # ๋‘ ๋ฒˆ์งธ ์กฐํ•ฉ ๊ฒฐ๊ณผ
    โ””โ”€โ”€ ...

๋” ์ž์„ธํ•œ ์›Œํฌํ”Œ๋กœ์šฐ ์‚ฌ์šฉ๋ฒ•์€ WORKFLOW.md๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.

๐Ÿ—๏ธ ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ

๐Ÿ“ ์ „์ฒด ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ
structured_output_kit/
โ”œโ”€โ”€ ๐Ÿ“ main.py                    # ๐Ÿš€ ๋ฉ”์ธ ์ง„์ž…์  (API ์„œ๋ฒ„/CLI ์‹คํ–‰)
โ”œโ”€โ”€ ๐Ÿ“ cli.py                     # ๐Ÿ’ป Typer ๊ธฐ๋ฐ˜ CLI ์ธํ„ฐํŽ˜์ด์Šค
โ”œโ”€โ”€ ๐Ÿ“ server/                    # ๐ŸŒ FastAPI ์„œ๋ฒ„
โ”‚   โ”œโ”€โ”€ main.py                   # FastAPI ์•ฑ ์„ค์ • ๋ฐ ๋ผ์šฐํ„ฐ ๋“ฑ๋ก
โ”‚   โ”œโ”€โ”€ config.py                 # ์„œ๋ฒ„ ์„ค์ • ๊ด€๋ฆฌ
โ”‚   โ”œโ”€โ”€ routers/                  # API ์—”๋“œํฌ์ธํŠธ
โ”‚   โ”‚   โ”œโ”€โ”€ extraction.py         # ๊ตฌ์กฐํ™” ์ •๋ณด ์ถ”์ถœ API
โ”‚   โ”‚   โ”œโ”€โ”€ evaluation.py         # ํ‰๊ฐ€ API
โ”‚   โ”‚   โ”œโ”€โ”€ parsing.py            # ํŒŒ์‹ฑ API
โ”‚   โ”‚   โ”œโ”€โ”€ visualization.py      # ์‹œ๊ฐํ™” API
โ”‚   โ”‚   โ””โ”€โ”€ utils.py              # ์œ ํ‹ธ๋ฆฌํ‹ฐ API
โ”‚   โ”œโ”€โ”€ models/                   # ๋ฐ์ดํ„ฐ ๋ชจ๋ธ
โ”‚   โ””โ”€โ”€ services/                 # ๋น„์ฆˆ๋‹ˆ์Šค ๋กœ์ง ์„œ๋น„์Šค
โ”œโ”€โ”€ ๐Ÿ“ parsing/                   # ๐Ÿ“„ ํŒŒ์‹ฑ ๋ชจ๋“ˆ
โ”‚   โ”œโ”€โ”€ core.py                   # ํŒŒ์‹ฑ ํ•ต์‹ฌ ๋กœ์ง
โ”‚   โ”œโ”€โ”€ factory.py                # ํŒŒ์‹ฑ ํ”„๋ ˆ์ž„์›Œํฌ ํŒฉํ† ๋ฆฌ
โ”‚   โ”œโ”€โ”€ utils.py                  # ํŒŒ์‹ฑ ์œ ํ‹ธ๋ฆฌํ‹ฐ
โ”‚   โ””โ”€โ”€ frameworks/               # ํŒŒ์‹ฑ ํ”„๋ ˆ์ž„์›Œํฌ ๊ตฌํ˜„์ฒด
โ”‚       โ”œโ”€โ”€ docling_framework.py  # IBM Docling (๊ถŒ์žฅ)
โ”‚       โ”œโ”€โ”€ pypdf_framework.py    # PyPDF
โ”‚       โ”œโ”€โ”€ pdfplumber_framework.py # PDFPlumber
โ”‚       โ”œโ”€โ”€ fitz_framework.py     # PyMuPDF
โ”‚       โ”œโ”€โ”€ markitdown_framework.py # Microsoft MarkItDown
โ”‚       โ””โ”€โ”€ vlm_framework.py      # Vision Language Model
โ”œโ”€โ”€ ๐Ÿ“ extraction/                # ๐Ÿ”ง ์ถ”์ถœ ๋ชจ๋“ˆ
โ”‚   โ”œโ”€โ”€ core.py                   # ์ถ”์ถœ ํ•ต์‹ฌ ๋กœ์ง
โ”‚   โ”œโ”€โ”€ utils.py                  # ์ถ”์ถœ ์œ ํ‹ธ๋ฆฌํ‹ฐ
โ”‚   โ”œโ”€โ”€ factory.py                # LLM ํ”„๋ ˆ์ž„์›Œํฌ ํŒฉํ† ๋ฆฌ
โ”‚   โ”œโ”€โ”€ compatibility.yaml        # ํ”„๋ ˆ์ž„์›Œํฌ-ํ˜ธ์ŠคํŠธ ํ˜ธํ™˜์„ฑ ๋งคํ•‘
โ”‚   โ”œโ”€โ”€ frameworks/               # LLM ํ”„๋ ˆ์ž„์›Œํฌ ๊ตฌํ˜„์ฒด
โ”‚   โ”‚   โ”œโ”€โ”€ openai_framework.py   # OpenAI ๋„ค์ดํ‹ฐ๋ธŒ
โ”‚   โ”‚   โ”œโ”€โ”€ anthropic_framework.py # Anthropic ๋„ค์ดํ‹ฐ๋ธŒ
โ”‚   โ”‚   โ”œโ”€โ”€ google_framework.py   # Google Gemini ๋„ค์ดํ‹ฐ๋ธŒ
โ”‚   โ”‚   โ”œโ”€โ”€ ollama_framework.py   # Ollama ๋„ค์ดํ‹ฐ๋ธŒ
โ”‚   โ”‚   โ”œโ”€โ”€ instructor_framework.py # Instructor
โ”‚   โ”‚   โ”œโ”€โ”€ langchain_tool_framework.py # LangChain Tool
โ”‚   โ”‚   โ”œโ”€โ”€ langchain_parser_framework.py # LangChain Parser
โ”‚   โ”‚   โ”œโ”€โ”€ llamaindex_framework.py # LlamaIndex
โ”‚   โ”‚   โ”œโ”€โ”€ marvin_framework.py   # Marvin
โ”‚   โ”‚   โ””โ”€โ”€ mirascope_framework.py # Mirascope
โ”‚   โ””โ”€โ”€ schema/                   # ๋ฐ์ดํ„ฐ ์Šคํ‚ค๋งˆ
โ”‚       โ””โ”€โ”€ schema_han.py         # ํ•œ๊ตญ์–ด ์ด๋ ฅ์„œ ์Šคํ‚ค๋งˆ
โ”œโ”€โ”€ ๐Ÿ“ evaluation/                # ๐Ÿ“Š ํ‰๊ฐ€ ๋ชจ๋“ˆ  
โ”‚   โ”œโ”€โ”€ core.py                   # ํ‰๊ฐ€ ํ•ต์‹ฌ ๋กœ์ง
โ”‚   โ”œโ”€โ”€ metrics.py                # ํ‰๊ฐ€ ๋ฉ”ํŠธ๋ฆญ (์ž„๋ฒ ๋”ฉ ์œ ์‚ฌ๋„, ์™„์ „์ผ์น˜)
โ”‚   โ”œโ”€โ”€ utils.py                  # ํ‰๊ฐ€ ์œ ํ‹ธ๋ฆฌํ‹ฐ
โ”‚   โ”œโ”€โ”€ visualizer.py             # Streamlit ์‹œ๊ฐํ™”
โ”‚   โ””โ”€โ”€ criteria/                 # ํ‰๊ฐ€ ๊ธฐ์ค€ ์„ค์ • ํŒŒ์ผ
โ”œโ”€โ”€ ๐Ÿ“ utils/                     # ๐Ÿ› ๏ธ ๊ณตํ†ต ์œ ํ‹ธ๋ฆฌํ‹ฐ
โ”‚   โ”œโ”€โ”€ types.py                  # ํƒ€์ž… ์ •์˜ (Request/Response ๋ชจ๋ธ)
โ”‚   โ”œโ”€โ”€ logging.py                # ๋กœ๊น… ์„ค์ •
โ”‚   โ”œโ”€โ”€ tracing.py                # Langfuse ์ถ”์  ์„ค์ •
โ”‚   โ”œโ”€โ”€ cli_helpers.py            # CLI ํ—ฌํผ ํ•จ์ˆ˜
โ”‚   โ”œโ”€โ”€ common.py                 # ๊ณตํ†ต ๊ธฐ๋Šฅ
โ”‚   โ””โ”€โ”€ visualization.py          # ์‹œ๊ฐํ™” ํ—ฌํผ
โ”œโ”€โ”€ ๐Ÿ“ data/                      # ๐Ÿ“„ ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ
โ”‚   โ”œโ”€โ”€ ๋ฆฌ๋ฉค๋ฒ„-s1.json            # ํ•œ๊ตญ์–ด ์ด๋ ฅ์„œ ์ƒ˜ํ”Œ
โ”‚   โ”œโ”€โ”€ ๊ตญ๋ฌธ์ด๋ ฅ์„œ(๊ทธ๋ฆผํฌํ•จ)-s1.json
โ”‚   โ””โ”€โ”€ ...
โ””โ”€โ”€ ๐Ÿ“ result/                    # ๐Ÿ“ˆ ๊ฒฐ๊ณผ ์ €์žฅ์†Œ
    โ”œโ”€โ”€ parsing/                  # ํŒŒ์‹ฑ ๊ฒฐ๊ณผ
    โ”œโ”€โ”€ extraction/               # ์ถ”์ถœ ๊ฒฐ๊ณผ
    โ”œโ”€โ”€ evaluation/               # ํ‰๊ฐ€ ๊ฒฐ๊ณผ
    โ””โ”€โ”€ visualization/            # ์‹œ๊ฐํ™” ๊ฒฐ๊ณผ

์‹คํ–‰ ๋ชจ๋“œ

  • ๐ŸŒ API ์„œ๋ฒ„ ๋ชจ๋“œ: python main.py (๊ธฐ๋ณธ๊ฐ’)
  • ๐Ÿ’ป CLI ๋ชจ๋“œ: python main.py --cli [command]

ํŒŒ์ดํ”„๋ผ์ธ ํ”Œ๋กœ์šฐ

graph LR
    A[๐Ÿ“„ PDF/์ด๋ฏธ์ง€] --> B[๐Ÿ“ ํŒŒ์‹ฑ]
    B --> C[๐Ÿ“‹ ํ…์ŠคํŠธ]
    C --> D[๐Ÿ”ง LLM ์ถ”์ถœ]
    D --> E[๐Ÿ“Š ๊ตฌ์กฐํ™”๋œ JSON]
    E --> F[๐Ÿ“ˆ ํ‰๊ฐ€]
    F --> G[๐ŸŽจ ์‹œ๊ฐํ™”]
    
    H[โš™๏ธ ์„ค์ •] --> B
    H --> D
    H --> F

๐Ÿ”ง ์ง€์› ํ”„๋ ˆ์ž„์›Œํฌ

ํŒŒ์‹ฑ ํ”„๋ ˆ์ž„์›Œํฌ

๐Ÿ“„ PDF ํŒŒ์‹ฑ ํ”„๋ ˆ์ž„์›Œํฌ
ํ”„๋ ˆ์ž„์›Œํฌ ํŠน์ง• ์šฉ๋„
DoclingFramework IBM ๊ฐœ๋ฐœ, ์ตœ์‹  AI ๊ธฐ๋ฐ˜ ๋ณต์žกํ•œ ๋ ˆ์ด์•„์›ƒ, ํ…Œ์ด๋ธ” ์ถ”์ถœ (๊ถŒ์žฅ)
PDFPlumberFramework ํ…Œ์ด๋ธ” ์ถ”์ถœ ํŠนํ™” ์ •ํ™•ํ•œ ํ…Œ์ด๋ธ” ๋ฐ์ดํ„ฐ ํ•„์š” ์‹œ
PyPDFFramework ๋น ๋ฅด๊ณ  ๊ฐ€๋ฒผ์›€ ๋‹จ์ˆœํ•œ ํ…์ŠคํŠธ ์ถ”์ถœ
FitzFramework PyMuPDF ๊ธฐ๋ฐ˜ ๊ณ ์„ฑ๋Šฅ, ๋‹ค์–‘ํ•œ ํฌ๋งท ์ง€์›
MarkItDownFramework Microsoft ๊ฐœ๋ฐœ Office ๋ฌธ์„œ, ๋‹ค์–‘ํ•œ ํ˜•์‹ ์ง€์›
๐Ÿ–ผ๏ธ ์ด๋ฏธ์ง€/OCR ํ”„๋ ˆ์ž„์›Œํฌ
ํ”„๋ ˆ์ž„์›Œํฌ ๋ชจ๋ธ ์ง€์› ํŠน์ง•
VLMFramework OpenAI GPT-4V, Google Gemini ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ OCR, ์ดํ•ด๋ ฅ ๋†’์Œ

LLM ํ˜ธ์ŠคํŠธ๋ณ„ ์ง€์› ํ”„๋ ˆ์ž„์›Œํฌ

๐Ÿค– OpenAI
  • โœ… OpenAIFramework - ๋„ค์ดํ‹ฐ๋ธŒ Structured Outputs
  • โœ… InstructorFramework - ํƒ€์ž… ์•ˆ์ „์„ฑ ๊ฐ•ํ™”
  • โœ… LangchainToolFramework - Tool ๊ธฐ๋ฐ˜ ์ถ”์ถœ
  • โœ… LangchainParserFramework - ํŒŒ์„œ ๊ธฐ๋ฐ˜ ์ถ”์ถœ
  • โœ… LlamaIndexFramework - ๋ฐ์ดํ„ฐ ์ค‘์‹ฌ ์ถ”์ถœ
  • โœ… MarvinFramework - AI ์—”์ง€๋‹ˆ์–ด๋ง ๋„๊ตฌ
  • โœ… MirascopeFramework - ํ˜„๋Œ€์  LLM ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
๐ŸŽญ Anthropic
  • โœ… AnthropicFramework - ๋„ค์ดํ‹ฐ๋ธŒ Tool Use
  • โœ… InstructorFramework - Anthropic ์ง€์›
  • โœ… LangchainToolFramework - Claude ํ†ตํ•ฉ
  • โœ… LangchainParserFramework - Claude ํŒŒ์„œ
  • โœ… MarvinFramework - Claude ์ง€์›
๐Ÿ” Google
  • โœ… GoogleFramework - Gemini ๋„ค์ดํ‹ฐ๋ธŒ JSON ๋ชจ๋“œ
  • โœ… InstructorFramework - Gemini ์ง€์›
  • โœ… LangchainToolFramework - Gemini ํ†ตํ•ฉ
  • โœ… LangchainParserFramework - Gemini ํŒŒ์„œ
  • โœ… LlamaIndexFramework - Gemini ์ง€์›
  • โœ… MarvinFramework - Gemini ์ง€์›
  • โœ… MirascopeFramework - Gemini ์ง€์›
๐Ÿฆ™ Ollama (๋กœ์ปฌ)
  • โœ… OllamaFramework - ๋„ค์ดํ‹ฐ๋ธŒ JSON ๊ตฌ์กฐํ™”
  • โœ… OpenAIFramework - OpenAI ํ˜ธํ™˜ ๋ชจ๋“œ
  • โœ… InstructorFramework - ๋กœ์ปฌ ๋ชจ๋ธ ์ง€์›
  • โœ… LangchainToolFramework - Ollama ํ†ตํ•ฉ
  • โœ… LangchainParserFramework - Ollama ํŒŒ์„œ
  • โœ… LlamaIndexFramework - Ollama ์ง€์›
  • โœ… MarvinFramework - Ollama ์ง€์›
  • โœ… MirascopeFramework - Ollama ์ง€์›
๐Ÿ”— OpenAI-Compatible

vLLM, Together AI, Groq ๋“ฑ OpenAI ํ˜ธํ™˜ ์„œ๋ฒ„ ์ง€์›

  • โœ… OpenAIFramework - ํ˜ธํ™˜ ๋ชจ๋“œ
  • โœ… InstructorFramework - ํ˜ธํ™˜ ์ง€์›
  • โœ… LangchainToolFramework - ํ˜ธํ™˜ ํ†ตํ•ฉ
  • โœ… LangchainParserFramework - ํ˜ธํ™˜ ํŒŒ์„œ
  • โœ… LlamaIndexFramework - ํ˜ธํ™˜ ์ง€์›
  • โœ… MarvinFramework - ํ˜ธํ™˜ ์ง€์›
  • โœ… MirascopeFramework - ํ˜ธํ™˜ ์ง€์›

๐Ÿ“„ ํŒŒ์‹ฑ ์‹œ์Šคํ…œ

์ง€์› ํŒŒ์ผ ํ˜•์‹

๐Ÿ“‹ ์ง€์› ํŒŒ์ผ ํ˜•์‹ ๋ชฉ๋ก
ํ˜•์‹ ํ™•์žฅ์ž ์ถ”์ฒœ ํ”„๋ ˆ์ž„์›Œํฌ ํŠน์ง•
PDF .pdf DoclingFramework ๋ ˆ์ด์•„์›ƒ, ํ…Œ์ด๋ธ” ๋ณด์กด
์ด๋ฏธ์ง€ .png, .jpg, .jpeg VLMFramework OCR + ์ดํ•ด
Word .docx MarkItDownFramework Office ๋ฌธ์„œ
PowerPoint .pptx MarkItDownFramework ์Šฌ๋ผ์ด๋“œ ํ…์ŠคํŠธ
Excel .xlsx MarkItDownFramework ์Šคํ”„๋ ˆ๋“œ์‹œํŠธ

ํŒŒ์‹ฑ ์˜ˆ์‹œ

# ๋ณต์žกํ•œ PDF ๋ฌธ์„œ (ํ…Œ์ด๋ธ” ํฌํ•จ)
python main.py --cli parse --file report.pdf --framework DoclingFramework

# ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜ ๋ฌธ์„œ (OCR)
python main.py --cli parse --file scan.png --framework VLMFramework

# Office ๋ฌธ์„œ
python main.py --cli parse --file document.docx --framework MarkItDownFramework

๐Ÿ“‹ ์Šคํ‚ค๋งˆ์™€ ํ‰๊ฐ€

๊ธฐ๋ณธ ์Šคํ‚ค๋งˆ (schema_han)

ํ•œ๊ตญ์–ด ์ด๋ ฅ์„œ ์ •๋ณด ์ถ”์ถœ์„ ์œ„ํ•œ ํฌ๊ด„์ ์ธ ๊ตฌ์กฐํ™”๋œ ์Šคํ‚ค๋งˆ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“ ์Šคํ‚ค๋งˆ ๊ตฌ์กฐ ์ƒ์„ธ
class ExtractInfo(BaseModel):
    # ๊ธฐ๋ณธ ์ •๋ณด
    personal_info: Optional[PersonalInfo]           # ๊ฐœ์ธ์ •๋ณด (์ด๋ฆ„, ์—ฐ๋ฝ์ฒ˜, ์ฃผ์†Œ ๋“ฑ)
    summary_info: Optional[SummaryInfo]             # ์š”์•ฝ์ •๋ณด (๊ฐ„๋žต์†Œ๊ฐœ, ํ•ต์‹ฌ์—ญ๋Ÿ‰)
    
    # ํ•™๋ ฅ ๋ฐ ๊ต์œก
    educations: List[Education]                     # ํ•™๋ ฅ์‚ฌํ•ญ (ํ•™๊ต, ์ „๊ณต, ํ•™์  ๋“ฑ)
    education_programs: List[EducationProgram]      # ๊ต์œก๊ณผ์ • (์™ธ๋ถ€ ๊ต์œก, ์—ฐ์ˆ˜ ๋“ฑ)
    overseas_experiences: List[OverseasExperience]  # ํ•ด์™ธ์—ฐ์ˆ˜ (๊ตญ๊ฐ€, ๊ธฐ๊ฐ„, ๋‚ด์šฉ)
    
    # ๊ฒฝ๋ ฅ ๋ฐ ์„ฑ๊ณผ
    careers: List[Career]                           # ๊ฒฝ๋ ฅ์‚ฌํ•ญ (ํšŒ์‚ฌ, ์ง๋ฌด, ๋‹ด๋‹น์—…๋ฌด ๋“ฑ)
    certificates: List[Certificate]                 # ์ž๊ฒฉ์ฆ (์ž๊ฒฉ๋ช…, ๋ฐœํ–‰์ฒ˜, ์ ์ˆ˜ ๋“ฑ)
    awards: List[Award]                             # ์ˆ˜์ƒ/๊ณต๋ชจ์ „ (์ˆ˜์ƒ๋ช…, ๊ธฐ๊ด€, ์ผ์ž)
    
    # ๊ธฐํƒ€ ์ •๋ณด
    employment_preference: Optional[EmploymentPreference] # ์ทจ์—…์šฐ๋Œ€ (๋ณดํ›ˆ, ์žฅ์•  ๋“ฑ)
    military_service: Optional[MilitaryService]     # ๋ณ‘์—ญ (๊ตฐ๋ณ„, ๊ณ„๊ธ‰, ๊ธฐ๊ฐ„)
    cover_letter: Optional[CoverLetter]             # ์ž๊ธฐ์†Œ๊ฐœ์„œ
    etc_info: Optional[EtcInfo]                     # ๊ธฐํƒ€ ์ •๋ณด

์ฃผ์š” ํ•„๋“œ ์˜ˆ์‹œ:

  • PersonalInfo: ์ด๋ฆ„, ์„ฑ๋ณ„, ์ƒ๋…„์›”์ผ, ์—ฐ๋ฝ์ฒ˜, ์ด๋ฉ”์ผ, ์ฃผ์†Œ, SNS ๋งํฌ
  • Career: ํšŒ์‚ฌ๋ช…, ์ž…์‚ฌ/ํ‡ด์‚ฌ์ผ, ๋‹ด๋‹น์—…๋ฌด, ์—ฐ๋ด‰, ์ง์ฑ…, ์ง๊ธ‰, ๊ณ ์šฉํ˜•ํƒœ
  • Education: ํ•™๊ต์ข…๋ฅ˜, ํ•™๊ต๋ช…, ์ „๊ณต, ํ•™์œ„, ํ•™์ , ์กธ์—…์ƒํƒœ

ํ‰๊ฐ€ ์‹œ์Šคํ…œ

ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ํ‰๊ฐ€ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์—ฌ ์ •ํ™•๋„์™€ ์˜๋ฏธ์  ์œ ์‚ฌ์„ฑ์„ ๋ชจ๋‘ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“Š ํ‰๊ฐ€ ๋ฉ”ํŠธ๋ฆญ ์ƒ์„ธ

1. ์™„์ „์ผ์น˜ (Exact Match)

  • ๋ฌธ์ž์—ด์ด ์ •ํ™•ํžˆ ์ผ์น˜ํ•˜๋Š”์ง€ ํ™•์ธ
  • ์ด๋ฆ„, ์ด๋ฉ”์ผ, ๋‚ ์งœ ๋“ฑ ์ •ํ™•์„ฑ์ด ์ค‘์š”ํ•œ ํ•„๋“œ์— ์‚ฌ์šฉ

2. ์ž„๋ฒ ๋”ฉ ์œ ์‚ฌ๋„ (Embedding Similarity)

  • ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋ฅผ ํ†ตํ•œ ์˜๋ฏธ์  ์œ ์‚ฌ์„ฑ ์ธก์ •
  • ์ž๊ธฐ์†Œ๊ฐœ์„œ, ๋‹ด๋‹น์—…๋ฌด ๋“ฑ ํ…์ŠคํŠธ ํ•„๋“œ์— ์‚ฌ์šฉ
  • ์ง€์› ๋ชจ๋ธ: OpenAI, HuggingFace (ko-sroberta-multitask)

3. ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์Šค์ฝ”์–ด

  • ์™„์ „์ผ์น˜์™€ ์ž„๋ฒ ๋”ฉ ์œ ์‚ฌ๋„์˜ ๊ฐ€์ค‘ ํ‰๊ท 
  • ํ•„๋“œ๋ณ„ ๋งž์ถค ๊ฐ€์ค‘์น˜ ์„ค์ • ๊ฐ€๋Šฅ

4. ํ—๊ฐ€๋ฆฌ์•ˆ ์•Œ๊ณ ๋ฆฌ์ฆ˜

  • ๋ฆฌ์ŠคํŠธ ์š”์†Œ ๊ฐ„ ์ตœ์  ๋งค์นญ
  • ๊ฒฝ๋ ฅ, ํ•™๋ ฅ ๋“ฑ ์ˆœ์„œ๊ฐ€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ๋Š” ๋ฐฐ์—ด ๋ฐ์ดํ„ฐ ํ‰๊ฐ€
# ํ•„๋“œ๋ณ„ ํ‰๊ฐ€ ๋ฐฉ์‹ ์˜ˆ์‹œ
evaluation_criteria = {
    "personal_info.name": {"method": "exact"},        # ์ด๋ฆ„์€ ์ •ํ™•ํ•ด์•ผ ํ•จ
    "personal_info.email": {"method": "exact"},       # ์ด๋ฉ”์ผ๋„ ์ •ํ™•ํ•ด์•ผ ํ•จ  
    "summary_info.brief_introduction": {"method": "embedding"}, # ์†Œ๊ฐœ๊ธ€์€ ์˜๋ฏธ์  ์œ ์‚ฌ์„ฑ
    "careers": {"method": "hybrid", "exact_weight": 0.3, "embedding_weight": 0.7}
}

๐ŸŽจ ์‹œ๊ฐํ™”

Streamlit ๋Œ€์‹œ๋ณด๋“œ

์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒ ๋Œ€์‹œ๋ณด๋“œ๋กœ ํ‰๊ฐ€ ๊ฒฐ๊ณผ๋ฅผ ์‹ค์‹œ๊ฐ„์œผ๋กœ ํƒ์ƒ‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ“Š ๋Œ€์‹œ๋ณด๋“œ ๊ธฐ๋Šฅ

์ฃผ์š” ๊ธฐ๋Šฅ:

  • ๐Ÿ“Š ์ „์ฒด ์„ฑ๋Šฅ ๊ฐœ์š”: ์ด์ , ์™„์ „์ผ์น˜์œจ, ์ž„๋ฒ ๋”ฉ ์œ ์‚ฌ๋„
  • ๐Ÿ“ˆ ํ•„๋“œ๋ณ„ ์ƒ์„ธ ๋ถ„์„: ๊ฐ ํ•„๋“œ์˜ ์ ์ˆ˜ ๋ถ„ํฌ ๋ฐ ์ƒ์„ธ ๋น„๊ต
  • ๐Ÿ” ์˜ˆ์ธก vs ์ •๋‹ต ๋น„๊ต: ์‹ค์ œ ๊ฐ’๊ณผ ์˜ˆ์ธก ๊ฐ’์˜ ์‹œ๊ฐ์  ๋น„๊ต
  • ๐Ÿ“‰ ์„ฑ๋Šฅ ๋ถ„ํฌ ์ฐจํŠธ: ์ ์ˆ˜ ํžˆ์Šคํ† ๊ทธ๋žจ ๋ฐ ๋ถ„ํฌ ์‹œ๊ฐํ™”
  • ๐ŸŽฏ ์˜ค๋ฅ˜ ๋ถ„์„: ๋‚ฎ์€ ์ ์ˆ˜๋ฅผ ๋ฐ›์€ ํ•„๋“œ์˜ ์›์ธ ๋ถ„์„

์‹คํ–‰ ๋ฐฉ๋ฒ•:

python main.py --cli viz --eval-result path/to/eval_result.json

์ •์  HTML ๋ฆฌํฌํŠธ

๊ฐ„๋‹จํ•œ HTML ๋ฆฌํฌํŠธ๋กœ ๊ฒฐ๊ณผ๋ฅผ ๊ณต์œ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ƒ์„ฑ ๋ฐฉ๋ฒ•:

# CLI๋กœ ์ƒ์„ฑ
python main.py --cli viz --eval-result path/to/eval_result.json --html

# API๋กœ ์ƒ์„ฑ  
curl -X POST http://localhost:8000/v1/visualization/generate \
  -H 'Content-Type: application/json' \
  -d '{"eval_result_path": "path/to/eval_result.json"}'

๐Ÿ› ๏ธ ๊ฐœ๋ฐœ ๊ฐ€์ด๋“œ

๋กœ์ปฌ ๊ฐœ๋ฐœ ํ™˜๊ฒฝ

# ๊ฐœ๋ฐœ ๋ชจ๋“œ๋กœ ์„œ๋ฒ„ ์‹คํ–‰
python main.py --reload

# ํ…Œ์ŠคํŠธ ์‹คํ–‰ (๊ตฌํ˜„ ์˜ˆ์ •)
pytest tests/

# ์ฝ”๋“œ ํฌ๋งทํŒ… (๊ถŒ์žฅ)
black .
ruff check .

์ปค์Šคํ…€ ์Šคํ‚ค๋งˆ ์ถ”๊ฐ€

๐Ÿ“ ์ƒˆ๋กœ์šด ์Šคํ‚ค๋งˆ ์ƒ์„ฑ ๊ฐ€์ด๋“œ
  1. extraction/schema/ ๋””๋ ‰ํ† ๋ฆฌ์— ์ƒˆ ์Šคํ‚ค๋งˆ ํŒŒ์ผ ์ƒ์„ฑ
  2. Pydantic v2 BaseModel์„ ์ƒ์†๋ฐ›๋Š” ExtractInfo ํด๋ž˜์Šค ์ •์˜
  3. ์Šคํ‚ค๋งˆ ์ด๋ฆ„์œผ๋กœ ํŒŒ์ผ์— ์ ‘๊ทผ ๊ฐ€๋Šฅ
# extraction/schema/custom_schema.py
from pydantic import BaseModel, Field
from typing import Optional, List

class PersonInfo(BaseModel):
    name: Optional[str] = Field(description="์ด๋ฆ„", default=None)
    age: Optional[int] = Field(description="๋‚˜์ด", default=None)

class SkillInfo(BaseModel):
    skill_name: Optional[str] = Field(description="๊ธฐ์ˆ ๋ช…", default=None)
    proficiency: Optional[str] = Field(description="์ˆ™๋ จ๋„", default=None)

class ExtractInfo(BaseModel):
    person: Optional[PersonInfo] = Field(description="์ธ๋ฌผ์ •๋ณด", default=None)
    skills: List[SkillInfo] = Field(description="๊ธฐ์ˆ ์Šคํƒ", default_factory=list)

์ปค์Šคํ…€ ํŒŒ์‹ฑ ํ”„๋ ˆ์ž„์›Œํฌ ์ถ”๊ฐ€

๐Ÿ”ง ์ƒˆ๋กœ์šด ํŒŒ์‹ฑ ํ”„๋ ˆ์ž„์›Œํฌ ๊ตฌํ˜„
  1. parsing/frameworks/ ๋””๋ ‰ํ† ๋ฆฌ์— ์ƒˆ ํ”„๋ ˆ์ž„์›Œํฌ ํŒŒ์ผ ์ƒ์„ฑ
  2. BaseFramework๋ฅผ ์ƒ์†๋ฐ›๋Š” ํด๋ž˜์Šค ๊ตฌํ˜„
# parsing/frameworks/custom_framework.py
from structured_output_kit.parsing.base import BaseFramework

class CustomFramework(BaseFramework):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # ์ดˆ๊ธฐํ™” ๋กœ์ง
    
    def run(self, retries: int = 1) -> tuple[str, bool, float]:
        # ํŒŒ์‹ฑ ๋กœ์ง ๊ตฌํ˜„
        try:
            content = self.parse_file(self.file_path)
            return content, True, 0.5  # content, success, latency
        except Exception as e:
            return f"ERROR: {str(e)}", False, 0
    
    def parse_file(self, file_path: str) -> str:
        # ์‹ค์ œ ํŒŒ์‹ฑ ๋กœ์ง
        pass

์ปค์Šคํ…€ LLM ํ”„๋ ˆ์ž„์›Œํฌ ์ถ”๊ฐ€

๐Ÿค– ์ƒˆ๋กœ์šด LLM ํ”„๋ ˆ์ž„์›Œํฌ ๊ตฌํ˜„
  1. extraction/frameworks/ ๋””๋ ‰ํ† ๋ฆฌ์— ์ƒˆ ํ”„๋ ˆ์ž„์›Œํฌ ํŒŒ์ผ ์ƒ์„ฑ
  2. BaseFramework๋ฅผ ์ƒ์†๋ฐ›๋Š” ํด๋ž˜์Šค ๊ตฌํ˜„
  3. compatibility.yaml์— ํ˜ธ์ŠคํŠธ ํ˜ธํ™˜์„ฑ ์ •๋ณด ์ถ”๊ฐ€
# extraction/frameworks/custom_framework.py
from structured_output_kit.extraction.base import BaseFramework, experiment

class CustomFramework(BaseFramework):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # LLM ํด๋ผ์ด์–ธํŠธ ์ดˆ๊ธฐํ™”
    
    def run(self, retries: int, inputs: dict = {}) -> tuple[list[Any], float, list[float]]:
        @experiment(retries=retries)
        def run_experiment(inputs):
            # LLM ํ˜ธ์ถœ ๋กœ์ง
            response = self.client.complete(
                prompt=self.prompt.format(**inputs),
                response_model=self.response_model
            )
            return response
        
        predictions, percent_successful, latencies = run_experiment(inputs)
        return predictions, percent_successful, latencies

๊ธฐ์—ฌ ๋ฐฉ๋ฒ•

  1. ๐Ÿด Fork the repository
  2. ๐ŸŒŸ Create a feature branch: git checkout -b feature/amazing-feature
  3. ๐Ÿ’พ Commit your changes: git commit -m 'Add amazing feature'
  4. ๐Ÿ“ค Push to the branch: git push origin feature/amazing-feature
  5. ๐ŸŽฏ Open a Pull Request

๏ฟฝ ์„ฑ๋Šฅ ๋ฒค์น˜๋งˆํฌ

๐Ÿšจ ์ž์ฃผ ๋ฐœ์ƒํ•˜๋Š” ๋ฌธ์ œ๋“ค

๐Ÿ”Œ ํฌํŠธ ์ถฉ๋Œ

# ๋‹ค๋ฅธ ํฌํŠธ ์‚ฌ์šฉ
python main.py --port 8080

๐Ÿ”‘ API ํ‚ค ์˜ค๋ฅ˜

# .env ํŒŒ์ผ ํ™•์ธ
cat .env | grep API_KEY

# ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ์ง์ ‘ ์„ค์ •
export OPENAI_API_KEY=your-key-here

๐Ÿ“ฆ ์˜์กด์„ฑ ๋ฌธ์ œ

# ๊ฐ€์ƒํ™˜๊ฒฝ ์žฌ์ƒ์„ฑ
rm -rf .venv
uv venv
source .venv/bin/activate
uv sync

๐ŸŒ ๋А๋ฆฐ ์ฒซ ์‹คํ–‰

  • HuggingFace ๋ชจ๋ธ ๋‹ค์šด๋กœ๋“œ๋กœ ์ธํ•œ ์ง€์—ฐ
  • ๋„คํŠธ์›Œํฌ ์—ฐ๊ฒฐ ์ƒํƒœ ํ™•์ธ

๐Ÿ’พ ๋Œ€์šฉ๋Ÿ‰ ํŒŒ์ผ ์ฒ˜๋ฆฌ

# MAX_FILE_SIZE ์กฐ์ • (.env)
MAX_FILE_SIZE=52428800  # 50MB

๐Ÿ”„ ๊ฒฐ๊ณผ๋ฌผ ๊ตฌ์กฐ

result/
โ”œโ”€โ”€ ๐Ÿ“ extraction/                # ์ถ”์ถœ ๊ฒฐ๊ณผ
โ”‚   โ””โ”€โ”€ 20250823_1430/           # ํƒ€์ž„์Šคํƒฌํ”„ ํด๋”
โ”‚       โ”œโ”€โ”€ result.json          # ์ถ”์ถœ๋œ JSON ๊ฒฐ๊ณผ
โ”‚       โ”œโ”€โ”€ extraction.log       # ์ถ”์ถœ ๋กœ๊ทธ
โ”‚       โ””โ”€โ”€ metadata.json        # ์‹คํ–‰ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ
โ”œโ”€โ”€ ๐Ÿ“ evaluation/               # ํ‰๊ฐ€ ๊ฒฐ๊ณผ  
โ”‚   โ””โ”€โ”€ 20250823_1435/
โ”‚       โ”œโ”€โ”€ eval_result.json     # ํ‰๊ฐ€ ๊ฒฐ๊ณผ
โ”‚       โ”œโ”€โ”€ pred.json           # ์˜ˆ์ธก JSON (์ •๊ทœํ™”๋จ)
โ”‚       โ”œโ”€โ”€ gt.json             # ์ •๋‹ต JSON
โ”‚       โ”œโ”€โ”€ criteria.json       # ์‚ฌ์šฉ๋œ ํ‰๊ฐ€ ๊ธฐ์ค€
โ”‚       โ””โ”€โ”€ evaluation.log      # ํ‰๊ฐ€ ๋กœ๊ทธ
โ””โ”€โ”€ ๐Ÿ“ visualization/            # ์‹œ๊ฐํ™” ๊ฒฐ๊ณผ
    โ””โ”€โ”€ 20250823_1440/
        โ””โ”€โ”€ visualization.html   # HTML ๋ฆฌํฌํŠธ

๐Ÿ“ˆ ์„ฑ๋Šฅ ๋ฒค์น˜๋งˆํฌ

๐Ÿ“Š ์ƒ˜ํ”Œ ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ

LLM ํ”„๋ ˆ์ž„์›Œํฌ ์„ฑ๋Šฅ ๋น„๊ต (ํ•œ๊ตญ์–ด ์ด๋ ฅ์„œ ๋ฐ์ดํ„ฐ์…‹)

ํ”„๋ ˆ์ž„์›Œํฌ ํ˜ธ์ŠคํŠธ ๋ชจ๋ธ ์ •ํ™•๋„ ์‘๋‹ต์‹œ๊ฐ„ ์•ˆ์ •์„ฑ ํŠน์ง•
OpenAIFramework OpenAI gpt-4o-mini 94.2% 1.2s โญโญโญโญโญ ๋„ค์ดํ‹ฐ๋ธŒ Structured Outputs
AnthropicFramework Anthropic claude-3-5-sonnet 95.1% 2.1s โญโญโญโญโญ Tool Use ๊ธฐ๋ฐ˜
GoogleFramework Google gemini-1.5-flash 92.8% 1.8s โญโญโญโญ JSON ๋ชจ๋“œ
InstructorFramework OpenAI gpt-4o-mini 93.8% 1.4s โญโญโญโญโญ ํƒ€์ž… ๊ฒ€์ฆ ๊ฐ•ํ™”
LangchainToolFramework OpenAI gpt-4o-mini 92.5% 1.8s โญโญโญโญ Tool ๊ธฐ๋ฐ˜
OllamaFramework Ollama llama3.1:8b 88.3% 3.2s โญโญโญ ๋กœ์ปฌ ์‹คํ–‰

ํŒŒ์‹ฑ ํ”„๋ ˆ์ž„์›Œํฌ ์„ฑ๋Šฅ ๋น„๊ต

ํ”„๋ ˆ์ž„์›Œํฌ ํŒŒ์ผ ํ˜•์‹ ์ •ํ™•๋„ ์†๋„ ํŠน์ง•
DoclingFramework PDF 95.2% ์ค‘๊ฐ„ ํ…Œ์ด๋ธ”, ๋ ˆ์ด์•„์›ƒ ๋ณด์กด
PDFPlumberFramework PDF 91.8% ๋น ๋ฆ„ ํ…Œ์ด๋ธ” ์ถ”์ถœ ํŠนํ™”
VLMFramework ์ด๋ฏธ์ง€ 89.7% ๋А๋ฆผ OCR + ์ดํ•ด๋ ฅ
PyPDFFramework PDF 87.3% ๋งค์šฐ ๋น ๋ฆ„ ๋‹จ์ˆœ ํ…์ŠคํŠธ

*๊ฒฐ๊ณผ๋Š” ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ์…‹ ๊ธฐ์ค€์ด๋ฉฐ, ์‹ค์ œ ์„ฑ๋Šฅ์€ ๋ฌธ์„œ ๋ณต์žก๋„์™€ ๋ชจ๋ธ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ” ํŠธ๋Ÿฌ๋ธ”์ŠˆํŒ…

๐Ÿšจ ์ž์ฃผ ๋ฐœ์ƒํ•˜๋Š” ๋ฌธ์ œ๋“ค

๐Ÿ”Œ ํฌํŠธ ์ถฉ๋Œ

# ๋‹ค๋ฅธ ํฌํŠธ ์‚ฌ์šฉ
python main.py --port 8080

๐Ÿ”‘ API ํ‚ค ์˜ค๋ฅ˜

# .env ํŒŒ์ผ ํ™•์ธ
cat .env | grep API_KEY

# ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ์ง์ ‘ ์„ค์ •
export OPENAI_API_KEY=your-key-here

๐Ÿ“ฆ ์˜์กด์„ฑ ๋ฌธ์ œ

# ๊ฐ€์ƒํ™˜๊ฒฝ ์žฌ์ƒ์„ฑ
rm -rf .venv
uv venv
source .venv/bin/activate
uv sync

๐ŸŒ ๋А๋ฆฐ ์ฒซ ์‹คํ–‰

  • HuggingFace ๋ชจ๋ธ ๋‹ค์šด๋กœ๋“œ๋กœ ์ธํ•œ ์ง€์—ฐ
  • VLM ๋ชจ๋ธ ๋กœ๋”ฉ ์‹œ๊ฐ„
  • ๋„คํŠธ์›Œํฌ ์—ฐ๊ฒฐ ์ƒํƒœ ํ™•์ธ

๐Ÿ’พ ๋Œ€์šฉ๋Ÿ‰ ํŒŒ์ผ ์ฒ˜๋ฆฌ

# MAX_FILE_SIZE ์กฐ์ • (.env)
MAX_FILE_SIZE=52428800  # 50MB

๐Ÿ“„ PDF ํŒŒ์‹ฑ ์‹คํŒจ

  • ์Šค์บ”๋œ PDF์˜ ๊ฒฝ์šฐ VLMFramework ์‚ฌ์šฉ ๊ถŒ์žฅ
  • ์•”ํ˜ธํ™”๋œ PDF๋Š” ์ง€์›ํ•˜์ง€ ์•Š์Œ
  • ๋ณต์žกํ•œ ๋ ˆ์ด์•„์›ƒ์€ DoclingFramework ๊ถŒ์žฅ

๐Ÿง  ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ ์˜ค๋ฅ˜

# HuggingFace ๋ชจ๋ธ ์บ์‹œ ํด๋ฆฌ์–ด
rm -rf ~/.cache/huggingface/

# ๋‹ค๋ฅธ ์ž„๋ฒ ๋”ฉ ๋ชจ๋ธ ์‚ฌ์šฉ
HUGGINGFACE_EMBED_MODELS=sentence-transformers/all-MiniLM-L6-v2

๐Ÿ”„ ํ”„๋ ˆ์ž„์›Œํฌ ํ˜ธํ™˜์„ฑ ๋ฌธ์ œ

  • compatibility.yaml ํŒŒ์ผ์—์„œ ์ง€์› ์กฐํ•ฉ ํ™•์ธ
  • ์ง€์›ํ•˜์ง€ ์•Š๋Š” ์กฐํ•ฉ์€ ์˜ค๋ฅ˜ ๋ฉ”์‹œ์ง€๋กœ ์•ˆ๋‚ด

๐Ÿ”„ ๊ฒฐ๊ณผ๋ฌผ ๊ตฌ์กฐ

๐Ÿ“ ๊ฒฐ๊ณผ ๋””๋ ‰ํ† ๋ฆฌ ๊ตฌ์กฐ
result/
โ”œโ”€โ”€ ๐Ÿ“ parsing/                  # ํŒŒ์‹ฑ ๊ฒฐ๊ณผ
โ”‚   โ””โ”€โ”€ 20250824_1430/           # ํƒ€์ž„์Šคํƒฌํ”„ ํด๋”
โ”‚       โ”œโ”€โ”€ content.txt          # ํŒŒ์‹ฑ๋œ ํ…์ŠคํŠธ
โ”‚       โ”œโ”€โ”€ parsing.log          # ํŒŒ์‹ฑ ๋กœ๊ทธ
โ”‚       โ””โ”€โ”€ metadata.json        # ํŒŒ์‹ฑ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ
โ”œโ”€โ”€ ๐Ÿ“ extraction/               # ์ถ”์ถœ ๊ฒฐ๊ณผ
โ”‚   โ””โ”€โ”€ 20250824_1435/
โ”‚       โ”œโ”€โ”€ result.json          # ์ถ”์ถœ๋œ JSON ๊ฒฐ๊ณผ
โ”‚       โ”œโ”€โ”€ extraction.log       # ์ถ”์ถœ ๋กœ๊ทธ
โ”‚       โ””โ”€โ”€ metadata.json        # ์‹คํ–‰ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ (ํ”„๋ ˆ์ž„์›Œํฌ, ๋ชจ๋ธ ๋“ฑ)
โ”œโ”€โ”€ ๐Ÿ“ evaluation/               # ํ‰๊ฐ€ ๊ฒฐ๊ณผ  
โ”‚   โ””โ”€โ”€ 20250824_1440/
โ”‚       โ”œโ”€โ”€ eval_result.json     # ํ‰๊ฐ€ ๊ฒฐ๊ณผ (์ ์ˆ˜, ๋ฉ”ํŠธ๋ฆญ)
โ”‚       โ”œโ”€โ”€ pred.json           # ์˜ˆ์ธก JSON (์ •๊ทœํ™”๋จ)
โ”‚       โ”œโ”€โ”€ gt.json             # ์ •๋‹ต JSON
โ”‚       โ”œโ”€โ”€ criteria.json       # ์‚ฌ์šฉ๋œ ํ‰๊ฐ€ ๊ธฐ์ค€
โ”‚       โ””โ”€โ”€ evaluation.log      # ํ‰๊ฐ€ ๋กœ๊ทธ
โ””โ”€โ”€ ๐Ÿ“ visualization/            # ์‹œ๊ฐํ™” ๊ฒฐ๊ณผ
    โ””โ”€โ”€ 20250824_1445/
        โ””โ”€โ”€ visualization.html   # HTML ๋ฆฌํฌํŠธ

๐Ÿš€ ์‚ฌ์šฉ ์‚ฌ๋ก€

๐Ÿ’ผ ๋น„์ฆˆ๋‹ˆ์Šค ํ™œ์šฉ ์‚ฌ๋ก€

1. ์ธ์‚ฌ ๋‹ด๋‹น์ž

  • ์ด๋ ฅ์„œ ์ž๋™ ๋ถ„์„ ๋ฐ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ๊ตฌ์ถ•
  • ์ง€์›์ž ์ •๋ณด ์ž๋™ ์ถ”์ถœ ๋ฐ ๋ถ„๋ฅ˜
  • ์ฑ„์šฉ ํ”„๋กœ์„ธ์Šค ์ž๋™ํ™”

2. ์—ฐ๊ตฌ์ž

  • LLM ๊ตฌ์กฐํ™” ์ถœ๋ ฅ ์„ฑ๋Šฅ ๋ฒค์น˜๋งˆํ‚น
  • ๋‹ค์–‘ํ•œ ํ”„๋ ˆ์ž„์›Œํฌ ๋น„๊ต ์—ฐ๊ตฌ
  • ํ‰๊ฐ€ ๋ฉ”ํŠธ๋ฆญ ๊ฐœ๋ฐœ ๋ฐ ๊ฒ€์ฆ

3. ๊ฐœ๋ฐœ์ž

  • ๋ฌธ์„œ ์ฒ˜๋ฆฌ ์‹œ์Šคํ…œ ํ”„๋กœํ† ํƒ€์ดํ•‘
  • LLM ํ†ตํ•ฉ ํ…Œ์ŠคํŠธ ํ™˜๊ฒฝ
  • ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ ๊ฒ€์ฆ

4. ๋ฐ์ดํ„ฐ ์‚ฌ์ด์–ธํ‹ฐ์ŠคํŠธ

  • ๋น„๊ตฌ์กฐํ™” ๋ฐ์ดํ„ฐ ๊ตฌ์กฐํ™”
  • ๋ชจ๋ธ ์„ฑ๋Šฅ ๋ถ„์„ ๋ฐ ์ตœ์ ํ™”
  • ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ํ‰๊ฐ€

๏ฟฝ ๊ณ ๊ธ‰ ์‚ฌ์šฉ๋ฒ•

๐ŸŽ›๏ธ ๊ณ ๊ธ‰ ์„ค์ • ๋ฐ ์ปค์Šคํ„ฐ๋งˆ์ด์ง•

๋ฐฐ์น˜ ์ฒ˜๋ฆฌ

# ์—ฌ๋Ÿฌ ํŒŒ์ผ ์ผ๊ด„ ์ฒ˜๋ฆฌ
for file in documents/*.pdf; do
    python main.py --cli parse --file "$file" --framework DoclingFramework
done

์„ฑ๋Šฅ ์ตœ์ ํ™”

# ๋ฐฐ์น˜ ํฌ๊ธฐ ์กฐ์ • (API)
extra_kwargs = {
    "temperature": 0.1,
    "timeout": 60,
    "max_tokens": 4096
}

์ปค์Šคํ…€ ํ‰๊ฐ€ ๊ธฐ์ค€

{
    "personal_info.name": {
        "method": "exact",
        "weight": 1.0
    },
    "careers.*.responsibilities": {
        "method": "embedding",
        "weight": 0.8,
        "embedding_model": "jhgan/ko-sroberta-multitask"
    }
}

Langfuse ํ†ตํ•ฉ ๋ชจ๋‹ˆํ„ฐ๋ง

# ์ถ”์  ID๋กœ ์‹คํ–‰
python main.py --cli extract \
  --input "ํ…์ŠคํŠธ" \
  --trace-id "custom-trace-123"

๏ฟฝ๐Ÿ“„ ๋ผ์ด์„ ์Šค

์ด ํ”„๋กœ์ ํŠธ๋Š” MIT ๋ผ์ด์„ ์Šค ํ•˜์— ๋ฐฐํฌ๋ฉ๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ LICENSE ํŒŒ์ผ์„ ์ฐธ์กฐํ•˜์„ธ์š”.

๐Ÿ™ ๊ฐ์‚ฌ์˜ ๋ง

์ด ํ”„๋กœ์ ํŠธ๋Š” ๋‹ค์Œ ์˜คํ”ˆ์†Œ์Šค ํ”„๋กœ์ ํŠธ๋“ค์˜ ์˜ํ–ฅ์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค:

๐Ÿ”— ์˜์กด์„ฑ ํ”„๋กœ์ ํŠธ๋“ค

LLM ํ”„๋ ˆ์ž„์›Œํฌ

  • Instructor - OpenAI ๊ตฌ์กฐํ™” ์ถœ๋ ฅ
  • LangChain - LLM ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ํ”„๋ ˆ์ž„์›Œํฌ
  • LlamaIndex - ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์›Œํฌ
  • Marvin - AI ์—”์ง€๋‹ˆ์–ด๋ง ํˆดํ‚ท
  • Mirascope - LLM ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

ํŒŒ์‹ฑ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

  • Docling - IBM ๋ฌธ์„œ ํŒŒ์‹ฑ
  • PDFPlumber - PDF ํ…Œ์ด๋ธ” ์ถ”์ถœ
  • PyPDF - PDF ์ฒ˜๋ฆฌ
  • MarkItDown - Microsoft ๋ฌธ์„œ ๋ณ€ํ™˜

์›น ํ”„๋ ˆ์ž„์›Œํฌ

  • FastAPI - ๋ชจ๋˜ API ํ”„๋ ˆ์ž„์›Œํฌ
  • Streamlit - ๋ฐ์ดํ„ฐ ์•ฑ ํ”„๋ ˆ์ž„์›Œํฌ

๋ชจ๋‹ˆํ„ฐ๋ง & ์ถ”์ 

  • Langfuse - LLM ์ถ”์  ๋ฐ ๋ชจ๋‹ˆํ„ฐ๋ง

๐Ÿ“ž ์—ฐ๋ฝ์ฒ˜

๐ŸŽฏ ๋กœ๋“œ๋งต

๐Ÿšง ๊ฐœ๋ฐœ ๊ณ„ํš

v0.2.0 (์˜ˆ์ •)

  • ํ…Œ์ŠคํŠธ ์Šค์œ„ํŠธ ๊ตฌํ˜„
  • Docker ์ปดํฌ์ฆˆ ์„ค์ •
  • ์›น UI ์ธํ„ฐํŽ˜์ด์Šค ์ถ”๊ฐ€
  • ๋” ๋งŽ์€ ํŒŒ์‹ฑ ํ”„๋ ˆ์ž„์›Œํฌ ์ง€์›

v0.3.0 (์˜ˆ์ •)

  • ์‹ค์‹œ๊ฐ„ ์ŠคํŠธ๋ฆฌ๋ฐ ์ฒ˜๋ฆฌ
  • ๋ถ„์‚ฐ ์ฒ˜๋ฆฌ ์ง€์›
  • ํด๋ผ์šฐ๋“œ ๋ฐฐํฌ ๊ฐ€์ด๋“œ
  • ์„ฑ๋Šฅ ์ตœ์ ํ™”

์žฅ๊ธฐ ๊ณ„ํš

  • ๋‹ค๊ตญ์–ด ์Šคํ‚ค๋งˆ ์ง€์›
  • ์ž๋™ ์Šคํ‚ค๋งˆ ์ƒ์„ฑ
  • ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ํ‰๊ฐ€ ๋ฉ”ํŠธ๋ฆญ
  • ํ”Œ๋Ÿฌ๊ทธ์ธ ์‹œ์Šคํ…œ

โญ ์ด ํ”„๋กœ์ ํŠธ๊ฐ€ ๋„์›€์ด ๋˜์…จ๋‹ค๋ฉด Star๋ฅผ ๋ˆŒ๋Ÿฌ์ฃผ์„ธ์š”! โญ

๐ŸŒŸ Star | ๐Ÿ› Issues | ๐Ÿ’ฌ Discussions | ๐Ÿ“– Docs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

structured_output_kit-0.1.0.tar.gz (77.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

structured_output_kit-0.1.0-py3-none-any.whl (73.9 kB view details)

Uploaded Python 3

File details

Details for the file structured_output_kit-0.1.0.tar.gz.

File metadata

  • Download URL: structured_output_kit-0.1.0.tar.gz
  • Upload date:
  • Size: 77.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for structured_output_kit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b0eb5a275ef1d270af3415097ad52669669db4b94dde9b6fe4be096a1e4bc4e3
MD5 d1c1db64c89fb1b604dd95daa09c77e2
BLAKE2b-256 fc91c920f3007c91f7267648ffc518b6e3a9fac81a56e82b25537d53d31dc7df

See more details on using hashes here.

File details

Details for the file structured_output_kit-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for structured_output_kit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 00e68cd70f7dbdcc97306c940dd31227380515d7638fafd45730ba7926295716
MD5 de803e38a8a713a37adf68b5678a3f04
BLAKE2b-256 208091c704d130a8ce0335e2eb3555af1418187a6008056fd20a7b2251ec9d2a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page