Skip to main content

Pipeline for extracting unregistered Korean blend words (혼성어) from corpora

Project description

wordextractor

한국어 혼성어(blend word) 미등재어 추출 파이프라인

설치

# 기본 설치 (step 1-4)
pip install wordextractor

# LLM 주석 기능 포함 (step 5)
pip install wordextractor[llm]

# 네이버 검색 포함 (step 6)
pip install wordextractor[naver]

# 전체 설치
pip install wordextractor[all]

파이프라인 개요

Step 설명 주요 의존성
step1 기등재 혼성어에서 N-Gram 패턴 추출 pandas
step2 말뭉치에서 어절 빈도 목록 구축 polars
step3 패턴 매칭 + 사전 필터링 + 형태소 분석 ahocorasick-rs, kiwipiepy
step4 말뭉치 용례 추출 polars, ahocorasick-rs
step5 LLM 보조 혼성어 판정 (OpenAI Batch API) openai
step6 네이버 뉴스 최초 출현일 검색 selenium

사용법

1. 설정 파일 작성

config.yaml을 작성합니다. 예시: examples/config.yaml

2. CLI로 실행

# 개별 step 실행
wordextractor -c config.yaml step1
wordextractor -c config.yaml step2

# 단축 명령어
wordextractor -c config.yaml step3

# 전체 파이프라인 실행
wordextractor -c config.yaml run-all

# 특정 구간만 실행
wordextractor -c config.yaml run-all --start 3 --end 5

# 설정 확인
wordextractor -c config.yaml show-config

3. Python API로 사용

from wordextractor import PipelineConfig
from wordextractor.steps.step1_extract_patterns import run as run_step1
from wordextractor.steps.step3_pattern_matching import run as run_step3

cfg = PipelineConfig.from_yaml("config.yaml")
run_step1(cfg)
run_step3(cfg)

필요 리소스

  • wordlist.xlsx — 기등재 혼성어 목록 (혼성어(색인표제어), 음절 수 컬럼 필요)
  • 우리말샘 XLS 파일 디렉토리 (선택)
  • 말뭉치 Parquet 파일 (SC_YYYYMM.parquet 형식)

라이선스

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wordextractor-0.1.0.tar.gz (22.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wordextractor-0.1.0-py3-none-any.whl (28.8 kB view details)

Uploaded Python 3

File details

Details for the file wordextractor-0.1.0.tar.gz.

File metadata

  • Download URL: wordextractor-0.1.0.tar.gz
  • Upload date:
  • Size: 22.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for wordextractor-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5dbbb4cb134659cc8632bf24128df7952e6c78f235f2ea96bc05d7ad8aee4ff1
MD5 d9857d0aac540f95ea7a236532b69436
BLAKE2b-256 bc0efd151a258234dce306e6fe933112fc81593d1e65e6141ffaaba04d82814e

See more details on using hashes here.

File details

Details for the file wordextractor-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: wordextractor-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 28.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for wordextractor-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 805bc1a467fda93be95c643e19ee2e0fabf0dc42ef27b805e3dc7c81e67bfdb5
MD5 78c02c5952c55e06bf01a81170bf0b8a
BLAKE2b-256 637765ee67eca836ffd58b58a95754d17eca0654f637289c7d3ae65c90ee4c38

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page