Skip to main content

Pipeline for extracting unregistered Korean blend words (혼성어) from corpora

Project description

wordextractor

한국어 혼성어(blend word) 미등재어 추출 파이프라인

설치

# 기본 설치 (step 1-4)
pip install wordextractor

# LLM 주석 기능 포함 (step 5)
pip install wordextractor[llm]

# 네이버 검색 포함 (step 6)
pip install wordextractor[naver]

# 전체 설치
pip install wordextractor[all]

파이프라인 개요

Step 설명 주요 의존성
step1 기등재 혼성어에서 N-Gram 패턴 추출 pandas
step2 말뭉치에서 어절 빈도 목록 구축 polars
step3 패턴 매칭 + 사전 필터링 + 형태소 분석 ahocorasick-rs, kiwipiepy
step4 말뭉치 용례 추출 polars, ahocorasick-rs
step5 LLM 보조 혼성어 판정 (OpenAI Batch API) openai
step6 네이버 뉴스 최초 출현일 검색 selenium

사용법

1. 설정 파일 작성

config.yaml을 작성합니다. 예시: examples/config.yaml

2. CLI로 실행

# 개별 step 실행
wordextractor -c config.yaml step1
wordextractor -c config.yaml step2

# 단축 명령어
wordextractor -c config.yaml step3

# 전체 파이프라인 실행
wordextractor -c config.yaml run-all

# 특정 구간만 실행
wordextractor -c config.yaml run-all --start 3 --end 5

# 설정 확인
wordextractor -c config.yaml show-config

3. Python API로 사용

from wordextractor import PipelineConfig
from wordextractor.steps.step1_extract_patterns import run as run_step1
from wordextractor.steps.step3_pattern_matching import run as run_step3

cfg = PipelineConfig.from_yaml("config.yaml")
run_step1(cfg)
run_step3(cfg)

필요 리소스

  • wordlist.xlsx — 기등재 혼성어 목록 (혼성어(색인표제어), 음절 수 컬럼 필요)
  • 우리말샘 XLS 파일 디렉토리 (선택)
  • 말뭉치 Parquet 파일 (SC_YYYYMM.parquet 형식)

라이선스

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wordextractor-0.2.0.tar.gz (4.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wordextractor-0.2.0-py3-none-any.whl (4.5 MB view details)

Uploaded Python 3

File details

Details for the file wordextractor-0.2.0.tar.gz.

File metadata

  • Download URL: wordextractor-0.2.0.tar.gz
  • Upload date:
  • Size: 4.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for wordextractor-0.2.0.tar.gz
Algorithm Hash digest
SHA256 afdd09e57cdfe5763bb0343f73ceb4ce8fd4bfe46d50a6a01e788a1514005548
MD5 985479b32ace455edbd35784414f3ab1
BLAKE2b-256 7595d1fb09024b8f900695602551df7f57dc15f7c1d0bf4e597bd2e04b71c545

See more details on using hashes here.

File details

Details for the file wordextractor-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: wordextractor-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 4.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for wordextractor-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d4d1d488014e4d7c81fb8134f5318921606e2fb2d072026eeffd3d122e504729
MD5 68a1a3e96c4c4773d6241654def73940
BLAKE2b-256 4a9d65033f20923879ec8c844e83035621c18735ea9ed3527093a9062b676d1e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page