Pipeline for extracting unregistered Korean blend words (혼성어) from corpora
Project description
wordextractor
한국어 혼성어(blend word) 미등재어 추출 파이프라인
설치
# 기본 설치 (step 1-4)
pip install wordextractor
# LLM 주석 기능 포함 (step 5)
pip install wordextractor[llm]
# 네이버 검색 포함 (step 6)
pip install wordextractor[naver]
# 전체 설치
pip install wordextractor[all]
파이프라인 개요
| Step | 설명 | 주요 의존성 |
|---|---|---|
| step1 | 기등재 혼성어에서 N-Gram 패턴 추출 | pandas |
| step2 | 말뭉치에서 어절 빈도 목록 구축 | polars |
| step3 | 패턴 매칭 + 사전 필터링 + 형태소 분석 | ahocorasick-rs, kiwipiepy |
| step4 | 말뭉치 용례 추출 | polars, ahocorasick-rs |
| step5 | LLM 보조 혼성어 판정 (OpenAI Batch API) | openai |
| step6 | 네이버 뉴스 최초 출현일 검색 | selenium |
사용법
1. 설정 파일 작성
config.yaml을 작성합니다. 예시: examples/config.yaml
2. CLI로 실행
# 개별 step 실행
wordextractor -c config.yaml step1
wordextractor -c config.yaml step2
# 단축 명령어
wordextractor -c config.yaml step3
# 전체 파이프라인 실행
wordextractor -c config.yaml run-all
# 특정 구간만 실행
wordextractor -c config.yaml run-all --start 3 --end 5
# 설정 확인
wordextractor -c config.yaml show-config
3. Python API로 사용
from wordextractor import PipelineConfig
from wordextractor.steps.step1_extract_patterns import run as run_step1
from wordextractor.steps.step3_pattern_matching import run as run_step3
cfg = PipelineConfig.from_yaml("config.yaml")
run_step1(cfg)
run_step3(cfg)
필요 리소스
wordlist.xlsx— 기등재 혼성어 목록 (혼성어(색인표제어),음절 수컬럼 필요)- 우리말샘 XLS 파일 디렉토리 (선택)
- 말뭉치 Parquet 파일 (
SC_YYYYMM.parquet형식)
라이선스
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
wordextractor-0.2.0.tar.gz
(4.5 MB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wordextractor-0.2.0.tar.gz.
File metadata
- Download URL: wordextractor-0.2.0.tar.gz
- Upload date:
- Size: 4.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
afdd09e57cdfe5763bb0343f73ceb4ce8fd4bfe46d50a6a01e788a1514005548
|
|
| MD5 |
985479b32ace455edbd35784414f3ab1
|
|
| BLAKE2b-256 |
7595d1fb09024b8f900695602551df7f57dc15f7c1d0bf4e597bd2e04b71c545
|
File details
Details for the file wordextractor-0.2.0-py3-none-any.whl.
File metadata
- Download URL: wordextractor-0.2.0-py3-none-any.whl
- Upload date:
- Size: 4.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d4d1d488014e4d7c81fb8134f5318921606e2fb2d072026eeffd3d122e504729
|
|
| MD5 |
68a1a3e96c4c4773d6241654def73940
|
|
| BLAKE2b-256 |
4a9d65033f20923879ec8c844e83035621c18735ea9ed3527093a9062b676d1e
|