Pipeline for extracting unregistered Korean blend words (혼성어) from corpora
Project description
wordextractor
한국어 혼성어(blend word) 미등재어 추출 파이프라인
설치
# 기본 설치 (step 1-4)
pip install wordextractor
# LLM 주석 기능 포함 (step 5)
pip install wordextractor[llm]
# 네이버 검색 포함 (step 6)
pip install wordextractor[naver]
# 전체 설치
pip install wordextractor[all]
파이프라인 개요
| Step | 설명 | 주요 의존성 |
|---|---|---|
| step1 | 기등재 혼성어에서 N-Gram 패턴 추출 | pandas |
| step2 | 말뭉치에서 어절 빈도 목록 구축 | polars |
| step3 | 패턴 매칭 + 사전 필터링 + 형태소 분석 | ahocorasick-rs, kiwipiepy |
| step4 | 말뭉치 용례 추출 | polars, ahocorasick-rs |
| step5 | LLM 보조 혼성어 판정 (OpenAI Batch API) | openai |
| step6 | 네이버 뉴스 최초 출현일 검색 | selenium |
사용법
1. 설정 파일 작성
config.yaml을 작성합니다. 예시: examples/config.yaml
2. CLI로 실행
# 개별 step 실행
wordextractor -c config.yaml step1
wordextractor -c config.yaml step2
# 단축 명령어
wordextractor -c config.yaml step3
# 전체 파이프라인 실행
wordextractor -c config.yaml run-all
# 특정 구간만 실행
wordextractor -c config.yaml run-all --start 3 --end 5
# 설정 확인
wordextractor -c config.yaml show-config
3. Python API로 사용
from wordextractor import PipelineConfig
from wordextractor.steps.step1_extract_patterns import run as run_step1
from wordextractor.steps.step3_pattern_matching import run as run_step3
cfg = PipelineConfig.from_yaml("config.yaml")
run_step1(cfg)
run_step3(cfg)
필요 리소스
wordlist.xlsx— 기등재 혼성어 목록 (혼성어(색인표제어),음절 수컬럼 필요)- 우리말샘 XLS 파일 디렉토리 (선택)
- 말뭉치 Parquet 파일 (
SC_YYYYMM.parquet형식)
라이선스
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
wordextractor-0.2.1.tar.gz
(4.5 MB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wordextractor-0.2.1.tar.gz.
File metadata
- Download URL: wordextractor-0.2.1.tar.gz
- Upload date:
- Size: 4.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f793bd69bd0f4c5d8a0d614dfb3ede19cc1aff744a44099b20c476fe741af9fd
|
|
| MD5 |
afbfd20b1b9b575a910bb0ce2644937c
|
|
| BLAKE2b-256 |
4ec2350a68963cb9428161cad10ce45a74efb9622bb709ef11e0995cc4fe1f07
|
File details
Details for the file wordextractor-0.2.1-py3-none-any.whl.
File metadata
- Download URL: wordextractor-0.2.1-py3-none-any.whl
- Upload date:
- Size: 4.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b94f03e95e2b233e6bd3f72e3ccd0650b754aa985d58bc440a48f2e641662ae
|
|
| MD5 |
34d894fe9a12e6ca1519a789f4683f04
|
|
| BLAKE2b-256 |
1b3fab2c46a6d2c84cf1e5c989d025347b1a85b8f21958087630681e3c9d9f3f
|