Skip to main content

Pipeline for extracting unregistered Korean blend words (혼성어) from corpora

Project description

wordextractor

한국어 혼성어(blend word) 미등재어 추출 파이프라인

설치

# 기본 설치 (step 1-4)
pip install wordextractor

# LLM 주석 기능 포함 (step 5)
pip install wordextractor[llm]

# 네이버 검색 포함 (step 6)
pip install wordextractor[naver]

# 전체 설치
pip install wordextractor[all]

파이프라인 개요

Step 설명 주요 의존성
step1 기등재 혼성어에서 N-Gram 패턴 추출 pandas
step2 말뭉치에서 어절 빈도 목록 구축 polars
step3 패턴 매칭 + 사전 필터링 + 형태소 분석 ahocorasick-rs, kiwipiepy
step4 말뭉치 용례 추출 polars, ahocorasick-rs
step5 LLM 보조 혼성어 판정 (OpenAI Batch API) openai
step6 네이버 뉴스 최초 출현일 검색 selenium

사용법

1. 설정 파일 작성

config.yaml을 작성합니다. 예시: examples/config.yaml

2. CLI로 실행

# 개별 step 실행
wordextractor -c config.yaml step1
wordextractor -c config.yaml step2

# 단축 명령어
wordextractor -c config.yaml step3

# 전체 파이프라인 실행
wordextractor -c config.yaml run-all

# 특정 구간만 실행
wordextractor -c config.yaml run-all --start 3 --end 5

# 설정 확인
wordextractor -c config.yaml show-config

3. Python API로 사용

from wordextractor import PipelineConfig
from wordextractor.steps.step1_extract_patterns import run as run_step1
from wordextractor.steps.step3_pattern_matching import run as run_step3

cfg = PipelineConfig.from_yaml("config.yaml")
run_step1(cfg)
run_step3(cfg)

필요 리소스

  • wordlist.xlsx — 기등재 혼성어 목록 (혼성어(색인표제어), 음절 수 컬럼 필요)
  • 우리말샘 XLS 파일 디렉토리 (선택)
  • 말뭉치 Parquet 파일 (SC_YYYYMM.parquet 형식)

라이선스

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wordextractor-0.2.1.tar.gz (4.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

wordextractor-0.2.1-py3-none-any.whl (4.5 MB view details)

Uploaded Python 3

File details

Details for the file wordextractor-0.2.1.tar.gz.

File metadata

  • Download URL: wordextractor-0.2.1.tar.gz
  • Upload date:
  • Size: 4.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for wordextractor-0.2.1.tar.gz
Algorithm Hash digest
SHA256 f793bd69bd0f4c5d8a0d614dfb3ede19cc1aff744a44099b20c476fe741af9fd
MD5 afbfd20b1b9b575a910bb0ce2644937c
BLAKE2b-256 4ec2350a68963cb9428161cad10ce45a74efb9622bb709ef11e0995cc4fe1f07

See more details on using hashes here.

File details

Details for the file wordextractor-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: wordextractor-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 4.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for wordextractor-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5b94f03e95e2b233e6bd3f72e3ccd0650b754aa985d58bc440a48f2e641662ae
MD5 34d894fe9a12e6ca1519a789f4683f04
BLAKE2b-256 1b3fab2c46a6d2c84cf1e5c989d025347b1a85b8f21958087630681e3c9d9f3f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page