Skip to main content

Polyhymnia: Natual Chinese Data Augmentation

Project description

polyhymnia

Polyhymnia (/pɒliˈhɪmniə/; Greek: Πολυύμνια, lit. 'the one of many hymns'), alternatively Polymnia (Πολύμνια) was in Greek mythology the Muse of sacred poetry, sacred hymn, dance, and eloquence as well as agriculture and pantomime.

Polyhymnia name comes from the Greek words "poly" meaning "many" and "hymnos", which means "praise".

将praise理解为增强,那么该项目就是对“诗歌”的增强,即对自然语言数据的增广(我说是就是不接受反驳)。那么,该项目旨在给各位NLP工程师提供一些开箱即用的数据增广办法。

Survey

Installation

推荐使用本地安装模式。

  • 本地安装:
git clone https://github.com/luoy2/polyhymnia.git
cd polyhymnia
pip install .

测试安装:

~/polyhymnia$ python38 -m unittest tests/test_methods.py -v
test_aeda (tests.test_methods.TestMethods) ... [jieba] default dict file path ../data/vocab.txt
[jieba] default dict file path ../data/vocab.txt
[jieba] load default dict ../data/vocab.txt ...
/opt/python38/lib/python3.8/site-packages/pkg_resources/__init__.py:1151: DeprecationWarning: Use of  in a future release.
  return get_provider(package_or_requirement).get_resource_stream(
[jieba] load default dict ../data/vocab.txt ...
>> Synonyms load wordseg dict [/data/wanting/.local/lib/python3.8/site-packages/synonyms/data/vocab.t
>> Synonyms on loading stopwords [/data/wanting/.local/lib/python3.8/site-packages/synonyms/data/stop
/data/wanting/.local/lib/python3.8/site-packages/synonyms/synonyms.py:104: ResourceWarning: unclosed synonyms/data/stopwords.txt' mode='r' encoding='utf-8'>
  _load_stopwords(_fin_stopwords_path)
ResourceWarning: Enable tracemalloc to get the object allocation traceback
[Synonyms] on loading vectors [/data/comm/pkgs/polyhymnia/polyhymnia/data/words.vector] ...
/data/wanting/.local/lib/python3.8/site-packages/smart_open/smart_open_lib.py:479: DeprecationWarningopen/blob/develop/MIGRATING_FROM_OLDER_VERSIONS.rst for more information
  warnings.warn(message, category=DeprecationWarning)
/data/wanting/.local/lib/python3.8/site-packages/synonyms/word2vec.py:175: DeprecationWarning: The biputs. Use frombuffer instead
  weights = fromstring(fin.read(binary_len), dtype=REAL)
['身份证丢了怎 。么办', '身份证丢了  。么办', '身份证丢了怎。么,\t', '身份证丢了怎 么办']
ok
test_eda (tests.test_methods.TestMethods) ... /data/wanting/.local/lib/python3.8/site-packages/scipy/e n to be a power of 2.
  warnings.warn("The balance properties of Sobol' points require"
[Polyhnmnia] 2021-10-21 11:11:38,266 - DEBUG - execute tasks: 
[Polyhnmnia] 2021-10-21 11:11:38,266 - DEBUG - random_insertion: 3 times
[Polyhnmnia] 2021-10-21 11:11:38,266 - DEBUG - synonym_replacement: 2 times
[Polyhnmnia] 2021-10-21 11:11:38,266 - DEBUG - random_deletion: 2 times
[Polyhnmnia] 2021-10-21 11:11:38,266 - DEBUG - random_swap: 2 times
[Polyhnmnia] 2021-10-21 11:11:38,302 - DEBUG - random_insertion --- 谷物小麦种植
[Polyhnmnia] 2021-10-21 11:11:38,341 - DEBUG - synonym_replacement --- 玉米栽植
[Polyhnmnia] 2021-10-21 11:11:38,341 - DEBUG - random_deletion --- 小麦种植
[Polyhnmnia] 2021-10-21 11:11:38,341 - DEBUG - random_swap --- 种植小麦
[Polyhnmnia] 2021-10-21 11:11:38,341 - DEBUG - random_swap --- 种植小麦
[Polyhnmnia] 2021-10-21 11:11:38,341 - DEBUG - random_deletion --- 小麦种植
[Polyhnmnia] 2021-10-21 11:11:38,342 - DEBUG - synonym_replacement --- 小麦甘蔗
[Polyhnmnia] 2021-10-21 11:11:38,342 - DEBUG - random_insertion --- 棉花小麦种植
[Polyhnmnia] 2021-10-21 11:11:38,342 - DEBUG - random_insertion --- 农作物小麦种植
['种植小麦', '小麦种植', '谷物小麦种植', '小麦种植', '种植小麦', '农作物小麦种植', '小麦甘蔗', '棉花小
ok
test_reverse_translate (tests.test_methods.TestMethods) ... 请使用 ReverseTranslate.set_creds(appid, 
ok
test_simbert (tests.test_methods.TestMethods) ... 2021-10-21 11:11:38.407926: I tensorflow/stream_exebcudart.so.11.0
['身份证丢了,怎么办',
 '身份证丢了怎么办?',
 '身份证丢了怎么办啊',
 '身份证丢了怎么办?',
 '身份证丢了怎么办!',
 '身份证丢失怎么办?',
 '身份证丢了该怎么办',
 '身份证丢失,怎么办?']
ok

----------------------------------------------------------------------
Ran 4 tests in 11.913s

Usage

  • simbert 数据生成

    使用simbert_v2 (SimBERTv2来了!融合检索和生成的RoFormer-Sim模型) 进行相似句生成。

    from polyhymnia import Simbert
    Simbert.gen("身份证丢了怎么办", 8)
    
    Out[38]: 
    ['身份证丢了怎么办?',
     '身份证丢了怎么办?',
     '身份证丢了怎么办。',
     '身份证丢失了怎么办',
     '身份证丢失了怎么办?',
     '身份证丢了咋办?',
     '身份证丢失怎么办?',
     '身份证丢失怎么办!']
    

    默认会使用GPU进行生成,如果想使用CPU,请自行在引用包之前设置CUDA_VISIBLE_DEVICES 环境变量:

    import os
    os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
    from polyhymnia import Simbert
    
  • reverse_translate 多链翻译

    使用之前请先于https://api.fanyi.baidu.com/product/11 申请翻译api appidappsecret

    from polyhymnia import ReverseTranslate
    
    In [3]: ReverseTranslate.set_creds(appid, appSecret)
    In [4]: ReverseTranslate.gen("小麦种植", 4)   
    [Polyhnmnia] 2021-10-20 18:05:04,331 - DEBUG - start translate for: 小麦种植
    [Polyhnmnia] 2021-10-20 18:05:05,744 - DEBUG - zh -> hu -> spa -> zh: 小麦栽培
    [Polyhnmnia] 2021-10-20 18:05:07,481 - DEBUG - zh -> dan -> rom -> zh: 小麦种植
    [Polyhnmnia] 2021-10-20 18:05:08,424 - DEBUG - zh -> bul -> zh: 小麦作物
    [Polyhnmnia] 2021-10-20 18:05:09,301 - DEBUG - zh -> en -> zh: 小麦种植
    Out[4]: ['小麦种植', '小麦作物', '小麦种植;', '小麦栽培']
    
  • EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks

    在此进行了一些改进,使用sobol随机序列生成随机任务顺序,保证任务随机性

    In [1]: from polyhymnia import EDA    
    
    In [2]: EDA.gen("小麦种植", 8)                
    
    [Polyhnmnia] 2021-10-20 18:10:24,632 - DEBUG - execute tasks: 
    [Polyhnmnia] 2021-10-20 18:10:24,632 - DEBUG - random_insertion: 3 times
    [Polyhnmnia] 2021-10-20 18:10:24,632 - DEBUG - synonym_replacement: 2 times
    [Polyhnmnia] 2021-10-20 18:10:24,632 - DEBUG - random_deletion: 2 times
    [Polyhnmnia] 2021-10-20 18:10:24,632 - DEBUG - random_swap: 2 times
    [Polyhnmnia] 2021-10-20 18:10:24,673 - DEBUG - random_insertion --- 玉米小麦种植
    [Polyhnmnia] 2021-10-20 18:10:24,717 - DEBUG - synonym_replacement --- 甜菜作物
    [Polyhnmnia] 2021-10-20 18:10:24,718 - DEBUG - random_deletion --- 小麦种植
    [Polyhnmnia] 2021-10-20 18:10:24,718 - DEBUG - random_swap --- 种植小麦
    [Polyhnmnia] 2021-10-20 18:10:24,718 - DEBUG - random_swap --- 种植小麦
    [Polyhnmnia] 2021-10-20 18:10:24,718 - DEBUG - random_deletion --- 小麦种植
    [Polyhnmnia] 2021-10-20 18:10:24,718 - DEBUG - synonym_replacement --- 大豆栽植
    [Polyhnmnia] 2021-10-20 18:10:24,718 - DEBUG - random_insertion --- 马铃薯小麦种植
    [Polyhnmnia] 2021-10-20 18:10:24,718 - DEBUG - random_insertion --- 种植小麦种植
            
    Out[2]: ['种植小麦', '小麦种植', '种植小麦种植', '玉米小麦种植', '小麦种植', '种植小麦', '甜
    菜作物', '大豆栽植', '马铃薯小麦种植']
    

    高级api polyhymnia.methods.noising.stacking.eda.gen

    • alpha_sr 同义词替换概率, 默认为0.1

    • alpha_ri 随机插入概率, 默认为0.1

    • alpha_rs 随机替换词顺序概率,默认为0.1

    • p_rd 随机删除概率,默认为0.1

    使用者可以使用高级api进行任务组合,如:

    from polyhymnia.methods.noising.stacking.eda import gen
    contents = '小麦种植'
    print(gen(contents, num_aug=8, p_rd=0, alpha_ri=0))
    
    [Polyhnmnia] 2021-10-20 18:15:03,438 - DEBUG - execute tasks: 
    [Polyhnmnia] 2021-10-20 18:15:03,438 - DEBUG - synonym_replacement: 4 times
    [Polyhnmnia] 2021-10-20 18:15:03,438 - DEBUG - random_swap: 4 times
    [Polyhnmnia] 2021-10-20 18:15:03,438 - DEBUG - synonym_replacement --- 小麦耕种
    [Polyhnmnia] 2021-10-20 18:15:03,438 - DEBUG - synonym_replacement --- 小麦栽植
    [Polyhnmnia] 2021-10-20 18:15:03,438 - DEBUG - random_swap --- 种植小麦
    [Polyhnmnia] 2021-10-20 18:15:03,439 - DEBUG - random_swap --- 种植小麦
    [Polyhnmnia] 2021-10-20 18:15:03,439 - DEBUG - synonym_replacement --- 小麦种植
    [Polyhnmnia] 2021-10-20 18:15:03,439 - DEBUG - synonym_replacement --- 谷物种植
    [Polyhnmnia] 2021-10-20 18:15:03,439 - DEBUG - random_swap --- 种植小麦
    [Polyhnmnia] 2021-10-20 18:15:03,439 - DEBUG - random_swap --- 种植小麦
    
    Out[6]: ['种植小麦', '小麦耕种', '种植小麦', '小麦栽植', '谷物种植', '小麦种植', '种植小麦', '种植小
    麦']
    

    可以发现的是,当设置某些概率为0时,基于sobol序列的特性,EDA会均匀分配概率非0的任务进行数据增强组合。

    Ref:

  • AEDA: An Easier Data Augmentation Technique for Text Classification

    In [6]: from polyhymnia import AEDA   
    In [7]: AEDA.gen("身份证丢了怎么办", 4)       
    Out[7]: ['身份证丢了怎么,办', '身份证丢了怎么办。', '身 份证丢了怎么办', '身份证丢了怎么办']
    

    高级api polyhymnia.methods.noising.insertion.aeda_text

    • fraction 插入标点符号数量比例,默认为1/3

    • puncs 插入标点符号列表,默认为[",", "。", ",", "\t", " "]

Logging

  • 使用polyhymnia._logger.LoggingFactory进行模块日志配置, 使用polyhymnia.set_verbose来进行快速切换日志等级。

    In [1]: import polyhymnia                               
    In [2]: polyhymnia.ReverseTranslate.set_creds(appid, appsecret)              
    In [3]: polyhymnia.ReverseTranslate.gen("你是狗吗", 3)                                             
    Out[3]: ['你是狗吗?', '你是一只狗。']
    # 此时没有日志信息
        
    In [4]: import polyhymnia                                 
    In [5]: polyhymnia.set_verbose(True)                                               
    In [6]: ReverseTranslate.gen("你是狗吗", 3)                                                  
    [Polyhnmnia] 2021-10-21 08:21:05,464 - DEBUG - start translate for: 你是狗吗
    [Polyhnmnia] 2021-10-21 08:21:06,668 - DEBUG - zh -> el -> en -> zh: 你是一只狗
    [Polyhnmnia] 2021-10-21 08:21:08,065 - DEBUG - zh -> cs -> ara -> zh: 你是狗吗
    [Polyhnmnia] 2021-10-21 08:21:09,438 - DEBUG - zh -> est -> pt -> zh: 你是狗吗
    [Polyhnmnia] 2021-10-21 08:21:10,775 - DEBUG - zh -> bul -> ru -> zh: 你是狗吗
    [Polyhnmnia] 2021-10-21 08:21:11,693 - DEBUG - zh -> en -> zh: 你是狗吗
    Out[6]: ['你是狗吗', '你是狗吗?', '你是一只狗。']
    

鸣谢

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polyhymnia-1.0.0.tar.gz (19.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

polyhymnia-1.0.0-py3-none-any.whl (57.8 MB view details)

Uploaded Python 3

File details

Details for the file polyhymnia-1.0.0.tar.gz.

File metadata

  • Download URL: polyhymnia-1.0.0.tar.gz
  • Upload date:
  • Size: 19.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.23.0 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.50.2 importlib-metadata/4.11.3 keyring/21.4.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.5

File hashes

Hashes for polyhymnia-1.0.0.tar.gz
Algorithm Hash digest
SHA256 cb9128637b781faac10393c82d413e47e771d0460ffe235daf081d08950262e8
MD5 bb37b7703c3cd73e69a62721811b5579
BLAKE2b-256 8a1a2a5f0e66f045a6c6bd40363415d70f65b5b6d653508fab3c1ec867c9e820

See more details on using hashes here.

File details

Details for the file polyhymnia-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: polyhymnia-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 57.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.63.0 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.11

File hashes

Hashes for polyhymnia-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cb6326752106759e066422fb587b489dc75982401b7ae94dcca2e8b1330d26d7
MD5 b0a24b2eba535767f07c25e2e0186804
BLAKE2b-256 69b92fb7734d4c4712f720758d88ee3d1ec3828ba83d64f97866bb6acae415cc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page