Skip to main content

Byte-pair embeddings in 275 languages

Project description

BPEmb

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing.

WebsiteUsageDownloadMultiBPEmbPaper (pdf)Citing BPEmb

Usage

Install BPEmb with pip:

pip install bpemb

Embeddings and SentencePiece models will be downloaded automatically the first time you use them.

>>> from bpemb import BPEmb
# load English BPEmb model with default vocabulary size (10k) and 50-dimensional embeddings
>>> bpemb_en = BPEmb(lang="en", dim=50)
downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.model
downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.d50.w2v.bin.tar.gz

You can do two main things with BPEmb. The first is subword segmentation:

# apply English BPE subword segmentation model
>>> bpemb_en.encode("Stratford")
['▁strat', 'ford']
# load Chinese BPEmb model with vocabulary size 100k and default (100-dim) embeddings
>>> bpemb_zh = BPEmb(lang="zh", vs=100000)
# apply Chinese BPE subword segmentation model
>>> bpemb_zh.encode("这是一个中文句子")  # "This is a Chinese sentence."
['▁这是一个', '中文', '句子']  # ["This is a", "Chinese", "sentence"]

If / how a word gets split depends on the vocabulary size. Generally, a smaller vocabulary size will yield a segmentation into many subwords, while a large vocabulary size will result in frequent words not being split:

vocabulary size segmentation
1000 ['▁str', 'at', 'f', 'ord']
3000 ['▁str', 'at', 'ford']
5000 ['▁str', 'at', 'ford']
10000 ['▁strat', 'ford']
25000 ['▁stratford']
50000 ['▁stratford']
100000 ['▁stratford']
200000 ['▁stratford']

The second purpose of BPEmb is to provide pretrained subword embeddings:

# Embeddings are wrapped in a gensim KeyedVectors object
>>> type(bpemb_zh.emb)
gensim.models.keyedvectors.Word2VecKeyedVectors
# You can use BPEmb objects like gensim KeyedVectors
>>> bpemb_en.most_similar("ford")
[('bury', 0.8745079040527344),
 ('ton', 0.8725000619888306),
 ('well', 0.871537446975708),
 ('ston', 0.8701574206352234),
 ('worth', 0.8672043085098267),
 ('field', 0.859795331954956),
 ('ley', 0.8591548204421997),
 ('ington', 0.8126075267791748),
 ('bridge', 0.8099068999290466),
 ('brook', 0.7979353070259094)]
>>> type(bpemb_en.vectors)
numpy.ndarray
>>> bpemb_en.vectors.shape
(10000, 50)
>>> bpemb_zh.vectors.shape
(100000, 100)

To use subword embeddings in your neural network, either encode your input into subword IDs:

>>> ids = bpemb_zh.encode_ids("这是一个中文句子")
[25950, 695, 20199]
>>> bpemb_zh.vectors[ids].shape
(3, 100)

Or use the embed method:

# apply Chinese subword segmentation and perform embedding lookup
>>> bpemb_zh.embed("这是一个中文句子").shape
(3, 100)

Downloads for each language

ab (Abkhazian)ace (Achinese)ady (Adyghe)af (Afrikaans)ak (Akan)als (Alemannic)am (Amharic)an (Aragonese)ang (Old English)ar (Arabic)arc (Official Aramaic)arz (Egyptian Arabic)as (Assamese)ast (Asturian)atj (Atikamekw)av (Avaric)ay (Aymara)az (Azerbaijani)azb (South Azerbaijani)

ba (Bashkir)bar (Bavarian)bcl (Central Bikol)be (Belarusian)bg (Bulgarian)bi (Bislama)bjn (Banjar)bm (Bambara)bn (Bengali)bo (Tibetan)bpy (Bishnupriya)br (Breton)bs (Bosnian)bug (Buginese)bxr (Russia Buriat)

ca (Catalan)cdo (Min Dong Chinese)ce (Chechen)ceb (Cebuano)ch (Chamorro)chr (Cherokee)chy (Cheyenne)ckb (Central Kurdish)co (Corsican)cr (Cree)crh (Crimean Tatar)cs (Czech)csb (Kashubian)cu (Church Slavic)cv (Chuvash)cy (Welsh)

da (Danish)de (German)din (Dinka)diq (Dimli)dsb (Lower Sorbian)dty (Dotyali)dv (Dhivehi)dz (Dzongkha)

ee (Ewe)el (Modern Greek)en (English)eo (Esperanto)es (Spanish)et (Estonian)eu (Basque)ext (Extremaduran)

fa (Persian)ff (Fulah)fi (Finnish)fj (Fijian)fo (Faroese)fr (French)frp (Arpitan)frr (Northern Frisian)fur (Friulian)fy (Western Frisian)

ga (Irish)gag (Gagauz)gan (Gan Chinese)gd (Scottish Gaelic)gl (Galician)glk (Gilaki)gn (Guarani)gom (Goan Konkani)got (Gothic)gu (Gujarati)gv (Manx)

ha (Hausa)hak (Hakka Chinese)haw (Hawaiian)he (Hebrew)hi (Hindi)hif (Fiji Hindi)hr (Croatian)hsb (Upper Sorbian)ht (Haitian)hu (Hungarian)hy (Armenian)

ia (Interlingua)id (Indonesian)ie (Interlingue)ig (Igbo)ik (Inupiaq)ilo (Iloko)io (Ido)is (Icelandic)it (Italian)iu (Inuktitut)

ja (Japanese)jam (Jamaican Creole English)jbo (Lojban)jv (Javanese)

ka (Georgian)kaa (Kara-Kalpak)kab (Kabyle)kbd (Kabardian)kbp (Kabiyè)kg (Kongo)ki (Kikuyu)kk (Kazakh)kl (Kalaallisut)km (Central Khmer)kn (Kannada)ko (Korean)koi (Komi-Permyak)krc (Karachay-Balkar)ks (Kashmiri)ksh (Kölsch)ku (Kurdish)kv (Komi)kw (Cornish)ky (Kirghiz)

la (Latin)lad (Ladino)lb (Luxembourgish)lbe (Lak)lez (Lezghian)lg (Ganda)li (Limburgan)lij (Ligurian)lmo (Lombard)ln (Lingala)lo (Lao)lrc (Northern Luri)lt (Lithuanian)ltg (Latgalian)lv (Latvian)

mai (Maithili)mdf (Moksha)mg (Malagasy)mh (Marshallese)mhr (Eastern Mari)mi (Maori)min (Minangkabau)mk (Macedonian)ml (Malayalam)mn (Mongolian)mr (Marathi)mrj (Western Mari)ms (Malay)mt (Maltese)mwl (Mirandese)my (Burmese)myv (Erzya)mzn (Mazanderani)

na (Nauru)nap (Neapolitan)nds (Low German)ne (Nepali)new (Newari)ng (Ndonga)nl (Dutch)nn (Norwegian Nynorsk)no (Norwegian)nov (Novial)nrm (Narom)nso (Pedi)nv (Navajo)ny (Nyanja)

oc (Occitan)olo (Livvi)om (Oromo)or (Oriya)os (Ossetian)

pa (Panjabi)pag (Pangasinan)pam (Pampanga)pap (Papiamento)pcd (Picard)pdc (Pennsylvania German)pfl (Pfaelzisch)pi (Pali)pih (Pitcairn-Norfolk)pl (Polish)pms (Piemontese)pnb (Western Panjabi)pnt (Pontic)ps (Pushto)pt (Portuguese)

qu (Quechua)

rm (Romansh)rmy (Vlax Romani)rn (Rundi)ro (Romanian)ru (Russian)rue (Rusyn)rw (Kinyarwanda)

sa (Sanskrit)sah (Yakut)sc (Sardinian)scn (Sicilian)sco (Scots)sd (Sindhi)se (Northern Sami)sg (Sango)sh (Serbo-Croatian)si (Sinhala)sk (Slovak)sl (Slovenian)sm (Samoan)sn (Shona)so (Somali)sq (Albanian)sr (Serbian)srn (Sranan Tongo)ss (Swati)st (Southern Sotho)stq (Saterfriesisch)su (Sundanese)sv (Swedish)sw (Swahili)szl (Silesian)

ta (Tamil)tcy (Tulu)te (Telugu)tet (Tetum)tg (Tajik)th (Thai)ti (Tigrinya)tk (Turkmen)tl (Tagalog)tn (Tswana)to (Tonga)tpi (Tok Pisin)tr (Turkish)ts (Tsonga)tt (Tatar)tum (Tumbuka)tw (Twi)ty (Tahitian)tyv (Tuvinian)

udm (Udmurt)ug (Uighur)uk (Ukrainian)ur (Urdu)uz (Uzbek)

ve (Venda)vec (Venetian)vep (Veps)vi (Vietnamese)vls (Vlaams)vo (Volapük)

wa (Walloon)war (Waray)wo (Wolof)wuu (Wu Chinese)

xal (Kalmyk)xh (Xhosa)xmf (Mingrelian)

yi (Yiddish)yo (Yoruba)

za (Zhuang)zea (Zeeuws)zh (Chinese)zu (Zulu)

MultiBPEmb

multi (multilingual)

Citing BPEmb

If you use BPEmb in academic work, please cite:

@InProceedings{heinzerling2018bpemb,
  author = {Benjamin Heinzerling and Michael Strube},
  title = "{BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages}",
  booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
  year = {2018},
  month = {May 7-12, 2018},
  address = {Miyazaki, Japan},
  editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {979-10-95546-00-9},
  language = {english}
  }

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bpemb-0.3.6.tar.gz (24.3 kB view details)

Uploaded Source

Built Distribution

bpemb-0.3.6-py3-none-any.whl (20.0 kB view details)

Uploaded Python 3

File details

Details for the file bpemb-0.3.6.tar.gz.

File metadata

  • Download URL: bpemb-0.3.6.tar.gz
  • Upload date:
  • Size: 24.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.0

File hashes

Hashes for bpemb-0.3.6.tar.gz
Algorithm Hash digest
SHA256 a33fa1dcdfaf3d4cb3eaebac430b6f23a684a888e1761f5a026ce3868153ee2d
MD5 81868482da2b6e1a7de66c0c3d65c26a
BLAKE2b-256 761304c4da4daf77a5cfa5dc911a3de91a394ca6236331799d8c9957bdc85185

See more details on using hashes here.

File details

Details for the file bpemb-0.3.6-py3-none-any.whl.

File metadata

  • Download URL: bpemb-0.3.6-py3-none-any.whl
  • Upload date:
  • Size: 20.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.0

File hashes

Hashes for bpemb-0.3.6-py3-none-any.whl
Algorithm Hash digest
SHA256 6eabc133bbd0a7dbeb52b2cfed55ca5cacbb38b236ebb1f504b279a2d835e8b7
MD5 499c2f704bafc7a87aa832dba3338e65
BLAKE2b-256 c5f3e878025903d935de64a92acceb0c2af0c225d0fd17d3fe9502c61c86e504

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page