Skip to main content

A portable document embedding using SWEM.

Project description

SWEM

GitHub Actions PyPI Version MIT License GitHub Starts GitHub Forks

Implementation of SWEM(Simple Word-Embedding-based Models)
Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms (ACL 2018)

Installation

pip install swem

Example

Examples are available in examples directory.

Functional API

from typing import List

import numpy as np
import swem
from gensim.models import KeyedVectors

if __name__ == '__main__':
    kv: KeyedVectors = KeyedVectors(vector_size=200)
    tokens: List[str] = ['I', 'have', 'a', 'pen']

    embed: np.ndarray = swem.infer_vector(
        tokens=tokens, kv=kv, method='concat'
    )
    print(embed.shape)

Japanese

from typing import List

import MeCab
import swem
from gensim.models import KeyedVectors


def tokenize_ja(text: str, args: str = '-O wakati') -> List[str]:
    tagger = MeCab.Tagger(args)
    return tagger.parse(text).strip().split(' ')


if __name__ == '__main__':
    kv = KeyedVectors.load('wiki_mecab-ipadic-neologd.kv')
    swem_embed = swem.SWEM(kv, tokenize_ja)

    doc = 'すもももももももものうち'
    embed = swem_embed.infer_vector(doc, method='max')
    print(embed.shape)

Results

(200,)

English

from typing import List

import swem
from gensim.models import KeyedVectors


def tokenize_en(text: str) -> List[str]:
    text_processed = text.replace('.', ' .').replace(',', ' ,')
    return text_processed.replace('?', ' ?').replace('!', ' !').split()


if __name__ == '__main__':
    kv = KeyedVectors.load('wiki_mecab-ipadic-neologd.kv')
    swem_embed = swem.SWEM(kv, tokenizer=tokenize_en)

    doc = 'This is an implementation of SWEM.'
    embed = swem_embed.infer_vector(doc, method='max')
    print(embed.shape)

Results

(200,)

Set random seed

SWEM generates random vector when given token is out of vocaburary. To reproduce token's embeddings, you need to set seed of NumPy.

from typing import List

import numpy as np
import swem
from gensim.models import KeyedVectors

if __name__ == '__main__':
    np.random.seed(0)
    kv: KeyedVectors = KeyedVectors(vector_size=200)
    tokens: List[str] = ['I', 'have', 'a', 'pen']

    embed: np.ndarray = swem.infer_vector(
        tokens=tokens, kv=kv, method='concat'
    )
    print(embed.shape)

Download pretained w2v and use it.

import swem
swem.download_w2v(lang='ja')
kv = swem.load_w2v(lang='ja')
Downloading w2v file to /Users/<username>/.swem/ja.zip
Extract zipfile into /Users/<username>/.swem/ja
Success to extract files.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swem-0.3.2.tar.gz (7.7 kB view details)

Uploaded Source

Built Distribution

swem-0.3.2-py3-none-any.whl (6.5 kB view details)

Uploaded Python 3

File details

Details for the file swem-0.3.2.tar.gz.

File metadata

  • Download URL: swem-0.3.2.tar.gz
  • Upload date:
  • Size: 7.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.8

File hashes

Hashes for swem-0.3.2.tar.gz
Algorithm Hash digest
SHA256 1b2d3d3e8bc8e3dacd645d606aa6cb9e81d00c132685ccd9ea7bc54d31e19f4a
MD5 f011d4d085e30e9904553b83e98bd985
BLAKE2b-256 438906bc1938ea5b9f627bba17935c64e39faf1b6327bd04eef4476cc5c49863

See more details on using hashes here.

File details

Details for the file swem-0.3.2-py3-none-any.whl.

File metadata

  • Download URL: swem-0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 6.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.1.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.8

File hashes

Hashes for swem-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9c69ee21c31f55b012892f61aa80ba6415121a0658aaa5fb114ab359b6a20169
MD5 2f30eb0ff6ccea02a74fb4e34d0f1987
BLAKE2b-256 1dce8b597e8dfbccec185384032f32b62bf16cdc1427aab377ebe65e6c20b88b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page