Skip to main content

A portable document embedding using SWEM.

Project description

SWEM

GitHub Actions PyPI Version MIT License GitHub Starts GitHub Forks

Implementation of SWEM(Simple Word-Embedding-based Models)
Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms (ACL 2018)

Installation

pip install swem

Example

Examples are available in examples directory.

Functional API

from typing import List

import numpy as np
import swem
from gensim.models import KeyedVectors

if __name__ == '__main__':
    kv: KeyedVectors = KeyedVectors(vector_size=200)
    tokens: List[str] = ['I', 'have', 'a', 'pen']

    embed: np.ndarray = swem.infer_vector(
        tokens=tokens, kv=kv, method='concat'
    )
    print(embed.shape)

Japanese

from typing import List

import swem
from gensim.models import KeyedVectors


if __name__ == '__main__':
    kv: KeyedVectors = KeyedVectors(vector_size=200)
    swem_embed = swem.SWEM(kv)

    tokens: List[str] = ['すもも', 'も', 'もも', 'も', 'もも', 'の', 'うち']
    embed = swem_embed.infer_vector(tokens, method='max')
    print(embed.shape)

Results

(200,)

English

from typing import List

import swem
from gensim.models import KeyedVectors


if __name__ == '__main__':
    kv: KeyedVectors = KeyedVectors(vector_size=200)
    swem_embed = swem.SWEM(kv)

    tokens: List[str] = ['This', 'is', 'an', 'implementation', 'of', 'SWEM']
    embed = swem_embed.infer_vector(tokens, method='max')
    print(embed.shape)

Results

(200,)

Set random seed

SWEM generates random vector when given token is out of vocaburary. To reproduce token's embeddings, you need to set seed of NumPy.

from typing import List

import numpy as np
import swem
from gensim.models import KeyedVectors

if __name__ == '__main__':
    np.random.seed(0)
    kv: KeyedVectors = KeyedVectors(vector_size=200)
    tokens: List[str] = ['I', 'have', 'a', 'pen']

    embed: np.ndarray = swem.infer_vector(
        tokens=tokens, kv=kv, method='concat'
    )
    print(embed.shape)

Download pretained w2v and use it.

import swem
swem.download_w2v(lang='ja')
kv = swem.load_w2v(lang='ja')
Downloading w2v file to /Users/<username>/.swem/ja.zip
Extract zipfile into /Users/<username>/.swem/ja
Success to extract files.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swem-0.4.1.tar.gz (7.3 kB view hashes)

Uploaded source

Built Distribution

swem-0.4.1-py3-none-any.whl (6.2 kB view hashes)

Uploaded py3

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page