Skip to main content

Ecommerce product title recognition package

Project description

revizor Test & Lint codecov

This package solves task of splitting product title string into components, like type, brand, model and vendor_code.
Imagine classic named entity recognition, but recognition done on product titles.

Install

revizor requires python 3.8+ version on Linux or macOS, Windows isn't supported now, but contributions are welcome.

$ pip install revizor

Usage

from revizor.tagger import ProductTagger

tagger = ProductTagger()
product = tagger.predict("Смартфон Apple iPhone 12 Pro 128 gb Gold (CY.563781.P273)")

assert product.type == "Смартфон"
assert product.brand == "Apple"
assert product.model == "iPhone 12 Pro"
assert product.vendor_code == "CY.563781.P273"

Boring numbers

Actually, just output from flair training log:

Corpus: "Corpus: 138959 train + 15440 dev + 51467 test sentences"
Results:
- F1-score (micro) 0.8843
- F1-score (macro) 0.8766

By class:
VENDOR_CODE    tp: 9893 - fp: 1899 - fn: 3268 - precision: 0.8390 - recall: 0.7517 - f1-score: 0.7929
BRAND          tp: 47977 - fp: 2335 - fn: 514 - precision: 0.9536 - recall: 0.9894 - f1-score: 0.9712
MODEL          tp: 35187 - fp: 11824 - fn: 9995 - precision: 0.7485 - recall: 0.7788 - f1-score: 0.7633
TYPE           tp: 25044 - fp: 637 - fn: 443 - precision: 0.9752 - recall: 0.9826 - f1-score: 0.9789

Dataset

Model was trained on automatically annotated corpus. Since it may be affected by DMCA, we'll not publish it.
But we can give hint on how to obtain it, don't we?
Dataset can be created by scrapping any large marketplace, like goods, yandex.market or ozon.
We extract product title and table with product info, then we parse brand and model strings from product info table.
Now we have product title, brand and model. Then we can split product title by brand string, e.g.:

product_title = "Смартфон Apple iPhone 12 Pro 128 Gb Space Gray"
brand = "Apple"
model = "iPhone 12 Pro"

product_type, product_model_plus_some_random_info = product_title.split(brand)

product_type # => 'Смартфон'
product_model_plus_some_random_info # => 'iPhone 12 Pro 128 Gb Space Gray'

License

This package is licensed under MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

revizor-0.2.0.tar.gz (24.2 MB view details)

Uploaded Source

Built Distribution

revizor-0.2.0-py3-none-any.whl (24.2 MB view details)

Uploaded Python 3

File details

Details for the file revizor-0.2.0.tar.gz.

File metadata

  • Download URL: revizor-0.2.0.tar.gz
  • Upload date:
  • Size: 24.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.4 CPython/3.8.10 Linux/5.8.0-55-generic

File hashes

Hashes for revizor-0.2.0.tar.gz
Algorithm Hash digest
SHA256 aafd8beab6dc4753b31c92d591e00f5b3f1822c15f63c0144d50dc4be210e3fd
MD5 935db076098aafd34ad8dce8f92b06d5
BLAKE2b-256 cda1efa2edf614ffd4082a74c3c581012dc49f9c6a6c39f6801a60c47fda3f9c

See more details on using hashes here.

File details

Details for the file revizor-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: revizor-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 24.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.4 CPython/3.8.10 Linux/5.8.0-55-generic

File hashes

Hashes for revizor-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6ad69f11a6ad6219132ba1a8db475e75fe2db2db1a5f2cc0ee0339dd2d4521bd
MD5 8146c39b8dcded40a245b8b8a2c54306
BLAKE2b-256 261f76907104d090440924403ec84a22463e6e5b214b9ac2fba84cc310cd6100

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page