Ecommerce product title recognition package
Project description
revizor
This package solves task of splitting product title string into components, like type
, brand
, model
and vendor_code
.
Imagine classic named entity recognition, but recognition done on product titles.
Install
revizor
requires python 3.8+ version on Linux or macOS, Windows isn't supported now, but contributions are welcome.
$ pip install revizor
Usage
from revizor.tagger import ProductTagger
tagger = ProductTagger()
product = tagger.predict("Смартфон Apple iPhone 12 Pro 128 gb Gold (CY.563781.P273)")
assert product.type == "Смартфон"
assert product.brand == "Apple"
assert product.model == "iPhone 12 Pro"
assert product.vendor_code == "CY.563781.P273"
Boring numbers
Actually, just output from flair training log:
Corpus: "Corpus: 138959 train + 15440 dev + 51467 test sentences"
Results:
- F1-score (micro) 0.8843
- F1-score (macro) 0.8766
By class:
VENDOR_CODE tp: 9893 - fp: 1899 - fn: 3268 - precision: 0.8390 - recall: 0.7517 - f1-score: 0.7929
BRAND tp: 47977 - fp: 2335 - fn: 514 - precision: 0.9536 - recall: 0.9894 - f1-score: 0.9712
MODEL tp: 35187 - fp: 11824 - fn: 9995 - precision: 0.7485 - recall: 0.7788 - f1-score: 0.7633
TYPE tp: 25044 - fp: 637 - fn: 443 - precision: 0.9752 - recall: 0.9826 - f1-score: 0.9789
Dataset
Model was trained on automatically annotated corpus. Since it may be affected by DMCA, we'll not publish it.
But we can give hint on how to obtain it, don't we?
Dataset can be created by scrapping any large marketplace, like goods, yandex.market or ozon.
We extract product title and table with product info, then we parse brand and model strings from product info table.
Now we have product title, brand and model. Then we can split product title by brand string, e.g.:
product_title = "Смартфон Apple iPhone 12 Pro 128 Gb Space Gray"
brand = "Apple"
model = "iPhone 12 Pro"
product_type, product_model_plus_some_random_info = product_title.split(brand)
product_type # => 'Смартфон'
product_model_plus_some_random_info # => 'iPhone 12 Pro 128 Gb Space Gray'
License
This package is licensed under MIT license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file revizor-0.2.0.tar.gz
.
File metadata
- Download URL: revizor-0.2.0.tar.gz
- Upload date:
- Size: 24.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.4 CPython/3.8.10 Linux/5.8.0-55-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | aafd8beab6dc4753b31c92d591e00f5b3f1822c15f63c0144d50dc4be210e3fd |
|
MD5 | 935db076098aafd34ad8dce8f92b06d5 |
|
BLAKE2b-256 | cda1efa2edf614ffd4082a74c3c581012dc49f9c6a6c39f6801a60c47fda3f9c |
File details
Details for the file revizor-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: revizor-0.2.0-py3-none-any.whl
- Upload date:
- Size: 24.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.4 CPython/3.8.10 Linux/5.8.0-55-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6ad69f11a6ad6219132ba1a8db475e75fe2db2db1a5f2cc0ee0339dd2d4521bd |
|
MD5 | 8146c39b8dcded40a245b8b8a2c54306 |
|
BLAKE2b-256 | 261f76907104d090440924403ec84a22463e6e5b214b9ac2fba84cc310cd6100 |