Ecommerce product title recognition package

Last update: Mar 03, 2022

Overview

revizor

This package solves task of splitting product title string into components, like type, brand, model and article (or SKU or product code or you name it).
Imagine classic named entity recognition, but recognition done on product titles.

Install

revizor requires python 3.8+ version on Linux or macOS, Windows isn't supported now, but contributions are welcome.

$ pip install revizor

Usage

from revizor.tagger import ProductTagger

tagger = ProductTagger()
product = tagger.predict("Смартфон Apple iPhone 12 Pro 128 gb Gold (CY.563781.P273)")

assert product.type == "Смартфон"
assert product.brand == "Apple"
assert product.model == "iPhone 12 Pro"
assert product.article == "CY.563781.P273"

Boring numbers

Actually, just output from flair training log:

Corpus: "Corpus: 138959 train + 15440 dev + 51467 test sentences"
Results:
- F1-score (micro) 0.8843
- F1-score (macro) 0.8766

By class:
ARTICLE    tp: 9893 - fp: 1899 - fn: 3268 - precision: 0.8390 - recall: 0.7517 - f1-score: 0.7929
BRAND      tp: 47977 - fp: 2335 - fn: 514 - precision: 0.9536 - recall: 0.9894 - f1-score: 0.9712
MODEL      tp: 35187 - fp: 11824 - fn: 9995 - precision: 0.7485 - recall: 0.7788 - f1-score: 0.7633
TYPE       tp: 25044 - fp: 637 - fn: 443 - precision: 0.9752 - recall: 0.9826 - f1-score: 0.9789

Dataset

Model was trained on automatically annotated corpus. Since it may be affected by DMCA, we'll not publish it.
But we can give hint on how to obtain it, don't we?
Dataset can be created by scrapping any large marketplace, like goods, yandex.market or ozon.
We extract product title and table with product info, then we parse brand and model strings from product info table.
Now we have product title, brand and model. Then we can split product title by brand string, e.g.:

product_title = "Смартфон Apple iPhone 12 Pro 128 Gb Space Gray"
brand = "Apple"
model = "iPhone 12 Pro"

product_type, product_model_plus_some_random_info = product_title.split(brand)

product_type # => 'Смартфон'
product_model_plus_some_random_info # => 'iPhone 12 Pro 128 Gb Space Gray'

License

This package is licensed under MIT license.

Ecommerce product title recognition package

Related tags

Overview

revizor

Install

Usage

Boring numbers

Dataset

License

Owner

Bureaucratic Labs

Bu Chatbot, Konya Bilim Merkezi Yen için tasarlanmış olan bir projedir.

Sorce code and datasets for "K-BERT: Enabling Language Representation with Knowledge Graph",

Beyond Paragraphs: NLP for Long Sequences

A Python script that compares files in directories

code for modular summarization work published in ACL2021 by Krishna et al

Script to generate VAD dataset used in Asteroid recipe

Behavioral Testing of Clinical NLP Models

Paddle2.x version AI-Writer

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Automated question generation and question answering from Turkish texts using text-to-text transformers

Pretrain CPM - 大规模预训练语言模型的预训练代码

Fastseq 基于ONNXRUNTIME的文本生成加速框架

✔👉A Centralized WebApp to Ensure Road Safety by checking on with the activities of the driver and activating label generator using NLP.

A raytrace framework using taichi language

Chatbot for the Chatango messaging platform

Code voor mijn Master project omtrent VideoBERT

KR-FinBert And KR-FinBert-SC

Open source annotation tool for machine learning practitioners.

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.