This page provides the WDC-24 Gold Standard for Product Categorization as well as a categorization of all product offers in the WDC PDC2020-C Products Corpus for public download. The gold standard consists of over 17,000 manually labeled product offers from e-commerce websites. The offers are assigned to a non-hierarchical categorization schema consisting of 24 product categories. The gold standard is a result of the master thesis of José Luis Garza Garza. Below, José explains how he built the gold standard and used it for comparing the performance of different machine learning techniques on the product categorization task. The best performing model (RoBERTa Base) is used to categorize all offers in the PDC2020-C Products Corpus.
Contents
- 1. Introduction
- 2. Gold Standard Creation
- 3. Gold Standard Statistics
- 4. Benchmark Experiments and Results
- 5. Categorization of all Offers from PDC2020-C
- 6. Download
- 7. Reference
- 8. Feedback
1. Introduction
Product categorization is an important task in e-commerce, influencing several aspects, including search and recommendation systems, customer experience, and revenue generation. However, categorizing an enormous volume of products into a correct structure presents considerable challenges due to the inherent complexity and lack of labeled data. Product categorization comes with a unique series of challenges. Dealing with high variability and unstructured text data in product descriptions and with different vendors potentially describing the same product in varied ways, the ML model must identify and categorize products accurately regardless of these discrepancies. Additionally, multiclass classification problems with a high number of different product categories can significantly increase the complexity of the classification task.
Therefore, we focus on addressing these challenges. It involves developing a scalable and efficient Deep Learning models for product categorization that can handle the large volume and diversity of products, manages the variability in product descriptions, and outperform traditional Machine Learning methods. We will explore the application of transformer-based deep learning models, such as BERT and RoBERTa, for this task and develop a strategy for creating the WDC-24 Gold Standard. The goal is to accurately and consistently categorize the WDC Products Corpus using transformer-based models.
2. Gold Standard Creation
To create a WDC-24 Gold Standard, we use a semi-automatic annotation strategy . This approach involves utilizing a baseline model in predicting category labels for products in the corpus, as mentioned in the previous section. The model’s outputs act as preliminary category assignments for these products. However, these initial labels should not be accepted without further review. The need for a verification process is paramount, as it ensures the validity of these categorization predictions. To that end, we introduce a manual verification process following the model’s prediction. In this process, we take a subset of clusters and verify whether the model’s predicted category aligns with the product descriptions.
First,we established a baseline model by following the experimental procedure, training, and testing data introduced in WDC-25 Gold Standard for Product Categorization. The experiment follows a non-hierarchical product categorization and a Logistic Regression with a One-vs-Rest approach, a widely used and easily comprehensible ML model. This technique is notably favorable for multiclass classification issues since it adapts to each category by fitting a separate model, treating it as a binary classification issue. Furthermore, this method is computationally efficient, which was crucial because of the large datasets we utilized in our task.
After using the baseline model to learn to predict the product categories on the WDC Products Corpus, the application of majority voting is used as a post-processing technique for the classification results based on the underlying structure of the data. This approach reinforces the effectiveness of the model’s predictions, increasing confidence in the final category assignment. The WDC Products Corpus contains multiple products grouped into clusters based on similarity. These clusters are derived from product co-occurrences across various e-commerce sites, which means the products within the same cluster are typical of the same category. However, due to the diversity of e-commerce offerings and their descriptions, certain products within the same cluster may have to be misclassified into different categories. By applying majority voting, we set a democratic rule within each cluster. The category predicted most frequently within a cluster will be deemed as the ’majority vote’ and assigned as the final category for all products within that cluster. This method allows leveraging the data’s inherent structure (clusters) to rectify any inconsistencies in the model’s predictions. This step improves the quality of our model’s output in two ways. Firstly, it reduces noise in the predicted categories, as the majority vote within each cluster overrules misclassifications. Secondly, by aligning the products’ category assignment with the underlying structure of the data, we ensure that our model’s outputs are theoretically sound and practical.
Our next step is to define a strategy for the manual verification. Our selection for this verification stage is not arbitrary. The focus is on clusters that have between 5 and 30 products per category, as these clusters are narrow enough and therefore are more likely to contain products that are indeed related. These clusters are a subset of those populated by the baseline model’s predictions. A result of this manual verification process is the unbalancing of the dataset. In the analysis and categorization of our data, several assumptions were made about the classification of certain products. These assumptions were necessary to create a unified structure of categories and to help in the interpretation and understanding of the data. Since some categories may have more correct predictions, the number of clusters in each category could vary significantly. To mitigate this imbalance issue and ensure the models are well-trained across all categories, we perform additional checks for more clusters from the minority classes.
3.Gold Standard Statistics
The WDC-24 Gold Standard has at least 70 clusters per category, contributing to a more balanced dataset that contains 2,620 unique clusters, 17,580 products, and an average cluster size of 6.67 products. Table 1 presents a complete overview of the product categories distribution within the WDC-24 Gold Standard after the manual verification. This analysis is useful because it helps us understand the labeled corpus’s size and class distribution. It allows us to define the best sampling approach to create a good training, validation, and testing set for our task. The WDC-24 Gold Standard contains 24 categories, as the ’Not Found’ category from the original taxonomy was not included because of the low number of predicted labels.
Category Label | # Offers | # Clusters | Avg Cluster Size | Max Cluster Size |
---|---|---|---|---|
Home and Garden | 1,389 | 181 | 7.67 | 25 |
Toys and Games | 1,318 | 194 | 6.79 | 30 |
Computers and Accessories | 1,305 | 161 | 8.11 | 25 |
Sports and Outdoors | 1,144 | 172 | 6.65 | 22 |
Other Electronics | 915 | 118 | 7.75 | 28 |
Shoes | 735 | 122 | 6.02 | 13 |
Clothing | 733 | 116 | 6.32 | 15 |
CDs and Vinyl | 716 | 114 | 6.28 | 13 |
Cell Phones and Accessories | 694 | 98 | 7.08 | 29 |
Automotive | 667 | 92 | 7.25 | 21 |
Pet Supplies | 664 | 99 | 6.71 | 20 |
Books | 654 | 105 | 6.23 | 16 |
Jewelry | 642 | 103 | 6.23 | 16 |
Office Products | 613 | 98 | 6.26 | 11 |
Camera and Photo | 604 | 82 | 7.36 | 20 |
Health and Beauty | 588 | 99 | 5.94 | 11 |
Video Games | 583 | 91 | 6.41 | 16 |
Movies and TV | 556 | 90 | 6.18 | 16 |
Tools and Home Improvement | 555 | 92 | 6.03 | 20 |
Musical Instruments | 546 | 90 | 6.07 | 11 |
Grocery and Gourmet Food | 546 | 90 | 6.07 | 14 |
Baby | 498 | 72 | 6.92 | 22 |
Luggage and Travel Gear | 486 | 82 | 5.93 | 19 |
Others | 429 | 72 | 5.96 | 16 |
Totals | 17,580 | 2,633 | 6.67 | - |
4. Benchmark Experiments and Results
This section represents the approach we followed to train and evaluate our selected models’ performance using the WDC-24 Gold Standard. We have a diverse range of models to analyze and gain a thorough understanding of their performance across their different architectures. This analysis will help us determine which model best suits our large-scale experiment using the latest WDC Product Corpus (2020).
- Traditional Machine Learning Model
- Logistic Regression with One-vs-Rest Approach: A grid search is performed to find the best combination of hyperparameters. The parameters being searched include the maximum features for TF-IDF, the regularization parameter (C), the solver algorithm, and the maximum iterations. After defining the hyperparameters for our grid search, we build a pipeline using a 5-fold cross-validation. We identified the model with the most suitable hyperparameters. The Logistic Regression Classifier performed best with a C value of 10, a ’newton-cg’ solver, and 500 iterations. Additionally, the TF-IDF Vectorizer only considered 50,000 features. The model configured with these hyperparameters is set as our final model.
- Transformer-based Models
BERT VariantsAll three BERT models (Base Uncased, Large Uncased, and Base Multilingual Uncased) use the BERT Tokenizer, which employs the WordPiece tokenization algorithm. This algorithm breaks words into smaller, manageable subwords to handle out-of-vocabulary words more effectively. For the task of classification, these models use BertForSequenceClassification. With this groundwork established, we examined the distinct training configurations each model employed.
- BERT Base Uncased: The model was trained for 10 epochs, meaning the entire dataset passes through the neural network ten times. The batch size for both training and evaluation is 64, meaning that 64 samples from the dataset are used at each step of the gradient update during training, and model performance is evaluated over batches of 64 samples during evaluation. The learning rate for this model is set to 5e-5, which determines how much the model changes in response to the estimated error each time the model weights are updated. The evaluation and saving of the model are performed at the end of each epoch, and the model that achieved the highest Micro F1 Score is loaded at the end of training.
- BERT Large Uncased: This model was trained for a total of 5 epochs due to its more extensive size and computational demands. The training and evaluation batch sizes are decreased to 16 to compensate for the larger model size. Despite the size differences, the learning rate is kept consistent with the BERT Base model at 5e-5. Evaluation, saving, and loading of the best model are handled similarly to the BERT Base model, with these actions occurring at the end of each epoch and the best model determined by the highest Micro F1 Score.
- BERT Base Multilingual Uncased: This model follows a similar training configuration to the previous two models. It is trained for 10 epochs, with training and evaluation batch sizes of 64. The learning rate is maintained at 5e-5, and the best model, decided by the highest Micro F1 Score, is loaded at the end of training. Evaluation and saving of the model are performed at the conclusion of each epoch.
RoBERTa VariantsBoth RoBERTa models (Base and Large) used the Roberta Tokenizer, designed to construct tokenizers, handling all the preprocessing from raw text to input sequences. Like the BERT models, the RoBERTa models used a variant of the BERT sequence classification model, specifically, RobertaForSequenceClassification. The classification models are pre-trained transformer models with a single linear layer on top for classification. On the other hand, the XLM-RoBERTa model used the AutoTokenizer, which automatically determines and applies the appropriate tokenizer based on the model architecture. For the classification task, it employed the XLMRobertaForSequenceClassification, a model uniquely suited for handling XLM-RoBERTa's architecture. These models were each fine-tuned with their respective configurations, taking advantage of their designs' efficient tokenization and powerful sequence classification capabilities. With this groundwork, we next dig into the individual training configurations for each model.
- RoBERTa Base: This model was trained throughout 10 epochs using a batch size 32 for training and evaluation phases. The learning rate was set at 5e-5 and the model that achieved the highest Micro F1 Score was loaded at the end of training.
- RoBERTa Large: Given its larger computational requirement, this model underwent a more limited training regimen of 5 epochs. However, it maintained the consistency of the learning rate at 5e-5, akin to the other models. The batch size was reduced to 16 for training and evaluation, likely to accommodate increased resource usage. As with the other models, the one achieving the highest Micro F1 Score was identified as the best and loaded at the end of training.
- XML RoBERTa: This model was trained for 10 epochs with a batch size of 32 for both training and evaluation, and a learning rate of 5e-5 was used. The highest Micro F1 Score was used as the determinant for the best model, which was then loaded at the end of training.
The performance results for each model, after being trained with their respective configurations, are presented in Table 2.
Model | Micro F1-Score | Macro F1-Score |
---|---|---|
Logistic Regression with OvR | 0.85 | 0.85 |
BERT Base Uncased | 0.88 | 0.88 |
BERT Large Uncased | 0.88 | 0.88 |
BERT Base Multilingual | 0.86 | 0.85 |
RoBERTa Base | 0.88 | 0.89 |
RoBERTa Large | 0.88 | 0.88 |
XML-RoBERTa | 0.88 | 0.87 |
Selection of the Best Performing Model
Factors such as model complexity, generalization capability, efficiency, and scalability play significant roles in making an informed decision. After analyzing the previous results, the RoBERTa Base model is chosen as the preferred option. Occam's razor principle favors simpler models that can provide accurate predictions. RoBERTa Base has demonstrated excellent generalization across diverse categories, making it well-suited for the experiment. Additionally, it offers computational efficiency and scalability, ensuring optimal performance as the data volume increases. By considering these factors, the RoBERTa Base model emerges as the ideal choice for the large-scale experiment.
5. Categorization of all Offers in the PDC2020-C Corpus
We use the best performing model to categorize all 22 million offers in the WDC PDC2020-C product corpus. The results presented in Table 3 and Figure 1 illustrate the outcomes of our large-scale experiment using the RoBERTa Base model alongside the implementation of majority voting as a method of final product classification on the WDC PDC2020-C Product Corpus.
Category | # Offers | # Clusters | Avg Cluster Size | % Clusters > 1 products |
---|---|---|---|---|
Clothing | 3,716,240 | 3,617,411 | 1.03 | 2.21 |
Home and Garden | 2,953,284 | 2,789,280 | 1.06 | 4.10 |
Sports and Outdoors | 1,884,730 | 1,779,077 | 1.06 | 4.01 |
Office Products | 1,336,911 | 1,266,785 | 1.06 | 3.64 |
Jewelry | 1,310,126 | 1,246,801 | 1.05 | 3.55 |
Tools and Home Improvement | 1,190,917 | 1,147,093 | 1.04 | 2.79 |
Health and Beauty | 1,048,407 | 904,705 | 1.16 | 9.27 |
Automotive | 937,906 | 886,365 | 1.06 | 4.40 |
Toys and Games | 925,174 | 845,964 | 1.09 | 4.61 |
Books | 866,078 | 802,832 | 1.08 | 5.85 |
Shoes | 817,529 | 777,374 | 1.05 | 3.36 |
Grocery and Gourmet Food | 799,108 | 724,006 | 1.10 | 7.71 |
Other Electronics | 605,783 | 557,814 | 1.09 | 5.25 |
Computers and Accessories | 609,701 | 533,220 | 1.14 | 7.79 |
Pet Supplies | 477,353 | 438,148 | 1.09 | 5.40 |
Luggage and Travel Gear | 434,211 | 417,249 | 1.04 | 3.11 |
Musical Instruments | 423,967 | 393,570 | 1.08 | 5.23 |
Baby | 354,002 | 332,050 | 1.07 | 4.84 |
Cell Phones and Accessories | 332,424 | 310,285 | 1.07 | 4.75 |
CDs and Vinyl | 315,851 | 289,000 | 1.09 | 6.53 |
Camera and Photo | 260,459 | 233,486 | 1.12 | 7.15 |
Others | 183,658 | 156,559 | 1.17 | 8.42 |
Video Games | 88,780 | 84,024 | 1.06 | 3.55 |
Movies and TV | 56,611 | 52,143 | 1.09 | 5.52 |
Totals | 21,929,210 | 20,585,241 | 1.06 | - |
In order to verify the performance of the RoBERTa Base model on the whole corpus, we manually checked a sample of 10 offers per category (240 records), considering only products that contain information from the following three features: Title, Brand, and Description. Overall, the experiment has an accuracy of 72.04%. However, the performance per class varies significantly across different categories, highlighting the strengths and potential improvement areas in the model’s predictive capability. Figure 2 shows the results per category in descending order.
6. Download
Below, we provide the WDC-24 Gold Standard for Product Categorization for public download. We further offer the training, validation, and testing set used for all the experiments. As a result of running our best-performing model, RoBERTa base, on the WDC Product Corpus, we have created two separate output files. One file displays the results obtained using majority voting, including cluster ID and category labels. The other file presents the results without majority voting and provides the offer ID and their respective category label.
File | Size | Download |
WDC-24 Gold Standard | 11.7MB | WDC_24_GoldStandard.csv |
WDC-24 Gold Standard Training Set | 9.3MB | WDC_24_GoldStandard_TrainingSet.csv |
WDC-24 Gold Standard Validation Set | 1.1MB | WDC_24_GoldStandard_ValidationSet.csv |
WDC-24 Gold Standard Testing Set | 1.2MB | WDC_24_GoldStandard_TestingSet.csv |
Large Scale Experiment on WDC Product Corpus with Majority Voting Using RoBERTa Base Model | 133.6MB | WDC_Corpus_LargeScaleExperiment_MajorityVoting.json.gz |
Large Scale Experiment on WDC Product Corpus without Majority Voting Using RoBERTa Base Model | 82.8MB | WDC_Corpus_LargeScaleExperiment_WithoutMajorityVoting.json.gz |
7. References
[1] Anna Primpeli, Ralph Peeters, and Christian Bizer. The WDC Training Dataset and Gold Standard for Large-Scale Product Matching. In Companion Proceedings of The 2019 World Wide Web Conference, WWW ’19, pages 381–386, New York, NY, USA, May 2019. Association for Computing Machinery
[2] Ralph Peeters, Reng Chiz Der, and Christian Bizer. WDC Products: A Multi Dimensional Entity Matching Benchmark.January 2023. arXiv:2301.09521[cs]
8. Feedback
Please send questions and feedback to the Web Data Commons Google Group.
More information about Web Data Commons is found here.