WDC-24 Gold Standard for Product Categorization

This page provides the WDC-24 Gold Standard for Product Categorization as well as a categorization of all product offers in the WDC PDC2020-C Products Corpus for public download. The gold standard consists of over 17,000 manually labeled product offers from e-commerce websites. The offers are assigned to a non-hierarchical categorization schema consisting of 24 product categories. The gold standard is a result of the master thesis of José Luis Garza Garza. Below, José explains how he built the gold standard and used it for comparing the performance of different machine learning techniques on the product categorization task. The best performing model (RoBERTa Base) is used to categorize all offers in the PDC2020-C Products Corpus.

1. Introduction

Product categorization is an important task in e-commerce, influencing several aspects, including search and recommendation systems, customer experience, and revenue generation. However, categorizing an enormous volume of products into a correct structure presents considerable challenges due to the inherent complexity and lack of labeled data. Product categorization comes with a unique series of challenges. Dealing with high variability and unstructured text data in product descriptions and with different vendors potentially describing the same product in varied ways, the ML model must identify and categorize products accurately regardless of these discrepancies. Additionally, multiclass classification problems with a high number of different product categories can significantly increase the complexity of the classification task.

Therefore, we focus on addressing these challenges. It involves developing a scalable and efficient Deep Learning models for product categorization that can handle the large volume and diversity of products, manages the variability in product descriptions, and outperform traditional Machine Learning methods. We will explore the application of transformer-based deep learning models, such as BERT and RoBERTa, for this task and develop a strategy for creating the WDC-24 Gold Standard. The goal is to accurately and consistently categorize the WDC Products Corpus using transformer-based models.

2. Gold Standard Creation

To create a WDC-24 Gold Standard, we use a semi-automatic annotation strategy . This approach involves utilizing a baseline model in predicting category labels for products in the corpus, as mentioned in the previous section. The model’s outputs act as preliminary category assignments for these products. However, these initial labels should not be accepted without further review. The need for a verification process is paramount, as it ensures the validity of these categorization predictions. To that end, we introduce a manual verification process following the model’s prediction. In this process, we take a subset of clusters and verify whether the model’s predicted category aligns with the product descriptions.

First,we established a baseline model by following the experimental procedure, training, and testing data introduced in WDC-25 Gold Standard for Product Categorization. The experiment follows a non-hierarchical product categorization and a Logistic Regression with a One-vs-Rest approach, a widely used and easily comprehensible ML model. This technique is notably favorable for multiclass classification issues since it adapts to each category by fitting a separate model, treating it as a binary classification issue. Furthermore, this method is computationally efficient, which was crucial because of the large datasets we utilized in our task.

After using the baseline model to learn to predict the product categories on the WDC Products Corpus, the application of majority voting is used as a post-processing technique for the classification results based on the underlying structure of the data. This approach reinforces the effectiveness of the model’s predictions, increasing confidence in the final category assignment. The WDC Products Corpus contains multiple products grouped into clusters based on similarity. These clusters are derived from product co-occurrences across various e-commerce sites, which means the products within the same cluster are typical of the same category. However, due to the diversity of e-commerce offerings and their descriptions, certain products within the same cluster may have to be misclassified into different categories. By applying majority voting, we set a democratic rule within each cluster. The category predicted most frequently within a cluster will be deemed as the ’majority vote’ and assigned as the final category for all products within that cluster. This method allows leveraging the data’s inherent structure (clusters) to rectify any inconsistencies in the model’s predictions. This step improves the quality of our model’s output in two ways. Firstly, it reduces noise in the predicted categories, as the majority vote within each cluster overrules misclassifications. Secondly, by aligning the products’ category assignment with the underlying structure of the data, we ensure that our model’s outputs are theoretically sound and practical.

Our next step is to define a strategy for the manual verification. Our selection for this verification stage is not arbitrary. The focus is on clusters that have between 5 and 30 products per category, as these clusters are narrow enough and therefore are more likely to contain products that are indeed related. These clusters are a subset of those populated by the baseline model’s predictions. A result of this manual verification process is the unbalancing of the dataset. In the analysis and categorization of our data, several assumptions were made about the classification of certain products. These assumptions were necessary to create a unified structure of categories and to help in the interpretation and understanding of the data. Since some categories may have more correct predictions, the number of clusters in each category could vary significantly. To mitigate this imbalance issue and ensure the models are well-trained across all categories, we perform additional checks for more clusters from the minority classes.

3.Gold Standard Statistics

The WDC-24 Gold Standard has at least 70 clusters per category, contributing to a more balanced dataset that contains 2,620 unique clusters, 17,580 products, and an average cluster size of 6.67 products. Table 1 presents a complete overview of the product categories distribution within the WDC-24 Gold Standard after the manual verification. This analysis is useful because it helps us understand the labeled corpus’s size and class distribution. It allows us to define the best sampling approach to create a good training, validation, and testing set for our task. The WDC-24 Gold Standard contains 24 categories, as the ’Not Found’ category from the original taxonomy was not included because of the low number of predicted labels.

Table 1: WDC-24 Gold Standard Statistics
Category Label	# Offers	# Clusters	Avg Cluster Size	Max Cluster Size
Home and Garden	1,389	181	7.67	25
Toys and Games	1,318	194	6.79	30
Computers and Accessories	1,305	161	8.11	25
Sports and Outdoors	1,144	172	6.65	22
Other Electronics	915	118	7.75	28
Shoes	735	122	6.02	13
Clothing	733	116	6.32	15
CDs and Vinyl	716	114	6.28	13
Cell Phones and Accessories	694	98	7.08	29
Automotive	667	92	7.25	21
Pet Supplies	664	99	6.71	20
Books	654	105	6.23	16
Jewelry	642	103	6.23	16
Office Products	613	98	6.26	11
Camera and Photo	604	82	7.36	20
Health and Beauty	588	99	5.94	11
Video Games	583	91	6.41	16
Movies and TV	556	90	6.18	16
Tools and Home Improvement	555	92	6.03	20
Musical Instruments	546	90	6.07	11
Grocery and Gourmet Food	546	90	6.07	14
Baby	498	72	6.92	22
Luggage and Travel Gear	486	82	5.93	19
Others	429	72	5.96	16
Totals	17,580	2,633	6.67	-

Dividing data into training, validation, and testing sets is crucial in building any Machine Learning model. This allows us to evaluate the model’s performance and ability to generalize to unseen data, mitigating the risk of overfitting. We followed an 80%-10%-10% approach for splitting our labeled Corpus, meaning that 80% of the data will be used for training, 10% for validation, and 10% for testing our models. Splitting the Corpus with this strategy guarantees that our data is prepared for our product categorization task. This step and the labeling strategies discussed in the previous section laid a solid foundation for successfully developing and validating our product categorization models. One thing we also considered is that we split the data by cluster, not by single offer, meaning that one cluster of products can appear in just one set. We also provide this subset for download at the end of the page.

4. Benchmark Experiments and Results

This section represents the approach we followed to train and evaluate our selected models’ performance using the WDC-24 Gold Standard. We have a diverse range of models to analyze and gain a thorough understanding of their performance across their different architectures. This analysis will help us determine which model best suits our large-scale experiment using the latest WDC Product Corpus (2020).

Traditional Machine Learning Model
- Logistic Regression with One-vs-Rest Approach: A grid search is performed to find the best combination of hyperparameters. The parameters being searched include the maximum features for TF-IDF, the regularization parameter (C), the solver algorithm, and the maximum iterations. After defining the hyperparameters for our grid search, we build a pipeline using a 5-fold cross-validation. We identified the model with the most suitable hyperparameters. The Logistic Regression Classifier performed best with a C value of 10, a ’newton-cg’ solver, and 500 iterations. Additionally, the TF-IDF Vectorizer only considered 50,000 features. The model configured with these hyperparameters is set as our final model.

Transformer-based Models

BERT Variants
All three BERT models (Base Uncased, Large Uncased, and Base Multilingual Uncased) use the BERT Tokenizer, which employs the WordPiece tokenization algorithm. This algorithm breaks words into smaller, manageable subwords to handle out-of-vocabulary words more effectively. For the task of classification, these models use BertForSequenceClassification. With this groundwork established, we examined the distinct training configurations each model employed.
- BERT Base Uncased: The model was trained for 10 epochs, meaning the entire dataset passes through the neural network ten times. The batch size for both training and evaluation is 64, meaning that 64 samples from the dataset are used at each step of the gradient update during training, and model performance is evaluated over batches of 64 samples during evaluation. The learning rate for this model is set to 5e-5, which determines how much the model changes in response to the estimated error each time the model weights are updated. The evaluation and saving of the model are performed at the end of each epoch, and the model that achieved the highest Micro F1 Score is loaded at the end of training.
- BERT Large Uncased: This model was trained for a total of 5 epochs due to its more extensive size and computational demands. The training and evaluation batch sizes are decreased to 16 to compensate for the larger model size. Despite the size differences, the learning rate is kept consistent with the BERT Base model at 5e-5. Evaluation, saving, and loading of the best model are handled similarly to the BERT Base model, with these actions occurring at the end of each epoch and the best model determined by the highest Micro F1 Score.
- BERT Base Multilingual Uncased: This model follows a similar training configuration to the previous two models. It is trained for 10 epochs, with training and evaluation batch sizes of 64. The learning rate is maintained at 5e-5, and the best model, decided by the highest Micro F1 Score, is loaded at the end of training. Evaluation and saving of the model are performed at the conclusion of each epoch.
RoBERTa Variants
Both RoBERTa models (Base and Large) used the Roberta Tokenizer, designed to construct tokenizers, handling all the preprocessing from raw text to input sequences. Like the BERT models, the RoBERTa models used a variant of the BERT sequence classification model, specifically, RobertaForSequenceClassification. The classification models are pre-trained transformer models with a single linear layer on top for classification. On the other hand, the XLM-RoBERTa model used the AutoTokenizer, which automatically determines and applies the appropriate tokenizer based on the model architecture. For the classification task, it employed the XLMRobertaForSequenceClassification, a model uniquely suited for handling XLM-RoBERTa's architecture. These models were each fine-tuned with their respective configurations, taking advantage of their designs' efficient tokenization and powerful sequence classification capabilities. With this groundwork, we next dig into the individual training configurations for each model.
- RoBERTa Base: This model was trained throughout 10 epochs using a batch size 32 for training and evaluation phases. The learning rate was set at 5e-5 and the model that achieved the highest Micro F1 Score was loaded at the end of training.
- RoBERTa Large: Given its larger computational requirement, this model underwent a more limited training regimen of 5 epochs. However, it maintained the consistency of the learning rate at 5e-5, akin to the other models. The batch size was reduced to 16 for training and evaluation, likely to accommodate increased resource usage. As with the other models, the one achieving the highest Micro F1 Score was identified as the best and loaded at the end of training.
- XML RoBERTa: This model was trained for 10 epochs with a batch size of 32 for both training and evaluation, and a learning rate of 5e-5 was used. The highest Micro F1 Score was used as the determinant for the best model, which was then loaded at the end of training.

The performance results for each model, after being trained with their respective configurations, are presented in Table 2.

Table 2: Results Comparison
Model	Micro F1-Score	Macro F1-Score
Logistic Regression with OvR	0.85	0.85
BERT Base Uncased	0.88	0.88
BERT Large Uncased	0.88	0.88
BERT Base Multilingual	0.86	0.85
RoBERTa Base	0.88	0.89
RoBERTa Large	0.88	0.88
XML-RoBERTa	0.88	0.87

Selection of the Best Performing Model

Factors such as model complexity, generalization capability, efficiency, and scalability play significant roles in making an informed decision. After analyzing the previous results, the RoBERTa Base model is chosen as the preferred option. Occam's razor principle favors simpler models that can provide accurate predictions. RoBERTa Base has demonstrated excellent generalization across diverse categories, making it well-suited for the experiment. Additionally, it offers computational efficiency and scalability, ensuring optimal performance as the data volume increases. By considering these factors, the RoBERTa Base model emerges as the ideal choice for the large-scale experiment.

5. Categorization of all Offers in the PDC2020-C Corpus

We use the best performing model to categorize all 22 million offers in the WDC PDC2020-C product corpus. The results presented in Table 3 and Figure 1 illustrate the outcomes of our large-scale experiment using the RoBERTa Base model alongside the implementation of majority voting as a method of final product classification on the WDC PDC2020-C Product Corpus.

Table 3: Large Scale Experiment Statistics
Category	# Offers	# Clusters	Avg Cluster Size	% Clusters > 1 products
Clothing	3,716,240	3,617,411	1.03	2.21
Home and Garden	2,953,284	2,789,280	1.06	4.10
Sports and Outdoors	1,884,730	1,779,077	1.06	4.01
Office Products	1,336,911	1,266,785	1.06	3.64
Jewelry	1,310,126	1,246,801	1.05	3.55
Tools and Home Improvement	1,190,917	1,147,093	1.04	2.79
Health and Beauty	1,048,407	904,705	1.16	9.27
Automotive	937,906	886,365	1.06	4.40
Toys and Games	925,174	845,964	1.09	4.61
Books	866,078	802,832	1.08	5.85
Shoes	817,529	777,374	1.05	3.36
Grocery and Gourmet Food	799,108	724,006	1.10	7.71
Other Electronics	605,783	557,814	1.09	5.25
Computers and Accessories	609,701	533,220	1.14	7.79
Pet Supplies	477,353	438,148	1.09	5.40
Luggage and Travel Gear	434,211	417,249	1.04	3.11
Musical Instruments	423,967	393,570	1.08	5.23
Baby	354,002	332,050	1.07	4.84
Cell Phones and Accessories	332,424	310,285	1.07	4.75
CDs and Vinyl	315,851	289,000	1.09	6.53
Camera and Photo	260,459	233,486	1.12	7.15
Others	183,658	156,559	1.17	8.42
Video Games	88,780	84,024	1.06	3.55
Movies and TV	56,611	52,143	1.09	5.52
Totals	21,929,210	20,585,241	1.06	-

Analysis by:

In order to verify the performance of the RoBERTa Base model on the whole corpus, we manually checked a sample of 10 offers per category (240 records), considering only products that contain information from the following three features: Title, Brand, and Description. Overall, the experiment has an accuracy of 72.04%. However, the performance per class varies significantly across different categories, highlighting the strengths and potential improvement areas in the model’s predictive capability. Figure 2 shows the results per category in descending order.

6. Download

Below, we provide the WDC-24 Gold Standard for Product Categorization for public download. We further offer the training, validation, and testing set used for all the experiments. As a result of running our best-performing model, RoBERTa base, on the WDC Product Corpus, we have created two separate output files. One file displays the results obtained using majority voting, including cluster ID and category labels. The other file presents the results without majority voting and provides the offer ID and their respective category label.

File	Size	Download
WDC-24 Gold Standard	11.7MB	WDC_24_GoldStandard.csv
WDC-24 Gold Standard Training Set	9.3MB	WDC_24_GoldStandard_TrainingSet.csv
WDC-24 Gold Standard Validation Set	1.1MB	WDC_24_GoldStandard_ValidationSet.csv
WDC-24 Gold Standard Testing Set	1.2MB	WDC_24_GoldStandard_TestingSet.csv
Large Scale Experiment on WDC Product Corpus with Majority Voting Using RoBERTa Base Model	133.6MB	WDC_Corpus_LargeScaleExperiment_MajorityVoting.json.gz
Large Scale Experiment on WDC Product Corpus without Majority Voting Using RoBERTa Base Model	82.8MB	WDC_Corpus_LargeScaleExperiment_WithoutMajorityVoting.json.gz

7. References

[1] Anna Primpeli, Ralph Peeters, and Christian Bizer. The WDC Training Dataset and Gold Standard for Large-Scale Product Matching. In Companion Proceedings of The 2019 World Wide Web Conference, WWW ’19, pages 381–386, New York, NY, USA, May 2019. Association for Computing Machinery

[2] Ralph Peeters, Reng Chiz Der, and Christian Bizer. WDC Products: A Multi Dimensional Entity Matching Benchmark.January 2023. arXiv:2301.09521[cs]

8. Feedback

Please send questions and feedback to the Web Data Commons Google Group.

More information about Web Data Commons is found here.