In order to support the evaluation and comparison of product feature extraction and product matching methods, we have manually created two public gold standards for these tasks. The gold standards comprise of several hundreds products from the categories mobile phones, TVs, and headphones.
News
- 2018-12-20: We have released the WDC Training Dataset and Gold Standard for Large-Scale Product Matching consisting of 26 million product offers originating from 79 thousand websites.
- 2017-30-06: Paper about extracting attribute-value pairs from product specifications accepted at at WI'17 conference in Leipzig, Germany: Extracting Attribute-Value Pairs from Product Specifications on the Web
- 2016-01-09: Paper about the gold standards accepted at EC-WEB'16 conference in Porto, Portugal: The WDC Gold Standards for Product Feature Extraction and Product Matching
Contents
- 1.About the WDC gold standard for product matching and product feature extraction
- 2.Product Selection
- 3.Gold Standard for Feature Extraction
- 4.Gold Standard for Product Matching
- 5.Product Data Corpus
- 6.Baselines
- 7.Extraction of Attribute-Value Pairs from Product Specifications
- 8.References
1. About the WDC Gold Standard for Product Matching and Product Feature Extraction
Finding out which e-shops offer a specific product is a central challenge for building integrated product catalogs and comparison shopping portals. Determining whether two offers refer to the same product involves extracting a set of features (product attributes) from the web pages containing the offers and comparing these features using a matching function. The existing gold standards [5, 6] for product matching have two shortcomings: (i) they only contain offers from a small number of e-shops and thus do not properly cover the heterogeneity that is found on the Web. (ii) they only provide a small number of generic product attributes and therefore cannot be used to evaluate whether detailed product attributes have been correctly extracted from textual product descriptions.
To overcome above shortcomings we introduce the WDC gold standard for product matching and product featrue extraction, which packages the following artifacts:
- Goldstandard for Product Feature Extraction: containing over 500 deeply annotated product web pages. On each web page the product-specific features are annotated.
- Goldstandard for Product Matching: containing over 75,000 correspondences (1,500 positive, and 73,500 negative) between products from the product catalog and products described on web pages.
In addition, we also provide a Product Data Corpus for public download containing over 13 million product-related web pages retrieved from the same 32 websites. This corpus might be useful as background knowledge for the semi-supervised training of feature extraction and matching methods.
2. Product Selection
We have used the 32 most frequently visited shopping web sites based on the ranking provided by Alexa. We collected first the ten most popular products from the different web sites, for each of the three chosen product categories. We further complemented this list by similar products (based on their name). As example, we found the product Apple iPhone 6 64GB to be one of the most popular amongst all shopping web sites. We therefore included also the products Apple iPhone 6 128GB as well as Apple iPhone 5 into our product catalog. Especially for the product matching task, this methodology introduces a certain level of complexity, as the product names only differ by one or two characters. All in all, for each product category we selected 50 different products.
3. Gold Standard for Feature Extraction
We labed 4 distinct stuctural units from the HTML pages: (1) Microdata title, (2) Microdata description, (3) HTML tables and (4) HTML lists. The labeled set comprises out of 500 product entities, while the distinct labeled properties are 338 in total. It was created by three different annotators. In order to ensure the validity of the annotaton process, the inter annotator agreement was calculated. Out of 1,500 labeled properties there were found 75 disagreements. Therefore, the inter annotator agreement is estimated to be 95%. The table below shows the distribution of labeled entities per category. The product entities were labeled as JSON objects for which a documentation of propety meaning and usage is available here. Additionally, the files containing the labeled entities are provided per category and in total for further use.
Category | #labeled Products | File with labeled Products |
---|---|---|
Headphones | 124 | headphones_labeled_entities.json(230kb) |
Mobile Phones | 212 | phones_labeled_entities.json(603kb) |
Televisions | 164 | tvs_labeled_entities.json(291kb) |
Total | 500 | all_labeled_entities.json(1,125kb) |
General statistics per PLD, Web Page and Totals for each category
The charts below show the PLD usage for the top 12 most used properties in the gold standard for each product category. Some more detailed statistics on all the properties before being mapped can be found in the following file (UnMappedProps_Statistics.xlsx). The file includes the feature names of tables and lists and their frequency per product category as used by the different commercial websites as well as the distribution of PLDs to the feature names per product category.
The charts below show the usage of attributes per product per product category within lists and tables.
Among the total number of defined properties we observed some that were used rather frequently and others that were used only from a few websites. The charts below present the most frequent properties per product category as found in titles, descriptions, tables and lists. Please note that the presented results do not include the tagline property which is used for the cases where a word in a title or description did not reveal any specification and thus could not be given any property value.
4. Gold Standard for Product Matching
Product Catalog
Feature Enrichment: In order to complement the products with features, for each of the 150 products in our product catalog we obtained product-specific features from the manufacturers’ web site or, from shopping service web sites like Google Shopping. We did not extract features from individual vendor web site, like Amazon as we want to focus on product-specific features, rather than on features arising from bundles offered by specific vendors.
Figure 1 shows two example pages, which we used to manually extract the features for our product catalog. Figure 1left) depicts the page of Google Shopping and Figure 1right) from the manufacturers’ web site.
The table below shows the scraped attributes per product category.
Defined Properties per Product Category
Category | #Properties | List of Properties | Downloads |
---|---|---|---|
Headphones | 36 | Show Properties for Headphones | HeadphoneCatalog.json(22kb) |
Mobile Phones | 32 | Show Properties for Mobile Phones | PhoneCatalog.json(38kb) |
Televisions | 76 | Show Properties for Televisions | TVCatalog.json(38kb) |
Correspondences
We manually generated 1,500 positive correspondences, 500 for each product category (the data are available here). For each product of the product catalog at least one positive correspondence is included. Additionally, to make the matching task more realistic the annotators also annotate closely related products to the once in the product catalog like: phone cases, TV wall mounts or headphone cables, ear-buds, etc. Furthermore we created additional negative correspondences exploiting transitive closure. As all products in the product catalog are distinct, we can generate for all product descriptions contained in web pages, where a positive correspondence exist to a product in the catalog, for all other products in the catalog a negative correspondence to this product on the web page. Using the two approaches we ended up with 73,500 negative correspondences. Please note that the properties product_name from the product catalogs and normalized product name can only be used as keys and not as extracted attributes. The full list of correspondences can be downloaded here.
Figure 2 depicts the number of positive correspondences which are contained for each product from the three different product categories.
Download Page Subset
We also provide the pages icluded in the gold standard for download. Each page is named by the id of each product given in the data.
5. Product Data Corpus
The motivation for the product crawl follow two main criteria: the diversity of its origin and its quantity. This means that the chosen product specifications derive from different websites which use different schemata to express product specifications for the same product category. Furthermore the data should include a comprehensive amount of the products and offers provided by the selected websites.
The approach adapted for crawling data while satisfying the previous criteria is the crawling and scraping framework, scrapy. The crawler was forced to retrieve data from 32 different PLDs which were chosen based on their containment of marked up annotations as well as their traffic rankings as reported by Alexa. The following table shows the distribution of pages included in the crawler per PLD.
Number of Pages per PLD in the Crawled Corpus
PLD | #Pages | PLD | #Pages |
---|---|---|---|
target.com | 2,007,121 | shop.com | 1,754,368 |
walmart.com | 1,711,517 | selection.alibaba.com | 607,410 |
microcenter.com | 459,921 | aliexpress.com | 431,005 |
ebay.com | 413,144 | macmall.com | 401,944 |
apple.com | 391,539 | bestbuy.com | 389,146 |
techspot.com | 386,273 | techforless.com | 361,234 |
overstock.com | 347,846 | searsoutlet.com | 341,924 |
pcrush.com | 292,904 | tesco.com | 222,802 |
frontierpc.com | 187,184 | abt.com | 115,539 |
ipkart.com | 63,112 | conns.com | 61,158 |
costco.com | 54,274 | dhgate.com | 50,099 |
shop.lenovo.com | 41,465 | bjs.com | 40,930 |
newegg.com | 37,393 | microsoftstore.com | 24,163 |
samsclub.com | 22,788 | tomtop.com | 13,306 |
alibaba.com | 7,136 | boostmobile.com | 2,487 |
sears.com | 659 | membershipwireless.com | 457 |
In order to restrict the obtained results, product categories and a list of 10 products per category were specified. The table shows the distribution of the retrieved pages after the implementation of the above mentioned filters.
Number of Pages per Product Category
Category | #Pages |
---|---|
Headphones | 355,648 |
Mobile Phones | 1,246,213 |
Televisions | 556,121 |
Markup Data of the Crawled Corpus
It has been shown that markup data like Microdata and RDFa are becoming the standrad for embedding varios semantics of data items. For the purpose of exploiting the Microdata markup in this couprus, the data derived from the crawl were profiled on the basis of markup data. The following table gives an overview about the top domains, classes and properties as observed in the markup entities of the crawled corpus.
Statistics on the Markup Data of the Crawled Corpus
Distinct PLDs with markup data | 29 |
---|---|
Total Triples | 26,184,401 |
Total Entities | 5,636,972 |
Top Domains by Extracted Triples | Show top values |
Top Classes by Entities | Show top values |
Top Properties | Show top values |
Detailed Statistics as Excel file | CorpusStatistics.xlsx (40kb) |
Product Corpus and extracted Microdata downloads
Download access to the product corpus, its general statistics and the extracted Microdata as n-quads from the general crawl can be found at the following link. Similarly, we provide the product corpus from the keyword serach crawl here.
6. Baselines
Baselines for Product Feature Extraction
Method
The approach makes use of the properties which are contained for the different products in the product catalog, described in Section 2. From this property names, we generate a dictionary, which we than apply to all web pages in the goldstandard. This means, whenever the name of the feature within the catalog occurs on the web page, we extract this as feature for the product.
Results
We applied the dictionary method described above for each part of our gold standard: (1) title and description form the markup data, (2) specification tables, and (3) specification list; all shown in the table below. The results show that the dictionary method does not perform well for each of the parts of the gold standard and improvement is needed. The reason for the poor performance can be found in the disparity of values coming from different vendors. For instance, the size of a display in our catalog are inches, however some of the vendors use the metric system for that measure.
Markup Data | |||
---|---|---|---|
Category | Precision | Recall | F1 |
Headphones | 0.623 | 0.588 | 0.604 |
Phones | 0.601 | 0.452 | 0.515 |
TVs | 0.573 | 0.604 | 0.590 |
Table Specification | |||
Category | Precision | Recall | F1 |
Headphones | 0.579 | 0.614 | 0.596 |
Phones | 0.443 | 0.555 | 0.493 |
TVs | 0.521 | 0.658 | 0.581 |
List Specification | |||
Category | Precision | Recall | F1 |
Headphones | 0.436 | 0.512 | 0.494 |
Phones | 0.389 | 0.571 | 0.462 |
TVs | 0.418 | 0.499 | 0.455 |
Baselines for Product Matching
We experiment on 3 datasets:
- Preprocessed text from the entire HTML
- Preprocessed name and description properties from Microdata markup
- Specification tables and lists
Common text preprocessing techniques are applied in order to reduce the number of tokens of every textual input and to result in filtered, richer in content unstructured and structured product specifications which will eventually contribute better to the identity resolution task.
The preprocessing starts with the removal of HTML tags which do not contribute to the content of the specification. As a second step, tokenization by non-alphanumeric character is realized. Thirdly, the textual characters are converted to lowercase and stemmed using a Porter Stemming Filter. Finally, stopwords are removed. It should be mentioned that all the steps are equally implemented in the unstructured and structured product specifications apart from the first step which is only realized for the unstructured specifications contained in the HTML pages.
Parametrization
We experiment with 3 different variable parameters. We take all n- grams (n ∈ [1, 2, 3]) and test on them separately. Additionally, we relax to string matching constraints by calculating the Levenshtein distance between potential matches. Lastly, we do maximum and minimum frequency pruning.
Similarity Methods
There are three main similarity methods involved in the experiments: exact string matching, Jaccard similarity and Cosine similarity. On top of the implemented similarity measures Levenshtein Distance can be calculated.
- 1. Simple Model: Calculation of the percentage of the common unique words in the two inputs (Exact Similarity on ngram level).
i. The list of tokens are transformed to sets of unique grams ii. The containment of the exact unique words is calculated as commonGrams/min{UniqueGramsCatalogEntity, UniqueGramsPage}
- 2. TF-IDF Cosine Similarity:
i. The unique grams of the two inputs are assigned weights* (see the weighting section) ii. Common words between the unique sets are retrieved iii. Cosine Similarity is calculated as the dot product of the weights of the common words divided by the normalization factor
- 3. Jaccard: Can be applied only for the case of ngrams with n>1. The similarity score is calculated as A∩B/A∪B
- Levenshtein Similarity (on top): In order to calculate the intersection between vectors (unique common words) the user can define if they want exact matching between the grams or if common grams should be considered those with a Levenshtein similarity above a certain user defined threshold. The Levenshtein similarity for common grams calculation can be implemented on top of all the previously mentioned similarity methods.
Feature Extraction
We implement 2 feature extraction methods:
- Bag-of-Words (BOW): Keeping all tokens from the preprocessing
- Dictionary method (See the featrue extraction baseline)
In order to evaluate the implemented similarity methods we measure precision, recall and F1 measure. For this we need to first define a threshold. The threshold selection is done by brute-force search for the best F1 score result over the set of [0,1] using a step size of 0.001.
Results
Below we show the best results for both the BOW and Dictionary approaches. All results are available for download in the respective tables.
BOW results
Category | Precision | Recall | F1 |
---|---|---|---|
Headphones | 0.622 | 0.559 | 0.588 |
Phones | 0.298 | 0.668 | 0.412 |
TVs | 0.661 | 0.474 | 0.552 |
BOW.xlsx(16kb) |
Dictionary results
Category | Precision | Recall | F1 |
---|---|---|---|
Headphones | 0.335 | 0.556 | 0.418 |
Phones | 0.617 | 0.612 | 0.614 |
TVs | 0.578 | 0.522 | 0.553 |
Dictionary.xlsx(16kb) |
7. Extraction of Attribute-Value Pairs from Product Specifications
Automatically extracting specifications from HTML pages is not a trivial task. Qui et al. have provided an analisys on product specification and concluded that technical specifications can be contained in different HTML structures, however they are primarily found in tables and lists. Similar to Qui et al. [3], we employ a two step approach: First, we detect specification tables and list in HTML pages: afterwards, we extract attribute-values pairs from the detected product specifications. We start by training a model for detection of specification tables and lists. Next, we apply that model on the web pages. Subsequently, we use a sample of the detected specifications to learn a model for column attribute and value detection. Finally, with the learned model we extract attribute value pairs from the specifications.
Specification Detection
As stated above product specifications are mostly found in tables and list. Considering that HTML tables and lists have a different base structure, we train separate classi- fiers for detecting specification tables and lists. The models are trained by learning a binary classifier that classifies tables/list into specification and non-specification. The classifiers use the following features which have also been used previously by Qui et al. [3]: Average text length per row, number of rows, overall frequency of the word "specification", number of links and number of images, standard deviation of text size. In addition to these features, we introduce the following new features in order to improve the detection accuracy: Average DOM node depth of the items relative to the root, average number of columns, standard deviation of columns, maximum number of columns, number of non table/list tags, average ratio between numerical and alphabetical characters in a cell, and maximum number of rows. In order to illustrate the correlation of these features with the target attribute (specification vs. non-specification), the figure below shows summary statistics about the evaluation dataset that we use. As it is evident from the figure, there is a clear difference between specifications and non-specifications when considering these features. For instance, when considering “standard deviation of columns”, specification tables hardly deviate from a given layout, while non-specification tables deviate form the layout much more. Another interesting feature is the “number of non-table\list tags” where non-specification structures contain much more tags like “<p>, <span>” etc. than the specification ones. With that said, it is intuitive that the binary classifier could learn a better model if both feature sets are used.
Specification extraction
Qui et al. [3] apply straight forward heuristics to extract attribute-value pairs from product specifications. Specifically, for tables they use a heuristic that, for each table row, extracts the first cell as attribute name and the remaining ones are extracted as concatenated values. Main limitation of this heuristics is that it can not handle tables that have more than two attribute name columns. Namely, instead yielding a correct result like in Figure 3, this heuristic will treat values in the third column as attribute values, which is obviously incorrect . To solve this issue we train a model which can detect whether a given column is an attribute or a value column. Much like the detection model, we train a binary classifier with the following features: average text length per cell, average text length per column, number of non table tags per column, standard deviation of text size per column, ratio between alphabetical and numerical characters in a cell. Taking into account that lists follow different structure (no columns), we convert list items to columns by separating the items by a delimiter and organise the separation result into columns. We consider the common delimiter characters: “:” and “;”. We don’t use delimiters like “,”, “-” , “/” and “\” since they might be a part of a product identification numbers like MPN. After the conversion process is done we are able to train the same model as described above. However, like in [3], this approach falls short when the lists do not contain any delimiter. Since the percentage of specification with out delimiters is less than 8%, we do not pursue a solution of this case. After the columns have been classified, we continue with pairing attribute with value columns, after which each attribute and value item, row wise, is considered a pair. The pairing of columns is done left to right, that is starting from the most left we pair the first attribute and value columns and we continue to the right. In the case of consecutive attribute or value columns, we concatenate them. Figure 3 shows an example of a table with tagged attribute and value columns, where the first and second column constitute the first attribute-value pairing, while the third and the fourth column constitute the second attribute-value pairing.
Details on the methodology and the results are presented in Petrovski and Bizer available here.
Additionally, we provide the code, models and training data at the following repository: Attribute-Value Pairs Extraction. Moreover, we provide extracted product attributes from specification tables and lists with aligned attribute names to the ones in the product catalog (available for download here).
8. References
- Petar Petrovski, Anna Primpeli, Robert Meusel, Christian Bizer: Product Data Corpus and Gold Standard for Product Feature Extraction and Matching. 17th International Conference on Electronic Commerce and Web Technologies, Porto, Portugal, September 2016. (To Appear)
- Robert Meusel, Anna Primpeli, Christian Meilicke, Heiko Paulheim, Christian Bizer: Exploiting Microdata Annotations to Consistently Categorize Product Offers at Web Scale. 16th International Conference on Electronic Commerce and Web Technologies
- Disheng Qiu, Luciano Barbosa, Xin Luna Dong, Yanyan Shen, Divesh Srivastava: DEXTER: Large-Scale Discovery and Extraction of Product Specifications on the Web, Proceedings of the VLDB Endowment, Vol. 8, No. 13, 2015.
- Hanna Köpcke, Andreas Thor, Stefan Thomas, Erhard Rahm: Tailoring entity resolution for matching product offers. In Proceedings of the 15th Int.Conf. on Extending Database Technology (EDBT), 2012.
- Hanna Köpcke, Andreas Thor, Erhard Rahm: Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment, Volume 3 Issue 1-2, September 2010 , 2010.
- Petar Ristoski, Peter Mika: Enriching product ads with metadata from html annotations. 13th Extendended Semantic Web Conference