This page offers the WDC Block benchmark for download. WDC Block is a benchmark for comparing the performance of blocking methods that are used as part of entity resolution pipelines. WDC Block features a maximum Cartesian product of 200 billion pairs as well as training sets of different sizes for evaluating supervised blockers. We use WDC Block to evaluate several state-of-the-art blocking systems, including CTT, Auto, JedAI, Sudowoodo, SBERT, BM25 and SC-Block.
News
- 2023-06-22: We have released WDC Block a benchmark for comparing the performance of blocking methods. WDC Block features a maximal Cartesian product of 200 billion pairs of product offers which were extracted form 3,259 e-shops.
- 2023-02-15: We have released an initial version of WDC Block.
Table of Contents
- 1 Introduction
- 2 Benchmark Creation
- 3 Benchmark Profiling
- 4 Experiments
- 5 Downloads
- 6 Feedback
- 7 References
1 Introduction
Entity resolution aims to identify records in two datasets (A and B) that describe the same real-world entity [2,3,4]. Since comparing all record pairs between two datasets can be computationally expensive, entity resolution is approached in two steps, blocking and matching. Blocking applies a computationally cheap method to remove non-matching record pairs and produces a smaller set of candidate record pairs reducing the workload of the matcher. During matching a more expensive pair-wise matcher produces a final set of matching record pairs[2, 8]. Existing benchmark datasets for blocking and matching are rather small with respect to the Cartesian product AxB for comparing all records and the vocabulary size[7]. If blockers are evaluated only on these small datasets, effects resulting from a high number of records or from a large vocabulary size (large number of unique tokens that need to be indexed) may be missed. The Web Data Commons Block (WDC-Block) is a new blocking benchmark that provides much larger datasets and thus requires blockers that address these scalability challenges. Additionally, we provide three development sets with different sizes (~1K pairs, ~5K pairs & ~20K pairs) to experiment with different amounts of training data for the blockers.
WDC Block is based on product data that has been extracted in 2020 from 3,259 e-shops that mark up product offers within their HTML pages using the schema.org vocabulary. The largest variant of WDC Block uses offers from 2 million different products. Multiple offers referring to the same product are identified based on GTIN and MPN numbers provided by the e-shops.
2 Benchmark Creation
The WDC Block benchmark was created in three steps: (i) we select a difficult variant of the WDC Products entity matching benchmark as the seed dataset for WDC Block, (ii) we split the dataset into two separate datasets A and B, (iii) we enlarge the dataset by adding offers for additional non-matching products from the WDC Product Data Corpus V2020, (iv) we prepare three development sets (~1K pairs, ~5k pairs & ~20k pairs). This section gives an overview of the four steps.
- We choose the large, pairwise, 80% corner cases and 20% random training set and the 50%-unseen test set from the WDC Products entity matching benchmark as the seed dataset for our benchmark. We selected the large version to start with a large number of product offers that are difficult to match and th e50% seen/ 50% unseen test set to have a trade-off between product offers that are part of the training set and product offers that the blockers did not see during training.
- We split the original record pairs into two datasets A and B to follow the common setup of entity matching datasets in the related work. Both datasets A and B are deduplicated to obtain clean datasets.
- Depending on the specific benchmark (small, medium, large), we populate the datasets A and B with additional randomly selected offers from the WDC Product Data Corpus V2020. We make sure that the randomly selected records do not match any of the existing records in the datasets to avoid introducing additional matching pairs.
- We derive the splits train, validation & test and transfer them to the format of the entity matching datasets in the related work. For the large development set (~20k pairs), the initially derived pairs are retained. For the medium development set (~5k pairs) & small development set (~1k pairs), the train & validation set are down-sampled such that the distributions of the large development set are comparable. The test set remains the same for all versions of this benchmark to make the results comparable.
3 Benchmark Profiling
The WDC Block benchmark consists of a total of 2,073,224 real-word products across all subsets which are described by 2,100,000 product offers. Each product offer in WDC Block has five attributes, title, description, price, priceCurrency and brand. WDC Block comes with nine configurations, which are derived of the two dimensions dataset size (small, medium, large) described in Table 1 and development set sizes (~1k, ~5k, ~20k) described in Table 2. Table 1 shows the number of records in Table A and Table B, the number of positive and negative pairs in the test set as well as the vocabulary size and the Cartesian product of the different dataset sizes. The vocabulary size represents the number of unique tokens after concatenating the attribute values of the product offers in Table A and Table B and tokenizing the concatenated attribute values by whitespace. The cartesian product is the maximum number of record comparisons between Table A and Table B (AxB).
Dataset | Table A | Table B | Pos. Pairs Test | Neg. Pairs Test | Vocabulary Size | Cartesian Product |
---|---|---|---|---|---|---|
WDC Block small | 5,000 | 5,000 | 500 | 4,000 | 67,294 | 25M |
WDC Block medium | 5,000 | 200,000 | 500 | 4,000 | 1,174,280 | 1,000M |
WDC Block large | 100,000 | 2,000,000 | 500 | 4,000 | 6,880,107 | 200,000M |
Development Set Size | Pos. Pairs Train | Neg. Pairs Train | Pos. Pairs Val. | Neg. Pairs Val. | Total Pairs |
---|---|---|---|---|---|
Small ~1k | 266 | 408 | 133 | 203 | 1,011 |
Medium ~5k | 1,559 | 1,880 | 779 | 939 | 5,192 |
Large ~20k | 6,454 | 10,502 | 3,226 | 5,250 | 21,932 |