Petar Ristoski
Oliver Lehmberg
Heiko Paulheim
Christian Bizer



This page provides basic statistics about the subset of relational Web tables in the WDC Web Table Corpus 2012 that originate from top-level-domains that likely provide English-language content. The subset consists of 91,815,190 tables out of the 147 million Web tables in the overall corpus. All tables are publicly available for download.

Contents

1. Identifying English-language TLD Web Tables

The Web tables in this corpus are extracted from the following top-level domains: "com", "org", "net", "eu" and "uk".

2. TLDs Distribution

Figure 1 shows the distribution of extracted Web tables per English-language top-level domain.

Fig. 1 - Number of tables per English-language TLD

2. Number of Columns and Rows Distribution

The table below provides basic statistics for the tables' size in the complete corpus.

min.max.averagemedian
columns2 7133.483
rows135 64010.372

2.1. Number of Columns Distribution

The table below provides basic statistics for the tables' size in the complete corpus. The rows number excludes the header row and thus refers to the data rows of the table.

Fig. 2 - Distribution of Number of Columns per Table


The complete distribution of number of columns per table can be found here. The file contains a list of two tab separated fields, #columns and #tables. E.g. the first entry of the file, 2 43930478, means that there are 43930478 tables that have exactly 2 columns.

2.2. Number of Rows Distribution

Figure 3 shows the distribution of number of data rows per table.
Data rows are all rows of the table that are positioned under the header row and contain at least one non-empty cell.

Fig. 3 - Distribution of Number of Rows per Table

The complete distribution of number of rows per table can be found here. The file contains a list of two tab separated fields, #rows and #tables. E.g. the first entry of the file, 1 416575, means that there are 416575 tables that have exactly 1 data rows.

3. Headers Distribution

In order to get a first impression about the topics of the tables, we applied a simple heuristic for identifying the column headers of each Web table. Our heuristic assumed that the column headers are in the first row of the Web table that contains at least 80% non-empty cells of the number of cells of the row with highest number of non-empty cells in the table. The heuristic will fail on vertical tables as well as tables that require more sophisticated header unfolding. We also did not take column name synonyms like 'population' and 'number of inhabitants' into account. Thus, the numbers presented below should be understood as lower bounds.
With the current approach were able to identify total of 319,598,356 column headers out of which 16,978,151 are different.

Figure 4 shows the number of tables in the corpus that contain some popular column headers.

Fig. 4 - Popular Column Headers

The complete distribution of headers can be found here. The file contains a list of two tab separated fields, header and #tables. E.g. the first entry of the file, name 3645104, means that there are 3645104 tables that contain column with header name.

To get a better understanding which topics are covered in the corpus, we performed a rough matching to the cross-domain knowledge base DBpedia, which is a structured data version of a subset of Wikipedia. We scanned the tables for properties used in DBpedia which are also used as table headers in our dataset. The complete list can be found here. The file contains a list of two tab separated fields, DBpediaProperty and #tables. E.g. the entry, name 3645104, means that there are 3645104 tables that contain column with header title.

4. Labels Distribution

Most applications working with Web tables assume that the tables are entity-attribute tables and that they contain a string column that provides the name of the described entity (label column). To get an initial insight of the entity coverage of the corpus, we determined the label column of the tables using a simple heuristic and counted value occurrences in the label column of all Web tables. Our heuristic assumed the left-most column that is not a number or a date and has almost unique values to be the label column. [Venetis2010] report an accuracy of 83% using a similar simple heuristic.
Before counting, all values are normalized, and stop-word are removed. E.g. the music album name The Dark Side of the Moon will be normalized to dark side moon. While counting the value occurrences, we do not take surface form synonyms into account (like 'New York' and 'New York City'). Thus, the reported numbers should be understood as lower bounds. In the corpus of Web tables we were able to identify total of 952,800,935 label column values, where 206,034,980 are different values.

In Table 1 is shown values coverage from different topics.

Countries Cities Rivers Movies Camera Models Music Albums Footballers
Name #Tables Name #Tables Name #Tables Name #Tables Name #Tables Name #Tables Name #Tables
usa 111040 new york 47264 mississippi 69466 avatar 5496 nikon d 200 624 thriller 2769 robin van persie 6046
france 76373 luxembourg 39092 lena 6109 taxi 3993 canon eos 20 d 345 aftermath 1710 cristiano ronaldo 1431
germany 64144 london 18651 don 5031 titanic 3008 canon eos 40 d 314 twist shout 1512 fernando torres 1055
united states 61923 berlin 18612 yangtze 2212 inception 1628 nikon d 5000 304 true blue 1247 ronaldo 1055
japan 56873 madrid 17021 mackenzie 2090 fantastic four 1123 canon eos 30 d 293 like prayer 1123 gareth bale 1018
italy 52358 andorra 13940 oka 866 black knight 1104 canon eos 50 d 270 yellow submarine 923 lionel messi 999
mexico 43628 dublin 13177 tigris 577 deception 965 nikon d 80 261 like virgin 885 frank lampard 998
netherlands 40997 amsterdam 9489 vardar 569 moulin rouge 948 canon eos 10 d 208 something new 696 thierry henry 897
austria 39866 athens 9019 tisa 484 minority report 867 nikon d 60 194 dark side moon 667 ronaldinho 673
belgium 38791 new york city 4641 loire 418 ice age 816 nikon d 90 183 shine light 636 david beckham 595
denmark 35007 budapest 4188 volta 396 unfaithful 786 nikon d 100 162 please please me 621 paolo maldini 569
ireland 33200 brussels 4036 volga 363 glitter 570 canon eos d 30 90 abbey road 551 roberto carlos 383
finland 32011 copenhagen 3671 kama 283 joy ride 526 canon eos d 60 72 sticky fingers 458 sergio ramos 371
greece 31811 lisbon 2950 rhine 280 from hell 388 sony cybershot dsc s 3000 digital camera 31 one day your life 450 xabi alonso 329
hungary 29003 helsinki 2831 danube 265 just married 356 sony cybershot dsc w 510 point amp shoot 31 some girls 429 zinedine zidane 323
bulgaria 27901 bern 1758 sava 253 shallow hal 206 sony cybershot dsc w 520 digital camera 31 exciter 301 fabio cannavaro 196
malta 22148 bratislava 1699 indus 247 high crimes 187 sony cybershot dsc w 570 digital camera 31 let bleed 285 rivaldo 195
croatia 21981 bucharest 1544 ural 207 monkeybone 162 olympus e 500 23 rubber soul 245 roberto baggio 158
egypt 21917 kiev 1405 elbe 166 like mike 137 olympus e 30 22 blood dance floor 220 marco van basten 140
cyprus 19032 belgrade 1374 seine 143 joe somebody 127 sony cybershot dsc w 530 digital camera 21 black celebration 177 oliver kahn 116

Table. 1 - Values Coverage

5. Column Data Types Distribution

We used a rough type guessing algorithm to detect the data type of each table column. First, the data type of each column value was detect, using 5 pre-defined data types: string, numeric, date, boolean and list. Afterwards, the most used data type in the column was chosen as the final data type of the column.

Figure 5 shows distribution of column data types.

Fig. 5 - Column Data Types Distribution

6. Feedback

Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.

More information about Web Data Commons is found here.

7. Credits

The extraction of the Web Table Corpus was supported by the German Research Foundation (DFG) under grant number PA 2373/1-1 (Mine@LOD), an Amazon Web Services in Education Grant award and by the EU FP7 research project PlanetData.

DFG Logo       PlanetData Logo       AWS Logo