This page provides basis statistics describing the relational subset of the WDC Web TablesCorpus 2012. The subset consists of 147 million relational tables. In relational tables, a set of entities is described with one or more attributes. In addition to this subset, we offer statistics about a subset consisting of only English-language relational tables. All tables are publicly available for download.

1. TLDs Distribution

Figure 1 shows the distribution of extracted Web tables per top-level domain.

Fig. 1 - Number of tables per TLD

The complete distribution of tables per top-level domain can be found here. The file contains a list of two tab separated fields, TLD and #tables. E.g. the first entry of the file, com 75229798, means that there are 75229798 tables extracted from the "com" domain.

2. Number of Columns and Rows Distribution

The table below provides basic statistics for the tables' size in the complete corpus. The rows number excludes the header row and thus refers to the data rows of the table.

	min.	max.	average	median
columns	2	2 368	3.49	3
rows	1	70 068	12.41	6

2.1 Number of Columns Distribution

Figure 2 shows the distribution of number of columns per table.

Fig. 2 - Distribution of Number of Columns per Table

The complete distribution of number of columns per table can be found here. The file contains a list of two tab separated fields, #columns and #tables. E.g. the first entry of the file, 2 70147349, means that there are 70147349 tables that have exactly two columns.

2.2 Number of Rows Distribution

Figure 3 shows the distribution of number of data rows per table.
Data rows are all rows of the table that are positioned under the header row and contain at least one non-empty cell.

Fig. 3 - Distribution of Number of Rows per Table

The complete distribution of number of rows per table can be found here. The file contains a list of two tab separated fields, #rows and #tables. E.g. the first entry of the file, 1 426104, means that there are 426104 tables that have exactly one data row.

3. Headers Distribution

In order to get a first impression about the topics of the tables, we applied a simple heuristic for identifying the column headers of each Web table. Our heuristic assumed that the column headers are in the first row of the Web table that contains at least 80% non-empty cells of the number of cells of the row with highest number of non-empty cells in the table. The heuristic will fail on vertical tables [Crestan2011], on tables that require more sophisticated header unfolding [Chen2013], as well as on table that do not have headers (20% of all tables according to [Pimplikar2012]). We also did not take column name synonyms like 'population' and 'number of inhabitants' into account. Thus, the numbers presented below should be understood as lower bounds.
With the current approach were able to identify total of 509,351,189 column headers out of which 28,072,596 are different.

Figure 4 shows the number of tables in the corpus that contain some popular column headers.

Fig. 4 - Popular Column Headers

The complete distribution of headers can be found here. The file contains a list of two tab separated fields, header and #tables. E.g. the first entry of the file, name 4653155, means that there are 4653155 tables that contain column with header name.

To get a better understanding which topics are covered in the corpus, we performed a rough matching to the cross-domain knowledge base DBpedia, which is a structured data version of a subset of Wikipedia. We scanned the tables for properties used in DBpedia which are also used as table headers in our dataset. The complete list can be found here here. The file contains a list of two tab separated fields, DBpediaProperty and #tables. E.g. the entry, title 2121028, means that there are 2121028 tables that contain column with header title.

4. Label Distribution

Most applications working with Web tables assume that the tables are entity-attribute tables and that they contain a string column that provides the name of the described entity (label column). To get an initial insight of the entity coverage of the corpus, we determined the label column of the tables using a simple heuristic and counted value occurrences in the label column of all Web tables. Our heuristic assumed the left-most column that is not a number or a date and has almost unique values to be the label column. [Venetis2010] report an accuracy of 83% using a similar simple heuristic.
Before counting, all values are normalized, and stop-word are removed. E.g. the music album name The Dark Side of the Moon will be normalized to dark side moon. While counting the value occurrences, we do not take surface form synonyms into account (like 'New York' and 'New York City'). Thus, the reported numbers should be understood as lower bounds. In the corpus of Web tables we were able to identify total of 1,742,015,870 label column values, where 253,001,795 are different values.

In Table 1 is shown values coverage from different topics.

Countries		Cities		Rivers		Movies		Camera Models		Music Albums		Footballers
Name	#Tables	Name	#Tables	Name	#Tables	Name	#Tables	Name	#Tables	Name	#Tables	Name	#Tables
usa	135688	new york	59398	mississippi	87367	avatar	11080	nikon d 200	1390	thriller	4268	robin van persie	7439
germany	91170	luxembourg	47722	lena	8717	inception	8121	canon eos 20 d	480	aftermath	2466	david beckham	3041
japan	76512	berlin	46850	don	6504	taxi	6292	canon eos 40 d	355	twist shout	2017	cristiano ronaldo	2927
united states	73169	london	37541	mackenzie	3346	titanic	4270	nikon d 5000	351	true blue	1737	lionel messi	1748
italy	71129	amsterdam	31548	yangtze	2241	fantastic four	2113	canon eos 30 d	346	like prayer	1616	ronaldo	1716
austria	56622	madrid	30486	oka	1708	moulin rouge	1616	nikon d 80	339	like virgin	1414	gareth bale	1708
netherlands	56533	andorra	21075	loire	1096	black knight	1298	canon eos 50 d	304	yellow submarine	1405	fernando torres	1641
mexico	55267	dublin	19790	tigris	946	deception	1286	nikon d 90	274	dark side moon	1201	frank lampard	1461
belgium	53175	athens	12228	volga	904	minority report	1201	canon eos 10 d	248	abbey road	971	thierry henry	1332
ireland	48543	budapest	9702	sava	873	ice age	1201	nikon d 60	233	something new	919	ronaldinho	1195
denmark	48389	helsinki	7761	volta	710	unfaithful	1179	nikon d 100	191	please please me	886	roberto carlos	817
finland	45156	bern	5839	vardar	595	glitter	943	canon eos d 30	172	shine light	833	xabi alonso	735
greece	42314	new york city	5611	kama	582	joy ride	674	sony cybershot dsc w120	104	some girls	801	oliver kahn	710
russia	41729	brussels	5305	tisa	552	from hell	520	canon eos d 60	93	sticky fingers	740	sergio ramos	647
hungary	38536	copenhagen	4949	ural	437	just married	459	sony cybershot dsc s3000	67	one day your life	711	paolo maldini	638
malta	37009	bratislava	4938	indus	420	shallow hal	265	sony cybershot dsc w520	64	exciter	543	zinedine zidane	517
bulgaria	36523	belgrade	4460	elbe	382	highn crimes	247	sony cybershot dsc w510	62	let bleed	492	fabio cannavaro	348
croatia	29022	lisbon	4194	danube	365	monkeybone	228	olympus e 500	53	rubber soul	464	rivaldo	331
egypt	27725	kiev	2406	rhine	352	like mike	175	sony cybershot dsc w570	45	blood dance floor	382	roberto baggio	251
cyprus	25828	bucharest	2180	seine	225	joe somebody	160	olympus e 30	38	black celebration	338	marco van basten	243

Table. 1 - Values Coverage

5. Column Data Types Distribution

We used a rough type guessing algorithm to detect the data type of each table column. First, the data type of each column value was detect, using 5 pre-defined data types: string, numeric, date, boolean and list. Afterwards, the most used data type in the column was chosen as the final data type of the column.

Figure 5 shows distribution of column data types.

Fig. 5 - Column Data Types Distribution

6. Feedback

Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.

More information about Web Data Commons is found here.

7. Credits

The extraction of the Web Table Corpus was supported by the German Research Foundation (DFG) under grant number PA 2373/1-1 (Mine@LOD), an Amazon Web Services in Education Grant award and by the EU FP7 research project PlanetData.

WDC Web Table Corpus 2012 - Statistics about Relational Subset

Contents