This page provides basis statistics describing the relational subset of the WDC Web TablesCorpus 2015. The subset consists of 90 million tables out of the 233 million Web tables in the corpus. In relational tables, a set of similar entities is described with one or more attributes. In addition to this subset, we offer statistics about a subset consisting of only English-language relational tables and a subset containing entity tables. All tables are publicly available for download.
Contents
1. TLDs Distribution
Figure 1 shows the distribution of extracted Web tables per top-level domain.
Fig. 1 - Number of tables per TLD
The complete distribution of tables per top-level domain can be found here.
The file contains a list of key values pairs, with TLD
as key and #tables
as value. For example the entry, Key : .com Value : 62249515
, means that there are 62,249,515 tables extracted from the "com" domain. Compared to the previous corpus, it is noticeable that the "gov" domain was not even among the top 20 TLDs.
2. PLD Distribution
Figure 2 shows the distribution of extracted Web tables per pay-level domain.
Fig. 2 - Number of tables per PLD
Altogether, 540,418 different PLDs are represented in all the 90 million relational tables. The complete distribution of tables per pay-level domain can be found here. Again, the file contains key value pairs where the key represents the PLD and the value the number of tables per PLD.
3. Table Sizes and Distribution
Table 1 shows the overall number of extracted relational tables, divided into horizontal and vertical tables. In a horizontal table, the entities are represented in rows and the attributes in columns. Whenever the entities are included in columns and the attributes in rows, we talk about vertical tables. A vertical table can be transferred into a horizontal table by simply flipping it.
#tables | |
---|---|
horizontal | 84,784,969 |
vertical | 5,481,254 |
sum | 90,266,223 |
Table 2 provides basic statistics for the tables' size. The row numbers exclude the header row (if present) and thus refer data rows. Data rows are all rows of the table that are positioned under the header row and contain at least one non-empty cell.
min. | max. | average | median | |
---|---|---|---|---|
Columns Horizontal Tables (attributes) | 2 | 18,106 | 5.20 | 4 |
Rows Horizontal Tables (entities) | 2 | 17,033 | 14.45 | 6 |
Columns Vertical Tables (entities) | 3 | 16,142 | 8.44 | 5 |
Rows Vertical Tables (atributes) | 1 | 486 | 3.66 | 3 |
3.1. Horizontal Tables
Figure 3 shows the distribution of number of columns (attributes) per table (Horizontal Table).
Fig. 3 - Distribution of Number of Columns (attributes) per Table(Horizontal Table)
The complete distribution of number of columns per horizontal table can be found here. The key of the key value pair represents the number of columns and the value the number of horizontal tables having exactly this number of columns, e.g. the line Key : 33 Value : 1323
means that there are 1,323 tables with exactly 13 columns.
Figure 4 shows the distribution of number of data rows (entities) per table (Horizontal Table).
Fig. 4 - Distribution of Number of Rows (entities) per Table(Horizontal Table)
The complete distribution of number of rows per horizontal table can be found here.
The key of the key value pair represents the number of data rows and the value the number of horizontal tables having exactly this number of data rows, e.g. the line Key : 8 Value : 2932320
means that there are 2,932,320 tables with exactly 8 data rows.
3.2. Vertical Tables
Figure 5 shows the distribution of number of columns (entities) per table (Vertical Table).
Fig. 5 - Distribution of Number of Columns (entities) per Table(Vertical Table)
The complete distribution of number of columns per vertical table can be found here.
The key of the key value pair represents the number of columns and the value the number of vertical tables having exactly this number of columns, e.g. the line Key : 13 Value : 42917
means that there are 42,917 tables with exactly 13 columns. Since we know that these tables are vertical, this number corresponds to the number of rows after flipping the table.
Figure 6 shows the distribution of number of data rows(attributes) per table (Vertical Table).
Fig. 6 - Distribution of Number of Rows (attributes) per Table(Vertical Table)
The complete distribution of number of rows per vertical table can be found here.
The key of the key value pair represents the number of data rows and the value the number of vertical tables having exactly this number of data rows, e.g. the line Key : 8 Value : 38554
means that there are 38,554 tables with exactly 8 rows. Since we know that these tables are vertical, this number corresponds to the number of columns after flipping the table.
4. Header Distribution
In order to get a first impression about the topics of the tables, we applied a simple heuristic for identifying the column headers of each Web table. Our heuristic is based on the Cell Content Pattern, which is defined as a tuple containing a representation of the composition of characters in a cell [Tang2006]. After extracting the content pattern of a cell, a comparison between the patterns in the current row and the following rows is made. If one row shows different patterns compared to its following rows, we consider this row as header containing the column names. By now, we only consider two cases: the first row is the header row or no header row exists. A more sophisticated header unfolding [Chen2013] would be necessary to find for example headers that spanning over several rows.
In contrast to our previous extraction, we can now deal with headers of vertical tables [Crestan2011] and we know whether a header is present or not (about 20% of all tables do not have a header according to [Pimplikar2012]). We did not take any column name synonyms like 'population' and 'number of inhabitants' into account. The only simple normalization we apply is to remove trailing 's' to get singular forms of nouns. Thus, the number of different headers can be seen as upper bound.
With the current approach were able to identify total of 462,165,071 column headers from which 5,477,071 are different. Figure 7 shows popular (useful) column headers together with their number of occurrences. The most often used header is the empty string (29,879,843 times) which we exclude in the figure since it does not provide any information about the content of the tables.
Fig. 7 - Popular Column Headers
The complete distribution of headers can be found here.
The key of the key value pair represents the header and the value the number of columns having exactly this header, e.g. the line Key : Title Value : 4043775
means that there are 4,043,775 columns with title as header.
5. Column Data Types Distribution
We used a rough type guessing algorithm to detect the data type of each table column. First, the data type of each column cell is detect by trying to parse it into the according types. We use the following 6 pre-defined data types: string, numeric, date, link, boolean and list. Afterwards, the most frequent data type in the column is chosen as the final data type of the whole column.
Figure 8 shows distribution of column data types. (Other include link, boolean and list because of very less percentage of overall datatypes)
Fig. 8 - Column Data Types Distribution
6. Context Information
Table 3 provides basic statistics about context related data which we further extracted. For each tables, we extract the 200 words before and after the table. In previous experiments, we found out that without additional context or temporal information, it is difficult to further process the tables, e.g. to match them to a knowledge base [Zhang2013]. For almost half of the tables, we can extract a timestamp which is located after the relational table. In many cases, this timestamp is the imprint of the webpage. The last modified date comes from the HTTP header of the HTML page.
#number of tables | |
---|---|
Timestamp Before Relational Table | 13,520,557 |
Timestamp After Relational Table | 42,990,899 |
Last Modified Date | 19,275,614 |
7. Further Information
We also offer statistics describing the English-language subset of the relational tables and the entity tables subset.
All tables are publicly available for download.