On this page, we provide the results of some initial profiling of entity tables from the WDC Web Table Corpus 2015. An entity table usually describes exactly one entity with several attributes while the name of the entity itself is not contained in the table but can be concluded by considering context. Off all 233 milliom extracted tables, 139,687,207 tables are of type entity. In addition, we offer statistics about a subset consisting of only relational tables and a subset containing English tables. All tables are publicly available for download.
Contents
1. TLD Distribution
Figure 1 shows the distribution of extracted Web tables per top-level domain.
Fig. 1 - Number of tables per TLD
The complete distribution of tables per top-level domain can be found here.
The file contains a list of key values pairs, with TLD
as key and #tables
as value. For example the entry, Key : .com Value : 62249515
, means that there are 62,249,515 tables extracted from the "com" domain.
2. PLD Distribution
Figure 2 shows the distribution of extracted Web tables per pay-level domain.
Fig. 2 - Number of tables per PLD
Altogether, 616,946 different PLDs are represented in all the 140 million entity tables. The complete distribution of tables per pay-level domain can be found here. Again, the file contains key value pairs where the key represents the PLD and the value the number of tables per PLD.
3. Table Sizes and Distribution
Table 1 shows the overall number of extracted entity tables, divided into horizontal and vertical tables. In a horizontal table, the entities are represented in rows and the attributes in columns. Whenever the entities are included in columns and the attributes in rows, we talk about vertical tables. A vertical table can be transferred into a horizontal table by simply flipping it.
#tables | |
---|---|
horizontal | 76,699,222 |
vertical | 62,987,985 |
sum | 139,687,207 |
Table 2 provides basic statistics for the tables' size. The row numbers exclude the header row (if present) and thus refer data rows. Data rows are all rows of the table that are positioned under the header row and contain at least one non-empty cell.
min. | max. | average | median | |
---|---|---|---|---|
Columns Horizontal Tables (attributes) | 2 | 256 | 2.40 | 2 |
Rows Horizontal Tables (entities) | 3 | 46994 | 9.08 | 3 |
Columns Vertical Tables (entities) | 3 | 24272 | 7.53 | 3 |
Rows Vertical Tables (attributes) | 2 | 241 | 2.06 | 2 |
3.1 Horizontal Tables
Figure 3 shows the distribution of number of columns (attributes) per table (Horizontal Table).
Fig. 3 - Distribution of Number of Columns (attributes) per Table (Horizontal Table)
The complete distribution of number of columns per horizontal table can be found here. The key of the key value pair represents the number of columns and the value the number of horizontal tables having exactly this number of columns, e.g. the line Key : 33 Value : 1323
means that there are 1,323 tables with exactly 13 columns.
Figure 4 shows the distribution of number of data rows (entities) per table (Horizontal Table).
Fig. 4 - Distribution of Number of Rows (entities) per Table (Horizontal Table)
The complete distribution of number of rows per horizontal table can be found here.
The key of the key value pair represents the number of data rows and the value the number of horizontal tables having exactly this number of data rows, e.g. the line Key : 8 Value : 2932320
means that there are 2,932,320 tables with exactly 8 data rows.
3.2. Vertical Tables
Figure 5 shows the distribution of number of columns (entities) per table (Vertical Table).
Fig. 5 - Distribution of Number of Columns (entities) per Table (Vertical Table)
The complete distribution of number of columns per vertical table can be found here.
The key of the key value pair represents the number of columns and the value the number of vertical tables having exactly this number of columns, e.g. the line Key : 13 Value : 42917
means that there are 42,917 tables with exactly 13 columns. Since we know that these tables are vertical, this number corresponds to the number of rows after flipping the table.
Figure 6 shows the distribution of number of data rows (attributes) per table (Vertical Table).
Fig. 6 - Distribution of Number of Rows (attributes) per Table (Vertical Table)
The complete distribution of number of rows per vertical table can be found here.
The key of the key value pair represents the number of data rows and the value the number of vertical tables having exactly this number of data rows, e.g. the line Key : 8 Value : 38554
means that there are 38,554 tables with exactly 8 rows. Since we know that these tables are vertical, this number corresponds to the number of columns after flipping the table.
4. Header Distribution
In order to get a first impression about the topics of the tables, we applied a simple heuristic for identifying the column headers of each Web table. Our heuristic is based on the Cell Content Pattern, which is defined as a tuple containing a representation of the composition of characters in a cell [Tang2006]. After extracting the content pattern of a cell, a comparison between the patterns in the current row and the following rows is made. If one row shows different patterns compared to its following rows, we consider this row as header containing the column names. By now, we only consider two cases: the first row is the header row or no header row exists. A more sophisticated header unfolding [Chen2013] would be necessary to find for example headers that spanning over several rows.
In contrast to our previous extraction, we can now deal with headers of vertical tables [Crestan2011] and we know whether a header is present or not (about 20% of all tables do not have a header according to [Pimplikar2012]). We did not take any column name synonyms like 'population' and 'number of inhabitants' into account. The only simple normalization we apply is to remove trailing 's' to get singular forms of nouns. Thus, the number of different headers can be seen as upper bound.
With the current approach were able to identify total of 100,420,803 column headers from which 17,812,609 are different. Figure 7 shows popular (useful) column headers together with their number of occurrences. The most often used header is the empty string (20,060,263 times) which we exclude in the figure since it does not provide any information about the content of the tables.
Fig. 7 - Popular Column Headers
The complete distribution of headers can be found here.
The key of the key value pair represents the header and the value the number of columns having exactly this header, e.g. the line Key : Title Value : 5934288
means that there are 5,934,288 columns with title as header.
5. Column Data Types Distribution
We used a rough type guessing algorithm to detect the data type of each table column. First, the data type of each column value was detect, using 5 pre-defined data types: string, numeric, date, boolean and list. Afterwards, the most used data type in the column was chosen as the final data type of the column.
Figure 8 shows distribution of column data types.
Fig. 8 - Column Data Types Distribution
6. Context Information
Table 3 provides basic statistics about context related data which we further extracted. For each tables, we extract the 200 words before and after the table. In previous experiments, we found out that without additional context or temporal information, it is difficult to further process the tables, e.g. to match them to a knowledge base [Zhang2013]. For almost half of the tables, we can extract a timestamp which is located after the entity table. In many cases, this timestamp is the imprint of the webpage. The last modified date comes from the HTTP header of the HTML page.
#number of tables | |
---|---|
Timestamp Before Entity Table | 15,044,690 |
Timestamp After Entity Table | 43,998,029 |
Last Modified Date | 19,654,944 |