This page describes the T4LTE dataset, a gold standard for the task of long-tail entity extraction from web tables.
Knowledge Bases, like DBpedia, Wikidata or Yago, all rely on data that has been extracted from Wikipedia and as a result cover mostly head instances that fulfill the Wikipedia notability criteria [Oulabi2019]. Their coverage of less well known instances from the long tail is rather low [Dong2014]. As the usefulness of a knowledge base increases with its completeness, adding long-tail instances to a knowledge base is an important task. Web tables [Cafarella2008], which are relational HTML tables extracted from the Web, contain large amounts of structured information, covering a wide range of topics, and describe very specific long tail instances. Web tables are thus a promising source of information for the task of augmenting cross-domain knowledge bases.
This dataset provides annotations for a selection of web tables for the task of augmenting the DBpedia knowledge base with new long-tail entities from those web tables. It includes annotations for the unique number of entities that can be derived from these web tables, and which of these entities are new, given the instances that are already covered in DBpedia. Additionally, there are annotations for values and facts that can be generated from web table data, allowing the evaluation of how well descriptions of new entities were created.
This dataset was used to develop and evaluate a method for augmenting a knowledge base with long-tail entities from web tables [Oulabi2019]. Using this dataset for training we were able to add 187 thousand new Song entities with 394 thousand facts, and 14 thousand new GridironFootballPlayer entities, with 44 thousand new facts to DBpeida. In regards to the number of instances this was an increase of 356 % and 67 % for Song and GridironFootballPlayer respectively [Oulabi2019].
Contents
1. Dataset Purpose
The purpose of this dataset is to act as a gold standard for evaluating the extraction of long-tail entities from web tables. It fulfills three tasks:
- Allow the measuring the performance of long-tail entity extraction, including recall. By focusing on recall we can ensure that methods retrieve a large number potential new entities from web tables.
- Allow the automatic evaluation of implemented methods.
- Allow us to train long-tail entity extraction methods.
2. Knowledge Base and Class Selection
We employ DBpedia [Lehmann2015] as the target knowledge base to be extended. It is extracted from Wikipedia and especially Wikipedia infoboxes. As a result, the covered instances are limited to those identified as notable by the Wikipedia community [Oulabi2019]. We use the 2014 release of DBpedia, as this release has been used in related work [Ritze2015, Ritze2016, Oulabi2016, Oulabi2017], and its release date is also closer to the extraction of the web table corpus from which we created this dataset.
From DBpedia we selected three classes for which built the dataset. This selection was done based on four criteria:
- Versatility: the three chosen classes must be from three different first-level classes. The first level classes in DBpedia are Species, Work, Agent and Place.
- Specificity: we preferred classes further down in the DBpedia set hierarchy.
- Potential: using a baseline set expansion approach we measured how many new instances and facts can potentially be added to the knowledge base. Classes with higher numbers were preferred.
- Name conflict likelihood: we utilize the labels of instances in the knowledge base to measure the potential for homonyms given a certain class. Classes with a higher relative occurrence of homonyms were preferred, as those represent classes for which this task is more difficult.
Based on this approach we chose the following three classes: (1) GridironFootballPlayer (GF-Player), (2) Song and (3) Settlement, where the class Song also includes all instances of the class Single.
Given those three classes we will profile the existing entities within the knowledge base. The first table provides the number of instances and facts per class, while the second profiles the properties and their densities. The first table shows that DBpedia already covers tens of thousands of instances for the profiled classes. This could indicate that most of the well-known instances are already covered, so that we are especially interested in finding instances from the long tail.
Class | Instances | Facts |
---|---|---|
GF-Player | 20751 | 137319 |
Song | 52,533 | 315,414 |
Settlement | 468,986 | 1,444,316 |
The following table reveals that the density differs significantly from property to property. We only consider head properties that have a density of at least 30 %.
Only the properties of class Song have consistently high densities larger than 60 %. The football player class has many properties, but half of them have a density below 50 %. The class Settlement suffers from both, a small number of properties, and low densities for some of them.
Class |
Property |
Facts |
Density |
---|---|---|---|
GF-Player |
birthDate |
20,218 |
97.43 % |
GF-Player |
college |
19,281 |
92.92 % |
GF-Player |
birthPlace |
17,912 |
86.32 % |
GF-Player |
team |
13,349 |
64.33 % |
GF-Player |
number |
11,430 |
55.08 % |
GF-Player |
position |
11,240 |
54.17 % |
GF-Player |
height |
10,059 |
48.47 % |
GF-Player |
weight |
10,027 |
48.32 % |
GF-Player |
draftYear |
7,947 |
38.30 % |
GF-Player |
draftRound |
7,932 |
38.22 % |
GF-Player |
draftPick |
7,924 |
38.19 % |
Song |
genre |
47,040 |
89.54 % |
Song |
musicalArtist |
45,097 |
85.85 % |
Song |
recordLabel |
43,053 |
81.95 % |
Song |
runtime |
42,035 |
80.02 % |
Song |
album |
40,666 |
77.41 % |
Song |
writer |
33,942 |
64.61 % |
Song |
releaseDate |
31,696 |
60.34 % |
Settlement |
country |
433,838 |
92.51 % |
Settlement |
isPartOf |
416,454 |
88.80 % |
Settlement |
populationTotal |
292,831 |
62.44 % |
Settlement |
postalCode |
154,575 |
32.96 % |
Settlement |
elevation |
146,618 |
31.26 % |
3. Web Table Corpus
We extract this dataset from the english-language relational tables
set of the Web
Data Commons 2012 Web Table Corpus. The set consists of 91.8
million tables. The Table below gives an overview of the general
characteristics of tables in the corpus. We can see that the majority
of tables are rather short, with an average of 10.4 rows and a median
of 2, whereas the average and median number of columns are 3.5 and 3.
As a result, a table on average describes 10 instances with 30 values,
which likely is a sufficient size and potentially useful for finding
new instances and their descriptions. In [Ritze2016]
we have profiled the potential of the same corpus for the task of slot
filling, meaning to find missing values for existing DBpedia
instances.
Average |
Median |
Min |
Max |
|
---|---|---|---|---|
Rows |
10.37 |
2 |
1 |
35,640 |
Columns |
3.48 |
3 |
2 |
713 |
For every table we assume that there is one attribute that contains the labels of the instances described by the rows. The remaining columns contain values, which potentially can be used to generate descriptions according to the knowledge base schema.
For the three evaluated classes, the following table shows the result of matching the web table corpus to existing instances and properties in DBpedia, using the T2K Matching Framework [Ritze2015, Ritze2016]. The first column shows the number of matched tables that have at least one matched attribute column. Rows of those tables were matched directly to existing instances of DBpedia. From the second and third columns we see how many values were matched to existing instances and how many values remained unmatched. While more values were matched, the number of unmatched values is still large, especially for the songs class.
Class |
Tables |
VMatched |
VUnmatched |
---|---|---|---|
GF-Player |
10,432 |
206,847 |
35,968 |
Song |
58,594 |
1,315,381 |
443,194 |
Settlement |
11,757 |
82,816 |
13,735 |
3. Dataset Creation
In this section we will outline the creation of the dataset. This
includes how we selected tables from the corpus, what the labeling
process is, and what annotations are included in the dataset.
3.1 Web Tables Selection
For the gold standard we had to select a certain number of tables per class to annotate. We first matched tables to classes in the knowledge base using the T2K framework [Ritze2015]. We then select the tables per class separately. To do this we first divided the instances in the knowledge base into quartiles of popularity using the indegree count based on a dataset of Wikipedia page-links [1, 2]. We then select three instances per quartile, overall 12 per class. We then look in the table corpus to find which labels in the corpus, for which we can not find a match in the knowledge base, co-occur most often with the label of the selected instance, as we select one label without a match for each of the 12 instances selected from the knowledge base. For both, the labels of the 12 knowledge base instances and the additional 12 "new" labels, we extract up to 15 tables per label, ensuring that few tables are chosen from the same PLD and that tables have a variety in their types of attributes.
3.2 Labeling Process
Using the method for table selection described above we ended up with 280, 217, 620 tables for the classes GridironFootballPlayer, Song and Settlement respectively. We did not label all tables and especially not all rows of these tables, but we looked for potentially new entities and entities with potentially conflicting names (homonyms). For those we then created clusters, by identifying the rows within the tables that describe the same real-world entitiy. From these row clusters entities can be created and added to the knowledge base. For each cluster we then identified whether the entity already exists in DBpedia or is a new entity, that can be added to DBpedia. For existing entities, we also added a correspondence to the URI of the entity in DBpedia.
For all web tables from which we created row clusters, we matched the table columns to the properties of the knowledge base. These property correspondences allow us to identify how many candidate values exist for a certain combination of entity and property. Finally, we also annotated for all clusters facts, i.e. the correct values given certain properties. We only annotated facts for properties for which a candidate value exists among the data in the table. I.e. for a row cluster, there is a row within a table that described within one column a certain property, only then did we annotate the correct fact. We also annotated whether the value of the correct fact was present among the values in the web tables.
When labeling rows, we aimed for labeling interesting row clusters first. As a result, most tables only have a small number rows in them labeled. This does not apply to columns. Whenever we label one row in a table, we always ensure to label all of its columns.
Finally, for the class Song, we include additional row cluster for
existing entities for learning purposes only. These clusters are not
fully labeled, as they are missing the fact annotations.
3.2. Annotation Types
The dataset contains various annotation types, which are all described in the table below.
Annotation Type | Description | Format |
---|---|---|
Table-To-Class Annotation | All tables included in the dataset are matched to one of the three classes we chose to evaluate. | The tables are placed in separate folders per class. |
Row-To-Instance Annotation | For a selection of row tables, we annotate if they belong to an instance, existing or new. If the instance described by the row already exists in DBpedia, the instance corresponds to the entity URI of that instance in DBpedia. Otherwise we generate a new random URI, but keep the prefix of DBpedia entity URIs. All rows matched to same instance form a row cluster. | CSV file format (see 7.3) |
New Instance Annotation | We provide the list of entity URIs that we crated to describe new instances, that do not yet exist in DBpedia. | LST file format (see 7.6) |
Attribute-To-Property Annotation | Given the columns of a web table, we annotate which of these columns describe information that corresponds to a property in the schema of the knowledge base. | CSV file format (see 7.2) |
Fact Annotations | Given row clusters and attribute-to-property correspondences, we can determine for each entity, existing or new, described in the dataset, for which triples we have candidate values in the web tables. For these triples, we annotate the correct facts to allow for evaluation. We additionally annotate, whether the correct fact is present among the candidate values within the web table data. | CSV file format (see 7.5) |
5 Dataset Statistics
The following table provides an overview of the number of annotations in the dataset. In the first three columns we see the number of table, attribute and row annotations. On average, we have 1.85 attribute annotations per table, not counting the label attribute. The two following columns show the number of annotated clusters, followed by the number of values within those clusters that match a knowledge base property. We overall annotated 266 clusters, of which 103 are new, and where each cluster has on average 3.63 rows and 7.85 matched values. The last two columns show the number of unique facts that can be derived for those clusters, and the number of facts for which a correct value is present. Per cluster we can derive on average 3.17 facts, for 92 % of which, the correct value is present.
Class | Tables | Attributes | Rows | Existing Clusters |
New Clusters |
Matched Values |
Facts | Correct Value Present |
---|---|---|---|---|---|---|---|---|
GF-Player | 192 | 572 | 358 | 80 | 17 | 1,177 | 460 | 436 |
Song | 152 | 248 | 195 | 34 | 63 | 428 | 231 | 212 |
Settlement | 188 | 162 | 413 | 51 | 23 | 487 | 152 | 124 |
Sum | 532 | 982 | 966 | 165 | 103 | 2,092 | 843 | 772 |
The number of row clusters describing existing entities is low for Song, especially when compared to the number of those clusters describing new entities. This is not relevant for evaluation purposes, but for learning, more training examples for existing entities might be required. We therefore additionally include 15 existing entities for learning purposes only. Unlike the other existing entities, we did not annotate any facts for those entities. For these entities we also include 17 additional tables for the class Song, so that we have 169 overall tables for the class Song included in the dataset.
The following three tables show the distribution of properties among the matched values annotated and facts annotated in the dataset per Class. We notice that for all three classes there are clear head properties, for which much more values were matched than for the remaining properties. We also find that for some properties, we have barely any matched values. This number is especially high for the class Settlement.
GridironFootballPlayer | Matched Values |
Facts | Correct Value Present |
---|---|---|---|
http://dbpedia.org/ontology/birthDate | 50 | 37 | 33 |
http://dbpedia.org/ontology/birthPlace | 5 | 5 | 3 |
http://dbpedia.org/ontology/college | 246 | 82 | 81 |
http://dbpedia.org/ontology/draftPick | 48 | 20 | 20 |
http://dbpedia.org/ontology/draftRound | 10 | 10 | 10 |
http://dbpedia.org/ontology/draftYear | 20 | 10 | 10 |
http://dbpedia.org/ontology/height | 134 | 61 | 58 |
http://dbpedia.org/ontology/highschool | 4 | 4 | 4 |
http://dbpedia.org/ontology/number | 72 | 40 | 37 |
http://dbpedia.org/ontology/Person/weight | 141 | 63 | 56 |
http://dbpedia.org/ontology/position | 269 | 78 | 74 |
http://dbpedia.org/ontology/team | 178 | 50 | 50 |
Song | Matched Values |
Facts | Correct Value Present |
---|---|---|---|
http://dbpedia.org/ontology/album | 98 | 54 | 50 |
http://dbpedia.org/ontology/bSide | 1 | 1 | 1 |
http://dbpedia.org/ontology/genre | 9 | 7 | 6 |
http://dbpedia.org/ontology/musicalArtist | 167 | 64 | 64 |
http://dbpedia.org/ontology/producer | 1 | 1 | 1 |
http://dbpedia.org/ontology/recordLabel | 16 | 10 | 8 |
http://dbpedia.org/ontology/releaseDate | 53 | 38 | 33 |
http://dbpedia.org/ontology/runtime | 78 | 52 | 45 |
http://dbpedia.org/ontology/writer | 5 | 4 | 4 |
Settlement | Matched Values |
Facts | Correct Value Present |
---|---|---|---|
http://dbpedia.org/ontology/area | 3 | 3 | 0 |
http://dbpedia.org/ontology/continent | 2 | 2 | 2 |
http://dbpedia.org/ontology/country | 156 | 22 | 22 |
http://dbpedia.org/ontology/elevation | 4 | 2 | 0 |
http://dbpedia.org/ontology/isPartOf | 158 | 50 | 50 |
http://dbpedia.org/ontology/populationDensity | 3 | 3 | 0 |
http://dbpedia.org/ontology/populationMetro | 8 | 7 | 0 |
http://dbpedia.org/ontology/populationTotal | 30 | 21 | 10 |
http://dbpedia.org/ontology/postalCode | 100 | 22 | 22 |
http://dbpedia.org/ontology/utcOffset | 5 | 4 | 4 |
http://www.w3.org/2003/01/geo/wgs84_pos#long | 9 | 8 | 6 |
http://www.w3.org/2003/01/geo/wgs84_pos#lat | 9 | 8 | 8 |
6. Cross-Validation Splits
We use the gold standard for learning and testing. For this, we split the data into three folds and performed cross-validation in our research [Oulabi2019]. To allow for comparable results, we provide the exact folds used in our work.
We split by cluster, so that the rows of one cluster are always fully included in one fold. We ensured that we evenly split both new clusters and homonym groups. A homonym group, is a group of clusters with highly similar labels. All clusters of a homonym group were always placed in one fold.
Class | Fold | Clusters | New | Homonym Groups | Clusters in Homonym Groups | Rows |
---|---|---|---|---|---|---|
GridironFootballPlayer | All | 97 | 17 | 10 | 21 | 358 |
GridironFootballPlayer | 0 | 31 | 5 | 3 | 6 | 126 |
GridironFootballPlayer | 1 | 33 | 5 | 4 | 8 | 118 |
GridironFootballPlayer | 2 | 33 | 7 | 3 | 7 | 114 |
Song | All | 97 | 63 | 20 | 65 | 195 |
Song | 0 | 32 | 18 | 6 | 21 | 51 |
Song | 1 | 34 | 24 | 7 | 27 | 86 |
Song | 2 | 31 | 21 | 7 | 17 | 58 |
Settlement | All | 74 | 23 | 14 | 31 | 413 |
Settlement | 0 | 26 | 7 | 5 | 11 | 150 |
Settlement | 1 | 24 | 9 | 4 | 9 | 106 |
Settlement | 2 | 24 | 7 | 5 | 11 | 157 |
7. Structure and File Formats
The dataset contains broadly three different file formats:
- CSV: used for row mappings, attribute mappings, facts and referenced entities
- LST: used to list e.g. tables and entities
- JSON: used to describe the whole table
All files are encoded using UTF-8. All CSV files have no headers, use comma as separators and and double quotation marks as quotation characters. In LST files each new line corresponds to an entry in the list. No quotation or separation characters are used in LST files.
7.1 Dataset Directory Structure
The gold standard is split into three separate folder by each knowledge base class. These folders have the following structure:
CLASS_NAME (GridironFootballPlayer, Song, Settlement)
│
├─ attributeMapping
│ ├─ table1.csv
│ ├─ table2.csv
│ ├─ table3.csv
│ ├─ ...
│ └─ ...
│
├─ rowMapping
│ ├─ table1.csv
│ ├─ table2.csv
│ ├─ table3.csv
│ ├─ ...
│ └─ ...
│
├─ tables
│ ├─ table1.json
│ ├─ table2.json
│ ├─ table3.json
│ ├─ ...
│ └─ ...
│
├─ facts.csv
├─ fold0.lst
├─ fold1.lst
├─ fold2.lst
├─ forLearning.lst (Song only)
├─ newInstances.lst
├─ referencedEntities.csv (Song only)
└─ tableList.lst
7.2 Attribute Mapping CSV Format
The attribute mapping consists of files that describe correspondences between columns of tables included in the dataset and properties of the knowledge base. For each table we include one CSV file, where the name of the table corresponds to the name of the file without the ".csv" extension.
Each row of this file describes two values. The first value contains the web table column number, while the second contains the mapped DBpedia property. The first column of a table has the number 0.
Example: Song/attributeMapping/1346981172250_1346981783559_593.arc5217493555668914181#52592088_6_1764789873557608114.csv
"0","http://dbpedia.org/ontology/musicalArtist" "3","http://dbpedia.org/ontology/releaseDate" "4","http://dbpedia.org/ontology/recordLabel" "6","http://dbpedia.org/ontology/genre"
7.3 Row Mappings CSV Format
The row mapping consists of files that describe which table rows correspond to which entity URI. For each table we include one CSV file, where the name of the table corresponds to the name of the file without the ".csv" extension.
Each row of this file describes two values. The first value contains the web table row number, while the second contains the full URI of the entity. The first row of the table, which is very often the header row, has the number 0.
Example: GridironFootballPlayer/rowMapping/1346876860779_1346961276986_5601.arc3719795019286941883#45451800_0_7520172109909715831.csv
"28","http://dbpedia.org/resource/Andrew_Sweat" "32","http://dbpedia.org/resource/Ameet_Pall" "46","http://dbpedia.org/resource/Jerrell_Wedge-00869aeb-d468-46fc-8a33-e11e6b771730" "50","http://dbpedia.org/resource/Chris Donald-248fa1b2-6061-4e39-b394-4b0717de75b4" "35","http://dbpedia.org/resource/shelly_lyons-c026bb63-4fa2-11e8-9b01-1d14cf16e545" "24","http://dbpedia.org/resource/Brandon_Marshall_(linebacker)" "23","http://dbpedia.org/resource/Jerrell_Harris"
7.4 Table JSON Format
The JSON files within the tables folder describe the individual tables included in the dataset fully, including rows that were not annotated as part of the dataset. The JSON Format is described further below using the example.These tables can also be found in the web table corpus linked above.
Two properties are important. First the relation property describes the actual content of the table. It is an array of arrays, where the outer array contains the columns of the table, and each inner array describes all rows of that column. The second important property is the keyColumnIndex, which is the property that sets which column is the key column of the table and is therefore linked to the Label property of the knowledge base.
Example: GridironFootballPlayer\tables\1346823846150_1346837956103_5045.arc6474871151262925852#91994528_4_1071800122125102457.json
{ "hasKeyColumn":true, "headerRowIndex":0, "keyColumnIndex":0, "pld":"draftboardinsider.com", "url":"http://www.draftboardinsider.com/ncaateams/sec/auburn.shtml", "relation":[ [ "player", "ben grubbs", "kenny irons", "will herring", "david irons", "courtney taylor" ], [ "pos", "og", "rb", "olb", "cb", "wr" ], [ "pick", "29", "49", "161", "194", "197" ], [ "nfl team", "baltimore ravens", "cincinnati bengals", "seattle seahawks", "atlanta falcons", "seattle seahawks" ] ] }
7.5 Facts CSV Format
Each line in the facts file describes one individual annotated fact. Per line, there are four values. The first contains the URI of the entity, while the second contains the URI of the property. The third contains the annotated fact, while the last is a boolean flag, on whether the correct value of a fact is present among the values found in the web table data, where the values "true" and "false" correspond to present and not present respectively.
While for most facts, there is only one correct value present, for some there can be multiple correct values. Multiple values are separated by a simple |, and values need to be split accordingly when using the dataset.
Parsing the first two and the last values is simple, for the actual fact annotation, the parsing depends on the fact data-type. The table below provides parsing instructions:
Data-Type | Description | Format | Example |
---|---|---|---|
Date | A format describing date, either with a year or day granularity | yyyy OR yyyy-mm-dd |
2000 2012-04-20 |
Reference | DBpedia URI (needs to prefixed with http://dbpedia.org/resource/) | No parsing required | Nina_Simone |
String | Literal string | No parsing required | FH3312 |
Integer | Integer numbers | No parsing required | 21 |
Decimal | Mixed decimal number. Some numbers might not be be mixed and simple integers | I.F I |
187.96 88.45051 4144 |
Signed Decimal | Mixed decimal number with a sign | �I.F | +1.0 -4.5333333 |
Runtime | A format describing runtime in minutes and seconds | h:ss | 4:01 5:13 |
This table provides a mapping between the properties and the data-types above. Additionally we provide some notes per property if applicable.
Class | Property | Data-Type | Note |
---|---|---|---|
GridironFootballPlayer | birthDate | Date | |
GridironFootballPlayer | birthPlace | Reference | |
GridironFootballPlayer | college | Reference | |
GridironFootballPlayer | draftPick | Integer | |
GridironFootballPlayer | draftRound | Integer | |
GridironFootballPlayer | draftYear | Date | |
GridironFootballPlayer | height | Decimal | We record height in centimeters, while DBpedia records height in meters, so that a conversion is necessary. Also, all tables exclusively record height in foot and inches. |
GridironFootballPlayer | highschool | Reference | |
GridironFootballPlayer | number | Integer | |
GridironFootballPlayer | Person/weight | Decimal | We record weight in kg, and so does DBpedia. All tables exclusively record weight in pounds. |
GridironFootballPlayer | position | Reference | |
GridironFootballPlayer | team | Reference | |
Song | album | Reference | |
Song | bSide | Reference | |
Song | genre | Reference | |
Song | musicalArtist | Reference | |
Song | producer | Reference | |
Song | recordLabel | Reference | |
Song | releaseDate | Date | |
Song | runtime | Time | DBpedia records runtime in seconds as a simple numeric property, while we record it as time in minutes and seconds. As a result, a conversion is necessary. |
Song | writer | Reference | |
Settlement | area | Decimal | |
Settlement | continent | Reference | |
Settlement | country | Reference | |
Settlement | elevation | Decimal | |
Settlement | isPartOf | Reference | |
Settlement | populationDensity | Decimal | |
Settlement | populationMetro | Decimal | |
Settlement | populationTotal | Decimal | |
Settlement | postalCode | String | |
Settlement | utcOffset | Signed Decimal | |
Settlement | wgs84_pos#long | Signed Decimal | |
Settlement | wgs84_pos#lat | Signed Decimal |
Below you will find some examples of the facts CSV file for all three classes.
Example: GridironFootballPlayer/facts.csv
"http://dbpedia.org/resource/Al_Harris_(defensive_lineman)","http://dbpedia.org/ontology/position","Defensive_end|Linebacker","true" "http://dbpedia.org/resource/Allen_Reisner","http://dbpedia.org/ontology/birthDate","1988-09-29","true" "http://dbpedia.org/resource/Mike_Bell_(defensive_lineman)","http://dbpedia.org/ontology/college","Colorado_State_Rams","true" "http://dbpedia.org/resource/Andre_Roberts_(American_football)","http://dbpedia.org/ontology/Person/weight","88.45051","true" "http://dbpedia.org/resource/louis_nzegwu-90617f1a-1dbf-48c0-ae52-dfb4cb5043ab","http://dbpedia.org/ontology/team","Atlanta_Falcons","true" "http://dbpedia.org/resource/Donald_Jones_(American_football)","http://dbpedia.org/ontology/birthDate","1987-12-17","true" "http://dbpedia.org/resource/Mike_Williams_(wide_receiver,_born_1987)","http://dbpedia.org/ontology/height","187.96","true" "http://dbpedia.org/resource/Anquan_Boldin","http://dbpedia.org/ontology/team","Arizona_Cardinals|Baltimore_Ravens|San_Francisco_49ers|Detroit_Lions|Buffalo_Bills","true" "http://dbpedia.org/resource/Al_Harris_(defensive_lineman)","http://dbpedia.org/ontology/team","Chicago_Bears","true" "http://dbpedia.org/resource/Mike_Williams_(wide_receiver,_born_1984)","http://dbpedia.org/ontology/draftPick","10","true"
...
Example: Song/facts.csv
"http://dbpedia.org/resource/rhythm_of_life-17f821d8-8424-49b9-ad35-aae0094d475c","http://dbpedia.org/ontology/musicalArtist","U96","true" "http://dbpedia.org/resource/Men's_Needs","http://dbpedia.org/ontology/musicalArtist","The_Cribs","true" "http://dbpedia.org/resource/The_Lemon_Song","http://dbpedia.org/ontology/runtime","6:19","true" "http://dbpedia.org/resource/lemon_tree-c4ef5525-b118-4ed9-8206-2d64b91a0b89","http://dbpedia.org/ontology/album","The_Very_Best_of_Peter,_Paul_and_Mary-f3c362a9-764f-45b2-a3b4-dc32f71c8902","true" "http://dbpedia.org/resource/Seek_&_Destroy","http://dbpedia.org/ontology/album","Kill_%27Em_All","true" "http://dbpedia.org/resource/I'm_Ready_for_Love","http://dbpedia.org/ontology/musicalArtist","Martha_and_the_Vandellas","true" "http://dbpedia.org/resource/Something_About_You-646ffccf-6fb9-4279-9b84-eb582b959388","http://dbpedia.org/ontology/album","Aliens_&_Rainbows","true" "http://dbpedia.org/resource/Beautiful_(Mai_Kuraki_song)","http://dbpedia.org/ontology/releaseDate","2009-06-10","true" "http://dbpedia.org/resource/Lemon_(song)","http://dbpedia.org/ontology/releaseDate","1993","true" ...
Example: Settlement/facts.csv
"http://dbpedia.org/resource/Beijing","http://www.w3.org/2003/01/geo/wgs84_pos#long","116.383333","true" "http://dbpedia.org/resource/Rome","http://dbpedia.org/ontology/area","1285","false" "http://dbpedia.org/resource/Bakar","http://dbpedia.org/ontology/country","Croatia","true" "http://dbpedia.org/resource/Rome","http://dbpedia.org/ontology/utcOffset","+1.0","true" "http://dbpedia.org/resource/arriondas-a4a0fc90-8a84-11e8-82e1-3fad23f94135","http://dbpedia.org/ontology/country","Spain","true" "http://dbpedia.org/resource/Bwaga_Cheti","http://www.w3.org/2003/01/geo/wgs84_pos#lat","-4.5333333","true" "http://dbpedia.org/resource/belec-a4a20efb-8a84-11e8-82e1-55488444b4bb","http://dbpedia.org/ontology/postalCode","49254","true" "http://dbpedia.org/resource/Bakarac","http://dbpedia.org/ontology/isPartOf","Primorje-Gorski_Kotar_County","true" ...
7.6 Table, New Instance, Folds and for Learning Lists
We use the list format for a different number of types:
- For the cross-validation folds, providing a listing of the entities for each fold.
- We provide a list of all tables per class.
- We provide a list of entity URIs of new entities.
- For the class Song we additionally list the 15 existing entities which are to be used for learning purposes only.
These list files have the extension .lst. Each line of the file is another entry in the list. There are no quoting characters.
Example: Settlement/fold0.lst
http://dbpedia.org/resource/Aurel,_Vaucluse http://dbpedia.org/resource/Stamford,_Lincolnshire http://dbpedia.org/resource/Bwaga_Cheti http://dbpedia.org/resource/kalakho-f889b410-4fa2-11e8-9b01-4b7e9cb868f5 http://dbpedia.org/resource/Belica,_Me%C4%91imurje_County http://dbpedia.org/resource/burgo_ranero_(el)-a4a1e6e3-8a84-11e8-82e1-f765c0073ff5 http://dbpedia.org/resource/Parys http://dbpedia.org/resource/Bakar http://dbpedia.org/resource/Beli_Manastir http://dbpedia.org/resource/Chaville http://dbpedia.org/resource/Bonyunyu ...
Example: Song/tableList.lst
1346981172231_1347009637666_1623.arc2035420803784423551#8157737_1_7371407078293434892 1346981172137_1346990327917_1687.arc8790234217643537183#29324093_0_4104648016207008655 1350433107059_1350464532277_262.arc3274150641837087721#45894422_0_4048022465851316720 1346876860798_1346941853400_2953.arc2527404313287902461#59379225_0_6616355908335718299 1346876860596_1346938127566_2223.arc3753870089959127664#66546593_0_554336699268001312 1346876860493_1346903186037_233.arc7106100585551357027#48571984_0_6773473850340800215 1350433107058_1350501694041_732.arc1602187029723264891#37777045_0_5887313017136165099 1346981172186_1346995420474_2788.arc8792188372387791527#96007354_0_5596511497072590105 1346876860840_1346953273866_1315.arc7476207801019051251#90975928_0_7687754714967118394 1346981172231_1347010609872_3515.arc2403866143077224377#1524220_0_7599368370767966283 1346876860596_1346938903566_3154.arc4273244386981436402#87484324_1_1397156714755041772 1346876860611_1346928074552_1536.arc941312314634454173#7668872_0_4460134851750954295 1346876860807_1346934722734_147.arc4322826721152635511#74438125_0_3796119154304144126 1346981172155_1347002310061_2451.arc2996742124595566891#69073711_0_6127538729831462210 1346981172239_1346995950584_63.arc1630314548530234317#65396150_0_5145880845606151839 ...
Example: GridironFootballPlayer/newInstances.lst
http://dbpedia.org/resource/shelly_lyons-c026bb63-4fa2-11e8-9b01-1d14cf16e545
http://dbpedia.org/resource/mike_ball-86931519-f533-4e99-b1be-b45fb805e7e5
http://dbpedia.org/resource/michael_vandermeulen-c020ef8f-4fa2-11e8-9b01-c3c34fe696e5
http://dbpedia.org/resource/Chris Donald-248fa1b2-6061-4e39-b394-4b0717de75b4
http://dbpedia.org/resource/james_carmon-c02869e4-4fa2-11e8-9b01-81f205eb29eb
http://dbpedia.org/resource/alvin_mitchell-33d17a8e-ed7d-433e-9728-3e1c26658a6a
http://dbpedia.org/resource/ben_buchanan-c0289116-4fa2-11e8-9b01-fdd98ba7cc66
http://dbpedia.org/resource/merritt_kersey-c025d110-4fa2-11e8-9b01-65117e7d5512
http://dbpedia.org/resource/aderious_simmoms-c023ae8d-4fa2-11e8-9b01-89f1d0e15a08
...
7.7 Referenced Entities
For the class Song there exist reference facts that reference entities, that do not exist in DBpedia, i.e. they are long-tail entities themselves. We provide these additional entities in a seperate file. The file is especially useful, as it provides both labels and the class of those referenced entities.
For this file we again use the CSV file format, with three values. The first is the URI of the referenced entity, the second its label, and the third, its class alignment.
Example: Song/referencedEntities.csv
"http://dbpedia.org/resource/Shelley_Laine-8ee3f06c-a68a-495b-a788-5a4473e39384","Shelley Laine","MusicalArtist" "http://dbpedia.org/resource/Skipping_Stones-b3ff0194-6a65-4835-8635-f02ba6d58e3d","Skipping Stones","Album" "http://dbpedia.org/resource/Best_Of_1991-2001-ab96f5ab-730a-48a2-a0b9-e3275993bf07","Best Of 1991-2001","Album" "http://dbpedia.org/resource/Terry_Steele-67a5c337-4b29-497e-9852-28ae222c7bfd","Terry Steele","Writer" "http://dbpedia.org/resource/David_L._Elliott-b11a4b31-8e01-4c06-8847-af359577525a","David L. Elliott","Writer" "http://dbpedia.org/resource/Anthology:_1965-1972-013145ea-382f-448e-ae7f-dfd3d959a2e0","Anthology: 1965-1972","Album" "http://dbpedia.org/resource/Bangarang_(EP)-b3bf5bc8-454a-47b6-9b4a-8b8b1b53f728","Bangarang_(EP)","Album" "http://dbpedia.org/resource/Old_School_New Rules-34473575-5c29-4048-8435-f96717404db7","Old School New Rules","Album" "http://dbpedia.org/resource/Wanna-d7a7259b-da78-41cd-bcac-bcd51ff040f2","Wanna","MusicalWork"
...
8. Download
You can download the dataset here: T4LTE.zip
9. Feedback
Please send questions and feedback to directly to the authors (listed
above) or post them in the Web
Data Commons Google Group.
10. References
- [Cafarella2008] Cafarella, Michael J. and Halevy, Alon Y. and Zhang, Yang and Wang, Daisy Zhe and Wu, Eugene (2008), "Uncovering the Relational Web", In WebDB '08.
- [Dong2014] Dong, Xin and Gabrilovich, Evgeniy and Heitz, Geremy and Horn, Wilko and Lao, Ni and Murphy, Kevin and Strohmann, Thomas and Sun, Shaohua and Zhang, Wei (2014), "Knowledge Vault: A Web-scale Approach to Probabilistic Knowledge Fusion", In KDD '14.
- [Lehmann2015] Lehmann, Jens and Isele, Robert and Jakob, Max and Jentzsch, Anja and Kontokostas, Dimitris and Mendes, Pablo N and Hellmann, Sebastian and Morsey, Mohamed and Van Kleef, Patrick and Auer, S�ren and others (2015), "DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia", Semantic Web. Vol. 6(2), pp. 167-195. IOS Press.
- [Oulabi2016] Oulabi, Yaser and Meusel, Robert and Bizer, Christian (2016), "Fusing Time-dependent Web Table Data", In WebDB '16.
- [Oulabi2017] Oulabi, Yaser and Bizer, Christian (2017), "Estimating missing temporal meta-information using Knowledge-Based-Trust", In KDWeb '17.
- [Oulabi2019] Oulabi, Yaser and Bizer, Christian (2019), "Extending Cross-Domain Knowledge Bases with Long Tail Entities using Web Table Data", In EDBT '19.
- [Ritze2015] Ritze, Dominique and Lehmberg, Oliver and Bizer, Christian (2015), "Matching HTML Tables to DBpedia", In WIMS '15.
- [Ritze2016] Ritze, Dominique and Lehmberg, Oliver and Oulabi, Yaser and Bizer, Christian (2016), "Profiling the Potential of Web Tables for Augmenting Cross-domain Knowledge Bases", In WWW '16.
Released: 15.07.19