This page describes the TDGT Dataset of time-dependent data, useful for evaluating tasks that make use of the time-aspect of data.
The dataset was built to evaluate fusion methods for slot filling time-dependent attributes of a knowledge base from web table data. Data fusion methods, also called Truth Discovery methods, try to select the correct value given a set of alternative input values [Bleiholder2009]. In time-dependent data, also called long-data [Dong2016], the correctness of a value depends on a certain point in time or a time range. For any given combination of Entity and Property multiple values can be considered correct, as they reflect a different period or point in time [Oulabi2016]. For evaluating the performance of data fusion methods for time-dependent data, it is necessary to know the correct values of a property at different points in time. The TDGT - Time-Dependent Ground Truth dataset consists of high-quality data describing countries, cities, athletes, companies, politicians and officeholders. All ground-truth data is annotated with temporal meta-information and can thus be used to evaluate data fusion methods for time-dependent data.
Contents
1. Dataset Overview
This dataset provides data for a selection of different time-dependent attributes in a uniform format. We built this dataset in the context of knowledge base augmentation from web table data. Web tables, which are relational HTML tables extracted from the Web, contain large amounts of valuable information describing temporal data [Lehmberg2016, Zhang2013]. We employ this dataset to evaluate fusion methods for augmenting from this web table data a temporal knowledge base [Oulabi2017] with new facts. Unlike snapshot-based knowledge bases, like DBpedia, which try to reflect only the most recent facts, temporal knowledge bases, like Wikidata, store time-dependent data as series of timed facts. Similarly, this ground truth provides a variety of time-dependent data as a series of timed facts. In previous work [Oulabi2016] we have presented and evaluated a method for slot filling snapshot-based knowledge bases.
The data was consolidated from different sources and covers the following domains:- Countries and Cities
- Athletes
- Companies
- Politicians and Officeholders
During the creation of this dataset, we ensured that all entities part of the dataset are matched to a comprehensive knowledge-base, in our case Wikidata. In addition we aimed at completeness of values per entity, so that one can assume that all possible values of a certain time-dependent attributes for a given entity are available.
2. Entity Classes Statistics
There are overall seven entity classes present in this dataset. The table below provides an overview of those classes, including the number of instances, properties and values per class. Additionally the name of the JSON file of the entity is provided in the table.
Class | JSON file name | # of Instances | # of Properties | # of Values | Source |
---|---|---|---|---|---|
Basketball Athlete (mostly NBA) | basketballAthlete.json | 3781 | 1 | 10,625 | www.basketball-reference.com |
City | city.json | 11372 | 2 | 40,666 | Wikidata |
Country | country.json | 197 | 7 | 44,013 | WorldBank (World Development Indicators) |
NFL Athlete | nflAthlete.json | 12756 | 2 | 96,711 | www.footballdb.com |
Politician / Holder of a political office | politician.json | 22062 | 1 | 36,816 | Wikidata |
Soccer Athlete | soccerAthlete.json | 134617 | 1 | 778,602 | Wikidata |
Traded Company (NYSE and NASDAQ) | tradedCompany.json | 1646 | 5 | 73,806 | www.stockrow.com |
3. Time-dependent Properties Overview
There are overall 19 time-dependent attributes in this dataset. The following table provides an overview of those attributes including their type, datatype and number of values. Additionally the JSON name of the property is included.
Class | Property | Name in JSON file | Type | Datatype | # of Values |
---|---|---|---|---|---|
Basketball Athlete | Team | team | Time Range | Reference | 10,625 |
City | Mayor | mayor | Time Range | Reference | 1,742 |
City | Population | population | Point in Time | Number | 38,924 |
Country | Head of Government | headOfGovernment | Time Range | Reference | 1,641 |
Country | Head of State | headOfState | Time Range | Reference | 2,727 |
Country | Memberships | memberOf | Time Range | Reference | 1,540 |
Country | Nominal GDP | gdp | Point in Time | Number | 8,417 |
Country | Nominal GDP per Capita | gdpPerCapita | Point in Time | Number | 8,414 |
Country | Population | population | Point in Time | Number | 10,845 |
Country | Population Density | populationDensity | Point in Time | Number | 10,429 |
NFL Athlete | Sports Number | sportsNumber | Point in Time | Number | 28,379 |
NFL Athlete | Team | team | Point in Time | Reference | 68,332 |
Politician | Position Held | positionHeld | Time Range | Reference | 36,816 |
Soccer Athlete | Team | team | Time Range | Reference | 778,602 |
Traded Company | Earnings before interests and taxes | ebit | Point in Time | Number | 14,945 |
Traded Company | Net Income | netIncome | Point in Time | Number | 14,948 |
Traded Company | Total Assets | totalAssets | Point in Time | Number | 14,585 |
Traded Company | Total Equity | totalEquity | Point in Time | Number | 14,527 |
Traded Company | Total Revenue | totalRevenue | Point in Time | Number | 14,801 |
4. Data Format
The dataset is provided in a JSON Format, where every entity class is stored in a seperate file.
4.1. JSON File Format
Every JSON file is an array of JSON Objects with the format described below. The JSON file has the following format:
[
{ENTITY_OBJECT_1},
{ENTITY_OBJECT_2},
{ENTITY_OBJECT_3},
{ENTITY_OBJECT_4},
{ENTITY_OBJECT_5},
......
{ENTITY_OBJECT_97},
{ENTITY_OBJECT_98},
{ENTITY_OBJECT_99}
]
Any {ENTITY_OBJECT_XX}
is a JSON Object as described in
Section 4.2 below.
4.2. Entity Object JSON Format
Entity Objects have the following Formats:
{
"name": "{ENTITY_NAME}",
"wikidataId": "{WIKIDATA_ID}",
"values": [
{VALUE_OBJECT_1},
{VALUE_OBJECT_2},
{VALUE_OBJECT_3},
{VALUE_OBJECT_4},
{VALUE_OBJECT_5},
......
{VALUE_OBJECT_97},
{VALUE_OBJECT_98},
{VALUE_OBJECT_99}
]
}
{ENTITY_NAME}
, {WIKIDATA_ID}
, {PROPERTY_NAME}
are string values. {ENTITY_NAME}
is a label of the
entity, while {WIKIDATA_ID}
is the ID of the entity in
Wikidata.
Any {VALUE_OBJECT_XX}
is a JSON Object as described in
Section 4.3 below.
4.3. Value Object JSON Format
The JSON format of the value object has two variations. Both
variations have a string property type
, that determines
the type, and a string property property
({PROPERTY_NAME}
),
which is the name of the property as shown in the table in Section 3
above.
4.3.1. Point-in-Time value
The point in time value reflects values that are valid for a given
point in time. The type
property has the value "point".
{
"propertyName" : "{PROPERTY_NAME}"
"point": "{POINT_DATE}",
"type": "point",
"value": {VALUE_TYPE_OBJECT}
}
{POINT_DATE}
is a string value and needs to be parsed. {POINT_DATE}
can be of the following formats: yyyy-mm-dd
or yyyy
.
It is in the format of a string and needs to be parsed.
4.3.2. Time-Range value
The point in time value reflects values that are valid for a given
time range. The type
property has the value "range".
{
"propertyName" : "{PROPERTY_NAME}"
"from": "{FROM_DATE}",
"to": "{TO_DATE}",
"type": "range",
"value": {VALUE_TYPE_OBJECT}
}
{FROM_DATE}
and {TO_DATE}
are both string
values and need to be parsed. They can both be of the following
formats: yyyy-mm-dd
or yyyy
. The to
property is optional.
4.4. Value Type Object
The value type objects are JSON Objects that describe the actual
value. They include a string property type
that
describes the type of the value.
4.4.1 Reference
For the reference type the value of the type
property
is simply reference
. The reference type has two
additional properties. First there is a label
property,
that provides a name for the referenced entitity. There is also wikidataId
,
which provides the ID of the referenced entity in the Wikidata
Knowledge-Base. Both properties are string values.
{
"wikidataId": "{WIKIDATA_ID}",
"label": "{LABEL}",
"type": "reference"
}
4.4.2. Number
For the number type the value of the type
property is
simply number
. There is an additional property amount
,
which incluedes the actual number. The amount
property
is of type string, and needs to parsed.
{
"amount": "{AMOUNT}",
"type": "number"
}
5. Sample Entity Object
You can download this sample below.
{
"name":"Mannheim",
"wikidataId":"Q2119",
"values":[
{
"from":"2007",
"propertyName":"mayor",
"type":"range",
"value":{
"wikidataId":"Q2076493",
"label":"Peter Kurz",
"type":"reference"
}
},
{
"from":"1983",
"to":"2007",
"propertyName":"mayor",
"type":"range",
"value":{
"wikidataId":"Q1512753",
"label":"Gerhard Widder",
"type":"reference"
}
},
{
"from":"1980",
"to":"1983",
"propertyName":"mayor",
"type":"range",
"value":{
"wikidataId":"Q2575485",
"label":"Wilhelm Varnholt",
"type":"reference"
}
},
{
"point":"2013-12-31",
"propertyName":"population",
"type":"point",
"value":{
"amount":"296690",
"type":"number"
}
},
{
"point":"2012-12-31",
"propertyName":"population",
"type":"point",
"value":{
"amount":"294627",
"type":"number"
}
},
{
"point":"1961",
"propertyName":"population",
"type":"point",
"value":{
"amount":"313890",
"type":"number"
}
},
{
"point":"1962",
"propertyName":"population",
"type":"point",
"value":{
"amount":"318919",
"type":"number"
}
},
{
"point":"1963",
"propertyName":"population",
"type":"point",
"value":{
"amount":"321075",
"type":"number"
}
},
{
"point":"1964",
"propertyName":"population",
"type":"point",
"value":{
"amount":"323444",
"type":"number"
}
},
{
"point":"1965",
"propertyName":"population",
"type":"point",
"value":{
"amount":"328156",
"type":"number"
}
},
{
"point":"1966",
"propertyName":"population",
"type":"point",
"value":{
"amount":"329301",
"type":"number"
}
},
{
"point":"1967",
"propertyName":"population",
"type":"point",
"value":{
"amount":"323744",
"type":"number"
}
},
{
"point":"1968",
"propertyName":"population",
"type":"point",
"value":{
"amount":"326302",
"type":"number"
}
},
{
"point":"1969",
"propertyName":"population",
"type":"point",
"value":{
"amount":"330920",
"type":"number"
}
},
{
"point":"1970",
"propertyName":"population",
"type":"point",
"value":{
"amount":"332163",
"type":"number"
}
},
{
"point":"1971",
"propertyName":"population",
"type":"point",
"value":{
"amount":"330635",
"type":"number"
}
},
{
"point":"1972",
"propertyName":"population",
"type":"point",
"value":{
"amount":"328411",
"type":"number"
}
},
{
"point":"1973",
"propertyName":"population",
"type":"point",
"value":{
"amount":"325386",
"type":"number"
}
},
{
"point":"1974",
"propertyName":"population",
"type":"point",
"value":{
"amount":"320508",
"type":"number"
}
},
{
"point":"1975",
"propertyName":"population",
"type":"point",
"value":{
"amount":"314086",
"type":"number"
}
},
{
"point":"1976",
"propertyName":"population",
"type":"point",
"value":{
"amount":"309059",
"type":"number"
}
},
{
"point":"1977",
"propertyName":"population",
"type":"point",
"value":{
"amount":"305741",
"type":"number"
}
},
{
"point":"1978",
"propertyName":"population",
"type":"point",
"value":{
"amount":"302794",
"type":"number"
}
},
{
"point":"1979",
"propertyName":"population",
"type":"point",
"value":{
"amount":"303247",
"type":"number"
}
},
{
"point":"1980",
"propertyName":"population",
"type":"point",
"value":{
"amount":"304303",
"type":"number"
}
},
{
"point":"1981",
"propertyName":"population",
"type":"point",
"value":{
"amount":"304219",
"type":"number"
}
},
{
"point":"1982",
"propertyName":"population",
"type":"point",
"value":{
"amount":"302621",
"type":"number"
}
},
{
"point":"1983",
"propertyName":"population",
"type":"point",
"value":{
"amount":"298042",
"type":"number"
}
},
{
"point":"1984",
"propertyName":"population",
"type":"point",
"value":{
"amount":"295178",
"type":"number"
}
},
{
"point":"1985",
"propertyName":"population",
"type":"point",
"value":{
"amount":"294984",
"type":"number"
}
},
{
"point":"1986",
"propertyName":"population",
"type":"point",
"value":{
"amount":"294648",
"type":"number"
}
},
{
"point":"1987",
"propertyName":"population",
"type":"point",
"value":{
"amount":"295191",
"type":"number"
}
},
{
"point":"1988",
"propertyName":"population",
"type":"point",
"value":{
"amount":"300468",
"type":"number"
}
},
{
"point":"1989",
"propertyName":"population",
"type":"point",
"value":{
"amount":"305974",
"type":"number"
}
},
{
"point":"1990",
"propertyName":"population",
"type":"point",
"value":{
"amount":"310411",
"type":"number"
}
},
{
"point":"1991",
"propertyName":"population",
"type":"point",
"value":{
"amount":"314685",
"type":"number"
}
},
{
"point":"1992",
"propertyName":"population",
"type":"point",
"value":{
"amount":"318446",
"type":"number"
}
},
{
"point":"1993",
"propertyName":"population",
"type":"point",
"value":{
"amount":"318025",
"type":"number"
}
},
{
"point":"1994",
"propertyName":"population",
"type":"point",
"value":{
"amount":"316223",
"type":"number"
}
},
{
"point":"1995",
"propertyName":"population",
"type":"point",
"value":{
"amount":"311292",
"type":"number"
}
},
{
"point":"1996",
"propertyName":"population",
"type":"point",
"value":{
"amount":"312216",
"type":"number"
}
},
{
"point":"1997",
"propertyName":"population",
"type":"point",
"value":{
"amount":"310475",
"type":"number"
}
},
{
"point":"1998",
"propertyName":"population",
"type":"point",
"value":{
"amount":"308903",
"type":"number"
}
},
{
"point":"1999",
"propertyName":"population",
"type":"point",
"value":{
"amount":"307730",
"type":"number"
}
},
{
"point":"2000",
"propertyName":"population",
"type":"point",
"value":{
"amount":"306729",
"type":"number"
}
},
{
"point":"2001",
"propertyName":"population",
"type":"point",
"value":{
"amount":"308385",
"type":"number"
}
},
{
"point":"2002",
"propertyName":"population",
"type":"point",
"value":{
"amount":"308759",
"type":"number"
}
},
{
"point":"2003",
"propertyName":"population",
"type":"point",
"value":{
"amount":"308353",
"type":"number"
}
},
{
"point":"2004",
"propertyName":"population",
"type":"point",
"value":{
"amount":"307499",
"type":"number"
}
},
{
"point":"2005",
"propertyName":"population",
"type":"point",
"value":{
"amount":"307900",
"type":"number"
}
},
{
"point":"2006",
"propertyName":"population",
"type":"point",
"value":{
"amount":"307914",
"type":"number"
}
},
{
"point":"2007",
"propertyName":"population",
"type":"point",
"value":{
"amount":"309795",
"type":"number"
}
},
{
"point":"2008",
"propertyName":"population",
"type":"point",
"value":{
"amount":"311342",
"type":"number"
}
},
{
"point":"2009",
"propertyName":"population",
"type":"point",
"value":{
"amount":"311969",
"type":"number"
}
},
{
"point":"2010",
"propertyName":"population",
"type":"point",
"value":{
"amount":"313174",
"type":"number"
}
},
{
"point":"2011",
"propertyName":"population",
"type":"point",
"value":{
"amount":"291458",
"type":"number"
}
},
{
"point":"2014",
"propertyName":"population",
"type":"point",
"value":{
"amount":"299844",
"type":"number"
}
}
]
}
6. Download
You can download the dataset here:
7. Feedback
Please send questions and feedback to directly to the authors (listed
above) or post them in the Web
Data Commons Google Group.
8. References
- [Bleiholder2009] Jens Bleiholder, and Felix Naumann. 2009. Data Fusion. ACM Computing Surveys, ACM, 2009, 41, 1:1-1:41 (January 2009).
- [Dong2016] Xin Luna Dong, Anastasios Kementsietsidis, and Wang-Chiew Tan. 2016. A Time Machine for Information: Looking Back to Look Forward. SIGMOD Rec. 45, 2 (September 2016).
- [Oulabi2016] Yaser Oulabi, Robert Meusel, and Christian Bizer. 2016. Fusing time-dependent web table data. In Proceedings of the 19th International Workshop on Web and Databases (WebDB '16). ACM, New York, NY, USA, , Article 3 , 7 pages.
- [Oulabi2017] Yaser Oulabi, and Christian Bizer. 2017. Estimating Missing Temporal Meta-Information using Knowledge-Based-Trust. In Proceedings of the 3rd International Workshop on Knowledge Discovery on the WEB (KDWeb '16). CEUR Workshop Proceedings, RWTH: Aachen.
- [Lehmberg2016] Oliver Lehmberg, Dominique Ritze, Robert Meusel, and Christian Bizer. 2016. A Large Public Corpus of Web Tables containing Time and Context Metadata. In Proceedings of the 25th International Conference Companion on World Wide Web (WWW '16 Companion). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 75-76.
- [Zhang2013] Meihui Zhang and Kaushik Chakrabarti. 2013. InfoGather+: semantic matching and annotation of numeric and time-varying attributes in web tables. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD '13). ACM, New York, NY, USA, 145-156.
Released: 15.07.19