Class-Specific Subsets of the Schema.org Data contained in the October 2022 Corpus

This page provides access to and statistics about class-specific subsets of the Schema.org data contained in the October 2022 version of the Web Data Commons Microdata and JSON-LD corpus. The datasets are part of the Web Data Commons Schema.org Data Set Series

Introduction

As many users are only interested in specific types of Schema.org data (like product data, event data, job postings, or data describing local businesses), we have created class-specific subsets out of the complete and merged Microdata and JSON-LD corpora for a selection of schema.org classes. The subsets contain all instances of a specific class of either formats as well as all other data that is found on the webpages containing these instances. For example, a page containing data about a product might also contain reviews and offers for this product; a page containing data about an event might also contain data about the location of the event and the persons involved in the event. The data is represented in N-Quads format, meaning that the forth element of each quad contains the URL of the webpage from which the data was extracted. To facilitate the download and access to the class specific data, we provide the schema.org subsets in chunks. Each chunk contains quads of specific pay-level-domains (PLDs), i.e. all quads of one PLD, e.g. yummly.com, are organized within the same chunk file. Additionally, we provide lookup files containing the mappings between PLDs and their corresponding chunks as well as csv files with PLD-specific statistics.

Please note that:

You are welcome to use the datasets and also to tell about your findings. If you find our datasets useful for your research, please cite the poster: The Web Data Commons Schema.org Data Set Series by Alexander Brinkmann, Anna Primpeli and Christian Bizer in Companion Proceedings of the ACM Web Conference 2023 (WWW ’23 Companion), Austin, Texas, USA, April 2023.

Class-Specific Subsets of the Schema.org Data

Schema.org Subset General Stats Related Classes Size
(# Files)
Download (Sample) PLD to File look-up
PLD Specific Stats
AdministrativeArea Quads: 77,960,962
URLs: 381,624
Hosts: 2,695
http://schema.org/City (2,154,382)
http://schema.org/ImageObject (1,260,633)
http://schema.org/AdministrativeArea (932,550)
http://schema.org/Person (819,487)
http://schema.org/ListItem (810,411)
923.8 MB
(1)
AdministrativeArea (sample) lookup_file
pld_stats_file
Airport Quads: 54,033,331
URLs: 160,852
Hosts: 538
http://schema.org/Airport (4,569,542)
http://schema.org/GeoCoordinates (2,461,017)
http://schema.org/Flight (1,593,849)
http://schema.org/Airline (1,473,676)
http://schema.org/Offer (1,244,616)
415.83 MB
(1)
Airport (sample) lookup_file
pld_stats_file
Answer Quads: 1,485,993,805
URLs: 13,279,199
Hosts: 251,442
http://schema.org/Answer (53,728,963)
http://schema.org/Question (45,624,470)
http://schema.org/ImageObject (30,813,788)
http://schema.org/ListItem (30,093,835)
http://schema.org/Person (18,278,578)
26.69 GB
(15)
Answer (sample) lookup_file
pld_stats_file
Book Quads: 308,456,812
URLs: 4,999,213
Hosts: 21,623
http://schema.org/Book (13,434,918)
http://schema.org/Country (7,685,986)
http://schema.org/Offer (7,381,043)
http://schema.org/Person (6,792,930)
http://schema.org/ListItem (3,755,829)
7.67 GB
(4)
Book (sample) lookup_file
pld_stats_file
City Quads: 213,841,120
URLs: 1,152,893
Hosts: 11,786
http://schema.org/City (5,832,337)
http://schema.org/ImageObject (3,890,159)
http://schema.org/Person (3,482,668)
http://schema.org/ListItem (3,132,617)
http://schema.org/PostalAddress (2,928,984)
2.25 GB
(3)
City (sample) lookup_file
pld_stats_file
CollegeOrUniversity Quads: 154,890,333
URLs: 1,326,291
Hosts: 3,544
http://schema.org/CollegeOrUniversity (5,643,657)
http://schema.org/ListItem (4,582,716)
http://schema.org/ImageObject (4,072,323)
http://schema.org/Person (3,630,069)
http://schema.org/PostalAddress (2,812,621)
2.15 GB
(2)
CollegeOrUniversity (sample) lookup_file
pld_stats_file
Continent Quads: 1,807,060
URLs: 8,584
Hosts: 57
http://schema.org/City (243,141)
http://schema.org/AdministrativeArea (122,577)
http://schema.org/Country (11,022)
http://schema.org/Continent (10,846)
http://schema.org/ListItem (7,788)
16.38 MB
(1)
Continent (sample) lookup_file
pld_stats_file
Country Quads: 625,627,034
URLs: 4,840,011
Hosts: 24,422
http://schema.org/Country (43,110,397)
http://schema.org/ListItem (15,718,968)
http://schema.org/Organization (8,272,171)
http://schema.org/Offer (8,244,657)
http://schema.org/PostalAddress (6,901,736)
7.83 GB
(7)
Country (sample) lookup_file
pld_stats_file
CreativeWork Quads: 2,909,313,285
URLs: 57,905,383
Hosts: 1,060,682
https://schema.org/CreativeWork (97,847,794)
https://schema.org/Person (69,897,980)
https://schema.org/SiteNavigationElement (60,989,946)
https://schema.org/WPHeader (39,355,522)
https://schema.org/WPFooter (38,637,170)
112.09 GB
(30)
CreativeWork (sample) lookup_file
pld_stats_file
Dataset Quads: 50,243,885
URLs: 930,859
Hosts: 1,676
http://schema.org/DataDownload (2,447,952)
http://schema.org/Dataset (1,320,329)
http://schema.org/Organization (951,404)
http://schema.org/PropertyValue (800,666)
http://schema.org/Person (449,880)
782.07 MB
(1)
Dataset (sample) lookup_file
pld_stats_file
EducationalOrganization Quads: 96,140,998
URLs: 1,146,674
Hosts: 9,080
http://schema.org/EducationalOrganization (2,312,278)
http://schema.org/ListItem (2,090,170)
http://schema.org/PostalAddress (1,825,786)
http://schema.org/ImageObject (1,036,325)
http://schema.org/GeoCoordinates (752,916)
1.5 GB
(1)
EducationalOrganization (sample) lookup_file
pld_stats_file
Event Quads: 1,732,974,389
URLs: 16,038,172
Hosts: 313,782
http://schema.org/Event (68,785,848)
http://schema.org/Place (48,427,726)
http://schema.org/PostalAddress (38,748,060)
http://schema.org/Person (21,337,534)
http://schema.org/ListItem (15,879,353)
22.93 GB
(18)
Event (sample) lookup_file
pld_stats_file
FAQPage Quads: 1,283,468,270
URLs: 10,111,531
Hosts: 230,283
http://schema.org/Question (42,934,439)
http://schema.org/Answer (42,802,142)
http://schema.org/ImageObject (31,156,198)
http://schema.org/ListItem (27,405,514)
https://schema.org/Question (13,404,725)
21.57 GB
(13)
FAQPage (sample) lookup_file
pld_stats_file
GeoCoordinates Quads: 3,699,629,956
URLs: 29,887,844
Hosts: 461,695
http://schema.org/ListItem (98,973,753)
http://schema.org/PostalAddress (64,876,448)
http://schema.org/GeoCoordinates (60,789,730)
http://schema.org/ImageObject (37,038,861)
http://schema.org/OpeningHoursSpecification (34,280,120)
45.45 GB
(37)
GeoCoordinates (sample) lookup_file
pld_stats_file
GovernmentOrganization Quads: 15,661,042
URLs: 348,175
Hosts: 1,398
http://schema.org/GovernmentOrganization (467,886)
http://schema.org/ImageObject (374,525)
http://schema.org/ListItem (335,640)
http://schema.org/PostalAddress (251,555)
http://schema.org/Organization (181,023)
303.11 MB
(1)
GovernmentOrganization (sample) lookup_file
pld_stats_file
Hospital Quads: 23,904,433
URLs: 272,304
Hosts: 2,058
http://schema.org/PostalAddress (682,782)
http://schema.org/Hospital (511,631)
http://schema.org/ListItem (364,906)
http://schema.org/Physician (271,490)
http://schema.org/Review (263,414)
315.43 MB
(1)
Hospital (sample) lookup_file
pld_stats_file
Hotel Quads: 323,859,754
URLs: 2,636,528
Hosts: 24,065
http://schema.org/Hotel (9,579,097)
http://schema.org/PostalAddress (8,664,346)
http://schema.org/Rating (8,582,433)
http://schema.org/ImageObject (5,499,477)
http://schema.org/ListItem (5,076,113)
4.56 GB
(4)
Hotel (sample) lookup_file
pld_stats_file
JobPosting Quads: 182,882,855
URLs: 4,126,373
Hosts: 50,466
http://schema.org/Place (5,538,410)
http://schema.org/PostalAddress (5,396,582)
http://schema.org/Organization (5,085,787)
http://schema.org/JobPosting (4,977,840)
http://schema.org/ListItem (2,628,233)
7.01 GB
(2)
JobPosting (sample) lookup_file
pld_stats_file
LakeBodyOfWater Quads: 54,795
URLs: 1,719
Hosts: 110
http://schema.org/LakeBodyOfWater (1,680)
http://schema.org/PropertyValue (1,396)
http://schema.org/ImageObject (1,336)
http://schema.org/GeoCoordinates (1,243)
http://schema.org/PostalAddress (820)
2.5 MB
(1)
LakeBodyOfWater (sample) lookup_file
pld_stats_file
LandmarksOrHistoricalBuildings Quads: 1,985,725
URLs: 23,513
Hosts: 348
http://schema.org/LandmarksOrHistoricalBuildings (129,909)
http://schema.org/PostalAddress (56,802)
http://schema.org/PropertyValue (34,550)
http://schema.org/ImageObject (31,991)
http://schema.org/Organization (27,292)
32.47 MB
(1)
LandmarksOrHistoricalBuildings (sample) lookup_file
pld_stats_file
Language Quads: 701,081,445
URLs: 5,783,390
Hosts: 10,513
http://schema.org/Person (30,919,803)
http://schema.org/Comment (24,954,078)
http://schema.org/ListItem (12,274,809)
http://schema.org/Language (9,191,675)
http://schema.org/InteractionCounter (8,765,796)
12.56 GB
(7)
Language (sample) lookup_file
pld_stats_file
Library Quads: 6,597,774
URLs: 199,425
Hosts: 714
http://schema.org/Library (215,364)
http://schema.org/OpeningHoursSpecification (205,783)
http://schema.org/PostalAddress (94,133)
http://schema.org/ListItem (78,297)
http://schema.org/Place (62,714)
109.4 MB
(1)
Library (sample) lookup_file
pld_stats_file
LocalBusiness Quads: 2,671,216,988
URLs: 37,243,804
Hosts: 1,197,063
http://schema.org/ListItem (89,958,346)
http://schema.org/LocalBusiness (55,898,742)
http://schema.org/PostalAddress (51,249,632)
http://schema.org/ImageObject (24,054,225)
http://schema.org/OpeningHoursSpecification (19,182,321)
34.85 GB
(27)
LocalBusiness (sample) lookup_file
pld_stats_file
Mountain Quads: 300,982
URLs: 15,293
Hosts: 56
http://schema.org/propertyValue (23,340)
http://schema.org/Mountain (16,503)
http://schema.org/GeoCoordinates (16,380)
http://schema.org/ImageObject (2,323)
http://schema.org/Place (838)
5.63 MB
(1)
Mountain (sample) lookup_file
pld_stats_file
Movie Quads: 232,715,373
URLs: 2,265,947
Hosts: 7,801
http://schema.org/Person (10,501,940)
http://schema.org/Movie (6,832,078)
https://schema.org/Person (3,423,482)
http://schema.org/ImageObject (2,789,665)
http://schema.org/VideoObject (2,435,667)
3.39 GB
(3)
Movie (sample) lookup_file
pld_stats_file
Museum Quads: 6,137,731
URLs: 102,146
Hosts: 610
http://schema.org/OpeningHoursSpecification (223,564)
http://schema.org/Museum (113,752)
http://schema.org/PostalAddress (91,909)
http://schema.org/Event (68,844)
http://schema.org/ListItem (62,544)
83.71 MB
(1)
Museum (sample) lookup_file
pld_stats_file
MusicAlbum Quads: 116,521,141
URLs: 901,157
Hosts: 16,829
http://schema.org/MusicRecording (7,489,652)
http://schema.org/Country (6,478,141)
http://schema.org/Offer (2,588,062)
http://schema.org/MusicAlbum (2,378,182)
http://schema.org/AudioObject (2,209,240)
1.07 GB
(2)
MusicAlbum (sample) lookup_file
pld_stats_file
MusicRecording Quads: 194,556,455
URLs: 1,683,808
Hosts: 24,513
http://schema.org/Country (16,492,792)
http://schema.org/MusicRecording (13,576,880)
http://schema.org/Offer (2,866,310)
http://schema.org/AudioObject (2,533,762)
http://schema.org/MusicGroup (2,344,164)
1.8 GB
(2)
MusicRecording (sample) lookup_file
pld_stats_file
Organization Quads: 40,108,866,817
URLs: 637,002,088
Hosts: 5,915,483
http://schema.org/ListItem (1,085,683,835)
http://schema.org/ImageObject (933,908,146)
http://schema.org/Organization (840,543,926)
http://schema.org/WebPage (429,047,446)
http://schema.org/Person (408,926,073)
679.48 GB
(401)
Organization (sample) lookup_file
pld_stats_file
Painting Quads: 15,219,466
URLs: 137,718
Hosts: 525
http://schema.org/Person (2,870,457)
http://schema.org/Painting (619,613)
http://schema.org/Offer (398,543)
http://schema.org/ListItem (309,616)
http://schema.org/Property (179,754)
136.25 MB
(1)
Painting (sample) lookup_file
pld_stats_file
Park Quads: 1,105,870
URLs: 13,473
Hosts: 280
http://schema.org/PostalAddress (31,540)
http://schema.org/OpeningHoursSpecification (29,531)
http://schema.org/ListItem (16,167)
http://schema.org/Park (15,215)
http://schema.org/GeoCoordinates (13,676)
16.83 MB
(1)
Park (sample) lookup_file
pld_stats_file
Person Quads: 28,389,014,403
URLs: 396,876,413
Hosts: 4,342,475
http://schema.org/ImageObject (707,265,548)
http://schema.org/Person (685,309,482)
http://schema.org/ListItem (587,832,845)
http://schema.org/WebPage (356,489,547)
http://schema.org/Organization (326,175,879)
559.6 GB
(284)
Person (sample) lookup_file
pld_stats_file
Place Quads: 3,732,184,081
URLs: 32,121,226
Hosts: 430,758
http://schema.org/ListItem (94,624,361)
http://schema.org/Place (93,816,638)
http://schema.org/PostalAddress (78,329,151)
http://schema.org/Event (52,594,074)
http://schema.org/Person (38,433,126)
51.62 GB
(38)
Place (sample) lookup_file
pld_stats_file
Product Quads: 17,883,521,101
URLs: 256,873,876
Hosts: 2,551,754
http://schema.org/Offer (568,611,575)
http://schema.org/Product (502,424,891)
http://schema.org/ListItem (500,361,108)
http://schema.org/Organization (218,515,422)
http://schema.org/ImageObject (142,962,903)
257.16 GB
(179)
Product (sample) lookup_file
pld_stats_file
QAPage Quads: 179,194,962
URLs: 3,174,379
Hosts: 11,172
http://schema.org/Person (10,610,649)
http://schema.org/Answer (6,971,763)
http://schema.org/Question (2,617,916)
http://schema.org/QAPage (2,434,624)
http://schema.org/ListItem (2,264,667)
4.25 GB
(2)
QAPage (sample) lookup_file
pld_stats_file
Question Quads: 1,520,018,091
URLs: 14,257,595
Hosts: 254,857
http://schema.org/Answer (52,752,628)
http://schema.org/Question (47,219,749)
http://schema.org/ImageObject (31,489,782)
http://schema.org/ListItem (30,231,894)
http://schema.org/Person (18,552,705)
27.47 GB
(16)
Question (sample) lookup_file
pld_stats_file
RadioStation Quads: 19,207,934
URLs: 393,145
Hosts: 967
http://schema.org/ListItem (847,579)
http://schema.org/RadioStation (449,935)
http://schema.org/ImageObject (308,248)
http://schema.org/NewsArticle (264,461)
http://schema.org/BreadcrumbList (199,135)
301.1 MB
(1)
RadioStation (sample) lookup_file
pld_stats_file
Recipe Quads: 449,214,017
URLs: 4,090,428
Hosts: 40,702
http://schema.org/HowToStep (13,702,473)
http://schema.org/ImageObject (7,240,806)
http://schema.org/Person (6,709,755)
http://schema.org/ListItem (6,635,160)
http://schema.org/Recipe (5,290,014)
7.33 GB
(5)
Recipe (sample) lookup_file
pld_stats_file
Restaurant Quads: 216,942,320
URLs: 1,599,457
Hosts: 63,121
http://schema.org/Offer (6,963,370)
http://schema.org/MenuItem (6,426,675)
http://schema.org/Restaurant (4,675,743)
http://schema.org/ListItem (3,907,235)
http://schema.org/PostalAddress (3,561,463)
2.35 GB
(3)
Restaurant (sample) lookup_file
pld_stats_file
RiverBodyOfWater Quads: 167,629
URLs: 3,221
Hosts: 19
http://schema.org/ImageObject (6,369)
http://schema.org/ListItem (6,367)
http://schema.org/Organization (4,276)
http://schema.org/RiverBodyOfWater (3,239)
http://schema.org/GeoCoordinates (2,726)
4.06 MB
(1)
RiverBodyOfWater (sample) lookup_file
pld_stats_file
School Quads: 12,547,288
URLs: 235,736
Hosts: 1,699
http://schema.org/School (395,869)
http://schema.org/ListItem (339,734)
http://schema.org/PostalAddress (271,185)
http://schema.org/ImageObject (140,270)
http://schema.org/WebPage (122,865)
182.74 MB
(1)
School (sample) lookup_file
pld_stats_file
ShoppingCenter Quads: 10,680,910
URLs: 136,690
Hosts: 1,199
http://schema.org/PostalAddress (207,036)
http://schema.org/ShoppingCenter (202,312)
http://schema.org/Offer (173,447)
http://schema.org/Organization (165,687)
http://schema.org/ListItem (148,766)
135.23 MB
(1)
ShoppingCenter (sample) lookup_file
pld_stats_file
SkiResort Quads: 1,475,134
URLs: 34,829
Hosts: 250
http://schema.org/ListItem (50,178)
http://schema.org/SkiResort (39,260)
http://schema.org/PostalAddress (28,966)
http://schema.org/Person (23,906)
http://schema.org/Review (23,384)
26.25 MB
(1)
SkiResort (sample) lookup_file
pld_stats_file
SportsEvent Quads: 140,098,380
URLs: 905,879
Hosts: 6,743
http://schema.org/SportsEvent (6,683,759)
http://schema.org/Place (6,215,841)
http://schema.org/SportsTeam (5,822,261)
http://schema.org/PostalAddress (4,944,824)
http://schema.org/Organization (1,993,490)
1.18 GB
(2)
SportsEvent (sample) lookup_file
pld_stats_file
SportsTeam Quads: 99,150,611
URLs: 734,718
Hosts: 3,818
http://schema.org/SportsTeam (6,879,209)
http://schema.org/SportsEvent (2,591,617)
http://schema.org/Place (2,420,968)
http://schema.org/Organization (1,837,178)
http://schema.org/PostalAddress (1,788,906)
876.31 MB
(1)
SportsTeam (sample) lookup_file
pld_stats_file
StadiumOrArena Quads: 27,517,383
URLs: 78,382
Hosts: 253
http://schema.org/Organization (1,118,034)
http://schema.org/SportsTeam (997,719)
http://schema.org/ImageObject (848,949)
http://schema.org/BlogPosting (417,543)
http://schema.org/SportsEvent (397,598)
207.27 MB
(1)
StadiumOrArena (sample) lookup_file
pld_stats_file
TelevisionStation Quads: 1,637,886
URLs: 20,631
Hosts: 91
http://schema.org/ListItem (51,424)
http://schema.org/ImageObject (31,052)
http://schema.org/TelevisionStation (25,578)
http://schema.org/Organization (23,518)
http://schema.org/PostalAddress (22,709)
24.68 MB
(1)
TelevisionStation (sample) lookup_file
pld_stats_file
TVEpisode Quads: 68,894,417
URLs: 394,881
Hosts: 1,204
http://schema.org/Country (5,010,019)
https://schema.org/TVEpisode (3,390,748)
http://schema.org/TVEpisode (1,626,080)
http://schema.org/Person (866,730)
http://schema.org/ListItem (408,453)
703.18 MB
(1)
TVEpisode (sample) lookup_file
pld_stats_file


In case you are interested in a particular class or set of classes which is not listed above, please get in contact with the WebDataCommons team via Mailing List or our Google Group.

Conversion to Other Formats

We provide the extracted data for download using a variation of the N-Quads format. For users who prefer other formats, we provide code for converting the download files into CSV and JSON formats, which are supported by a wide range of spreadsheet applications, relational databases and data mining frameworks like the python data analysis library pandas. Please find further details on how to convert the download files to other formats on the main page.

Get the Code

The jupyter notebooks used to create the schema.org subsets from the MD and JSON-LD corpus can be checked out from our Git repository.

The extraction of the December 2022 was done with version 1.5 of the extractor. For more information about the framework and a detailed description how to run a own extraction visit the framework page.

Get Support

Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.