This page provides access to and statistics about class-specific subsets of the Schema.org data contained in the October 2021 version of the Web Data Commons Microdata and JSON-LD corpus. The datasets are part of the Web Data Commons Schema.org Data Set Series
As many users are only interested in specific types of Schema.org data (like product data, event data, job postings,
or data describing local businesses), we have created class-specific subsets out of the complete and merged Microdata and JSON-LD corpora for a
selection of schema.org classes.
The subsets contain all instances of a specific class of either formats as well as all other data that is found on
the webpages containing these instances. For example, a page containing data about a product might also contain
reviews and offers for this product; a page containing data about an event might also contain data about the
location of the event and the persons involved in the event.
The data is represented in N-Quads format, meaning that the forth
element of each quad contains the URL of the webpage from which the data was extracted.
To facilitate the download and access to the class specific data, we provide the schema.org subsets in chunks. Each
chunk contains quads of specific pay-level-domains (PLDs), i.e. all quads of one PLD, e.g. yummly.com,
are organized within the same chunk file. Additionally, we provide lookup files containing the mappings between PLDs
and their corresponding chunks as well as csv files with PLD-specific statistics.
Please note that:
Schema.org Subset | General Stats | Related Classes | Size (# Files) | Download (Sample) | PLD to File look-up PLD Specific Stats |
---|---|---|---|---|---|
AdministrativeArea | Quads: 60,765,271 URLs: 447,918 PLDs: 2,622 | http://schema.org/ImageObject (944,007)http://schema.org/AdministrativeArea (850,664) http://schema.org/ListItem (802,971)http://schema.org/City (576,671)http://schema.org/Person (562,119) | 842.04 MB (1 file(s)) | AdministrativeArea (sample) | lookup_file pld_stats_file |
Airport | Quads: 64,083,243 URLs: 202,332 PLDs: 617 | http://schema.org/Airport (5,406,690)http://schema.org/GeoCoordinates (2,454,639) http://schema.org/Flight (1,835,551)http://schema.org/Airline (1,662,291) http://schema.org/Offer (1,473,869) | 486.08 MB (1 file(s)) | Airport (sample) | lookup_file pld_stats_file |
Book | Quads: 360,291,528 URLs: 6,444,929 PLDs: 17,998 | http://schema.org/Book (18,731,029)http://schema.org/Country (8,274,215) http://schema.org/Person (7,899,124)http://schema.org/Offer (7,376,178) http://schema.org/ListItem (3,784,768) | 6.2 GB (10 file(s)) | Book (sample) | lookup_file pld_stats_file |
City | Quads: 199,894,118 URLs: 1,276,457 PLDs: 11,260 | http://schema.org/City (4,662,681)http://schema.org/ImageObject (3,526,546) http://schema.org/PostalAddress (3,375,680)http://schema.org/Person (3,215,783) http://schema.org/ListItem (3,091,827) | 2.1 GB (2 file(s)) | City (sample) | lookup_file pld_stats_file |
CollegeOrUniversity | Quads: 118,551,289 URLs: 1,219,624 PLDs: 3,007 | http://schema.org/CollegeOrUniversity (4,566,888)http://schema.org/Person (3,332,672) http://schema.org/ImageObject (3,149,952)http://schema.org/PostalAddress (2,949,686) http://schema.org/ListItem (1,589,546) | 1.5 GB (2 file(s)) | CollegeOrUniversity (sample) | lookup_file pld_stats_file |
Continent | Quads: 1,403,157 URLs: 10,604 PLDs: 55 | http://schema.org/City (176,404)http://schema.org/AdministrativeArea (89,871) http://schema.org/Continent (12,635)http://schema.org/Country (11,200) http://schema.org/GeoCoordinates (9,053) | 13.57 MB (1 file(s)) | Continent (sample) | lookup_file pld_stats_file |
Country | Quads: 790,910,810 URLs: 6,673,935 PLDs: 40,409 | http://schema.org/Country (56,150,693)http://schema.org/ListItem (23,345,865) http://schema.org/Organization (14,653,383)http://schema.org/Offer (10,637,184) http://schema.org/ContactPoint (10,181,864) | 10.1 GB (3 file(s)) | Country (sample) | lookup_file pld_stats_file |
CreativeWork | Quads: 2,858,229,005 URLs: 47,768,961 PLDs: 1,033,049 | https://schema.org/CreativeWork (79,159,623)https://schema.org/Person (75,522,631) https://schema.org/Comment (44,714,553)https://schema.org/SiteNavigationElement (42,802,162) https://schema.org/WPHeader (29,937,233) | 97.9 GB (160 file(s)) | CreativeWork (sample) | lookup_file pld_stats_file |
Dataset | Quads: 38,953,280 URLs: 895,386 PLDs: 1,555 | http://schema.org/Dataset (1,286,271)http://schema.org/Organization (961,307) http://schema.org/PropertyValue (878,478)http://schema.org/DataDownload (790,598) http://schema.org/ImageObject (517,210) | 574.78 MB (1 file(s)) | Dataset (sample) | lookup_file pld_stats_file |
EducationalOrganization | Quads: 76,596,728 URLs: 1,102,570 PLDs: 8,375 | http://schema.org/EducationalOrganization (2,206,149)http://schema.org/PostalAddress (1,558,213)http://schema.org/ListItem (1,526,639)http://schema.org/ImageObject (706,628) http://schema.org/Course (672,205) | 1.2 GB (1 file(s)) | EducationalOrganization (sample) | lookup_file pld_stats_file |
Event | Quads: 1,633,842,997 URLs: 13,433,021 PLDs: 261,160 | http://schema.org/Event (72,764,507)http://schema.org/Place (52,668,977) http://schema.org/PostalAddress (38,480,951)http://schema.org/Person (20,414,748) http://schema.org/Offer (15,676,301) | 22.1 GB (12 file(s)) | Event (sample) | lookup_file pld_stats_file |
GeoCoordinates | Quads: 3,226,742,821 URLs: 28,650,982 PLDs: 469,253 | http://schema.org/PostalAddress (62,911,798)http://schema.org/GeoCoordinates (59,146,834) http://schema.org/ListItem (42,608,148)http://schema.org/ImageObject (32,868,276) http://schema.org/Offer (29,585,584) | 42.3 GB (20 file(s)) | GeoCoordinates (sample) | lookup_file pld_stats_file |
GovernmentOrganization | Quads: 16,628,646 URLs: 342,837 PLDs: 1,304 | http://schema.org/ImageObject (575,166)http://schema.org/GovernmentOrganization (530,788) http://schema.org/ListItem (343,372)http://schema.org/PostalAddress (296,180) http://schema.org/Organization (231,342) | 269.92 MB (1 file(s)) | GovernmentOrganization (sample) | lookup_file pld_stats_file |
Hospital | Quads: 21,693,491 URLs: 286,317 PLDs: 2,154 | http://schema.org/PostalAddress (538,221)http://schema.org/Hospital (443,902) http://schema.org/ListItem (277,373)http://schema.org/GeoCoordinates (238,584) https://schema.org/MedicalCondition (230,568) | 294.66 MB (1 file(s)) | Hospital (sample) | lookup_file pld_stats_file |
Hotel | Quads: 359,533,927 URLs: 2,609,123 PLDs: 23,487 | http://schema.org/ImageObject (8,530,131)http://schema.org/Hotel (7,448,020) http://schema.org/LocationFeatureSpecification (6,061,145)http://schema.org/Rating (5,313,944) http://schema.org/ListItem (5,108,429) | 4.6 GB (4 file(s)) | Hotel (sample) | lookup_file pld_stats_file |
JobPosting | Quads: 159,344,116 URLs: 3,698,141 PLDs: 43,357 | http://schema.org/JobPosting (4,883,181)http://schema.org/Place (4,881,113) http://schema.org/PostalAddress (4,770,776)http://schema.org/Organization (4,450,457) http://schema.org/ListItem (2,213,865) | 6.2 GB (4 file(s)) | JobPosting (sample) | lookup_file pld_stats_file |
LakeBodyOfWater | Quads: 1,007,227 URLs: 5,515 PLDs: 85 | https://schema.org/AdministrativeArea (63,970)https://schema.org/Place (39,398) https://schema.org/Map (21,306)https://schema.org/LakeBodyOfWater (17,822) https://schema.org/ListItem (15,398) | 10.26 MB (1 file(s)) | LakeBodyOfWater (sample) | lookup_file pld_stats_file |
LandmarksOrHistoricalBuildings | Quads: 2,499,817 URLs: 31,107 PLDs: 367 | http://schema.org/LandmarksOrHistoricalBuildings (98,037)http://schema.org/ImageObject (38,520)http://schema.org/CreativeWork (37,995)http://schema.org/PostalAddress (34,395) http://schema.org/PropertyValue (30,455) | 46.52 MB (1 file(s)) | LandmarksOrHistoricalBuildings (sample) | lookup_file pld_stats_file |
Language | Quads: 905,313,908 URLs: 7,490,791 PLDs: 11,793 | http://schema.org/Person (44,425,861)http://schema.org/Comment (37,520,118) http://schema.org/ListItem (15,781,306)http://schema.org/Language (11,051,639) http://schema.org/InteractionCounter (10,489,920) | 16.9 GB (10 file(s)) | Language (sample) | lookup_file pld_stats_file |
Library | Quads: 6,316,189 URLs: 189,051 PLDs: 617 | http://schema.org/Library (211,533)http://schema.org/OpeningHoursSpecification (200,112) http://schema.org/Book (63,575)http://schema.org/ListItem (60,836) http://schema.org/PostalAddress (59,573) | 103.42 MB (1 file(s)) | Library (sample) | lookup_file pld_stats_file |
LocalBusiness | Quads: 2,133,052,253 URLs: 36,545,099 PLDs: 727,613 | http://schema.org/LocalBusiness (56,481,369)http://schema.org/PostalAddress (49,720,404) http://schema.org/ListItem (37,017,817)http://schema.org/ImageObject (24,011,725) http://schema.org/Rating (14,248,632) | 29.6 GB (40 file(s)) | LocalBusiness (sample) | lookup_file pld_stats_file |
Mountain | Quads: 2,113,514 URLs: 33,646 PLDs: 60 | https://schema.org/AdministrativeArea (101,191)https://schema.org/Place (59,246) http://schema.org/Mountain (46,130)http://schema.org/GeoCoordinates (32,748) https://schema.org/Map (31,990) | 25.74 MB (1 file(s)) | Mountain (sample) | lookup_file pld_stats_file |
Movie | Quads: 188,319,450 URLs: 2,589,342 PLDs: 8,372 | http://schema.org/Person (11,715,266)http://schema.org/Movie (4,900,155) https://schema.org/Person (3,242,657)http://schema.org/ListItem (2,082,100) https://schema.org/Comment (2,065,826) | 3.2 GB (5 file(s)) | Movie (sample) | lookup_file pld_stats_file |
Museum | Quads: 7,104,754 URLs: 128,841 PLDs: 645 | http://schema.org/OpeningHoursSpecification (173,199)http://schema.org/PostalAddress (124,607) http://schema.org/Museum (121,880)http://schema.org/Person (94,311)https://schema.org/Museum (90,757) | 91.81 MB (1 file(s)) | Museum (sample) | lookup_file pld_stats_file |
MusicAlbum | Quads: 144,932,250 URLs: 1,325,222 Hosts: 15,569 | http://schema.org/MusicRecording (10,531,635)http://schema.org/Country (8,875,501) http://schema.org/MusicAlbum (3,006,915)http://schema.org/Offer (1,984,069) http://schema.org/MusicGroup (1,974,991) | 1.27 GB (2 file(s)) | MusicAlbum (sample) | lookup_file pld_stats_file |
MusicRecording | Quads: 224,396,466 URLs: 2,347,138 PLDs: 23,401 | http://schema.org/Country (17,125,378)http://schema.org/MusicRecording (16,930,443) http://schema.org/MusicGroup (2,942,496)http://schema.org/Offer (2,274,844) http://schema.org/MusicAlbum (1,991,504) | 2.0 GB (4 file(s)) | MusicRecording (sample) | lookup_file pld_stats_file |
Organization | Quads: 33,849,515,925 URLs: 586,167,976 PLDs: 5,590,365 | http://schema.org/ListItem (868,673,730) http://schema.org/Organization (768,124,229) http://schema.org/ImageObject (723,967,119)http://schema.org/WebPage (395,618,415) http://schema.org/Person (388,122,526) | 614 GB (500 file(s)) | Organization (sample) | lookup_file pld_stats_file |
Painting | Quads: 20,541,804 URLs: 165,580 PLDs: 451 | http://schema.org/Person (5,608,152)http://schema.org/Painting (594,433) http://schema.org/ListItem (320,224)http://schema.org/Offer (193,129) http://schema.org/Organization (154,693) | 161.87 MB (1 file(s)) | Painting (sample) | lookup_file pld_stats_file |
Park | Quads: 1,992,346 URLs: 18,013 PLDs: 295 | https://schema.org/AdministrativeArea (50,733)https://schema.org/Place (42,126) http://schema.org/GeoCoordinates (36,681)http://schema.org/Park (30,678) http://schema.org/PostalAddress (27,463) | 26.31 MB (1 file(s)) | Park (sample) | lookup_file pld_stats_file |
Person | Quads: 25,370,826,738 URLs: 379,276,912 PLDs: 4,162,621 | http://schema.org/Person (683,705,984)http://schema.org/ImageObject (645,796,885) http://schema.org/ListItem (488,479,836)http://schema.org/WebPage (358,768,733) http://schema.org/Organization (294,732,412) | 492.0GB (527 file(s)) | Person (sample) | lookup_file pld_stats_file |
Place | Quads: 3,265,055,697 URLs: 29,633,238 PLDs: 378,270 | http://schema.org/Place (97,713,014)http://schema.org/PostalAddress (76,701,627) http://schema.org/Event (57,270,391)http://schema.org/ListItem (36,550,268) http://schema.org/Person (34,451,494) | 47.1 GB (25 file(s)) | Place (sample) | lookup_file pld_stats_file |
Product | Quads: 17,301,144,036 URLs: 271,813,425 PLDs: 2,583,228 | http://schema.org/Offer (610,738,978)http://schema.org/Product (590,894,883) http://schema.org/ListItem (437,803,391)http://schema.org/Organization (212,004,989) http://schema.org/BreadcruMB List (117,297,771) | 274.7 GB (300 file(s)) | Product (sample) | lookup_file pld_stats_file |
RadioStation | Quads: 14,952,891 URLs: 337,200 PLDs: 663 | http://schema.org/ListItem (693,857)http://schema.org/RadioStation (373,590) http://schema.org/NewsArticle (323,919)http://schema.org/ImageObject (153,898) http://schema.org/PostalAddress (144,450) | 227.1 MB (1 file(s)) | RadioStation (sample) | lookup_file pld_stats_file |
Recipe | Quads: 367,757,913 URLs: 4,521,389 PLDs: 42,495 | http://schema.org/HowToStep (10,567,769)http://schema.org/ListItem (6,030,978) http://schema.org/Person (5,353,168)http://schema.org/Recipe (4,937,223) http://schema.org/ImageObject (4,499,447) | 6.8 GB (5 file(s)) | Recipe (sample) | lookup_file pld_stats_file |
Restaurant | Quads: 233,766,701 URLs: 1,556,445 PLDs: 57,710 | http://schema.org/Offer (11,319,253)http://schema.org/MenuItem (10,559,825) http://schema.org/Product (4,257,912)http://schema.org/Restaurant (4,144,203) http://schema.org/ListItem (3,457,459) | 2.5 GB (2 file(s)) | Restaurant (sample) | lookup_file pld_stats_file |
RiverBodyOfWater | Quads: 367,856 URLs: 3,945 PLDs: 24 | https://schema.org/AdministrativeArea (14,587)https://schema.org/Place (8,679) https://schema.org/BodyOfWater (6,838)http://schema.org/ListItem (5,234) http://schema.org/ImageObject (5,138) | 4.93 MB (1 file(s)) | RiverBodyOfWater (sample) | lookup_file pld_stats_file |
School | Quads: 15,037,483 URLs: 280,034 PLDs: 1,734 | http://schema.org/ListItem (455,460)http://schema.org/School (401,765) http://schema.org/PostalAddress (262,017)http://schema.org/WebPage (201,885) http://schema.org/ImageObject (135,657) | 218.82 MB (1 file(s)) | School (sample) | lookup_file pld_stats_file |
ShoppingCenter | Quads: 13,571,658 URLs: 166,061 PLDs: 1,316 | http://schema.org/Offer (313,733)http://schema.org/PostalAddress (296,514) http://schema.org/Organization (282,880)http://schema.org/ShoppingCenter (258,638) http://schema.org/Product (147,719) | 163.38 MB ( 1file(s)) | ShoppingCenter (sample) | lookup_file pld_stats_file |
SkiResort | Quads: 1,034,270 URLs: 30,349 PLDs: 220 | http://schema.org/SkiResort (31,231)http://schema.org/PostalAddress (28,630) http://schema.org/ListItem (27,387)http://schema.org/AggregateRating (21,499) http://schema.org/Organization (15,414) | 18.51 MB (1 file(s)) | SkiResort (sample) | lookup_file pld_stats_file |
SportsEvent | Quads: 141,738,795 URLs: 957,983 PLDs: 6,844 | http://schema.org/SportsTeam (7,329,202)http://schema.org/SportsEvent (6,290,522) http://schema.org/Place (5,328,600)http://schema.org/PostalAddress (4,300,669) http://schema.org/Organization (1,676,829) | 1.2 GB (3 file(s)) | SportsEvent (sample) | lookup_file pld_stats_file |
SportsTeam | Quads: 132,540,282 URLs: 936,635 PLDs: 4,770 | http://schema.org/SportsTeam (9,126,369)http://schema.org/Place (4,777,298) http://schema.org/SportsEvent (3,758,544)http://schema.org/PostalAddress (2,995,648) http://schema.org/Person (1,736,128) | 1.1 GB (3 file(s)) | SportsTeam (sample) | lookup_file pld_stats_file |
StadiumOrArena | Quads: 26,788,192 URLs: 95,667 PLDs: 235 | http://schema.org/Place (1,331,291)http://schema.org/SportsTeam (720,717) http://schema.org/Organization (637,179)http://schema.org/ImageObject (577,567) http://schema.org/StadiumOrArena (328,592) | 221.26 MB (1 file(s)) | StadiumOrArena (sample) | lookup_file pld_stats_file |
TVEpisode | Quads: 85,209,605 URLs: 460,809 PLDs: 1,284 | http://schema.org/Country (7,956,609)http://schema.org/TVEpisode (3,579,380) https://schema.org/TVEpisode (1,303,687)http://schema.org/Person (1,253,586) http://schema.org/OnDemandEvent (732,639) | 801.1 MB (1 file(s)) | TVEpisode (sample) | lookup_file pld_stats_file |
TelevisionStation | Quads: 1,184,918 URLs: 16,879 PLDs: 103 | http://schema.org/ListItem (25,527)http://schema.org/ImageObject (21,425) http://schema.org/TelevisionStation (20,807)http://schema.org/SiteNavigationElement (19,267) http://schema.org/CreativeWorkSeries (17,466) | 17.63 MB (1 file(s)) | TelevisionStation (sample) | lookup_file pld_stats_file |
In case you are interested in a particular class or set of classes which is not listed above, please get in contact with the WebDataCommons team via Mailing List or our Google Group.
We provide the extracted data for download using a variation of the N-Quads format. For users who prefer other formats, we provide code for converting the download files into CSV and JSON formats, which are supported by a wide range of spreadsheet applications, relational databases and data mining frameworks like the python data analysis library pandas. Please find further details on how to convert the download files to other formats on the main page.
The jupyter notebooks used to create the schema.org subsets from the MD and JSON-LD corpus can be checked out from our Git repository.
The extraction of the December 2021 was done with version 1.5 of the extractor. For more information about the framework and a detailed description how to run a own extraction visit the framework page.
Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.