Web Data Commons - RDFa, Microdata, and Microformats Data Sets - November 2013

This document provides statistics about the Web Data Commons RDFa, Microdata and Microformats data sets which have been extracted from the November 2013 release of the Common Crawl.

In summary, we found structured data within 585 million HTML pages out of the 2.24 billion pages contained in the crawl (26%).
These pages originate from 1.7 million different pay-level-domains out of the 12.8 million pay-level-domains covered by the crawl (13%).

Altogether, the extracted data sets consist of 17.2 billion RDF quads.
Instructions on how to download the RDFa, Microdata, and Microformats data sets are given on the page how to get the data.

Please note that in the following term Domains refers to pay-level-domains. Subdomains are not counted as seperate domains.

Overall


Crawl DateWinter 2013
Total Data148 Terabyte(uncompressed)
44 Terabyte(compressed)
Parsed HTML URLs2,224,829,946
URLs with Triples585,792,337
Domains in Crawl12,831,509
Domains with Triples1,779,935
Typed Entities4,264,562,758
Triples17,241,313,916

Results per Format


FormatDomainsURLsTyped EntitiesTriples
html-rdfa471,406296,005,115436,100,2102,636,964,693
html-microdata463,539276,348,6091,964,777,8518,795,074,538
html-mf-geo23,04414,436,46756,611,312222,780,517
html-mf-hcalendar20,9813,683,00241,683,362212,675,776
html-mf-hcard995,258113,402,9681,643,288,8894,884,918,863
html-mf-hlisting2,584528,38719,204,88265,494,465
html-mf-hrecipe3,530814,7937,094,91434,062,142
html-mf-hresume26252,67581,924231,573
html-mf-hreview12,8803,504,64333,027,023145,692,102
html-mf-species10922,419121,200373,033
html-mf-xfn195,66318,467,16862,571,191243,046,214

Top Domains by Extracted Triples


  1. citysearch.com (1,007,368,178 triples)
  2. blogspot.com (635,069,324 triples)
  3. fotolia.com (343,244,317 triples)
  4. spokeo.com (308,093,753 triples)
  5. peoplefinders.com (190,348,060 triples)
  6. wordpress.com (172,753,849 triples)
  7. ebay.com (153,004,160 triples)
  8. tripadvisor.com (135,531,030 triples)
  9. flightaware.com (129,407,201 triples)
  10. aliexpress.com (112,576,855 triples)
  11. hotels.com (104,271,712 triples)
  12. ticketliquidator.com (94,324,212 triples)
  13. thecubreporter.com (93,263,000 triples)
  14. stubhub.com (91,691,504 triples)
  15. yellowpages.com (80,249,766 triples)
  16. competitivecyclist.com (76,746,874 triples)
  17. snagajob.com (76,662,607 triples)
  18. repairpal.com (74,065,387 triples)
  19. imore.com (73,357,679 triples)
  20. ebay.co.uk (71,703,305 triples)
  21. razorgator.com (71,680,910 triples)
  22. More

Top Domains by URLs with Triples


  1. blogspot.com (18,726,871 urls)
  2. citysearch.com (11,028,975 urls)
  3. yahoo.com (3,735,246 urls)
  4. wordpress.com (3,501,400 urls)
  5. tripadvisor.com (3,439,581 urls)
  6. stackexchange.com (3,355,910 urls)
  7. mlb.com (3,168,405 urls)
  8. go.com (2,939,738 urls)
  9. patch.com (2,141,669 urls)
  10. wikipedia.org (2,122,209 urls)
  11. epicsports.com (1,481,094 urls)
  12. dreamstime.com (1,476,765 urls)
  13. hotels.com (1,462,058 urls)
  14. tvguide.com (1,259,297 urls)
  15. google.com (1,239,243 urls)
  16. agoda.com (1,206,243 urls)
  17. reference.com (1,109,185 urls)
  18. youtube.com (1,095,288 urls)
  19. whsv.com (1,060,982 urls)
  20. hp.com (1,044,450 urls)
  21. stackoverflow.com (1,040,652 urls)
  22. More

Extractor html-rdfa


Triples Extracted2,636,964,693
URLs with Triples296,005,115
Average Triples per URL8.9085
Domains with Triples471,406
Average Triples per Domain5,593.8292
Top Domains by Extracted TriplesShow top domains
Typed Entities436,100,210
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-rdfa.xlsx (127kb)

Extractor html-microdata


Triples Extracted8,795,074,538
URLs with Triples276,348,609
Average Triples per URL31.8260
Domains with Triples463,539
Average Triples per Domain18,973.7530
Top Domains by Extracted TriplesShow top domains
Typed Entities1,964,777,851
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-microdata.xlsx (178kb)

Property Usage Statistics for the Schema.org Classes Product and Local Business

schema.org/Product and schema.org/LocalBusiness are two of the most frequently used Microdata classes. Below, we thus provide statistics about how many websites use which properties together with the classes:

schema:Product

Class Occurrence# PLDs
{schema:Product} 56,338
PLDs Total
Property Co-occurrence # %
schema:Product/name 50,536 89.70
schema:Product/description 38,037 67.51
schema:Product/image 34,921 61.98
schema:Offer/price 33,509 59.47
schema:Product/offers 33,090 58.73

schema:LocalBusiness

Class Occurrence# PLDs
{schema:LocalBusiness} 35,264
PLDs Total
Property Co-occurrence # %
schema:LocalBusiness/name 29,404 83.38
schema:LocalBusiness/address 27,718 78.60
schema:LocalBusiness/telephone 21,380 60.62
schema:LocalBusiness/description 7,489 21.23
schema:LocalBusiness/url 5,192 14.72

Extractor html-mf-geo


Triples Extracted222,780,517
URLs with Triples14,436,467
Average Triples per URL15.4317
Domains with Triples23,044
Average Triples per Domain9,667.6148
Top Domains by Extracted TriplesShow top domains
Typed Entities56,611,312
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-hcalendar


Triples Extracted212,675,776
URLs with Triples3,683,002
Average Triples per URL57.7452
Domains with Triples20,981
Average Triples per Domain10,136.5891
Top Domains by Extracted TriplesShow top domains
Typed Entities41,683,362
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-hcard


Triples Extracted4,884,918,863
URLs with Triples113,402,968
Average Triples per URL43.0757
Domains with Triples995,258
Average Triples per Domain4,908.1935
Top Domains by Extracted TriplesShow top domains
Typed Entities1,643,288,889
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-hlisting


Triples Extracted65,494,465
URLs with Triples528,387
Average Triples per URL123.9516
Domains with Triples2,584
Average Triples per Domain25,346.1551
Top Domains by Extracted TriplesShow top domains
Typed Entities19,204,882
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-hrecipe


Triples Extracted34,062,142
URLs with Triples814,793
Average Triples per URL41.8046
Domains with Triples3,530
Average Triples per Domain9,649.3320
Top Domains by Extracted TriplesShow top domains
Typed Entities7,094,914
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values
Show top values by entity count

Extractor html-mf-hresume


Triples Extracted231,573
URLs with Triples52,675
Average Triples per URL4.3962
Domains with Triples262
Average Triples per Domain883.8664
Top Domains by Extracted TriplesShow top domains
Typed Entities81,924
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-hreview


Triples Extracted145,692,102
URLs with Triples3,504,643
Average Triples per URL41.5711
Domains with Triples12,880
Average Triples per Domain11,311.4986
Top Domains by Extracted TriplesShow top domains
Typed Entities33,027,023
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-species


Triples Extracted373,033
URLs with Triples22,419
Average Triples per URL16.6391
Domains with Triples109
Average Triples per Domain3,422.3211
Top Domains by Extracted TriplesShow top domains
Typed Entities121,200
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-xfn


Triples Extracted243,046,214
URLs with Triples18,467,168
Average Triples per URL13.1609
Domains with Triples195,663
Average Triples per Domain1,242.1674
Top Domains by Extracted TriplesShow top domains
Typed Entities62,571,191
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count