Web Data Commons - RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets - November 2015

This document provides statistics about the Web Data Commons RDFa, Microdata and Microformats data sets which have been extracted from the November 2015 release of the Common Crawl.

In summary, we found structured data within 541 million HTML pages out of the 1.77 billion pages contained in the crawl (30%). These pages originate from 2.72 million different pay-level-domains out of the 14.41 million pay-level-domains covered by the crawl (19%). Altogether, the extracted data sets consist of 24.38 billion RDF quads.

Instructions on how to download the RDFa, Microdata, Embedded JSON-LD and Microformats data sets are given on the page how to get the data.

Please note that in the following term Domains refers to pay-level-domains. Subdomains are not counted as separate domains.

Overall


Crawl DateNovember 2015
Total Data151 Terabyte(compressed)
Parsed HTML URLs1,770,525,212
URLs with Triples541,514,775
Domains in Crawl14,409,425
Domains with Triples2,724,591
Typed Entities6,107,584,968
Triples24,377,132,352

Results per Format


FormatDomainsURLsTyped EntitiesTriples
html-microdata1,100,783312,229,9192,463,165,94513,224,134,881
html-rdfa521,806196,336,975460,658,5581,598,114,462
html-embedded-jsonld596,22935,486,19272,180,127382,896,204
html-mf-adr140,06313,937,30545,073,622146,968,811
html-mf-geo16,2633,802,0386,995,79417,847,779
html-mf-hcalendar25,7213,007,50224,846,049119,168,060
html-mf-hcard1,189,863104,318,7151,471,748,3704,238,311,400
html-mf-hcard incl. html-mf-adr1,198,702105,712,4461,516,821,9924,385,280,211
html-mf-hlisting2,484135,7804,043,96116,126,218
html-mf-hrecipe2,862389,1083,504,88813,456,521
html-mf-hresume1152,9876,74119,338
html-mf-hreview11,4691,637,7926,048,53937,280,831
html-mf-species8821,370138,818382,344
html-mf-xfn139,42616,159,22432,351,564197,145,292
overall2,724,591541,514,7756,107,584,96824,377,132,352

Top Domains by Extracted Triples


  1. theclothdiaperwhisperer.com (570,484,881 triples)
  2. blogspot.com (538,760,931 triples)
  3. moosejaw.com (362,538,262 triples)
  4. hallmark.com (351,073,701 triples)
  5. cnbc.com (340,398,109 triples)
  6. wordpress.com (213,049,733 triples)
  7. rent.com (189,856,235 triples)
  8. zap2it.com (169,534,834 triples)
  9. gigmasters.com (161,002,214 triples)
  10. apple.com (127,552,186 triples)
  11. hotels.com (122,152,702 triples)
  12. epicsports.com (114,339,437 triples)
  13. snagajob.com (105,647,356 triples)
  14. competitivecyclist.com (105,170,966 triples)
  15. flightaware.com (102,813,934 triples)
  16. diaryofaquilter.com (101,945,154 triples)
  17. backcountry.com (100,449,223 triples)
  18. eventbrite.com (99,028,396 triples)
  19. go.com (97,834,845 triples)
  20. meetup.com (91,931,385 triples)
  21. razorgator.com (90,671,418 triples)
  22. More

Top Domains by URLs with Triples


  1. blogspot.com (17,945,167 urls)
  2. stackexchange.com (4,094,109 urls)
  3. wordpress.com (4,074,073 urls)
  4. oclc.org (3,371,590 urls)
  5. wikipedia.org (2,924,510 urls)
  6. google.com (2,014,975 urls)
  7. hotels.com (1,830,659 urls)
  8. deviantart.com (1,808,415 urls)
  9. yahoo.com (1,786,726 urls)
  10. dreamstime.com (1,437,273 urls)
  11. go.com (1,428,903 urls)
  12. popsugar.com (1,405,929 urls)
  13. agoda.com (1,378,677 urls)
  14. epicsports.com (1,167,852 urls)
  15. wsj.com (1,141,992 urls)
  16. cnet.com (1,114,463 urls)
  17. meetup.com (1,082,299 urls)
  18. nytimes.com (1,075,548 urls)
  19. flightaware.com (1,032,731 urls)
  20. mlb.com (984,393 urls)
  21. polyvore.com (950,460 urls)
  22. More

Extractor html-rdfa


Triples Extracted1,598,114,462
URLs with Triples196,336,975
Average Triples per URL8.14
Domains with Triples521,806
Average Triples per Domain3,062.66
Typed Entities460,658,558
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-rdfa.xlsx (127kb)

Extractor html-microdata


Triples Extracted19,991,852,141
URLs with Triples312,229,919
Average Triples per URL42.35
Domains with Triples1,100,783
Average Triples per Domain12,013.39
Typed Entities2,463,165,945
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-microdata.xlsx (221kb)

Extractor html-embedded-jsonld


Triples Extracted382,896,204
URLs with Triples35,486,192
Average Triples per URL10.79
Domains with Triples596,229
Average Triples per Domain642.20
Typed Entities72,180,127
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-embedded-jsonld.xlsx (221kb)

Extractor html-mf-geo


Triples Extracted17,847,779
URLs with Triples3,802,038
Average Triples per URL4.69
Domains with Triples16,263
Average Triples per Domain1,097.45
Typed Entities6,995,794
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-hcalendar


Triples Extracted119,168,060
URLs with Triples3,007,502
Average Triples per URL39.62
Domains with Triples25,721
Average Triples per Domain4,633.10
Typed Entities24,846,049
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-hcard


Triples Extracted4,238,311,400
URLs with Triples104,318,715
Average Triples per URL40.63
Domains with Triples1,189,863
Average Triples per Domain3,562.02
Typed Entities1,471,748,370
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-hlisting


Triples Extracted16,126,218
URLs with Triples135,780
Average Triples per URL118.77
Domains with Triples2,484
Average Triples per Domain6,492.04
Typed Entities4,043,961
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-hrecipe


Triples Extracted13,456,521
URLs with Triples389,108
Average Triples per URL34.58
Domains with Triples2,862
Average Triples per Domain4,701.79
Typed Entities3,504,888
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values
Show top values by entity count

Extractor html-mf-hresume


Triples Extracted19,338
URLs with Triples2,987
Average Triples per URL6.47
Domains with Triples115
Average Triples per Domain168.16
Typed Entities6,741
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-hreview


Triples Extracted37,280,831
URLs with Triples1,637,792
Average Triples per URL22.76
Domains with Triples11,469
Average Triples per Domain3,250.57
Typed Entities6,048,539
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-species


Triples Extracted382,344
URLs with Triples21,370
Average Triples per URL17.89
Domains with Triples88
Average Triples per Domain4,344.82
Typed Entities138,818
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-xfn


Triples Extracted197,145,292
URLs with Triples16,159,224
Average Triples per URL12.20
Domains with Triples139,426
Average Triples per Domain1,413.98
Typed Entities32,351,564
Top Domains by Extracted TriplesShow top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count