Web Data Commons - RDFa, Microdata, Embedded JSON-LD, and Microformats Data Sets - November 2019

This document provides statistics about the Web Data Commons RDFa, Microdata and Microformats data sets which have been extracted from the November 2019 release of the Common Crawl.

In summary, we found structured data within 0.93 billion HTML pages out of the 2.45 billion pages contained in the crawl (37.9%). These pages originate from 11.9 million different pay-level-domains out of the 32 million pay-level-domains covered by the crawl (37.2%). Altogether, the extracted data sets consist of 44.2 billion RDF quads.

Instructions on how to download the RDFa, Microdata, Embedded JSON-LD and Microformats data sets are given on the page how to get the data.

In addition we have extracted schema.org class-specific datasets from the Microdata and JSON-LD corpora.

Please note that in the following term Domains refers to pay-level-domains. Subdomains are not counted as separate domains.

Overall


Crawl Date November 2019
Total Data 53.9 Terabyte (compressed)
Parsed HTML URLs 2,454,900,132
URLs with Triples 934,814,452
Domains in Crawl 32,040,026
Domains with Triples 11,917,576
Typed Entities 14,450,406,289
Triples 44,245,690,165
Size of Extracted Data 1.01TB (compressed)

Results per Format


Format Domains URLs Typed Entities Triples
html-microdata 6,359,911 517,256,498 4,229,005,443 22,290,981,916
html-embedded-jsonld 5,100,519 284,269,873 1,785,653,581 8,413,179,046
html-mf-hcard 4,064,555 249,200,906 7,846,931,312 11,839,793,514
html-rdfa 1,039,623 122,273,033 409,843,447 1,134,186,061
html-mf-xfn 403,901 22,512,393 52,357,474 318,840,671
html-mf-adr 182,265 11,071,072 24,050,529 83,568,496
html-mf-geo 66,106 4,318,085 9,510,347 26,565,723
html-mf-hcalendar 41,301 2,383,464 13,638,573 58,647,691
html-mf-hreview 31,843 1,649,989 5,868,293 35,191,798
html-mf-hlisting 10,106 456,970 11,031,428 36,205,607
html-mf-hrecipe 4,967 385,120 1,853,472 7,014,399
html-mf2-h-adr 7,566 168,188 239,828 900,583
html-mf-hresume 135 3,605 6,101 22,981
html-mf-species 238 81,953 237,739 591,679
OVERALL 11,917,576 934,814,452 14,311,261,855 44,245,690,165



Top Domains by Extracted Triples


  1. wordpress.com (961,381,004 triples)
  2. kidsroom.com.tw (451,621,141 triples)
  3. blogspot.com (411,891,887 triples)
  4. postimees.ee (157,046,344 triples)
  5. google.com (140,410,209 triples)
  6. directorycircle.com (82,705,509 triples)
  7. aliexpress.com (77,831,120 triples)
  8. tut.by (66,561,718 triples)
  9. apple.com (66,155,741 triples)
  10. alibaba.com (65,369,806 triples)
  11. startlap.hu (64,517,607 triples)
  12. tjfs-journal.org (61,271,964 triples)
  13. remax.com (59,385,644 triples)
  14. notamedia.es (57,068,512 triples)
  15. momstart.com (56,202,286 triples)
  16. coocan.jp (53,455,069 triples)
  17. wikipedia.org (43,215,386 triples)
  18. stockholm.se (42,266,166 triples)
  19. smittenkitchen.com (42,115,528 triples)
  20. elpais.com (41,217,023 triples)
  21. More

Top Domains by URLs with Triples


  1. blogspot.com (19,409,126 urls)
  2. wordpress.com (18,255,371 urls)
  3. wikipedia.org (2,473,953 urls)
  4. livejournal.com (1,813,536 urls)
  5. tistory.com (919,648 urls)
  6. hotels.com (781,430 urls)
  7. canalblog.com (772,270 urls)
  8. ning.com (764,828 urls)
  9. apple.com (749,529 urls)
  10. airbnb.com (651,620 urls)
  11. google.com (638,695 urls)
  12. hatenablog.com (614,355 urls)
  13. thefreedictionary.com (576,302 urls)
  14. pixnet.net (546,225 urls)
  15. tradingview.com (520,116 urls)
  16. exblog.jp (503,984 urls)
  17. sched.com (495,299 urls)
  18. nih.gov (486,558 urls)
  19. typepad.com (485,365 urls)
  20. blog.jp (478,820 urls)
  21. More

Extractor html-microdata


Triples Extracted 22,290,981,916
URLs with Triples 517,256,498
Average Triples per URL 43.09
Domains with Triples 6,359,911
Average Triples per Domain 3,504.92
Typed Entities 4,229,005,443
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-microdata.xlsx (452kb)

Extractor html-embedded-jsonld


Triples Extracted 8,413,179,046
URLs with Triples 284,269,873
Average Triples per URL 29.59
Domains with Triples 5,100,519
Average Triples per Domain 1,649.47
Typed Entities 1,785,653,581
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-embedded-jsonld.xlsx (73kb)

Extractor html-mf-hcard


Triples Extracted 11,839,793,514
URLs with Triples 249,200,906
Average Triples per URL 47.51
Domains with Triples 4,064,555
Average Triples per Domain 2,926.88
Typed Entities 7,846,931,312
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-rdfa


Triples Extracted 1,134,186,061
URLs with Triples 122,273,033
Average Triples per URL 9.27
Domains with Triples 1,039,623
Average Triples per Domain 1,090.95
Typed Entities 409,843,447
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count
Detailed Statistics as Excel-File html-rdfa.xlsx (56kb)

Extractor html-mf-xfn


Triples Extracted 318,840,671
URLs with Triples 22,512,393
Average Triples per URL 14.16
Domains with Triples 403,901
Average Triples per Domain 789.40
Typed Entities 52,357,474
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-adr


Triples Extracted83,568,496
URLs with Triples11,071,072
Average Triples per URL7.54
Domains with Triples182,265
Average Triples per Domain458.49
Typed Entities24,050,529
Top Domains by Extracted Triples Show top domains
Top ClassesShow top values by domain count
Show top values by entity count
Top PropertiesShow top values by domain count
Show top values by entity count

Extractor html-mf-geo


Triples Extracted 26,565,723
URLs with Triples 4,318,085
Average Triples per URL 6.15
Domains with Triples 66,106
Average Triples per Domain 401.86
Typed Entities 9,510,347
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hcalendar


Triples Extracted 58,647,691
URLs with Triples 2,383,464
Average Triples per URL 24.60
Domains with Triples 41,301
Average Triples per Domain 1,420.00
Typed Entities 13,638,573
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hreview


Triples Extracted 35,191,798
URLs with Triples 1,649,989
Average Triples per URL 21.32
Domains with Triples 31,843
Average Triples per Domain 1,105.16
Typed Entities 5,868,293
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hlisting


Triples Extracted 36,205,607
URLs with Triples 456,970
Average Triples per URL 79.22
Domains with Triples 10,106
Average Triples per Domain 3,582.58
Typed Entities 11,031,428
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-hrecipe


Triples Extracted 7,014,399
URLs with Triples 385,120
Average Triples per URL 18.21
Domains with Triples 4,967
Average Triples per Domain 1,412.20
Typed Entities 1,853,472
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values
Show top values by entity count

Extractor html-mf-hresume


Triples Extracted 22,981
URLs with Triples 3,605
Average Triples per URL 6.37
Domains with Triples 135
Average Triples per Domain 170.22
Typed Entities 6,101
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count

Extractor html-mf-species


Triples Extracted 591,679
URLs with Triples 81,953
Average Triples per URL 7.21
Domains with Triples 238
Average Triples per Domain 2,486.04
Typed Entities 237,739
Top Domains by Extracted Triples Show top domains
Top Classes Show top values by domain count
Show top values by entity count
Top Properties Show top values by domain count
Show top values by entity count