This document provides statistics about the Web Data Commons RDFa, Microdata and Microformats data sets which have been extracted from the November 2015 release of the Common Crawl.
In summary, we found structured data within 541 million HTML pages out of the 1.77 billion pages contained in the crawl (30%). These pages originate from 2.72 million different pay-level-domains out of the 14.41 million pay-level-domains covered by the crawl (19%). Altogether, the extracted data sets consist of 24.38 billion RDF quads.
Instructions on how to download the RDFa, Microdata, Embedded JSON-LD and Microformats data sets are given on the page how to get the data.
Please note that in the following term Domains refers to pay-level-domains. Subdomains are not counted as separate domains.
Crawl Date | November 2015 | |
Total Data | 151 Terabyte | (compressed) |
Parsed HTML URLs | 1,770,525,212 | |
URLs with Triples | 541,514,775 | |
Domains in Crawl | 14,409,425 | |
Domains with Triples | 2,724,591 | |
Typed Entities | 6,107,584,968 | |
Triples | 24,377,132,352 |
Format | Domains | URLs | Typed Entities | Triples |
html-microdata | 1,100,783 | 312,229,919 | 2,463,165,945 | 13,224,134,881 |
html-rdfa | 521,806 | 196,336,975 | 460,658,558 | 1,598,114,462 |
html-embedded-jsonld | 596,229 | 35,486,192 | 72,180,127 | 382,896,204 |
html-mf-adr | 140,063 | 13,937,305 | 45,073,622 | 146,968,811 |
html-mf-geo | 16,263 | 3,802,038 | 6,995,794 | 17,847,779 |
html-mf-hcalendar | 25,721 | 3,007,502 | 24,846,049 | 119,168,060 |
html-mf-hcard | 1,189,863 | 104,318,715 | 1,471,748,370 | 4,238,311,400 |
html-mf-hcard incl. html-mf-adr | 1,198,702 | 105,712,446 | 1,516,821,992 | 4,385,280,211 |
html-mf-hlisting | 2,484 | 135,780 | 4,043,961 | 16,126,218 |
html-mf-hrecipe | 2,862 | 389,108 | 3,504,888 | 13,456,521 |
html-mf-hresume | 115 | 2,987 | 6,741 | 19,338 |
html-mf-hreview | 11,469 | 1,637,792 | 6,048,539 | 37,280,831 |
html-mf-species | 88 | 21,370 | 138,818 | 382,344 |
html-mf-xfn | 139,426 | 16,159,224 | 32,351,564 | 197,145,292 |
overall | 2,724,591 | 541,514,775 | 6,107,584,968 | 24,377,132,352 |
Triples Extracted | 1,598,114,462 |
URLs with Triples | 196,336,975 |
Average Triples per URL | 8.14 |
Domains with Triples | 521,806 |
Average Triples per Domain | 3,062.66 |
Typed Entities | 460,658,558 |
Top Domains by Extracted Triples | Show top domains |
Top Classes | Show top values by domain count |
Show top values by entity count | |
Top Properties | Show top values by domain count |
Show top values by entity count | |
Detailed Statistics as Excel-File | html-rdfa.xlsx (127kb) |
Triples Extracted | 19,991,852,141 |
URLs with Triples | 312,229,919 |
Average Triples per URL | 42.35 |
Domains with Triples | 1,100,783 |
Average Triples per Domain | 12,013.39 |
Typed Entities | 2,463,165,945 |
Top Domains by Extracted Triples | Show top domains |
Top Classes | Show top values by domain count |
Show top values by entity count | |
Top Properties | Show top values by domain count |
Show top values by entity count | |
Detailed Statistics as Excel-File | html-microdata.xlsx (221kb) |
Triples Extracted | 382,896,204 |
URLs with Triples | 35,486,192 |
Average Triples per URL | 10.79 |
Domains with Triples | 596,229 |
Average Triples per Domain | 642.20 |
Typed Entities | 72,180,127 |
Top Domains by Extracted Triples | Show top domains |
Top Classes | Show top values by domain count |
Show top values by entity count | |
Top Properties | Show top values by domain count |
Show top values by entity count | |
Detailed Statistics as Excel-File | html-embedded-jsonld.xlsx (221kb) |
Triples Extracted | 17,847,779 |
URLs with Triples | 3,802,038 |
Average Triples per URL | 4.69 |
Domains with Triples | 16,263 |
Average Triples per Domain | 1,097.45 |
Typed Entities | 6,995,794 |
Top Domains by Extracted Triples | Show top domains |
Top Classes | Show top values by domain count |
Show top values by entity count | |
Top Properties | Show top values by domain count |
Show top values by entity count |
Triples Extracted | 119,168,060 |
URLs with Triples | 3,007,502 |
Average Triples per URL | 39.62 |
Domains with Triples | 25,721 |
Average Triples per Domain | 4,633.10 |
Typed Entities | 24,846,049 |
Top Domains by Extracted Triples | Show top domains |
Top Classes | Show top values by domain count |
Show top values by entity count | |
Top Properties | Show top values by domain count |
Show top values by entity count |
Triples Extracted | 4,238,311,400 |
URLs with Triples | 104,318,715 |
Average Triples per URL | 40.63 |
Domains with Triples | 1,189,863 |
Average Triples per Domain | 3,562.02 |
Typed Entities | 1,471,748,370 |
Top Domains by Extracted Triples | Show top domains |
Top Classes | Show top values by domain count |
Show top values by entity count | |
Top Properties | Show top values by domain count |
Show top values by entity count |
Triples Extracted | 16,126,218 |
URLs with Triples | 135,780 |
Average Triples per URL | 118.77 |
Domains with Triples | 2,484 |
Average Triples per Domain | 6,492.04 |
Typed Entities | 4,043,961 |
Top Domains by Extracted Triples | Show top domains |
Triples Extracted | 13,456,521 |
URLs with Triples | 389,108 |
Average Triples per URL | 34.58 |
Domains with Triples | 2,862 |
Average Triples per Domain | 4,701.79 |
Typed Entities | 3,504,888 |
Top Domains by Extracted Triples | Show top domains |
Top Classes | Show top values by domain count |
Show top values by entity count | |
Top Properties | Show top values |
Show top values by entity count |
Triples Extracted | 19,338 |
URLs with Triples | 2,987 |
Average Triples per URL | 6.47 |
Domains with Triples | 115 |
Average Triples per Domain | 168.16 |
Typed Entities | 6,741 |
Top Domains by Extracted Triples | Show top domains |
Top Classes | Show top values by domain count |
Show top values by entity count | |
Top Properties | Show top values by domain count |
Show top values by entity count |
Triples Extracted | 37,280,831 |
URLs with Triples | 1,637,792 |
Average Triples per URL | 22.76 |
Domains with Triples | 11,469 |
Average Triples per Domain | 3,250.57 |
Typed Entities | 6,048,539 |
Top Domains by Extracted Triples | Show top domains |
Top Classes | Show top values by domain count |
Show top values by entity count | |
Top Properties | Show top values by domain count |
Show top values by entity count |
Triples Extracted | 382,344 |
URLs with Triples | 21,370 |
Average Triples per URL | 17.89 |
Domains with Triples | 88 |
Average Triples per Domain | 4,344.82 |
Typed Entities | 138,818 |
Top Domains by Extracted Triples | Show top domains |
Top Classes | Show top values by domain count |
Show top values by entity count | |
Top Properties | Show top values by domain count |
Show top values by entity count |
Triples Extracted | 197,145,292 |
URLs with Triples | 16,159,224 |
Average Triples per URL | 12.20 |
Domains with Triples | 139,426 |
Average Triples per Domain | 1,413.98 |
Typed Entities | 32,351,564 |
Top Domains by Extracted Triples | Show top domains |
Top Classes | Show top values by domain count |
Show top values by entity count | |
Top Properties | Show top values by domain count
Show top values by entity count