Web Data Commons - Feb 2012 Corpus - Format Development
The following tables show the development of the different structured data formats between the two web crawls analyzed so far. Since the crawls have different sizes, the relative differences between the percentages the formats over Domains, URLs, Typed Entities and Triples are given. To calculate the relavtive difference, we first calculate the percentage of each format of the total sum for each crawl and then subtract the results.
Domains with Triples
Extractor | 2010 | 2012 | ? % |
html-rdfa | 537,820 | 16,976,232 | 23.14% |
html-microdata | 3,930 | 3,952,674 | 6.02% |
html-mf-geo | 244,838 | 897,080 | 0.09% |
html-mf-hcalendar | 226,279 | 629,319 | -0.22% |
html-mf-hcard | 12,502,500 | 30,417,192 | -18.91% |
html-mf-hlisting | 31,871 | 69,569 | -0.06% |
html-mf-hresume | 10,419 | 9,890 | -0.04% |
html-mf-hreview | 216,331 | 615,681 | -0.19% |
html-mf-species | 3,244 | 4,109 | -0.01% |
html-mf-hrecipe | 13,362 | 127,381 | 0.12% |
html-mf-xfn | 5,323,335 | 11,709,819 | -9.95% |
URLs with Triples
Extractor | 2010 | 2012 | ? % |
html-rdfa | 14,314,036 | 67,901,246 | 26.28% |
html-microdata | 56,964 | 26,929,865 | 14.22% |
html-mf-geo | 5,051,622 | 2,491,933 | -2.10% |
html-mf-hcalendar | 2,747,276 | 1,506,379 | -1.06% |
html-mf-hcard | 83,583,167 | 61,360,686 | -24.03% |
html-mf-hlisting | 1,227,574 | 197,027 | -0.73% |
html-mf-hresume | 387,364 | 20,762 | -0.25% |
html-mf-hreview | 2,836,701 | 1,971,870 | -0.87% |
html-mf-species | 25,158 | 14,033 | -0.01% |
html-mf-hrecipe | 115,345 | 422,289 | 0.15% |
html-mf-xfn | 37,526,630 | 26,004,925 | -11.61% |
Typed Entities
Extractor | 2010 | 2012 | ? % |
html-rdfa | 26,583,558 | 49,370,729 | 2.32% |
html-microdata | 346,411 | 90,526,013 | 7.38% |
html-mf-geo | 7,326,516 | 4,787,126 | -0.08% |
html-mf-hcalendar | 21,289,402 | 27,165,545 | 0.85% |
html-mf-hcard | 973,170,050 | 865,633,059 | 7.89% |
html-mf-hlisting | 25,660,498 | 8,252,632 | -0.98% |
html-mf-hresume | 1,501,009 | 92,346 | -0.09% |
html-mf-hreview | 8,234,850 | 7,809,088 | 0.11% |
html-mf-species | 152,621 | 139,631 | 0.00% |
html-mf-hrecipe | 695,838 | 5,516,036 | 0.41% |
html-mf-xfn | 481,945,127 | 163,271,544 | -17.80% |
Triples
Extractor | 2010 | 2012 | ? % |
html-rdfa | 293,542,991 | 456,169,126 | 8.20% |
html-microdata | 1,197,115 | 404,413,915 | 12.25% |
html-mf-geo | 28,831,795 | 11,222,766 | -0.21% |
html-mf-hcalendar | 65,727,393 | 65,547,870 | 0.72% |
html-mf-hcard | 3,226,066,019 | 1,837,847,772 | -6.33% |
html-mf-hlisting | 88,146,122 | 20,703,189 | -1.07% |
html-mf-hresume | 12,640,527 | 432,363 | -0.23% |
html-mf-hreview | 84,411,951 | 50,475,411 | -0.09% |
html-mf-species | 391,911 | 224,847 | -0.00% |
html-mf-hrecipe | 1,228,925 | 5,513,030 | 0.14% |
html-mf-xfn | 1,391,091,386 | 441,698,363 | -13.38% |
Absolute Development
In this section, we directly compare the amount of Domains, URLs, Typed Entities and Triples for the different formats. However, since the two crawls have different sizes and contain different URLs, the following results should be taken with caution. The delta value is the newer value divided by the older value minus 1.
Domains with Triples
Extractor | 2010 | 2012 | ? |
html-rdfa | 537,820 | 16,976,232 | 30.56 |
html-microdata | 3,930 | 3,952,674 | 1004.77 |
html-mf-geo | 244,838 | 897,080 | 2.66 |
html-mf-hcalendar | 226,279 | 629,319 | 1.78 |
html-mf-hcard | 12,502,500 | 30,417,192 | 1.43 |
html-mf-hlisting | 31,871 | 69,569 | 1.18 |
html-mf-hresume | 10,419 | 9,890 | -0.05 |
html-mf-hreview | 216,331 | 615,681 | 1.85 |
html-mf-species | 3,244 | 4,109 | 0.27 |
html-mf-hrecipe | 13,362 | 127,381 | 8.53 |
html-mf-xfn | 5,323,335 | 11,709,819 | 1.20 |
URLs with Triples
Extractor | 2010 | 2012 | ? |
html-rdfa | 14,314,036 | 67,901,246 | 3.74 |
html-microdata | 56,964 | 26,929,865 | 471.75 |
html-mf-geo | 5,051,622 | 2,491,933 | -0.51 |
html-mf-hcalendar | 2,747,276 | 1,506,379 | -0.45 |
html-mf-hcard | 83,583,167 | 61,360,686 | -0.27 |
html-mf-hlisting | 1,227,574 | 197,027 | -0.84 |
html-mf-hresume | 387,364 | 20,762 | -0.95 |
html-mf-hreview | 2,836,701 | 1,971,870 | -0.30 |
html-mf-species | 25,158 | 14,033 | -0.44 |
html-mf-hrecipe | 115,345 | 422,289 | 2.66 |
html-mf-xfn | 37,526,630 | 26,004,925 | -0.31 |
Typed Entities
Extractor | 2010 | 2012 | ? |
html-rdfa | 26,583,558 | 49,370,729 | 0.86 |
html-microdata | 346,411 | 90,526,013 | 260.33 |
html-mf-geo | 7,326,516 | 4,787,126 | -0.35 |
html-mf-hcalendar | 21,289,402 | 27,165,545 | 0.28 |
html-mf-hcard | 973,170,050 | 865,633,059 | -0.11 |
html-mf-hlisting | 25,660,498 | 8,252,632 | -0.68 |
html-mf-hresume | 1,501,009 | 92,346 | -0.94 |
html-mf-hreview | 8,234,850 | 7,809,088 | -0.05 |
html-mf-species | 152,621 | 139,631 | -0.09 |
html-mf-hrecipe | 695,838 | 5,516,036 | 6.93 |
html-mf-xfn | 481,945,127 | 163,271,544 | -0.66 |
Triples
Extractor | 2010 | 2012 | ? |
html-rdfa | 293,542,991 | 456,169,126 | 0.55 |
html-microdata | 1,197,115 | 404,413,915 | 336.82 |
html-mf-geo | 28,831,795 | 11,222,766 | -0.61 |
html-mf-hcalendar | 65,727,393 | 65,547,870 | 0.00 |
html-mf-hcard | 3,226,066,019 | 1,837,847,772 | -0.43 |
html-mf-hlisting | 88,146,122 | 20,703,189 | -0.77 |
html-mf-hresume | 12,640,527 | 432,363 | -0.97 |
html-mf-hreview | 84,411,951 | 50,475,411 | -0.40 |
html-mf-species | 391,911 | 224,847 | -0.43 |
html-mf-hrecipe | 1,228,925 | 5,513,030 | 3.49 |
html-mf-xfn | 1,391,091,386 | 441,698,363 | -0.68 |