Web Data Commons Extraction Report - 2009/2010 Corpus

This file contains the extraction report for the Web Data Commons project. Both extraction and data statistics are given overall as well as for each structured data format.

In the following statistics we count every subdomain as a separate domain. In contrast, we only count pay-level-domains as separate domain in the statistics about the August 2012 release. Thus the domain counts are not directly compatible between the two releases.

Overall


Total Data28.9 Terabyte(compressed)
Total URLs2,804,054,789
Parsed HTML URLs2,565,741,671
Domains with Triples19,113,929
URLs with Triples147,871,837
Typed Entities1,546,905,880
Triples5,193,276,058

Results per Format


ExtractorDomains with TriplesURLs with TriplesTyped EntitiesTriples
html-rdfa537,82014,314,03626,583,558293,542,991
html-microdata3,93056,964346,4111,197,115
html-mf-geo244,8385,051,6227,326,51628,831,795
html-mf-hcalendar226,2792,747,27621,289,40265,727,393
html-mf-hcard12,502,50083,583,167973,170,0503,226,066,019
html-mf-hlisting31,8711,227,57425,660,49888,146,122
html-mf-hresume10,419387,3641,501,00912,640,527
html-mf-hreview216,3312,836,7018,234,85084,411,951
html-mf-species3,24425,158152,621391,911
html-mf-hrecipe13,362115,345695,8381,228,925
html-mf-xfn5,323,33537,526,630481,945,1271,391,091,386

Top Domains by URLs with Triples


  1. www.myspace.com (1,586,653 URLs)
  2. twitter.com (834,656 URLs)
  3. local.yahoo.com (798,436 URLs)
  4. www.youtube.com (681,046 URLs)
  5. www.flickr.com (679,915 URLs)
  6. www.slideshare.net (572,806 URLs)
  7. bitmunk.com (518,584 URLs)
  8. www.naymz.com (456,555 URLs)
  9. eventful.com (449,609 URLs)
  10. upcoming.yahoo.com (447,862 URLs)
  11. dbpedia.org (409,250 URLs)
  12. www.threadless.com (392,825 URLs)
  13. maps.google.com (386,103 URLs)
  14. www.backcountry.com (376,250 URLs)
  15. es.netlog.com (376,199 URLs)
  16. digg.com (373,790 URLs)
  17. www.linkedin.com (345,838 URLs)
  18. www.bookfinder.com (345,203 URLs)
  19. www.facebook.com (335,008 URLs)
  20. www.ncbi.nlm.nih.gov (323,250 URLs)
  21. More

In the following statistics, the term "Property values" refers to the overall number of properties that describe all typed entities. The term "URL values" refers to the subset of the "Property Values" that have a URL as object. The term "Remote URL Values" refers to the subset of the "URL values" which point at a different websites (meaning that the namespace of the URL differs from the namespace of the described entity). The term "Literal Values" refers to the subset of the "Property values" that are literals and no URLs

Extractor html-rdfa


Triples Extracted293,542,991
URLs with Triples14,314,036
Average Triples per URL20.51
Domains with Triples537,820
Average Triples per Domain545.8
Top Domains by Extracted TriplesShow top domains
Top Domains by URLs with TriplesShow top domains
Typed Entities26,583,558
Property Values118,460,273
URL Values32,309,783
Remote URL Values8,196,812
Literal Values68,903,623
Other Values17,246,867
Top ClassesShow top values
Top PropertiesShow top values

Extractor html-microdata


Triples Extracted1,197,115
URLs with Triples56,964
Average Triples per URL21.02
Domains with Triples3,930
Average Triples per Domain304.61
Top Domains by Extracted TriplesShow top domains
Top Domains by URLs with TriplesShow top domains
Typed Entities346,411
Property Values911,741
URL Values91,282
Remote URL Values9,390
Literal Values558,852
Other Values261,607
Top ClassesShow top values
Top PropertiesShow top values

Extractor html-mf-geo


Triples Extracted28,831,795
URLs with Triples5,051,622
Average Triples per URL5.71
Domains with Triples244,838
Average Triples per Domain117.76
Top Domains by Extracted TriplesShow top domains
Top Domains by URLs with TriplesShow top domains
Typed Entities7,326,516
Property Values20,572,529
URL Values229
Remote URL Values28
Literal Values13,245,784
Other Values7,326,516
Top ClassesShow top values
Top PropertiesShow top values

Extractor html-mf-hcalendar


Triples Extracted65,727,393
URLs with Triples2,747,276
Average Triples per URL23.92
Domains with Triples226,279
Average Triples per Domain290.47
Top Domains by Extracted TriplesShow top domains
Top Domains by URLs with TriplesShow top domains
Typed Entities21,289,402
Property Values47,785,003
URL Values4,392,784
Remote URL Values533,554
Literal Values31,993,374
Other Values11,398,845
Top ClassesShow top values
Top PropertiesShow top values

Extractor html-mf-hcard


Triples Extracted3,226,066,019
URLs with Triples83,583,167
Average Triples per URL38.6
Domains with Triples12,502,500
Average Triples per Domain258.03
Top Domains by Extracted TriplesShow top domains
Top Domains by URLs with TriplesShow top domains
Typed Entities973,170,050
Property Values2,235,174,193
URL Values247,632,634
Remote URL Values156,770,254
Literal Values1,350,136,682
Other Values637,404,877
Top ClassesShow top values
Top PropertiesShow top values

Extractor html-mf-hlisting


Triples Extracted88,146,122
URLs with Triples1,227,574
Average Triples per URL71.81
Domains with Triples31,871
Average Triples per Domain2765.72
Top Domains by Extracted TriplesShow top domains
Top Domains by URLs with TriplesShow top domains
Typed Entities25,660,498
Property Values43,758,895
URL Values7,713,675
Remote URL Values3,930,083
Literal Values19,288,625
Other Values16,756,595
Top ClassesShow top values
Top PropertiesShow top values

Extractor html-mf-hresume


Triples Extracted12,640,527
URLs with Triples387,364
Average Triples per URL32.63
Domains with Triples10,419
Average Triples per Domain1213.22
Top Domains by Extracted TriplesShow top domains
Top Domains by URLs with TriplesShow top domains
Typed Entities1,501,009
Property Values1,425,436
URL Values1,065
Remote URL Values992
Literal Values1,145,713
Other Values278,658
Top ClassesShow top values
Top PropertiesShow top values

Extractor html-mf-hreview


Triples Extracted84,411,951
URLs with Triples2,836,701
Average Triples per URL29.76
Domains with Triples216,331
Average Triples per Domain390.2
Top Domains by Extracted TriplesShow top domains
Top Domains by URLs with TriplesShow top domains
Typed Entities8,234,850
Property Values30,433,082
URL Values600,245
Remote URL Values207,851
Literal Values21,598,626
Other Values8,234,211
Top ClassesShow top values
Top PropertiesShow top values

Extractor html-mf-species


Triples Extracted391,911
URLs with Triples25,158
Average Triples per URL15.58
Domains with Triples3,244
Average Triples per Domain120.81
Top Domains by Extracted TriplesShow top domains
Top Domains by URLs with TriplesShow top domains
Typed Entities152,621
Property Values237,470
URL Values0
Remote URL Values0
Literal Values143,238
Other Values94,232
Top ClassesShow top values
Top PropertiesShow top values

Extractor html-mf-hrecipe


Triples Extracted1,228,925
URLs with Triples115,345
Average Triples per URL10.65
Domains with Triples13,362
Average Triples per Domain91.97
Top Domains by Extracted TriplesShow top domains
Top Domains by URLs with TriplesShow top domains
Typed Entities695,838
Property Values695,838
URL Values0
Remote URL Values0
Literal Values0
Other Values695,838
Top ClassesShow top values
Top PropertiesShow top values

Extractor html-mf-xfn


Triples Extracted1,391,091,386
URLs with Triples37,526,630
Average Triples per URL37.07
Domains with Triples5,323,335
Average Triples per Domain261.32
Top Domains by Extracted TriplesShow top domains
Top Domains by URLs with TriplesShow top domains
Typed Entities481,945,127
Property Values530,000,989
URL Values264,961,031
Remote URL Values155,329,158
Literal Values30,827
Other Values265,009,131
Top ClassesShow top values
Top PropertiesShow top values