This document contains instructions on how to download the October 2016 version of the Web Data Commons RDFa, Microdata, Embedded JSONLD, and Microformats data sets.
The extracted RDFa, Microdata, Microformat and Embedded JSONLD data is provided for download as N-Quads. If you want to transform the data in other format you can use our conversion tool. Files are compressed using GZIP and each file is around 100 MB large. Overall 9,661 files with a total size of 967 GB are provided.
List of download URLs for RDF from the October 2016 corpus (Example Content)
The extracted RDF data can be downloaded using wget with the command wget -i http://webdatacommons.org/structureddata/2016-10/files/file.list
. The files, containing quads for a specific formats can be found in the table below, together with more detailed statistics about the number of files and sizes.
Format | Number Of Files | Approx. Total File Size | File List |
---|---|---|---|
html-rdfa | 600 | 60 GB | html-rdfa.list |
html-microdata | 8,007 | 805 GB | html-microdata.list |
html-embedded-jsonld | 295 | 29 GB | html-embedded-jsonld.list |
html-mf-geo | 7 | 691 MB | html-mf-geo.list |
html-mf-hcalendar | 23 | 2 GB | html-mf-hcalendar.list |
html-mf-hcard | 600 | 60 GB | html-mf-hcard.list |
html-mf-adr | 45 | 4 GB | html-mf-adr.list |
html-mf2-h-adr | 1 | 12 MB | html-mf-adr.list |
html-mf-hrecipe | 7 | 698 MB | html-mf-hrecipe.list |
html-mf-hlisting | 7 | 661 MB | html-mf-hlisting.list |
html-mf-hresume | 1 | 0.8 MB | html-mf-hresume.list |
html-mf-hreview | 35 | 3 GB | html-mf-hreview.list |
html-mf-species | 1 | 0.24 MB | html-mf-species.list |
html-mf-xfn | 32 | 3 GB | html-mf-xfn.list |
The source code can be checked out from our Github repository. For more information about the framework and a detailed description how to run a own extraction visit the framework page.
The code for the analysis of the quads can be checked out from the StructuredDataProfiler Github repository.
Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.