This document contains instructions on how to download the October 2022 version of the Web Data Commons RDFa, Microdata, Embedded JSONLD, and Microformats data sets.
The extracted RDFa, Microdata, Microformat and Embedded JSONLD data is provided for download as N-Quads. Files are compressed using GZIP and each file is around 100MB large. Overall 15,819 files with a total size of 1.6 TB are provided.
List of download URLs for RDF from the October 2022 corpus (Example Content)
The extracted RDF data can be downloaded using wget with the
command wget -i http://webdatacommons.org/structureddata/2022-12/files/file.list
. The files,
containing quads for a specific formats can be found in the table below, together with more detailed statistics
about the number of files and sizes.
If you want to download the class-specific schema.org datasets please navigate to the WDC class-specific subsets page.
Format | Number Of Files | Approx. Total File Size | File List |
---|---|---|---|
html-rdfa | 221 | 22 GB | html-rdfa.list |
html-microdata | 7,180 | 725 GB | html-microdata.list |
html-embedded-jsonld | 7,030 | 707 GB | html-embedded-jsonld.list |
html-mf-geo | 3 | 290 MB | html-mf-geo.list |
html-mf-hcalendar | 8 | 790 MB | html-mf-hcalendar.list |
html-mf-hcard | 1,316 | 130 GB | html-mf-hcard.list |
html-mf-adr | 14 | 1.3 GB | html-mf-adr.list |
html-mf-hrecipe | 2 | 165 MB | html-mf-hrecipe.list |
html-mf-hlisting | 3 | 271 MB | html-mf-hlisting.list |
html-mf-hresume | 1 | 944 KB | html-mf-hresume.list |
html-mf-hreview | 8 | 762 MB | html-mf-hreview.list |
html-mf-species | 1 | 7.9 MB | html-mf-species.list |
html-mf-xfn | 32 | 3.1 GB | html-mf-xfn.list |
We provide the extracted data for download using a variation of the N-Quads format. For users who prefer other formats, we provide code for converting the download files into CSV and JSON formats, which are supported by a wide range of spreadsheet applications, relational databases and data mining frameworks like the python data analysis library pandas. Please find further details on how to convert the download files to other formats on the main page.
The source code can be checked out from our Github
repository. For more information about the framework and a detailed description how to run a own extraction
visit the framework page.
The code for the analysis of the quads can be checked out from the StructuredDataProfiler Github
repository.
Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.