This page provides detailed download instruction to obtain the hyperlink graph extracted from the Common Crawl 2012 web corpus. The graph covers 3.5 billion web pages and 128 billion hyperlinks between these pages.
We also provide basic statistics about the hyperlink. Please visit the overview page for more information about the provided file formats.
Contents
1. Index/Arc Format
The following table contains the links for downloading the index as the arc file(s) of the graph for each aggregation level.
Data Set | Index File | Arc File |
---|---|---|
Page Graph | see below (45 GB) | see below (331 GB) |
Subdomain/Host Graph | download (832 MB) | download (9.2 GB) |
1st Subdomain Graph | download (757 MB) | download (8.7 GB) |
PLD Graph | download (297 MB) | download (2.8 GB) |
Downloading the page graph: The page graph (arc and indes files) are, due to their size split into in small files of around 500 MB. These files can be downloaded using
wget -i http://webdatacommons.org/hyperlinkgraph/2012-08/data/index.list.txt
for the index files and respectively wget -i http://webdatacommons.org/hyperlinkgraph/2012-08/data/arc.list.txt
for the arc files.
2. WebGraph Files
We also provide the page graph in the format expected by the WebGraph Framework developed by Sebastiano Vigna. The graph is represented using three files: .graph, .offsets, .properties
. All three are necessary to load the network into the library.
- network.graph (52 GB)
- network.offsets (4 GB)
- network.properties (1.5 MB)
BVGraph graph = BVGraph.loadMapped(baseName, new ProgressLogger())
.
3. Additional Material
3.1. PLD Graph (.NET)
We offer the PLD-aggregation of the page graph also in the Pajek NET Format which is understood by various graph analysis tools such as Pajek or Gephi. The format combines the index and the arc list into a single file (example file, 106 nodes, 141 arcs). The PLD graph can be downloaded from Pajek version of PLD Graph (2.3 GB) and can directly be imported into Pajek after unzipping. To process the graph in acceptable time, we recommend to run Pajek with at least 32 GB of RAM.
3.2. Ranking Files
Beside the pure graph files, we also calculated for each host in the host graph the harmonic centrality
, indegree centrality
, Katz's index
, and Page Rank
. An interactive version of those ranking is available at wwwranking.webdatacommons.org. The incorporated rating data can also be downloaded using the files listed below. Each line within the files consists of the host and the value for the corresponding measure.
- Harmonic Centrality: https://data.dws.informatik.uni-mannheim.de/hyperlinkgraph/2012-08/ranking/hostgraph-h.tsv.gz
- Indegree: https://data.dws.informatik.uni-mannheim.de/hyperlinkgraph/2012-08/ranking/hostgraph-indegree.tsv.gz
- Katz: https://data.dws.informatik.uni-mannheim.de/hyperlinkgraph/2012-08/ranking/hostgraph-katz.tsv.gz
- PageRank: https://data.dws.informatik.uni-mannheim.de/hyperlinkgraph/2012-08/ranking/hostgraph-pr.tsv.gz
4253d61a54f38a577c1a86fba8fa0a84 hostgraph-h.tsv.gz
f8fc1d3f239f64ef5e0b95856fa216e4 hostgraph-indegree.tsv.gz
23537ca26d123298b84c24f6298c27f7 hostgraph-katz.tsv.gz
1edc26f368022e45689fc1c4ee82a5f8 hostgraph-pr.tsv.gz
4. License
The extracted data is provided according the same terms of use, disclaimer of warranties and limitation of liabilities that apply to the Common Crawl corpus.
The Web Data Commons extraction framework can be used under the terms of the Apache Software License.
5. Feedback
Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.
More information about Web Data Commons is found here.