This page provides two large hyperlink graph for public download. The graphs have been extracted from the 2012 and 2014 versions of the Common Crawl web corpera. The 2012 graph covers 3.5 billion web pages and 128 billion hyperlinks between these pages. To the best of our knowledge, the graph is the largest hyperlink graph that is available to the public outside companies such as Google, Yahoo, and Microsoft. The2014 graph covers 1.7 billion web pages connected by 64 billion hyperlinks. Below we provide instructions on how to download the graphs as well as basic statistics about their topology.
We hope that the graphs will be useful for researchers who develop
- search algorithms that rank results based on the hyperlinks between pages.
- SPAM detection methods which identity networks of web pages that are published in order to trick search engines.
- graph analysis algorithms and can use the hyperlink graphs for testing the scalability and performance of their tools.
- Web Science researchers who want to analyze the linking patterns within specific topical domains in order to identify the social mechanisms that govern these domains.
Contents
1. Levels of Aggregation
We provide the graphs on three different levels of aggregation:
- Page-Level Graph - This version of the graph contains all details with each node representing a single web page (like
http://dws.informatik.uni-mannheim.de/en/projects/current-projects/#c13686
) and each arc a hyperlink between to two pages. - Host Graph - This graph aggregates the page graph by subdomain/host. Each node in the graph represents a specific subdomain/host (like
research.dws.uni-mannheim.de
) and a arc exists, if at least one hyperlink was found between pages that belong to a pair of subdomains/hosts. Note that subdomains/hosts can be of arbitrary depth. - Pay-Level-Domain Graph - Each node represents a pay-level-domain (like
uni-mannheim.de
). An arc exists if at least one hyperlink was found between pages contained in a pair pay-level-domains.
2. Available Web Graphs
Up till now, we have extracted the hyperlink graphs from two releases of the Common Crawl web corpora. The first graph was extracted from a Web corpus gathered in the first half of 2012 and released in August 2012. The second graph was extracted from a crawl gathered in the first quarter of 2014, released in April 2014.
2.2. Extracted Hyperlink Graph from August 2012 Common Crawl Corpus
The hyperlink graph was extracted from the Web cropus released by the Common Crawl Foundation in August 2012. The Web cropus was gathered using a web crawler employing a breadth-first-search selection strategy and embedding link discovery while crawling. The crawl was seeded with a large number of URLs from former crawls performed by the Common Crawl Foundation. The table below gives an overview of the size of the different aggregation levels of the extracted graph which are provided for download:
Granularity | #Nodes | #Arcs |
---|---|---|
Page | 3,563 million | 128,736 million |
Host | 101 million | 2,043 million |
Pay-Level-Domain | 43 million | 623 million |
Within the graph over 94% of all pages are connected and the largest strongly connected component consists of over 50% of all pages. Comparing these and other properties to what is known about the structure of the Web graph from earlier research indicates that the graph is a good sample of the overall Web graph. Please visit the following links for more detailed statistics about the graph as well as a detailed description of how to download and make use of the graph:
2.3. Extracted Hyperlink Graph from Spring 2014 Common Crawl Corpus
The hyperlink graph was extracted from the Web cropus released by the Common Crawl Foundation in April 2014. The Web corpus was gathered using a modified Apache Nutch crawler to download pages from a large but fixed seed list. The crawler was restricted to URLs contained in this list and did not extract additional URLs from links in the crawled pages. The seed list contained around 6 billion URLs and was provided by the search engine company blekko. The Common Crawl Foundation and blekko started a cooperation in 2013 to increase the quality and popularity of the crawled pages and reduce the number of crawled spam pages and crawler traps. The table below gives an overview of the size of the different aggregation levels of the extracted graph which are provided for download:
Granularity | #Nodes | #Arcs |
---|---|---|
Page | 1,727 million | 64,422 million |
Host | 22 million | 123 million |
Pay-Level-Domain | 13 million | 56 million |
Within the extracted graph over 91% of all crawled pages are connected, and the largest strongly connected component consists of 19% of all nodes in the graph. This means, that in comparison to the 2012 graph, the percentage of connected nodes is almost equal, but the size of the largest strongly connected component is much smaller. This phenomenon results from the different crawling strategies which were used to gather the web corpora. For the analysis of the connectivity of Web pages or the overall analysis of the Web graph, we thus recommend to use the 2012 graph and not the 2014 graph, as a BFS based selection strategy including URL discovery while crawling more likely results in a realistic sample of the Web graph.
Please visit the following links for more detailed statistics about the 2014 graph as well as detailed description how to download and make use of the graph:
3. Data Formats and Download
We provide the graphs for free download in several formats. All graphs are provided in an index/arc data format. In addition, we provide the page graphs in the format used by the WebGraph library and the PLD graphs in the format used by Pajek.
3.1. Index/Arc Format
The Index/Arc format represents each graph using two files. Within the index file each line represents one node. The first column states the node name, the second column states the node index. Within the arc file each line represents a directed edge between two nodes, where the first column is the origin node and the second the target node. The files are sorted by index and use tabs as a delimiter. The following example files contain a graph with 106 nodes and 141 arcs.
3.2. WebGraph Framework Format
We also provide the page graph in the format expected by the WebGraph Framework developed by Sebastiano Vigna. The graph is represented using three files: .graph, .offsets, .properties
. All three are necessary to load the network into the library.
Using the WebGraph Framework, which can be downloaded from Maven Central, these files can be loaded using the following line of code:
BVGraph graph = BVGraph.loadMapped(baseName, new ProgressLogger())
.
3.3. Pajek NET Format
We offer the PLD-aggregation of the page graph also in the Pajek NET Format which is understood by various graph analysis tools such as Pajek or Gephi. The format combines the index and the arc list into a single file (example file, 106 nodes, 141 arcs).
4. Extraction Process and Source Code
We extracted all HTML pages (mime-type: text/html) and every hyperlink pointing to another crawled HTML page (link type: a and link) from the corpera. For each re-direct, we include an additional node in the graph which links to the re-direct target.
Since the Common Crawl coprus is provided via the AWS Simple Storage Service (S3), it made sense to perform the extraction in the Amazon cloud (EC2). The main criterion here is the cost to achieve a certain task. Instead of using the ubiquitous Hadoop framework, we found using the Simple Queue Service (SQS) for our extraction process increased efficiency. SQS provides a message queue implementation, which we use to coordinate the extraction nodes. The Common Crawl data set is readily partitioned into compressed files of around 100MB each including several thousand webpages. Beside those content files, also metadata files are provided. These files include for each page the URL, re-directs, mime-type, hyperlinks, and type of link using a JSON format. As these files contain all information needed to extract the hyperlink graph for the crawled webpages, we used an adapted version of the framework that we already used to extract RDFa, Microformat and Microdata from the crawled pages to parse the URL, redirect, links and link types from the metadata files. We used 100 machines on Amazon EC2 to process the metadata files. In a second step, we created an index file for each aggregation level (PLD, Domain, 1st Subdomain) and indexed the graphs based on this mappings using Apache PIG running on a 40 node Amazon Elastic MapReduce cluster (EMR).
The source code to extract the WDC Hyperlink Graph from the Common Crawl corpus can be checked out from our Subversion repository. For using the code, you will need to create your own configuration and fill in your AWS authentication information and bucket names. Compilation is performed using Maven, thus changing into the source root directory and typing mvn install
should be sufficient to create a build. In order to run the extractor on more than 10 EC2 nodes, you will have to request an EC2 instance limit increase for your AWS account. Beside the raw extraction framework, the project also includes various algorithms to format and manipulate the entire graph, as shrink them to a specific aggregation level or index the graph to compress it.
5. Other Public Hyperlink Graphs and Web Crawls
The Laboratory for Web Algorithms provides various hyperlink graphs for public download in the format understood by the WebGraph Framework. In comparison to these graphs, the WDC Hyperlink Graph is more recent and larger.
The Stanford Large Network Dataset Collection also contains several smaller hyperlink graphs (all below 1 million nodes).
Beside of the Common Crawl that was used to extract the WDC Hyperlink Graph, there are several other public web corpora that could be used to extract hyperlink graphs:
- The ClueWeb12 corpus was crawled in a similar time period as the Common Crawl. The corpus consists of 740 million English webpages. In comparison, the Common Crawl is 4 times larger and covers non-English top-level domains as well.
- The Stanford WebBase project provides a Web crawl containing 118 million pages and around 1 billion links. The corpus was collected in 2001.
- The Yahoo! Webscope Project has published an older version of the AltaVista crawl, created in 2002. The corpus includes 1.4 billion webpages that are connected by 6.6 billion hyperlinks.
6. License
The Web Data Commons extraction framework can be used under the terms of the Apache Software License.
7. Feedback
Please send questions and feedback to the Web Data Commons mailing list or post them in our Web Data Commons Google Group.
More information about Web Data Commons is found here.
8. Credits
Lots of thanks to
- the Common Crawl project for providing their great web crawl and thus enabling the creation of the WDC Hyperlink Graph.
- Sebastiano Vigna for providing and supporting us with his amazing Java WebGraph library.
- Andrej Mrvar for his fast and detailed answers about the usage of specific functions in Pajek.
- Stephan Seufert for giving us some initial ideas about how to compress and format our graph.
The creation of the WDC Hyperlink Graph was supported by the EU research project PlanetData and by Amazon Web Services in Education Grant award. We thank your sponsors a lot for supporting Web Data Commons.
9. References
- Robert Meusel, Sebastiano Vigna, Oliver Lehmberg, Christian Bizer: The Graph Structure in the Web - Analyzed on Different Aggregation Levels. Journal of Web Science: Vol. 1: No. 1, pp 33-47, 2015.
- Oliver Lehmberg, Robert Meusel, Christian Bizer: The Graph Structure of the Web aggregated by Pay-Level Domain. ACM Web Science 2014 Conference (WebSci2014), Bloomington, USA, June 2014.
- Robert Meusel, Sebastiano Vigna, Oliver Lehmberg, Christian Bizer: Graph Structure in the Web - Revisited. 23rd International World Wide Web Conference (WWW2014), Web Science Track, Seoul, Korea, April 2014.