This page contains shows the results of the pre-released 2012 corpus which was published by Common Crawl in February. The pages contained in the pre-release are a subset of the pages contained in the August 2012 Common Crawl Corpus. We also extracted the structured data from this pre-release. The resulting of the complete 2012 corpus can be found here.
The February 2012 Common Crawl Corpus is part of the August 2012 Common Crawl Corpus and is no longer available as separate download.
Crawl Date | Feb 2012 | |
---|---|---|
Total Data | 20.9 Terabyte | (compressed) |
Total URLs | 1,700,611,442 | |
Parsed HTML URLs | 1,486,186,868 | |
Domains with Triples | 65,408,946 | |
URLs with Triples | 188.821.015 | |
Typed Entities | 1,222,563,749 | |
Triples | 3,294,248,652 |
The costs for parsing the 20.9 Terabytes of compressed input data of the Feburary 2012 Common Crawl corpus, extracting the RDF data and storing the extracted data on S3 totaled 523 EUR (excluding VAT) in Amazon EC2 fees. We used 100 spot instances of type c1.xlarge
for the extraction which altogether required 3,007 machine hours.