CDNs and Duplicate Content
June 17, 2011 | Taylor Jasko
All webmasters know that duplicate content on the web is one major issue that they face in search engines. Google has recently announced on their Webmaster Central Blog that they are now supporting canonical links via HTTP headers.
One practical example Google listed on their blog is that a PDF file and a HTML file containing the same content and by adding “Link: <http://www.example.com/white-paper.html>; rel=”canonical”” to the HTTP headers of the PDF file, Google will fully understand that the PDF file is a duplicate of “white-paper.html”. Many PDF documents and HTML files contain the same content, it happens too much, and now that Google supports canonical links in the HTTP headers, webmasters can finally say that a PDF file is a duplicate of another similar file.
The example HTTP header Google mentioned is below for reference:
GET /white-paper.pdf HTTP/1.1Host: www.example.com(...rest of HTTP request headers...)HTTP/1.1 200 OKContent-Type: application/pdfLink: <http://www.example.com/white-paper.html>; rel="canonical"Content-Length: 785710
(… rest of HTTP response headers…)
Google is always trying to help webmasters use their search engine in the most efficient way possible. Even though canonical links have been around for quite a while now, it’s somewhat surprising Google decided to add the canonical attribute to HTTP headers at this point in time.
There’s one amazing use for the canonical link in the HTTP headers, which Google did not give an example on. As CDNs rely on taking your content and duplicating it, the content from any CDN could show up on Google along with duplicate content of the main image. One example is having an image at “http://cdn.maxcdn.com/image.png” and “http://maxcdn.com/image.png”. Now with the canonical attribute, this opens a whole new door to CDN providers by making sure Google knows the content on your CDN is in fact duplicate content.
Either way, we hope every webmaster will take advantage of this new feature being provided by Google. It may not be a large feature, but canonical links are very important in terms of SEO rankings.
MaxCDN also has full robots.txt support on all pull zones!
There’s most likely no purpose of letting search engines index your CDN, as all content is already on your main server, especially on a pull zone. MaxCDN came up with a brilliant way to prevent search engines from crawling your CDN pull zone. Under the advanced settings of every MaxCDN pull zone, you will now see a “Custom Robots.txt” section presenting you with full ability to modify the robots.txt file.
Not only can you enable the robots.txt file to deny all user agents by default, you can also create a custom robots.txt file and only deny them to your scripts folder or anything that search engines do not need to index. When search engines are indexing your CDN, it can potentially drain your bandwidth as well; enabling the robots.txt file of your pull zone is something you should consider.