I have been using Google Reader as my main RSS aggregator for several years now. Unlike some others, I however continue to use a desktop-based RSS client to subscribe to private feeds. This was an intuitive decision, I didn’t spend much time thinking about it.
Earlier this week, I was in for a big surprise. I did a search on a public search engine, and results included a link that I recognized to be from a private feed. In other words, public index included information that should only be available to registered and authorized users of a particular site.
So I started digging. First of all, there are three general forms how a private feed can be implemented. This post summarizes it nicely - unique secret token in URI, cookie or HTTP authentication. Unique secret token seems to be most popular these days, possibly because the other two methods will make it more difficult to get such feed in online readers.
With unique secret token method however, a feed publisher must somehow notify search bots that this content is not to be indexed. Otherwise, we rely on the fact that this URI will never be discovered, which becomes problematic with so many people switching to online readers recently. I found an old story on Techcrunch that brought up the same issue and discussed efforts by Bloglines to set up a standard for this, but could not confirm if those efforts led anywhere. This leaves one well-known method - robots.txt.
Dear publishers of private feeds! Please make sure to disallow access to private feed URIs on your sites in robots.txt. I checked two major publishers of private feeds in the last couple of days that use unique secret token method, and none of them have proper disallow in their robots.txt. Ouch!
Am I missing anything? If you think my theory is wrong and there is a better method, please let me know in comments below.