I have been using Google Reader as my main RSS aggregator for several years now. Unlike some others, I however continue to use a desktop-based RSS client to subscribe to private feeds. This was an intuitive decision, I didn’t spend much time thinking about it.
Earlier this week, I was in for a big surprise. I did a search on a public search engine, and results included a link that I recognized to be from a private feed. In other words, public index included information that should only be available to registered and authorized users of a particular site.
So I started digging. First of all, there are three general forms how a private feed can be implemented. This post summarizes it nicely - unique secret token in URI, cookie or HTTP authentication. Unique secret token seems to be most popular these days, possibly because the other two methods will make it more difficult to get such feed in online readers.
With unique secret token method however, a feed publisher must somehow notify search bots that this content is not to be indexed. Otherwise, we rely on the fact that this URI will never be discovered, which becomes problematic with so many people switching to online readers recently. I found an old story on Techcrunch that brought up the same issue and discussed efforts by Bloglines to set up a standard for this, but could not confirm if those efforts led anywhere. This leaves one well-known method - robots.txt.
Dear publishers of private feeds! Please make sure to disallow access to private feed URIs on your sites in robots.txt. I checked two major publishers of private feeds in the last couple of days that use unique secret token method, and none of them have proper disallow in their robots.txt. Ouch!
Am I missing anything? If you think my theory is wrong and there is a better method, please let me know in comments below.
Tags: Internet
UPDATED 2008-11-12: Adjusted Failover section below (additions in italic) based on a thread on rabbit-discuss.
I am a big fan of RabbitMQ, an implementation of Advanced Message Queueing Protocol. In this post I am going to provide an overview how RabbitMQ can be used beyond simple queueing and pubsub. For more background on this topic, please see a list of messaging scenarios that RabbitMQ supports.
Queueing
Broker can take a message from producer and keep it until a consumer shows up. To survive broker restarts, queues in this case should be durable, with auto-delete set to false, and messages should be published with delivery mode of 2 (which means persistent). This pattern can also be helpful if consumer is temporarily unable to keep up with incoming message flow - queueing allows producer and consumer to keep going at their own pace, and will make sure all messages are consumed eventually.
This is when multiple producers publish messages to be routed to a single consumer. You can either have all your producers publish to the same fanout exchange, or to the same direct exchange with the same routing key, or to a topic exchange with a routing key that matches that of consumer. The latter case allows producers to publish “flights.aa.ord” and “flights.ua.sfo”, while the consumer can be reading all of these with “flights.#” (* matches a single word, # matches zero or more words).
This is when a single producer publishes messages that are routed to multiple different consumers. It can be a topic exchange, producer could publish “order.book” and “order.cd”, while orders for books and CDs are handled by different systems.
Instant Feedback (Queueing Bypass)
Producer can instruct broker not to queue a message at all and return it to sender if a consumer is not currently available to read it (this is achieved by setting immediate flag to true in basic.publish method). Can be helpful in scenarios where message content is time sensitive - it needs to be processed now or an error must be returned.
Duplicating
In AMQP, a message is delivered to all queues bound to a given exchange, if a queue meets routing criteria (for different types of exchanges, these criteria are different). For example, a message published with “prod.server01.disk.full” key can be simultaneously routed to “prod.#” queue (for production logger to keep track of all events in production environment) and “#.disk.full” queue (for an archiver process that removes old logs). Very powerful feature, and it works with direct and fanout exchanges as well.
Load balancing
If multiple consumers read from the same queue, RabbitMQ broker will automatically load balance messages between all available consumers. Each message will be sent to one consumer at a time.
Failover
In no_ack=false mode, a consumer must eventually explicitly acknowledge receipt of each message, individually or as a group (this does not mean that a message must be ack’ed before next one can be received). If a consumer disconnects without acknowledging, unack’ed messages are automatically re-queued for another consumer. This helps achieve consumer failover in response to crashes or loss of network connectivity.
Relaying
If producers and consumers do not have direct line of sight network-wise (for example, they are behind NAT or are located on private subnets), RabbitMQ can provide the connectivity by serving as a message relayer. Both producers and consumers must be able to establish client connections to broker (AMQP official port is TCP 5672) and then they can exchange messages.
Some of these patterns can be mixed and matched, which further expands a set of problems where RabbitMQ can help you achieve a distributed messaging nirvana.
Tags: rabbitmq
Most people who work with Internet know about RFC1918 “Address Allocation for Private Internets.” But did you know that RFC 3330 “Special-Use IPv4 Addresses” has even more address spaces allocated for non-public use?
I didn’t know about it till today.
Tags: Internet
… And it’s not only because it’s often cheaper to own or use, but also because it raises the bar for every single piece of proprietary software - they no longer can get away with poor user interface or limited features like they used to. Proprietary software now has to beat and exceed open source to win a customer, which results in better products. A win-win-win situation for users, open source and proprietary software.
In economics, this is called a non-zero-sum situation.
Tags: Economics · technology
Someone once asked me to explain cloud computing. I jokingly replied that it’s like running your servers somewhere where there is no shortage of CPU power, storage capacity or bandwidth, and you get charged only for what you actually use. And if you needed more, you just ask (via API) - and it’s there. “Wow! There’s gotta be some magic involved in that,” my buddy said.
Today we at CohesiveFT announced a new solution called VPN-Cubed, which can add even more magic to your cloud-based deployment. It offers “customer-controlled security in a cloud, across multiple clouds, and between the physical data center and cloud(s).” But it’s not only a security solution, but also a network infrastructure component that complements our flagship Elastic Server On Demand platform. It has high availability built in, and no single points of failure. It supports many different topologies and is available on many different operating systems (including Windows). It was developed in part to facilitate our own internal infrastructure (read: we needed something like this to run our own business), and has been in use internally for some time.
I was involved in this project from the engineering side, and I am extremely excited about the end result. You should definitely check it out!
Tags: cloud computing · cohesiveft
Designing a distributed application to be fault tolerant is one of my favorite things that I often get to do at work. First of all, it should never fail under normal circumstances. Don’t believe people who tell you that circumstances are never normal - if it’s the case, a fault-tolerant design is the least of your worries and you need to get overall environment to be at least somewhat stable first. But then, circumstances don’t remain unchanged for too long - something will happen sooner or later. So you want to expect as many possible failure scenarios as you can think of, try to anticipate how the event will impact your application, how the app will find out that the event occurred, and what to do about it.
But it’s not what I wanted to write about. As you might imagine, I read a lot on the subject - learning from other people’s mistakes and experiences in distributed systems world has never been easier, thanks to blogging and general tendency towards openness and disclosure. In all this stream of data that I get, the most frequent failure scenarios can by typically categorized as a “hardware crash” or “software crash.” Something was running fine, and then - BAM! - it crashed. It no longer exists. Nothing can talk to it anymore. Nothing can ask it how it’s doing, or what was the last thing it did. It crashed. Died. Disappeared.
But is crash the worst that could happen? Unfortunately not. Connectivity loss is way more tricky to deal with. Your Nagios thinks your web server crashed because it’s not responding? Can’t tell - not enough information. Everything you know is that nagios could not connect to the web server. It doesn’t mean that the latter crashed. Or you can’t connect to your messaging backend - did it crash? Not necessarily, everything you know at the moment is that connectivity between you and remote end is broken.
So why do I say the connectivity loss is way worse than crash?
- Crash is the same crash to all clients. All clients will fail to connect. Connectivity loss however can impact only a fraction of your client base. So half of your clients are failing over to the secondary, while the other half are still attached to primary. And you neglected to implement an alarm for that - and now your customers see only half of your inventory on the site? Oops.
- Crash is usually a terminal state, as in your application can’t easily leave a crash state on its own. And what about connectivity? Oh, not at all - connectivity can be restored without your direct intervention. It can range from route convergence after a backup link gets up, to easing network congestion after a spike in traffic. Are you going to be prepared?
And here is yet another twist. No matter what your position is on cloud computing, it is here to stay. And it is only a matter of time before many more services on which you rely for your operations will be scattered all over the world (or space, but that’s later). Connectivity loss will be occurring way more often than crashes, and unless you start approaching it as a different problem, you might be in for a big surprise.
Tags: cloud computing · distributed · technology
September 22nd, 2008 · No Comments
In Rails, update_attribute method bypasses model validations, while update_attributes and update_attributes! will fail (return false or raise an exception, respectively) if a record you are trying to save is not valid.
This means that if at certain point during the project you adjust validations such that some records that used to be valid are now invalid, expect your code to stop working without any obvious reasons. This will happen because update_attributes will no longer update old records properly, but unless you check its return value, you’ll never know that it’s failing.
I stumbled upon this problem at least twice in the last month, and decided to write it up to finally remember to do it right next time.
Tags: ruby
September 18th, 2008 · No Comments
Those of us who [still] have a PalmOS-based device and use its Blazer browser will probably know that Blazer may take some time to render complex pages, and the end result might not even be readable on a small screen. I recently found a solution to this problem that works great, at least for me.
On your phone, head over to http://www.google.com/gwt/n and enter URL you are trying to get. GWT (which stands for Google Web Toolkit) will fetch the content and optimize it for your mobile browser. Additionally, it will adjust all links to also go through GWT, which makes Internet surfing with Blazer not painful at all.
For example, I like checking Techmeme on train on my way to work. They offer mini version, which renders well on my Palm Centro from Sprint. But if I want to follow a story and click on a link, I usually get the page not optimized for mobile (there are several exceptions that detect user agent and adjust content formatting). Instead, in my Blazer bookmarks, I have this - http://www.google.com/gwt/n?u=http%3A%2F%2Ftechmeme.com. From this page, I can jump to any story and get the content nicely formatted for my Centro.
UPDATE: It looks like I might have confused Google Wireless Transcoder (GWT) with Google Web Toolkit (GWT).
Tags: technology · web
I was under assumption that when a site moves to a new domain or URL space, the best thing to do from SEO perspective was to put up one’s site at a new place and set up old site to do HTTP 301 redirects (Moved Permanently).
I did it a couple of weeks ago when I was moving this site to its current address, but noticed today that my old address still shows up in Google at the top of search results. I got curious, and checked both Yahoo and MSN and both of them properly do not display links that have been redirected.
Am I missing anything, or is it a bug in GoogleBot?
Tags: technology · web
I have done some reorg on the site. If you are reading this in a feed, you might have received old entries from my blog as new. There is no way around that, I apologize for inconvenience.
New address for this blog is http://somic.org. Old URL will redirect here with 301, so I expect all links to work properly.
Tags: blogging