Graphite RabbitMQ Integration

I started a new project on github - http://github.com/somic/graphite-rabbitmq. It currently includes a couple of tools written in Python which facilitate sending data to Graphite via RabbitMQ instead of connecting directly to the service using TCP.

Graphite CLI Screenshot

Graphite is a flexible and powerful tool to build charts. It’s also a data series analytics framework. It was developed inside Orbitz by my former colleague, originally for use within a single group (of which I was a part). However, its power did not remain a secret for too long - it quickly spread to entire organization and became an irreplaceable tool for both development and engineering/operations. Graphite was then open-sourced under Apache license. It currently lives at http://graphite.wikidot.com/

The key to Graphite’s power, in addition to dynamic web UI, an improved RRD implementation called “whisper” (read this FAQ - highly recommend!) web-based command line with auto-completion which allows you to overlay any metrics on a single chart, IMHO is the fact that you are in control what kind of data to send to it, how often, and how to set up hierarchies of your metrics - by environment, by machine type, by datacenter, etc. Graphite doesn’t do its own polling that won’t scale to hundreds or thousands of metrics. Nor does it enforce anything but the fact that your metrics are dot-separated hierarchies (as in routing keys of AMQP topic exchanges - my.metric.name) and that their values are numeric (int or float).

If you are still reading this but still are not convinced that it’s the way to go, I’ve got one last argument. If you already use RabbitMQ to publish and consume data, wouldn’t it be nice to get a powerful charts without touching your application AND without installing agents on your publishers or consumers? Recall the duplication pattern of RabbitMQ - you can fork the incoming stream of messages into another queue (without impacting your original consumers and the queues to which they attach) and set up Graphite+RabbitMQ off of this new queue.

If you are planning to run multiple carbon instances, remember that heavy lifting (writing to disk) is actually performed by another process called carbon-persister.py (it’s started by carbon-agent, with communications over a pipe) - try to avoid multiple persisters writing data within the same hierarchy to avoid slow down and possible data corruption. RabbitMQ can help you sort out what messages go where, thus minimizing this risk.

I am very excited about future opportunities that a Graphite-RabbitMQ combination can deliver, and I hope someone finds my scripts useful. Both tools bring a lot of awesomeness to the table, and nicely complement each other forming a great charts and data series analytics solution you have been searching for. Check it out!

Categories: rabbitmq | python |

The Power of Knowing "Why?" in Software Engineering

I am currently reading “How Life Imitates Chess” by Garry Kasparov, after I saw a great review of the book by Baron Schwartz. Great book and I highly recommend it.

It’s got many lessons for software engineers as well. For example, in chapter 9 “Phases of the game” Kasparov talks about inexperienced players blindly following openings by famous grandmasters and how this can carry one only so far and ultimately is a trap.

Players, even club amateurs, dedicate hours to studying and memorizing the lines of the preferred opening. This knowledge is invaluable, but it can also be a trap. Many make the mistake of believing that if they know what a famous Grandmaster played in this exact position back in 1962, they don't have to think for themselves. [...] Without knowing why all the moves are made, he'll have little idea of how to continue when play inevitably advances beyond the moves he was able to store in his memory.

In software engineering, we have many conferences and online tutorials and blogs where our own Grandmasters talk about how they tackled a particular problem or resolved a particular outage. Sharing experiences is invaluable, but like Kasparov says, it can only carry you so far. Many people will blindly follow solutions described during conference talks, without understanding why it was done this way and not the other. Some people base their selection of a certain technology on opinion of a guru. Again - without fully understanding the context and reasons behind the decision.

What I am trying to say is Learn from other people's experiences, but don't forget to understand their context and their reasons. Your ability as a software engineer is based on your ability to adapt the solution to your needs, not simply copy it. Or if you copy, you need to know exactly why it will work for you.

Categories: software-engineering |

Don't Use OpenDNS On Servers

Are you thinking about using OpenDNS in your servers’ /etc/resolv.conf? Don’t. Why? Because when OpenDNS receives a query for a non-existing name, instead of returning NXDOMAIN (essentially name you’re looking for does not exist), it will return some IP, which probably is meant to catch typos, misspelt URLs or phishing attempts. Works great for humans and their browsers, not so much for your applications. NXDOMAIN is a valid result after all and may impact application’s logic.

$ dig @208.67.222.222 doesnotexist---doesnt.com

; <<>> DiG 9.4.2-P2 <<>> @208.67.222.222 doesnotexist---doesnt.com
; (1 server found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46259
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;doesnotexist---doesnt.com.	IN	A

;; ANSWER SECTION:
doesnotexist---doesnt.com. 0	IN	A	208.69.36.132

;; Query time: 14 msec
;; SERVER: 208.67.222.222#53(208.67.222.222)
;; WHEN: Fri Apr 17 14:14:49 2009
;; MSG SIZE  rcvd: 59

Categories: linux |

Compiling Erlang On Linux With Old Glibc

I recently wanted to compile Erlang (in order to install RabbitMQ) on a Linux box with old glibc (2.3.2, from days of Red Hat Linux 7.0). It was the only out-of-date component, everything else was quite fresh - GCC 4.3.3, binutils 2.19.1.

Version of Erlang I used was R12B-5. I configured it with ./configure –disable-x –enable-threads –disable-hipe.

But it wouldn’t build, giving me the following error:

Fatal, could not get clock_monotonic value!, errno = 22
This was strange because I had no problems building this version of erlang on Debian Etch, even with an older compiler.

The solution was to edit all instances of config.h in the build tree (in my case, there were 2 - lib/erl_interface/src/i686-pc-linux-gnu/config.h and erts/i686-pc-linux-gnu/config.h) after running ./configure but before starting make and comment out this line:

/* Define if you want to use clock_gettime to simulate gethrtime */
/* #define GETHRTIME_WITH_CLOCK_GETTIME 1 */

Categories: erlang | linux |

Eliminating Single Points of Failure - One, Two, Many

I recently reached an interesting conclusion. When you are trying to eliminate a single point of failure from your architecture, it’s almost always beneficial to first go with a 2-way redundant solution (active-passive or active-active pair, whichever is easiest to implement) and only then go to N-way, N > 2, only if necessary.

One huge difference between a pair and N-way (N>2) is how difficult it is to detect partitioning (of CAP Theorem fame - you can simultaneously achieve only two properties from the following three: data Consistency, high Availability and Partition tolerance). Assuming symmetrical communications (A can talk to B if and only if B can talk to A), partitioning detection in a pair is trivial, because there can be only one option - system A can’t talk to system B. With N>2 however, there are way more scenarios to deal with: A can’t talk to B while both A and B can talk to C, A can’t talk to B and C , etc. Additionally, communications may be restored in some random order - A may first be able to talk to B, and only some time later get its visibility to C back.

Interestingly, also from personal experience, if you manage to build a 3-way redundancy, building 4-way or even 5-way is relatively not that difficult.

There are also a couple of purely practical aspects that make a 2-way redundancy an attractive option, even if it’s going to be intermediate step before N-way is achieved. 2-way can serve as a working prototype - you can observe it, learn and analyze its failure scenarios and make sure your response to each is optimal. This can validate your approach before you sink all this time in partitioning detection for N-way.

And secondly, after you build an easier 2-way, you might as well discover that you don’t need an N-way redundancy. If a pair meets your goal (say a given percentage of service availability), you can save a lot of time and effort.

My advice - don’t skip two on your way from one to many.

Categories: distributed | software-engineering |

Previous Page
Next Page