On rabbitmqctl and badrpc,nodedown
In the true spirit of open source community that has formed around RabbitMQ in the past several years and continues growing every week, on the mailing list we have recently tackled an issue when one runs “rabbitmqctl status” and gets “badrpc,nodedown” response, while broker is running as evidenced by ps output. Check out a thread on “broker runs; cant’ get status” here. The issue is centered around erlang’s security and communications mechanism in distributed mode. Here is a tentative step-by-step that can help you resolve the issue.
Are you running broker as user rabbitmq and rabbitmqctl as root or rabbitmq? If not, please stop and fix this. This is not a requirement per se, but this represents canonical installation of rabbitmq broker. You can certainly hack your scripts to work around this requirement, but you are on your own if you do.
Double check that broker is in fact running. Use ps, netstat -lptn (look for port 5672 unless you overrode it in /etc/default/rabbitmq). Telnet to localhost on port 5672, type AMQP and press ENTER several times. You should get response that at least will show AMQ. Check logs at /var/log/rabbitmq to verify that broker saw your connection attempt.
Next step is to start "erl -sname foo -cookie coo" in shell and run this command: "net_adm:names()."
If this command returns ok followed by a list of nodes within 1 or 2 seconds, check if rabbit is there. If it is, it’s very likely that you have users mixed up above. Please double check. If rabbit node is not listed, double check that rabbit broker is still running.
If this command returns {error,address}, there is a problem with your instance of EPMD, erlang naming daemon (man net_adm). First, check if it’s running (it most likely is running). Then, in erlang, run ”net_adm:localhost().” Exit erlang, and try to connect to exactly the name you got from net_adm:localhost() on port 4369 (epmd). This shouldn’t work and should timeout. If it doesn’t time out, you shouldn’t have gotten {error,address}.
The problem most likely will be that name as returned by net_adm:localhost() is associated with some IP in /etc/hosts or in DNS which is either not accessible from this host, or firewalled off. An entry in /etc/hosts that associates this name with 127.0.0.1 or one of other IPs on this server should fix the problem.
Alternatively, net_adm:names() may time out with {error, timeout}. We have seen it caused by snoopy in the past. Remove snoopy or do not install it system wide in /etc/ld.so.preload, and you should be fine.
If these steps did not help, please leave a comment below or join rabbitmq-discuss and ask your question there, and we’ll help!
UPDATE 2009-04-17: Your host's name as shown by net_adm:localhost() does not technically need to be defined in /etc/hosts. But if it's not there, it should not be defined anywhere - when you do "ping name", you should get "ping: unknown host name". I have seen at least one case when it worked this way. This is somewhat unverified though.