I'm having trouble with my cloud installation.
I've installed everything from the Ubuntu Server 11.04 CD, then upgraded the system to the latest version (apt-get update && apt-get dist-upgrade), installed NTP, correctly set it up, and booted everything. The cloud runs ok for sometime, but then the cloud controller / cluster controller will stop talking to one of the node controllers!
One of the symptoms is euca-describe-availability-zones verbose
suddenly starts showing less resources than the appropriate (and initial) values.
If I look on cc.log
, it shows:
[Mon Sep 19 19:09:07 2011][002531][EUCADEBUG ] DEBUG: requested URI http://10.20.200.10:8775/axis2/services/EucalyptusNC
[Mon Sep 19 19:09:07 2011][002531][EUCADEBUG ] ncClientCall(ncDescribeResource): ppid=13403 client calling 'ncDescribeResource'
[Mon Sep 19 19:09:07 2011][002531][EUCAERROR ] ERROR: DescribeResource() could not be invoked (check NC host, port, and credentials)
[Mon Sep 19 19:09:07 2011][002531][EUCADEBUG ] ncClientCall(ncDescribeResource): ppid=13403 done calling 'ncDescribeResource' with exit code '1'
Then, in the corresponding node controller, axis2c.log
, I see this:
[Mon Sep 19 19:10:58 2011] [error] rampart_timestamp_token.c(179) [rampart]Timestamp not valid: Created time is not valid
[Mon Sep 19 19:10:58 2011] [error] rampart_sec_header_processor.c(612) [rampart]Timestamp is not valid
[Mon Sep 19 19:10:58 2011] [error] rampart_sec_header_processor.c(1911) [rampart]Timestamp processing failed
[Mon Sep 19 19:10:58 2011] [error] rampart_in_handler.c(143) [rampart][rampart_in_handler] Security Header processing failed.
[Mon Sep 19 19:10:58 2011] [error] phase.c(233) Handler RampartInHandler invoke failed within phase Security
[Mon Sep 19 19:10:58 2011] [error] engine.c(696) Invoking phase Security failed
[Mon Sep 19 19:10:58 2011] [error] engine.c(279) Invoking operation specific phases failed for operation ncDescribeResource
[Mon Sep 19 19:10:58 2011] [error] rampart_engine.c(159) [rampart][rampart_engine] Cannot get saved rampart_context
[Mon Sep 19 19:10:58 2011] [error] rampart_out_handler.c(136) [rampart][rampart_out_handler] ramaprt_context creation failed.
[Mon Sep 19 19:10:58 2011] [error] phase.c(233) Handler RampartOutHandler invoke failed within phase MessageOut
[Mon Sep 19 19:10:58 2011] [error] engine.c(696) Invoking phase MessageOut failed
So: there's a time synchronization problem.
However, NTP is installed, and correctly configured.
One thing I've noticed, by issuing lots of ntpq -np
, is that the NC stops working once the time offset is positive. If the offset keeps negative, everything works fine. The offsets are very small (around 5ms, the absolute maximum I could see is 10ms).
Googling, I've found this rampart code: http://wso2.org/project/wsf/php/1.1.0/docs/code-coverage/rampartc/src/util/.libs/rampart_timestamp_token.c.gcov.html
/*Check whether created is less than current time or not*/
current_val = rampart_generate_time(env, 0);
validity = rampart_compare_date_time(env, current_val, created_val);
if (validity == AXIS2_SUCCESS)
{
AXIS2_LOG_ERROR(env->log, AXIS2_LOG_SI, "[rampart][ts]Timestamp not valid: Created time is not valid");
AXIS2_FREE(env->allocator, current_val);
current_val = NULL;
return AXIS2_FAILURE;
}
As we can see, it apparently allows for deviation in time in one sense, but not in the other.
Am I missing something? Am I the only facing this issue? Isn't it stupid to verify timestamps with milissecond precision and allow only negative deviations?!
How do people deal with this problem? What can I do to keep my cloud alive?
I thought of some solutions:
- Patch rampart, to simply remove the timestamp verification
- Patch rampart, to allow for positive deviations as well
- Find a way to make
ntp
orntpdate
adjust the time to some specific offset behind the server's reference time - Write my own time synchronization tool
What do you think?
EDIT: it looks like it is possible to disable Rampart in Axis2 configuration, but I can't figure out how to do that!
EDIT 2: The version of Rampart available on Ubuntu's repositories is 1.3.0, which is from 2007 or 2008... the latest released version is something like 1.6.0, from June 2011. Apparently this latest version allows packets "from the future". I'd really like to find this latest version from a PPA!
EDIT 3: I've found some parameters to change Rampart 1.3.0's behaviour: TimeToLive, ClockSkewBuffer and PrecisionInMilliseconds. I've added them (360, 60 and false, respectively) to EucalyptusNC.xml and EucalyptusCC.xml, and things are getting better. Occasionally I still see the timestamp error messages on the logs, but they are very rare now. I've also disabled NTP on the NCs and created a cron script (that runs every hour) to sync time (ntpdate -b) with the CC.
EDIT 4: Apparently this is a bug in Ubuntu's packaging of Eucalyptus. I've filed a bug on Launchpad, following recommendations from people in #eucalyptus on Freenode: https://bugs.launchpad.net/ubuntu/+source/eucalyptus/+bug/854946