I'm having trouble with my cloud installation.
I've installed everything from the Ubuntu Server 11.04 CD, then upgraded the system to the latest version (apt-get update && apt-get dist-upgrade), installed NTP, correctly set it up, and booted everything. The cloud runs ok for sometime, but then the cloud controller / cluster controller will stop talking to one of the node controllers!
One of the symptoms is euca-describe-availability-zones verbose
suddenly starts showing less resources than the appropriate (and initial) values.
If I look on cc.log
, it shows:
[Mon Sep 19 19:09:07 2011][002531][EUCADEBUG ] DEBUG: requested URI http://10.20.200.10:8775/axis2/services/EucalyptusNC
[Mon Sep 19 19:09:07 2011][002531][EUCADEBUG ] ncClientCall(ncDescribeResource): ppid=13403 client calling 'ncDescribeResource'
[Mon Sep 19 19:09:07 2011][002531][EUCAERROR ] ERROR: DescribeResource() could not be invoked (check NC host, port, and credentials)
[Mon Sep 19 19:09:07 2011][002531][EUCADEBUG ] ncClientCall(ncDescribeResource): ppid=13403 done calling 'ncDescribeResource' with exit code '1'
Then, in the corresponding node controller, axis2c.log
, I see this:
[Mon Sep 19 19:10:58 2011] [error] rampart_timestamp_token.c(179) [rampart]Timestamp not valid: Created time is not valid
[Mon Sep 19 19:10:58 2011] [error] rampart_sec_header_processor.c(612) [rampart]Timestamp is not valid
[Mon Sep 19 19:10:58 2011] [error] rampart_sec_header_processor.c(1911) [rampart]Timestamp processing failed
[Mon Sep 19 19:10:58 2011] [error] rampart_in_handler.c(143) [rampart][rampart_in_handler] Security Header processing failed.
[Mon Sep 19 19:10:58 2011] [error] phase.c(233) Handler RampartInHandler invoke failed within phase Security
[Mon Sep 19 19:10:58 2011] [error] engine.c(696) Invoking phase Security failed
[Mon Sep 19 19:10:58 2011] [error] engine.c(279) Invoking operation specific phases failed for operation ncDescribeResource
[Mon Sep 19 19:10:58 2011] [error] rampart_engine.c(159) [rampart][rampart_engine] Cannot get saved rampart_context
[Mon Sep 19 19:10:58 2011] [error] rampart_out_handler.c(136) [rampart][rampart_out_handler] ramaprt_context creation failed.
[Mon Sep 19 19:10:58 2011] [error] phase.c(233) Handler RampartOutHandler invoke failed within phase MessageOut
[Mon Sep 19 19:10:58 2011] [error] engine.c(696) Invoking phase MessageOut failed
So: there's a time synchronization problem.
However, NTP is installed, and correctly configured.
One thing I've noticed, by issuing lots of ntpq -np
, is that the NC stops working once the time offset is positive. If the offset keeps negative, everything works fine. The offsets are very small (around 5ms, the absolute maximum I could see is 10ms).
Googling, I've found this rampart code: http://wso2.org/project/wsf/php/1.1.0/docs/code-coverage/rampartc/src/util/.libs/rampart_timestamp_token.c.gcov.html
/*Check whether created is less than current time or not*/
current_val = rampart_generate_time(env, 0);
validity = rampart_compare_date_time(env, current_val, created_val);
if (validity == AXIS2_SUCCESS)
{
AXIS2_LOG_ERROR(env->log, AXIS2_LOG_SI, "[rampart][ts]Timestamp not valid: Created time is not valid");
AXIS2_FREE(env->allocator, current_val);
current_val = NULL;
return AXIS2_FAILURE;
}
As we can see, it apparently allows for deviation in time in one sense, but not in the other.
Am I missing something? Am I the only facing this issue? Isn't it stupid to verify timestamps with milissecond precision and allow only negative deviations?!
How do people deal with this problem? What can I do to keep my cloud alive?
I thought of some solutions:
- Patch rampart, to simply remove the timestamp verification
- Patch rampart, to allow for positive deviations as well
- Find a way to make
ntp
orntpdate
adjust the time to some specific offset behind the server's reference time - Write my own time synchronization tool
What do you think?
EDIT: it looks like it is possible to disable Rampart in Axis2 configuration, but I can't figure out how to do that!
EDIT 2: The version of Rampart available on Ubuntu's repositories is 1.3.0, which is from 2007 or 2008... the latest released version is something like 1.6.0, from June 2011. Apparently this latest version allows packets "from the future". I'd really like to find this latest version from a PPA!
EDIT 3: I've found some parameters to change Rampart 1.3.0's behaviour: TimeToLive, ClockSkewBuffer and PrecisionInMilliseconds. I've added them (360, 60 and false, respectively) to EucalyptusNC.xml and EucalyptusCC.xml, and things are getting better. Occasionally I still see the timestamp error messages on the logs, but they are very rare now. I've also disabled NTP on the NCs and created a cron script (that runs every hour) to sync time (ntpdate -b) with the CC.
EDIT 4: Apparently this is a bug in Ubuntu's packaging of Eucalyptus. I've filed a bug on Launchpad, following recommendations from people in #eucalyptus on Freenode: https://bugs.launchpad.net/ubuntu/+source/eucalyptus/+bug/854946
it is my understanding that the version of rampartc is 1.3.0 while the current version of axis2c is 1.6.0. So it is the current version.
We have not seen that problem in the synchornization: if the times are withing 5 minutes it usually works.
You have hit upon a key issue with softare virtualization in general and by extension cloud-based virtualization, the clock is not grounded in a hardware clock and it will float in relationship to the overall businessness of the underlying host operating system. Occasionally the physical clock and virtual clock will synch up, causing a clock jump when this occurs. There are lots of apps where this clock jump can play havoc with the performance of the app. Where you need a really high precision clock for timing or auditing purposes you may need to move to physical hosting as opposed to virtual hosting on the internet.