We recently set up an Ultra Monkey load balancer with 2 real servers
and 95% of the time it seems to be working perfectly, but every now
and then our customers are getting "Page cannot be displayed" errors.
What's the average/peak request rate and size?
It happens at different stages on our websites and we can't seem to
reproduce the problem here. Out customers are very large Fortune 500
companies so we assume that their networking etc is top of the line,
and the fact that it is occurring with multiple customers we assume it
is our architecture. Our LB environment is as follows :
Ultra monkey box :
# Global Directives
Can you correlate any log messages from ldirectord with the 5% page
display problems? Since you seem to have a very high timeout value for
your persistency and no indication of expire_nodest_conn it's not easy
to pinpoint the problem. What kind of application is running behind the
services? Does the application logic span over both services within the
lifetime context? Does the fallback work?
# Controls IP packet forwarding
net.ipv4.ip_forward = 1
# Controls source route verification
net.ipv4.conf.default.rp_filter = 1
# Do not accept source routing
net.ipv4.conf.default.accept_source_route = 0
# Controls the System Request debugging functionality of the kernel
kernel.sysrq = 0
Besides the strange comment, enabling this can be helpful at times.
1: I assume that because we are using masq(NAT) that we don't need to
worry about the noarp problem with DR or TUN?
2: Is there any ip tuning that we should do on the Ultra Monkey box as
not only is it acting as the load balancer but it is also a router
Only if you experience performance problems. So I'd like to ask back if
you've previously seen any indication of such problems in your log files
(including kernel log: dmesg -s 100000).
3: Has anybody else seen this intermittent "Page cannot be displayed"
error with UM?
Sure, but there's tons of possibilities for this to happen. I can
envision that ldirectord takes one of the RS out and due to the high
service template timeout and the missing expire_nodest_conn setting and
probably other issues, client requests are still being forwarded to the
non-functional RS, which will definitely cause such a message to be
displayed on the client's browser.
For your own amusement, I've allowed myself to quote the KB241344
article from Microsoft:
This is maybe a wonderful example of why Microsoft is so much more
successful than others: No mentioning of tcpdump/windump to their users
and of course it's always the fault of the user :).
Roberto Nibali, ratz
'[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc