Hi,
I am currently trying to get down to the core of a problem where my
LVS-director seems to drop a packet coming from a client from time to
time. We have this problem on our production systems and can reproduce
the problem on staging.
Our setup:
===========
We are using ipvsadm with Linux CentOS5 x86_64 in a PV XEN-DomU.
Current Version details:
Kernel: 2.6.18-348.1.1.el5xen
ipvsadm: 1.24-13.el5
LVS-Setup:
We use IPVS in DR-mode, for managing the running connections we use
lvs-kiss.
lvs is running in a heartbeat-v1-cluster (two virtual nodes), master
and backup are running constantly on both nodes
For the LVS-services we use logical IPs being setup by heartbeat
(active/passive-clustermode)
The real-servers are physical Linux-machines.
Network-Setup:
The VM acting as director is running as XEN-PV-DomU on a Dom0 using
bridged networks.
Networks "in play":
abn-network (staging-network, used to connect the client to the
director),
used by the real-servers to send the answer to the clients (direct
routing approach),
used for ipvsadm slave/master multicast-traffic
lvs-network: This is a dedicated VLAN which connects director and
real-servers
dr-arp-problem: solved my suppressing arp-answers on the
real-servers for the service-ip
The service-IP is configured as logical IP on the lvs-interface on
the real-servers.
In this setup ip_forwarding is not needed anywhere (neither on
director, nor on real-server).
VM details:
1 GB RAM, 2 vCPUs, system-load almost 0, memory 73M free, 224M
buffers, 536M cache, no swap.
top shows almost always 100% idle, 0% us/sy/ni/wa/hi/si/st.
Configuration details:
ipvsadm -Ln for the service in question shows:
TCP x.y.183.217:12405 wrr persistent 7200
-> 192.168.83.234:12405 Route 1000 0 0
-> 192.168.83.235:12405 Route 1000 0 0
x.y first two octets are from our internal class-B-range.
We use 192.168.83.x as lvs-network for staging.
Persistent ipvsadm-configuration:
/etc/sysconfig/ipvsadm: --set 20 20 20
Cluster-configuration:
/etc/ha.d/haresources: $primary_directorname lvs-kiss x.y.183.217
lvs-kiss-configuration-snippet for the service above:
<VirtualServer idm-abn:12405>
ServiceType tcp
Scheduler wrr
DynamicScheduler 0
Persistance 7200
QueueSize 2
Fuzz 0.1
<RealServer rs1-lvs:12405>
PacketForwardingMethod gatewaying
Test ping -c 1 -nq -W 1 rs1-lvs >/dev/null
RunOnFailure "/sbin/ipvsadm -d -t idm-abn:12405 -r rs1-lvs"
RunOnRecovery "/sbin/ipvsadm -a -t idm-abn:12405 -r rs1-lvs"
</RealServer>
<RealServer rs2-lvs:12405>
PacketForwardingMethod gatewaying
Test ping -c 1 -nq -W 1 rs2-lvs >/dev/null
RunOnFailure "/sbin/ipvsadm -d -t idm-abn:12405 -r rs2-lvs"
RunOnRecovery "/sbin/ipvsadm -a -t idm-abn:12405 -r rs2-lvs"
</RealServer>
</VirtualServer>
idm-abn, rs1 and rs2 resolve via /etc/hosts.
About the service:
This is a soa-web-service.
How we reproduce the error:
From a client we run constant calls to the web-service at an interval
of one call in three seconds.
From time to time there will be a connection reset from the director
to the client.
Interesting: This happens on n x 100th + 1 tries - interesting is the
one.
What we did to trace down the problem:
- Checked /proc/sys/net/ipv4/vs: all values are set to default, so
drop_packet is NOT in place (=0)
- tcpdump on client, fronted/abn of the director, backend/lvs of the
directory, lvs and abn of the real-servers
In this tcpdump we could see a request from the client, answered by a
connection-reset by the director.
The packet was NOT forwarded via LVS.
I welcome any ideas on how to track this problem further down.
If any information is unclear/missing to drill down the problem - please
ask.
Kind regards
Nils Hildebrand
_______________________________________________
Please read the documentation before posting - it's available at:
http://www.linuxvirtualserver.org/
LinuxVirtualServer.org mailing list - lvs-users@xxxxxxxxxxxxxxxxxxxxxx
Send requests to lvs-users-request@xxxxxxxxxxxxxxxxxxxxxx
or go to http://lists.graemef.net/mailman/listinfo/lvs-users
|