Recently we ran into an issue on our Kubernetes kluster. More or less frequently we where being punished with a 5seconds delay on ingress, egress and between service calls at the kluster. We are using Azure AKS configured with advanced networking which uses CNI, implemented with Azure CNI. We have a vnet in the /20 range with a single subnet taking up the entire range, i.e., /20. The topology is flat with no overlay network and the pods are getting ip addreses in the subnet range. I plan on writing a post on the topology of the Azure AKS advanced networking.
One of the benefits of using Kubernetes as a platform for deploying, scaling and monitoring applications, instead of something vendor spesific, is the broader community. Everyone was being hit by this issue, AWS, GKE and AKS, thus everyone was working on a fix. Some of the “big companies” was experiencing this as well, which probably made this a top priority.
I started investigating based on something I heard from a service request that this was only affecting alpine based images and not debian.
Due to a race conditions bug in the linux kernel some packets to the internal k8s DNS service, among others, get lost. The default behavior is for the DNS lookup to wait for a response from both the A and AAAA request even if one of them is lost. It will eventually result in a timeout. Setting the single-request-reopen flag for the pod will overwrite the DNS resolver settings for the containers in the pod to not keep listening for lost packet from the A or AAAA request.
Debian, Alpine issue?
It is fairly easy to look up our base image. Looking at the Dockerfile for one of the pods with slow response Dockerfile i found:
So we are actually using microsoft/dotnet:2.1-aspnetcore-runtime as the base image. But is that based on debian? A quick search for it on store.docker.com I found this:
Ok what is the 2.1-aspnetcore-runtime telling us?
There we go! A debian image. So the hypothesis that this is only affecting alpine and not debian is wrong in our case.
We were running kubernetes version 1.10.6 and I knew that version 1.11.3 was available. I read up on the release notes to look for breaking changes before doing an upgrade: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.11.md. All looks fine in our case to do an upgrade.
The upgrade was quick and painless. But the DNS timeout persisted.
A little background info. In kubernetes there is an abstraction called a service. Pods can die and be rescheduled as a result of an application error or just because k8s scheduler is trying to optimize the application execution. It can happen anytime. As part of the rescheduling the pod get a new ip, thus other applications can not use their ips directly. The service acts as an abstraction layer that give the other application an dns name for the service ip. This is an internal DNS that k8s keeps up to date for you. If you want to se the ip addresses for a service you can look at the endpoints object:
kubectl get endpoints –all-namespaces
If you have looked at the pods running in the kube-system namespace you will find kube-proxy. It is responsible for implementing a form of virtual IP for the services. There is one kube-proxy pod running for each node. The mode that this pod is running in depends on our kubernetes version and setup. See https://kubernetes.io/docs/concepts/services-networking/service/#virtual-ips-and-service-proxies for more information about the different ones. Either way the kube-proxy ultimately provides a DNS name for a services that is reliable and can be used by other pods in the cluster.
In our case we are using https://traefik.io/ as an ingress controller. It is responsible for routing different http requests to a given service. In other words it uses k8s service abstraction by calling the DNS name. We were seeing a 5s increase for each subsequent k8s service requests that was part of one initial request to the load balancer.
A consistent 5seconds delay turns out to be a good indication of some DNS issue. Some searches on github seems to suggest that this is an issue that has started in kubernetes 1.7 https://github.com/kubernetes/kubernetes/issues/56903#issuecomment-378334492.
I also found this https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02 suggesting that some of the packets could get lost when doing DNS lookups by a race condition in the linux kernel. This aligns well with our experience where we saw the frequency of delays go up on the number of applications deployed. And that the delays where non consistent. Sometimes there where delays, other times not. The internal DNS in k8s is itself a service.
So what happens when one of the packets get lost? I found the answer by looking at a setting for the pods called dnsConfig: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.11/#poddnsconfigoption-v1-core. Here I can set different dns resolver options. One of them is: single-request-reopen. http://man7.org/linux/man-pages/man5/resolv.conf.5.html:
single-request-reopen (since glibc 2.9) Sets RES_SNGLKUPREOP in _res.options. The resolver uses the same socket for the A and AAAA requests. Some hardware mistakenly sends back only one reply. When that happens the client system will sit and wait for the second reply. Turning this option on changes this behavior so that if two requests from the same port are not handled correctly it will close the socket and open a new one before sending the second request.
In our case it is not the hardware that mistakenly only sends back one reply, but the kernel race condition.
So due to a race conditions bug in the linux kernel some packets to the internall DNS service, among others, get lost. The default behavior is for the DNS lookup to wait for a response resulting in a timeout. Setting the flag for the pod will overwrite the DNS resolver settings for the containers in the pod.
This resolved the issue for us. Rumors has it that k8s 1.11.4 has solved this issue making the dnsConfig obsulete. Will investigate when I get the time and update this post.