Troubleshooting Services

Service stuck in `<pending>`

If a service is stuck in the <pending> state then there are a number of places to begin looking!

Are all the components running?

In order for a successfully load balancer service to be created then ensure the following is running:

A Cloud controller manager, such as the kube-vip-cloud-provider
The kube-vip pods (either as a daemonset or as static pods)

Is kube-vip running with services enabled?

Look at the logs of the kube-vip pods to determine if services are enabled:

kubectl logs -n test kube-vip-ds-9kbgv
time="2022-10-07T09:44:23Z" level=info msg="Starting kube-vip.io [v0.5.0]"
time="2022-10-07T09:44:23Z" level=info msg="namespace [kube-system], Mode: [ARP], Features(s): Control Plane:[false], Services:[true]"

The Services:[true] is what is required!

Is an address being assigned?

The <pending> is only removed from a service once the status is updated, however to rule out the cloud controller we can examine the service to see if an IP was allocated.

kubectl get svc nginx -o yaml

apiVersion: v1
kind: Service
metadata:
  annotations:
    kube-vip.io/vipHost: k8s04
    "kube-vip.io/loadbalancerIPs": "1.1.1.1"
  labels:
    implementation: kube-vip
    ipam-address: 192.168.0.220
  name: nginx
  namespace: default
spec:
...
  loadBalancerIP: 192.168.0.220

The above example shows that the annotations[kube-vip.io/loadbalancerIPs] was populated with an IP from the cloud controller, this means that the problem is with the kube-vip pods themselves.

Since k8s 1.24, loadbalancerIP field is deprecated. It's recommended to use the annotations instead of command line or service.spec.loadBalancerIP to specify the ip.

Examining the `kube-vip` pods

Checking the logs of the kube-vip pods should hopefully reveal some reasons as to why they're unsuccssefully advertising the IP to the outside world and updating the status of the service.

If `kubectl` doesn't work

Sometimes kubectl can't talk to the cluster, which makes it difficult to troubleshoot why the control plane node isn't working. This is likely due to the API server and etcd pods crashing, which results in kube-vip crashing.

If a new control plane node is unstable, there may be an issue with your Container Runtime Interface (CRI) cgroup configuration if using containerd on a systemd based distro.

Check the stability of your Control Plane Node's Pods

To check the stability of your control plane pods when kubectl is unusable, you can use crictl:

crictl ps -a

Or to watch the pods over a period of time:

watch -n 1 crictl ps -a

If you see the control plane pods (etcd, kube-apiserver, etc.) show a mix of "Exited" and "Running" and the "ATTEMPT" counters are going up every minute or so, it is likely the CRI is not configured correctly. On a system using containerd (sometimes installed as a dependency of docker) for the CRI and systemd for the init system, the cgroup driver in containerd needs to be configured for systemd. Without the systemd cgroup driver, it appears containers are frequently sent the SIGTERM signal.

Set containerd to use systemd cgroups

containerd needs the cgroup driver set to systemd when a systemd init system is present on your distro. To do this, you can execute the following 3 commands to generate the containerd config and set the option:

sudo mkdir /etc/containerd
sudo containerd config default | sed 's/SystemdCgroup = false/SystemdCgroup = true/' | sudo tee /etc/containerd/config.toml
sudo systemctl restart containerd.service

If you have already attempted to init a new control plane node with kubeadm, and it is the first node in a new cluster, you can then reset and init it again with the following commands:

sudo kubeadm reset -f
sudo kubeadm init .....