Cluster Troubleshooting

This page describes some basic issues we have faced while deploying and operating the cluster.

1. MongoDB Restarts

We define the following in the mongo-ss.yaml file:

    cpu: 200m
    memory: 5G

When the MongoDB cache occupies a memory greater than 5GB, it is terminated by the kubelet. This can usually be verified by logging in to the worker node running MongoDB container and looking at the syslog (the journalctl command should usually work).

This issue is resolved in PR #1757.

2. 502 Bad Gateway Error on Runscope Tests

It means that NGINX could not find the appropriate backed to forward the requests to. This typically happens when:

  1. MongoDB goes down (as described above) and BigchainDB, after trying for BIGCHAINDB_DATABASE_MAXTRIES times, gives up. The Kubernetes BigchainDB Deployment then restarts the BigchainDB pod.
  2. BigchainDB crashes for some reason. We have seen this happen when updating BigchainDB from one version to the next. This usually means the older connections to the service gets disconnected; retrying the request one more time, forwards the connection to the new instance and succeed.

3. Service Unreachable

Communication between Kubernetes Services and Deployments fail in v1.6.6 and before due to a trivial key lookup error for non-existent services in the kubelet. This error can be reproduced by restarting any public facing (that is, services using the cloud load balancer) Kubernetes services, and watching the kube-proxy failure in its logs. The solution to this problem is to restart kube-proxy on the affected worker/agent node. Login to the worker node and run:

docker stop `docker ps | grep k8s_kube-proxy | cut -d" " -f1`

docker logs -f `docker ps | grep k8s_kube-proxy | cut -d" " -f1`

This issue is fixed in Kubernetes v1.7.

4. Single Disk Attached to Multiple Mountpoints in a Container

This is currently the issue faced in one of the clusters and being debugged by the support team at Microsoft.

The issue was first seen on August 29, 2017 on the Test Network and has been logged in the Azure/acs-engine repo on GitHub.

This is apparently fixed in Kubernetes v1.7.2 which include a new disk driver, but is yet to tested by us.

5. MongoDB Monitoring Agent throws a dial error while connecting to MongoDB

You might see something similar to this in the MongoDB Monitoring Agent logs:

Failure dialing host without auth. Err: `no reachable servers`
    at monitoring-agent/components/dialing.go:278
    at monitoring-agent/components/dialing.go:116
    at monitoring-agent/components/dialing.go:213
    at src/runtime/asm_amd64.s:2086

The first thing to check is if the networking is set up correctly. You can use the (maybe using the toolbox container).

If everything looks fine, it might be a problem with the Preferred Hostnames setting in MongoDB Cloud Manager. If you do need to change the regular expression, ensure that it is correct and saved properly (maybe try refreshing the MongoDB Cloud Manager web page to see if the setting sticks).

Once you update the regular expression, you will need to remove the deployment and add it again for the Monitoring Agent to discover and connect to the MongoDB instance correctly.

More information about this configuration is provided in this document.

6. Create a Persistent Volume from existing Azure disk storage Resource

When deleting a k8s cluster, all dynamically-created PVs are deleted, along with the underlying Azure storage disks (so those can’t be used in a new cluster). resources are also deleted thus cannot be used in a new cluster. This workflow will preserve the Azure storage disks while deleting the k8s cluster and re-use the same disks on a new cluster for MongoDB persistent storage without losing any data.

The template to create two PVs for MongoDB Stateful Set (One for MongoDB data store and the other for MongoDB config store) is located at mongodb/mongo-pv.yaml.

You need to configure diskName and diskURI in mongodb/mongo-pv.yaml file. You can get these values by logging into your Azure portal and going to Resource Groups and click on your relevant resource group. From the list of resources click on the storage account resource and click the container (usually named as vhds) that contains storage disk blobs that are available for PVs. Click on the storage disk file that you wish to use for your PV and you will be able to see NAME and URL parameters which you can use for diskName and diskURI values in your template respectively and run the following command to create PVs:

$ kubectl --context <context-name> apply -f mongodb/mongo-pv.yaml


Please make sure the storage disks you are using are not already being used by any other PVs. To check the existing PVs in your cluster, run the following command to get PVs and Storage disk file mapping.

$ kubectl --context <context-name> get pv --output yaml