This doc contains information on how to troubleshoot common scenarios.
The instance running dcos might get filled up and dcos can fail.
This can lead to unhealthy agents which may not respond to the Mesos.
Follow the steps below to troubleshoot the scenario.
If able to identify the source of high disk space,
try to do the cleanup and bring back the disk space.
If there is a failure/bad configuration with docker,
/var/lib/docker/overlay folder can get used up.
- If able to identify the docker we can stop the container and it is not advised to delete the instance.
- Disk space used by jenkins containers.
We have had scenarios where the jenkins backup getting accumulated and using up the space.
Stopping the backup plugin and deleting the backup from the container can resolve the issue.
- If the above scenarios and scenario does not help,
we can stop the instance from the console and wait for the auto scaling configuration to bring up new agents.
This is where the
iptables of the instances are setup to block the ports of marathon-lb and resulting timeouts.
You can use the below scripts to backup and flush the
for i in $(curl -sS leader-ip:5050/slaves | jq '.slaves | .hostname' | tr -d '"'); do ssh "core@$i" "mkdir -p debugging && sudo iptables-save >> debugging/iptables-backup && sudo iptables -F"; done
This scenario occurs when the marathon service running in one of the masters fails and does not restart on its own.
journalctl -u dcos-marathon.service