Fault Tolerance¶
This page describes the fault tolerant behavior of Anaconda Enterprise.
The status of core Enterprise services and user deployments can be monitored from the Anaconda Enterprise Operations Center.
Anaconda Enterprise can be deployed with automatically provided service redundancy and fault tolerance. Anaconda Enterprise employs automatic service restarts and health monitoring to remain operational if a process halts or a worker node becomes unavailable. Additional levels of fault tolerance, such as service migration, are provided if there are at least three nodes in the deployment. However, the Master Node cannot currently be configured for automatic failover and does present a single point of failure.
When Anaconda Enterprise is deployed to a cluster with three or more nodes the core services will automatically be configured into a fault tolerant mode. This configuration will take effect whether Anaconda Enterprise is initially configured this way, or due to changes at a later point in time. Once there are three or more nodes available the service fault tolerance features will come into effect.
Service migration is only possible when there are multiple worker nodes in the deployment.
Core Services¶
Anaconda Enterprise core services will automatically be restarted or, if possible, migrated in the event of any service failure.
User Deployments¶
User-initiated project deployments will automatically be restarted or, if possible, migrated in the event of any failure.
Worker Nodes¶
If any worker node becomes unresponsive or unavailable, it will be flagged while the core Enterprise services and backend continue to run without interruption. If additional worker nodes are available the services that had been running on the failed worker node will be migrated or restarted on other still-live worker nodes. This migration may take a few minutes.
New worker nodes can be added to the Enterprise cluster from the Anaconda Enterprise Operations Center.
Storage and Persistency Layer¶
Anaconda Enterprise does not automatically configure storage or persistency layer fault tolerance when using the default storage and persistency services. This includes the database, git server, and object storage. If you have configured Anaconda Enterprise to use external storage and persistency services then you will need to configure these for fault tolerance.
Recovering Anaconda Enterprise After Node Failure¶
The Master Node presents a single point of failure. If it fails it can be recovered using the steps outlined on the Master Node Failure Recovery page.