Manage failovers
Trigger a failover
You can trigger a failover manually using the Tempora Cloud Web UI, the tcld CLI, or the Cloud Ops API.
Always check the replication lag before initiating a failover. A forced failover when there is a significant replication lag has a higher likelihood of rolling back Workflow progress.
- Web UI
- tcld
- Cloud Ops API
- Visit the Namespace page on the Temporal Cloud Web UI.
- Navigate to your Namespace details page and select the Trigger a failover option from the menu.
- Confirm your action. After confirmation, Temporal initiates the failover.
To manually trigger a failover, run the following command in your terminal:
tcld namespace failover \
--namespace <namespace_id>.<account_id> \
--region <target_region>
The <target_region> must be the name of a region (example: us-east-1) where the Namespace has a replica that is
ready to be failed over to (replica state is Activated).
If using API key authentication with the --api-key flag, you must add it directly after the tcld command and before
namespace failover.
You can trigger a failover programmatically using the Cloud Ops API. The API is available via both HTTP and gRPC.
Using HTTP
Send a POST request to the
FailoverNamespaceRegion
endpoint:
POST https://saas-api.tmprl.cloud/cloud/namespaces/<namespace>/failover-region
Request body:
{
"region": "<target_region>",
"asyncOperationId": "<optional_async_operation_id>"
}
region(required): The region code of the region to failover to. Must be a region where the Namespace has a replica inActivatedreplica state, indicating the replica is ready to be failed over to. Example:aws-us-east-1asyncOperationId(optional): A user-defined ID for tracking the async operation. If not set, the server will assign one.
Using gRPC
Use the
FailoverNamespaceRegion
RPC with a
FailoverNamespaceRegionRequest:
message FailoverNamespaceRegionRequest {
// The namespace to failover.
string namespace = 1;
// The id of the region to failover to.
// Must be a region that the namespace is currently available in.
string region = 2;
// The id to use for this async operation - optional.
string async_operation_id = 3;
}
Both methods return a
FailoverNamespaceRegionResponse
containing an async operation that you can use to track the failover status.
The Temporal Cloud Terraform provider does not support triggering failovers. You must use the Web UI, tcld CLI, or Cloud Ops API.
Once the failover async operation returns successfully, the Namespace will be failed over. Temporal manages retries for the failover workflow. In the rare event that an internal error prevents the failover from completing, the Temporal on-call team is automatically paged to intervene and force the failover to completion.
Post-failover event information
After any failover, whether triggered by you or by Temporal, an event appears in both the
Temporal Cloud Web UI (on the Namespace detail page) and in your audit logs. The
audit log entry for Failover uses the "operation": "FailoverNamespace" event. After failover, the replica becomes
active, taking over from the original region.
Temporal Cloud notifies you via email whenever there is a failover event.
Return to the primary with failbacks
After a Temporal-managed (automatic) failover, Temporal Cloud automatically fails back to the original region once it is healthy. Follow Temporal's status page for updates on the original region's health.
After a Temporal-managed failover
When Temporal triggers an automatic failover due to an outage, Temporal will also trigger an automatic failback to the original region once the region recovers. No action is required from you.
If you prefer to manage failback yourself, you have two options:
-
Opt out of automatic failback (manage failback manually): Disable Temporal-managed failovers on the Namespace. When you're ready to fail back to the original region, trigger a failover to that region and then re-enable Temporal-managed failovers.
-
Stay on the new region permanently ("fail forward"): Trigger a failover to the region that is already active. This tells Temporal that you want to treat the new region as your primary for as long as it's healthy. Temporal-managed automatic failovers remain enabled, so Temporal will still protect you if the new region has an outage.
After a user-triggered failover
If you triggered a failover yourself during an outage (instead of relying on a Temporal-managed failover), Temporal will not automatically fail back for you. You must trigger a failover back to the original region when it is healthy. Monitor Temporal's status page for updates on region health.
Automatic failback is only available after Temporal-managed (automatic) failovers.
How to check whether your Namespace will be automatically failed back
If you are not sure whether your Namespace will be automatically failed back, check the list of failovers in the Temporal Cloud Web UI on your Namespace's detail page:
- If the most recent failover was Temporal-triggered, then Temporal will automatically fail back the Namespace when the original region is healthy.
- If the most recent failover was user-triggered, then the Namespace will not be automatically failed back. You must trigger the failback yourself.
Disable Temporal-initiated failovers
When you add a replica to a Namespace, Temporal Cloud automatically fails over the Namespace to its replica in the event of an outage. This is the recommended and default option.
If you prefer to disable Temporal-initiated failovers and handle your own failovers, follow these instructions:
- Web UI
- tcld
- Navigate to the Namespace detail page in Temporal Cloud.
- Choose the "Disable Temporal-initiated failovers" option.
To disable Temporal-initiated failovers, run the following command in your terminal:
tcld namespace update-high-availability \
--namespace <namespace_id>.<account_id> \
--disable-auto-failover=true
If using API key authentication with the --api-key flag, you must add it directly after the tcld command and before
namespace update-high-availability.
To restore the default behavior, unselect the option in the Web UI or change true to false in the CLI command.
Workers and failovers
Enabling High Availability for Namespaces does not require specific Worker configuration. When a Namespace fails over to the replica, the DNS redirection orchestrated by Temporal ensures that your existing Workers continue to poll the Namespace without interruption.
When a Namespace fails over to a replica in a different region, Workers will be communicating cross-region.
- If your application cannot tolerate this latency, deploy a second set of Workers in the replica's region or opt for a replica in the same region.
- In the case of a complete regional outage, Workers in the original region may fail alongside the original Namespace. To keep Workflows moving during this level of outage, deploy a second set of Workers to the secondary region.
Temporal Cloud enforces a maximum connection lifetime of 5 minutes, which gives your Workers an opportunity to re-resolve the DNS.
Test failovers
Temporal recommends regular failover testing for mission-critical applications in production. By testing in non-emergency conditions, you verify that your application continues to function even when parts of the infrastructure fail.
If this is your first time performing a failover test, run it with a test-specific Namespace and application. Practice runs help ensure the process runs smoothly during real incidents in production.