Skip to main content

Manage failovers

Trigger a failover

You can trigger a failover manually using the Tempora Cloud Web UI, the tcld CLI, or the Cloud Ops API.

Check your replication lag

Always check the replication lag before initiating a failover. A forced failover when there is a significant replication lag has a higher likelihood of rolling back Workflow progress.

  1. Visit the Namespace page on the Temporal Cloud Web UI.
  2. Navigate to your Namespace details page and select the Trigger a failover option from the menu.
  3. Confirm your action. After confirmation, Temporal initiates the failover.
Terraform not supported

The Temporal Cloud Terraform provider does not support triggering failovers. You must use the Web UI, tcld CLI, or Cloud Ops API.

Once the failover async operation returns successfully, the Namespace will be failed over. Temporal manages retries for the failover workflow. In the rare event that an internal error prevents the failover from completing, the Temporal on-call team is automatically paged to intervene and force the failover to completion.

Post-failover event information

After any failover, whether triggered by you or by Temporal, an event appears in both the Temporal Cloud Web UI (on the Namespace detail page) and in your audit logs. The audit log entry for Failover uses the "operation": "FailoverNamespace" event. After failover, the replica becomes active, taking over from the original region.

Temporal Cloud notifies you via email whenever there is a failover event.

Return to the primary with failbacks

After a Temporal-managed (automatic) failover, Temporal Cloud automatically fails back to the original region once it is healthy. Follow Temporal's status page for updates on the original region's health.

After a Temporal-managed failover

When Temporal triggers an automatic failover due to an outage, Temporal will also trigger an automatic failback to the original region once the region recovers. No action is required from you.

If you prefer to manage failback yourself, you have two options:

  • Opt out of automatic failback (manage failback manually): Disable Temporal-managed failovers on the Namespace. When you're ready to fail back to the original region, trigger a failover to that region and then re-enable Temporal-managed failovers.

  • Stay on the new region permanently ("fail forward"): Trigger a failover to the region that is already active. This tells Temporal that you want to treat the new region as your primary for as long as it's healthy. Temporal-managed automatic failovers remain enabled, so Temporal will still protect you if the new region has an outage.

After a user-triggered failover

If you triggered a failover yourself during an outage (instead of relying on a Temporal-managed failover), Temporal will not automatically fail back for you. You must trigger a failover back to the original region when it is healthy. Monitor Temporal's status page for updates on region health.

Automatic failback is only available after Temporal-managed (automatic) failovers.

How to check whether your Namespace will be automatically failed back

If you are not sure whether your Namespace will be automatically failed back, check the list of failovers in the Temporal Cloud Web UI on your Namespace's detail page:

  • If the most recent failover was Temporal-triggered, then Temporal will automatically fail back the Namespace when the original region is healthy.
  • If the most recent failover was user-triggered, then the Namespace will not be automatically failed back. You must trigger the failback yourself.

Disable Temporal-initiated failovers

When you add a replica to a Namespace, Temporal Cloud automatically fails over the Namespace to its replica in the event of an outage. This is the recommended and default option.

If you prefer to disable Temporal-initiated failovers and handle your own failovers, follow these instructions:

  1. Navigate to the Namespace detail page in Temporal Cloud.
  2. Choose the "Disable Temporal-initiated failovers" option.

To restore the default behavior, unselect the option in the Web UI or change true to false in the CLI command.

Workers and failovers

Enabling High Availability for Namespaces does not require specific Worker configuration. When a Namespace fails over to the replica, the DNS redirection orchestrated by Temporal ensures that your existing Workers continue to poll the Namespace without interruption.

When a Namespace fails over to a replica in a different region, Workers will be communicating cross-region.

  • If your application cannot tolerate this latency, deploy a second set of Workers in the replica's region or opt for a replica in the same region.
  • In the case of a complete regional outage, Workers in the original region may fail alongside the original Namespace. To keep Workflows moving during this level of outage, deploy a second set of Workers to the secondary region.

Temporal Cloud enforces a maximum connection lifetime of 5 minutes, which gives your Workers an opportunity to re-resolve the DNS.

Test failovers

Temporal recommends regular failover testing for mission-critical applications in production. By testing in non-emergency conditions, you verify that your application continues to function even when parts of the infrastructure fail.

tip

If this is your first time performing a failover test, run it with a test-specific Namespace and application. Practice runs help ensure the process runs smoothly during real incidents in production.

Dive deeper — Why test?[+]

Failover testing (also known as "trigger testing") can:

  • Validate replicated deployments: In multi-region setups, failover testing ensures your application can run from another region when the primary region experiences outages. In Same-region Replication setups, failover testing works with a separate cell within the same region.

  • Assess replication lag: In multi-region deployments, monitoring replication lag between regions is important. Check the lag before initiating a failover to avoid rolling back Workflow progress.

  • Assess recovery time: Manual testing helps you measure actual recovery time and check if it meets your expected Recovery Time Objective (RTO).

  • Identify potential issues: Failover testing uncovers problems not visible during normal operation, including issues like backlogs and capacity planning and how external dependencies behave during a failover event.

  • Operational readiness: Regular testing familiarizes your team with the failover process, improving their ability to handle real incidents.