Manage maintenance events with TPUs in All Capacity mode
All TPU hosts go through regular maintenance. In TPU All Capacity Mode, you can plan for upcoming maintenance events and initiate the maintenance operations when you want on all of your capacity. You can update your used and unused capacity, simultaneously or separately. You can also perform maintenance at a VM, a sub-block, a block, or reservation level. This fine-grained maintenance control lets you create an optimal maintenance sequence, and schedule maintenance operations to minimize business impact.
TPU All Capacity Mode only supports "grouped maintenance", which means the maintenance operations for all VM instances within a reservation are scheduled at the same time. All TPU VMs in a reservation have the same maintenance window. However, the maintenance operations can be carried out separately at the host, sub-block, block, or reservation level. Maintenance notifications are sent approximately 90 days in advance. Maintenance won't be carried out more frequently than once every 90 days.
If you're using TPU Cluster Director on GKE and you're using multi-host TPU slice node pools, we recommend you delete the GKE node pool before manually starting the pending maintenance for any hosts in that node pool. Once the maintenance has been executed for all hosts in the original node pool, you can recreate the node pool.
The following is an example timeline for a TPU host maintenance event:
- Maintenance is scheduled. A notification is sent to you that the host will be updated within 90 days.
- You can choose to manually update the host with 90 days.
- After 90 days, the maintenance operation is run without exception.
- If another maintenance event is scheduled before the previous event is run, the second operation is scheduled to run after 180 days, 90 days after the initial maintenance event is scheduled to occur.
Set up maintenance notification alerts for physical capacity
Compute Engine sends you Cloud Logging events for scheduled, started, or completed maintenance. These maintenance events stay in your logs so you can build log queries to get a historical view of the maintenance for your capacity. You can also get notified about future maintenance events for a reservation, block, or sub-block by creating log-based alert policies.
To create an alert for maintenance events on your physical capacity:
- In the Google Cloud console, go to the Logs Explorer.
- Make sure Show query is turned on.
- In the query pane, build a query in the format listed in the following sections. Replace the corresponding parameter placeholder accordingly and run the query.
- Once you verify the returned results match what you want, you can create alerts by selecting Create log alert from the Actions drop-down in the query Results toolbar and provide the requested information.
Query for upcoming maintenance
The following is an example query for querying for upcoming maintenance:
protoPayload.methodName="compute.CAPACITY_COMPONENT.upcomingGroupMaintenance" severity>=DEFAULT protoPayload.resourceName="projects/shared-reservation-project/reservations/RESOURCE_NAME" protoPayload.status.message =~ "scheduled"
Replace CAPACITY_COMPONENT and RESOURCE_NAME with the following values:
| Receive upcoming maintenance notification for | CAPACITY_COMPONENT |
RESOURCE_NAME |
|---|---|---|
| All reservations | reservations |
Omit RESOURCE_NAME |
| A specific reservation | reservations |
YOUR_RESERVATION_NAME |
| Blocks across all reservations | reservations.blocks |
Omit RESOURCE_NAME |
| A specific block | reservations.blocks |
YOUR_RESERVATION_NAME/reservationBlocks/YOUR_RESERVATION_BLOCK_ID |
| Sub-blocks across all reservations | reservations.blocks.subblocks |
Omit RESOURCE_NAME |
| A specific sub-block | reservations.blocks.subblocks |
YOUR_RESERVATION_NAME/reservationBlocks/YOUR_RESERVATION_BLOCK_ID/reservationSubBlocks/YOUR_RESERVATION_SUBBLOCK_ID |
Query for maintenance window opening
protoPayload.methodName="compute.reservations.CAPACITY_COMPONENT.startGroupMaintenance" severity>=DEFAULT protoPayload.status.message =~ "started"
Replace CAPACITY_COMPONENT with one of the following values:
| Receive notification for a maintenance window opening for | CAPACITY_COMPONENT |
|---|---|
| Blocks in a reservations | reservations.blocks |
| Sub-blocks in a reservation | reservations.blocks.subblocks |
Query for completed maintenance
The following is an example query for querying for completed maintenance:
protoPayload.methodName="compute.reservations.CAPACITY_COMPONENT.completedGroupMaintenance" severity>=DEFAULT protoPayload.resourceName="projects/YOUR_RESERVATION_PROJECT/reservations/RESOURCE" protoPayload.status.message =~ "completed"
Replace CAPACITY_COMPONENT and RESOURCE_NAME with the following values:
| Receive notification for completed maintenance for | CAPACITY_COMPONENT |
RESOURCE_NAME |
|---|---|---|
| All reservations | reservations |
Omit RESOURCE_NAME |
| A specific reservation | reservations |
YOUR_RESERVATION_NAME |
| Blocks across all reservations | reservations.blocks |
Omit RESOURCE_NAME |
| A specific block | reservations.blocks |
YOUR_RESERVATION_NAME/reservationBlocks/YOUR_RESERVATION_BLOCK_ID |
| Sub-blocks across all reservations | reservations.blocks.subblocks |
Omit RESOURCE_NAME |
| A specific sub-block | reservations.blocks.subblocks |
YOUR_RESERVATION_NAME/reservationBlocks/YOUR_RESERVATION_BLOCK_ID/reservationSubBlocks/YOUR_RESERVATION_SUBBLOCK_ID |
View the maintenance status of physical capacity
You can find out the maintenance status of your capacity through Cloud Logging, APIs, and CLI. Maintenance status information is provided at four levels: reservation, block, sub-block, and host.
Cloud Logging
The following example JSON was generated in response to this example query:
{
"protoPayload": {
"@type": "type.googleapis.com/google.cloud.audit.AuditLog",
"status": {
"message": "Maintenance is scheduled for this block in reservation
YOUR_RESERVATION. Review the maintenance schedule
by describing the reservation and block via gcloud CLI"
},
"metadata": {
"type":SCHEDULED
"canReschedule":True
"windowGroupStartTime": '2025-09-12T13:00:00.000-07:00',
"windowGroupEndTime": '2025-09-12T17:00:00.000-07:00',
"maintenanceGroupStatus":PENDING,
"maintenancePendingCount":128 # Used and Unused Machines,
"instanceMaintenancePendingCount": 64 # VMs Only
},
"methodName": "compute.reservations.block.upcomingGroupMaintenance",
…
},
}
gcloud
gcloud compute reservations blocks describe YOUR_RESERVATION \ --block-name=YOUR_BLOCK \ --project=YOUR_PROJECT \ --zone=YOUR_ZONE
The output is similar to the following:
count: 128 # Host count creationTimestamp: '2025-08-19T18:23:32.825-07:00' id: '6404259976725386932' inUseCount: 64 # In use host count kind: compute#reservationBlock name: exr1-block-0002 … reservationMaintenance: instanceMaintenanceOngoingCount: 0 instanceMaintenancePendingCount: 64 # VMs Only maintenanceOngoingCount: 0 maintenancePendingCount: 128 # Used and Unused Hosts schedulingType: GROUPED subblockInfraMaintenanceOngoingCount: 0 subblockInfraMaintenancePendingCount: 0 upcomingGroupMaintenance: canReschedule: true maintenanceReasons:…
- PLANNED_UPDATE maintenanceStatus: PENDING type: SCHEDULED windowEndTime: '2025-09-12T17:00:00.000-07:00' windowStartTime: '2025-09-12T13:00:00.000-07:00'
The following values from the output describe the maintenance information:
reservationMaintenance.instanceMaintenanceOngoingCount: the number of used hosts being updatedreservationMaintenance.instanceMaintenancePendingCount: the number of used hosts pending maintenancereservationMaintenance.maintenanceOngoingCount: the number of unused hosts being updatedreservationMaintenance.maintenancePendingCount: the number of unused hosts pending maintenancereservationMaintenance.upcomingGroupMaintenance.maintenanceReasons: the type of maintenancereservationMaintenance.upcomingGroupMaintenance.maintenanceStatus: the status of the maintenance operationreservationMaintenance.upcomingGroupMaintenance.type: the type of maintenance (SCHEDULEDfor planned maintenance orUNSCHEDULEDfor unplanned or emergency maintenance)reservationMaintenance.upcomingGroupMaintenance.windowEndTime: the scheduled end of the time window for the maintenance operationreservationMaintenance.upcomingGroupMaintenance.windowStartTime: the scheduled start of the time window for the maintenance operation
Set up maintenance notification alerts for TPU VMs
You can create alerts for maintenance events on your TPU VMs:
- In the Google Cloud console, go to the Logs Explorer.
- Set the Show query toggle to the "on" position.
- In the query pane, build a query in the format listed in the following sections.
- Once you verify the returned results match what you want, you can create an alert by clicking the Actions drop-down, select Create log alert and complete the information from the Create logs-based alert policy pane.
Query for when maintenance is scheduled for a VM instance
protoPayload.methodName="compute.instances.upcomingMaintenance" severity>=DEFAULT
protoPayload.status.message =~ "scheduled"
Query for when maintenance window has opened for a VM instance
protoPayload.methodName="compute.instances.upcomingMaintenance" severity>=DEFAULT
protoPayload.status.message =~ "ongoing"
Query for maintenance started for VM instances
protoPayload.methodName="compute.instances.blocks.terminateOnHostMaintenance" severity>=DEFAULT
Query for when maintenance has completed for a VM instance
protoPayload.methodName="compute.instances.upcomingMaintenance" severity>=DEFAULT
protoPayload.status.message =~ "completed"
View the maintenance status of a Cloud TPU VM
You can retrieve the maintenance status of a Cloud TPU VM with the
Compute Engine instance API or with a curl command from within the
guest operating system.
Describe an instance
gcloud
gcloud compute instances describe <var>INSTANCE</var> --zone <var>ZONE</var>
This command returns output like the following:
…
upcomingMaintenance:{
"type":"SCHEDULED"
"canReschedule":True
"windowStartTime": '2025-09-12T13:00:00.000-07:00'
"windowEndTime": 2025-09-12T17:00:00.000-07:00
"latestWindowStartTime": '2025-09-12T13:00:00.000-07:00'
"maintenanceStatus":"PENDING"
...
curl
curl http://metadata.google.internal/computeMetadata/v1/instance/upcoming-maintenance?alt=json -H "Metadata-Flavor: Google"
This command returns output like the following:
{
"maintenanceType":"SCHEDULED"
"canReschedule":True
"windowStartTime": '2025-09-12T13:00:00.000-07:00'
"windowEndTime": 2025-09-12T17:00:00.000-07:00
"latestWindowStartTime": '2025-09-12T13:00:00.000-07:00'
"maintenanceStatus":"PENDING"
}
You can also find maintenance notifications in Cloud Logging.
The following is an example log message of a pending planned maintenance. For an example query, see View the maintenance status of physical capacity.
{
"protoPayload": {
"@type": "type.googleapis.com/google.cloud.audit.AuditLog",
"status": {
"message": "Maintenance is scheduled for this instance. Review the maintenance schedule by describing the VM with gcloud CLI or querying the http://metadata.google.internal/computeMetadata/v1/instance/upcoming-maintenance metadata key."
},
"metadata": {
"canReschedule": true
"latestWindowStartTime": "2024-01-01:00:00:00PST"
"maintenanceStatus": "PENDING"
"type": "SCHEDULED"
"windowEndTime": "2024-01-01:00:02:00PST"
"windowStartTime": "2024-01-01:00:00:00PST"
},
},
"operation": {
"id": "systemevent-1702539760425-60c736da2db40-701ddf19-b5424b20",
"producer": "compute.instances.upcomingMaintenance",
"first": true,
"last": false
},
}
The following example is a log message of an ongoing unplanned maintenance, for an example query, see Query for when a maintenance window is open
{
"protoPayload": {
"@type": "type.googleapis.com/google.cloud.audit.AuditLog",
"status": {
"message": "Maintenance window has started for this instance. Review the maintenance schedule by describing the VM with gcloud CLI or querying the http://metadata.google.internal/computeMetadata/v1/instance/upcoming-maintenance metadata key."
},
"metadata": {
"canReschedule": true
"latestWindowStartTime": "2024-01-01:00:00:00PST"
"maintenanceStatus": "ONGING"
"type": "UNSCHEDULED"
"windowEndTime": "2024-01-01:00:02:00PST"
"windowStartTime": "2024-01-01:00:00:00PST"
},
},
"operation": {
"id": "systemevent-1702539760425-60c736da2db40-701ddf19-b5424b20",
"producer": "compute.instances.upcomingMaintenance",
"first": true,
"last": false
},
}
The following example is a log message for completed maintenance. For an example query, see Query for when maintenance has completed for a VM instance.
{
"protoPayload": {
"@type": "type.googleapis.com/google.cloud.audit.AuditLog",
"status": {
"message": "Maintenance window has completed for this instance. All maintenance notifications on the instance have been removed."
},
"operation": {
"id": "systemevent-1702539760425-60c736da2db40-701ddf19-b5424b20",
"producer": "compute.instances.upcomingMaintenance",
"first": false,
"last": true
},
}
Manually start pending maintenance for physical capacity
Once a maintenance event is scheduled (maintenanceStatus is set to PENDING),
you can manually start maintenance for your reservations, blocks, or sub-blocks
that have the canReschedule property set to True. When you manually start
a pending maintenance event, what happens depends on the maintenance state of
your reservation, blocks, or sub-blocks. The following table describes what
happens for each of these:
| Maintenance state | Description | What you see |
|---|---|---|
| Scheduled | Compute Engine has scheduled maintenance for the reservation. You can manually start maintenance before the scheduled time. | In the
Google Cloud CLI or REST API, the maintenanceStatus field is set to
PENDING. |
| In progress | Maintenance is underway. You can't reschedule it. | In the Google Cloud CLI or REST API, the maintenanceStatus
field is set to ONGOING. |
| Complete | Maintenance is finished. Compute Engine has removed all maintenance notifications from the VM. | In the Google Cloud CLI or REST API, the maintenanceStatus
field doesn't exist. |
Manually start maintenance on the entire reservation
The following command starts maintenance on a reservation. Use the
--scope parameter to specify one of the following values that specify the scope
of the maintenance operation:
- All hosts:
--scope=all - Hosts with running VMs:
--scope=running - Unused, stopped, or suspended VMs:
--scope=unused
To start maintenance on all blocks of a reservation run the following command:
gcloud compute reservations perform-maintenance YOUR_RESERVATION \ --zone=YOUR_ZONE \ --scope=all
To check the progress of a maintenance event, run the following command:
gcloud compute reservations describe YOUR_RESERVATION \ --project=YOUR_PROJECT \ --zone=YOUR_ZONE
The output is similar to the following:
ResourceStatus
upcomingGroupMaintenance:
"type":"SCHEDULED"
"canReschedule":True
"maintenanceStatus":"PENDING" → "ONGOING"
"maintenancePendingCount":512 → 0 # all hosts are moved into an ongoing state.
"maintenanceOngoingCount":0 → 512 → 256 → 0 # this number first increases to all hosts
# as machines complete, this number reduces.
Manually start maintenance on a block
The following command starts maintenance on a block. Use the --scope parameter
to specify one of the following values that specify the scope of the maintenance
operation:
- All hosts:
--scope=all - Hosts with running VMs:
--scope=running - Unused, stopped, or suspended VMs:
--scope=unused
The following command shows how to start maintenance on running hosts:
gcloud compute reservations perform-maintenance YOUR_RESERVATION --scope=RUNNING \ --project=YOUR_PROJECT \ --zone=YOUR_ZONE
gcloud compute reservations blocks describe YOUR_RESERVATION --block-name=YOUR_BLOCK_NAME \ --project=YOUR_PROJECT \ --zone=YOUR_ZONE
The output is similar to the following:
ResourceStatus
upcomingGroupMaintenance:
"maintenanceType":"SCHEDULED"
…
"maintenanceGroupStatus":"PENDING" → "ONGOING"
"maintenancePending":0
"maintenanceOngoing":70 → 0
Manually start maintenance on a sub-block
When starting maintenance on a sub-block, you don't specify the --scope
parameter because a sub-block is the smallest maintenance scope.
The following command starts maintenance on all hosts in a block:
gcloud compute reservations sub-blocks perform-maintenance YOUR_RESERVATION --block-name=YOUR_BLOCK_NAME \ --sub-block-name=YOUR_SUBBLOCK_NAME \ --project=YOUR_PROJECT \ --zone=YOUR_ZONE
The following command checks maintenance progress:
gcloud compute reservations sub-blocks describe YOUR_RESERVATION --block-name=YOUR_BLOCK_NAME \ --sub-block-name=YOUR_SUBBLOCK_NAME \ --project=YOUR_PROJECT \ --zone=YOUR_ZONE
The output is similar to the following:
ResourceStatus
groupMaintenance:
"maintenanceType":"SCHEDULED"
"canReschedule":True
"maintenanceGroupStatus":"PENDING" → "ONGOING"
"maintenancePendingCount": 32 → 0 # 32 hosts updated
"maintenanceOngoingCount":0 → 32 → 0
"instanceMaintenancePendingCount": 64 → 0
"instanceMaintenanceOngoingCount": 0 → 64 → 0 # 64 instances updated
Manually start pending maintenance for a TPU VM
If a host is running more than one VM, starting the maintenance on one VM triggers maintenance for all VMs on the host.
The following example shows how to manually trigger maintenance for a Trillium host that has two VMs:
gcloud compute instances perform-maintenance vm-1
Starting the maintenance of vm-1 will trigger the maintenance of vm-2.