Train a model using TPU7x (Ironwood)

This document describes how to provision TPU7x resources and gives an example of deploying a training workload using MaxText and XPK.

TPU7x is the first release within the Ironwood family, Google Cloud's seventh generation TPU. The Ironwood generation is designed for large-scale AI training and inference. For more information, see TPU7x.

For more examples optimized for TPU7x, see Training Recipes for Ironwood TPU on GitHub.

Provision TPUs

You can provision and manage TPU7x using the following methods:

GKE: You can use GKE to provision and manage TPUs as a pool of accelerators for your containerized machine learning workloads. Use the Google Cloud CLI to create your GKE cluster instance manually for precise customization or expansion of existing production GKE environments. For more information, see About TPUs in GKE.
GKE and XPK: XPK is a command-line tool that simplifies cluster creation and workload execution on GKE. It's designed for ML practitioners to provision TPUs and run training jobs without needing deep Kubernetes expertise. Use XPK to quickly create GKE clusters and run workloads for proof-of-concept and testing. For more information, see the XPK GitHub repository.
GKE and TPU Cluster Director: TPU Cluster Director is available through an All Capacity mode reservation, which gives you full access to all of your reserved capacity (without hold-backs) and full visibility into the TPU hardware topology, utilization status, and health status. For more information, see All Capacity mode overview.

Deploy a training workload with MaxText and XPK

Use Accelerated Processing Kit (XPK) to create GKE clusters for proof-of-concept and testing. XPK is a command-line tool designed to simplify provisioning, managing, and running machine learning workloads.

The following sections show how to deploy a training workload using MaxText and XPK.

Before you begin

Before you start, complete the following steps:

Ensure you have a Google Cloud project with billing enabled.
Get access to TPU7x. For more information, contact your account team.
Ensure the account you're using with XPK has the roles listed in the XPK GitHub repository.

Install XPK and dependencies

Install XPK. Follow the instructions in the XPK GitHub repository.
Install Docker using instructions provided by your administrator or follow the official installation instructions. Once installed, run the following commands to configure Docker and test the installation:
```
gcloud auth configure-docker
sudo usermod -aG docker $USER # relaunch the terminal and activate venv after running this command
docker run hello-world # Test Docker
```
Set the following environment variables:
```
export PROJECT_ID=YOUR_PROJECT_ID
export ZONE=YOUR_ZONE
export CLUSTER_NAME=YOUR_CLUSTER_NAME
export ACCELERATOR_TYPE=YOUR_ACCELERATOR_TYPE
export RESERVATION_NAME=YOUR_RESERVATION_NAME
export BASE_OUTPUT_DIR="gs://YOUR_BUCKET_NAME"
```
Replace the following:
- YOUR_PROJECT_ID: Your Google Cloud project ID.
- YOUR_ZONE: The zone in which to create the cluster. For Preview, only us-central1-c is supported.
- YOUR_CLUSTER_NAME: The name of the new cluster.
- YOUR_ACCELERATOR_TYPE: The TPU version and topology. For example, tpu7x-4x4x8. For a list of supported topologies, see Supported configurations.
- YOUR_RESERVATION_NAME: The name of your reservation. For shared reservations, use projects/YOUR_PROJECT_NUMBER/reservations/YOUR_RESERVATION_NAME.
- YOUR_BUCKET_NAME: The name of your Cloud Storage bucket, which will be the output directory for model training.

If you don't have an existing Cloud Storage bucket, create one using the following command:

gcloud storage buckets create ${BASE_OUTPUT_DIR} \
    --project=${PROJECT_ID} \
    --location=US \
    --default-storage-class=STANDARD \
    --uniform-bucket-level-access

Create a single-NIC, single slice cluster

Follow the instructions in the Configure MTU section to optimize your network configuration.
Populate the ${CLUSTER_ARGUMENTS} variable, which you'll use in the xpk cluster create command:
```
export CLUSTER_ARGUMENTS="--network=${NETWORK_NAME} --subnetwork=${SUBNET_NAME}"
```
Create your GKE cluster with TPU7x node pools using the xpk cluster create command:
```
xpk cluster create \
    --project=${PROJECT_ID} \
    --zone=${ZONE} \
    --cluster ${CLUSTER_NAME} \
    --cluster-cpu-machine-type=n1-standard-8 \
    --tpu-type=${ACCELERATOR_TYPE} \
    --reservation=${RESERVATION_NAME} \
    --custom-cluster-arguments="${CLUSTER_ARGUMENTS}"
```
Setting the --cluster-cpu-machine-type flag to n1-standard-8 (or larger) ensures that the default node pool has sufficient CPU for system pods, for example JobSet webhook, preventing errors. By default, XPK uses e2-standard-16. Some zones only support specific CPU types, so you might need to change between n1, n2, and e2 types. Otherwise, you might encounter quota errors.

Add a maintenance exclusion to prevent upgrades for the cluster:

gcloud container clusters update ${CLUSTER_NAME} \
    --zone=${ZONE} \
    --add-maintenance-exclusion-name="no-upgrade-next-month" \
    --add-maintenance-exclusion-start="EXCLUSION_START_TIME" \
    --add-maintenance-exclusion-end="EXCLUSION_END_TIME" \
    --add-maintenance-exclusion-scope="no_upgrades"

Replace the following:

EXCLUSION_START_TIME: Your selected start time for the maintenance exclusion in YYYY-MM-DDTHH:MM:SSZ format.
EXCLUSION_END_TIME: Your selected end time for the maintenance exclusion in YYYY-MM-DDTHH:MM:SSZ format.

Build or upload the MaxText Docker image

You can either build a Docker image locally using scripts provided by MaxText or use a prebuilt image.

Build locally

The following commands copy your local directory into the container:

# Make sure you're running on a virtual environment with python3.12. If nothing is printed, you have the correct version.
[[ "$(python3 -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")' 2>/dev/null)" == "3.12" ]] || { >&2 echo "Error: Python version must be 3.12."; false; }

# Clone MaxText
git clone https://github.com/AI-Hypercomputer/maxtext.git
cd maxtext
git checkout maxtext-tutorial-v1.0.0

# Custom Jax and LibTPU wheels
pip download libtpu==0.0.28.dev20251104+nightly -f "https://storage.googleapis.com/jax-releases/libtpu_releases.html"
pip download --pre jax==0.8.1.dev20251104 jaxlib==0.8.1.dev20251104 --index https://us-python.pkg.dev/ml-oss-artifacts-published/jax/simple/

# Build the Docker image
bash docker_build_dependency_image.sh MODE=custom_wheels

After the successful execution of the commands, you should see an image named maxtext_base_image created locally. You can use your local image directly in the xpk workload command.

Upload image (optional)

After building the Docker image locally using the instructions in the previous section, you can upload the MaxText Docker image into the registry using the following command:

export CLOUD_IMAGE_NAME="${USER}-maxtext-runner"
bash docker_upload_runner.sh CLOUD_IMAGE_NAME=${CLOUD_IMAGE_NAME}

After the successful execution of this command, you should see the MaxText image in gcr.io with the name gcr.io/PROJECT_ID/CLOUD_IMAGE_NAME.

Define the MaxText training command

Prepare the command to run your training script within the Docker container.

The MaxText 1B model is a configuration within the MaxText framework designed for training a language model with approximately 1 billion parameters. Use this model to experiment with small chip scales. Performance is not optimized.

export MAXTEXT_COMMAND="JAX_PLATFORMS=tpu,cpu \
    ENABLE_PJRT_COMPATIBILITY=true \
    python3 src/MaxText/train.py src/MaxText/configs/base.yml \
        base_output_directory=${BASE_OUTPUT_DIR} \
        dataset_type=synthetic \
        per_device_batch_size=2 \
        enable_checkpointing=false \
        gcs_metrics=true \
        run_name=maxtext_xpk \
        steps=30"

Deploy the training workload

Run the xpk workload create command to deploy your training job. You must either specify the --base-docker-image flag to use the MaxText base image or specify the --docker-image flag and the image you want to use. You can choose to include the --enable-debug-logs flag to enable debug logging.

xpk workload create \
    --cluster ${CLUSTER_NAME} \
    --base-docker-image maxtext_base_image \
    --workload maxtext-1b-$(date +%H%M) \
    --tpu-type=${ACCELERATOR_TYPE} \
    --zone ${ZONE} \
    --project ${PROJECT_ID} \
    --command "${MAXTEXT_COMMAND}"
    # [--enable-debug-logs]

Workload names must be unique within the cluster. In this example, $(date +%H%M) is appended to the workload name to ensure uniqueness.