Train a model using TPU7x (Ironwood)
This document describes how to provision TPU7x resources and gives an example of deploying a training workload using MaxText and XPK.
TPU7x is the first release within the Ironwood family, Google Cloud's seventh generation TPU. The Ironwood generation is designed for large-scale AI training and inference. For more information, see TPU7x.
For more examples optimized for TPU7x, see Training Recipes for Ironwood TPU on GitHub.
Provision TPUs
You can provision and manage TPU7x using the following methods:
- GKE: You can use GKE to provision and manage TPUs as a pool of accelerators for your containerized machine learning workloads. Use the Google Cloud CLI to create your GKE cluster instance manually for precise customization or expansion of existing production GKE environments. For more information, see About TPUs in GKE.
- GKE and XPK: XPK is a command-line tool that simplifies cluster creation and workload execution on GKE. It's designed for ML practitioners to provision TPUs and run training jobs without needing deep Kubernetes expertise. Use XPK to quickly create GKE clusters and run workloads for proof-of-concept and testing. For more information, see the XPK GitHub repository.
- GKE and TPU Cluster Director: TPU Cluster Director is available through an All Capacity mode reservation, which gives you full access to all of your reserved capacity (without hold-backs) and full visibility into the TPU hardware topology, utilization status, and health status. For more information, see All Capacity mode overview.
Deploy a training workload with MaxText and XPK
Use Accelerated Processing Kit (XPK) to create GKE clusters for proof-of-concept and testing. XPK is a command-line tool designed to simplify provisioning, managing, and running machine learning workloads.
The following sections show how to deploy a training workload using MaxText and XPK.
Before you begin
Before you start, complete the following steps:
- Ensure you have a Google Cloud project with billing enabled.
- Get access to TPU7x. For more information, contact your account team.
- Ensure the account you're using with XPK has the roles listed in the XPK GitHub repository.
Install XPK and dependencies
Install XPK. Follow the instructions in the XPK GitHub repository.
Install Docker using instructions provided by your administrator or follow the official installation instructions. Once installed, run the following commands to configure Docker and test the installation:
gcloud auth configure-docker sudo usermod -aG docker $USER # relaunch the terminal and activate venv after running this command docker run hello-world # Test DockerSet the following environment variables:
export PROJECT_ID=YOUR_PROJECT_ID export ZONE=YOUR_ZONE export CLUSTER_NAME=YOUR_CLUSTER_NAME export ACCELERATOR_TYPE=YOUR_ACCELERATOR_TYPE export RESERVATION_NAME=YOUR_RESERVATION_NAME export BASE_OUTPUT_DIR="gs://YOUR_BUCKET_NAME"
Replace the following:
- YOUR_PROJECT_ID: Your Google Cloud project ID.
- YOUR_ZONE: The zone in which to create the cluster. For Preview,
only
us-central1-cis supported. - YOUR_CLUSTER_NAME: The name of the new cluster.
- YOUR_ACCELERATOR_TYPE: The TPU version and topology. For example,
tpu7x-4x4x8. For a list of supported topologies, see Supported configurations. - YOUR_RESERVATION_NAME: The name of your reservation. For shared
reservations, use
projects/YOUR_PROJECT_NUMBER/reservations/YOUR_RESERVATION_NAME. - YOUR_BUCKET_NAME: The name of your Cloud Storage bucket, which will be the output directory for model training.
If you don't have an existing Cloud Storage bucket, create one using the following command:
gcloud storage buckets create ${BASE_OUTPUT_DIR} \ --project=${PROJECT_ID} \ --location=US \ --default-storage-class=STANDARD \ --uniform-bucket-level-access
Create a single-NIC, single slice cluster
Follow the instructions in the Configure MTU section to optimize your network configuration.
Populate the
${CLUSTER_ARGUMENTS}variable, which you'll use in thexpk cluster createcommand:export CLUSTER_ARGUMENTS="--network=${NETWORK_NAME} --subnetwork=${SUBNET_NAME}"Create your GKE cluster with TPU7x node pools using the
xpk cluster createcommand:xpk cluster create \ --project=${PROJECT_ID} \ --zone=${ZONE} \ --cluster ${CLUSTER_NAME} \ --cluster-cpu-machine-type=n1-standard-8 \ --tpu-type=${ACCELERATOR_TYPE} \ --reservation=${RESERVATION_NAME} \ --custom-cluster-arguments="${CLUSTER_ARGUMENTS}"Setting the
--cluster-cpu-machine-typeflag ton1-standard-8(or larger) ensures that the default node pool has sufficient CPU for system pods, for example JobSet webhook, preventing errors. By default, XPK usese2-standard-16. Some zones only support specific CPU types, so you might need to change betweenn1,n2, ande2types. Otherwise, you might encounter quota errors.Add a maintenance exclusion to prevent upgrades for the cluster:
gcloud container clusters update ${CLUSTER_NAME} \ --zone=${ZONE} \ --add-maintenance-exclusion-name="no-upgrade-next-month" \ --add-maintenance-exclusion-start="EXCLUSION_START_TIME" \ --add-maintenance-exclusion-end="EXCLUSION_END_TIME" \ --add-maintenance-exclusion-scope="no_upgrades"
Replace the following:
- EXCLUSION_START_TIME: Your selected start time for
the maintenance exclusion in
YYYY-MM-DDTHH:MM:SSZformat. - EXCLUSION_END_TIME: Your selected end time for
the maintenance exclusion in
YYYY-MM-DDTHH:MM:SSZformat.
- EXCLUSION_START_TIME: Your selected start time for
the maintenance exclusion in
Build or upload the MaxText Docker image
You can either build a Docker image locally using scripts provided by MaxText or use a prebuilt image.
Build locally
The following commands copy your local directory into the container:
# Make sure you're running on a virtual environment with python3.12. If nothing is printed, you have the correct version.
[[ "$(python3 -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")' 2>/dev/null)" == "3.12" ]] || { >&2 echo "Error: Python version must be 3.12."; false; }
# Clone MaxText
git clone https://github.com/AI-Hypercomputer/maxtext.git
cd maxtext
git checkout maxtext-tutorial-v1.0.0
# Custom Jax and LibTPU wheels
pip download libtpu==0.0.28.dev20251104+nightly -f "https://storage.googleapis.com/jax-releases/libtpu_releases.html"
pip download --pre jax==0.8.1.dev20251104 jaxlib==0.8.1.dev20251104 --index https://us-python.pkg.dev/ml-oss-artifacts-published/jax/simple/
# Build the Docker image
bash docker_build_dependency_image.sh MODE=custom_wheels
After the successful execution of the commands, you should see an image named
maxtext_base_image created locally. You can use your local image directly in
the xpk workload command.
Upload image (optional)
After building the Docker image locally using the instructions in the previous section, you can upload the MaxText Docker image into the registry using the following command:
export CLOUD_IMAGE_NAME="${USER}-maxtext-runner"
bash docker_upload_runner.sh CLOUD_IMAGE_NAME=${CLOUD_IMAGE_NAME}
After the successful execution of this command, you should see the MaxText image
in gcr.io with the name
gcr.io/PROJECT_ID/CLOUD_IMAGE_NAME.
Define the MaxText training command
Prepare the command to run your training script within the Docker container.
The MaxText 1B model is a configuration within the MaxText framework designed for training a language model with approximately 1 billion parameters. Use this model to experiment with small chip scales. Performance is not optimized.
export MAXTEXT_COMMAND="JAX_PLATFORMS=tpu,cpu \
ENABLE_PJRT_COMPATIBILITY=true \
python3 src/MaxText/train.py src/MaxText/configs/base.yml \
base_output_directory=${BASE_OUTPUT_DIR} \
dataset_type=synthetic \
per_device_batch_size=2 \
enable_checkpointing=false \
gcs_metrics=true \
run_name=maxtext_xpk \
steps=30"
Deploy the training workload
Run the xpk workload create command to deploy your training job. You must
either specify the --base-docker-image flag to use the MaxText base image or
specify the --docker-image flag and the image you want to use. You can choose
to include the --enable-debug-logs flag to enable debug logging.
xpk workload create \
--cluster ${CLUSTER_NAME} \
--base-docker-image maxtext_base_image \
--workload maxtext-1b-$(date +%H%M) \
--tpu-type=${ACCELERATOR_TYPE} \
--zone ${ZONE} \
--project ${PROJECT_ID} \
--command "${MAXTEXT_COMMAND}"
# [--enable-debug-logs]
Workload names must be unique within the cluster. In this example, $(date
+%H%M) is appended to the workload name to ensure uniqueness.
What's next
- Use the Google Cloud ML Diagnostics platform to optimize and diagnose your workloads
- Run a training workload using a recipe optimized for TPU7x
- Run a TPU7x microbenchmark