-
Notifications
You must be signed in to change notification settings - Fork 3.3k
feat: add gradient accumulation support #2646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds gradient accumulation support to the training pipeline, enabling simulation of larger batch sizes without increasing GPU memory usage. The implementation leverages Accelerator's built-in gradient accumulation features.
Key changes:
- Added
gradient_accumulation_stepsconfiguration parameter (default: 1) to control how many batches to accumulate before performing an optimizer step - Updated training loop to properly handle gradient synchronization, skipping evaluation/checkpointing during accumulation steps
- Modified effective batch size calculations throughout the codebase to account for gradient accumulation
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/lerobot/configs/train.py | Adds gradient_accumulation_steps configuration parameter with documentation |
| src/lerobot/scripts/lerobot_train.py | Updates training loop and update_policy() to use Accelerator's accumulation context, removes unused lock parameter, and adds gradient sync checks |
| src/lerobot/utils/logging_utils.py | Updates MetricsTracker.step() to calculate effective batch size including gradient accumulation |
| tests/training/test_update_policy.py | Adds comprehensive tests for gradient sync behavior and mathematical equivalence |
| tests/utils/test_logging_utils.py | Adds test for MetricsTracker with gradient accumulation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
What this does
Adds gradient accumulation support to the training script.
This allows simulating larger batch sizes without increasing GPU memory usage, which is useful when training large models or when memory is limited.
gradient_accumulation_stepsconfiguration parameter toTrainPipelineConfig(default: 1)update_policy()to useaccelerator.accumulate()context managerMetricsTrackerstepas an optimizer step)lockparameter fromupdate_policy()as it was not used anywhere in the codebase)How it was tested
tests/training/test_update_policy.py.test_update_policy_sync_gradients: Verifies gradient sync behaviortest_update_policy_gradient_accumulation_equivalence: Validates mathematical equivalencetest_metrics_tracker_step_with_acceleratorintests/utils/test_logging_utils.pyHow to checkout & try? (for the reviewer)
# Effective batch size: 8 × 1 × 4 = 32 lerobot-train \ --dataset.repo_id=lerobot/svla_so101_pickplace \ --policy.type=act \ --policy.repo_id=foo/my_policy \ --policy.push_to_hub=false \ --batch_size=8 \ --gradient_accumulation_steps=4 \ --log_freq=1 \ --steps=50