How to Pass Config into Accelerate: A Practical Guide
The landscape of modern artificial intelligence, particularly in the realm of deep learning, has witnessed an explosion in model size and complexity. Training these sophisticated models, especially Large Language Models (LLMs) and intricate vision transformers, is no longer a task for a single GPU or even a single machine. Distributed training has become the de facto standard, enabling researchers and engineers to leverage immense computational power by distributing workloads across multiple GPUs, nodes, or even entire clusters. However, orchestrating distributed training effectively presents a significant challenge. Managing synchronization, communication overhead, mixed precision, and specialized distributed training strategies like DeepSpeed or Fully Sharded Data Parallel (FSDP) can quickly become a labyrinth of boilerplate code and intricate setups.
Enter Hugging Face Accelerate, a powerful library designed to abstract away the complexities of distributed training, allowing developers to write standard PyTorch training loops that can seamlessly run on various distributed configurations without significant code changes. Accelerate handles the intricacies of device placement, communication, and synchronization under the hood, empowering users to focus on model development rather than infrastructure headaches. But to harness Accelerate's full potential, understanding its robust configuration mechanisms is paramount. The way you configure Accelerate dictates how your training jobs utilize resources, manage memory, and optimize performance across different hardware setups.
This guide aims to provide a comprehensive, practical exploration of how to pass configuration into Accelerate. We will delve into the various methods available, from interactive command-line interfaces to programmatic adjustments within your Python scripts, and the use of dedicated configuration files. Each approach offers distinct advantages, catering to different workflows and levels of control. By the end of this article, you will possess a profound understanding of Accelerate's configuration philosophy, enabling you to tailor your distributed training pipelines with precision, efficiency, and scalability. This mastery is not just about making your code run; it's about making it run optimally, reliably, and repeatably, setting the stage for robust model development and subsequent deployment as a performant API.
Understanding Accelerate's Configuration Philosophy
At its core, Accelerate's design philosophy revolves around making distributed training as straightforward as possible. It aims to bridge the gap between single-GPU PyTorch scripts and complex multi-GPU or multi-node setups. To achieve this, Accelerate needs to understand the underlying hardware environment and the desired training strategy. This information is conveyed through its configuration.
The necessity for flexible configuration stems from the diverse nature of distributed training environments. A model might be trained on a single machine with multiple GPUs (e.g., an 8-GPU workstation), or across several interconnected machines, each with its own set of GPUs. Furthermore, different models and datasets might necessitate varying optimization strategies, such as the choice between fp16 and bf16 mixed precision, the number of gradient accumulation steps to simulate larger batch sizes, or the adoption of advanced techniques like DeepSpeed or FSDP for handling truly massive models. Without a clear and adaptable configuration system, developers would be forced to rewrite significant portions of their training code for each new environment or strategy, defeating the purpose of an abstraction layer.
Accelerate’s configuration system provides a hierarchical and intuitive way to specify these parameters. It prioritizes clarity and ease of use, allowing users to start with a simple interactive setup and progressively move towards more sophisticated, version-controlled configurations as their needs evolve. This flexible approach ensures that whether you're a beginner exploring distributed training or an experienced MLOps engineer deploying production-grade LLMs, Accelerate has a configuration pathway that suits your requirements. The judicious use of these configuration methods ensures that your training runs are not just successful, but also efficient, reproducible, and scalable, laying the groundwork for exposing your powerful models via a robust api.
Method 1: Command Line Interface (CLI) Configuration
The Command Line Interface (CLI) offers the most accessible entry point into Accelerate's configuration ecosystem. It's particularly useful for initial setup, interactive experimentation, and for scenarios where you need to quickly adapt a training script to a new environment without modifying any files. Accelerate provides two primary CLI mechanisms for configuration: accelerate config for interactive setup and accelerate launch for direct argument passing or using existing configuration files.
1.1. accelerate config: The Interactive Setup Wizard
The accelerate config command is designed to be an interactive wizard that guides you through the process of setting up your distributed training environment. When you run this command, Accelerate asks a series of questions about your hardware, desired precision, and distributed strategy, then generates a configuration file (typically ~/.cache/huggingface/accelerate/default_config.yaml on Linux, or a similar path on other OS) that accelerate launch will automatically pick up for subsequent runs. This is an excellent starting point for anyone new to Accelerate or setting up a new machine.
Let's walk through the typical prompts you might encounter:
In which compute environment are you running?- Options:
This machine,AWS (Amazon Web Services),GCP (Google Cloud Platform),Azure (Microsoft Azure),Slurm,Kubernetes,MPI. - Explanation: This question helps Accelerate understand the underlying job scheduler or cloud provider. For most local setups, "This machine" is the correct choice. If you're using a cluster managed by Slurm, for instance, Accelerate will then prompt for Slurm-specific configurations.
- Options:
This machine only has one GPU, right? [y/N]- Explanation: Accelerate detects the number of GPUs. If you only have one, it simplifies the setup. If multiple, it proceeds to configure multi-GPU training.
Do you wish to use deepspeed? [yes/NO]- Explanation: DeepSpeed is a highly optimized library for training large models, offering memory efficiency and speed benefits through techniques like ZeRO (Zero Redundancy Optimizer). If you say 'yes', Accelerate will then prompt you for DeepSpeed-specific settings.
Do you wish to use Fully Sharded Data Parallel (FSDP)? [yes/NO]- Explanation: FSDP is another advanced distributed training strategy (primarily from PyTorch) that shards model parameters, gradients, and optimizer states across GPUs, making it possible to train models larger than a single GPU's memory. Similar to DeepSpeed, choosing 'yes' will lead to FSDP-specific configuration prompts.
What distributed strategy would you like to use? <ddp|fsdp|deepspeed|smp|xla>- Explanation: This appears if you didn't explicitly choose DeepSpeed or FSDP earlier or want to confirm.
ddp(Distributed Data Parallel) is the standard PyTorch approach.smpfor SageMaker Model Parallel.xlafor TPUs.
- Explanation: This appears if you didn't explicitly choose DeepSpeed or FSDP earlier or want to confirm.
What is your choice for mixed precision? <no|fp16|bf16>- Explanation: Mixed precision training uses lower-precision formats (like float16 or bfloat16) for certain operations to speed up computation and reduce memory usage, while retaining float32 for critical parts to maintain numerical stability.
no: Full float32 precision.fp16: Float16 (half-precision). Generally faster on NVIDIA GPUs with Tensor Cores (Volta, Turing, Ampere, Hopper architectures).bf16: BFloat16. Offers a wider dynamic range thanfp16, making it more robust against underflow/overflow issues, especially for models with very dynamic gradients. Often preferred for modern LLMs and works well on newer NVIDIA GPUs (Ampere onwards) and TPUs.
- Explanation: Mixed precision training uses lower-precision formats (like float16 or bfloat16) for certain operations to speed up computation and reduce memory usage, while retaining float32 for critical parts to maintain numerical stability.
How many gradient accumulation steps would you like to use? [1]- Explanation: Gradient accumulation allows you to simulate a larger batch size than what fits into GPU memory. Gradients are computed over several mini-batches (steps), accumulated, and then the optimizer step is performed once after the specified number of accumulations. For example, 4 steps with a per-GPU batch size of 8 effectively yields a global batch size equivalent to 32. This is crucial for training LLMs where large batch sizes are often beneficial but memory-prohibitive.
What is the address of the main process? [localhost]- Explanation: For multi-node training, this specifies the IP address or hostname of the machine designated as the "main process," which orchestrates the training.
What is the port of the main process? [29500]- Explanation: The port used by the main process for inter-process communication. Ensure it's an open port.
What is the machine rank? [0]- Explanation: In multi-node setups, each machine is assigned a unique rank (0, 1, 2...). The main process usually has rank 0.
How many machines are you going to use? [1]- Explanation: The total number of nodes/machines involved in distributed training.
Do you want to use afind_unused_parameters=Trueoption in DDP? [yes/NO]- Explanation: Relevant for
DDP(Distributed Data Parallel). If parts of your model are not used in every forward pass (e.g., due to conditional logic or specific training stages), setting this toTrueprevents DDP from erroring out, but can incur a performance penalty. It's often safer to set it toTrueif you're unsure, or explicitlyFalseif you know all parameters are always used.
- Explanation: Relevant for
- DeepSpeed/FSDP Specific Prompts: If you chose DeepSpeed or FSDP, Accelerate will then ask about their specific configurations, such as ZeRO stage (1, 2, or 3),
offload_optimizer_to_cpu,offload_param_to_cpu, gradient clipping, checkpointing, etc. These settings are critical for optimizing memory and speed for very large models.
After answering these questions, accelerate config saves a default_config.yaml file, typically in your user cache directory. This file contains all your specified settings and will be automatically loaded by accelerate launch when you run your training script.
Pros of accelerate config: * Ease of Use: Highly interactive and beginner-friendly, requiring no manual file editing. * Automated Setup: Generates a working configuration file automatically. * Good for Exploration: Quick way to try different settings on a given machine.
Cons of accelerate config: * Less Repeatable: Relying solely on the default_config.yaml means configurations aren't explicitly tied to your project or version-controlled, making it harder to reproduce specific runs. * Global Impact: Changes affect all subsequent accelerate launch commands unless overridden.
1.2. accelerate launch: Overriding and Direct Argument Passing
While accelerate config sets up a default, accelerate launch is the command you actually use to execute your training script in a distributed manner. It can implicitly use the default_config.yaml, but critically, it also allows you to override any of those settings directly via command-line flags. This provides immediate flexibility and fine-grained control for specific training runs.
The basic syntax for accelerate launch is:
accelerate launch [OPTIONS] your_script.py [SCRIPT_ARGS]
Here, [OPTIONS] are Accelerate-specific flags, and [SCRIPT_ARGS] are arguments intended for your Python script.
Let's look at some common accelerate launch options:
--num_processes INTEGER: Specifies the total number of training processes to spawn. For single-node multi-GPU, this usually equals the number of GPUs you want to use. E.g.,--num_processes 4for a machine with 4 GPUs.--num_machines INTEGER: The total number of machines (nodes) involved in the training.--mixed_precision {no,fp16,bf16}: Overrides the mixed precision setting. E.g.,--mixed_precision fp16.--gpu_ids STRING: Specifies the exact GPU IDs to use. E.g.,--gpu_ids 0,2,3to use GPUs 0, 2, and 3.--main_process_ip STRING: IP address of the main process for multi-node setups.--main_process_port INTEGER: Port for the main process.--machine_rank INTEGER: Rank of the current machine in a multi-node setup.--gradient_accumulation_steps INTEGER: Overrides the number of gradient accumulation steps. E.g.,--gradient_accumulation_steps 8.--use_deepspeed: Enables DeepSpeed. Further DeepSpeed arguments can be passed via--deepspeed_config_fileor a series of--deepspeed_prefixed flags.--use_fsdp: Enables FSDP. Similarly, FSDP arguments can be passed.--config_file FILE: Points to a specific YAML or JSON configuration file (discussed in Method 3). This is crucial for version-controlled, project-specific configurations.
Example 1: Single-Node Multi-GPU Training with Overrides
Suppose your default_config.yaml uses fp16, but for a specific experiment, you want to try bf16 with a higher gradient accumulation:
accelerate launch --mixed_precision bf16 --gradient_accumulation_steps 4 train_script.py --data_path /mnt/data/my_dataset
In this command, train_script.py is your training script, and --data_path /mnt/data/my_dataset is an argument passed to your Python script. Accelerate-specific arguments like --mixed_precision and --gradient_accumulation_steps are processed by accelerate launch itself.
Example 2: Multi-Node Training (on Node 0)
If you have two machines (Node 0 and Node 1), you'd run accelerate launch on each:
On Node 0 (main process):
accelerate launch \
--num_machines 2 \
--machine_rank 0 \
--main_process_ip 192.168.1.100 \
--main_process_port 29500 \
--mixed_precision fp16 \
train_script.py
On Node 1 (worker process):
accelerate launch \
--num_machines 2 \
--machine_rank 1 \
--main_process_ip 192.168.1.100 \
--main_process_port 29500 \
--mixed_precision fp16 \
train_script.py
Here, 192.168.1.100 would be the IP address of Node 0. Both nodes need access to the same train_script.py and potentially the same data (via network file system).
Precedence with accelerate launch: Arguments passed directly to accelerate launch always take precedence over settings in the default_config.yaml file. This provides a powerful way to conduct ad-hoc experiments or make temporary adjustments without altering your saved default configuration.
Pros of accelerate launch arguments: * Immediate Control: Quick overrides for specific runs without changing files. * Experimentation: Easily test different settings on the fly. * Granular Adjustments: Fine-tune parameters for individual training jobs.
Cons of accelerate launch arguments: * Verbosity: For many parameters, the command line can become very long and unwieldy. * Error Prone: Easy to make typos with many flags. * Not Version-Controlled: The specific command used for a run might not be easily recordable or reproducible without manual tracking, making it less ideal for production pipelines.
In summary, the CLI configuration methods in Accelerate provide a spectrum of control, from the guided accelerate config for initial setup to the precise overrides offered by accelerate launch. While convenient for quick tests, for more robust and reproducible workflows, we often turn to programmatic and file-based configurations.
Method 2: Programmatic Configuration within Python Scripts
While CLI configuration is excellent for initial setup and quick adjustments, there are scenarios where you need to configure Accelerate directly within your Python training script. This programmatic approach offers the highest degree of flexibility, allowing for dynamic configuration based on runtime logic, user inputs, or even complex environment checks. It's particularly powerful when integrating Accelerate into existing PyTorch codebases or when you need fine-grained control over specific Accelerate behaviors that might not be exposed as simple CLI flags.
The heart of programmatic configuration lies in the Accelerator object, which is the central orchestrator for distributed training in Accelerate. When you instantiate Accelerator, you can pass various arguments to its constructor, effectively configuring its behavior for the current training run.
2.1. Initializing the Accelerator Object
The most common way to use Accelerate in your script begins with importing and instantiating the Accelerator class:
from accelerate import Accelerator
# ... your model, optimizer, dataloader definitions ...
accelerator = Accelerator(
mixed_precision="fp16", # Can be "no", "fp16", "bf16"
gradient_accumulation_steps=4,
cpu=False, # Set to True to force CPU training even if GPUs are available
project_dir="./my_accelerate_project", # Directory for logging/checkpoints
log_with="tensorboard", # Or "wandb", "clearml"
# Additional plugins for DeepSpeed or FSDP
deepspeed_plugin=None,
fsdp_plugin=None,
# Other advanced parameters
dispatch_batches=True,
split_batches=False,
step_scheduler_with_optimizer=True,
sync_gradients=True,
# Kwargs for backend-specific settings (e.g., PyTorch DDP)
# **kwargs
)
# ... rest of your training loop using accelerator.prepare(), etc.
Let's break down some of the key parameters you can pass to the Accelerator constructor:
mixed_precision: A string indicating the desired mixed precision mode:"no","fp16", or"bf16". This directly controls whether and how half-precision training is enabled. For instance, settingmixed_precision="bf16"is often preferred for training large language models due tobf16's wider dynamic range, which helps prevent issues like NaN values or sudden loss spikes that can sometimes occur withfp16in unstable training regimes.gradient_accumulation_steps: An integer specifying how many steps to accumulate gradients before performing an optimizer step. This parameter is crucial for simulating larger batch sizes when actual batch sizes are limited by GPU memory, a common strategy when training massive models. For example, if your physical batch size is 8 andgradient_accumulation_stepsis 4, Accelerate will effectively train with a batch size of 32.cpu: A boolean. IfTrue, Accelerate will force training on the CPU, even if GPUs are available. This can be useful for debugging or when you want to run a lightweight test without engaging GPU resources.project_dir: The directory where Accelerate should save logs, checkpoints, and potentially other run-related files. This helps in organizing experimental results.log_with: Specifies the logging backend to integrate with, such as"tensorboard","wandb"(Weights & Biases), or"clearml". Accelerate will automatically configure the chosen logger for distributed training, ensuring all metrics from all processes are correctly aggregated and reported.deepspeed_plugin: An instance ofAccelerate'sDeepSpeedPluginclass. This allows for deep customization of DeepSpeed parameters directly within the script. We will delve into this in the advanced section.fsdp_plugin: An instance ofAccelerate'sFSDPPluginclass, similar todeepspeed_plugin, for configuring PyTorch's FSDP.dispatch_batches: A boolean. IfTrue(default), Accelerate will try to dispatch batches to processes in an even way, especially useful forDDP.split_batches: A boolean. IfTrue, thedataloaderwill yield batches already split for each process. IfFalse(default), each process receives the full batch, andaccelerator.prepare()handles the splitting.step_scheduler_with_optimizer: A boolean (defaultTrue). IfTrue,accelerator.step_scheduler()will step the scheduler only whenaccelerator.step_optimizer()is called, ensuring correct gradient accumulation behavior.sync_gradients: A boolean (defaultTrue). IfTrue, gradients are synchronized across all processes after each backward pass (or after accumulation steps). Set toFalseif you handle gradient synchronization manually or with specific DeepSpeed/FSDP configurations.
2.2. When to Use Programmatic Configuration
Programmatic configuration is invaluable in several scenarios:
- Dynamic Adjustments: When configuration parameters need to be determined at runtime. For example, you might adjust
gradient_accumulation_stepsbased on the available GPU memory, or dynamically selectmixed_precisionbased on the specific hardware detected. - Integration with Existing Codebases: If you're porting an existing PyTorch training script, it's often simpler to add
Acceleratorinstantiation with explicit parameters than to refactor it to rely heavily on CLI arguments or external config files. - Complex Logic: For advanced use cases where simple flags aren't enough, such as conditionally enabling DeepSpeed or FSDP based on model size thresholds, or implementing custom logging behaviors.
- Unit Testing: Programmatic configuration makes it easier to write unit tests for your training loop, as you can directly pass specific Accelerate settings without relying on external files or environment variables.
- Hyperparameter Tuning Frameworks: When integrating with hyperparameter tuning libraries, programmatic configuration allows the tuning framework to directly pass Accelerate parameters as part of the trial configuration, offering a seamless experience.
2.3. Interaction with CLI/File Configurations: Precedence Rules
It's important to understand how programmatic configurations interact with CLI arguments and configuration files. Accelerate follows a clear hierarchy of precedence:
- Programmatic arguments to
Accelerator()constructor: These have the highest precedence. Any parameter explicitly passed when instantiatingAcceleratorin your script will override equivalent settings from CLI flags or configuration files. - CLI flags to
accelerate launch: These override settings found in a configuration file (likedefault_config.yamlor a custom--config_file). - Configuration file (
--config_fileordefault_config.yaml): These provide the base configuration.
This hierarchy ensures that you can always override settings at a more granular level if needed. For example, if your default_config.yaml specifies mixed_precision: fp16, but your script explicitly sets accelerator = Accelerator(mixed_precision="bf16"), then bf16 will be used. If you also run accelerate launch --mixed_precision no train.py, then no precision will be used.
2.4. Example: Simple Training Loop with Programmatic Config
Here’s a simplified example demonstrating programmatic configuration within a PyTorch training loop:
import torch
from torch.utils.data import DataLoader, Dataset
from accelerate import Accelerator
from accelerate.utils import set_seed
import os
# 1. Define a dummy dataset and model
class DummyDataset(Dataset):
def __init__(self, num_samples=1000, input_dim=10, output_dim=2):
self.data = torch.randn(num_samples, input_dim)
self.labels = torch.randint(0, output_dim, (num_samples,))
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
return self.data[idx], self.labels[idx]
class SimpleModel(torch.nn.Module):
def __init__(self, input_dim, output_dim):
super().__init__()
self.linear = torch.nn.Linear(input_dim, output_dim)
def forward(self, x):
return self.linear(x)
def main():
set_seed(42) # For reproducibility
# 2. Programmatic Accelerate Configuration
accelerator = Accelerator(
mixed_precision="bf16", # Use bfloat16 for training
gradient_accumulation_steps=8, # Accumulate gradients over 8 steps
log_with="tensorboard", # Integrate with TensorBoard
project_dir="./accelerate_logs", # Log to a specific directory
)
# Logging setup using Accelerator's integrated logger
accelerator.init_trackers("my_accelerate_run", config={"learning_rate": 1e-3, "epochs": 5})
# 3. Model, Optimizer, and Dataloader
input_dim = 10
output_dim = 2
model = SimpleModel(input_dim, output_dim)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
loss_fn = torch.nn.CrossEntropyLoss()
train_dataset = DummyDataset(input_dim=input_dim, output_dim=output_dim)
train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=True)
# 4. Prepare everything for distributed training
model, optimizer, train_dataloader = accelerator.prepare(
model, optimizer, train_dataloader
)
# 5. Training Loop
num_epochs = 5
for epoch in range(num_epochs):
model.train()
total_loss = 0
for step, (inputs, labels) in enumerate(train_dataloader):
with accelerator.accumulate(model): # Accumulate gradients
outputs = model(inputs)
loss = loss_fn(outputs, labels)
accelerator.backward(loss) # Perform backward pass
# Log loss for the current step
accelerator.log({"train_loss": loss.item()}, step=epoch * len(train_dataloader) + step)
if accelerator.sync_gradients:
# Sync gradients across devices (if not using DeepSpeed/FSDP Zero)
accelerator.clip_grad_norm_(model.parameters(), 1.0) # Example gradient clipping
optimizer.step()
optimizer.zero_grad() # Clear gradients
total_loss += loss.item()
avg_loss = total_loss / len(train_dataloader)
accelerator.print(f"Epoch {epoch+1}/{num_epochs}, Average Loss: {avg_loss:.4f}")
accelerator.log({"avg_epoch_loss": avg_loss}, step=epoch)
accelerator.end_trackers() # End logging
if __name__ == "__main__":
main()
To run this script, save it as train_programmatic.py and simply execute:
accelerate launch train_programmatic.py
Accelerate will detect the programmatic configuration within the script and use bf16 mixed precision, 8 gradient accumulation steps, and TensorBoard logging in the specified accelerate_logs directory. If you were to run accelerate launch --mixed_precision fp16 train_programmatic.py, the CLI flag --mixed_precision fp16 would override the mixed_precision="bf16" defined in the script.
Programmatic configuration offers unparalleled control and integration, making it a powerful tool for sophisticated distributed training workflows. However, for managing complex, version-controlled setups, especially when collaborating in teams or deploying across various environments, dedicated configuration files often provide a more robust and organized solution.
Method 3: Configuration Files (.yaml or .json)
While CLI options are excellent for quick experiments and programmatic configuration offers dynamic control, relying solely on them for complex, production-grade distributed training can become cumbersome. Imagine a scenario where you need to manage dozens of parameters, including intricate settings for DeepSpeed or FSDP, across multiple environments (e.g., development, staging, production). Here, a dedicated configuration file becomes indispensable.
Configuration files, typically in YAML or JSON format, provide a structured, human-readable, and version-controllable way to define all aspects of your Accelerate environment. They promote separation of concerns: your training code focuses on the model and logic, while the configuration file specifies the execution environment and strategy. This approach is highly recommended for MLOps pipelines, team collaboration, and ensuring reproducibility of results.
3.1. Structure of the Configuration File
As mentioned earlier, running accelerate config generates a default_config.yaml file. This file serves as an excellent template for creating your own custom configuration files. These files are typically structured to define various aspects of the distributed environment and training strategy.
A typical Accelerate configuration file might look like this:
# my_custom_config.yaml
# General Compute Environment Settings
compute_environment: LOCAL_MACHINE # LOCAL_MACHINE, AWS, GCP, Azure, SLURM, KUBERNETES, MPI
# Distributed Training Type
distributed_type: MultiGPU # NO, MultiGPU, FSDP, DEEPSPEED, MEGATRON_LM, XLA
# Mixed Precision Settings
mixed_precision: bf16 # no, fp16, bf16
# Number of Processes (GPUs) to use per machine
num_processes: 4
# Multi-Machine (Node) Settings
num_machines: 1
machine_rank: 0
main_process_ip: null # Set for multi-node setups
main_process_port: null # Set for multi-node setups
main_process_uri: null # Alternative for main process identification
# Gradient Accumulation
gradient_accumulation_steps: 8
# DDP Specific Settings (if distributed_type is MultiGPU)
use_ddp: true # Explicitly stating DDP (Accelerate's MultiGPU uses DDP by default)
ddp_find_unused_parameters: false # Set to true if parts of model are not used in every forward pass
# DeepSpeed Specific Settings (if distributed_type is DEEPSPEED)
deepspeed_config:
deepspeed_path: null # Path to deepspeed.json config if separate
zero_optimization:
stage: 2 # ZeRO Stage (1, 2, or 3)
offload_optimizer_states: true # Offload optimizer states to CPU/NVMe
offload_param_states: false # Offload model parameters to CPU/NVMe
zero_hpz_partition_size: 1
zero_quantized_weights: false
zero_reduce_bucket_size: 5e8
zero_max_live_parameters: 1e9
zero_max_prefetch_stage_2_parameters: 1e9
fp16:
enabled: false # Use fp16 for DeepSpeed
loss_scale_window: 1000
initial_scale_power: 15
hysteresis: 2
min_loss_scale: 1
bf16:
enabled: true # Use bf16 for DeepSpeed
gradient_accumulation_steps: auto # Or a specific integer
gradient_clipping: 1.0 # Gradient clipping threshold
# ... many other DeepSpeed specific parameters
# FSDP Specific Settings (if distributed_type is FSDP)
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_LAYER_WRAP # Can be TRANSFORMER_LAYER_WRAP, SIZE_BASED_WRAP, or custom
fsdp_backward_prefetch: BACKWARD_PRE # NO_PREFETCH, BACKWARD_PRE, BACKWARD_POST
fsdp_offload_params: true # Offload parameters to CPU
fsdp_sharding_strategy: SHARD_GRAD_OP # NO_SHARD, SHARD_GRAD_OP, FULL_SHARD, HYBRID_SHARD
fsdp_state_dict_type: FULL_STATE_DICT # FULL_STATE_DICT, SHARDED_STATE_DICT, LOCAL_STATE_DICT
fsdp_cpu_ram_efficient_loading: true # For efficient loading of large models
# ... other FSDP parameters
# Logging and Project Settings
project_dir: ./my_accelerate_project_logs
logging_dir: null # Will default to project_dir/run_name
log_with: tensorboard # tensorboard, wandb, clearml, all, or null
# Other Accelerate specific settings
dynamo_config: {} # For PyTorch 2.0 dynamo compilation settings
use_cpu: false # Force CPU training
3.2. Loading a Specific Configuration File
To use a custom configuration file, you provide its path to accelerate launch using the --config_file argument:
accelerate launch --config_file my_custom_config.yaml train_script.py
When accelerate launch finds the --config_file argument, it will load the specified YAML or JSON file and use its settings as the base configuration for the training run. This overrides the default global configuration generated by accelerate config.
3.3. Advantages of Configuration Files
- Repeatability and Reproducibility: The configuration is explicitly defined in a file, making it easy to reproduce the exact training environment later. This is crucial for scientific experiments and model validation.
- Version Control: Configuration files can be stored in your version control system (e.g., Git) alongside your code. This allows you to track changes, revert to previous versions, and ensure that everyone on the team is using the same setup.
- Separation of Concerns: Clearly separates infrastructure and environment details from your core training logic. Your Python script can remain cleaner and more focused on the ML problem.
- Team Collaboration: Teams can share and standardize configurations, ensuring consistent training environments across different developers and machines.
- Environment Management: Easily switch between different configurations for different environments (e.g., a
dev_config.yamlfor local debugging,prod_config.yamlfor cluster training) by simply changing the--config_filepath. - Complex Settings: Ideal for defining intricate configurations for advanced features like DeepSpeed or FSDP, which often involve many nested parameters.
3.4. Disadvantages of Configuration Files
- Less Dynamic: While powerful, configuration files are static. They don't offer the same runtime dynamism as programmatic configuration without additional logic to generate or modify them on the fly.
- Overhead for Simple Cases: For very simple single-GPU training, creating a separate config file might feel like overkill.
3.5. Deep Dive into DeepSpeed and FSDP Configurations within YAML
For large language models and other complex architectures, DeepSpeed and FSDP are essential tools for scaling. Accelerate provides comprehensive support for configuring these directly within your YAML files, offering fine-grained control over their advanced features.
DeepSpeed Configuration (deepspeed_config):
The deepspeed_config block in your YAML file maps directly to the DeepSpeed configuration JSON schema. It allows you to specify details like:
zero_optimization:stage: The most critical parameter, determining the ZeRO stage (1, 2, or 3).- ZeRO Stage 1: Partitions the optimizer states (e.g., Adam states) across GPUs.
- ZeRO Stage 2: Partitions optimizer states and gradients across GPUs.
- ZeRO Stage 3: Partitions optimizer states, gradients, and model parameters across GPUs. This is often necessary for truly massive models that exceed even combined GPU memory.
offload_optimizer_states: Whether to offload optimizer states to CPU RAM or NVMe for further memory savings.offload_param_states: Whether to offload model parameters to CPU RAM or NVMe.
fp16/bf16: Configures the precision settings for DeepSpeed. You typically enable one and disable the other.enabled: Set totrueto use the respective precision.loss_scale_window,initial_scale_power, etc.: Parameters for dynamic loss scaling, crucial forfp16stability.
gradient_accumulation_steps: DeepSpeed can also manage gradient accumulation.automeans it will use thegradient_accumulation_stepsfrom the main Accelerate config.gradient_clipping: Sets a global gradient clipping value to prevent exploding gradients.
Example snippet for DeepSpeed (from my_custom_config.yaml):
distributed_type: DEEPSPEED
mixed_precision: bf16 # DeepSpeed can manage its own precision, but Accelerate needs this too
deepspeed_config:
zero_optimization:
stage: 3
offload_optimizer_states: true
offload_param_states: true # Enable for truly enormous models
# ... other ZeRO settings
bf16:
enabled: true
gradient_accumulation_steps: auto # Use the global Accelerate setting
gradient_clipping: 1.0
train_batch_size: auto # Can specify here or let Accelerate manage
train_micro_batch_size_per_gpu: auto # Can specify here or let Accelerate manage
When working with an AI Gateway or LLM Gateway, the ability to reliably train and deploy models, especially those optimized with DeepSpeed, is paramount. Such a gateway would eventually serve the api endpoints for these powerful models.
FSDP Configuration (fsdp_config):
FSDP (Fully Sharded Data Parallel) is PyTorch's native solution for sharding model states. Its configuration block allows you to fine-tune how parameters, gradients, and optimizer states are sharded.
fsdp_sharding_strategy:NO_SHARD: No sharding.SHARD_GRAD_OP: Only shards gradients and optimizer states (similar to ZeRO Stage 2).FULL_SHARD: Shards parameters, gradients, and optimizer states (similar to ZeRO Stage 3). This is the most memory-efficient.HYBRID_SHARD: Combines DDP and FSDP for different layers.
fsdp_auto_wrap_policy: How FSDP decides which modules to wrap.TRANSFORMER_LAYER_WRAP: Automatically wraps Transformer layers, ideal for models like BERT, GPT.SIZE_BASED_WRAP: Wraps modules larger than a certain size.
fsdp_offload_params: Whether to offload sharded parameters to CPU when not needed on GPU.fsdp_backward_prefetch: Strategy for prefetching tensors during backward pass to overlap communication and computation.fsdp_state_dict_type: How the state dictionary is saved (FULL_STATE_DICT,SHARDED_STATE_DICT,LOCAL_STATE_DICT). Crucial for loading and saving large models efficiently.fsdp_cpu_ram_efficient_loading: A specific setting for loading very large FSDP models on CPU-constrained machines.
Example snippet for FSDP (from my_custom_config.yaml):
distributed_type: FSDP
mixed_precision: bf16
fsdp_config:
fsdp_sharding_strategy: FULL_SHARD
fsdp_auto_wrap_policy: TRANSFORMER_LAYER_WRAP # Specify the name of your transformer block if custom
fsdp_offload_params: true
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_state_dict_type: SHARDED_STATE_DICT # Recommended for saving large FSDP models
fsdp_cpu_ram_efficient_loading: true
Effectively managing DeepSpeed and FSDP through configuration files ensures that your highly optimized LLMs can be trained reliably and repeatedly. Once trained, these models are often exposed through api endpoints, and their performance and cost can be managed by an AI Gateway or LLM Gateway. This level of detail in configuration is essential for models that will eventually serve critical applications.
APIPark is a high-performance AI gateway that allows you to securely access the most comprehensive LLM APIs globally on the APIPark platform, including OpenAI, Anthropic, Mistral, Llama2, Google Gemini, and more.Try APIPark now! 👇👇👇
Advanced Configuration Scenarios
Beyond the fundamental parameters, Accelerate offers sophisticated controls for optimizing specific aspects of distributed training. Mastering these advanced configurations is key to pushing the boundaries of model size and training efficiency.
4.1. Mixed Precision Training: fp16 vs. bf16
Mixed precision training is a cornerstone of modern deep learning, allowing for faster training and reduced memory footprint by performing some operations in lower precision (typically 16-bit floats). Accelerate seamlessly integrates this capability, but understanding the nuances between fp16 and bf16 is vital.
fp16(Half-precision float):- Representation: Uses 1 sign bit, 5 exponent bits, and 10 mantissa bits.
- Hardware Support: Widely supported on NVIDIA GPUs with Tensor Cores (Volta, Turing, Ampere, Hopper architectures), offering significant speedups.
- Pros: Fastest on compatible hardware, significant memory savings.
- Cons: Limited dynamic range (small exponent range) and precision (few mantissa bits) compared to
fp32. Can be prone to underflow (numbers becoming zero) or overflow (numbers becoming infinity), especially with very small gradients or large weight scales. Dynamic loss scaling is often required to mitigate these issues.
bf16(BFloat16):- Representation: Uses 1 sign bit, 8 exponent bits, and 7 mantissa bits. Crucially, it has the same exponent range as
fp32, but reduced precision. - Hardware Support: Supported on newer NVIDIA GPUs (Ampere architecture and later, like A100, H100), Google TPUs, and some Intel CPUs.
- Pros: Wider dynamic range than
fp16, making it much more robust against underflow/overflow errors, often allowing stable training without dynamic loss scaling. Generally easier to get stable training withbf16for complex models like LLMs. - Cons: Slower than
fp16on older Tensor Cores (e.g., V100) which are optimized forfp16. Slightly less memory efficient thanfp16due to the larger exponent.
- Representation: Uses 1 sign bit, 8 exponent bits, and 7 mantissa bits. Crucially, it has the same exponent range as
Configuration: * CLI: accelerate launch --mixed_precision bf16 train.py * Programmatic: accelerator = Accelerator(mixed_precision="bf16") * Config File: yaml mixed_precision: bf16 Choosing between fp16 and bf16 depends heavily on your GPU hardware and the stability of your model's training. For LLMs, bf16 has become the preferred choice due to its robustness.
4.2. Gradient Accumulation
Gradient accumulation is a technique to effectively increase the batch size without requiring more GPU memory. Instead of performing an optimizer step after every mini-batch, gradients are accumulated over several mini-batches before a single optimizer step is taken. This mimics the effect of a much larger batch size, which is often beneficial for model convergence and stability, especially for very large models where a true large batch size cannot fit into GPU memory.
How it works: 1. Forward pass and backward pass are performed for N mini-batches. 2. Gradients from each backward pass are added (accumulated) instead of immediately updating weights. 3. After N mini-batches, the accumulated gradients are used to perform a single optimizer.step(). 4. Gradients are then zeroed out for the next accumulation cycle.
Configuration: * CLI: accelerate launch --gradient_accumulation_steps 8 train.py * Programmatic: accelerator = Accelerator(gradient_accumulation_steps=8) * Config File: yaml gradient_accumulation_steps: 8 When using gradient accumulation, Accelerate handles the optimizer.step() and optimizer.zero_grad() calls correctly through accelerator.accumulate(model) context manager and accelerator.backward(loss).
4.3. DeepSpeed Integration
DeepSpeed is an advanced optimization library from Microsoft that provides significant memory and training speed benefits, especially for training models with billions or trillions of parameters. Accelerate offers seamless integration with DeepSpeed, allowing you to leverage its features through configuration.
Key DeepSpeed features: * ZeRO (Zero Redundancy Optimizer): Partitions optimizer states, gradients, and/or model parameters across GPUs to reduce memory consumption. ZeRO Stage 3 offers the most aggressive memory savings. * Mixed Precision: DeepSpeed can manage its own mixed precision training, often with better stability than vanilla PyTorch AMP. * Gradient Accumulation and Checkpointing: Optimized implementations for these memory-saving techniques.
Configuration with Accelerate: DeepSpeed configuration is primarily done via a dictionary in programmatic setup or a dedicated block in the YAML file.
Programmatic (using DeepSpeedPlugin): ```python from accelerate import Accelerator from accelerate.utils import DeepSpeedPlugin
Configure DeepSpeed parameters
deepspeed_plugin = DeepSpeedPlugin( zero_stage=2, gradient_accumulation_steps=4, offload_optimizer_states=True, offload_param_states=False, # True for ZeRO stage 3 with offloading fp16={"enabled": False}, bf16={"enabled": True}, # ... other DeepSpeed options )accelerator = Accelerator(deepspeed_plugin=deepspeed_plugin) * **Config File (`deepspeed_config` block):** As detailed in Method 3, the `deepspeed_config` block mirrors the DeepSpeed JSON configuration.yaml distributed_type: DEEPSPEED mixed_precision: bf16 # Must align with DeepSpeed bf16.enabled deepspeed_config: zero_optimization: stage: 3 offload_optimizer_states: true offload_param_states: true # Enable for massive models bf16: enabled: true gradient_accumulation_steps: auto # Use Accelerate's top-level setting gradient_clipping: 1.0 # ... other DeepSpeed params `` When configuring DeepSpeed, especially for very large models (e.g., 7B, 13B, 70B LLMs), careful consideration ofzero_stage, offloading options, and memory limits is crucial. This directly impacts the ability to fit the model into memory and train efficiently. The output of such training efforts, whether fine-tuned LLMs or custom AI models, will eventually be exposed as anapi, potentially managed by anAI GatewayorLLM Gateway` for secure and scalable access.
4.4. FSDP (Fully Sharded Data Parallel) Integration
FSDP is PyTorch's native equivalent to DeepSpeed's ZeRO, providing a powerful way to shard model parameters, gradients, and optimizer states across GPUs. This allows for training models much larger than a single GPU's memory.
Key FSDP features: * Parameter, Gradient, Optimizer Sharding: FSDP can shard all three components across GPUs. * Auto-Wrapping: Automatically detects and wraps large modules (e.g., Transformer blocks) into FSDP units for efficient sharding. * CPU Offloading: Can offload inactive sharded parameters to CPU to save GPU memory.
Configuration with Accelerate: Similar to DeepSpeed, FSDP is configured via a FSDPPlugin programmatically or an fsdp_config block in YAML.
Programmatic (using FSDPPlugin): ```python from accelerate import Accelerator from accelerate.utils import FSDPPlugin from torch.distributed.fsdp.fully_sharded_data_parallel import ShardingStrategy, CPUOffload, BackwardPrefetch
Configure FSDP parameters
fsdp_plugin = FSDPPlugin( sharding_strategy=ShardingStrategy.FULL_SHARD, # FULL_SHARD, SHARD_GRAD_OP, etc. cpu_offload=CPUOffload(offload_params=True), # Offload parameters to CPU backward_prefetch=BackwardPrefetch.BACKWARD_PRE, limit_all_gathers=True, # Improves performance for some models state_dict_type="sharded_state_dict", # Important for saving/loading # ... other FSDP options )accelerator = Accelerator(fsdp_plugin=fsdp_plugin) * **Config File (`fsdp_config` block):** As detailed in Method 3, the `fsdp_config` block allows detailed FSDP configuration.yaml distributed_type: FSDP mixed_precision: bf16 # FSDP also works well with bf16 fsdp_config: fsdp_sharding_strategy: FULL_SHARD fsdp_auto_wrap_policy: TRANSFORMER_LAYER_WRAP # Specify this for transformer models fsdp_offload_params: true # Enable for maximum memory savings fsdp_backward_prefetch: BACKWARD_PRE fsdp_state_dict_type: SHARDED_STATE_DICT # Essential for large model checkpoints fsdp_cpu_ram_efficient_loading: true # ... other FSDP params `` FSDP is a powerful alternative to DeepSpeed, especially for those who prefer a more native PyTorch solution. Both require careful configuration to maximize memory efficiency and training throughput for huge models. These advanced strategies are indispensable for pushing the boundaries of what is possible in AI training, directly impacting the capabilities of future AI services offered viaapi` endpoints.
4.5. Multi-Node / Multi-Machine Setups
Training models that don't fit on a single machine, or simply require more aggregate computing power, necessitates multi-node distributed training. Accelerate simplifies this by handling the inter-machine communication setup.
Key parameters: * main_process_ip: The IP address of the node that will act as the "main" or "rank 0" process. All other nodes connect to this IP. * main_process_port: The port on the main process machine that will be used for communication. Ensure this port is open and accessible between all nodes. * machine_rank: A unique identifier for the current machine, starting from 0. The main process machine typically has rank 0. * num_machines: The total number of machines participating in the distributed training.
Configuration: * CLI (on each machine): ```bash # On Machine 0 (main process, IP: 192.168.1.10) accelerate launch --num_machines 2 --machine_rank 0 --main_process_ip 192.168.1.10 --main_process_port 29500 train.py
# On Machine 1 (worker process, IP: 192.168.1.11)
accelerate launch --num_machines 2 --machine_rank 1 --main_process_ip 192.168.1.10 --main_process_port 29500 train.py
```
Config File: It's common to have a shared configuration file (e.g., on a network file system) or generate separate ones for each machine based on templates. ```yaml # shared_multi_node_config.yaml compute_environment: LOCAL_MACHINE # Or SLURM, Kubernetes, etc. distributed_type: MultiGPU # Or FSDP, DEEPSPEED mixed_precision: bf16 num_processes: 8 # Number of GPUs per machine num_machines: 2 main_process_ip: 192.168.1.10 # Main process IP main_process_port: 29500 # ... other settings
Then on Machine 0:
accelerate launch --config_file shared_multi_node_config.yaml --machine_rank 0 train.py
On Machine 1:
accelerate launch --config_file shared_multi_node_config.yaml --machine_rank 1 train.py `` Note: When using a shared config file, you still need to overridemachine_rank` via CLI for each machine unless your environment variable setup (e.g., Slurm) handles this automatically.
Networking Considerations: For multi-node setups, ensure that: 1. All nodes can communicate with the main_process_ip and main_process_port. Firewall rules must allow traffic on that port. 2. The train.py script and necessary data are accessible on all nodes (e.g., via a shared network file system like NFS or S3 buckets). 3. All nodes have the same versions of PyTorch, Accelerate, and any other dependencies.
Mastering these advanced configuration techniques empowers you to tackle the most demanding AI training tasks. The models developed through such rigorous processes are highly valuable and, once deployed, require robust management, often through an AI Gateway or LLM Gateway to secure and optimize their performance as an api.
Precedence and Best Practices
With multiple ways to pass configurations to Accelerate, understanding their order of precedence and adopting best practices is crucial for efficient, reliable, and reproducible training.
5.1. Order of Precedence
Accelerate follows a clear and logical hierarchy when resolving configuration parameters:
- Programmatic arguments to
Accelerator()constructor: These have the highest priority. If you explicitly passmixed_precision="bf16"toAccelerator(), it will override all other settings for mixed precision. This allows for dynamic, in-script control. - CLI flags passed to
accelerate launch: These come next. For example,accelerate launch --mixed_precision fp16 ...will override anymixed_precisionsetting found in a configuration file or thedefault_config.yaml. This is useful for temporary overrides or specific experimental runs. - Custom configuration file (
--config_file <path>.yaml): If you specify a custom config file, its contents are loaded. These settings override thedefault_config.yaml. This is the recommended method for project-specific, version-controlled configurations. - Default configuration file (
~/.cache/huggingface/accelerate/default_config.yaml): This file, generated byaccelerate config, provides the baseline settings if no other configurations are specified. - Accelerate's internal defaults: If a parameter is not set by any of the above methods, Accelerate falls back to its own hardcoded default values.
This hierarchy ensures that more specific or direct configurations (programmatic, CLI) take precedence over general or saved configurations (config files), giving you ultimate control over each training run.
5.2. When to Use Which Method
accelerate config(interactive setup):- Ideal for: Initial setup of a new machine, first-time users of Accelerate, quickly establishing a baseline configuration.
- Avoid for: Production environments, shared projects, or when reproducibility is paramount, as the
default_config.yamlcan be easily overwritten.
- CLI flags (
accelerate launch --param value ...):- Ideal for: Ad-hoc experiments, quickly testing a hypothesis (e.g., "what if I use
bf16instead offp16for this run?"), making temporary adjustments to a production job without changing files. - Avoid for: Defining complex configurations (e.g., DeepSpeed or FSDP with many sub-parameters) due to command line verbosity and potential for errors.
- Ideal for: Ad-hoc experiments, quickly testing a hypothesis (e.g., "what if I use
- Programmatic configuration (
Accelerator(param=value, ...)):- Ideal for: Dynamic configuration based on runtime logic, integrating Accelerate into complex existing codebases, specific behaviors that are best controlled within the script, robust unit testing of training logic.
- Avoid for: Managing environment-specific settings (e.g.,
num_processes,main_process_ip) that ideally should be external to the code. Over-reliance can clutter your training script.
- Configuration files (
--config_file my_config.yaml):- Ideal for: Production deployments, team collaboration, MLOps pipelines, managing complex DeepSpeed/FSDP configurations, ensuring full reproducibility, separating infrastructure concerns from code logic, version control. This is generally the recommended best practice for serious development.
- Avoid for: Very simple one-off runs where a CLI flag is quicker, or when parameters must be truly dynamic and determined at runtime by the script.
5.3. Best Practices
- Version Control Your Config Files: Always store your custom Accelerate configuration files (and any DeepSpeed/FSDP sub-configs) in your project's Git repository. This ensures that every team member uses the same settings and that your experiments are reproducible over time.
- Use Descriptive Naming: Name your config files clearly (e.g.,
training_config_fsdp_bf16.yaml,dev_local_machine.yaml). - Parameterize Your Scripts: While Accelerate handles many environment details, your Python training scripts should still be parameterized for dataset paths, model names, learning rates, etc., typically using libraries like
argparseorHydra. - Environment Variables for Secrets/Dynamic Values: For sensitive information (API keys for loggers) or values that change frequently/dynamically based on the execution environment (e.g.,
CUDA_VISIBLE_DEVICESor custom cluster-specific variables), consider using environment variables. Accelerate can pick up some settings from environment variables, or you can retrieve them in your script. - Start Simple, Then Specialize: Begin with
accelerate configor simple CLI flags. As your needs grow, transition to dedicated config files for more complex, reproducible setups. - Validate Your Configurations: Before a long training run, perform a short "sanity check" run to ensure your Accelerate configuration is correctly interpreted and the desired distributed strategy (e.g., DeepSpeed ZeRO Stage 3) is active.
- Logging Integrations: Use Accelerate's
log_withfeature (oraccelerator.init_trackers()) to integrate with tools like TensorBoard, Weights & Biases, or MLflow. This provides valuable insights into your training metrics and can help debug configuration-related performance issues.
5.4. Connecting to AI Gateway and LLM Gateway Concepts
Once a model, especially a large language model (LLM) or a specialized AI model, has been meticulously trained and fine-tuned using Accelerate with optimal configurations, its purpose often shifts from research to deployment. These powerful models are not typically consumed directly by end-user applications; instead, they are exposed through robust api endpoints. This is where the concepts of an AI Gateway and LLM Gateway become critically important.
An AI Gateway serves as a centralized management layer for all your AI service APIs. It acts as an entry point, handling routing, authentication, rate limiting, and monitoring for calls to your deployed models. For LLMs, an LLM Gateway specializes further, often providing features like prompt management, response caching, and cost optimization specific to large language models.
Consider a model trained with Accelerate using FSDP and bf16 precision to achieve maximum scale and efficiency. Once this model is ready, it needs to be made accessible. An AI Gateway or LLM Gateway platform, such as ApiPark, plays a pivotal role in this transition. APIPark, for example, is an open-source AI gateway and API management platform designed to help developers and enterprises manage, integrate, and deploy AI and REST services with ease. It allows you to encapsulate your Accelerate-trained model into a secure and performant api endpoint.
With a platform like APIPark, you can: * Unify API Invocation: Standardize how applications interact with your models, even if the underlying models or frameworks differ. * Manage Prompts: For LLMs, APIPark allows you to manage and encapsulate prompts, turning complex model interactions into simple api calls. * Apply Security Policies: Implement authentication, authorization, and rate limiting to protect your valuable AI resources. * Monitor Performance and Cost: Track api call metrics, identify bottlenecks, and manage the operational costs of serving your models. * Lifecycle Management: Handle the entire lifecycle of your model APIs, from publication to versioning and eventual deprecation.
By mastering Accelerate's configuration, you're not just training models; you're building the foundation for scalable, high-performance AI services. The robust, well-configured models that emerge from your Accelerate pipelines are then perfectly poised for seamless integration and management within an AI Gateway or LLM Gateway solution, making them ready for real-world applications as reliable and secure api endpoints.
Example Walkthrough: Training a Hugging Face Transformer with Accelerate and a Custom Config
To solidify our understanding, let's walk through an example of fine-tuning a small Hugging Face Transformer model using Accelerate, demonstrating the use of a custom configuration file and how CLI flags can override it.
Goal: Fine-tune a bert-base-uncased model for text classification using a custom configuration file with bf16 precision and gradient accumulation.
6.1. Project Structure
my_accelerate_project/
├── training_config.yaml
└── train_transformer.py
6.2. training_config.yaml
This file will define our default Accelerate settings for this project.
# training_config.yaml
compute_environment: LOCAL_MACHINE
distributed_type: MultiGPU
mixed_precision: bf16 # Default to bf16
num_processes: 2 # Use 2 GPUs if available
gradient_accumulation_steps: 4 # Accumulate gradients over 4 steps
project_dir: ./accelerate_logs
log_with: tensorboard
6.3. train_transformer.py
This script will fine-tune a BERT model on a dummy dataset.
import torch
from torch.utils.data import DataLoader
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, get_scheduler
from accelerate import Accelerator
from accelerate.utils import set_seed
from tqdm.auto import tqdm
import os
def main():
set_seed(42)
# 1. Initialize Accelerator
# Note: We are not passing explicit args here, Accelerate will load from training_config.yaml
# or detect CLI overrides.
accelerator = Accelerator(log_with="tensorboard", project_dir="./accelerate_logs")
accelerator.print(f"Using mixed precision: {accelerator.mixed_precision}")
accelerator.print(f"Gradient accumulation steps: {accelerator.gradient_accumulation_steps}")
accelerator.print(f"Number of processes: {accelerator.num_processes}")
# Initialize tracker with relevant metadata
accelerator.init_trackers(
project_name="bert_finetuning",
config={
"model_name": "bert-base-uncased",
"learning_rate": 2e-5,
"epochs": 3,
"batch_size_per_gpu": 8,
"gradient_accumulation_steps": accelerator.gradient_accumulation_steps,
"mixed_precision": accelerator.mixed_precision,
"num_processes": accelerator.num_processes
}
)
# 2. Load Model, Tokenizer, and Dataset
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
# Load a small sample dataset (e.g., sst2 from GLUE)
# Using 'mrpc' from glue for a quick, small classification task
raw_datasets = load_dataset("glue", "mrpc")
num_labels = raw_datasets["train"].features["label"].num_classes
# Preprocessing function
def tokenize_function(examples):
return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True)
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets.set_format("torch")
train_dataset = tokenized_datasets["train"]
eval_dataset = tokenized_datasets["validation"]
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(eval_dataset, batch_size=8)
# 3. Optimizer and Scheduler
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
num_training_steps = len(train_dataloader) * 3 # 3 epochs
lr_scheduler = get_scheduler(
name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)
# 4. Prepare with Accelerator
# This automatically distributes the model, optimizer, and dataloader across devices
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
)
# 5. Training Loop
num_epochs = 3
progress_bar = tqdm(range(num_training_steps), disable=not accelerator.is_main_process)
for epoch in range(num_epochs):
model.train()
for step, batch in enumerate(train_dataloader):
with accelerator.accumulate(model): # Accumulate gradients
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
# Gradient clipping (optional, but good for stability with transformers)
if accelerator.sync_gradients:
accelerator.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
if accelerator.is_main_process:
progress_bar.update(1)
accelerator.log({"train_loss": loss.item()}, step=epoch * len(train_dataloader) + step)
# Evaluation loop (simplified)
model.eval()
total_eval_loss = 0
for batch in eval_dataloader:
with torch.no_grad():
outputs = model(**batch)
loss = outputs.loss
total_eval_loss += loss.item()
avg_eval_loss = total_eval_loss / len(eval_dataloader)
accelerator.print(f"Epoch {epoch+1} - Eval Loss: {avg_eval_loss:.4f}")
accelerator.log({"eval_loss": avg_eval_loss}, step=epoch)
accelerator.end_trackers()
accelerator.print("Training complete!")
# Save the final model (only on main process)
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(
"./output_model", is_main_process=accelerator.is_main_process, save_function=accelerator.save
)
if __name__ == "__main__":
main()
6.4. Running the Example
Scenario 1: Using the custom config file To run with the settings defined in training_config.yaml (bf16, 2 processes, 4 gradient accumulation steps):
cd my_accelerate_project
accelerate launch --config_file training_config.yaml train_transformer.py
Accelerate will load training_config.yaml, configure two GPU processes (if available), enable bf16 mixed precision, and use 4 gradient accumulation steps. You should see the accelerator.print statements reflecting these values.
Scenario 2: Overriding config with CLI flags Now, let's say you want to quickly test with fp16 precision and 8 gradient accumulation steps, and only use 1 GPU, without changing training_config.yaml.
cd my_accelerate_project
accelerate launch \
--config_file training_config.yaml \
--mixed_precision fp16 \
--gradient_accumulation_steps 8 \
--num_processes 1 \
train_transformer.py
In this case, the CLI flags (--mixed_precision fp16, --gradient_accumulation_steps 8, --num_processes 1) will take precedence over the settings in training_config.yaml. The script will run with fp16, 8 accumulation steps, and on a single GPU (or process), demonstrating the power of CLI overrides. The accelerator.print statements will show these overridden values.
This example illustrates how configuration files provide a robust baseline, while CLI flags offer the flexibility for on-the-fly adjustments, all orchestrated by Accelerate. This ensures that whether you are developing locally or deploying at scale, your training process remains adaptable and efficient.
Table: Comparison of Accelerate Configuration Methods
| Feature | CLI (accelerate config) | CLI (accelerate launch flags) | Programmatic (Accelerator()) | Configuration Files (.yaml/.json) |
|---|---|---|---|---|
| Ease of Initial Setup | Excellent (interactive wizard) | Good (direct, but manual) | Moderate (requires code changes) | Moderate (requires file creation) |
| Flexibility | Low (sets global default) | High (per-run overrides) | Very High (dynamic, in-code) | High (structured, external) |
| Reproducibility | Low (default can be overwritten) | Moderate (manual command tracking) | High (code is version-controlled) | Excellent (file is version-controlled) |
| Version Control | No (default file typically ignored) | No (commands aren't tracked) | Yes (code is tracked) | Yes (file is tracked) |
| Complexity for Advanced Features (DeepSpeed, FSDP) | Moderate (wizard prompts) | Low (can point to separate config file, but not ideal for full inline config) | High (requires Plugin objects) |
Excellent (structured nested sections) |
| Ideal Use Case | First-time setup, quick starts | Ad-hoc experiments, temporary overrides | Dynamic runtime configurations, integration into complex apps | Production pipelines, team collaboration, complex distributed setups |
| Precedence | Lowest (unless no other method used) | High | Highest | Medium |
| Human Readability | N/A (interactive) | Moderate (if few flags) | Moderate (mixed with code) | Excellent (structured, commented) |
Conclusion
Navigating the complexities of distributed training is a formidable challenge in modern AI development, particularly as models grow exponentially in size and computational demands. Hugging Face Accelerate emerges as an invaluable tool, abstracting away much of this complexity and empowering developers to focus on the core machine learning problem. However, the true power and efficiency of Accelerate are unlocked through a deep understanding and masterful application of its diverse configuration mechanisms.
Throughout this guide, we've explored the comprehensive landscape of Accelerate's configuration, from the interactive wizard of accelerate config for initial environment setup, to the flexible command-line arguments of accelerate launch for quick overrides, and finally, to the robust and reproducible nature of programmatic configurations and dedicated configuration files. Each method offers distinct advantages, catering to different stages of development and levels of control. We've delved into advanced scenarios such as mixed precision strategies (fp16 vs. bf16), gradient accumulation for simulating larger batch sizes, and the intricate details of integrating powerful memory-saving techniques like DeepSpeed and FSDP, which are absolutely critical for training massive models, including the latest LLMs.
The ability to precisely configure num_processes, mixed_precision, gradient_accumulation_steps, and complex DeepSpeed or FSDP parameters directly impacts your training throughput, memory utilization, and the ultimate scalability of your AI projects. By understanding the precedence rules and adopting best practices like version-controlling your configuration files, you establish a solid foundation for reproducible research, efficient resource utilization, and streamlined collaboration within teams.
Ultimately, a well-configured Accelerate pipeline doesn't just train models; it trains them optimally, preparing them for the next crucial stage: deployment. Once your powerful AI models are trained and fine-tuned, they transition from a research artifact to a production-ready asset. This is where an AI Gateway or LLM Gateway solution becomes indispensable. Platforms like ApiPark exemplify how these gateways simplify the exposure, management, and security of your trained models as reliable api endpoints. They bridge the gap between complex model training and seamless integration into applications, handling the intricacies of api invocation, prompt management, and traffic control.
In conclusion, mastering Accelerate's configuration is not merely a technical skill; it is a strategic advantage. It empowers you to build more robust, scalable, and efficient AI training pipelines, paving the way for the development and seamless deployment of cutting-edge artificial intelligence services that can be securely and effectively delivered to end-users through well-managed apis. Embrace these configuration techniques, and unlock the full potential of distributed AI.
5 FAQs
- What is the recommended way to configure Accelerate for a production environment? For production environments, the recommended approach is to use dedicated configuration files (YAML or JSON) in conjunction with
accelerate launch --config_file <path_to_config.yaml>. This method ensures that configurations are structured, human-readable, version-controlled, and easily shareable across teams and different deployment stages (e.g., development, staging, production). It also allows for clear separation of concerns between your training code and infrastructure settings. - How do I choose between
fp16andbf16mixed precision in Accelerate? The choice betweenfp16andbf16depends primarily on your GPU hardware and the stability requirements of your model.fp16(float16) generally offers faster performance on NVIDIA GPUs with Tensor Cores (Volta, Turing, Ampere, Hopper architectures) but has a limited dynamic range, which might require dynamic loss scaling for training stability.bf16(bfloat16) provides a wider dynamic range, identical tofp32's exponent, making it more robust against underflow/overflow issues and often more stable for training Large Language Models. Modern NVIDIA GPUs (Ampere and later) supportbf16efficiently. If training stability is a concern,bf16is often the preferred choice, assuming your hardware supports it well. - Can I combine different configuration methods in Accelerate, and what are the precedence rules? Yes, Accelerate allows you to combine different configuration methods, and it follows a clear hierarchy of precedence:
- Programmatic arguments to
Accelerator()constructor: Highest precedence. - CLI flags to
accelerate launch: Second highest. - Custom configuration file (
--config_file): Third highest. - Default configuration file (
default_config.yaml): Fourth highest. - Accelerate's internal defaults: Lowest precedence. This means explicit settings in your script will override CLI flags, which in turn override settings in your config files, offering fine-grained control at various levels.
- Programmatic arguments to
- How does Accelerate handle multi-node distributed training, and what configuration is needed? Accelerate simplifies multi-node distributed training by abstracting the communication setup. For multi-node setups, you need to configure the
main_process_ip,main_process_port,machine_rank, andnum_machinesparameters. Themain_process_ippoints to the IP address of the "rank 0" node, andmain_process_portis the communication port. Each node is assigned a uniquemachine_rank. These can be set via CLI flags (accelerate launch --num_machines 2 --machine_rank 0 --main_process_ip <ip> --main_process_port <port> ...) or within a shared configuration file. Ensure network connectivity and consistent code/data access across all nodes. - After training a model with Accelerate, how can it be deployed as a secure and manageable API? Once a model is trained using Accelerate, it is often deployed as an
apiendpoint to be consumed by applications. To make thisapisecure, scalable, and manageable, anAI GatewayorLLM Gatewayis typically used. For example, platforms like ApiPark provide an all-in-one solution for managing, integrating, and deploying AI and REST services. Such a gateway centralizes aspects like authentication, rate limiting, traffic routing, prompt management (for LLMs), monitoring, and end-to-end API lifecycle management, transforming your Accelerate-trained model into a robust, production-ready AI service.
🚀You can securely and efficiently call the OpenAI API on APIPark in just two steps:
Step 1: Deploy the APIPark AI gateway in 5 minutes.
APIPark is developed based on Golang, offering strong product performance and low development and maintenance costs. You can deploy APIPark with a single command line.
curl -sSO https://download.apipark.com/install/quick-start.sh; bash quick-start.sh

In my experience, you can see the successful deployment interface within 5 to 10 minutes. Then, you can log in to APIPark using your account.

Step 2: Call the OpenAI API.

