Train Models Faster with Cutting-Edge Distributed Training Strategies and Techniques
HPE Machine Learning Development Environment Software integrates with DeepSpeed for 3D-Parallel (data-, model-, and pipeline-parallel) distributed training, to speed up training of large models like GPT-NeoX.
Enable Horovod for easy-to-use data-parallel distributed training.
Provide PyTorch Distributed Data Parallel (DDP) for flexibility and choice of distributed training strategies.
Find Better Model Configurations Efficiently with Cutting-Edge Hyperparameter Tuning Techniques
HPE Machine Learning Development Environment Software features production-grade implementation from the creators of the Asynchronous Successive Halving (ASHA) Hyperband algorithm for HPE search and optimization.
Define your own logic to coordinate across multiple trials within an experiment.
Implement your own custom hyperparameter search algorithms, ensembling, active learning, neural architecture search, and reinforcement learning.
Easily Share GPUs and Accelerators with ML Workflow-Aware Smart Scheduling and Resource Management
With HPE Machine Learning Development Environment Software, you can easily share your on-premises or cloud GPUs and accelerators with your ML development and operations teams.
Run ML and HPC jobs alongside each other on the same cluster, with support for workload managers like Slurm or PBS, and secure container runtimes like Singularity/Apptainer, Podman, or NVIDIA® Enroot.
Seamlessly use spot or preemptible instances to manage cloud costs.
Train models on NVIDIA or AMD GPUs without any code changes, with foundational support for accelerator heterogeneity.
Consistent user experience for deployments on your laptop to a supercomputer, and everything in-between including: baremetal, virtual machine (including cloud and on-premises IaaS solutions), Kubernetes, Slurm, and PBS.
Track and Reproduce your Work with Integrated Experiment Tracking and the Model Registry
HPE Machine Learning Development Environment Software provides built-in experiment tracking that covers model code, configuration, hyperparameters, metrics, and checkpoints.
Version, annotate, and organize trained models so that MLOps teams can effectively collaborate with model developers to manage your models' lifecycle.