Scalability and Performance Optimization in ML Training Pipelines

Authors

  • Yasodhara Varma Vice President at JPMorgan Chase & Co, USA Author

Keywords:

Machine Learning, Deep Learning, Neural Networks, Model Optimization

Abstract

The fast expansion of machine learning (ML) has produced ever more sophisticated models needing enormous computational capabilities for training. Scaling training methods have been a serious obstacle for researchers and industry practitioners since ML models get in scope and complexity. First of all, proper training hours, economy of cost, and sustainable energy use are ensured by means of resource and computer efficiency. Resource constraints stemming from limited GPU availability, ineffective memory management, or network limits in remote situations are one of the main difficulties in scaling ML training. Furthermore, influencing cost and performance is over-provisioning of hardware brought about by inadequate computation distribution. Dealing with these challenges calls for a mix of creative solutions that maximize resources, work distribution, and data flow simplicity together so optimizing ideal utilization of these elements. Several fundamental techniques have become quite well-known to increase the efficiency of ML training. Distributed training methods combining data and model parallelism—which makes use of many processing nodes—make large-scale learning feasible. employing GPUs, TPUs, tailored artificial intelligence circuits accelerates hardware employing still more processing capability. Dynamic change of resource allocation depending on workload demands made feasible by automated scaling methods makes reducing inefficiencies possible. Moreover, ensuring that data access does not bottleneck the training process calls for bettering data pipelines including cache, prefetching, and compression. Real-world case studies show the need of these optimizations since they disclose efficiency increases attained by using scalable ML training techniques.  Organizations and research labs have significantly reduced model accuracy, training times, and prices by using these methods.  These case studies provide perceptive study of perfect approaches for limited mass machine learning control. As artificial intelligence develops, stretching machine learning depends on continually improving ML training methods.  Good training programs promote next-generation models and help to enable the advancements in artificial intelligence research and practical application.

Downloads

Download data is not yet available.

References

Sparks, Evan R., et al. "Keystoneml: Optimizing pipelines for large-scale advanced analytics." 2017 IEEE 33rd international conference on data engineering (ICDE). IEEE, 2017.

Xin, Doris, et al. "Production machine learning pipelines: Empirical analysis and optimization opportunities." Proceedings of the 2021 international conference on management of data. 2021.

Zhou, Yue, Yue Yu, and Bo Ding. "Towards mlops: A case study of ml pipeline platform." 2020 International conference on artificial intelligence and computer engineering (ICAICE). IEEE, 2020.

Selvarajan, Guru Prasad. "OPTIMISING MACHINE LEARNING WORKFLOWS IN SNOWFLAKEDB: A COMPREHENSIVE FRAMEWORK SCALABLE CLOUD-BASED DATA ANALYTICS." Technix International Journal for Engineering Research 8 (2021): a44-a52.

Rachakatla, Sareen Kumar, P. Ravichandran, and N. Kumar. "Scalable Machine Learning Workflows in Data Warehousing: Automating Model Training and Deployment with AI." Australian Journal of AI and Data Science (2022).

Narayanan, Deepak, et al. "Memory-efficient pipeline-parallel dnn training." International Conference on Machine Learning. PMLR, 2021.

Mayer, Ruben, and Hans-Arno Jacobsen. "Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools." ACM Computing Surveys (CSUR) 53.1 (2020): 1-37.

Shang, Zeyuan, et al. "Democratizing data science through interactive curation of ml pipelines." Proceedings of the 2019 international conference on management of data. 2019.

Olson, Randal S., and Jason H. Moore. "TPOT: A tree-based pipeline optimization tool for automating machine learning." Workshop on automatic machine learning. PMLR, 2016.

Athlur, Sanjith, et al. "Varuna: scalable, low-cost training of massive deep learning models." Proceedings of the Seventeenth European Conference on Computer Systems. 2022.

Narayanan, Deepak, et al. "PipeDream: Generalized pipeline parallelism for DNN training." Proceedings of the 27th ACM symposium on operating systems principles. 2019.

Kunft, Andreas, et al. "An intermediate representation for optimizing machine learning pipelines." Proceedings of the VLDB Endowment 12.11 (2019): 1553-1567.

Olson, Randal S., et al. "Evaluation of a tree-based pipeline optimization tool for automating data science." Proceedings of the genetic and evolutionary computation conference 2016. 2016.

Park, Jongse, et al. "Scale-out acceleration for machine learning." Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. 2017.

Ben-Nun, Tal, et al. "A modular benchmarking infrastructure for high-performance and reproducible deep learning." 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2019.

Downloads

Published

11-07-2023

How to Cite

[1]
Yasodhara Varma, “Scalability and Performance Optimization in ML Training Pipelines”, American J Auton Syst Robot Eng, vol. 3, pp. 116–143, Jul. 2023, Accessed: Dec. 12, 2025. [Online]. Available: https://ajasre.org/index.php/publication/article/view/45