eduzhai > Applied Sciences > Engineering >

PipeSim Trace-driven Simulation of Large-Scale AI Operations Platforms

  • Save

... pages left unread,continue reading

Document pages: 11 pages

Abstract: Operationalizing AI has become a major endeavor in both research andindustry. Automated, operationalized pipelines that manage the AI applicationlifecycle will form a significant part of tomorrow s infrastructure workloads.To optimize operations of production-grade AI workflow platforms we canleverage existing scheduling approaches, yet it is challenging to fine-tuneoperational strategies that achieve application-specific cost-benefit tradeoffswhile catering to the specific domain characteristics of machine learning (ML)models, such as accuracy, robustness, or fairness. We present a trace-drivensimulation-based experimentation and analytics environment that allowsresearchers and engineers to devise and evaluate such operational strategiesfor large-scale AI workflow systems. Analytics data from a production-grade AIplatform developed at IBM are used to build a comprehensive simulation model.Our simulation model describes the interaction between pipelines and systeminfrastructure, and how pipeline tasks affect different ML model metrics. Weimplement the model in a standalone, stochastic, discrete event simulator, andprovide a toolkit for running experiments. Synthetic traces are made availablefor ad-hoc exploration as well as statistical analysis of experiments to testand examine pipeline scheduling, cluster resource allocation, and similaroperational mechanisms.

Please select stars to rate!


0 comments Sign in to leave a comment.

    Data loading, please wait...