This interview explores the remarkable journey of Mahan Salehi, from founding AI startups to becoming a Senior Product Manager at NVIDIA. Initially, Salehi co-founded two AI startups—one automating insurance underwriting with machine learning, the other enhancing mental healthcare with an AI-powered digital assistant for primary care physicians. These ventures provided invaluable technical expertise and deep insights into AI’s business applications and economic fundamentals. Driven by intellectual curiosity and a desire to learn from industry pioneers, Salehi transitioned to NVIDIA, assuming a role akin to a startup CEO. At NVIDIA, the focus is on managing the deployment and scaling of large language models, ensuring efficiency and innovation. This interview covers Salehi’s entrepreneurial journey, the challenges faced in managing AI products, his vision for AI’s future in business and industry, and key advice for aspiring entrepreneurs looking to leverage machine learning for innovative solutions.
Can you walk us through your journey from founding AI startups to becoming a Senior Product Manager at NVIDIA? What motivated these transitions?
I have always been deeply driven towards entrepreneurship.
I co-founded and served as CEO of two AI startups. The first focused on automating underwriting in insurance using machine learning. After several years, we moved towards acquisition.
The second startup focused on healthcare, where we developed an AI-powered digital assistant for primary care physicians to better identify and treat mental illness. It empowered family doctors to feel as if they had a psychiatrist sitting right next to them, helping assess each patient that comes in.
Building AI startups from scratch provided invaluable technical expertise while teaching me important insights about the business applications, limitations, and economic fundamentals of building an A.I company
Despite my passion for building technology startups, at this point in my journey I wanted to take a break and try something different. My intellectual curiosity led me to seek opportunities where I could learn from the world’s leading experts that are advancing the frontiers of computer science.
My interests led me to NVIDIA, known for pioneering technologies years ahead of others. I had the opportunity to learn from pioneers in the field. I recall initially feeling out of place on my first day at NVIDIA, after meeting several new interns whom I quickly realized were all PhDs (when I previously interned, I was a lowly 2nd year university student).
I chose to be a technical product manager at NVIDIA as the role mirrored the responsibilities of a CEO of a well-funded startup. The role entailed being a true product owner and having to wear multiple hats. It required having a hand in all aspects of the business – engineering design, go to market plan, company strategy, legal, etc.
As the product owner of NVIDIA’s inference serving software portfolio, what are the biggest challenges you face in ensuring efficient deployment and scaling of large language models?
Deploying large language models efficiently at scale presents unique challenges due to their massive size, strict performance requirements, need for customization, and security considerations.
1) Massive model sizes:
LLMs are unprecedented in their size, containing billions of parameters (up to 10,000 times larger than traditional models).
Hardware devices are required which have sufficient capacity for such models. NVIDIA’s latest GPU architectures are designed to support LLMs, with ample RAM (up to 80GB), memory bandwidth, and high-speed interconnects (like NVLink) for fast communication between hardware devices.
At the software layer, frameworks are required that use model parallelism algorithms to partition a LLM across multiple hardware devices, such that different parts of the model can be computed in parallel. The software must handle the division of the model (via pipeline or tensor parallelism), distribute the partitions, and manage the communication and synchronization of computations across devices.
2) Performance Requirements:
A.I applications require fast response times and high throughput. No one would use a chatbot that takes 10 seconds to reply to each question, as an example.
As models grow larger, performance can decrease due to increased compute demands. To mitigate this, NVIDIA’s software frameworks include features like inflight or continuous batching, kv cache management, quantization, and optimized kernels specifically for LLM models.
3) Customization Challenges:
Foundational models (such as LLama, Mixtral, etc) are great for generic reasoning. They have been trained on publicly available datasets, therefore their knowledge is limited to what’s public on the internet.
For most business applications, LLMs need to be customized for a specific task. This process involves tuning a foundational model on a small proprietary dataset, in order to tailor it for a specific task. For example, if an enterprise wants to create a customer support chatbot that can recommend the company’s products and help troubleshoot any issues, they will need to fine tune a foundational model on their internal database of products, as well as their troubleshooting guide.
There are several different techniques and algorithms for customizing foundational LLMs for a specific task, including fine tuning, LoRA (Low-Rank Adaptation) tuning, prompt tuning, and more.
However, enterprises face challenges in:
- Identifying and using the optimal tuning algorithm to build a custom LLM
- Writing custom logic to integrate the customized LLM into their deployment infrastructure
4) Security Concerns:
Today there are several cloud-hosted API solutions for training and deploying LLMs. However, they can be a non-starter for many enterprises that do not wish to upload sensitive or proprietary data and models due to security, privacy, and compliance risks.
Additionally, many enterprises require control over the software and hardware stack used to deploy their applications. They want to be able to download their models, and choose where it is deployed.
To solve all of these challenges, our team at NVIDIA has recently released the NVIDIA NIM platform: https://www.nvidia.com/en-us/ai/
It provides enterprises with a set of microservices to easily build and deploy generative AI models anywhere they prefer (on-prem data centers, on preferred cloud environments, on GPU-accelerated workstations). It grants enterprises with self hosting capabilities, giving them back control over their AI infrastructure and strategy. At the same time, NVIDIA NIM abstracts away the complexity of LLM deployment, providing ready-to-deploy docker containers with industry-standard
APIs.
A demo video can be seen here: https://www.youtube.com/watch?v=bpOvayHifNQ
The Triton Inference Server has seen over 3 million downloads. What do you attribute to its success, and how do you envision its future evolution?
Triton Inference Server, a popular open-source platform, has become widely adopted due to its focus on simplifying AI deployment.
Its success can be attributed to two key factors:
1) Features to standardize inference and maximize performance:
- Supports all inference use cases:
- Real time online (low latency requirement)
- Offline batch (high throughput requirement)
- Streaming
- Ensemble Pipelines (multiple models and pre/post processing chained together)
- Supports any model architecture:
All deep learning and machine learning models, including LLMs , Automatic Speech Recognition (ASR), Computer Vision (CV), Recommender Systems, tree-based models, linear models, etc
2) Maximizes performance and reduce costs via features like:
- Dynamic Batching
- Concurrent multiple model execution
- Tools like Model Analyzer to optimize configuration parameters to maximize performance 2) Ecosystem Integrations and Versatility:
- Triton seamlessly integrates with all major cloud platforms, leading
MLOps tools, and Kubernetes environments - Supports all major frameworks:
PyTorch, Python, Tensorflow, TensorRT, ONNX, OpenVino, vLLM,
Rapids FIL (XGBoost, Scikitlearn, and more), etc
- Supports multiple platforms:
- GPUs, CPUs, and different accelerators
- Linux, Windows, ARM, Jetson builds
- Available as a docker container and as a shared library
- Can be deployed anywhere:
- Deploy on-prem, in cloud, or on embedded and edge devices
- Designed to scale
- Plugs into kubernetes environments
- Provides health and status metrics, critical for monitoring and auto scaling
The future evolution of Triton is currently being built as we speak. The next generation Triton 3.0 promises to further streamline AI deployment with features to support model orchestration, enhanced Kubernetes scaling, and much more!
How do you see the role of generative AI and deep learning evolving in the next five years, particularly in the context of business and industry applications?
Generative AI is poised to become a game-changer for businesses in the next five years. The release of ChatGPT in 2022 ignited a wave of innovation across industries. From automating e-commerce tasks, to drug discovery, to extracting insights from legal documents, LLMs are tackling complex challenges with remarkable efficiency.
I believe we will start to see accelerated commoditization of LLMs in the coming years. The rise of open-source models and user-friendly tools is democratizing access to this powerful technology, allowing businesses of all sizes to leverage its potential.
This is analogous to the evolution of website development. Nowadays, anyone can build a web hosted application with minimal experience using any of the countless no-code tools out there. We will likely see a similar trend for LLMs.
However, differentiation will stem from how companies will tune models on proprietary datasets. The players with the best datasets for tailored for specific applications will unlock the best performance
Looking ahead, we will also start to see an explosion of multi-modal models that combine text, images, audio, and video. These advanced models will enable richer interactions and a deeper understanding of information, leading to a new wave of applications across various sectors.
With your experience in AI startups, what advice would you give to entrepreneurs looking to leverage machine learning for innovative solutions?
If AI models are increasingly becoming more accessible and commoditized, how does one create a competitive moat?
The answer lies in the ability to create a strong “datafly wheel”.
This is an automated system with a feedback loop that collects data on how customers are using your product and how well your models are performing. The more data you collect, the more you iterate on improving model accuracy, leading to a better user experience that then attracts more users and generates even more data. It’s a cyclical self improving process, which only gets stronger and more efficient over time.
The key to a successful data flywheel lies in the quality and quantity of your data. The more specialized, proprietary, and high-quality data you can collect, the more accurate and valuable your solution becomes compared to competitors. Implore creative strategies and user incentives to encourage data collection that fuels your flywheel.
How do you balance innovation with practicality when developing and managing NVIDIA’s suite of applications for large language models?
A key part of my focus is finding a way to strike a critical balance between cutting-edge research and practical application development for our generative AI software platforms. Our success hinges on the collaboration between our advanced research teams, constantly pushing the boundaries of LLM capabilities, and our product team, focused on translating those innovations into user-friendly and commercially viable products.
We achieve this balance by:
User-Centric Design: We build software that abstracts the underlying complexity, providing users with an easy-to-use interface and industry-standard APIs. Our solutions are designed to be “out-of-the-box” – downloadable and deployable in production environments with minimal hassle.
Performance Optimization: Our software is pre-optimized to maximize performance without sacrificing usability.
Cost-Effectiveness: We understand that the biggest model isn’t always the best. We advocate for “right-sizing” LLMs – customizing foundational models for specific tasks. This allows us to achieve optimal performance without incurring unnecessary costs associated with massive, generic models. For instance, we’ve developed industry specific, customized models for domains like drug discovery, generating short stories, etc.
In your opinion, what are the key skills and attributes necessary for someone to excel in the field of AI and machine learning today?
There is a lot more involved in building A.I applications than just creating a neural network. A successful AI practitioner possesses a strong foundation in:
Technical Expertise: Proficiency in deep learning frameworks (PyTorch, TensorFlow, ONNX, etc), machine learning frameworks (XGBoost, scikitlearn, etc) and familiarity with differences in model architectures
Data Savvy: Understanding the MLOps lifecycle (data processing, feature engineering, experiment tracking, deployment, monitoring) and the critical role of high-quality data in training effective models is essential. Deep learning models are not magic. They are only as good as the data you feed them.
Problem-Solving Mindset: The ability to identify and analyze problems, determine if AI is the right solution, and then design and implement an effective approach is key.
Communication and Collaboration: Clearly explaining complex AI concepts to both technical and non-technical audiences, as well as collaborating effectively within teams, are essential for success.
Adaptability and Continuous Learning: The field of AI is constantly evolving. The ability to learn new skills and stay updated with the latest advancements is crucial for long-term success.
What are some of the most exciting developments you are currently working on at NVIDIA, especially in relation to generative AI and deep learning?
We just recently announced the release of NVIDIA NIM, a suite of microservices to power generative AI applications across modalities and every industry
Enterprises can use NIM to run applications for generating text, images and video, speech, and digital humans.
BioNeMoTM NIM can be used for healthcare applications, including surgical planning, digital assistants, drug discovery, and clinical trial optimization.
ACE NIM is used by developers to easily build and operate interactive, lifelike digital humans in applications for customer service, telehealth, education, gaming, and entertainment.
The impact extends beyond specific companies. Leading MLOps partners and global system integrators are embracing NIM, making it easier for enterprises of all sizes to deploy production-ready generative AI solutions.
This technology is already making waves across industries. For example, Foxconn, the world’s largest electronics manufacturer, is leveraging NIM to integrate LLMs into its smart manufacturing processes. Amdocs, a leading communications software provider, is using NIM to develop a customer billing LLM that significantly reduces costs and improves response times. Beyond these examples, Lowe’s, a major home improvement retailer, is utilizing NIM for various AI use cases, while ServiceNow, a leading enterprise AI platform, is integrating NIM to enable faster and more cost-effective LLM development for its customers. This momentum also extends to Siemens, a global technology leader, which is using NIM to integrate AI into its operations technology and build an on-premises version of its Industrial Copilot for
Machine Operators.
How do you envision the impact of AI and automation on the future of work, and what steps should professionals take to prepare for these changes?
As with any new groundbreaking technology, our relationship with work will significantly transform.
Some manual and repetitive tasks will undoubtedly be automated, leading to job displacement in certain sectors. In other areas, we will see the creation of entirely new opportunities.
The most significant shift will likely be the augmentation of existing roles. Human workers will work alongside AI systems to enhance productivity and efficiency. Imagine doctors leveraging AI assistants to handle routine tasks like note-taking and medical history analysis. This frees up valuable time for doctors to focus on the human aspects of their job – building rapport, picking up on subtle patient cues, and providing personalized care. In this way, AI becomes a powerful tool for enhancing human strengths, not replacing them.
To prepare for this future, professionals should invest in developing a well-rounded skill set:
Technical Skills: While deep technical expertise may not be required for every role, a foundational understanding of programming, data engineering, MLOps, and machine learning concepts will be valuable. This knowledge empowers individuals to leverage AI’s strengths and navigate its limitations.
Soft Skills: Critical thinking, creativity, and emotional intelligence are uniquely human strengths that AI struggles to replicate. By honing these skills, professionals can position themselves for success in the evolving workplace.