The possibilities of Gen AI are almost limitless at the moment and many of our customers are starting to look into it. We want to help them through MLOps.
Let’s discuss integrating LLMs with AWS. Particularly what challenges it poses to us as a company who primarily focuses on DevOps & managing cloud infrastructures, but also you as a potential person who’s interested in integrating LLMs into your systems.
Just like you [unless you’re someone who’s been involved with Gen AI for many years], we’re also new to this and we’re testing the different paths one can take. But, to even start illustrating the concept of transitioning into something like Gen AI, it is important to explain how we’ve done it technically until this point.
What is LARA?
Over the years, we’ve built something we’re proud to call our cloud-native platform LARA. It is a flexible set of ready made, battle tested and proven building blocks for rapid set up of well-architected infrastructure. We’re able to apply LARA to various use cases, whether those are clients in gaming, fintech, streaming, IoT or SaaS.
With this approach we’re able to tackle problems in many domains, while maintaining similar patterns to address these issues. Besides LARA, our bread and butter are Kubernetes workloads, application containerization & orchestration, monitoring & observability, CI/CD, security and data analytics infrastructures.
Typical LARA Setup
Focuses on addressing pain points and long lasting issues. It’s important to mention, [especially when cloud is such a big topic and everyone is moving to cloud], that our goal is to move the company to the cloud and solve their issues to enable their growth. Not just move them to the cloud because it’s cool nowadays.
We look at cloud like something as a tool or a mean to address issues, may it be problems with scaling or performance, outages or high costs. That means, we can’t just move the infrastructure to the cloud. We mainly focus on re-architecting their existing solutions so that we can truly solve their issues, and only then moving them to the cloud. We believe this is the proper way to do it, because many of those issues are primarily caused by something that’s hiding below in the existing infrastructure.
Common Generative AI Use Cases
The possibilities of what we can do with Generative AI are almost limitless at the moment. Although many of our customers haven’t used AI before, they are starting to look into it now. Especially, we’ve noticed these topics among our customers, and expect we’ll be encountering even more.
- text summarization [taking large amounts of texts & extracting the value]
- content generation [the infamous chat.gpt, dall-e …]
- improving customer experience [e.g. having a chatbot that handles your customer support]
Therefore, we need to start thinking about how we can help our customers with these topics in a similar way how we help them with DevOps & cloud. AI is not something we can avoid. It’s something that’s inevitable and everyone of us will eventually have to work with it or have some knowledge about it one way or another.
Journey to Production
When it comes to the world of cloud infrastructures & AI, we know there are many challenges we as a DevOps company & you as a DevOps engineer/Software Engineer/ Platform Engineer [you name it] will most likely face. Our customers are beginning to approach AI in multiple ways. For some companies it may be just a small part of their product, building an MVP or just playing around with LLMs.
Using Gen AI & LLMs at this level is pretty easy, because you can easily incorporate some SaaS solutions like OpenAI, or some similar solutions like AWS Bedrock. You can easily just start playing with LLMs in your open machine and have e.g. a local chatbot running on your computer. It’s a great start, it allows you to learn or fail fast and learn from that.
Some companies will be able to stay this way indefinitely. Meaning the amount of time they want to put into research and Gen AI will be enough for them at this level. Especially if AI is just something as a byproduct to them or if they are just offloading some small part of their business to these applications.
For other companies, at some point they might reach a state where they need to scale more. There are various reasons why to turn to AI for that. So let’s discuss the challenges that they might face once they reach this point, because let’s face it, scaling LLMs for production is hard.
As mentioned before, our typical setup is concentrated around building infrastructures. By doing that we’re trying to address some unwanted issues. But speaking for Labyrinth Labs and myself as an individual, we have to come up with ways to address challenges coming with AI. Making this step, so that we can help our customers with AI is called MLOps.
Machine Learning Operations
Or shortly, MLOps. You might have come across this term, but I think the best way to explain this is that MLOps is something that extends the principles of DevOps to the world of machine learning. Especially for us Cloud engineers, MLOps seems to improve and automate the lifecycle of machine learning models. Starting with things like gathering the data, preparing the data, model training, reusing, feed-backing, evaluating, retraining, deploying, monitoring and more.
In other words, MLOps is an area of philosophy that wants to extend on and combine the knowledge of Machine Learning to DevOps & Data Engineering. Currently for us, we’re one of the circles, but we’re never going to fully comprehend all of the areas. And we’re not really aiming to. But, throughout the process we’re trying to figure out and understand the intersection of these three circles and how we can help our customers with the intersection, MLOps.
MLOps presents many phases. Let’s focus on these four, on which we can explain how to tackle some of the challenges:
- Model Development
- Training
- Deployment
- Monitoring & Management
When we look at these four areas and when we think about how we as DevOps engineers can help our customers to address them. We’re not looking at Model Development per se [because we’re not machine learning engineers], but what we can help them with are the other three areas, training, deployment, monitoring and management. This is where we can offer our ability to help you improve these areas to be more productive and what to focus on.
But, how?
In general, we need to incorporate machine learning workflows into our existing DevOps culture and existing cloud infrastructures. That can have various forms.
We can look at them as individual parts of DevOps and MLOps workflows, because I think there are many similarities. But, one of the examples can be that we can help the developers set up a sandbox and training environments in the cloud. Give them tools to help them enable it quickly, to try, to fail fast, learn from it, get some feedback and validation.
Automating ML Lifecycle
This is similar to how we would address that in software development. When our customers are developing new feature environments, we are giving them an option to test some new features. We help them build CI/CD pipelines which are able to automate the processes of building and testing, application packaging and deploying it. Similarly, we can do this with MLOps in the world of AI.
The goal is to extract some of the complexities out of the topic [and frankly, with AI the complexities are quite huge], and help customers focus on the important parts of their business.
Data Preparation and Model Training
Another area we can focus on is data preparation and model training. We can help people prepare the data, to store and internalize it. Help them set up solutions in AWS, because they have many services how they can help you with this as well. That can be some data analytics services, managed database solutions, ETL solutions or more. Our goal is to give you tools to do what you need to do.
Model Deployment and Inference
We can also help with deployment or scaling of the models, monitoring it, trying to fine tune its performance. The point I am trying to illustrate here is that DevOps is complex. There’s a huge amount of tooling and services. If you’ve ever tried to do something with cloud or Kubernetes, you know the complexities can be huge and it can get hard to navigate it.
Handling MLOps, especially at scale, can be even more difficult with the current set of tools working with AI & skill sets that are available to people. There aren’t many best practices that you could use yet. There are many frameworks, but a lot of them only tackle one single problem. So, for now it’s quite a challenge to navigate through it alone.
One of the ways to offload the complexities for our clients are AWS managed solutions, or offload it to us as we can help you the same way to e.g. manage your Kubernetes clusters. The main goal is that the client can focus on what’s really important for their business. That’s the challenge we’re facing today and you might be as well.
Self-hosting Models
What would be the reason to self-host a model or create your own? It’s quite easy to get started with AWS Bedrock or other solutions from AWS. The learning curve is much smaller than self-hosting something. As I mentioned before, using SaaS solutions, OpenAI or Bedrock may be a good way to do it for companies who aren’t focusing their core business model on this. But, there are some instances where creating your own has it’s place.
What are some reasons for self-hosting? It’s usually when companies outgrow these simple solutions and come across these problems.
- Potential high costs: Things can get out of hand especially when using managed solutions constantly. Meaning you have high utilization and you do it all the time. This can generate lot of costs. At some point, although self-hosting isn’t a cheap solution, it can be cheaper option.
- Limited options of fine-tuning: Depending on the service, some services allow you to do it and some don’t.
- Compliance: Some of these vendors or tools you are using may be an issue if you need to achieve some level of compliance. This might also be a valid reason to consider self-hosting.
- Potential licensing / data safety issues: Some companies are prohibiting the usage of Github Copilot or tools like that, because they are afraid of leaking intelectual property.
- Custom models: Another reason might be that you simply have a valid use case for developing and hosting an entirely custom model.
Training & Inference
Before we begin with this topic, here’s a quick description of these two terms that are used quite frequently when we’re talking AI.
- Training: A machine learning training model is a process in which a machine learning (ML) algorithm is fed with sufficient training data to learn from, depending on the purpose of you’re training.
- Inference: A process where trained LLM generates output based on input data. Basically, you reach a conclusion based on evidence and reasoning.
LLMs in AWS
We see some challenges that we can address when building infrastructures for these models. Similar to the DevOps world or application development, one of the goals is to reduce complexity. That’s something cloud can be a good tool for, if you do it right. If you don’t do it right it can even enhance the complexity and make things worse. But if you do it right, the tools that cloud is offering you can actually help you move much faster, and stay pretty flexible.
Challenges
AWS offers many solutions for Gen AI, AI/ML training like AWS Bedrock, AWS SageMaker. But if you outgrow those, you can use tools like EKS, ECS or others. It doesn’t really matter which tools you’re using. At some point you may run into issues with performance, because running these models requires a lot of resources, and those resources are usually pricey and large. At some point, you can tune the performance to have better latency. Another complex topic with LLMs is that you’ll most likely run into issues with scalability. Another things to look out for are surely cost efficiency and observability.
All of these challenges can be mapped to any topics regarding application or infrastructure development. The challenges are very similar to what we know now, but they bring new aspects in the world of AI.
Hardware Needs
One of the specific challenges you face with running your own LLMs are hardware needs. This can sometimes be true when you’re running traditional application services as well, but not really at this scale. Right now, you don’t have to care so much with what CPU you’re using to host your application (Intel, AMD…).
But in the case of AI it’s different. You have multitude of options for hardware that you can to use when you’re in the cloud. And, the interesting thing is that hardware can perform differently with different tasks, types of models or size of models.
You’re in a position, where you want to experiment a little. Benchmark the models, play around and figure out what works best for the models, for what price, and determine what works for you. It’s also important to mention that hardware needs are different for training and for inference. These tasks are very different in what they perform, what they require, and how you can scale them.
At the moment, we have hardware that is dedicated for these tasks. AWS offers EC2 instances with different GPU dedicated for these tasks. What’s even more interesting today is that you have specialized CPUs for these tasks. AWS offers its Inferentia CPUs which are optimized for inference. They have their own Trainium chips which are optimized for training. I think that is quite interesting, that we live in an era where we’re designing and building dedicated hardware for different tasks in AI/ML. That’s something that wasn’t really common a couple of years ago.
The challenge that we’re facing as a DevOps company is how to integrate this specialized hardware for our clients into existing infrastructures. Basically, how you can run it within EKS or SageMaker. At the same time, we’re trying to figure out what’s the most cost optimal solution. You need to benchmark what works the best, different models work with different hardware.
I also wanted to showcase this chart, it showcases the latencies in the BERT model among different hardware. The specialized CPUs are currently able to outperform some GPUs which wasn’t the case earlier. But also how it changes when you change the sequence length. If you look at the last column, the influx of GPU is quite significant (compared to the others where the performance of GPU is pretty similar). It’s not just about picking up one specific instance. You have to consider the model that you’re using and what is the sequence length, what are the parameters etc. You have to benchmark, test out and implement it to your solution.
Scaling
I think it’s hard on its own even with traditional infrastructures, if you want to scale efficiently in terms of performance and costs you have to have some knowledge in how to do it. Scale different metrics, use spot instances to make it more cost optimal, there are various ways to tackle it. But scaling training and inference is also something different, something you have to figure out how to do.
When you want to scale training, you might run into problems with distributed computing. There are frameworks you can use like MPI and others, but it’s truly a different concept than just scaling vertically or horizontally, how you’d normally do it. You have to break up the problem into smaller charts, you have to distribute them, you have to compute the bits and pieces and then somehow connect it together again.
Inference scaling is more similar to the traditional scaling, meaning you use larger instances or you use more of them. You can use the traditional aspect of load balancing and issue different tasks to different models. But then there’s another layer of scaling as well. Scalable models by the number of tokens and parameters. You can play around with that if you have a model that’s too big and you could use a smaller, fine-tuned model for the same task. You’re also able to combine different models, batch requests and you can improve scaling and the performance this way as well.
Considering AI and LLMs, brings in unique challenges to the topics that we’re only starting to get familiar with. Brings in a lot of complexities.
Conclusion
There are many challenges we’re going to face, that applies to us as a company, to you as individuals or your companies because this is a very fast paced topic that changes rapidly and has there isn’t really any set of best practices yet with proper ways how to do it.
We also need to figure out how to integrate MLOps into DevOps and help other companies with MLOps and enable their growth.