Logz.io is one of Logz.io’s biggest customers. To handle the scale our customers demand, we must operate a high scale 24-7 environment with attention to performance and security. To accomplish this, we ingest large volumes of data into our service.
As we continue to add new features and build out our new machine learning capabilities, we’ve incorporated new services and capabilities. One of our new services leverages several AWS services and our code to deliver unique capabilities to our customers. There are numerous moving parts to keeping this running efficiently and the models tuned.
As we moved these new features from the lab to running in production, we needed to do better at managing our models and pipelines. This process moved us from an R&D organization only into an operational team. The team needed capabilities to run our machine learning in production, this is known as MLOps.
How We Enhance Our MLOps Practice
The data science team did a lot of research to determine which tools we would need to run in production, and we evaluated various commercial solutions. The solutions turned out to not be the best fit for many reasons, but most commonly they were either too expensive or complex, and just didn’t fit what the team wanted to have for production support and monitoring.
We decided to look at what we were doing today for observability across the organization, leveraging our platform to see if we could adapt it to this use case. Not surprisingly, it handled the challenge with ease.
Like most data science teams, we use Jupyter notebooks along with the AWS EMR and AWS Sagemaker services for this specific portion of our new features that leverage artificial intelligence and machine learning. You can read more about our use of SageMaker on the AWS blog.
We also run some spark infrastructure for data ingestion between our data stores and the modeling. We leverage DataPlate to help manage a lot of this work across the different components and services. The first step in Observability is debugging problems as we deploy new models daily. Troubleshooting is done with our log analytics platform. In this screenshot, you’ll see how we monitor our Anomaly detection (detectors) along with their behavior and exceptions.
To understand if our models are predicting outcomes effectively, we collect a lot of metric data and compare this across different runs leveraging our Infrastructure Monitoring (Metric) solution. In this screenshot, we are monitoring and comparing our 60 and 30 minute predictions with our 15-minute predictions to identify drift in our longer-term prediction models.
As new models are trained, we also need to understand if the predictions and models are improving as new models are built. We can compare two models with one another in our dashboards and compare the values. In this screenshot, we compare the inference results of complex model versus naive model to make sure we are always performing better.
We can monitor the models as well and create alerts when there are issues. In this image, we are collecting training statistics and alerting on problematic data from the logs generated by the modeling.
More advanced use cases such as Inference model monitoring can also be done within the Logz.io platform. Similar to other types of monitoring, understanding the models’ inference is critical to learn about how the model is generating data. In this view, we can see the statistics on the modeling run and setup monitoring on the data, this inference monitoring identifies failures in the runtime due to bugs or data changes which are unexpected by the algorithms.
Although we haven’t productized the use of the Logz.io platform for MLOps, this is precisely how the team here is using our platform. If you’re interested in this or would like to learn more, please reach out. We’ll be happy to discuss how we might be able to help you or your team solve challenges in operating artificial intelligence or machine learning technologies in production.
Leave a Reply