OctoML raised $15M to Optimize ML Model Codes Easily

ML Code


Machine Learning startup OctoML was found in July 2019 and is based in Seattle, with only 10 employees. The growth of its engineering team funded around $3.9 million and released software as a service (SaaS) for all machine learning users. Also, recently the company raised 15 Million to make Machine Learning models easier. They aimed the optimization of machine learning models with Apache TVM software stack which was also invented by the OctoML team. TVM, currently used by many top companies like Amazon, Facebook, Microsoft and runs as an end to end learning compiler stack.

About OctoML

OctoML is a Software company founded in July 2019. The CEO of OctoML is Luis Ceze who is the professor of the University of Washington. This company mainly focussed on Artificial Intelligence, Machine Learning, Software and Computers. It is a US-based company which has its headquarters in the Greater Seattle Area, West Coast, Western US. It has fewer employees ranging from 1 to 10 but has high productivity.OctoML focuses on making people’s life simpler and better. Its latest project is regarding optimizing machine learning models for AI needs. It makes ML more efficient and effective use.OctoML uses unity in all dimensions.

“Stop worrying about operator coverage and debugging frameworks. Enjoy easy, efficient, and secure deployments”.

Apache TVM

OctoMl is a startup company founded by the Apache TVM machine learning compiler stack project. The key role of  TVM is to optimize the machine learning code. Virtual Assistant Amazon Alexa’s wake-word detection is mechanized by Apache TVM.

TVM which is an end to end learning compiler stack. Deep Learning solves many problems and increases in customer’s use of deep learning applications.TVM is considered to be an end to end optimization stack which was invented by the members of OctoML.

OctoML and Octomizer

Users have to upload their model and it will optimize and standardize the model. The optimized model works fastly because they can grip the hardware on it. These efficient models cost them too little to run in the cloud environment, and they are able to use less costly hardware with very little performance to get almost similar results. For some of the use cases, Apache TVM already results in 80 times performance gains. For more advanced users, there’s also an option to add the service’s API to their CI/CD pipelines.

Fig1: Optimizing ML code using Octomizer

What is Optimization?

In simple terms, optimization means choosing input that results in the best possible outputs. To make things the best that it can be. This could be a variety of things residing from the best allocation of resources to produce a design with the best characteristics, to choosing control variables which cause the system to be very well desired.

One way to think about machine learning is a way to convert raw data to a simplified cartoon-like representation and use that representation to make predictions. Optimization was built under the foundation of machine learning. It comes when we build especially deep neural networks. An optimization is a powerful tool in many application

What is Optimizing your model?

  • Transformation of your ML model so that it executes effectively and efficiently.
  • Fast computation
  • Low memory, storage and battery usage.
  • Focussing on faster inference without training.

Why should we Optimize our model?

To optimize our ML models we mainly use TensorFlow and TensorFlow Lite kit

  • For benefiting users and unlocking use cases
  • We use them for speech recognition, Face recognition, Object recognition, music recognition and many more.

There are two major Technique in the Toolkit 

Quantization: It is a general term describing technology that reduces numeric precisions and static parameters and executes the operation in lower precision.

There are two approaches to Quantization.

  • Post-training: Post-training operates on an already trained model and built on top of TensorFlow Lite converter
  • During-training: During-training performs additional weight fine training in the training process and built on top of tf.keras that require training

Pruning: It is a method of removing parameters that can cause less impact in prediction. Pruned models have the same size and runtime latency but can be effectively compressed. Pruning reduces models downloading size.

Optimization of Machine Learning models

Let’s come to an example that makes this point of optimization clear. If tea has a high temperature then our tongue gets hurt and if the temperature is less then there is less enjoyment. So we need a medium temperature to get the enjoyment of having tea. That particular temperature points to the topic optimization. Two main methods have been commonly used for Optimizing Machine Learning models, which are:

  • Exhaustive Search
  • Gradient Descent

Exhaustive Search

When a scientist wants to find the right temperature of tea they flip the problem upside down. They try to reduce suffering while drinking tea which automatically increases the enjoyment while drinking tea. In machine learning this method is termed as an energy function. Finding the optimized temperature of the tea using above method can be so-called as an exhaustive search, which has been considered as a straight forward and effective but consumes time, if time was not provided then it would be better to check for a few other methods

Gradient Descent

If we use a wavelength to find the suffering then we can find the bottom. This method called gradient descent literally going down the hill. Here we start with an arbitrary temperature. At the start, we make a random guess and see how our tea drinker likes it. From there next is we figure out which direction to reach down the hill. For this again will take another arbitrary value and check how much the tea drinker likes and compares it with our first value and move on the direction until we get the bottom value. Sometimes if we take larger stepsize we may miss out on our bottom value and in some cases, if we take a small step size then it can take a longer time to reach the bottom value, so finding the step size is an important part of this method.

One step in finding the bottom of temperature in fewer steps is Gradient descent with curvature, by using curvature. if slope getting steeper takes a big step and slope getting shallow take a small step, this means the bottom is getting more closer. Curvature is actually the slope of the slope. When there would be more than one value at a point then it can be termed as a local minimum and a single value at a point it can be termed as a global minimum and we mainly use a global minimum rather than a local minimum. The deepest of all values, the lowest of all local minimum.

Now coming to the comparison of Exhaustive exploration with gradient descent, exhaustive explorations have few assumptions,it is robust and also expensive to compute whereas in gradient descent there are more assumptions, it is sensitive and efficient to compute.


While the Octomizer was a good start, the real goal for them is to build a more fully-featured MLOps platform. The goal of OctoML was to create the best platform in the world which would automate MLOps. The main application of Optimization has been mentioned below:

  • Optimization helps to find the input for the best outputs
  • Usually, optimization requires an optimization algorithm
  • Optimization is also applicable to many disciplines

Read Next: Reinforcement Learning

Leave a Reply

Your "email address" will not be published. Fields which required below are marked as *