Intro to Computer Vision at the Edge
Why computer vision at the edge is more accessible now than ever and how to get started
Introduction
Before we really get started, I should probably define what we mean when we say “edge computing”. I’m only half joking when I circularly define “edge computing” as “not cloud computing” although that is basically what edge computing is. I generally define it as encompassing everything from your smartphone to your high end home server (as we all have, right?). In this post, I’ll be mostly discussing edge computing in the field of Computer Vision (CV), which is what I do.
I’ve been fortunate to be able to deploy CV models on various devices from Nvidia Jetsons to high end consumer hardware to cloud kubernetes clusters. I particularly enjoy working on the edge because of the large variety of hardware and the particular requirements of tuning applications and model types to fit the specific hardware as well as possible.
Edge computing provides much more of a challenge for CV, in my opinion, than deploying to the cloud. In the cloud you have to worry about various setup items like how quickly to scale up when your service is getting hard, how much you want to scale up, etc but the actual implementation of the application doesn’t really matter that much. If your model takes 300 ms per batch, you can get close to real-time inference by scaling your cluster which almost feels like cheating. On some edge hardware you have to consider that if you want real-time data, you need to sacrifice accuracy so you can run a smaller model. The type of your model makes a difference as well. Object detection is generally faster than instance segmentation but instance segmentation yields more accurate object tracking (when using tracking strategies such as Intersection over Union or IoU). For this reason I find working on the edge a fun challenge that is only really frustrating when someone insists on having a real-time, highly accurate model with tracking on a small device like Nvidia’s Jetson platform.
In this post, we’ll mostly go over questions that I wish people would ask themselves (or customers) before starting projects and why they’re important to the task. These questions have all been a pain point for me at one time or the other such as when someone expects a real-time, highly accurate, cheap instance detection deployment to a laptop or small Jetson. My hope with this is that you, the reader, will come out of this with a better understanding of how to properly spec out your edge deployment or at least understand the common challenges better.
The important questions
Note: When I refer to real-time here, I’m generally speaking about 30 FPS (~0.03 ms latency) but everything still applies when needing higher or lower frame rates which will be dictated by your use-case or camera.
Due to various levels of hardware and cost, there are several questions that must first be answered when beginning a project on the edge. All of these questions have an impact on the hardware selection and help mitigate expectations as the project develops. In no particular order:
How far are objects from the camera and how difficult are the objects to see?
What’s the camera’s resolution and how much compression loss is expected?
How close to real-time would you like this to be and how fast do the objects move?
How many objects do you expect in the frame at any given time and how many should be tracked?
How accurate does the output need to be? Is this a critical application where one missed detection spells doom or can is there some leeway?
How much are you willing to spend?
Is power consumption a concern?
How much space do you have? Are you working on a drone or in really tight quarters? Do you have a server rack with extra space or a place where a desktop PC could go?
Optional questions:
How many devices do you want to eventually have and how do you plan on managing them?
What’s the network connectivity like?
Let’s dive into these questions a bit more to see why they’re important.
How far are objects from the camera and how difficult are the objects to see? The answer to this question can dictate how big you need a model to be. The further the objects are from the camera or the more difficult they are to see means you will likely want a model that can pick out smaller differences better.
What’s the camera’s resolution and how much compression loss is expected? This is related to the previous question about object size. When the images start getting compressed or the image is small to begin with, the difficult to make out the objects goes up. This can make smaller models fail to pick up any objects at all, especially in novel scenes. If a model has been retrained on your specific data, this is less of an issue but small, pre-trained models will often fail when the image information drops too much.
How close to real-time would you like this to be and how fast do the objects move? This question affects both the model size and the hardware because the two are correlated here. The closer to real-time you get, you need to either beef up your hardware or you need to reduce your model size. If you want to run at 30 FPS and the objects move quickly but the hardware is lacking, you’ll need a smaller model. On the other hand, if you have beefy hardware, you can get away with a larger model. This also affects your decision on the type of model, if you are using an RTX 4090, you won’t have to worry about any of this but using an RTX 2080 will require making concessions. An RTX 2080 can run a medium sized Mask-RCNN model somewhere around 15 FPS but this can change depending on the amount of post-processing you need to do.
How many objects do you expect in the frame at any given time and how many should be tracked? This ends up being mostly a post-processing question since the actual model latency will likely be mostly unaffected by the number of objects in the frame but this question also helps determine what kind of hardware you need. If you are planning on tracking people in a busy area, such as Shibuya Crossing in Tokyo, Japan, you will need to be concerned with how much compute you will need to process all of that data. A 4-core CPU is going to get bogged down with each frame in the Shibuya Crossing example and you’ll want something with much more parallel processing power such as an Intel Xeon or AMD Threadripper CPU. However, maybe you’re tracking wild mountain lions and only expect one every couple of hours in which case you’ll be fine with a low end CPU (or one of Nvidia’s Jetson lineup).
How accurate does the output need to be? Is this a critical application where one missed detection spells doom or can is there some leeway? Generally speaking, the larger the model, the more features can be extracted. This depends on model architecture but if the accuracy must be >99%, you’re going to need a model that has higher latency. If you also need that model to run at real-time, you’ll need better hardware to keep up. This also extends to your CPU since having untracked detections is unlikely to help much in critical applications.
How much are you willing to spend? This is one of the most impactful questions on this list, as can be expected. If your budget allows for no more than $500 worth of hardware, you will need to make concessions elsewhere to keep up with real-time. On the other end, if you have unlimited budge and space on your server rack, by all means, get that $200k AI server with 8 A100s and you can likely run dozens of CV models at real-time simultaneously.
Is power consumption a concern? This ties in with the next question but if you’re implementing an automated robot for your factory, you will likely be running off of battery or at least in a low power situation. In that case, you will need to think hard about the model you use and the hardware. You’re probably going to be using a Jetson but do you need a Mask-RCNN model or can you get away with just bounding box detections? Do you have expensive electricity where you live in which case an RTX 4090 running 24/7 could be costly? Is power consumption a concern at all?
How much space do you have? Are you working on a drone or in really tight quarters? Do you have a server rack with extra space or a place where a desktop PC could go? Similar to the previous question, physical space can be a limiting factor for many deployments. You can put a Jetson device on a drone for remote inference but you’re going to be limited in frame rate, especially at lower power levels. If you have extra space where a desktop could go or a server rack, this is less likely to be a concern for you.
How many devices do you want to eventually have and how do you plan on managing them? This optional question can be very important in some cases. If you’re working on a pilot program that your company wants to scale to thousands of locations, do you really want to have to SSH in to each computer to do updates? If you’re implementing this for a Ph.D. or one-off project, you can likely ignore this entirely.
What’s the network connectivity like? Without network connectivity, model updates and continuous data collection can be a lot more difficult unless you’re keeping all data local anyway. If using cloud storage, keep this question in your mind as you work out the details.
Conclusion
Hopefully these questions have provided some clarity into some of the challenges with Computer Vision at the edge. Computer Vision is a fascinating field and has great potential with applications on the edge. Everyone probably has some sort of project that could use CV with one of the big ones being home security but every company has a use for CV in some form or another. These companies are likely either unaware of the applications, wary about sending private company data to a cloud application, or find the challenge too daunting. Hopefully the rest of the posts will help demystify some of the questions when it comes to CV at the edge and give you a better sense of how you can apply it to your next project.
Next up
In the next part, we’ll look into hardware more specifically before eventually moving into the software side.