Computer Vision at the Edge: Hardware Part 2
A brief overview of the multitude of non-SoC graphics options
In the first part of the hardware discussion, I talked about System-on-a-Chip (SoC) systems such as the NVIDIA Jetson lineup and Apple Silicon. These are currently useful for very specific niche cases but what if you want to use something that doesn’t require heavy optimization or sacrifices in accuracy? What if you want to infer on multiple camera feeds as fast as possible and keep that data off the cloud? In this post, I’ll answer various options for non-SoC hardware such as laptops and consumer and enterprise discrete Graphics Processing Units (GPUs)
Quick clarification for those that are unfamiliar: a discrete (or dedicated) GPU is a GPU that is separate from the Central Processing Unit (CPU). Some CPUs have their own integrated GPU (sometimes shown as iGPU). iGPUs are typically only powerful enough for light usage like web browsing, Excel, etc and are not recommended for gaming or ML. In this post, I’ll be talking about discrete GPUs.
We should also discuss GPU processing and video memory before starting because these are key topics that play a role in how well a GPU can be applied to a task.
Let’s start with video memory. Just like your CPU has memory (not storage), generally a separate module that you would have installed if you built your own PC. On GPUs, this memory is built onto the card and can’t be changed without changing your GPU. This is referred to as VRAM or video memory. Different ML applications will require different amounts of video memory. In Computer Vision, memory can be a factor if using large batch sizes but large batch sizes can create unnecessary delays in real-time situations. In other domains, video memory is extremely important. Large Language Models (LLMs) are very dependent on memory. Meta’s open-source LLaMa 7B, for instance, requires 28 GB at full precision! There are no consumer grade GPUs that can support this model at full precision and that’s the smallest model available. Fortunately, this is titled “Computer Vision” at the edge so we’ll work under the assumption that any consumer GPU will have enough video memory.
The other key part is the processing. I don’t want to go into too much detail about the inner workings of a GPU but the key components that we use to estimate the performance (on NVIDIA hardware) are the number of CUDA cores and the number of Tensor cores. We’ll go into more detail about how the cores work in a later post but the general idea is that the GPU actually does the processing on these cores. Tensor cores were recently introduced and are specifically designed to be more efficient with ML workloads and there are specific methods for ensuring that you’re using Tensor cores. If you’re interesting in reading more before my post about it, you can read this article explaining the difference between the two.
Now that we’ve gotten that out of the way, let’s move into our actual hardware discussion.
Consumer hardware
I’m sure you’ll be glad to know that I don’t plan on talking about every piece of consumer hardware in existence. The plan here is to mainly cover laptops and desktops. Before we do that, however, I should talk about NVIDIA, AMD, and Intel. These are the companies that come up when talking about computer hardware as they are the designers and manufacturers of the hardware (I’m ignoring 3rd parties such as EVGA for the sake of this discussion).
Most people are familiar with NVIDIA. I mentioned them above when talking about the Jetsons and they are impossible to stay away from when talking about ML. NVIDIA is the global leader in the ML hardware scene thanks to their GPUs and CUDA. They have such a stronghold over the market that you’ll see people completely ignoring the manufacturer name on their GPUs because everyone already knows who they’re talking about (e.g. “trained on 8 V100s”, “low latency on a T4”, etc). NVIDIA is essentially the only player in the ML game due to their large lead in chip design. I personally have a 2080 Super that I use for testing and it works flawlessly. I used to daily drive a VM with a T4 and have tested on 4080s. The ML performance has increased so quickly, a 4080 can run multiple models simultaneously and still run at real-time! You generally won’t go wrong picking one of their cards for ML.
To give you an example of the performance of a model on my 2080 Super, I ran a benchmark run from my own benchmarking script. Each run is a different batch size but uses a single image and warms up for 5 iterations before starting the benchmark. This benchmark uses my custom wrapper around the PyTorch Mask-RCNN implementation. One thing to note about this test is that my 2080 Super ran out of video memory after about a batch size of 20.
These tests were run with a relatively small image (540x360) and could be optimized, especially the larger batch sizes, but this is a quick, naive implementation intended to work out-of-the-box on any NVIDIA GPU. I will be keeping all benchmark results in this table.
PC gamers will be well acquainted with the name AMD. Their CPUs provide a ton of performance for a great price but in ML, the GPU is king. AMD’s GPUs are also great bang-for-the-buck gaming devices but they aren’t really used in ML. That’s not to say that they can’t be used, it’s just not common. AMD’s ROCm platform is supported in PyTorch (Linux only) and advanced users can install TensorFlow for ROCm as well.
As the support grows, the hardware will grow as well. I have not personally run any models on an AMD GPU but I’d love to hear from anyone who has and what their experience was like.
The last big name I want to bring up is Intel. Most people will know them from their CPUs but they have recently released their Arc lineup of GPUs. Intel is relatively new to the scene but advanced users can use PyTorch on Arc GPUs by following Intel's installation guide. I’m unable to find some numbers for performance other than Stable Diffusion implementations but it appears that the performance of the Arc A770 is right between the RTX 3050 and RTX 3060 on ML tasks. That’s not bad for a first iteration GPU but it’s definitely not going to be taking NVIDIA market share.
Laptop or desktop?
What if you want a portable edge setup for demos or for some other reason? Laptops can be a viable alternative but before going out and buying a laptop for ML, there are a few things to keep in mind.
First, many gaming laptops will list a full desktop GPU but this has (generally) been specially designed for the unique space and thermal constraint of a laptop. There are a few laptops that have a full desktop GPU but these laptops can be extremely expensive. Most other laptops will use a special version of the desktop version that’s been designed to run efficiently in a laptop. This generally entails restricting the power draw of the GPU to ensure your battery doesn’t drain instantly and also ensures that you don’t melt your laptop. A 3080, for instance, has a Total Power Draw (TDP) of 320 watts. The laptop version of the same card will be restricted to a range of 80-150 watts. This directly correlates with how many operations per second the card can perform.
Second, I’ll admit, it’s been a while since I’ve used a true gaming laptop, but with the small form factor, heat can be a concern over time. Long running, intensive tasks on a laptop, even with the lower power draw, can cause heat to build up over time. This will require throttling of the entire laptop and can harm performance. My solution when I was in college was to rest my laptop on an ice pack but I wouldn’t recommend that for anyone hoping to use a laptop as a long-term ML server. While you may find that your laptop starts at 150 watts and is running fine, a few days of constant inference may see your GPU throttled so it’s only pulling 80 watts. This means that your GPU is likely around half of the original performance.
While these are likely worst case scenarios for laptops and many people may have never seen an issue with inference, there is no doubt that a full desktop (or server) is a better option for very intensive tasks. Desktops (with a good case and fans) allow much more airflow removing heat much more efficiently. This obviously comes at a portability but building a desktop PC can also allow you to get more compute for the same amount of money. If you’re curious about the cost of building a PC, one of my guilty pleasure sites is PCPartPicker. I love putting together PC builds so feel free to reach out if you would like my advice!
Enterprise hardware
This section is pretty short because there are only a few differences between the consumer hardware and enterprise hardware. Probably the most visible difference (pun intended) is that most ML enterprise hardware cannot output video. The noticeable exception here are the professional grade GPUs that are used for CAD applications.
These are not the GPUs that I’m talking about. Some examples of the hardware I’m talking about are: T4, V100, and the H100. There are many GPUs, details of which can be found on the NVIDIA website.
These are specifically designed to be used for data processing. If you’ve ever used a cloud provider before, you may have seen these GPUs with varying prices. The T4 (and the newer L4) are cheaper entry-level versions with the T4 available for free on Google Colab. Below is a screenshot of the T4 on the same benchmark that I ran on my 2080 Super.
You’ll notice that it’s noticeably slower than a 2080 Super but it was also able to run a larger final batch size. This is because these enterprise cards start with much more memory than consumer GPUs but are also far more expensive. The T4 has 16 GB of video memory while a 2080 Super has only 8 GB. However, a T4 costs around $2,300 and a 2080 Super has an MSRP of $699 (it can be found for anywhere from $100-$800 as of October 2023).
You may be wondering why we’re even talking about enterprise hardware for edge use cases. Sure, it’s far more expensive but the memory itself can be a huge plus for some cases. We’ve also only been talking about the current lowest end card of a T4 in isolation. Enterprise hardware really shines when doing what it’s meant to do: accelerate compute with parallel hardware.
For use cases where you may need more compute than a single card can provide or you may need a multi-modal solution, enterprise hardware can work extremely well. As I defined in a previous post, edge solutions are “not cloud” but you can bring the cloud to the edge if you’d like. It’s a bit of a grey (or foggy) area but on-premise servers are still on the edge to me.
Autonomous Vehicles
I wanted to briefly talk about another edge deployment that many people may overlook and that’s autonomous vehicles. I live in Phoenix where Waymo is all over the place. Every time I enter their service area (I’m a couple minute walk away), I see one of their vehicles and I try to ride in one as often as possible. If you haven’t tried it out, I strongly recommend riding in an AV.
The main hardware for AVs is directly on the car. You can’t rely on cell networks for inference so the compute needs to be on the car. For Waymo and Cruise, the compute is neatly packaged in the trunk of the car.
AVs, to me, are the epitome of how good hardware has become. We now have the compute to create autonomous drivers that are safer than humans and never get fatigued, distracted, drunk, etc. This will help reduce traffic fatalities and ensure that everyone has a safer time on the roads.
Conclusion
Well if you’ve made it to the end, I commend you. That one really got away from me and it really could’ve been longer. The current hardware market is massive and powerful and it’s continuing at a breakneck pace. This is allowing models to get bigger, faster, and more accurate even on edge device.
Now that we’ve covered hardware, I’ll be discussing software. If you missed my previous posts, you can start at the introduction or read my post on ARM SoCs.