There’s an old saying in the Artificial Intelligence community: once software starts working people stop calling it AI. You could make the argument that the opposite has taken place during the last ~6 years of the neural network renaissance with Machine Learning researchers returning to the term AI as the old stigma of exaggerated hype wears off. However, it does point to an interesting guideline for technological maturity: it’s mature when you stop noticing it. That’s why the old Palm Pilots were a conversation piece but modern smartphones go completely unnoticed. One particularly powerful implementation of Deep Learning is in the proliferation of Voice assistants.
While there are plenty of valid critiques about failure cases of various voice assistants, speech recognition, achieved via a combination of Natural Language Processing, convolutional neural networks, and long short-term memory models, is in general impressively good. But just try getting a response from Siri or Cortana when the internet is down.
And this brings us to today’s topic: Where does Cloud Computing fall short in Deep Learning applications? As providers of on-site high-performance computing hardware and software catering to Deep Learning applications, Exxact has particular expertise in identifying and delineating those scenarios where on-site compute takes the advantage over Cloud in terms of cost, flexibility, privacy, and/or security. Some of these applications are obvious: you would hardly want to rely on Cloud compute for object recognition and decision processes in a self-driving car cruising at highway speeds.
While autonomous vehicles may have an obvious need to carry their own compute, many other applications are a matter of choosing the most important factors to optimize for. Unlike a critical application such as driving, a failure in a voice assistant model is annoying but not fatal. Running the voice recognition models off-device helps save on battery usage in mobile applications, and ensures that the most up-to-date model is always used. On the other hand, some people may weigh the battery savings against their privacy concerns (for themselves or their customers) and choose an offline voice recognition model.
It’s important to consider every aspect of your computational needs when deciding between Cloud or Local compute for your next project. Many people may not even realize they don’t need their own warehouse-sized datacenter to match the performance of Cloud compute virtual machines: Deep Learning supercomputers like NVIDIA’s DGX-1 or a bespoke Tensor Workstation are not much larger than a conventional personal computer.
One of the selling points of Cloud Computing is “elasticity,” or the ability to quickly spin up additional virtual machines as needed. As counter-intuitive as it may sound, this elasticity does not necessarily translate into increased flexibility when it comes to pre-installed frameworks or choice of hardware. Invest in reserved P2/P3 instances from Amazon Web Services, for instance, and you’ll find yourself limited to a choice between older-generation K80 and more capable but pricier Tesla V100 GPUs.
Choosing a custom-built system for your Deep Learning application allows flexibility in the choice of GPUs. Not only that, but on-site providers support specialized software configurations including not only the ubiquitous TensorFlow, Torch, and Theano libraries but more esoteric packages like DL4J, Chainer, and Deepchem for drug discovery. Specialized frameworks offer ease of flexibility that is not always available from one-size-fits-all solutions offered by major Cloud providers, configured with all dependencies to run smoothly out-of-the-box. Remember, developer/researcher time is oftentimes demonstrably your most valuable resource.
Cloud Computing obviates the need to worry about upgrades and maintenance so that you and your team can concentrate on solving real problems. What’s not as obvious is that sourcing a Deep Learning system from a dedicated provider provides many of the same benefits, with services and warranties you’ll be hard-pressed to do without on a DIY system.
Security and Privacy
The obvious considerations of capability and cost may be the first thing to come to mind when debating the Cloud vs On-site decision, but in fact, there are many applications where the choice will be made for you by data security or privacy requirements. As members of the public, we may be growing overly accustomed to news of security breaches in Cloud services, such as the personal information describing US voter registered for the 2016 election left exposed on AWS by data services company Deep Roots Analytics, but in setting up a research or business project with potentially sensitive data the consequences are all too real.
The convenience of Cloud resources comes at the cost of increased exposed attack surfaces which may be vulnerable to malicious or accidental breaches. On-site systems mitigate some of this risk and can be configured to optimize for security, air-gapped systems to avoid side-channel attacks.
Applications serving government, law enforcement, defense, and medical industries all have strict regulations on maintaining data security, often preventing the use of third-party storage solutions. In other instances, the control and protection of private data may be something of a gray area, but internal best practices may encourage on-site data storage. Banking, FinTech, or Insurance applications all deal with sensitive data, and even for areas without explicit regulatory requirements data security is a priority consideration when a breach may have long-term reputational consequences.
It’s a well-known secret that Cloud Computing can be expensive when compared to dedicated systems, particularly for tasks with reliable compute needs known well in advance. Cost comparison estimates for Cloud vs. On-site systems vary from about 2x as expensive for data centers in general to 3-4x more expensive in Deep Learning specific setups. On the other hand, Cloud Computing may make sense for unknown or widely varying compute requirements. That is to say, if you don’t know what scale your models will be operating at, you can use Cloud Computing to explore your use requirements in between concept and scaling up to a dedicated on-site system.
There are a few factors that make it difficult for Cloud Computing to compete with on-site systems in terms of price. Hardware manufacturers like NVIDIA often market more expensive versions of the same GPU to data center clients and use licensing agreements to segment consumer and data center markets (this is one reason we see K80 GPUs in AWS P2 instances). Large data centers also have increased thermal engineering costs, and these costs will ultimately be part of the price. Existing thermal inefficiencies allowed DeepMind to improve data center cooling efficiency by 40% at Google data centers. Finally, installing an on-site system allows your organization to claim depreciation against tax liabilities.
For application specifications that don’t rule out either On-site or Cloud solutions, cost is king. In that case, it’s time to set the total cost of ownership against a comparable subscription to a major Cloud compute provider. Keep in mind that the numbers below are estimates and that Cloud-based projects often accrue additional costs from things like data storage and transfer that aren’t immediately obvious.
The costs for running Amazon Web Servies P2 and P3 instances, marketed especially for machine learning, are shown below with and without a 3-year subscription (the 3-year commitment entails partial payment in advance). For the latest pricing, check the AWS EC pricing pages for P2 and P3 instances. Custom on-site systems are more sensitive to price fluctuations in the underlying hardware, allowing providers to pass on price drops associated with rollouts of new GPU architectures.
Therefore the cost of ownership for on-site systems reported below is represented as a relatively conservative approximation and assuming total depreciation over 3 years. Other cost comparisons estimate maintenance and running costs at 50% of the original purchase cost per year, but it probably makes more sense to consider only the cost of electricity (estimated as ~$0.20 per kW*hr in the cost estimates), especially if additional maintenance costs are covered by warranty. It’s worth noting that even with an estimate of 50% maintenance costs per year, on-site systems at 100% utilization would still be significantly cheaper than slower Cloud counterparts. While the “lower-end” P100 on-site configuration costs about 50% less per hour than a reserved p2.xlarge AWS instance, P100 GPUs perform about 4x faster than the older K80 GPUs on Tensorflow benchmarks.
Cloud Computing makes a lot of sense for small, unknown, or variable compute requirements, but for Deep Learning at scale, there are numerous advantages to considering a dedicated on-site system. For continuous, large scale and anticipated Deep Learning compute requirements the cost savings of using dedicated on-site systems are significant. Computational needs in between a DIY system and a full-scale setup for smaller or more experimental workload can be met by a cost-effective yet capable Deep Learning workstations.