Artificial intelligence (AI), like so many processes today, requires sudden capacity spikes. Data often comes in waves and, if processed in real-time, your AI applications must be able to scale. If these spikes are significant, developing a scalable application on premises can be expensive, maybe even unaffordable.
Cloud computing with its pay-as-you-go model has really democratized scalability, making AI a lot more feasible. As we’ll explore in this article, there are many benefits to running AI in the cloud vs on premises, such as significant cost savings. We’ll start with a little history, the difference between running AI on-premises and in the cloud, and what it takes to implement an AI in the cloud approach.
The Origins of Artificial Intelligence
In the early days, AI research experienced ups and downs as scientists around the globe went from excitement to disappointment as they tried to develop a smart computer. As they envisioned a new way to approach AI, their theories were quickly shut down by a lack of high performance processing power. A good example of this is the deep neural networks development process.
Deep Neural Networks
It all started with a paper in 1943, which found that our brain can be simulated by a computer if neurons are represented as connected digital processing gates. The simple idea behind artificial neurons was that they can be trained to either fire or not fire when given a set of inputs. If multiple artificial neurons were connected, a decision could be made by evaluating which neurons have fired and which did not. This is the same principle by which our brains operate.
It wasn’t until 1954 that the first successful artificial neural network was created, which contained only 128 neurons due to memory limitations. The network was trained to recognize a pattern of 1s and 0s and predict the next number in the sequence. In addition, the researchers discovered that removing 10% of the neurons in the network did not hinder its performance, which is somewhat similar to how our brain operates when it gets slightly damaged.
Despite this success, neural network research and funding was mostly halted. The most prominent reason for the budget cuts was that computers were not powerful enough. It could have taken weeks to train a simple pattern recognition network. Today, however, training the same 128 neuron network will not even make a noticeable spike in your computer’s resources. Not to mention, the training will be done in about two minutes and not in the weeks that it took back then.
Another reason for failure of these early AI systems is that there wasn’t enough data to train them (known now as machine learning or ML). Organizations did not understand the value of data back then and so they did not collect any. Even if they knew that data is valuable, there was almost no means of storing it in such large volumes, since data storage was not cost efficient like it is today.
However, in the 1980s, continued research and funding resulted in more sophisticated algorithms and the availability of more powerful computers. Only at this time did people start believing that AI was a possibility.
Chess Playing AI
Perhaps the most notable example of AI in the past was the Deep Blue computer – a system designed to play chess. Deep Blue is believed to be the first computer to win a chess match against a world champion, largely thanks to its 11.38 gigaflops of processing power. Its predecessor, Deep Thought, wasn’t bad at playing chess either, but it doesn’t even come close to its younger brother Deep Blue.
The chess playing computers of the time worked using a simple principle of brute force. By evaluating every possible combination of moves, and what those moves would lead to, the computer could determine the best strategy. In simpler terms, the further you can evaluate the better move you can pick. Some argued that it wasn’t true AI, but the grandmasters that played with the machines told stories of sensing a kind of intelligence behind the artificial opponent’s moves.
Even though Deep Thought couldn’t compete against a world champion, in 1988 it did win a regular tournament game against a grandmaster. At the time, it could evaluate about 10 moves ahead. For comparison, in 1996 Deep Blue could evaluate a massive 20 steps ahead, largely due to its enhanced performance and very parallelized architecture. It is the possibility of being able to connect a bunch of computers together in an efficient manner that helped Deep Blue win that match.
The Present and Future of Machine Learning
Since those days, computer speeds have increased exponentially, and machine learning and AI are once again the center of attention. Today, we can run sophisticated algorithms using only the data processing power of our laptops. Even the once great Deep Blue system cannot compare to what we carry in our pockets every day.
A supercomputer that cost over $100 million dollars and filled a small room in 1993 would have had 600 gigaflops with 2048 processors. The Apple iPhone A13 Bionic has 600+ gigaflops on 18 cores, costs about $1000, and fits in your pocket.
As we invent devices that are designed for quicker calculation, such as AI Processing Units (APUs) and Tensor Processing Units (TPUs), we are pushing the boundaries of processing power.
Until recently, any AI component assumed learning on top of data, which somehow was externally generated and captured. The most sophisticated chess algorithms used chess games played by humans played as a starting learning point. As a result, the knowledge of the neural network for the AI chess player was limited to the knowledge what people could capture. Despite this, the most modern AI player can overplay any human chess player, because it can evaluate more steps ahead than humans can. However, the strategy the AI player chooses to play is still human-like, due to the nature of the data used for training.
In 2017 this changed. DeepMind, the AI subdivision of Google, created a computer program based on a neural network architecture called AlphaZero. Instead of searching for the best moves using algorithms and data created by humans, it was designed to learn how to play games itself. It took 24 hours of self-learning, knowing only the chess game rules, to create the new world champion chess player. During this time, the program played 44 million matches against itself to train the neural network to find and memorize the best playing strategies. This massive amount of training in such a short time was possible largely thanks to the 5000 TPUs the machine could use in parallel.
After these 24 hours of training, AlphaZero achieved a superhuman level of play and defeated world-champion programs Stockfish, Elmo, and its predecessor AlphaGo Zero. In 100 games against Stockfish, AlphaZero won 25 games as White, three games as Black, and scored a draw in the remaining 72 games. During the matches, AlphaZero ran a previously trained neural network on a single machine with four application specific TPUs and 44 CPU cores. AlphaZero is the first step of the future AI, when AI can build the knowledge and strategies that supersede the human capabilities.
Our advance in hardware sounds like a great technological leap forward, but the biggest leap of all is AI in the cloud. The most popular cloud providers (AWS, Azure, and GCP) put teraflops of computer power at our fingertips. As with AlphaZero, that incredible amount of compute power can be used to train your AI model at a large scale in the beginning and use it with limited and economically viable resources at the end.
Creating AI has always been a monumentally difficult task. Having the ability to train large scale models without investing thousands of dollars into infrastructure makes the process a lot easier, not to mention it is a dream come true for data scientists.
Next, we will explore what makes cloud-based AI so great and why it is better than investing in on premises infrastructures. We will cover typical machine learning and AI development tools which allow for the creation of datasets, training models, and inferencing.
AI in the Cloud vs On-Premises
The move to a cloud platform seems scary at first because it is unfamiliar compared to on-premises, which has been around for ages. On-premises presents the illusions of greater control, easier maintainability, easier scalability, and enhanced security. This illusion stops people from making the switch to cloud services. By examining what the cloud has to offer and comparing an on-premises approach to the cloud approach, we hope to break the illusion and to show that using the cloud for AI and machine learning is a correct approach.
Leveraging the elasticity of the cloud for scalability
First, let’s examine the benefits of using cloud versus on-premises or data center solutions. You could start out by creating your models on a laptop, as most of us do. This is a good way to create a proof of concept model, where there is not much performance required and the benefits of cloud are not as apparent. Sooner or later you will start noticing bottlenecks in your setup, since fitting a model will render your laptop useless for any other task.
At that point, a decision needs to be made; you could go out and buy a dedicated machine learning computer or rent a computer from leading cloud providers with state-of-the-art machine learning set up. If you make the decision to spend some cash on a dedicated system, you will once again run into the same issues down the road when there is not enough power. On the other hand, using a cloud solution, all you must do is rent a slightly more expensive computer instance, or even run multiple cheap instances in parallel. Scalability and parallel processing are the main reasons why AI in the cloud is great.
The set up: Manual configuration versus convenient on-demand resources
The next challenge with on-prem is setting things up. When using your own hardware, you must configure each software or library that you are using. Even a simple task such as installing proprietary GPU drivers on your machine is tedious work. With a cloud solution, you can wave goodbye to wasting time, since all the software and libraries are pre-configured and ready to go.
Cloud providers give you assurances that if you select to install one of the supported libraries, it will work out-of-the-box and will scale without causing any issues. It will also be tightly integrated with the other solutions offered by the cloud provider. Another hidden benefit of using pre-installed software on your cloud machine is that, in some cases, the software is an optimized version of the open source solution that the public has access to. Just like that, by using the libraries in the cloud there can be a noticeable performance boost in some cases.
Redundancy and security
Redundancy and security are also cloud computing benefits since you know that the files you put on the server will stay there until you delete it and they won’t be accessed by unwelcome people. By storing multiple copies of your data, you can be assured that your data is backed up in the event of a catastrophe.
In addition, cloud servers can be configured to make sure that only the proper personnel can access your data or your instances. All cloud providers use Role-Based Access Control (RBAC) which allows users to create user roles to dictate who has access to your systems. Cloud providers also have built-in solutions to handle major data and security compliances like DPR, HIPAA, FedRamp and others. This much cannot be promised about the on-premises solutions.
Every cloud provider has an extensive suite of services in their portfolio designed to decrease the workload and speed up delivery time for AI researchers. In addition, everything in the cloud is designed to work together and make life easier. The services can be broken down into data preparation and modeling.
To make any kind of artificial intelligence, you need big data (meaning a lot of data) that the machine can use for learning. Because clouds are easily scalable, you are able to process terabytes of data in a span of hours. There are plenty of tools that are configured and ready to be used as soon as you boot up a computer in the cloud.
Let’s say you would like to create a dataset by downloading some information from the internet and processing it. The best way to go about it would be to use Apache Spark because of its ability to parallelize your transformation algorithms. You could try to use Amazon EMR, GCP DataProc, or Azure Databricks to quickly create a cluster that is set up for autoscaling. Even though these solutions come from different cloud providers, they are always set up and ready to go with the newest releases of Apache Spark.
After your infinite data has been processed you can store it using one of the integrated database utilities offered by cloud providers. Amazon Aurora, for example, can be used to create a scalable relational database. If relational databases do not float your boat, try GCP Cloud Bigtable – a low latency scalable non-relational database. If you are after something fresh and new, then you might want to experiment with Delta Lake on Azure, which can help you start your organization’s next big data project.
Regardless of your data use case, the cloud is ready and integrated with almost anything you might want. No more configurations or spending money on server racks, just pay for what you use and sleep well knowing that your data is protected and has multiple layers of redundancies.
Now that we covered the data part, we can move on to the modeling. Any AI project will require you to create a machine learning model using one of the proven algorithms or a custom solution. By setting up your project in the cloud you gain an advantage over your competition. Most cloud providers offer multiple machine learning state-of-the-art models that are already trained and ready to be used. If using someone else’s models is not your style, you can use model creation tools that the cloud provides to speed up your delivery time.
AWS offers a fully managed software solution, which allows anyone to create, test, and deploy models with ease. SageMaker removes the burden of creating a robust machine learning architecture by providing fully integrated popular frameworks in one bundle. This solution allows developers and data scientists to save time setting up the environment and get right into creating the models.
SageMaker Studio is an integrated development environment (IDE) which gives you access to built-in tools so you can automatically train any type of model, monitor and debug the training process, collaborate faster using SageMaker Notebooks, and deploy the model to production. Trying to recreate all these processes will take enormous effort from the developers and data scientists, it will also cost much more than simply running it in the cloud. Keep in mind that this approach does require you to know how to write code. However, using the next solution, you are not required to code at all.
If you are new to machine learning and are not too familiar with the popular machine learning frameworks, then Azure Machine Learning Studio might be for you. With its simple drag and drop user interface, you don’t need to write any code at all. Simply select prepared modules that complete a certain task, put them in the proper order, and let Azure handle the rest. Note that writing custom code is allowed, but it is totally up to the user to decide if they need a more custom approach.
Azure ML Studio allows you to drag and drop modules such as importing datasets, performing transformations on your dataset, training and evaluating the trained model, etc. Azure will transform your architecture into a working model and will give you an ability to publish it to a production inference server. The entire experience is so easy that you can create a working model in a span of several hours without any coding experience.
To help data scientists, Google is taking a similar approach to Amazon SageMaker. Instead of a “coding-free” option, Google offers a scalable environment using their TPU servers for additional performance boost. In addition, the AI platform was not designed to prepare and transform your data sets. In the documentation, Google explains that their service fits in the spectrum of efficient model training, evaluation, deployment, and monitoring. Therefore, if you are planning to use this software solution, you should have a prepared dataset that you can use.
More and more businesses are migrating their applications to the public cloud or hybrid cloud. The level of affordability that autoscaling can provide and the level of redundancy cannot be matched by an on-premises solution.
Deep Blue once shook the world by presenting what highly parallel computing power can do to further our quest for artificial intelligence. The cloud now puts this same power at our fingertips at a much higher order of magnitude than was available to us before. It would be unwise not to use it for AI integration and research.
In the cloud everything is preconfigured, managed, and easy to use. It is a perfect place to try out new frameworks, train highly scalable AI models, or plain simple run cheap experiments of any kind. Whatever your use case may be, make sure to try AI in the cloud first instead of buying more server racks.