Take Data Analytics to the Cloud
After completing this unit, you’ll be able to:
- Explain the challenges of on-premises data collection and analytics tools.
- List the advantages of cloud-based data analytics.
Build Your Data Analytics Solution in the Cloud
As business has gotten more complex over time, tools and services have gotten more powerful to enable organizations to keep up. A prime example is the evolution of data analytics from expensive, on-premises hardware to cloud-based architectures. Raf highlights the differences between these two approaches in the following video.
[Raf] You may already know that the cloud is more flexible, scalable, secure, distributed, and resilient. But I want to give a more data-related approach in terms of why cloud computing is relevant for data analytics. In this section, I will explain why the cloud is the best way to perform data analytics nowadays, and why it has been solid for operating big data workloads. So, let's get started.
Before we start talking about cloud, allow me to go back in time, maybe a decade, and tell you a brief story. After going back in time, it will be natural for you to understand why everybody loves doing data analytics in the cloud. Ready for the journey? Get your beverage of choice, and buckle up!
(cup hitting the floor)
Years ago, the most common approach for companies to have compute infrastructure, big data included, was to buy servers and install them into data centers. This is usually called a collocation, or colo. The thing is, servers utilized for data operations are not cheap, because they need lots of storage, consume lots of electricity, and require careful maintenance regarding data durability.
Hence, entire dedicated infrastructure teams. And trust me, I've been one of those infrastructure analysts working with data centers. It is expensive and overwhelming.
With that scenario, only big companies were able to work with big data. And consequently, data analytics was not popular. It was very common for those servers to have a RAID storage controller that replicates data across the disks, increasing the cost and maintenance care even more.
In the early 2000s, big data operations were closely related to the underlying hardware, such as mainframes and server clusters. Although this was extremely profitable for the ones selling hardware, it was expensive and not flexible for the consumers. Then, something fantastic started to happen. And the name of this fantastic thing is Apache Hadoop.
Mostly, what Hadoop does is replacing all that fancy hardware by software installed in operating systems. Yeah, that's right. With the help of Hadoop and computing frameworks, data could be distributed and replicated across multiple servers by using distributed systems, and eliminating the need of those expensive data-replication hardware to start working with big data.
All you needed was efficient network equipment, and the data were synchronized over the network to other servers. By embracing failures instead of trying to avoid them, Hadoop helped reduce hardware complexity. And when you reduce hardware complexity, you reduce cost.
And by reducing cost, you start to democratize big data, because smaller companies could start leveraging it as well. Welcome to the big data boom.
I brought up Hadoop originally, because Hadoop is the most popular open source, big data ecosystem. There are others. And what I wanted to highlight here, is the concept, and not specific frameworks or vendors.
The thing is, by baselining hardware to a basic level and applying all big data concepts to software, such as data replication, we can start thinking about running big data operations on providers that are capable to provide virtual machines with storage and a network card attached. We can start thinking about using the cloud to build entire data lakes, data warehousing, and data analytics solutions.
Since then, cloud computing has emerged as an attractive alternative because it is exactly what it does. You can get virtual machines, install the software that will handle the data replication, distributed file systems, and entire big data ecosystems, and be happy without having to spend lots of money in hardware. The advantage is that cloud does not stop there.
Many cloud providers, such as Amazon Web Services, started to see that customers were spinning up virtual machines to install big data tools and frameworks. And then based on that, Amazon started to create offerings with everything already installed, configured, and ready to use. That's why you have AWS services, such as Amazon EMR, Amazon S3, Amazon RDS, Amazon Athena, and many others. Those are what we call managed services. All those are AWS services that operate in the data scope. In a later lesson, I will talk more about some services.
We will need to build our basic data analytics solution. Another big advantage of running data analytics in the cloud is the ability to stop paying for infrastructure resources when you don't need them anymore. This is very common in the data analytics, because due to the nature of big data operations, you may need to run reports once in a while. And you can easily do that in the cloud by spinning up server or services, using them, getting the report you need, saving that, and turning off everything.
In addition, you can temporarily spin more servers to speed up your jobs, and turn off when you're done. And since you mostly pay for time and resources needed, 10 servers running for 1 hour tends to have the same price of one server running for 10 hours. Basically, with the cloud, you're having access to hardware without having to concern with all the burden involved on doing data center operations. It is like the best of both worlds.
Did You Watch the Video?
Remember, the quiz asks about the video in this unit. If you haven’t watched it yet, go back and do that now. Then you’ll be ready to take the quiz.