Traditional relational databases have been around for over a couple of decades and were designed at a time when storage and compute infrastructure was really expensive. Today, with the wide availability and adoption of Cloud, the cost of storage has drastically come down paving way for new non-relational database designs to take shape. These non-relational data models and services, while in an emerging phase, promise to be quite disruptive. Unlike traditional applications, which rely upon a single shared database, the new microservice architecture is such that each individual microservice has a separate database with its own domain data. This allows organisations to independently deploy and scale the microservices.
To take advantage of this, CIOs have to ask a key question: “How prepared and competent is my organisation to work with these new database models and services?”
During the re:INVENT 2018, held late last year, AWS launched two new databases, Amazon Timestream, (a scalable time-series database), and Amazon Quantum Ledger Database (QLDB), a blockchain-based, fully managed ledger database with a central trusted authority. In addition to earlier launched Amazon Neptune (a fully-managed graph database) and Amazon DynamoDB (a proprietary NoSQL database service supporting key-value and document data structures), AWS has cleared its intent to challenge the monolithic databases and address the application performance concerns of enterprises in the era of extreme digital transformation.
I had a chance to speak to Herain Oberoi, General Manager, Database, Analytics, and Blockchain Marketing, AWS during the re:INVENT 2018 and discuss how AWS is trying to free organisations from the shackles of monolithic, on-premise, relational databases and help them transform to provide a faster, superior, customer experience using the new models.
Below are the excerpts:
DynamicCIO (DCIO) Data management is a critical aspect for deriving value from the deluge of data that enterprises collect today. How can a shift from relational to non-relational databases help in deriving value?
Herain Oberoi (HO): There are a couple of things to consider here. One, how does an enterprise look at data in the context of application development, which is where this whole context of shift from relational to non-relational comes in. Secondly, how do we look at the context of analytics on the data stored. These two things are important to consider for the data lifecycle management. A lot of CIOs are now moving to data lake architectures. The whole point of a data lake architecture is that you can put all your data in one place like in Amazon S3 (AWS Simple Storage Service) storage. Organisations don’t have to worry about whether it’s relational or non-relational. Here the data can be catalogued, and you can apply metadata tags to build, secure and manage the data. One of the services, recently announced to help with this, was AWS Lake Formation. There are customers already using it. They are building data lake architectures. It does take some amount of heavy lifting to make the data lake worthy. The heavy lifting falls typically into the following three areas:
- How do I get my data in?
- How do I ensure that the data quality is right?
- How do I make sure it’s in the right format?
Another aspect is that of data security both in terms of access and policies. And finally, there is data discovery piece to ensure that the people, who need to access the data for analysis, know which data to access and can trust the quality of the data. The AWS Lake Formation addresses each of the above three issues. It automates the process to reduce the time taken for someone to build the data lake.
DCIO: Data lake is a new concept, which is emerging. How different is it from the earlier concept of data warehouse?
HO: It’s a good point. The data warehouse concept was primarily built on the relational model. Earlier, you had a monolithic, single large enterprise data warehouse and to get your data into it, you had to know the schema beforehand. If you needed to analyse the data that didn’t fit the schema, you had to change it and tinker with the data warehouse. It became challenging to maintain the data warehouse because it was one large pool of data. If one group, within an organisation, is overusing the data warehouse, others would face latency issues. The data lake architecture does a couple of things: One, you don’t have to define any schema when you put the data in. You can put the data in whatever native format it is and do the schema on read versus write. Secondly, it decentralises the analytics, putting the onus on the individual business groups. It provides the data in the right quality and gives you the metadata to know which data is useful and which is not. But the onus is on you to do the analytics. Based on the data lake architecture, especially when you are working on the Cloud, you can set it up where different organisations have their own accounts and then they can decide what analytic service to use. So, some will use Amazon Athena, some will use Amazon Redshift, some will use Amazon EMR against the data in the data lake.
I’ll give a couple of example here.
The Major League Baseball (MLB) in the US have about 30 different baseball teams and they have to enable each one to do analytics to get information about their fans. Each of the team’s goal is how to increase ticket sales and how to increase renewal rates for seasonal tickets. The MLB provides the data lake to every team. They take all the information that comes from various channels including websites and social feeds, information from the teams in terms of ticket purchases and point of sale purchases. MLB curates the data lake and makes it available for the teams. The individual teams can then apply analytics. This wouldn’t have been the case with an enterprise data warehouse architecture.
The second example is that of Amazon.com. We recently moved from the legacy Oracle data warehouse to a data lake architecture that is built on Amazon S3 with EMR and Redshift and Glue. AWS Glue is the data catalogue. Amazon had about 50 terabytes of data in this enterprise data warehouse, which was way beyond the limits of what that system was designed for. It had had about 3000 different teams that were accessing that data to perform analytics. Any small team in the Amazon would want to look at metrics and if you look at all the different sub-teams within Amazon it’s a lot. If they were doing a transformation on a table that was more than 100 million rows, it consistently failed. With the data lake architecture in place, they are able to support 100 terabytes of data and now the dictionary and data lakes are available to all these different teams. The teams can set up their own accounts and use EMR or Redshift to run their own analytics without removing the bottlenecks of depending on one central enterprise data warehouse.
DCIO: How is AWS Lake Formation relevant in this context?
HO: It usually takes months to get the data lake to a place where its useful and productive. The reason it takes long is you have to ensure the availability of right data, quality, appropriate security etc. IT teams have to create what’s called a ‘data catalogue’ or a ‘data dictionary’ that defines the metadata so that the accessed data is useful. AWS Lake Formation gives the ability to create templates, workflows, and policies that you can, for example, define security policies once and for all. It gives you a catalogue and a dictionary also which reduces the time from months to days and accelerate the business.
DCIO: You mentioned about 3000 teams working in Amazon internally but most user organisations wouldn’t have that luxury of skilled resources and time. How does AWS handhold the clients to bring them to a maturity so that they can do justice with these kinds of technologies platforms?
HO: It all depends on the ‘analytics maturity model’ where an organisation may or may not have the required skilled and competencies. Organizations have teams that are further along the path and teams that are less along the path. That’s why we provide the broadest set of analytics components. Suppose, an organisation has a team that’s got business analysts skills and not going to get into data engineering, for them we’ve set up some kind of a service system with Amazon QuickSight to provide interactive dashboards and Amazon Athena for ad hoc query. If an organisation has a team that’s more advanced and knows how to deal with big data and has a data engineer also onboard they can use something like Amazon EMR or Amazon Redshift. Further, an organisation may have a team that actually has a data scientists. They can use Amazon SageMaker and perform machine learning. If you think of it as a progression, in all use cases you are still going with the same data lake and the work to define the data and provide context for the users is going to be similar. Based on what the teams need, there are different tools for different jobs.
DCIO: While announcing the Blockchain-initiative QLDB, Andy Jessy made a bold statement: “we don’t build things for optics.” Where do you see this momentum going for developers using the cloud computing services and how will it be helpful?
HO: Blockchain has had a lot of hype in last one year. We tried to understand the problems users are trying to address using Blockchain. The core problem was the need for a verifiable history of transactions. Users also want a central authority that they can trust to own the infrastructure. An example of that would be AWS Marketplace. Our marketplace partners trust AWS to own the database that records the transactions and as long as we provide them access and they can verify it, they’re good. But another situation might be that of a consortium of banks and where no bank agrees to a single entity owning the database. In such a situation, you would want to decentralise the trust and that’s when Blockchain becomes vital where you need a peer-to-peer network. Everyone in the network needs to have their own copy of the ledger. In order to make updates to the ledger, you have to have agreement and consensus among the members of the network to do so.
We realised there were these two very distinct use cases. One was where it centralised trust and you can use a database and you don’t need the complexity of the network and consensus. And the other one where there was a need for decentralised trust and the complexity of Blockchain was required. That’s where we found that a ledger database is actually much easier to use and implement. It runs a lot faster because you don’t need to have consensus to record transactions. Our goal is to drive clarity for customers because there has been so much hype around it. That’s what Andy meant when he said: “We don’t do things just for optics.”
DCIO: From Nov 1, 2018, Amazon itself migrated from Oracle to AWS Aurora. How would you place it as a use case for the external world?
HO: Most customers that we talk to are quite fed up with the old guard database vendors and are willing to move to the open source. They are moving to MySQL, PostgreSQL, or MariaDB. The job before us was to look at the merits of those open source DBs and make them enterprise grade. That resulted in AWS Aurora, which gives the flexibility and openness of open database but and availability and performance of commercial grade – all at 1/10th of the cost. The value proposition is good and that’s the reason why AWS Aurora has been the fastest growing service in the history of AWS which offers over 125 services. It’s been out there for a while. Last year alone we doubled the number of Aurora customers. With Aurora, we actually built a custom storage layer. Underneath that would be 6 copies of your data across 3 availability zones. It would allow faster failover. These kinds of architectures you can only design in the Cloud. Our strategy is to listen to the customer pain points and help solve them. The legacy vendors are making it hard for users and that’s a great opportunity for us.