Delivering Cloud Data and Analytic Solutions

Public cloud computing platforms have made delivering data and analytics easier than ever. They are certainly easier and more powerful compared to when such solutions were deployed on-premise not long ago, but not always necessarily easy. There are still many decisions to be made. It is also worth reflecting on how these choices have evolved in the past 10-15 years, as well as the principles that haven’t changed.

Use Case Selection

It’s the simplest and most often overlooked question: what are we doing, exactly? Identify the goals, stakeholders, and users first.

Compliance

Understand the use case’s compliance requirements, such as HIPAA for healthcare. Cloud platforms have evolved significantly in the last 10 years in their ability to support myriad compliance contexts. Something to never forget, however, is that these capabilities enable you to manage your compliance requirements. You still have work to do.

Make friends with the compliance folks and lawyers. Things go much better when they are brought in earlier to projects.

Data Ingest

Identifying data ingest frequency is a fundamental design decision, and it usually starts with a Near Real Time (NRT) vs. Batch decision. But these are broad classifications, and the devil is in the details. There is a difference between something that must be ingested and processed sub-second, compared to an expectation of a minute or two, even though both cases would be called NRT. Similarly, Batch could be once a day, or once an hour. It depends on the size of the batch.

Both cases are important and have their place. My advice would be that if you know you have NRT needs, then start designing for that. If you think you might have NRT needs, it is arguably easier to start designing for a Batch approach, get a basic solution working, and go from there.

Cloud Platform Selection

Cloud platform selection is a topic that takes the religious fervor of PC vs. Mac and Linux vs. Everybody Else and combines perennial vendor lock-in concerns and takes them to a new level. In the old days, acquiring a particular brand of technology such as a database presented the same challenges, but it was still only one component. Now the cloud platform in so many ways represents everything, with a bill that keeps coming every month (more on that later).

This might sound like stodgy advice, but from an enterprise perspective, try to work with whatever cloud platform is already in use at that organization. Each cloud vendor will try to tell you they are the only ones that can do something, but organizations will be better off with some semblance of technical consistency. The Big 3 (AWS, Azure, Google Cloud) are close enough on most capabilities. If there is a cloud provider in place at your organization, try to leverage it. If there are frameworks already in use, try to leverage them, too. While multi-cloud solutions are discussed with much excitement, it’s hard enough for an organization to be competent (and secure) with a single cloud platform, much less several.

The more cloud platforms are in use at an organization, the more compliance, security, networking, and support conversations will need to happen—and those conversations don’t typically scale linearly. Exceptional multi-cloud cases do exist; just go in with both eyes open.

Framework Selection

General Advice

If you’ve been around the data and solution space for any stretch of time, you’ve seen the industry survey slides with vendor logos packed more densely than anything NASCAR ever put on a racetrack. There are a lot of companies and frameworks out there, and it is good to have choices. A preponderance of choices can also be overwhelming.

My advice is to pick a few key frameworks and try to use the heck out of them, at least initially. There are always edge cases, of course, but simpler tends to be easier to integrate, debug, and maintain. Add additional frameworks when there is a proven need.

Identity Management

Identity management is a foundational security component. The details are beyond the scope of this post; just don’t make any assumptions on whether all the cloud services intended to be used actually support the identity management framework, or whether/how Multi-Factor Authentication is specifically supported.

Bring your security lead in early.

Storage

Object stores are the de facto standard storage framework for modern cloud data and analytic use cases, such as Azure’s ADLS Gen2 and AWS’s S3, as such storage services provide the cheapest storage with “good enough” performance for most data and analytic cases.

It’s still worth reflecting on why this pattern is such a big deal and so easy to take for granted. The concept of disaggregation of storage and compute goes back decades, but the traditional RDBMS deployment was a database server plugged into a Storage Area Network, for example. They were separated, but not by much. For anyone that lived through the Hadoop era, data locality was a big deal for reading and processing performance, especially when running a multi-rack cluster, because while distributed workloads were powerful, they also put a tremendous strain on network infrastructure. The network is the unsung hero on why we can take object stores for granted for cloud data and analytic workflows.

Other storage features to pay attention to are encryption (and associated key management), soft-delete/restore options, and replication (e.g., intra-region, geo-replication) configuration.

Data Management

There are several components in a cloud analytic data management service worth noting.

Catalog—There needs to be a way to organize and secure your data (related to the Identity Management topic above). Whether a propriety catalog or open source solution like Unity Catalog, Apache Iceberg, Apache Hive, Metastore, etc., without an effective catalog nothing in the data platform works.

Operations—One should ensure that a candidate data platform can support all required operations (select, insert, update, merge, delete). This is a subtle but critical point and it is easy to make assumptions. Some query engines were built as just that—query engines—with expectations that new data would be added with the external preparation of files that would be added/registered to the catalog. This pattern was common in the Hadoop era with how people used Apache Hive and Hive-compatible frameworks. Look at the small print and ensure all required data operations are supported for your solution’s needs.

Files—Some data management frameworks will expose the underlying files and some won’t. For analytic solutions, a common pattern now is utilizing a column-oriented file-format such as Parquet (which emerged in the Hadoop era). Similarly, in order to support the full complement of data operations frameworks such as Delta Lake—built on Parquet, but with write-logs and transaction support—are increasingly prevalent. It can be beneficial to check under the hood for understanding of how the data and files are actually structured, if that information is accessible, as it will better prepare you for solution operations management in terms of when like actions such as compactions are triggered.

Query, Redux

The verdict has long been in: SQL is here to stay, at least for analytics.

That’s not to say there isn’t basis for grumbling. I lived through the early enthusiasm of NoSQL frameworks about 15 years ago from Bigtable clones like HBase, and those frameworks gained popularity not specifically because of an aversion to SQL or declarative programming paradigms per se. A significant factor at the time was frustration with enterprise licensing of commercial RDBMS products. They could be expensive at large data volumes yet still technically limited, and open source frameworks that offered the ability for distributed storage of data on commodity hardware were very appealing for large scale use-cases.

Another factor was that NoSQL datastores like HBase would allow you to support the exact data access paths needed, because the developer was responsible for designing the keys. One didn’t need to argue with the query optimizer anymore. Consider this query …

SELECT /*+ LEADING(e2 e1) USE_NL(e1) INDEX(e1 emp_emp_id_pk)

USE_MERGE(j) FULL(j) */

… although this is just an example from documentation, hints this extensive are not unheard of in production solutions. If such hints are required for your SQL to function appropriately, one is already a significant way down the NoSQL Road, and such datastores might not seem that intimidating at that point. For operational systems (e.g., OLTP), such optimization might be just what is needed.

SQL, however, is just so darn useful for analytic solutions. Appreciate SQL and underlying query engines for what they do and then handle the edge cases as they appear.

Analytics

Current analytic workloads may leverage languages like Python and distributed processing frameworks like Apache Spark, and there are plenty of other places to read about that outside this post.

What I find amazing is how dynamically provisioned and scalable workloads have become a mainstream feature. No more static provisioning of clusters! For most organizations, that type of capability 10-15 years ago was only a dream. The design goal should not just be distributed processing and analytics, but also intelligent provisioning and scaling of compute resources.

Dashboards & Reporting

There are a number of frameworks for this functionality, e.g., PowerBI and Tableau. Pick one.

A critical part of effective dashboarding and reporting is not only having the technology, but also delivering the information to the people that need it, in the way they can absorb it, and in the middle of whatever else they have happening in their day (or night). Consequently, don’t underestimate the value of the daily email delivered to a stakeholder. Similarly, there might be 50 amazing reports, but can anyone find the links to those reports? Human navigation matters.

Auditing and Logging

Zero Trust Security is a set of principles. It is a mindset: assume that you’ve been compromised—how would you determine if it had happened, and what happened?

Ensure this is designed, configured, and built in early.

Cost Management

The Cable Bill From Hell

The good thing about cloud computing is that it is flexible, and services can be provisioned and utilized with a few clicks. The bad thing about cloud computing is that it is flexible and services can be provisioned and utilized with a few clicks, and you’re going to pay for all of that. Many higher-level services charge based on data/compute utilization, and it can often be tough to know what a solution will cost—and how it will perform—until it’s been built out. This is one aspect of on-premise deployments that is comparatively easier to manage than cloud computing because a rack of servers costs what it costs.

Cloud computing is powerful, but it can also be the Cable Bill From Hell if not managed appropriately. Watch the bill every month and watch it closely.

Workload Costs

Additionally, start thinking early about appropriate Resource Groups and workload tagging to best determine which solution functions are accruing the most costs. Otherwise, your bill could wind up with a few high-level “storage” and “compute” buckets, and nothing else.

Iterate

Keep going. Too many solutions have been cursed with an “MVP” (Minimum Viable Product) for an initial release…and then abandoned. This could happen for any number of reasons, such as an expansion of solution portfolio, prioritization changes, or budget issues, but the result is bad no matter the cause. Read my BLOG@CACM post Anna Karenina on Development Methodologies for more on that.

References

BLOG@CACM
- Anna Karenina on Development Methodologies
  - https://cacm.acm.org/blogcacm/anna-karenina-on-development-methodologies/
- Design Orientation and Optimization
  - https://cacm.acm.org/blogcacm/design-orientation-and-optimization/
- Operational and Analytic Data Cycles
  - https://cacm.acm.org/blogcacm/operational-and-analytic-data-cycles/
- Software Quotes and Counter Quotes
  - https://cacm.acm.org/blogcacm/software-quotes-and-counter-quotes/
- The Hadoop Ecosystem’s Continued Impact
  - https://cacm.acm.org/blogcacm/the-hadoop-ecosystems-continued-impact/
SQL Hints
- https://docs.oracle.com/cd/B13789_01/server.101/b10752/hintsref.htm

Doug Meil is a software architect in healthcare data management and analytics. He also founded the Cleveland Big Data Meetup in 2010. More of his BLOG@CACM posts can be found at https://www.linkedin.com/pulse/publications-doug-meil