How to Overcome Obstacles in Data Lake and Warehouse Strategies: 3 Best Practices for Enterprise Architects

Kimberly Read is the enterprise architect at Faction, a multi‑cloud data services provider that helps customers overcome cloud vendor lock-in, defy data-gravity, and choose the best cost and performance combinations for cloud workloads. Prior to joining Faction, Kim spent 20 years with Hewlett Packard Enterprise, most recently in the role of principal enterprise architect, BI, and analytics.

Today’s enterprise architects (EAs) have full plates. They’re developing organizational strategies, cloud strategies, implementation strategies, best practices, producing and updating architectural artifacts, and tooling strategies to ensure proper technological choices. EAs source the appropriate software, hardware, and applications to enable business requirements. EAs also aim to incorporate varied types of data (unstructured, relational, etc.) into a single solution that preserves quality, integrates well with other enterprise systems, and supports varied required data formats and compliance guidelines and functions as the single source of truth for the company.

Multi-cloud initiatives—drawing on services from public and private clouds—can help organizations stay ahead of the curve. As they look to create reliable data warehousing strategies, however, EAs often encounter challenges. Top among them: aggressive budget limitations (driven by mandates to decrease capital and operating expenses, organization-wide) and the need to avoid vendor lock-in (either in the form of stagnant tools or multiple-year licenses).

To support the business case for multi-cloud, enterprise architects can benefit by addressing three primary considerations.

1. Make appropriate tooling choices

Getting the right tools at the right price is always a challenge. It may be impossible for companies under budget pressures to change tools for several years if the current choice does not work out or when business needs change. The challenge of weighing budget and features applies to a range of tools and cloud services, including those for business intelligence (BI), data science, data warehouses, and data lakes.

A clear list of requirements and a scoring methodology to evaluate options will help determine the most suitable tool for your situation. Tool purchase considerations extend beyond price and features: will personnel require new training, how well does it integrate with other tools, are there additional licensing requirements, and does it help or hinder architecture decisions for the next three years?

A crucial step in the selection process is determining if the tool you’re considering will meet the business requirements now and for the foreseeable future. Admittedly, with new tools coming into the marketplace frequently, this may be difficult to ascertain. Looking at purchases through this lens will help move beyond current budgetary restrictions. For example, Gartner predicts that multi-cloud strategies “will reduce vendor dependency for two-thirds of organizations through 2024.” Be sure the tools you’re selecting are flexible yet support today’s business demands.

Further, data strategies need to be carefully defined and examined since data is the heart of business decisions. Data needs to be accessible with minimum latency so that timely business decisions can be made. Good data that is accessible by business analysts and data science is the key to moving to predictive (as opposed to reactive) decisions. Ensuring that data is accessible from on-premises, or on any cloud or multiple clouds, is essential, but this should be accomplished without moving or copying data to multiple locations. Multiple copies of data mean there is no single source of the truth; most of the time will be spent on copying or migrating data instead of utilizing the data to fulfill business strategy.

2. Evaluate the build vs. buy decision

Part of any tooling choice centers around a decision to either build one of your own or to buy an existing tool. Some organizations may have clear mandates that prioritize one option over the other. One common mandate is to avoid vendor lock-in, which can preclude purchases from certain vendors but may bolster a case for more flexible approaches that support multi-cloud initiatives.

When facing the decision to buy or build a tool, consider:

Build:

Will you be able to build the solution with development resources that are available internally? Or will you need to rely on contractors? Whether relying on in-house developers or on contractors, be sure to document and share reference architectures, establish guardrails, define frequent checkpoints and code walkthroughs, and perform checks and audits of your source repository.

Here, agile methodology is key. Carefully review all documentation and ensure that it’s communicated efficiently to support and operations teams throughout the lifecycle in order to ensure an effective handoff without surprises. While building data lakes or an entire data warehouse is certainly possible, the task can be formidable, especially when viewed against the offerings available from many data warehouse vendors.

Buy:

What are the requirements of each of your stakeholders? Gathering these, then evaluating them against an empirical scoring framework, will allow your team to assess tools without bias.

With hundreds of connectors and exceptional performance capabilities, it makes more sense to buy some tools; they provide cost-effective opportunities to use public cloud services, often leading to year-over-year reductions in capital expenditures, support, and maintenance. BI tools are one such example. Data warehouses have evolved tremendously over the last few years; there are massively parallel processing (MPP) database engines—created to manage multiple operations simultaneously through multi-processing units—providing excellent performance for data warehouses. Advances in storage have also enabled the consumption of varied data types for data warehouses (structured and semi-structured). Data warehouses can apply business logic, offer familiar query capabilities, and do ETL (extract, transform, load) and ELT (extract, load, transform) functions, which can sometimes alleviate the need for a separate data lake.

3. Trust and certify the data

Whatever choice an enterprise architect makes about solutions and tooling, the focus must always be on the quality of the data. Especially with data warehouses, data must be clean; validated; available in near real-time/real-time (NRT/RT) frequency; organized for queries, reporting, and analytics; and trusted.

To ensure that the data is trusted and certified:

  • Identify all data so that it can be searched and found. Identify, catalog, and tag each piece of data; ensure the data lineage details are preserved and that a defined taxonomy is in place.
  • Automate the process of data discovery; keep it up to date. Tagging can be performed automatically (through auto-tagging, via a machine learning capability that improves with time) or manually by data stewards. 
  • Establish security policies. Most data lakes and data warehouses can be configured with specific rules, including security policies, that specify how to process each type of data. Personally identifiable information (PII) data, for example, must be protected and tagged appropriately on your lake or warehouse.
  • Audit and log all access to the data throughout the system. Once a data catalog exists and security policies are defined for various data types, tags, and lineage, then the “trust in the data” component is almost complete. 
  • Show time variants. Your data lake, warehouse, and reports must all show time variants to guarantee the accuracy and timeliness of the data.
  • Data governance.  Without data governance the solution will fail. Data Governance defines the processes, policies, and procedures, and defines the governing parties for all data assets.  Components of data governance include data quality, data stewardship, master data management, data timeliness, and data accuracy.

Once your data is tagged and certified, it’s available for reporting. The reports can be certified by business subject matter experts or data stewards, then watermarked as “certified.” Data and reports must also be secured—encrypted both at rest and while in motion, with role-based security, masking, and obfuscation used to meet varied reporting or compliance needs.

From Obstacles to Opportunities

Cloud-based data warehousing can be cost-effective, scale easily, and reduce labor demands on organizations compared to on-premises or DIY solutions. Look for options that fit with your organization’s cloud (or multi-cloud) strategy to avoid data duplication, data egress fees, and vendor lock-in. By minimizing current challenges, you’ll be able to creatively address the business initiatives that will help your organization thrive.