From Data Lake Curator to Data Lake Master: Best Practices for Making the Most of Your Data Lake

I recently caught up with Jennifer Cheplick, Sr. Director of Product Marketing at Syncsort, to discuss how the concept of a data lake is now commonplace, as businesses realize the value of having a single repository to house all enterprise data in its original, unaltered format. While this is a vital step in mastering a single view of all data assets, left unchecked, it can turn into a data swamp. Jennifer has 20+ years of accomplishment in technical business-to-business marketing, particularly within the enterprise software and solutions space, and has a passion for Big Data technologies.

Daniel – Managing Editor, insideAI News

 

insideAI News: You’ve spoken to the importance of treating data as an asset within an organization. What does this mean, and why is implementing a data lake an essential first step?

Jennifer Cheplick: Everyone is familiar with the concept of data as an asset when it comes to the marketplace for selling and buying customer data, but you don’t have to be a data broker to view data as a strategic asset. Any company can turn the raw data that exists across their business into higher-value insights that can be acted upon to drive additional revenue, profitability or cost-savings. At every step along the way, the data needs to be managed, secured and subject to proper usage and quality controls – just like other assets in the enterprise – particularly when dealing with large volumes. To accomplish this, most organizations create data lakes where they can bring together all critical data from across the enterprise – legacy systems, web logs, sensors, 3rd parties, etc. – into one place and make sense of it all to ultimately create actionable insights and real business results.

insideAI News: What do organizations often overlook when they embark on creating a data lake?

Jennifer Cheplick: To have data viewed as a strategic asset in your organization, you need to treat it that way from the outset of your data lake project. Successful organizations start with defined business objectives and measures of success – whether that’s driving bigger insights or reducing costs, or both. The technical objectives and measures will flow from there. Once the strategy is defined, there are a few critical capabilities that must be addressed, regardless of the business use case. As discussed above, organizations should make sure they are including all critical enterprise data in the lake, but that is just one task often overlooked. No matter where it originates, data is not valuable if downstream users don’t have a high degree of confidence that it’s valid, accurate and complete. Nobody wants their data lake to turn into a data swamp. Organizations must also consider security, privacy and compliance needs, especially those in highly regulated industries like banking and insurance. This can impact how much data you preserve (often in its original, unaltered format to maintain data lineage for a potential audit) – and for how long. The ability to cost-effectively offload to Hadoop is a popular use case for creating a data lake. Security, privacy and compliance mandates also require a thoughtful approach to data governance.

insideAI News: How can organizations ensure their data lake doesn’t turn into a data “swamp”?

Jennifer Cheplick: Data “swamps” are a common business issue, especially when organizations do not fully understand their data. This is why a well thought out data quality plan is essential, and there are several best practices organizations can keep in mind to achieve this. First, address the aspects of data quality directly impacted by members of your organization by creating business rules with all key stakeholders, and having employees follow an agreed-upon method of governing the data before populating the data lake. This ensures all team members can understand the data from a non-technical perspective, and that the data is analyzed consistently across departments. It also simplifies data cleansing once it is already inside the data lake, and even allows the process to be automated for more hands-free insight. Then, turn your attention to the data itself. You want to analyze your data to assess how good, or bad, the quality is. This involves identifying the priority attributes of your data, and then profiling your data based on those attributes. Once quality issues are identified, organizations must implement a set of remediation activities to correct the defective records, as well as address the root causes of the problems. Of course, quality is not a “once and done” activity. Continuous monitoring is important to ensure data quality issues don’t get introduced with new data sources, etc.

insideAI News: What benefits can organizations receive if they properly utilize their data lake?

Jennifer Cheplick: Data lakes, when created and managed effectively, can empower organizations to leverage their data as a strategic asset across the enterprise. Converging all data sources in one place – from mainframes to streaming data sources – is very powerful, particularly for advanced and predictive analytics. For example, data lake “masters” can detect and respond to fraud in real-time; improve customer loyalty with targeted marketing and enhanced customer service; and strengthen regulatory compliance and credibility if the organization is audited. When these benefits are tied back to the business outcomes defined when setting up the data lake, the value of the data lake – and the data itself – will demonstrated to all stakeholders, opening the door to future projects.

 

Sign up for the free insideAI News newsletter.