From “Lake” to Insight: How Federal Agencies Can Get More from Their Big Data Platforms

Federal agencies are collecting and sharing data at an unprecedented rate. Between July 2010 and August 2017, the number of data centers grew by 476 percent—from 2,094 to 12,0621. Meanwhile, the Data Accountability Transparency Act has made an unprecedented amount of this information open and accessible.

For many agencies, “data lakes”—centralized repositories of structured and unstructured data—have been the next step toward leveraging this information for analysis and decision-making. But compiling information into one place is only the first step. These data lakes need to become fully functional data platforms capable of ingesting, storing, and supporting data of any format or type, in a way that enables analysts to make connections and develop insights.

How can federal agencies achieve this platform without falling into the “data swamp”? How can data analysts avoid the 80-20 rule, where they spend 80 percent of their time data wrangling and only 20 percent performing real analysis? Drawing from our work with civilian and defense agencies and the creation of our Open Data Platform (ODP), here are five tips for success:

Start by identifying the problem to be solved

More information doesn’t automatically equate to more insight. Historically, 60 percent of all big data initiatives fail—and one big reason is jumping in before asking the right questions.

What does your organization aim to do with the data in your data lake: Optimize the scheduling of equipment maintenance? Assess battlefield capabilities and see early warning factors of potential threats? Will your data platform need to support multiple functions, such as machine learning and predictive analytics while spinning out applications? To what degree will it need to scale?

These questions can guide the types of data you capture, how it’s secured, your cloud infrastructure, and more. We’ve found this step to be so critical, in fact, that we’ve integrated it into our ODP. For example, the platform creates indices for data storage based on mission/business questions and requirements.

Define and secure the data  

The vast amounts of structured and unstructured data in today’s federal data lakes come from a variety of sources—often in a state that’s far from ready for advanced analytics.

“If enterprise data—with its large volumes, varied formats and types—is to be managed strategically, its metadata must be suitably defined and used,” according to global research and advisory firm Gartner2. For federal agencies charged to comply with strict security and privacy regulations, metadata must also be effectively secured.

Here’s where metadata and attribute authorization schemes come in. Metadata tags and unique identifiers allow organizations to quickly and easily query, process, analyze, aggregate, and present data of any variety. From a security standpoint, attribute authorization schemes protect data at the source, record, or field level.

Make information easier to find

Once tagged, data must be governed in a way that enables analysts to easily find and understand it. In a survey by Erwin of 118 CIOs, CTOs, data center managers, IT staff, and consultants, most responded that data governance is critical to compliance, customer satisfaction, and better decision-making. Yet nearly half (46 percent) lacked a formal strategy for data governance.

Cataloging—the continual management of information such as data set names, formats, tagging, releasability, retention, and more—is key. So is governance of data access. Our ODP includes four different data zones: raw (data that has not been touched), trusted (data that has been quality checked, tagged, and enriched), analytic (data that has been indexed, stored, and tuned to run advanced machine learning algorithms) and sandbox (data that has been segmented or quarantined to enable testing, prototyping, and exploration). Moving information to smaller data zones enables organizations to manage data efficiently and deliver more immediate results.

“Democratize” data analytics

Such data zones can also help organizations spread the analytics workload out to team members beyond an organization’s data scientists, whose time ideally should be kept free for exploring new ideas and models.

Another tactic toward this goal: making certain data analytics functions self-service, such as clearing information and running through notebooks and analysis models. Booz Allen’s ODP uses an automated installation system that allows developers to provision a new data platform from a single command line. Furthermore, because the ODP uses open source products, it’s easy to customize it for specific objectives.

Use Agile to accelerate progress

For many big data objectives, such as mission readiness, challenges are multifaceted and time is of the essence. Through helping teams manage uncertainty and improve teamwork, transparency, and project commitment, Agile methodology can help organizations take on these more difficult technical goals3. So can an Agile culture—one that encourages new ideas and pivoting quickly based on lessons learned.

In conclusion

Overcoming data lake obstacles involves a combination of processes, culture, and appropriate technology choices. By employing modern data management platforms and best practices from the outset, agencies can start flipping the 80-20 rule of data science—and start putting the valuable information they collect to work for their missions.

1https://www.gao.gov/assets/700/691959.pdf

2https://www.gartner.com/doc/3878879/ways-use-metadata-management-deliver

3https://digital.gov/2018/04/03/thinking-about-going-agile-5-benefits-your-office-will-reap-with-agile-methods/

About the Authors

Chris Brown is a Chief Technologist at Booz Allen Hamilton providing customer expertise in big data analytics solutions, with over 20 years Information Technology Experience. His experience ranges from large-scale, mission-critical applications in both the commercial and United States Federal Government markets. Chris received a BS from William & Mary in Business Administration with a minor in Computer Science.

 

David Cunningham is a Principal in Booz Allen’s Strategic Innovation Group (SIG) focused on delivering Cloud and Data platform capabilities to our Civil, Defense, and Intelligence Community clients. He has been at the forefront of technology evolutions such as Service-Oriented Architecture, Enterprise Integration, Cloud Computing and Big Data. David has over 18 years of professional experience in IT development and received a BS in Computing Sciences from Villanova University.

 

Sign up for the free insideAI News newsletter.