In this special guest feature, Carole Gunst, Marketing Director at Attunity, offers four best practices for setting up your data lake in Hadoop. Carole is a marketing leader with demonstrated results at growing companies and world-class marketing organizations. She has a strong background working with B2B companies in the technology space. Carole has a B.A in Journalism from University of Rhode Island, and an M.A. in Marketing from Emerson College.
Hadoop is on everyone’s mind in Big Data. This free, open source software allows companies of all sizes to perform analyses that they could not afford to previously analyze. Hadoop Data Lakes are low-cost, reliable, scalable and agile – all key ingredients when determining how you’d like to store and access your data.
Arguably the most important piece of building out a Data Lake is to make sure it’s not a silo. All relevant people should be able to access and utilize this new subset of information. Safeguards are necessary, but when you lock too many stakeholders out, the Data Lake cannot enable employees to the best of its ability. If you want to see the fastest ROI possible, be sure to give the right people access and avoid creating new data silos.
To help you avoid problems on your journey to setting up a Hadoop Data Lake, here are a few best practices:
1. Pre-define your architecture
Before you start production, you have to have a concrete plan for how this aggregated data will be used and who will be using it. This will help you set up your architecture to best support these use cases. Putting extra effort into predefining your architecture will actually save you significant time in the long run. Your implementation team will thank you.
2. Set up a security strategy
This also relates to governance. As stated before, it’s important not to create new data silos when implementing a Hadoop data lake. ALL stakeholders who could benefit from the information should have access. Now, certain safeguards and permissions may have to be added to certain accounts, but a clear governance plan should be ironed out before the Data Lake is created.
You should also make sure that you have the right security protocols in place to deter outside threats. Depending on how sensitive the information in your Data Lake is, you’ll want to do research on recommended security tools and protocols. Just don’t rush into implementing your Data Lake before your security research is complete – this is when your Data Lake would be most vulnerable.
3. Don’t forget about disaster recovery
Sometimes we forget that Data Lakes can be breached or the technology may fail in some way or another. In case of an emergency, you should have a backup plan for your data. Sometimes, Data Lakes can be set to update themselves in near real-time, but a disaster recovery plan that includes most of your recent data could save your company one day. While this may not be your first priority at the beginning of this project, it is definitely a piece you should consider.
4. Have a five year plan
Creating a Data Lake without a plan for the future is a scary endeavor. Taking into account how much your company plans on scaling and how you may set up your environment in the future would be wise. It doesn’t necessarily have to be a full “five year plan”, but the fact that you’re thinking ahead while in the beginning stages of the project can only help you down the line.
Sign up for the free insideAI News newsletter.