In this special guest feature, Ron Bennatan of jSonar provides four practical lessons on how companies can get past the “Hadoop hangover” by using NoSQL solutions. Ron Bennatan is a co-founder at jSonar Inc., developers of powerful JSON-based Big Data analytic tools and data warehouses. He has been a “database guy” for 25 years and has worked at companies such as J.P. Morgan, Merrill Lynch, Intel, IBM and AT&T Bell Labs. He was co-founder and CTO at Guardium which was acquired by IBM where he later served as a Distinguished Engineer and the CTO for Big Data Governance. He is now focused on NoSQL Big Data analytics. He has a Ph.D. in Computer Science and has authored 11 technical books.
Big Data is not a stranger in the IT landscape any more. All organizations have embarked on the Big Data path and are building data lakes, new forms of the Enterprise Data Warehouse, analytics as the cornerstone to everything, and more. But many of them still struggle to reap the benefits and some are stuck in the “collection phase”. Landing the data is always the first phase, and that tends to be successful; it’s the next phase, the analytics phase that is hard. Some call this the “Hadoop Hangover”. Some never go past the ETL phase, using the Data Lake as no more than an ETL area and loading the data back into conventional data stores. Some make it and transform their business, and some give up.
When these initiatives stall the reason is complexity and skill set. But while all this is happening, on the other “side” of the data management arena, the NoSQL world has perfected precisely that – the simplicity of data management and how easy it is for newcomers to become productive. This article explores some of the attributes that make NoSQL so useful and how we can incorporate some of the lessons we can learn from that world to can make Big Data analytics initiatives work better.
Lesson 1: Simplicity is Paramount
The first lesson Big Data can learn from the NoSQL world (and from other modern software domains like mobile, social and more) is that simplicity and ease-of use are key – they are not nice-to-haves and do not take a back seat to anything else. Developers are viewed by the NoSQL world as the “masters” – and the technology needs to fit the way these masters will use it. Perhaps the main reason that NoSQL has been so successful is its appeal to developers who find it easy to use and who feel they are an order of magnitude more productive than other environments. The same is true for ops. The result is something that makes everyone more productive – developers, ops people etc.
Some Big Data technologies are so complex that they either require armies of consultants or very highly skilled internal staff. The complexity of such tools is such that the team is focused 80% of the time on the technology itself (making it work and making it run) and only 20% of the time on the business functionality required. The NoSQL stacks on the other hand are small, simple to learn, simple to use and intuitive. Instead of making more stacks and technologies and options – perhaps Big Data needs to finally start simplifying the stack by adopting a clean NoSQL-like API (or at least make it an option).
Lesson 2: Flexible Data is a Must
Another example of one of these NoSQL characteristics is that NoSQL systems allow the team to easily adapt to changing requirements and data models. Flexible data in NoSQL not only increase productivity and reduces costs – it is a “must-have”. Metlife’s usage of MongoDB is an example of how this flexibility is used. MetLife managed to integrate data from back office siloed systems in record time. Their key need was the requirement to bring in data from many different products into one place that could be accessed by both customers and representatives. This system is known as “The Wall”. It is similar to many “Customer 360” projects that many organizations undertake – but with a NoSQL approach based upon JSON, Metlife completed the implementation in three months.
While many of the file-system-based layers in Big Data preserve “schema-on-read”, the database-like stacks on Big Data often do not – and go back to a rigid schema. Lesson two is – flexible data is crucial. After all, isn’t one of the “V”s variety? Isn’t part of the other “V”, velocity, also about the rate as which the data and the analytics change?
Lesson 3: JSON is King
The industry has a clear winner for a modern data format – Javascript Object Notation (JSON). JSON has become ubiquitous and is considered to be the lingua-franca of Web, mobile applications, social media and IoT. It is simple but not simplistic. It is flexible and yet has enough self-describing structure to make it effective. It is the fastest growing data format on earth – by a lot. It is also the foundational data structure for almost all NoSQL technology.
But what some have yet to understand is that JSON is also an excellent data format for today’s demanding Big Data analytic applications and environments. In the schema-on-read vs. schema-on-write discussion JSON is the perfect middle-ground.
On the one hand – JSON has a structure. Anyone who looks at JSON sees it – it’s a very intuitive organization of data – good for both man and machine. Data stored in JSON is not structure-less and the advantages you have with schema-on-write systems can be had with JSON. This is even more so in columnar JSON, such as that used by jSonar in its NoSQL Big Data Warehouse, SonarW. For example, in a row-based JSON representation you might have to traverse all documents in order to report on the schema (or sample the documents to get an approximation). But in columnar JSON the schema is implicit and available – because the columnar organization is the schema itself. Hence, as an example, organizations concerned with regulatory requirements like the fact that the schema is defined and can even be enforced.
On the other hand, JSON is flexible and hence has all the benefits of schema-on-read. Each document can have a different structure and the schema can evolve freely when needed. This yields flexibility and very rapid development cycles. It does not require lengthy modeling cycles that force structures that will not withstand the test of time. It promotes exploration and discovery. It supports the diverse structures of modern applications. And most important – it scales; you never have to do an ALTER TABLE when you use JSON; try to do an ALTER TABLE when you have 2PB or even 50TB of data and see how long that takes.
If JSON is the de-facto data format for modern data, it’s only natural to have a new warehouse that was purpose-built for JSON (e.g. SonarW). The alternative is to try to fit this data into warehousing technologies built decades ago – which is like fitting a square peg into a round hole. So lesson three is to use JSON – it makes things simple, convenient, functional and efficient.
Lesson 4: The Times, they-are-a-Changin – Revisit Old Assumptions
As technology changes, revisit assumptions and status quo. NoSQL did that when they looked at the nature of distributed systems, transactions and more. Big Data can do the same.
In the past, schemas were defined first, and everything came later. But today’s data structures are often declared at query time and not at load time. Data is first loaded and only interpreted while analysis occurs – the structure cannot be determined up-front and cannot be fixed. The velocity at which data changes is too great and the size of the data is too large to rely on systems that cannot support schema evolution. Things we have all taken for granted forever are no longer true. For example, we all learned that fixed schemas let query execution systems run faster. But when the data cannot be fixed, it is loaded into relational warehouses as BLOBs accessed through UDFs and the analytical processing speeds drop to a crawl. So as flexible data becomes the norm the strict schema approach slows things down rather than speed things up. Similarly, requirements are no longer fixed or slowly changing; they change quickly and new requirements are the norm and not the exception. Just like development methodologies went from waterfall- oriented to agile-oriented, data methodologies are also evolving – introducing “schema-on-read” approaches. Data sources come on-line quickly and the need to land the data and use it for correlation and analysis becomes ever-more common. A system that needs to normalize data or to convert it to a new format every time is naïve and can no longer be used at scale. Try to imagine an Internet of Things (IoT) application that would require all sensors to have the same fields forever, or a cyber-security warehouse that brings in events from an ever-growing set of sources yet requires all the sources to have identical data. This is a complex problem for traditional database approaches and requires a new modern data technology.
Since new technologies are coming onto Big Data platforms faster than ever, the capabilities are also changing. For example, in the past the assumption was that flexible data does not come hand-in-hand with performant queries. But that is not longer true – for example, the concept of compressed columnar JSON originally pioneered by jSonar allows these two attributes to co-exist. So lesson four is that it is ok to question decade-old assumptions – sometimes it makes a world of difference.
Summary
The two main revolutions that have been transforming data management for the past 5-10 years have been Big Data and NoSQL. They were born with different workloads in mind, different goals and different audiences. And they both evolved differently. But neither can ignore the other and each has much to learn from the other. Specifically, the Big Data world can learn much from the simplicity, convenience and productivity of NoSQL – and if it does, it will become much more successful – especially in the trenches.
Download insideAI News: An Insider’s Guide to Apache Spark
I have been tracking NoSQL databases for several years, collecting publicly available data on skills and vendors. The NoSQL market is still tiny. Considerations and summary of data in Section 2 of this very large slide deck: https://speakerdeck.com/abchaudhri/considerations-for-using-nosql-technology-on-your-next-it-project-1 Slides regularly updated with new data as I find it.