MongoDB’s journey to analytics

To further strengthen our commitment to providing industry-leading coverage of data technology, VentureBeat is excited to welcome Andrew Brust and Tony Baer as regular contributors. Watch for their articles in the Data Pipeline.

About a half dozen years ago, when writing for ZDNet, we posed the question, what does MongoDB want to be when it grows up? Much of the answer has since become apparent. It made the database more extensible to support the variety of the apps that its developers were already writing. MongoDB added native search for supporting content management; time-series data support for internet of things (IoT) use cases; along with change streams for supporting use cases such as next-best-action for ecommerce apps.

Oh, and by the way, MongoDB’s customers wanted a cloud experience that matched the ease of use of its developer tooling. The result is Atlas, the managed cloud service that now accounts for 60% of MongoDB’s business.

But there’s a major piece where the surface has barely been scratched: Analytics. It’s part of what MongoDB will be talking about this week at its annual live event.

[Related: MongoDB fires up new cloud, on-premises releases]

Let’s rewind the tape. MongoDB was designed from the get-go as an operational database. It’s deployed for use cases like managing online subscriber profiles for delivering optimal gaming or entertainment experiences. It can also be used for capturing automotive telematics to track the state of operation of components; providing ready access to clinical patient data for managing healthcare delivery; or powering ecommerce applications for delivering seamless shopping experiences.

It’s not that MongoDB was focused strictly on writes, as one of its earliest enhancements was the aggregation framework to address multistep “group-by” queries that are considered checkbox requirements for transaction databases.

But MongoDB and, in all fairness — most operational databases — have until recently never been known for analytics because the last thing that you want to do in an operational database is slow it down to process a complex query involving multiple table (or document collection) joins.

Why ask about analytics? 

The common thread behind most operational applications is that they become far more useful when you add analytics features. For instance, analytics could help automakers expedite preventive maintenance, healthcare providers pinpoint the best care regimen, or ecommerce or gaming providers improve how they engage or prevent churn with customers. Analytics designed for making quick optimization decisions are logical complements to operational databases.

Pairing analytics and transaction databases is not a new idea, as reflected by the funny names that some analyst firms have added to the conversation, like HTAP, translytical or augmented transaction databases. 

Cloud-native, where compute is separated from storage, provides yet another opportunity to rethink how to piece operational data processing together with analytics without impacting performance or throughput, as shown by recent introductions of Oracle MySQL HeatWave and, more recently, Google’s AlloyDB.

Most of these hybrid databases supplemented row storage with columnar tables designed for analytics and, by the way, they all used the same common relational data structures, making the translation straightforward. Conversely, translating document models, with their hierarchical and nested data structures, has traditionally been more challenging.

So, is now the time for MongoDB to take the analytics plunge? This perhaps depends upon how we define “analytics.” As noted above, applications become far more useful when you add operational analytics that can make transactions smart. If we’re talking about the analytics that can be used for quick decisions, not complex modeling, then the answer is “yes.”

Not an overnight journey

MongoDB has been gradually dipping its feet in the water for supporting analytics. It started with visualization, where MongoDB provides its own charting capability and offers a business intelligence (BI) connector that makes it look like MySQL to the Tableaus and Qliks of the world. While pictures are worth a thousand words, when it comes to analytics, visualizations just scratch the surface. They provide snapshots of trends, but without further correlation (which typically requires more complex queries), cannot fully answer the question of “why” something is happening.

MongoDB is starting to up its game with analytics, but won’t replace Snowflake, Redshift, Databricks or any of the other usual suspects when it comes to performing highly complex analytics. Nor does it necessarily want to do so. The company’s focus has never been data analysts, but rather application developers. Going back to the first principle of operational databases, you want to avoid tying them down with queries requiring highly complex joins and/or high concurrency. And for MongoDB to succeed, it needs to enable those developers to build better apps.

Atlas has the flexibility to set aside dedicated nodes that could be reserved for analytics. MongoDB is announcing that soon, customers will be able to choose different compute instances on those nodes that would be more appropriate for analytics. The nodes would have in-line data replication, making analytics near real-time.

That’s just a first step; with Atlas available on multiple clouds, it leaves an overly wide choice of instances on the customer’s shoulders. Nonetheless, we believe that down the road, MongoDB will introduce prescriptive guidelines and, after that, some machine learning that could help auto-select the right instance for the workload.

Let’s not stop there. Atlas Serverless, announced in preview last year, is going GA this week. So, it would make logical sense to add this option for analytics, where the workloads tend to be different and more spikey than operational transactions.

What about SQL?

The idea of SQL was anathema in MongoDB’s early years. MongoDB will never become a relational database. But could cooler heads be prevailing?

This week, MongoDB is introducing a new Atlas SQL interface for reading Atlas data, a completely brand-new construct that takes a different track than the BI connector. Atlas SQL will be MongoDB’s first real attempt to provide a SQL face to its data that will not simply flatten JSON to make it look like MySQL to Tableau, but provide a more granular view that will reflect the richness of the JSON document schema. 

As no SQL interface is written overnight, expect that Atlas SQL will also be an evolving story in coming years as it gets enriched with more integrations with SQL tools (beyond visualizations) that are checkbox requirements for data warehouses. We would also like to see support for operations such as upserts, a core capability for analytic platforms, that can insert the equivalent of missing rows in what is surfaced as an analytic table.

Along with Atlas SQL interface is the preview of a new column store index that is essential for delivering performance for analytical queries. Again, this is just a start. For instance, MongoDB users would have to manually set up the column store index, specifying the fields. But in the longer run, we could see this being automated through profiling access patterns. And while our imagination is running: enriching the metadata to profile field cardinality, adding capabilities like Bloom filters that would further optimize scanning, and further optimizing the query planner should not be out of the question.

Then there’s Atlas Data Lake, which provided a federated view of JSON documents in cloud object storage. Atlas Data Lake is being refashioned into more of a general-purpose federated query capability that can target multiple Atlas clusters and cloud object stores. This is accompanied by the introduction of a new storage layer for Atlas Data Lake. The new storage layer automatically extracts Atlas cluster dataset into a combination of cloud object storage and an internal technical catalog (this is not Alation) to help speed-up analytical queries.

Up with people

MongoDB has long thrived as a developer-favorite database because JavaScript and JSON are home turf to developers, not to mention the reality that JavaScript ranks number 7 on the Tiobe index. JavaScript, JSON and the document model are always going to be what MongoDB is about. But MongoDB’s historical shunning of SQL kept it off limits to a very large pool of talent: SQL developers, responsible for ranking it as number nine. It’s time to change that.

While MongoDB still believes that the document model is superior to and will replace the relational model (a point that we would debate), a fact that all can agree on is that to extend its footprint across the enterprise, it must embrace the audience it traditionally ignored. And as a win-win, appealing to both camps means that deployments often can be simplified; in place of having to move and transform data to a separate data warehouse target, for some operational use cases, this could be simplified to working within the same platform, replacing data extract with data replication.

Not the end of the data warehouse, lake or lakehouse

MongoDB will not replace the need for separate data warehouses, data lakes, or data lakehouses. The complex modeling and discovery that is becoming an essential ingredient for analytics must, of necessity, be performed separately from the operational system. More to the point, the objective for supporting analytics in an operational database is to make the process inline and as close to real time as possible.

And that shows how MongoDB and the Snowflakes or Databricks of the world would work together. The models that identify outliers would be developed in the warehouse, lake, or lakehouse and the result would be a relatively simple (from a processing standpoint) classification, predictive or prescriptive model that could be triggered when a transaction appears to be weird.

Today, pulling off such a closed-loop process in MongoDB is not impossible, but it’s complicated. You would have to cobble together change streams, triggers and functions in MongoDB to provide some sort of closed analytic feedback loop. It’s not a stretch of the imagination to believe that at some point, MongoDB would bury this complexity under the hood for a closed-loop, near-real-time analytics option. That’s just another example of why we characterize MongoDB’s move into analytics as a journey.

Source: Read Full Article