On Hybrid Database Development

At the recent IBM Insight conference in Las Vegas I presented a session entitled:

"My Data is Relational But My Coders Want to Use JSON! Help!"

Over 30 development managers and developers attended. When asked most said they attended because this was a common problem that they are all either dealing with already or expect to have to deal with soon.

Here, as well as I can translate a presentation to text, is what I had to say:

I assert that JSON and other semi-structured data formats present unique challenges for organizations. In order to take advantage of these new structures and the development paradigms that they support and encourage and integrate them into our existing systems, we wrestle with three alternatives:

Keep these new data and the apps that utilize them independent of our legacy data and keep our legacy apps independent of this new data
Perform ETL/ELT frequently to maintain all of this data in both relational and semi-structured databases.
Develop applications that can access multiple data sources concurrently.

I maintain that none of these alternatives is acceptable and I want to propose a better path to future application and database development. First a digression to define terms which you can feel free to skip.

There are many classes of data:

Structure data
Unstructured data
Semi-structured data
Time stamped data composed from any data class
Geographically located data composed from any data class
Spatially located data composed from any data class

Structured Data refers to the data we have in relational databases for the most part. Despite the predictions of pundits relational databases are here to stay. There are several well defined application segments for which relational databases are the best tool for the job.

Semi-Structured Data is data that may have a natural structure to it but where the structure may vary from one data element to another. This can also include elements where part of the information is structured while other sub-elements are unstructured.

Currently the most popular format for semi-structured data is known as JSON for JavaScript Object Notation. JSON is a key:value record or document format that is quickly becoming a new standard for data interchange replacing CSV, XML, and other interchange formats. JSON is the basis for the RESTFUL web service protocol. It is more space efficient than XML and easier to parse. In addition, it is supported by a binary format, BSON, that is even faster to parse and extract individual sub-elements. JSON is a dynamic schema format where each record, or document in JSON terminology, contains the schema details for the document which can be different from one document to another. This promotes and supports RAD or Rapid Application Development.

MongoDB has become the most popular of the JSON databases or stores. It saves JSON documents in BSON format and produces JSON output to consuming clients. There are several sets of interoperable development tool sets known as "stacks" the most popular of which is call the MEAN stack as well as client libraries for most common languages.

MEAN stands for MongoDB, ExpressJS, AngularJS, and Node.js.

Class time over:

On to the problem. We all have TBs of relational data. Our experienced developers already know SQL and our schema. Some new development still fits with structured data. There is no sense in moving OLTP or structured DW data out of relational systems. There is no financially supportable way to rewrite all of our existing systems from scratch just to take advantage of new paradigms. Many new applications written primarily to use semi-structured data will need to present the relational data we've been collecting for the past 30+ years.

Many of us are capturing TBs of semi-structured data already. New developers that we want to hire only want to work in the NoSQL space. Some new applications are natural fits for semi-structured data. RAD techniques require schema flexibility to succeed. Many new and existing relational based applications could take advantage of all of that semi-structured data to expose new features without a massive redevelopment effort.

How to get relational data into MEAN stack applications and JSON data into C/C++/C# applications is the problem. Traditionally when faced with data in multiple silos we would either use ETL techniques to copy the data from one silo to another or we would write data access layers using multiple libraries.

ETL is time consuming and error prone. We may have to sacrifice precision or internal relationships in the target system. Maintaining a timely, consistent, view is a major constant effort and expense. Data duplication is itself an expense. Fast storage is not cheap!

We can't just move everything into MongoDB either. MongoDB like most NoSQL stores do not support multiple operation transactions, inter-object relationships (references and sub-documents are not relationships), XA transactions (so they cannot participate in transactions across multiple silos).

The solution:

The solution is Hybrid Application Development using a Hybrid Database system using both traditional and RAD development tools. Here is what is needed from a hybrid database for this to work:

Access from traditional development stacks (C, C++, C#, Perl, etc.)
Access from MEAN and other RAD development stacks
Full ACID compliant transaction support for all data
Full relational integrity and data normalization support for all data
Ability to store structure data with a predefined schema
Ability to store semi-structured data with a dynamic schema
Ability to present both types of data to both stacks in the native format of the data

JSON as JSON
Tables as tables

Ability to present both types of data to each stack in its native format

JSON as table data for traditional tools
Table data as JSON for RAD tools

Some NoSQL stores can handle ACID consistency but most can only manage "eventually consistent" transactions. Most, like MongoDB, can only guarantee consistent transactions and rollbacks for a single document, not for multiple documents in a collection nor for transactions that span multiple collections. Some NoSQL stores can support SQL or an SQL-like query language, however, they do not return data as rows and columns which is what is expected by SQL database access code.

Some RDBMS systems can:

Store JSON (or a binary form of it)
Store a collection as a collection (rather than as a column type)
Manipulate JSON fields within a document
Create indexes on JSON fields within the documents in a collection
Support referential integrity between JSON document fields and relational columns
Join JSON documents to relational tables

But most cannot do it all. I believe that these capabilities are the future of database and application development. Today there is one RDBMS product that CAN do it all:

Accept connections from MEAN stack and other MongoDB clients without modifying the application code
Accept connections from relational SQL clients
Support transactions on relation tables and JSON collections that include multiple tables/collections and multiple rows/documents
Allow joins between JSON collections and other JSON collections or relational tables
Enhance MEAN applications to support issuing SQL statements and return data in JSON format for those clients
Permit SQL clients to treat fields in JSON collection documents as ordinary columns
Support JSON and BSON as class 1 native data types for use as columns in relational tables or as a type to define a collection
Direct RESTFUL interface to expose all data as web services without middleware
Store JSON collections as collections
Permit any MongoDB aware client to treat relational tables collections
Support most MongoDB DDL and server management commands (including sharding)
Auto create databases and collections just as MongoDB does, on the fly

I know I'm sounding like a salesman, but I have nothing to sell you other than an idea and my help to achieve it in your organizations, so, bear with me. What is this magical database system? It is the latest incarnation of one of the first RDBMS products in the marketplace:

Informix Dynamic Server v12.10.xC4 from IBM

Informix, aka IDS, has all of the required features I mention above but it brings much more to the table:

Larger documents than MongoDB
Larger databases than MongoDB
Industry leading data replication technology
Four classes of data replication that can all work together
More capable time proven sharding technology than MongoDB
Data distribution and centralization capability
Support for Timeseries data
Support for Geospatial data
The only database system that can combine Timeseries and Geospatial data to track your IoT data through SpaceTime
User defined data types that perform as well as native types
World class OLTP transaction rates
Traditional and MEAN stack applications can connect and use both JSON and SQL data in their own native formats
Centralized databases up to 128PB
Distributed databases without size limits
Server fail over and load balancing fully configurable using SLA specifications
Five 9's reliability
Near Zero downtime
Upgrade server versions without downtime
Instantly bring up additional servers to load balance during peak periods
Support large numbers of concurrent transactions
Row level locking (MongoDB locks entire collections only)
Extremely tunable engine
Autonomic features for low maintenance overhead and to permit unmonitored operations
Highly embeddable
Optional compression of data and index keys
Advanced query optimizer
Multiple tenancy
Informix Warehouse Accelerator returns data up to 1200X faster than the base server for complex queries over huge data sets
Timeseries in JSON
GeoJSON support in addition to native GeoSpacial support
Lucene text search on JSON documents and relational tables
High speed data loading technology for faster loads of streaming data
MQTT integration to link to IoT devices

Wrap up:

The future of software development will require hybrid applications!

The future of software development will require hybrid databases!

The future is here now!

Why wait?

IIUG World 2019

Wednesday, November 19, 2014