IIUG Conference 2018

Looking forward to meeting in Washington DC next October!

Wednesday, November 19, 2014

On Hybrid Database Development

At the recent IBM Insight conference in Las Vegas I presented a session entitled: 

"My Data is Relational But My Coders Want to Use JSON!  Help!"

Over 30 development managers and developers attended.  When asked most said they attended because this was a common problem that they are all either dealing with already or expect to have to deal with soon.  

Here, as well as I can translate a presentation to text, is what I had to say:

I assert that JSON and other semi-structured data formats present unique challenges for organizations. In order to take advantage of these new structures and the development paradigms that they support and encourage and integrate them into our existing systems, we wrestle with three alternatives:

  1. Keep these new data and the apps that utilize them independent of our legacy data and keep our legacy apps independent of this new data
  2. Perform ETL/ELT frequently to maintain all of this data in both relational and semi-structured databases.
  3. Develop applications that can access multiple data sources concurrently.
I maintain that none of these alternatives is acceptable and I want to propose a better path to future application and database development. First a digression to define terms which you can feel free to skip.

There are many classes of data:

  • Structure data
  • Unstructured data
  • Semi-structured data
  • Time stamped data composed from any data class
  • Geographically located data composed from any data class
  • Spatially located data composed from any data class
Structured Data refers to the data we have in relational databases for the most part. Despite the predictions of pundits relational databases are here to stay.  There are several well defined application segments for which relational databases are the best tool for the job.

Semi-Structured Data is data that may have a natural structure to it but where the structure may vary from one data element to another.  This can also include elements where part of the information is structured while other sub-elements are unstructured.

Currently the most popular format for semi-structured data is known as JSON for JavaScript Object Notation.  JSON is a key:value record or document format that is quickly becoming a new standard for data interchange replacing CSV, XML, and other interchange formats.  JSON is the basis for the RESTFUL web service protocol.  It is more space efficient than XML and easier to parse.  In addition, it is supported by a binary format, BSON, that is even faster to parse and extract individual sub-elements.  JSON is a dynamic schema format where each record, or document in JSON terminology, contains the schema details for the document which can be different from one document to another.  This promotes and supports RAD or Rapid Application Development.

MongoDB has become the most popular of the JSON databases or stores.  It saves JSON documents in BSON format and produces JSON output to consuming clients.  There are several sets of interoperable development tool sets known as "stacks" the most popular of which is call the MEAN stack as well as client libraries for most common languages.

MEAN stands for MongoDB, ExpressJS, AngularJS, and Node.js.

Class time over:


On to the problem.  We all have TBs of relational data.  Our experienced developers already know SQL and our schema. Some new development still fits with structured data. There is no sense in moving OLTP or structured DW data out of relational systems.  There is no financially supportable way to rewrite all of our existing systems from scratch just to take advantage of new paradigms.  Many new applications written primarily to use semi-structured data will need to present the relational data we've been collecting for the past 30+ years.

Many of us are capturing TBs of semi-structured data already. New developers that we want to hire only want to work in the NoSQL space.  Some new applications are natural fits for semi-structured data. RAD techniques require schema flexibility to succeed.  Many new and existing relational based applications could take advantage of all of that semi-structured data to expose new features without a massive redevelopment effort.

How to get relational data into MEAN stack applications and JSON data into C/C++/C# applications is the problem.  Traditionally when faced with data in multiple silos we would either use ETL techniques to copy the data from one silo to another or we would write data access layers using multiple libraries.  

ETL is time consuming and error prone.  We may have to sacrifice precision or internal relationships in the target system. Maintaining a timely, consistent, view is a major constant effort and expense.  Data duplication is itself an expense.  Fast storage is not cheap!

We can't just move everything into MongoDB either.  MongoDB like most NoSQL stores do not support multiple operation transactions, inter-object relationships (references and sub-documents are not relationships), XA transactions (so they cannot participate in transactions across multiple silos).  

The solution:


The solution is Hybrid Application Development using a Hybrid Database system using both traditional and RAD development tools.  Here is what is needed from a hybrid database for this to work:
  • Access from traditional development stacks (C, C++, C#, Perl, etc.)
  • Access from MEAN and other RAD development stacks
  • Full ACID compliant transaction support for all data
  • Full relational integrity and data normalization support for all data 
  • Ability to store structure data with a predefined schema
  • Ability to store semi-structured data with a dynamic schema
  • Ability to present both types of data to both stacks in the native format of the data
    • JSON as JSON
    • Tables as tables
  • Ability to present both types of data to each stack in its native format
    • JSON as table data for traditional tools
    • Table data as JSON for RAD tools
Some NoSQL stores can handle ACID consistency but most can only manage "eventually consistent" transactions.  Most, like MongoDB, can only guarantee consistent transactions and rollbacks for a single document, not for multiple documents in a collection nor for transactions that span multiple collections. Some NoSQL stores can support SQL or an SQL-like query language, however, they do not return data as rows and columns which is what is expected by SQL database access code. 

Some RDBMS systems can:
  • Store JSON (or a binary form of it)
  • Store a collection as a collection (rather than as a column type)
  • Manipulate JSON fields within a document
  • Create indexes on JSON fields within the documents in a collection
  • Support referential integrity between JSON document fields and relational columns
  • Join JSON documents to relational tables
But most cannot do it all.  I believe that these capabilities are the future of database and application development. Today there is one RDBMS product that CAN do it all:
  • Accept connections from MEAN stack and other MongoDB clients without modifying the application code
  • Accept connections from relational SQL clients
  • Support transactions on relation tables and JSON collections that include multiple tables/collections and multiple rows/documents
  • Allow joins between JSON collections and other JSON collections or relational tables
  • Enhance MEAN applications to support issuing SQL statements and return data in JSON format for those clients
  • Permit SQL clients to treat fields in JSON collection documents as ordinary columns
  • Support JSON and BSON as class 1 native data types for use as columns in relational tables or as a type to define a collection
  • Direct RESTFUL interface to expose all data as web services without middleware
  • Store JSON collections as collections
  • Permit any MongoDB aware client to treat relational tables collections
  • Support most MongoDB DDL and server management commands (including sharding)
  • Auto create databases and collections just as MongoDB does, on the fly
I know I'm sounding like a salesman, but I have nothing to sell you other than an idea and my help to achieve it in your organizations, so, bear with me.  What is this magical database system?  It is the latest incarnation of one of the first RDBMS products in the marketplace:

Informix Dynamic Server v12.10.xC4 from IBM


Informix, aka IDS, has all of the required features I mention above but it brings much more to the table:
  • Larger documents than MongoDB
  • Larger databases than MongoDB
  • Industry leading data replication technology
  • Four classes of data replication that can all work together
  • More capable time proven sharding technology than MongoDB
  • Data distribution and centralization capability
  • Support for Timeseries data
  • Support for Geospatial data
  • The only database system that can combine Timeseries and Geospatial data to track your IoT data through SpaceTime
  • User defined data types that perform as well as native types
  • World class OLTP transaction rates
  • Traditional and MEAN stack applications can connect and use both JSON and SQL data in their own native formats
  • Centralized databases up to 128PB
  • Distributed databases without size limits
  • Server fail over and load balancing fully configurable using SLA specifications
  • Five 9's reliability
  • Near Zero downtime
  • Upgrade server versions without downtime
  • Instantly bring up additional servers to load balance during peak periods
  • Support large numbers of concurrent transactions
  • Row level locking (MongoDB locks entire collections only)
  • Extremely tunable engine
  • Autonomic features for low maintenance overhead and to permit unmonitored operations
  • Highly embeddable
  • Optional compression of data and index keys
  • Advanced query optimizer
  • Multiple tenancy
  • Informix Warehouse Accelerator returns data up to 1200X faster than the base server for complex queries over huge data sets
  • Timeseries in JSON
  • GeoJSON support in addition to native GeoSpacial support
  • Lucene text search on JSON documents and relational tables
  • High speed data loading technology for faster loads of streaming data
  • MQTT integration to link to IoT devices

Wrap up:


The future of software development will require hybrid applications!
The future of software development will require hybrid databases!

The future is here now!
Why wait?


1 comment: