IIUG World 2019

IIUG 2020 Online World Conference Presentations will begin soon!

Wednesday, November 19, 2014

On Hybrid Database Development

At the recent IBM Insight conference in Las Vegas I presented a session entitled: 

"My Data is Relational But My Coders Want to Use JSON!  Help!"

Over 30 development managers and developers attended.  When asked most said they attended because this was a common problem that they are all either dealing with already or expect to have to deal with soon.  

Here, as well as I can translate a presentation to text, is what I had to say:

I assert that JSON and other semi-structured data formats present unique challenges for organizations. In order to take advantage of these new structures and the development paradigms that they support and encourage and integrate them into our existing systems, we wrestle with three alternatives:

  1. Keep these new data and the apps that utilize them independent of our legacy data and keep our legacy apps independent of this new data
  2. Perform ETL/ELT frequently to maintain all of this data in both relational and semi-structured databases.
  3. Develop applications that can access multiple data sources concurrently.
I maintain that none of these alternatives is acceptable and I want to propose a better path to future application and database development. First a digression to define terms which you can feel free to skip.

There are many classes of data:

  • Structure data
  • Unstructured data
  • Semi-structured data
  • Time stamped data composed from any data class
  • Geographically located data composed from any data class
  • Spatially located data composed from any data class
Structured Data refers to the data we have in relational databases for the most part. Despite the predictions of pundits relational databases are here to stay.  There are several well defined application segments for which relational databases are the best tool for the job.

Semi-Structured Data is data that may have a natural structure to it but where the structure may vary from one data element to another.  This can also include elements where part of the information is structured while other sub-elements are unstructured.

Currently the most popular format for semi-structured data is known as JSON for JavaScript Object Notation.  JSON is a key:value record or document format that is quickly becoming a new standard for data interchange replacing CSV, XML, and other interchange formats.  JSON is the basis for the RESTFUL web service protocol.  It is more space efficient than XML and easier to parse.  In addition, it is supported by a binary format, BSON, that is even faster to parse and extract individual sub-elements.  JSON is a dynamic schema format where each record, or document in JSON terminology, contains the schema details for the document which can be different from one document to another.  This promotes and supports RAD or Rapid Application Development.

MongoDB has become the most popular of the JSON databases or stores.  It saves JSON documents in BSON format and produces JSON output to consuming clients.  There are several sets of interoperable development tool sets known as "stacks" the most popular of which is call the MEAN stack as well as client libraries for most common languages.

MEAN stands for MongoDB, ExpressJS, AngularJS, and Node.js.

Class time over:

On to the problem.  We all have TBs of relational data.  Our experienced developers already know SQL and our schema. Some new development still fits with structured data. There is no sense in moving OLTP or structured DW data out of relational systems.  There is no financially supportable way to rewrite all of our existing systems from scratch just to take advantage of new paradigms.  Many new applications written primarily to use semi-structured data will need to present the relational data we've been collecting for the past 30+ years.

Many of us are capturing TBs of semi-structured data already. New developers that we want to hire only want to work in the NoSQL space.  Some new applications are natural fits for semi-structured data. RAD techniques require schema flexibility to succeed.  Many new and existing relational based applications could take advantage of all of that semi-structured data to expose new features without a massive redevelopment effort.

How to get relational data into MEAN stack applications and JSON data into C/C++/C# applications is the problem.  Traditionally when faced with data in multiple silos we would either use ETL techniques to copy the data from one silo to another or we would write data access layers using multiple libraries.  

ETL is time consuming and error prone.  We may have to sacrifice precision or internal relationships in the target system. Maintaining a timely, consistent, view is a major constant effort and expense.  Data duplication is itself an expense.  Fast storage is not cheap!

We can't just move everything into MongoDB either.  MongoDB like most NoSQL stores do not support multiple operation transactions, inter-object relationships (references and sub-documents are not relationships), XA transactions (so they cannot participate in transactions across multiple silos).  

The solution:

The solution is Hybrid Application Development using a Hybrid Database system using both traditional and RAD development tools.  Here is what is needed from a hybrid database for this to work:
  • Access from traditional development stacks (C, C++, C#, Perl, etc.)
  • Access from MEAN and other RAD development stacks
  • Full ACID compliant transaction support for all data
  • Full relational integrity and data normalization support for all data 
  • Ability to store structure data with a predefined schema
  • Ability to store semi-structured data with a dynamic schema
  • Ability to present both types of data to both stacks in the native format of the data
    • JSON as JSON
    • Tables as tables
  • Ability to present both types of data to each stack in its native format
    • JSON as table data for traditional tools
    • Table data as JSON for RAD tools
Some NoSQL stores can handle ACID consistency but most can only manage "eventually consistent" transactions.  Most, like MongoDB, can only guarantee consistent transactions and rollbacks for a single document, not for multiple documents in a collection nor for transactions that span multiple collections. Some NoSQL stores can support SQL or an SQL-like query language, however, they do not return data as rows and columns which is what is expected by SQL database access code. 

Some RDBMS systems can:
  • Store JSON (or a binary form of it)
  • Store a collection as a collection (rather than as a column type)
  • Manipulate JSON fields within a document
  • Create indexes on JSON fields within the documents in a collection
  • Support referential integrity between JSON document fields and relational columns
  • Join JSON documents to relational tables
But most cannot do it all.  I believe that these capabilities are the future of database and application development. Today there is one RDBMS product that CAN do it all:
  • Accept connections from MEAN stack and other MongoDB clients without modifying the application code
  • Accept connections from relational SQL clients
  • Support transactions on relation tables and JSON collections that include multiple tables/collections and multiple rows/documents
  • Allow joins between JSON collections and other JSON collections or relational tables
  • Enhance MEAN applications to support issuing SQL statements and return data in JSON format for those clients
  • Permit SQL clients to treat fields in JSON collection documents as ordinary columns
  • Support JSON and BSON as class 1 native data types for use as columns in relational tables or as a type to define a collection
  • Direct RESTFUL interface to expose all data as web services without middleware
  • Store JSON collections as collections
  • Permit any MongoDB aware client to treat relational tables collections
  • Support most MongoDB DDL and server management commands (including sharding)
  • Auto create databases and collections just as MongoDB does, on the fly
I know I'm sounding like a salesman, but I have nothing to sell you other than an idea and my help to achieve it in your organizations, so, bear with me.  What is this magical database system?  It is the latest incarnation of one of the first RDBMS products in the marketplace:

Informix Dynamic Server v12.10.xC4 from IBM

Informix, aka IDS, has all of the required features I mention above but it brings much more to the table:
  • Larger documents than MongoDB
  • Larger databases than MongoDB
  • Industry leading data replication technology
  • Four classes of data replication that can all work together
  • More capable time proven sharding technology than MongoDB
  • Data distribution and centralization capability
  • Support for Timeseries data
  • Support for Geospatial data
  • The only database system that can combine Timeseries and Geospatial data to track your IoT data through SpaceTime
  • User defined data types that perform as well as native types
  • World class OLTP transaction rates
  • Traditional and MEAN stack applications can connect and use both JSON and SQL data in their own native formats
  • Centralized databases up to 128PB
  • Distributed databases without size limits
  • Server fail over and load balancing fully configurable using SLA specifications
  • Five 9's reliability
  • Near Zero downtime
  • Upgrade server versions without downtime
  • Instantly bring up additional servers to load balance during peak periods
  • Support large numbers of concurrent transactions
  • Row level locking (MongoDB locks entire collections only)
  • Extremely tunable engine
  • Autonomic features for low maintenance overhead and to permit unmonitored operations
  • Highly embeddable
  • Optional compression of data and index keys
  • Advanced query optimizer
  • Multiple tenancy
  • Informix Warehouse Accelerator returns data up to 1200X faster than the base server for complex queries over huge data sets
  • Timeseries in JSON
  • GeoJSON support in addition to native GeoSpacial support
  • Lucene text search on JSON documents and relational tables
  • High speed data loading technology for faster loads of streaming data
  • MQTT integration to link to IoT devices

Wrap up:

The future of software development will require hybrid applications!
The future of software development will require hybrid databases!

The future is here now!
Why wait?

Monday, November 3, 2014

China is standardizing all new database applications on Informix!

Here's the "world" shaking news I promised. IBM and China have come to an agreement whereby IBM will share the code for Informix with a Chinese software house, GBASE, who will modify its security code to conform to Chinese government standards. In turn China will be building all future database projects using Informix.

The agreement allows for sharing innovations to the Informix code made by either party to be shared by both. If anyone wanted proof that Informix is here to stay and is NEVER going away, this is it. China is the world's second largest market and probably the fastest growing technology market. The government of China does not want any software or hardware used for government projects or enterprises that it does not control, fearing Western spying through the systems it uses. So, they are standardizing on Informix. Cool!

IBM Press release dated: October 29, 2014

  GBASE and IBM to collaborate on locally innovated database in China 

IBM (NYSE: IBM) and General Data Technology Co.,Ltd, known as GBASE announces an agreement today, to create a locally innovated database product for the China marketplace. In support of the China government's national technology agenda, GBASE will develop and sell this local innovation based on IBM Informix technology. The agreement extends the reach of IBM’s Informix database technology into this market by providing a localized solution to better address the rapidly evolving needs of Chinese businesses. The agreement permits GBASE to build and license their own version of IBM Informix database, which will be offered to clients in the Chinese market as a stand alone product solution. The China market for database technology is estimated to be in excess of $700m according to IDC. The partnership between IBM and GBASE can fuel the growth of this market by creating a best in class database solution tailored to the unique requirements of the China marketplace. “This agreement confirms IBM’s innovation and commitment to growth in emerging markets in general and China specifically”, states Sean Poulley, Vice President, IBM Databases and Database Management. "Our intent is to help partners and clients gain competitive edge and transform their business with innovative database technologies in China, with China, for China”. Informix is one of the world’s most widely used databases, supporting clients who range from the largest multinational corporations to many regional and local small businesses. It is widely deployed in the retail, banking, manufacturing and government segments, all of which are important growth areas in China today. Informix is well known for its innovative design, which enables a single platform that powers both OLTP and OLAP workloads for real-time analytics, scales easily for cloud environments and provides continuous availability. It's renowned for extremely high levels of performance and availability, distinctive capabilities in data replication and scalability, and minimal administrative overhead. With this partnership agreement these advantages will be made more readily available to the fast-growing Chinese market.

Sunday, November 2, 2014

News from the Informix World

And I do mean world. Well it's been a long time since my last post. Over six months. I just got back from the IBM Insight Conference in Las Vegas NV (formerly the IBM Information On Demand Conference) and there is some big news. IBM announced their new vertical cloud development stacks called BlueMix. There is a BlueMix for timeseries data, BlueMix for mobile apps, BlueMix for in-memory Warehouse data analytics, BlueMix for Watson, BlueMix for ... well you get the idea. The stacks include access to a cloud server with an application appropriate database system (Cloudant, DB2 BLU Accelerator, Informix, etc.), development tools, access to IBM and some third party analytics tools, etc. You can create a BlueMix account for free and try it out. If you like it and develop something on it for yourself, your company, or for distribution then you can work out a commercial deal with IBM for unlimited access and more resources. "Wait", you say, "did you say Informix as part of an IBM Cloud service offering? They just slapped that in there to make us noisy Informix grumps happy. Right?" Nope. If you set up a BlueMix for Timeseries Data configuration you get a fully functional Informix database server (not just timeseries support, though it's not clear if this includes IWA). While DB2 is available, you have to ask for it, the only full function RDBMS with a BlueMix configuration ready to go is Informix. Very cool! More soon when I will address the "world" teaser.

Thursday, April 17, 2014

Heading down to Miami soon!

It is the middle days of Passover week and I am trying to squeak some work out between the first Seder days and the coming Sabbath. Part of that I am making final preparations to go down to Miami to the 2014 IIUG Conference. As others have said, this is the premiere conference for users of IBM's Informix database systems. I am excited about the coming conference, not just because I get to participate in one of my favorite activities - that being talking by presenting two sessions and helping out with a third - but I am hoping to hear about new Informix features for the next release. Already this year we have gotten some neat stuff in v12.10 through the first three releases: - Support for JSON and BSON data types. - Support for developing hybrid applications that incorporate both structured and semi-structured data together. - The ability to dynamically expand the buffer pools to handle unexpected data demands. - Improved dynamic logical log management. - Automatic management of the number of CPU VPs. - Automatic expansion of the physical log to handle unexpected transaction volumes. - The ability to specify that new tables should be placed in a specific dbspace or should be automatically partitioned across a number of dbspaces instead of defaulting to the dbspace in which the database was created. - SHMADD now automatically doubles every time 16 new segments have been added to the server. - The client side environment variables INFORMIX_CONTIME and INFORMIX_CONRETRY are now also ONCONFIG parameters that set the server's default connection timeout behavior. - The CPU VP private memory cache is now dynamic using VP_MEMORY_CACHE_VP as a starting value. The optional parameter STATIC can be added to the end of the size to revert to the previous fixed size behavior. - New onstat -g bth & -g BTH reports to better monitor resource contention causing blocked sessions. - Compression for indexes. - Automated compression for new tables. I hope to see more in Miami. I hope to see you all there as well!

Thursday, January 16, 2014

The myth of the new and other fallacies

In Kohelet (aka Ecclesiates) King Solomon of ancient Israel wrote:  
What has been is what will be, and what has been done is what will be done, and there is nothing new under the sun.
No, I haven't gone all proselytizing on you, I just want to make a point.  Most of the "new" technologies that everyone is so hot to get into to save the world are not new at all.  Some are even very old.  Many were tried in the past and while some were good ideas and successful in their first incarnations, many were either not successful at all or were later supplanted by other technologies that got the job done better.  A few examples:

Object Oriented Languages

I know, '80s technology, not new anymore.  But they are still 'hot' and coders coming up today don't know their history, so here's some.  Object oriented languages like C++, C##, Java, etc. encapsulate the operations associated with a data structure together with the structure itself.  In the 1960s, two new programming paradigms emerged.  These were Structured Programming and Modular Programming. Structured programming concentrated on eliminating non-linear controls from software code.  It was characterized most simply as "No goto statements, anywhere, anytime!" and was championed by such notables as C.Bohm, G. Jacopini, Edsgar Dijkstra, Michael Jackson (no, not the singer) and Issac Nassi and Ben Schneiderman (who together developed the Structure Chart or Nassi-Schneiderman Chart sometimes called Chapin Chart).

At around the same time other notables like Niklaus Wirth (the developer of Pascal, Modula2, Modula2, Oberon, and the Lilith computer system) were proposing the concept of Modular programming.  The idea was to encapsulate distinct operations into subroutines and to gather like subroutines with similar or related functionality and/or purpose into reusable concrete units or 'modules'.  Wirth went so far as to develop a programming language that supported both structured programming and modular programming techniques (and did not include a GOTO statement at all!)  To this day, Modula2 is still my favorite of the dozen or more programming languages I have learned and used.

These two paradigms sought and succeeded to reduce the complexity and redundancy in large software projects and to permit larger groups of programmers to cooperate in the production of complex projects.  Combine Structured Programming with Modular Programming and add language support for their concepts and voila you have an Object Oriented Language.

Columnar Database

In the early 1980's I needed to write a custom database for an application.  After research I found something that matched the requirements of the project perfectly.  It was called, variously, and Inverted Index, Inversion Table (no, not the exercise/torture device), or Inversion Database.  The idea was to build indexes of all of the fields of a set of data records without having the entire record present at all, only the indexes.  this meant that as long as you only wanted the values from a subset of the fields in a record you could access the data very quickly with reduced IO which was very expensive at the time.  Can you say Columnar Database?

Key:Value Store or Database

In 1979, Ken Thompson of UNIX fame wrote a database library for UNIX called dbm which used a hash table of key values to index data records.  The records had no particular structure.  Any interpretation of the contents of the record were application dependent.  This library is the UNIX dbm database library system.

In 1986 the Berkeley UNIX team developed ndbm which added the capability to have more than one database open in an application.  Unlike dbm, ndbm stores the data value in the hash table directly which limits the size of the key:value pair and of the database as a whole.

In the late 1980's and early 90s Margo Seltzer and Ozan Yigit worked to improve on dbm and ndbm and created Berkley DB.  BDB was still a hash managed key:value store.  It's main innovation was the ability to be accessed concurrently by multiple users without corrupting the database.

This is virtually the same as the "ground breaking" technology in MongoDB, Cassandra, Virtuoso, etc.

Graph Database

In 1969 Charles Bachman published a paper describing a new database model to replace the popular herarchical model at the Conference on Data Systems Languages (CODASYL).  He named it the Network model.  Wikipedia defines the network model as database model conceived as a flexible way of representing objects and their relationships. Its distinguishing feature is that the schema, viewed as a graph in which object types are nodes and relationship types are arcs, is not restricted to being a hierarchy or lattice.

Wikipedia defines a graph database is a database that uses graph structures with nodes, edges, and properties to represent and store data. By definition, a graph database is any storage system that provides index-free adjacency. This means that every element contains a direct pointer to its adjacent element and no index look ups are necessary. 

Sounds awful familiar to me.  By the late 1980's network databases and hierarchical databases were largely replaced for general purpose database use by databases that ostensibly followed the Relational Model described by Edgar F. Codd in a paper published in 1969 because their schema were fixed in the physical implementation of the data making it virtually impossible to make changes to the schema as applications evolved.

Getting to the Point

The point of this blog post is that there were reasons that we used these technologies "back in the day" (as my favorite reality TV stars say) and there were reasons that we abandoned them and moved on.  Some of these technologies never went away because they have good and valid applications to solve specific data storage and retrieval problems.  But these were not the technologies that conquered the database world.  The Relational Model, or as several friends of mine insist, a poor implementation of something that vaguely resembles the Relational Model and is more properly called the SQL Model, won that battle and for good reason.  So, let's not all jump on the same bandwagon without doing due diligence and examining in detail what technology best fits the problem.  

One of my favorite sayings is "Use the right tool for the right job!"  I am also a big proponent of avoiding the Yankee Adjuster Problem where, as the saying goes, "When all you have is a hammer, everything starts to look like a nail!"

So, let me finish as I began, there is nothing new under the sun!  Or as I am fond of replying when asked "What's new?" - "There's nothing new, just the old stuff come back to haunt us!"