Universal Databases, Marilyn Monroe & Scarlett Johansson

Marilyn Monroe

Marilyn Monroe

Scarlett Johansson

Scarlett Johansson

Relational Databases, RDBMSs, circa 1999, such as DB2, MS SQL Server, Oracle, Informix and Sybase, implementations of E. F. Codd‘s work at IBM and built according to Codd’s 12 Rules, are effective for solving OnLine Transaction Processing, or OLTP problems. These might be termed “Blue Quadrant” problems, if you consider Stonebraker’s Matrix, 1, below.

X-Ray of broken arm,  courtesy of C. Bethel, MD, and A. Dean, MD.

Broken arm, courtesy C. Bethel, MD, and A. Dean, MD.

They are billing systems, trading systems – accounting systems.  Relational databases are useful for tracking widgets and money, as long as those widgets can be described with words or numbers.  Processing simple data, even with a high volume of transactions.

Relational databases could also be used to store images, such as those of Scarlett Johansson or Marilyn Monroe, above, or X-Rays, left. However, these are stored in the database as Binary Large Objects, aka “BLOBs.” While RDBMSs could operate on large volumes of data in various ways, there are just two operations they could do on on BLOBs: store and retrieve.  They could not query on intrinsic properties of a BLOB. They could query on data that describe a BLOB, but not the BLOB itself. There is no other processing in a traditional RDBMS. And they could not search on images, only on text describing images.

For example, to search for images of Marilyn Monroe, Scarlett Johansson, or broken arms, back in 1999, you had to search images where “Marilyn Monroe,” “Scarlett Johansson,” or “Broken Arm” is in the name or other verbal description field.

Selecting images of Ms. Monroe or Ms. Johansson is easy:

SELECT Image FROM Data WHERE Subject_Name IN (“Marilyn Monroe”, “Scarlett Johansson”);

Selecting black and white images is somewhat more complicated. Your table has to have a Boolean column called “BlackandWhite”

SELECT Image FROM Data
WHERE Subject_Name IN (“Marilyn Monroe”, “Scarlett Johansson”)
and Black_and_White is TRUE ;

A discussion on Stack Overflow, from Nov., 2011, here, recommends two tables, an image table, containing the BLOBs, and an Image Description table containing descriptive data. You would then join the tables in the query, writing, for example:

SELECT Image FROM Image_Table
Join Image_Description_Table
ON Image_Table.Image_ID = Image_Description_Table.Image_ID
WHERE Image_Description_Table.Subject_Name IN
(“Marilyn Monroe”, “Scarlett Johansson”);

And for the black and white images,

SELECT Image FROM Image_Table
Join Image_Description_Table
ON Image_Table.Image_ID = Image_Description_Table.Image_ID
WHERE Image_Description_Table.Subject_Name IN
(“Marilyn Monroe”, “Scarlett Johansson”) and
Image_Description_Table.Black_and_White is TRUE;

Note that in 2006, in “To BLOB or Not To BLOB“, Microsoft recommended storing BLOBs over 256 K outside the database in files, and storing the file name in the database. These queries are very expensive in processor time, requiring one set of disk accesses for the database queries and a subsequent set of disk accesses to retrieve the images.  And note also that high resolution images, and most images taken today with iPhones, iPads, and Androids are 1 MB, or 1000 K or larger, meaning as a rule of thumb, store in the database the file names of the images, and store the images themselves outside the database in files. In addition to the access performance, this raises the spectre of computer security.

Parenthetically, this is how document management systems work.  The Relational database application stores metadata regarding various files – owner, matter, category, operator, version number, file type, etc., but the file itself is stored as an operating system file.

Thus – Stonebraker’s Matrix:

OLTP
SIMPLE DATA, QUERY
Relational Databases
GIS, DMS
COMPLEX DATA, QUERY
Object Databases
FILE SYSTEM
SIMPLE DATA, NO QUERY
Operating Systems

MEDIA PLAYERS
OFFICE APPLICATIONS
COMPLEX DATA, NO QUERY
Persistent Languages
Stonebraker’s Database Application Matrix, 1

COBOL, the Common Business Oriented Language, invented in 1959, based on the work of Admiral Dr. Grace Hopper, is designed for blue quadrant problems solved by accounting and billing systems.  These are the databases that banks, utilities, airlines and other companies use to manage their databases.  Stonebraker describes this as “Simple Data, Query” because a table is composed of thousands, hundreds of thousands, or millions of rows of very similar data. Every check drawn on every bank is associated with a unique identifying number, an amount, a payer, who has an account number, the bank routing number, the date the check is presented to be cashed or deposited, and the recipient or payee.

“Red Quadrant” problems – simple data, no query – involve files stored in a file system,  receiving data from keyboards, modems, the network, the Internet, sending data to printers, modems, out to the network, the Internet. The file is a series of bytes. The operating system only needs to know it’s name and the sectors of disk on which it is stored.

“Yellow Quadrant” problems – complex data, no query, involve retrieving files and doing something interesting, displaying an image, playing an audio or video file, editing a document or spreadsheet, or a multi-media document such as this blog page.  There may be sophisticated editing, but there is no query.

Advantages of the RDBMSs over COBOL are that the database structure can be changed and that reports can be generated relatively easily. However, as noted above, like COBOL, RDBMSs don’t handle things like pictures, audio, video or geographic information very well.  We can store and extract pictures, audio, video, and goegraphic information, but we can’t QUERY the database on the complex data. And we can’t act on the complex data directly.

We can store the images of Ms. Johansson and Ms. Monroe in a RDBMS. But in order to process those images, to render the image of Ms. Johannson in black and white, to color her dress blue, to colorize Marilyn’s image, etc, we would have to extract the image and process it using a graphics processor, an exercise left to the reader, or the reader’s imagination.

Document Management Systems, DMS’s and E-Mail are common real-world problems that are not fully addressed by relational or object oriented database management systems.

Document Management Systems contain two parts: a collection of documents and a database of metadata about the documents; information regarding author, when it was written, who edited it, when, and to what it was relevant. The DMS might be comprised of a collection of 30 GB of documents, spreadsheets, images, etc, and a 5 GB database describing those data in terms of author, edit date, version, subject, client, matter.  We can query on “Client,” “Matter,” “Author,” date, document description, and we can query on keywords in the file if it is a text file and if the DMS has full text search, but we can’t query on images. And querying on an external file or within a BLOB is very expensive, in terms of CPU and disk access.

E-Mail has gone from being a “Red Quadrant” problem to a “Green Quadrant” problem. E-Mail began as a simple message transport system, but, on the desks of many has evolved into a personal message and document database. Microsoft Exchange is  essentially an object oriented database, designed to store, retrieve and process a collection of objects – email messages and their attachments, with limited query capability on the metadata – sender, date – for those messages.

We can’t solve “Green Quadrant” problems with relational databases. And we can’t use “Yellow Quadrant” languages, such as C++, to solved “Green Quadrant” problems. However, this is exactly what Michael Stonebraker and his teams did when they developed Illustra and, subsequently, Informix 9.0, “Universal Server,” which, after IBM acquired the technology, became DB2 Universal Server.

UNIVERSAL DATABASES
OLTP w Complex Data
Object-Oriented Data w Transactions
Online Banking w Verification
Document Management
Geographical Information
File System
Simple Data, No Query
Media Players
Complex Data, No Query
Stonebraker’s Database Application Matrix, 2

Stonebraker noticed that OLTP systems were getting more complex. Building security systems, for example, were being written to accept biometric data. This could be stored in a relational database, but it couldn’t be processed by the relational database.

They saw the need to give object oriented databases the performance of relational databases and to embed in relational databases with the ability to do more than just store and retrieve objects, but to process objects.  Thus was born the “Object Relational Database,” also known as the “Universal Database.”

With Illustra, one of the first “Universal Database Servers,” we could write functions that evaluate images stored in the database.  Where DB2, Oracle, SQL Server could process numbers as long as they were integers of floating point values, Illustra and its derivatives could process time series data. 

In the 1990s I worked on a portfolio valuations system to manage data regarding bonds and mortgage backed securities. To represent the data for securities from floating rate instruments we needed tables containing records for each month of the 30 year life of the bond. We needed functions to populate those data. Illustra could have handled the data much simpler and much faster with its time series object.

We could hack geographical data into relational databases by defining points as an x, y pair, lines as ‘the set of points connected by two points and extending infinitely in the plane, circles as a point and a radius, etc., but we couldn’t process them. Illustra could handle geographical data. High Performing Geographical Information Systems could have been built using Illustra.

Captain Scarlet

Captain Scarlet

Marilyn Manson

Marilyn Manson

While a text search for people like Scarlett Johansson in a relational database might turn up an image of Captain Scarlet, from the 1960’s British television science fiction series for children. A text search on Marilyn Monroe might turn up Marilyn Manson.

A universal database, we can write functions that operate on images, and apply them to every record in a database. The system can differentiate between images of Marilyn Monroe and Marilyn Manson or Scarlett Johansson and Captain Scarlett.

The good news, is that this describes the state of the art from 1999.  Subsequent posts will look at how database technology has evolved in the last 15 years.

Scarlett Johansson

Scarlett Johansson

Marilyn Monroe

Marilyn Monroe

how DB2, Oracle and MS SQL Server allow us to add data types and methods or functions to operation on objects of those new data types.

Larry Furman, the former Director of Information Technology for a law firm in New York City, and a Systems Engineer with Popular Logistics, holds a BS in Biology and Computer Science and an MBA in Managing for Sustainability. Available for consulting, he can be reached at “lfurman . mba” at G Mail or “lfurman” at “Furman Group . Net”