Motivation and Concepts

OpenCabinet was born out of frustration with the difficulties involved in storing data structures in relational database tables. While there are plenty of object persistance solutions out there, we do not feel comfortable with the complexities involved in using them, especially when there are requirements for efficient storage and fast access paths.

We hope that by limiting ourselves to very specific, yet versatile data structures (lists, sets and maps of known data types) we can create a system that is easy to use and still very powerful in terms of storage efficiency, scalability and query performance.

A simple database table

Imagine that we have to implement an issue tracking system. We might store the bug reports in a relational database table like the following TBL_BUGS (inspired by the schema used in Bugzilla):

FieldTypeRemarks
bug_idintThe bug ID
bug_severityvarchar(64)The bug severity level
bug_statusvarchar(64)The bug's status
product_idintForeign key into products table
assigned_tointThe current owner of the bug (user account number)
reporterintThe user who reported this (user account number)
creation_tsdatetimeThe time of the bug's creation.
descriptiontextThe bug report and follow-up discussions
TBL_BUGS

Multiple-valued attributes

As it turns out, a single bug can affect more than one product. Also, we may want to assign more than one person to it. How do we handle that? The product_id and assigned_to columns can unfortunately store only a single value. Database 101 tells us that we need child tables.

Child table with unique parents

A natural approach is to create TBL_BUGS_TO_PRODUCTS with the bug_id as a foreign key to link back into TBL_BUGS which holds one row for every every affected product per bug.

FieldTypeRemarks
bug_idintThe bug ID
product_idintForeign key into products table
TBL_BUGS_TO_PRODUCTS

The same goes for assigned_to and TBL_BUGS_ASSIGNED.

TBL_BUGS
we can remove the product_id and assigned_to columns from TBL_BUGS.
deletes
if we delete the bug from TBL_BUGS, we should cascade the delete to also clear out the associated rows in the child tables.
updates
updating bug reports becomes a little tricky. We will probably delete and re-insert the rows in all child tables every time we update TBL_BUGS. We will probably also do this when the update did not change product_ids or assigned_tos, because otherwise we have to retrieve the current values for comparation first, or keep track and depend on 'dirty flags'.
table size
assuming that every bug is assigned to at least one product and one owner, the child tables will have as many rows as there are bugs, even though the number of products and the number of developers working on them is much smaller.

Child tables with multiple parents (OpenCabinet approach)

We prefer an alternative approach: The schema of the child tables is the same, but instead of using the bug_id as a foreign key we introduce an 'artificial' foreign key (the set_id) that has nothing to do with the bug_id. Since the set_id has to be stored in the parent table TBL_BUGS we keep the columns product_id and assigned_to. The meaning of their contents has changed, though. Instead of storing a product's id, product_id now contains a set_id for TBL_PRODUCT_SETS.

FieldTypeRemarks
set_idintThe set ID
product_idintForeign key into products table
TBL_PRODUCT_SETS

Accordingly, we also have TBL_USER_SETS instead of TBL_BUGS_ASSIGNED.

never update data in child tables

For reasons that will become clear very soon, we also limit our activities with the child tables to only inserting new data. Once a row has been committed to TBL_PRODUCT_SETS we will never update it.

The main advantage of this is that a set_id (and its associated data) can be reused: All bug reports that are assigned to the same group of people can share the same value for assigned_to.

TBL_BUGS
as explained above, we have to keep the product_id and assigned_to columns.
deletes
deletes cannot be cascaded from TBL_BUGS because other bug reports may be using the data still. Even if the last bug report has been deleted, we do not delete the data, because there is no reason to limit the use of the child tables to TBL_BUGS. Any other database table that needs to store sets of product or account ids can also use them.

Depending on your application, this may lead to a garbage collection problem. On the other hand, if you are looking into implementing a system that keeps track of previous versions of its data, you never want to delete anything anyway.

updates
when updating a bug report, you obtain set_ids for assignees and products. If this is an entirely new set of products that no other bug has been filed against, that will be a new set_id, with corresponding rows being created in the child tables. If not, you will just get the set_id of existing data.

OpenCabinet makes sure that the search for existing data is fast (since this data never changes, it can also be cached easily), and that new rows in the child tables are created and committed before handing out the set_id.

table size
the child tables contain only as many rows as there are unique sets of assigned products and peoples. This will be much smaller than the number of bug reports. Smaller tables require less disk space and (potentially more importantly) are faster to query.

Encoding string constants as integers (IdString)

TBL_BUGS stores the bug_severity and bug_status in varchar columns. The strings used there are examples of what we call an IdString: They come from a fixed (or at least very finite) set of possible options (such as Critical or Minor) and it is acceptable to impose a length limit on them. In the Java programming language, we would probably use a String constant or an Enumeration to represent them. Since there are only a few possible values, a great number of bug reports share the same IdString.

The opposite scenario for a string is a free text field, such as description: The content can be anything and quite long, and it is very unlikely that two separate bug reports share the same description (Although, if we support versioning, it is very likely that two versions of the same report have descriptions that are identical or only slightly different).

In OpenCabinet, all IdStrings are encoded as integers. It also keeps track of the mapping, so that you do not have to do it in your program.

Sets, Lists and Maps

We have already seen in the example of TBL_PRODUCT_SETS how to store a set of integers. Let us investigate some other collection types.

integer lists

The difference between a list and a set is that elements can occur more than once and that there is an ordering of elements that needs to be preserved. On a relational database table, this can be implemented by adding a position column:

FieldTypeRemarks
list_idintThe list ID
positionintthe list index of this element (0-based)
product_idintForeign key into products table
TBL_PRODUCT_LISTS

integer maps

A map is like a lookup table that associates a key to a value (in this case an integer). OpenCabinet requires that the key be an IdString, which seems like a natural fit anyway, so that the key can also be representated as an integer:

FieldTypeRemarks
list_idintThe list ID
keyintthe key (an IdString)
product_idintForeign key into products table
TBL_PRODUCT_MAPS

collections for other element types

very large maps