July 31, 2010

Agile Data modeling, de-normalization, and other important factors

Agile Data modeling, de-normalization, and other important factors

This subject probably deserves an entire book but in the essence of keeping this blog post short, I've tried to highlight some of the more important data modeling tasks related to conceptual, logical, and physical data model design. Please note, this is not meant to be exhaustive. Relational models, object relational models, and object models are not always the right solution for the problem. There are many great solutions available in the noSQL area and there is nothing wrong with a hybrid solution.

Software applications that are compiled on specific machine architectures often store and retrieve information from database software systems. Web server platforms often interpret machine independent script that stores and retrieves information from database software systems. Java byte code is executed by a Java Virtual Machine and data is stored and retrieved from database software systems. 

Efficient storage and retrieval of information from database software systems is certainly dependent on underlying hardware and operating system software characteristics. However; efficient storage and retrieval of information is also dependent on the structure and efficient execution of query statements that are used to store and retrieve the information. In a purely academic world, data is modeled and normalized in 3NF, 4NF, and 5NF forms. However; higher order normalized forms do not always promote the creation of optimal query statements. Therefore, certain parts of a data model are often de-normalized to improve the read performance of the database software system.

At a high level, joins between multiple, large relations are generally expensive so de-normalization reduces this cost. Reducing the cost of something typically results in inconsistencies or redundancies in the data so the developer(s) is left with the job of minimizing the amount of redundant data. Formal process typically alleviates this strain.

Defining the I/O characteristics of a database is an important precursor to data modeling. Often times, the I/O characteristics of the database are not defined before modeling the data. Nevertheless, there are a few basic tasks and questions that can help define the I/O characteristics of the database. What is the purpose of the database and what types of problem(s) does the database solve? What types of operations (read, write) will you be performing on the database? How many users will the database support? Define user stories - short, succinct 1-3 sentence action statements. Define as many as you need from both the developer perspective and the user perspective. The user stories will help define the conceptual data model for the domain from which the logical data model can be created. A clear definition of the purpose of the system will aid in defining the read/write (I/O) characteristics of the database software system. Thus forming a picture of the system. For instance, OLTP databases typically support short transactions so it is important that write latency is minimized. The answers to the questions above will also help determine the type of hardware storage configuration needed. Another important question when determining the type of hardware configuration is the following. How much downtime can we afford and how much failover do we need?

At this point, we can begin to see the types of queries that will be run on the database with heavy consideration given to the I/O characteristics of the database and the types of joins and scans that we would like for the query optimizer to ideally use.

Agile data modeling - Continual, iterative feedback. The ability to add entities and relationships, or slice off sections of a data model in various sizes, at any point in the project. This may seem difficult but can be accomplished. Whether the database architecture adheres to the relational model, object-relational model, or object model, the proper classification of entities and attributes, in conjunction with a balance of normalized and de-normalized groups of data, will allow the addition and subtraction of small and large chunks of relations from the data model throughout the development process; for a specific domain. Formal algorithms can be used to map entity relationship models to relations.

And how is this done? Continually modeling entities within the real world. Abstracting classes, re-factoring, keeping the hierarchy semi-flat, avoiding deep polymorphic behavior, and always embracing changes to the data model while never generalizing with a one size fits all approach.