Saturday, July 31, 2010

Agile Data modeling, de-normalization, and other important factors

This subject probably deserves an entire book, but in the essence of keeping this blog post short, I've tried to highlight some of the more important data modeling tasks related to conceptual, logical, and physical data model design. Please note, this is not meant to be exhaustive. Relational models, object-relational models, and object models are not always the right solution for the problem. There are many great solutions available in the NoSQL area, and there is nothing wrong with a hybrid solution.

Software applications that are compiled on specific machine architectures often store and retrieve information from database software systems. Web server platforms often interpret machine-independent script that stores and retrieves information from database software systems. Java bytecode is executed by a Java Virtual Machine, and data is stored and retrieved from database software systems.

Efficient storage and retrieval of information from database software systems are certainly dependent on underlying hardware and operating system software characteristics. However, efficient storage and retrieval of information is also dependent on the structure and efficient execution of query statements that are used to store and retrieve the information. In a purely academic world, data is modeled and normalized in 3NF, 4NF, and 5NF forms. However, higher-order normalized forms do not always promote the creation of optimal query statements. Therefore, certain parts of a data model are often denormalized to improve the read performance of the database software system.

At a high level, joins between multiple, large relations are generally expensive, so denormalization reduces this cost. Reducing the cost of something typically results in inconsistencies or redundancies in the data, so the developer(s) are left with the job of minimizing the amount of redundant data. A formal process typically alleviates this strain.

Defining the I/O characteristics of a database is an important precursor to data modeling. Often times, the I/O characteristics of the database are not defined before modeling the data. Nevertheless, there are a few basic tasks and questions that can help define the I/O characteristics of the database. What is the purpose of the database and what types of problem(s) does the database solve? What types of operations (read, write) will be performed on the database? How many users will the database support? Define user stories - short, succinct 1-3 sentence action statements. Define as many as needed from both the developer perspective and the user perspective. The user stories will help define the conceptual data model for the domain from which the logical data model can be created. A clear definition of the purpose of the system will aid in defining the read/write (I/O) characteristics of the database software system, thus forming a picture of the system. For instance, OLTP databases typically support short transactions, so it is important that write latency is minimized. The answers to the questions above will also help determine the type of hardware storage configuration needed. Another important question when determining the type of hardware configuration is the following. How much downtime can be afforded, and how much failover is needed?

At this point, the types of queries can be seen that will be run on the database with heavy consideration given to the I/O characteristics of the database and the types of joins and scans preferred for the query optimizer to ideally use.

Agile data modeling can be defined as continual iterative feedback with the ability to add entities and relationships, or slice off sections of a data model in various sizes, at any point in the project. This may seem difficult but can be accomplished. Whether the database architecture adheres to the relational model, object-relational model, or object model, the proper classification of entities and attributes, in conjunction with a balance of normalized and de-normalized groups of data, will allow the addition and subtraction of small and large chunks of relations from the data model throughout the development process; for a specific domain. Formal algorithms can be used to map entity relationship models to relations.

And how is this done? Continually modeling entities within the real world. Abstracting classes, re-factoring, keeping the hierarchy semi-flat, avoiding deep polymorphic behavior, and always embracing changes to the data model while never generalizing with a one size fits all approach.

It's important to note that data modeling is a crucial step in designing a database software system that efficiently stores and retrieves information. This process involves defining the I/O characteristics of the database, understanding the purpose of the system, and creating a conceptual data model for the domain. From there, a logical data model can be created, followed by a physical data model that considers the hardware storage configuration needed. Agile data modeling allows for continual iterative feedback, and the proper classification of entities and attributes, in conjunction with a balance of normalized and de-normalized groups of data, will allow for the addition and subtraction of small and large chunks of relations from the data model throughout the development process. Overall, data modeling is a critical aspect of creating an efficient and effective database software system.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.