[ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ]

Multi-tier Application Architectural Design

A 3 Day Seminar prepared and presented by: Jim Booth

Review of Database Design


Selecting a Database Architecture

There are as many different architectures for databases as there are uses for data.  The architectures that exist vary on a number of different parameters.  One of the issues that are addressed by the various architectures is the primary use of the data.

Database architectures can be divided into two groups based on use of data, the transactional and the warehouse database.  The transactional database is one that is used to record activity that is current and ongoing while the warehouse is used to record data that is relatively static.  The next two sections discuss these designs in more detail.

A)   Transactional Database

A transactional database has constant activity in it that is recording and altering the data contained.  This is the common database we see as the “relational” model.  The design is optimized for data entry and updating with a focus on protecting the integrity of the data during these operations.  Our focus will be on this type of database.

B)   Data Warehouse

The data warehouse is used to record fairly static data.  The primary use for a data warehouse is reporting of information and for decision support.  The data in a data warehouse is not very dynamic in that it does no have active and ongoing changes being made to it.  Because of this the design can be optimized for data retrieval and the issues related to data entry and updating can be relegated to a lesser concern.

The Relational Database Model

In some things we can get away with “shooting from the hip”, that is we can create as we go without major difficulty.  Relational database design is not one of those things.  Due to the very issues that make relational databases so flexible in storing and retrieving data, an effective design process is absolutely required.  Luckily, there are some tools that can be used to assist in database design as well as some heavily tested and proven guidelines.

I have often heard references from developers about too much “theory”.  These folks will promote a pragmatic approach to doing things, well I hate to be the bearer of sad tidings but theory is the basis for scientific discovery.  Relational database theory is footed in mathematical set theory.  It is a mathematically proven approach to data handling.  All of the relational “theory” can be proven, through mathematical formulae, to be true.

Many of the “rules” in relational design are no more than guidelines created to help prevent problems in working with the data later.  Guidelines are exactly that, guides.  As such, we are free to follow them or not as we see fit.  However, think about this for a moment.  There is a whole body of experience probably numbering in the millions of person years on which these guidelines are based.  To dismiss a body of wisdom that significant without careful consideration is truly the work of a fool.

I am not saying that every guideline should be followed without question. I am saying that before you or I violate one of these guidelines we should have done a careful analysis of what the guideline is, why it exists, what it will cost us to violate it, and what benefit we get by violating it.

C)   Start by learning about the business

A relational database must record and retrieve data that is derived from information about a specific business.  Each database is designed to work for a system or systems of software applications for a business.  The object of the relational design effort is to model the business’ information into a relational database structure.

In order to model anything you must understand the thing being modeled.  If you want to build a model airplane you must know detailed information about that airplane.  You must know the dimensions so the ratios are maintained, you must know the shapes and the number of engines, etc.  Businesses are no different.  To successfully model a business you must have a clear understanding of the nature of the information that the business uses in its operations

D)   Learning the Business

One of the things that I find satisfying about the software development field is the ability, actually the requirement, that I learn detailed knowledge about many different businesses.  As with learning anything the source of your education is instrumental in the success or failure of your efforts to learn.

If I want to learn skiing, I seek out a ski instructor that has years of experience.  If I choose to learn gardening, I seek out a gardening expert.  If I need to learn a business, I look for a business domain expert.

Domain is a word you will encounter in many places in relational design.  It refers to a body values.  There is a business domain (information and knowledge about the business), a database domain (possible data that might be found in the database), an entity domain (possible records that might be found in an entity), field domain (possible values that might be found in a field), and others.

The business domain experts most readily available to you are your client and their employees.  You need to spend some time with these folks early in the project just listening to them tell you about their business and how they run it.  During this time try very hard to stop the client’s people from discussing the computer software that you will develop.  You are the computer software expert and you don’t need them to tell you about software, but you do need them to tell you about their business in great detail.

During these conversations be very wary of these two disastrous words, always and neverAlways is never always and never is a very long time.  When the client says, “We always enter the customer information before we create an invoice.”  Hear that as “We usually record the customer record before we create an invoice” or “We would like to create the customer record before we create an invoice.”

When the client says, “Our customers never have any more than three contact persons.”  Hear that as “Most of our customers have no more than three contact persons” or “It is rare that we have a customer with more than three contact persons.”

One thing to keep in mind as you learn about the client’s business is this, our clients are primarily interested in probabilities and when they use the words never and always they are using them relative to the probability of something.  We must be focused on possibilities without regard to probability.  If something can occur, no matter how improbable, our design must be capable of handling the situation.

E)   Build a Skeletal Model

A model is a miniature of something.  A database model is a description of the structure of the actual database.  We will discuss the diagramming method for a database model later, for now let’s identify the major components of a database model.

Entity: The relational database term used for a person, place, thing, or idea that we need to record information about.  Not all entities are physical things; some of them are abstract things.

F)   Identify Major Entities

Some examples of entities are:

  • Customers
  • Employees
  • Invoices
  • Inventory

In this early stage of the database design you must work with the client to identify all of the entities that the system will require.  Keep in mind that design is an iterative process, and that you will revisit this stage again to add more entities that are discovered during the rest of the design and programming of the system.

Relation: A fact about or an association between two entities.

G)   Determine Relationships

When we speak of relationships in terms of a database design we are referring to persistent relationships.  That is, the natural relationships that describe of the nature of the data being managed.  One of the flexibility features of the relational data model is the fact that the relationships between the entities are not fixed; there can be multiple different views of the data in the database that look at the entities with different relationships.

For example, taking a database that includes Customer, Invoice, Invoice Details, and Inventory entities we can look at the data from the perspective of the Customer and determine what Inventory items have been purchased, or we can reverse the relationships and view the data from the perspective of the Inventory entity and see which customers bought things.  Because of this flexible nature of relationships in an RDBMS (Relational Database Management System) we classify relationships into two categories, persistent and temporal.  Persistent relationships are defined in the database model and describe the nature of the data in the real world.  Temporal relationships are created using SQL SELECT statements or, in VFP, the SET RELATION command at runtime.

The relationships that we need to identify at this point in our database design process are the persistent relationships.  We need to identify not only the relationships that exist, but also what rules for referential integrity apply to each relationship (more on this later).

H)   The Concepts of Parent and Child

Parent and child roles are not permanent roles for entities.  That is that two entities can be related in two ways, with either of them being the parent.  For example, consider the simple Customer to Invoice relationship.  We are apt to think of Customer as the parent and Invoice as the child, and the persistent relationship in the database is probably defined this way.  However, we can also view the data with invoice being the parent and customer being the child.

The parent is a relationship is the table that controls the relationship, that is the source of the relationship.  The child is the table being controlled that is the target of the relationship.  The VFP SET RELATION command shows this quite well.

    SET RELATION TO <RelationalExpr>

                           ^     

                {The Child alias name}

    INTO <TargetAlias>

              ^

    {The expression in the parent alias}

If we want to see customers and what invoices they have; we look at the data with customer as parent and invoice as child. However if we want to see the invoices for a period of dates and see which customers bought from use; we would need to make invoice the parent and customer the child.

Why talk about this at all, it seems pretty obvious, because there is often confusion in communications regarding the use of parent and child outside of the description of a specific view of the data.  Many times I have seen someone confused because they didn’t realize that the roles of parent and child in a relationship can be dynamic and they concluded that the parent in the persistent relationship was always called the parent.

I)   Identify Attributes

Attribute: An atomic piece of data that is descriptive of some aspect of an entity.

Next we need to determine all of the attributes for each entity.  The definition of each attribute’s domain is part of this step.  Refer to the definition of domain earlier in this document.

The domain for an attribute includes:

  • Data type
  • Length
  • Format (mask)
  • Uniqueness
  • NULL support
  • Allowance of blanks
  • Legal values
  • Default value
  • Meaning
  • Anything else that describes the attribute

It is important to take the time now to document these domains, as they will play an important role later in validating your design.  In the previous step you identified the persistent relationships in this step you need to insure that all of the foreign keys (more on this later) exist to support those relationships.

J)   Add the keys to your model

Keys are the key to a relational database.  It is through the use of these keys that the data in one entity can be related to data in another entity.

K)   Determine Primary and Alternate
      (or  Candidate) keys

The purpose of a primary key (PK) in relational databases is to uniquely identify a specific record within an entity.  The primary key requirement from the relational database perspective has no other purpose or responsibility.

Primary key: An attribute or group of attributes that uniquely identifies an instance of an entity.

Candidate Key:  An attribute or group of attributes that met all of the requirements for being the primary key.

Alternate Key: A candidate key that is not being used as the primary key.

In a relational database every table MUST have a PK.  If a table exists without a PK it is not a relational table.

L)   Natural vs. Surrogate Primary Keys

There are two major types of PKs, natural and surrogate.

Natural key: A primary key comprised of attribute(s) that would still be part of the entities attributes even it(they) were not being used as the PK. 

Surrogate key: An attribute that has been added to the entity for the sole purpose of being the primary key, if it were not used as the PK then the attribute could be removed without affecting the completeness of the entity definition.

Which is better?  Well the answer can be a controversial one.  There are those folks who believe that if a natural attribute (or attributes) of an entity can meet the requirements of being the PK then it should be used for the PK. 

Then there is the school of thought that I belong to; “The sole need for a PK is rooted in the relational database requirements, that requirement is to uniquely identify a record within a table, and there are problems (more about this later) when the value of a PK changes, therefore using surrogate PKs universally is the best approach to PKs.

OK, let’s examine my opinion on this issue.  A PK identifies a record.  The PK is used to related records from other tables to allow views of meaningful information from divergent tables.  If the value of a PK changes it is necessary to assure that the value of the pointers in other tables are also changed (Referential Integrity).  If an attribute has meaning then its value cannot be random, it must be accurate in reflecting the meaningful information it contains.  People make mistakes.  People don’t always discover their mistakes right away.  Therefore, if a PK has meaning outside of being the PK there will come a time when the user needs to change the value of an existing PK in order for its non-PK purpose to be accurate.  This change requires that code be written and run to resolve all of the foreign table references.

If a PK has no meaning outside of its role as PK, then there is no need to ever show the PK to the user.  A surrogate PK is created by the computer software and is only used by the computer software.  The surrogate PK has no meaning to the user.  In this case the user will never have any desire or need to change the value of an existing PK.

What about a mixture within a database, some natural and some surrogate?  Well, I have a problem with that.  The two types of PKs are handled very differently and I find that the benefits of having every table the same is regards to how its attributes and PK is handled is of more value than any possible benefit of mixing the methods.  Since surrogate PKs work everywhere, and they are easy to implement, then I say use surrogate PKs universally.

M)   Simple Vs. Compound Primary Keys

A second categorization that can be made for PKs is simple versus compound. 

Simple PK:  A primary key that is comprised of only one attribute.

Compound PK:  A primary key is comprised of more than one attribute.

In the case of surrogate PKs there is no reason that they would ever be compound.  The compound PKs would, therefore, only be seen in natural PKs.

This is another advantage to surrogate PKs is that they are never compound PKs.  One has to keep in mind that whenever we try to relate data from two or more tables we have to write something somewhere that explains how the relation is derived, that is the Join condition.  With compound PKs these join conditions can be complex while simple PKs render simple join conditions.

For example assume two tables: Invoice and Details.  Invoice has a compound PK comprised of InvNo + CustNo + InvDate , these fields are also contained in the Details table for use it relating details to invoices.  The syntax of a WHERE clause to connect these two tables is:

    … WHERE Invoice.InvNo = Details.InvNo ;
    AND Invoice.CustNo = Details.CustNo ;
    AND Invoice.InvDate = Details.InvDate …

     

    Now contrast that with the use of a surrogate simple PK named InvID:

    … WHERE Invoice.InvID = Details
    .InvID …

This example is contrived; however it does demonstrate the effect of the type of PK that is used on how JOIN conditions need to be phrased.

N)   Determine Foreign keys

In the pervious section on primary keys we discussed relating of table to one and other.  We said the primary key plays a role in this relating.  The other half of the relation requirement is called a foreign key (FK).

Foreign Key: An attribute or group of attributes that holds the value of the primary key in a different table.

It is the combination of a primary key in one table and the foreign key in another table that allows us to combine data from the two tables and produce meaningful information as a result.

Again, there is a benefit in using surrogate simple PKs here.  To relate TableA to TableB you must provide for a field or fields in TableB to hold a reference to the records of TableA.  This is done by reproducing TableA’s PK in TableB as an FK.   With a compound PK for TableA you would need to create all of TableA’s PK member attributes in TableB to make the PK to FK relationship work.  Using a simple PK in TableA requires only one attribute be added to TableB for the FK.

O)   Define the key business rules

There are business rules and there are key business rules.

Key Business Rules:  Those data rules of a system that are related to how the PK and FK relationships are to be managed.  These rules dictate what the referential integrity constraints will be.

Relational databases make heavy use of PK to FK relationships to elicit information from data.  There are a number of issues that are raised with this approach to database design.  For example:  a Customer table and an Invoice table exist in a system and they are related on the CustID field in each table.

How do you handle the entry of an Invoice that has a value in the CustID field that does not have a corresponding record in the Customer table?  What about a change in the CustID for an existing record, what do you do about the related Invoice records?  How about deleting a Customer record, again what do you do about the related Invoice records?

As database developers we must focus on insuring that all of our relationships are valid.

Referential Integrity: The state of a database in which all of the inter-table relationships are valid according to the key business rules.

of our relationships are valid.

Persistent Relationship:  A persistent relationship is descriptive of the nature of the data entities and how they relate to one and other.  Persistent relationships are defined in the database and survive work sessions.

Temporal Relationship: A temporal relationship is description of the relationship between entities in a specific view of the data.  Temporal relationships do not survive the work session.

In order to better understand the key business rules we need to organize the types of relationship that can exist in a database system.

P)   Types of Relationships

There are two basic types of relationships:

See the diagram below for an example of how temporal relationships provide flexibility that they give us in displaying our data.

This diagram could be a persistent relationship definition in the database.  It describes how customers own addresses.

If you were asked to produce a list of cities and the names of the customers in those cities you would need to look at the data in the tables using a relationship set up like the one shown above.

The ability of a relational database to allow us to use these two different views of the data is the very root of the relational designs flexibility.

Types of Persistent Relationships

There are two types of persistent relationships that are possible.  These types divide the persistent relationships into two groups that have different requirements in areas of key business rules and referential integrity enforcement.

Defining Relationship: The Parent record defines the existence of the child records.  If the parent record is removed then the children have no meaning any longer.

Non-defining Relationship: The child records have meaning even without a corresponding parent record.

See the next two figures for examples of each type of persistent relationship.

In figure 3 you see an Invoice table being related to an Invoice Details table.  The defining nature of this relationship is that an Invoice Detail record has absolutely no purpose without a corresponding Invoice record and any given Invoice Detail record is defined by one, and only one, Invoice record.  Contrast this relationship with the one below.

The relationship between CustAddr and Address is a non-defining one.  It is non-defining because any one CustAddr record does not define the existence of any of the Address records.  That is, a given Address record may be referenced by more than one CustAddr record and therefore the removal of any particular CustAddr record does not, necessarily, negate the value of the Address record that was pointed to.

Back to the Key Business Rules

So, what are key business rules?  Let’s consider a Customer to Invoice defining persistent relationship.  The defining nature of this relationship tells us that we have a referential integrity issue here.  We must insure that there is never an Invoice record that does not refer to an existing Customer record.

The integrity issue affects two database operations; one is whenever a record is added to the Invoice table we must insure that it has a valid FK value for the Customer table.  The second issue is when deleting or changing the value of a Customer record’s PK we must insure that all Invoice records that are related to the Customer record being affected are handled.

For simplicity let’s just consider the deletion of a Customer record.  Assume we have Customer number 1001 and it has 5 Invoice records.  If we attempt to delete Customer number 1001 there are three ways to deal with the referential integrity;

  • Cascade the deletion.  Delete all of the referring invoice records.
  • Restrict the delete: Do not allow the customer record to be deleted if there are any invoice records fro that customer
  • Ignore the relationship:  Allow the customer to be deleted and do nothing about the invoice records.

The first two approaches will insure the integrity of the customer to invoice relationship.  We’ll discuss the third option a little bit later.

Focusing on the first two options.  Cascading the deletion of the customer to the invoices will prevent any invoice records that are not connected to a customer record.  So will restricting the deletion of the customer.  From a database integrity perspective these two approaches are equal, they both protect the integrity of the data.

However, I think if you talked to the financial folks at your client and said, “By the way, if a user clicks this delete button we will blow away all of the invoice records for that customer.”  You might spark a conversation.  Financial folks don’t like the idea of destroying financial records.  So, then, you could use restrict.

Well it might not be that simple.  The financial people may give you a rule something like this; “If the customer has no invoice records in the current accounting period and no invoices with an outstanding balance, then you can cascade the delete, otherwise you must restrict the delete.”

The quoted sentence is an example of a key business rule.  It is a rule of the business that affects how we must manage the integrity of the PK to FK relationships.

Q)   Determine Additional Business Rules

The official terminology for the business rules that are not related to key management is triggering operations.

Triggering Operations: The most generalized form of business rule, encompassing domains and key business rules as well as other types of attribute business rules.  The set of rule rules that make some data sets valid and others invalid.

Do not confuse triggering operations in relational design theory with triggers which are features of certain database applications that automatically fire on database events, they are not the same thing.

As the definition above states, triggering operations are that set of business rules that are not involved with the management of primary and foreign keys.  Things like credit limit rules for customers or minimum order rules for free shipping and things along those lines.

R)   Define the Domains for all attributes

We defined a domain earlier in this paper.  An attribute domain describes an attribute in a way that accounts for all of the possible values that might ever be found in that attribute. 

The domain definition is a very important step in the design process.  This is because one of our goals in relational design is to eliminate redundant data elements from the tables.  It is virtually impossible to tell if two attributes are redundant without knowing the complete domain definition for each of them.

For example, take a Customer and Invoice table each of which has a field named SalesTaxRate.  Is SalesTaxRate a redundant attribute because it appears in both tables?  The answer depends on the domains for the two fields.  If the domain is the same then they are redundant.  In this particular case they are not redundant.  This is because the domain for the SalesTaxRate field in the customer table is; “The current sales tax rate for this customer” and the field in the Invoice table has the domain of, “The sales tax rate in effect at the time of this invoice for this customer.”  So you see their domains are different, therefore they are not redundant fields.

Attribute domains come into play again when we are looking for repeating and multi-valued fields in a single table.  If two fields have the same domain they are the same field with multiple possible values.  Again consider a customer table with fields named Contact1 and Contact2.  When we consider the domain definition for the two fields we get; Contact1 “The name of a contact person at this customer” and Contact2 “The name of a contact person at this customer.”  Both fields have the same domain; therefore they are the same field even though they have different names.  The contact fields are an example of a multi-valued attribute.  We will see how to fix this problem later on.

S)   Many To Many Resolution

The major problem with Many-to-Many relationships is that they do exist in business.  However, computer databases cannot handle them.  There are some database systems that claim to handle Many-to-Many relationships, but they actually do what you are about to see automatically.

Whenever you see a two-headed arrow in an ERD you should immediately think, “There is an entity missing here.”  The missing entity is called an intersecting entity or a joining entity.

Intersecting or Joining entity: An entity that serves the purpose of storing multiple join cases for another two entities.

We can resolve a Many-to-Many relationship between the Invoice and Inventory tables by using to the design seen below.

The figure above introduces the intersecting entity, Invoice Details.  This entity will have one record for each connection of an Invoice with an Inventory item.  The introduction of the Invoice Details intersecting entity reduces the Many-to-Many relationship to two One-to-Many relationships.

In some cases, the new intersecting entity may have a connection with the reality of the business (like line items on an invoice).  In other cases the intersecting entity may be a complete fabrication solely existing to resolve the Many-to-Many relationship (like Customers with Addresses where each customer may have many addresses and any address may be used by multiple customers, in this case the intersecting entity would only exist to connect a customer with an address and would not, necessarily, add any additional information).

T)   Test the Validity of Your Model

At this point you have identified all of the components of your database and you have drawn the first pass at your ERD.  The next step is to validate your work so far.  At this point you would need to apply some type of testing on what you have done to find the problem areas.

Relational database design has just such a tool available, that tool is called Normalization.

The process of Normalization

When you create a new form or report for a system you will run that new thing and test it to find out if it works as intended.  You will test the limits and boundaries of the thing to find out where it breaks or fails.  Generally, you want to get a level of confidence in the functioning of the new component.

Databases are no different than forms or reports in this area.  After you design the database it is helpful to test that database to see if it works as intended.  The testing process is called Normalization.

Normalization: Successively examine each entity and it’s attributes for structural redundancies and/or inconsistencies.

Why Normalize?

Testing the validity of your database design is a desirable goal, but why?  What benefits does insuring a normalized structure of the database provide?  We will discuss the four major benefits of normalization (that is my four most important ones).

U)   Minimize the space required to store the data.

One of the things that normalizing a database does is to remove redundant data elements.  For example, storing the date of an invoice in both the Invoice header table and in the Invoice details table is redundant.  The date of the invoice will always be the same for every line on that invoice.  Eliminating the data redundancies reduces the space required to store the data.

V)   Minimize the possibility of data inconsistencies.

This one is interesting.  The definition of inconsistent from Webster is “not compatible with another fact or claim.”  That means that in order for a data inconsistency to exist there must be two conflicting values somewhere.  If we remove the redundancy we remove the multiple locations for values to exist.  If all values exist in one, and only one, place then there cannot be any inconsistency.

W) Minimize possible update and delete anomalies.

An anomaly is when something goes wrong.  An update or delete anomaly is when something goes wrong with the updating or deleting of information from a database.

Let’s take the Invoice and Invoice detail situation mentioned in space reduction issue above.  If we were to store the data of the invoice in both the invoice header and in the detail lines tables, then we would have a possibility of data inconsistency, that is it would be possible for an invoice to have one date while one or more of the details had a different date.  That would be an inconsistency.

However, by only recording any piece of data in only one place the possibility of inconsistency is virtually eliminated.

X) Maximize the stability of the data structure.

This is my particular favorite reason for normalization.  In this case we are not referring to protection from damage for the files involved, we are talking about a database design that allows for additional functionality to be introduced with the absolute minimum changes to the existing data structure.  That is, the data design is stable.

How can normalization do this?  Here’s an example from real life.  A few years ago there was a question posted on the CompuServe FoxPro forum asking, “I have a table that is used to record the sales for 6 retail stores.  The customer has asked me to produce a report that shows the highest sales for an hour and which store that was in. My table structure is …” the table structure described is shown below.

The questioner also said the stores were open for 6 hours per day and the owner of the stores said that would never change (there’s that word never). He also said please don’t tell me to change the structure of the table because it contains 3 months of data and that would be too much to try and convert.  This occurred in September.

I responded to him and told him first to change the structure of his data and that that would fix his problem.  But, since he had specifically asked how to do this without changing the structure of the data I advised him of how to process all of the records for a given date for all of the stores and find the highest hourly value.

There is a fundamental problem with the data structure above that can be easily seen once we investigate the business requirements.  What, exactly, does the business need recorded in this table?  Is it sales for a date or sales for an hour?

In fact, it is sales for an hour and the structure in figure 7 is recording sales for a date.  To see this ask yourself, “What does one record in the table represent?”  It represents sales on one date at one store.  The next figure shows the table design for recording hourly sales.

This table structure can answer the current question with a single line of code;

SELECT StoreNum, Date, Amount FROM StoreSales TOP 1 ORDER BY 3 …

That’s it!  No comparing values or calculating across multiple records or any of that.  The original problem is solved.

Well the questioner did not take my advice, he used the complex calculation because, “There was 3 months of data already in the existing data structure.”  A couple of months later, in late October, I saw a question about 6 retail stores and sales data that was asking how, without major changes, to allow for the stores to be open more than 6 hours a day.  I looked back in my message archives and it was the same person doing the asking.

I couldn’t resist.  I posted the suggestion to make the same change to the data structure and said, “If you had made this change a month ago you would have had only 3 months of data to convert, now you have nearly 5 months worth.”

The point is that in figure 8, when the table design matches the business’s requirements, you can have business requirement changes that don’t need a data structure change.  Why, because if you have successfully modeled the business in the data then virtually anything that can happen in the business can also happen in the database.  Keep in mind though, that this is not always the case.  If a system does require changes to adapt to a new business requirement that is not a strong indication of a poor data design in the first place.  However, the inverse is always true, a poor data design will invariably require changes to the structure to adapt to new business requirements.

[ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ]