Chapter 10 Keypoints: Database Design

I. Overview:

The chapter covers why normalization is important in database design AND how to normalize tables. If you don’t design your database correctly, you will have many problems working with it. Design is the most important step in database creation.

The textbook assignment provides hands-on practice for concepts covered below and includes video demonstrations for all required tasks.

II. Background Review (what we already know)

A relational database is made up of several components. The table is the most important because that is where all the data in a database is stored, and without tables, there would not be much use for relational databases.

As you know, a database consists of one or more tables. Each table is made up of rows and columns. If you think of a table as a grid, the columns are laid out from left to right across the grid and each entry of data is listed in rows.

Each row in a table is uniquely identified by a primary key. Keys can be comprised of one or more columns, as long as they are unique identifiers. Keys can also be manufactured (i.e. identity columns). The purpose of a key is to retrieve a single record (or row)

Columns are designed to hold a specific type of data (numbers, text, dates etc.) A column is defined by its name (which must be unique within a table) and data type. The name is used in SQL statements when selecting and ordering data, and the data type is used to validate information stored.

Tables can be connected (joined) based on primary key/foreign key relationships OR common fields.

III. Normalization

Database normalization is process used to organize a database into tables and columns. Tables are organized around objects or entities which are single topics.

Each column identifies a feature of the object. The features are called attributes.

The formal process for diagramming database models is called entity-relationship (ER) modeling.

Before you can create an ER diagram, you need to complete 6 design steps. The six steps for designing a data structure are:

1. Identify the data elements
2. Subdivide each element into its smallest useful component
3. Identify the tables and assign columns
4. Identify the primary and foreign keys
5. Review whether the data structure is normalized
6. Identify the indices (indexes)

1. Identify the data elements (this is typically done by looking at paperwork , spreadsheets or an existing database).

3, 4 and 5. Identify the tables and assign columns, determining primary and foreign keys and normalization

This takes some skill, practice and time. You basically look at your data and organize it into logical groupings. While you are organizing the data, you will see columns that are unique identifiers (primary keys) and you will see instances where you know tables must share data, so you can also identify foreign keys while you are creating the tables. The process of fine tuning your table design involves normalizing the tables. So steps 3,4 and 5 of the design process are often done at the same time (this is not a sequential process).

• Reviewing paperwork (this can help you with this task because forms are already in logical groupings).
• Reviewing other data sources like spreadsheets and existing database files.
• Interviewing employees who work with the data
• Job shadowing employees who work with the data to determine if improvements in process and procedure can be obtained.

Once the data has been gathered, you will need to analyze it to determine what the purpose of the information is, how the information should be stored and how it should be organized.

For example, a spreadsheet containing information about sales people and customers serves several purposes:

• Identify sales people in your organization
• List all customers your company calls upon to sell product
• Identify which sales people call on specific customers.

Forms contain a great deal of information which can typically be split into multiple tables. The example below is from the textbook:

The main idea is that a table should be about a specific topic and that only those columns which support that topic are included.

By limiting a table to one purpose you reduce the number of duplicate data that is contained within your database and that makes modifying the database much easier (duplicate data is a problem when modifying data and the structure of database tables)

Keeping the main purpose in mind, normalization is a technique you can use to achieve a table based on a single object (or entity) and columns (attributes) that only support that single object.

There are three normal forms almost all databases follow; however, there are more than three normal forms (we will be covering all the normal forms so you know what they are and how to implement them).

A. Reasons for Normalization

As we go through the various states of normalization we’ll discuss how each form addresses these issues, but to start, let’s look at some data which hasn’t been normalized and discuss some potential pitfalls. Once these are understood, i think you’ll better appreciate the reason to normalize the data.

B. Normalization Process

The normalization process can be used on existing database tables and brand-new database tables. We will work through a few examples using existing tables and then new tables. The table shown below is for sales staff. The primary key is EmployeeID

The first question to ask is what is the purpose of the table? Does it serve 1 purpose or many?

Just looking at the data, you can see that it is identifying salespeople, customers and sales office information. So, it is serving 3 purposes. Because a table should serve 1 purpose, this table needs to be restructured.

The current table has a lot of duplicate information – this creates problems updating values and it also wastes storage space which can impact database performance.

Trying to accurately maintain this table would be a nightmare. These types of problems are referred to as modification anomalies.

The table would also present problems selecting and sorting information(especially if you wanted customer information)

SELECT salesoffice
FROM salesstaff
WHERE customer1 = ‘ford’ or customer2 = ‘ford’ or customer3 = ‘ford’

There are three common forms of normalization that all database tables follow: 1st , 2nd , and 3rd normal form. The forms build upon each other, so tables that are in 3rd normal form are also in 1st and 2nd normal form.

a. Each field (i.e. column) value must be a single value only.
b. All values for a given field (column ) must be of the same type.
c. Each field name (column label) must be unique.
d. The order of fields (columns) is insignificant
e. No two records (rows) in a relation can be identical.
f. The order of the records (rows) is insignificant.
g. If you have a key defined for the relation, then you can meet the unique row requirement.

In the example, customer1,customer2, customer3 are columns with repeating values (the data in the rows isn’t being analyzed, it is the columns themselves).

Another example showing how to go from a non-1NF to a 1NF database (Note how there are 3 columns containing multiple values):

The revised table eliminates the mulitple value problem, but highlights a different problem when testing for second normal form.

A table is in second normal form (2NF) if all of its non-key fields are dependent on all of the key.

Tables that have a single field for a key are automatically in 2NF assuming that all non-key fields are dependent upon the key.

It is a little trickier with composite keys (keys that are made up from multiple fields). If the key is comprised of 2 or more fields, all the non-key fields must be dependent upon ALL of the key fields. If the non-key fields are not dependent on ALL the fields that make up the key, you have a partial-dependency and that is not allowed in 2NF.

To resolve issues, non-key fields(columns) are removed and placed into a different table.

Here’s an expanded example of the employee/project table that illustrates how to go from 1NF to 2NF (the table must be split apart because of partial dependencies) For the table to be in 2NF, everything in the table must depend upon the primary key

A table is in third normal form if it is in 2NF AND every non-key column depends only on the primary key. To make this happen, you must remove transitive dependencies (Boyce-Codd normal form). A transitive dependency occurs when the value in one field is dependent upon the value in another field, for example, the value in a city and state field are dependent upon the zipcode - technically, if you include city, state and zipcode in a single table, you have a transitive dependency.

• Eliminate fields that do not depend on the key.
• Create a new table for groups of fields that may apply to more than a single record in the table (like city, state and zipcode)

If we apply the rules from the first 3 normal forms to the table, we would have something that looks like this:

o CustomerID(PK)
o Company Name
o Contact FirstName
o Contact LastName
o Contact phone
o repID(FK)

Can we remove a rep without removing customers? Yes, if we allow null values in the customer table for repID

Can we add/remove office locations without impacting reps and customers? Yes, if we allow null values for salesoffice in the rep table

Can we change an office location without updating multiple records? Yes, it will only be entered into 1 table as 1 record.

Here’s another 3NF example (note that deptName would be removed from the table so it can conform to 3NF)
third normal form example

Now that you are familiar with the first 3 normal forms, we will take a look at the rest.

Many companies will purposefully deviate from the remaining normal forms because the result doesn’t fit their business needs.

Only 1 column within the table can have multiple values (multi-valued dependency).

In the example below, the pno and dno columns can contain multiple values. 4NF doesn’t allow more than 1 column to have multiple values.

All redundancy is eliminated by splitting tables apart. Normalizing to this level can create overhead when performing queries because of all the JOINS you would need to do to retrieve data. That is why this level of normalization is rarely done in practice.

A spurious tuple occurs when 2 tables are joined on columns that aren't primary keys or foreign keys (this is considered bad database design).

Every constraint on relationships is dependent on key constraints or column value constraints (domain refers to values allowed within a column) This normal form operates at the relationship level rather than the table level and prevents bad data from being added to tables It also prevents modifications in one table creating inaccurate data in connected tables

NOTE: The seven normal forms are: 1NF, 2NF, 3NF, BCNF (Boyce-Codd), 4NF, 5NF and 6NF

IV. Types of Relationships

1. One to Many - where one record in a table could be related to many records in the second table. In most diagrams, the infinity symbol is used to represent many

2. One to One - where one record in a table can only relate to one record in another table. In a one-to-one relationship, the primary key is the same in both tables and is used to join the tables.

3. Many to Many - where many records in one table connect to many records in another table. When this occurs, you need to break the table down into additional tables.

E-R Diagrams are used to illustrate how tables are related and to identify many-to-many relationships. In the example below, there are two many-to-many relationships which identify the need for additional tables. Tables developed to eliminate many-to-many relationships are called "bridge" tables. In the example below, a WarehouseProducts table could be developed as a bridge between products and warehouse. An OrderDetails table could be developed as a bridge between order and products

In the example below, we would need to create another table to bridge orders and products because of the many to many relationship. An order detail table would work (the table could include order# from the order table and product# from the product table - both fields would be the key)

It is important when creating tables to think about referential integrity of the tables. Referential integrity rules are designed to ensure data is accurate after changes are made.

1) If a primary key in a table is changed and that column is used as a foreign key in other tables, the foreign key is also changed so everything remains in synch. For example, if a customer number changes, the customer number in the order table would also change.

2) If a primary key is deleted, rows in related tables are also deleted. For example, if a customer is deleted, their orders are also deleted.

3) When tables are related, data must be entered in the “1” table before the “many” table. For example, you cannot add customer orders until you have customers.

4) If a foreign key is updated, their must be a matching value in the primary key table; otherwise, the update is not allowed. For example, you cannot change the customer number in the order table unless the customer exists in the customer table

Now that you understand WHY referential integrity is important, we will discuss how it is created, because it is NOT automatically created in SQL.

This is created by defining foreign key constraints that indicate how referential integrity between tables is enforced (we will be covering this in ch 11 and 12)

A trigger is a special kind of stored procedure that automatically executes when an event occurs (we will be covering these in ch 15)

V. Indexing Tables

By default, SQL server (and most other RDMS systems) will index tables by the primary key allowing quick retrieval of data.

If you ask SQL to retrieve data by a column that is not the key and is not indexed, SQL performs a table scan which involves searching through the entire table (the equivalent of looking in every page of a book for something because the index isn’t there). As you can imagine, with a lot of data, this is a slow process.

Indexes can be clustered or nonclustered. Each table can have 1 clustered index (which is typically the primary key) and up to 249 non-clustered indexes. You can change clustered indexes from the primary key to a different column if you want to (there is nothing that prevents that). Rows are stored in tables based upon the clustered index.

SQL allows composite indices made up of 2 or more columns. You should only consider doing this if the data in the columns doesn’t change very often and the columns are used in a lot of queries.

• they cannot be used in search conditions that use the LIKE operator or with patterns containing wildcards

• they should be used sparingly because they have to be updated everytime a changes is made to data the index is based upon (if a column changes a lot, it isn’t a good candidate for an index)

• the column is NOT updated frequently (frequent updates require rebuilding the index)