Chapter 12: Distributed Database Management Systems

CIS 2210 - Database Management and Design

Chapter 12: Distributed Database Management Systems

Objectives:

This lesson discusses material from chapter 12. Objectives important to this lesson:

Distributed database management systems
Advantages and disadvantages
Distributed processing and distributed databases
Characteristics of a DDBMS
Components of a DDBMS
Levels of distribution
Transparency features
Distributed design
CAP theorem

Concepts:

Chapter 12

Distributed Database Management Systems

The chapter starts slowly, gradually telling us that when a database has end users who are separated across large distances, their experience suffers if the database is only located on one central device. Access may be requested across a company network or across the Internet. Access may be requested from computers, tablets, phones, or other devices that may use wired or wireless access.

On page 556, the text provides a list of problems users encounter if there is only one, distant copy of the data they need:

performance degradation due to bandwidth bottlenecks
increased costs of an owned infrastructure
reliability problems with a single point of failure affecting the entire system
scalability problems with the central database not being up to increasing access requests
organizational rigidity, which is revealed by the symptom of reliance on a centralized solution

Some of the errors above result in problems across the enterprise. Standardization becomes more important the larger the organization becomes. You can't just start out with good intentions and trust that everything will go well. Data assets need to be examined and changed to meet standards, which includes keeping them current. The text recommends setting up master copies of tables that are used in multiple system. Updates should be made to the master copy, then the other necessary copies should update from the master. This controls redundancy and sets a standard that other tables should follow.

Advantages and Disadvantages of a DDBMS

The text presents table 12.1 as its summary of advantages and disadvantages of using a distributed database management system. The assumption is that, like Google and Amazon, anyone on the Internet is faced with the reality of dealing with a global market.

Selected advantages:

Having distributed copies of your data allows creates a lower number of hits on each copy.
Predicting where to put the copies, based on network and database statistics, leads to better access for the end users.
Adding new sites to system is probably less costly than upgrading the central site in the old model.
There is less danger from a single failure, since replication of the data makes continued access possible.

Selected disadvantages:

A DDBMS is more complicated, especially if the data is just segmented instead of fully replicated (completely copied).
Updates to the replicas must take place rapidly.
Security for multiple locations and devices in necessary.
Protocols for a DDBMS vary from one vendor to another, so there may be no common standards.
Increased costs are incurred in training and in setup.

Distributed Processing and Distributed Databases

The text makes a distinction between the two concepts this section is about. In the context of this chapter, distributed processing is computer processing for a database that takes place on two or more computers, operating at different locations. The text suggests as examples that data retrieval might be done by one computer and report generation by another. This should tell you that distributed processing requires a network to connect the variously tasked computers.

The topic discussed so far in this chapter is distributed database, which means that the database itself exists in two or more locations, either in part or in full. The thing to know next is that in a distributed database environment there will be processing at each location, so if you have a distributed database, you also have to have some distributed processing. Each location will have its own database process. The part or copy of the database that exists at each location is called a database fragment.

Characteristics of a DDBMS

On page 559, the text lists a dozen characteristics of a DDBMS, and eight functions that any DBMS must perform:

an application interface between the user, applications, and other DBMSs
validation of requests for syntax errors
transformation from complex requests to requests with atomicity
query optimization to access the fewest files and fragments, to create the least need for fragment synchronization
mapping to determine and retain the locations of local and remote fragments
I/O interface to read and write to long term storage
formatting to meet user needs or application needs
security to provide authenticated access
backup and recovery features for fragments
an administrator interface
concurrency processes to provide a consistent database
transaction management to provide a consistent state, and to synchronize fragments when necessary

The eight functions of a database are mostly included in the list above:

Receive requests from users and applications. Covered by 1 above.
Validate, analyze, and decompose the requests. Covered by 2 and 3 above.
Map the request from logical to physical components. Covered by 5 above.
Decompose the requests into atomic individual disk operations. Covered by 3 and 6 above.
Search for, locate, read, and validate data. Not specifically stated above.
Ensure consistency, security, and integrity. Covered by 8 and 11 above.
Validate data for the conditions in the request. This just means to carry out SQL requests.
Present the data in a requested format. Covered by 7 above.

Components of a DDBMS

On page 560, the text gives you five bullet points for components of a DDBMS. There is actually a sixth one on the next page.

Network workstations or devices that define the network
Network hardware and software, such as switches, NICs, and client software
Communications media, such as network cable, wireless access points, or both.
This bullet and the next are confusing. Look at the illustration on page 561 as you read. A transaction processor (TP) is defined as the software component on a device that is requesting data. All but two of the devices in the illustration are end user devices. All of those have TPs.
A data processor (DP) is defined as the software component on a device that stores and retrieves data.
The sixth component would be the protocols used by the DDBMS network.

In the discussion of protocols, the text brings up a new concept. That a DP in the scenario above could be an independent DBMS that functions to process requests from devices on the DDBMS network. The DDBMS network could be a network of cooperative DBMSs.

Levels of Data and Process Distribution

This section of the chapter approaches the subject gently, giving us three scenarios to consider that go from simplest to most complex:

Single-site Processing, Single-site Data (SPSD) - Simplest concept, only uses one computer to store and process the database. The illustration on page 562 shows a central computer providing this service to three dumb terminals.
Multiple-site Processing, Single-site Data (MPSD) - The illustration on page 563 shows three TPs, identified as being at different sites, accessing data from a single site. All processing in this scenario takes place at the remote sites. Data is the only thing kept on the central file server. This makes for heavy file transfers, since selection within the files is taking place on the TPs.
Multiple-site Processing, Multiple-site Data (MPMD) - There is no illustration for this scenario, but we can use the one on page 561. This is how the concept of DDBMS was explained in the beginning of the chapter. It requires the most complicated hardware and software, but since everything is shared and accessible, and processing takes place in multiple locations, it is the best of the three concepts for a large network.

DDBMS Transparency Features

The next section of the chapter discusses five types of transparency that give the illusion of being a local DBMS. Each of them has their own feature set:

Distribution transparency - This feature specifically hides distributed data aspects related to commands from the end user. The text explains that this can be implemented in one of three levels, which are simply measures of how much they accomplish:
- Fragmentation transparency - At this level, neither a user nor a programmer would have to reference the database fragment (by name or location) in which a particular data structure resides. This is high transparency.
- Location transparency - At this level, a user or a programmer would have to reference the database fragment by name but not by location. This is medium transparency.
- Local mapping transparency - At this level, a user or a programmer would have to reference the database fragment by name and by location. This is low transparency, which is not very transparent.
  The text provides an example of the grammar that might be used in each of the three cases. In those examples the word "NODE" is only symbolic, not the actual word that would be used.
  This feature is supported through shared documentation, either a distributed data dictionary (DDD) or a distributed data catalog (DDC), which document the fragments at each location.
Transaction transparency - This feature allows a transaction to update data at more than one location, as may be required for consistency of the database. This involves protocols to update the database locally and remotely, and to rollback or commit transactions in both locations. Vocabulary for this feature:
- remote request - allows a single command to be carried out by one remote data processor
- remote transaction - allows a set of commands (a transaction) to be carried out by one remote data processor
- distributed request - allows one command to be carried out by more than one remote data processor
- distributed transaction - allows a set of commands (a transaction) to be carried out by more than one remote data processor
The text points out that concurrency is more complicated in the scenarios above, since coordination is required across the distributed DPs. The coordination is done by the Transaction Processor using a two-phase commit protocol (2PC). The illustration on page 572 shows the problem that could occur if this did not work.
Each DP has to have its own transaction log so that it can roll back a transaction that cannot be completed on another DP. Log entries are written with a write-ahead protocol that writes the intended change, as well as the current state of the objects about to be changed before an action is committed. If a commit fails, the actions that may have been committed are rolled back with the information in the necessary logs.
Failure transparency - This is a redundancy feature of the distributed service, continuing to provide service if one of the data fragments is inaccessible.
Performance transparency - This is a rather optimistic concept, that the system will not be slowed down due to distance, by fragmentation itself, or by translation from one platform to another. In case of a slow link, or an unresponsive segment, the network will access an alternative, replicated copy of the data elsewhere. This leads to replica transparency, which means that the data replica being used for a transaction is not important to the user, since all replicas of the same data will be synchronized as soon as possible.
Heterogeneity transparency- This one is harder to spell, so it must be important. It means that the DDBMS will appear to any user as a single system, even when it is actually a federation of systems of different types: network, relational, or hierarchical. In this sense, "federate" means to give trust to and receive trust from another system.

Distributed Design

The text has been talking about database fragments for many pages. It begins to discuss particulars about them on page 575. The first consideration is how to create fragments. Three fragmentation methods are listed:

Horizontal - A table is broken into multiple sets of rows, each set being unique rows, but having the same attributes. The preserves the horizontal nature of each row. The text points out that this is like storing the result of multiple SELECT statements in different locations.
Vertical - A table is broken into multiple sets of attributes, each set having those attributes and the same key attribute. The text points out that this is like storing the result of multiple PROJECT statements in different locations.
Mixed - Both techniques are used, often with very large tables.

The next topic is data replication. Having a single copy of a fragment is not the best plan, as noted above in the discussion of performance. We already know we should place a segment near the users most likely to need it. We should also plan to place a replica far enough away to survive a disaster, but close enough to be useful in a mere partial outage. In the illustration on page 579, you see a system with two segments, stored in three locations. Segment 1 is stored at site 1, and a replica is stored at site 2. Segment 2 is stored at site 3, and a replica of it is also stored at site 2. If any single DP or single fragment replica goes off line, the system will failover to the other copy of the data.

The text discusses two methods of synchronization to maintain consistency in the database:

Push replication - When a change is made one fragment (original or replica), that change is pushed right away to all copies in the network. This is more consistent, but data may be unavailable during a push.
Pull replication - Instead of pushing changes constantly, the replica having a change sends a message to its copies that a change has been made. The copies receiving the messages decide when to pull the change to themselves. This makes data more available, but that data may be out of date.

Deciding to replicate data can take different paths. The text presents three choices that cover most variations:

Fully replicated - All segments are replicated in multiple locations for the most redundancy and the most reliability, but this requires the most updating across the network. This can also be called a replicated data allocation plan.
Partially replicated - Some segments are replicated at multiple locations, reducing the update traffic, but also reducing ready availability for the segments that are not replicated. Those should be the ones that are used less often. This is likely to be chosen in a partitioned data allocation plan.
Unreplicated - No duplication of fragments. This puts your faith in no accidents, no disasters, no communications failures, and the little elves who keep the Internet running. This would be likely to be chosen in centralized data allocation plan.

The CAP Theorem

The author quotes another author, Dr. Eric Brewer, who created the CAP theorem the year 2000. The link in the previous sentence will take you to an update from Dr. Brewer from 2012. The original theorem says you can meet two of the following objectives at any given time, but not all three:

Consistency - this goes away during an update from one segment to another
Availability - this goes away if you are receiving an update or a change
Partition tolerance - This is actually fault tolerance for a lost partition (segment), which goes away during an update.

In his 2013 article, Dr. Brewer clarifies that you can seek these three states, and have them at any time that the database is not receiving changes. He believes, as does our author, that we should pursue CAP, as well as BASE (Basically Available, Soft state, Eventually consistent) as well as ACID (Atomicity, Consistency, Isolation, Durability). A distributed database will have to be dynamic, and will have to update its replicas as long as it is a database that going to be used.

The chapter ends with a list 12 Commandments from C. J. Date, which is now over thirty years old, and so important that it does not appear in the chapter summary. Let's leave it there, at the end of the chapter.

Assignments for Module 12:

Read chapter 13.
There is a numbered Assignment, partially for this chapter, that began two weeks ago. It is Assignment 7. A link to a Word document containing the questions can be found in the Module 9 folder for this course on Blackboard. This assignment is due before class in week 13.

Download the document, complete it, and save it with a new name that includes your login ID. This is the version of each file that you should upload for a grade.
Regarding this assignment and all the others, do not simply quote the material from my notes or the text. Use your own words to show your understanding of the concepts.

These are individual assignments. Duplicate work will not be counted as completed work.