|
|
CIS 2210 - Database Management and Design
Chapter 12: Distributed Database Management Systems
Objectives:
This lesson discusses material from chapter 12. Objectives important
to this lesson:
- Distributed database management systems
- Advantages and disadvantages
- Distributed processing and distributed databases
- Characteristics of a DDBMS
- Components of a DDBMS
- Levels of distribution
- Transparency features
- Distributed design
- CAP theorem
Concepts:
Chapter 12
Distributed Database Management Systems
The chapter starts slowly, gradually telling us that when a database
has end users who are separated across large distances, their experience
suffers if the database is only located on one central
device. Access may be requested across a company network or across the
Internet. Access may be requested from computers, tablets, phones, or
other devices that may use wired or wireless access.
On page 556, the text provides a list of problems users encounter if
there is only one, distant copy of the data they need:
- performance degradation due to bandwidth bottlenecks
- increased costs of an owned infrastructure
- reliability problems with a single point of failure affecting the
entire system
- scalability problems with the central database not being up to increasing
access requests
- organizational rigidity, which is revealed by the symptom of reliance
on a centralized solution
Some of the errors above result in problems across the enterprise. Standardization
becomes more important the larger the organization becomes. You can't
just start out with good intentions and trust that everything will go
well. Data assets need to be examined
and changed to meet standards,
which includes keeping them current.
The text recommends setting up master copies of tables that are used in
multiple system. Updates should be made to the master copy, then the other
necessary copies should update from the master. This controls redundancy
and sets a standard that other tables should follow.
Advantages and Disadvantages of a DDBMS
The text presents table 12.1 as its summary of advantages and disadvantages
of using a distributed database management system. The assumption is that,
like Google and Amazon, anyone on the Internet is faced with the reality
of dealing with a global market.
Selected advantages:
- Having distributed copies
of your data allows creates a lower number of hits on each copy.
- Predicting where to put the copies, based on network and database
statistics, leads to better access for the end users.
- Adding new sites to system is probably less costly than upgrading
the central site in the old model.
- There is less danger from a single failure, since replication of the
data makes continued access possible.
Selected disadvantages:
- A DDBMS is more complicated, especially if the data is just segmented
instead of fully replicated (completely copied).
- Updates to the replicas must take place rapidly.
- Security for multiple locations and devices in necessary.
- Protocols for a DDBMS vary from one vendor to another, so there may
be no common standards.
- Increased costs are incurred in training and in setup.
Distributed Processing and Distributed Databases
The text makes a distinction between the two concepts this section is
about. In the context of this chapter, distributed
processing is computer processing for a database that takes place
on two or more computers, operating at different locations. The text suggests
as examples that data retrieval might be done by one computer and report
generation by another. This should tell you that distributed processing
requires a network to connect the variously tasked computers.
The topic discussed so far in this chapter is distributed
database, which means that the database itself exists in two or
more locations, either in part or in full. The thing to know next is that
in a distributed database environment there will be processing at each
location, so if you have a distributed database, you also
have to have some distributed processing. Each location will have its
own database process. The part
or copy of the database that exists at each location is called a database
fragment.
Characteristics of a DDBMS
On page 559, the text lists a dozen characteristics of a DDBMS, and eight
functions that any DBMS must perform:
- an application interface between
the user, applications, and other DBMSs
- validation of requests for
syntax errors
- transformation from complex
requests to requests with atomicity
- query optimization to access
the fewest files and fragments, to create the least need for fragment
synchronization
- mapping to determine and retain
the locations of local and remote fragments
- I/O interface to read and
write to long term storage
- formatting to meet user needs
or application needs
- security to provide authenticated
access
- backup and recovery
features for fragments
- an administrator interface
- concurrency processes to provide
a consistent database
- transaction management to
provide a consistent state, and to synchronize fragments when necessary
The eight functions of a database are mostly included in the list above:
- Receive requests from users and applications. Covered by 1 above.
- Validate, analyze, and decompose the requests. Covered by 2 and 3
above.
- Map the request from logical to physical components. Covered by 5
above.
- Decompose the requests into atomic individual disk operations. Covered
by 3 and 6 above.
- Search for, locate, read, and validate data. Not specifically stated
above.
- Ensure consistency, security, and integrity. Covered by 8 and 11 above.
- Validate data for the conditions in the request. This just means to
carry out SQL requests.
- Present the data in a requested format. Covered by 7 above.
Components of a DDBMS
On page 560, the text gives you five bullet points for components of
a DDBMS. There is actually a sixth one on the next page.
- Network workstations or devices that define the network
- Network hardware and software, such as switches, NICs, and client
software
- Communications media, such as network cable, wireless access points,
or both.
- This bullet and the next are confusing. Look at the illustration on
page 561 as you read. A transaction
processor (TP) is defined
as the software component on a device that is requesting
data. All but two of the devices in the illustration are end user devices.
All of those have TPs.
- A data processor (DP)
is defined as the software component on a device that stores
and retrieves data.
- The sixth component would be the protocols
used by the DDBMS network.
In the discussion of protocols, the text brings up a new concept. That
a DP in the scenario above could be an independent DBMS that functions
to process requests from devices on the DDBMS network. The DDBMS network
could be a network of cooperative DBMSs.
Levels of Data and Process Distribution
This section of the chapter approaches the subject gently, giving us
three scenarios to consider that go from simplest to most complex:
- Single-site Processing, Single-site Data (SPSD) - Simplest concept,
only uses one computer to store and process the database. The illustration
on page 562 shows a central computer providing this service to three
dumb terminals.
- Multiple-site Processing, Single-site Data (MPSD) - The illustration
on page 563 shows three TPs, identified as being at different sites,
accessing data from a single site. All processing in this scenario takes
place at the remote sites. Data is the only thing kept on the central
file server. This makes for heavy file transfers, since selection within
the files is taking place on the TPs.
- Multiple-site Processing, Multiple-site Data (MPMD) - There is no
illustration for this scenario, but we can use the one on page 561.
This is how the concept of DDBMS was explained in the beginning of the
chapter. It requires the most complicated hardware and software, but
since everything is shared and accessible, and processing takes place
in multiple locations, it is the best of the three concepts for a large
network.
DDBMS Transparency Features
The next section of the chapter discusses five types of transparency
that give the illusion of being a local DBMS. Each of them has
their own feature set:
- Distribution transparency - This feature specifically hides
distributed data aspects related to commands from the end user. The
text explains that this can be implemented in one of three levels, which
are simply measures of how much they accomplish:
- Fragmentation transparency - At this level, neither
a user nor a programmer would have to reference the database fragment
(by name or location) in which a particular data structure
resides. This is high transparency.
- Location transparency - At this level, a user or a programmer
would have to reference the database fragment by name but not
by location. This is medium transparency.
- Local mapping transparency - At this level, a user or a
programmer would have to reference the database fragment by name
and by location. This is low transparency, which is not
very transparent.
The text provides an example of the grammar that might be used in
each of the three cases. In those examples the word "NODE" is only
symbolic, not the actual word that would be used.
This feature is supported through shared documentation, either a
distributed data dictionary (DDD) or a distributed
data catalog (DDC), which document the fragments at each
location.
- Transaction transparency - This feature allows a transaction
to update data at more than one location, as may be required for consistency
of the database. This involves protocols to update the database locally
and remotely, and to rollback or commit transactions in both locations.
Vocabulary for this feature:
- remote request - allows a single command to be carried
out by one remote data processor
- remote transaction - allows a set of commands (a transaction)
to be carried out by one remote data processor
- distributed request - allows one command to be carried
out by more than one remote data processor
- distributed transaction - allows a set of commands (a transaction)
to be carried out by more than one remote data processor
The text points out that concurrency is more complicated in the
scenarios above, since coordination is required across the distributed
DPs. The coordination is done by the Transaction Processor using a two-phase
commit protocol (2PC). The illustration on page 572 shows
the problem that could occur if this did not work.
Each DP has to have its own transaction log so that it can roll back
a transaction that cannot be completed on another DP. Log entries are
written with a write-ahead protocol that writes the intended
change, as well as the current state of the objects about to be changed
before an action is committed. If a commit fails, the actions that may
have been committed are rolled back with the information in the necessary
logs.
- Failure transparency - This is a redundancy feature
of the distributed service, continuing to provide service if one of
the data fragments is inaccessible.
- Performance transparency - This is a rather optimistic concept,
that the system will not be slowed down due to distance, by fragmentation
itself, or by translation from one platform to another. In case of a
slow link, or an unresponsive segment, the network will access an alternative,
replicated copy of the data elsewhere. This leads to replica
transparency, which means that the data replica being used for a
transaction is not important to the user, since all replicas of the
same data will be synchronized as soon as possible.
- Heterogeneity transparency- This one is harder to spell, so
it must be important. It means that the DDBMS will appear to any user
as a single system, even when it is actually a federation of systems
of different types: network, relational, or hierarchical. In this sense,
"federate" means to give trust to and receive trust from another system.
Distributed Design
The text has been talking about database fragments for many pages. It
begins to discuss particulars about them on page 575. The first consideration
is how to create fragments. Three fragmentation methods are listed:
- Horizontal - A table is broken into multiple sets of rows,
each set being unique rows, but having the same attributes. The preserves
the horizontal nature of each row. The text points out that this is
like storing the result of multiple SELECT statements in different
locations.
- Vertical - A table is broken into multiple sets of attributes,
each set having those attributes and the same key attribute.
The text points out that this is like storing the result of multiple
PROJECT statements in different locations.
- Mixed - Both techniques are used, often with very large tables.
The next topic is data replication. Having a single copy of a
fragment is not the best plan, as noted above in the discussion of performance.
We already know we should place a segment near the users most likely to
need it. We should also plan to place a replica far enough away to survive
a disaster, but close enough to be useful in a mere partial outage. In
the illustration on page 579, you see a system with two segments, stored
in three locations. Segment 1 is stored at site 1, and a replica is stored
at site 2. Segment 2 is stored at site 3, and a replica of it is also
stored at site 2. If any single DP or single fragment replica goes off
line, the system will failover to the other copy of the data.
The text discusses two methods of synchronization to maintain
consistency in the database:
- Push replication - When a change is made one fragment (original
or replica), that change is pushed right away to all copies in the network.
This is more consistent, but data may be unavailable during a push.
- Pull replication - Instead of pushing changes constantly, the
replica having a change sends a message to its copies that a change
has been made. The copies receiving the messages decide when to pull
the change to themselves. This makes data more available, but that data
may be out of date.
Deciding to replicate data can take different paths. The text presents
three choices that cover most variations:
- Fully replicated - All segments are replicated in multiple
locations for the most redundancy and the most reliability, but this
requires the most updating across the network. This can also be called
a replicated data allocation plan.
- Partially replicated - Some segments are replicated at multiple
locations, reducing the update traffic, but also reducing ready availability
for the segments that are not replicated. Those should be the ones that
are used less often. This is likely to be chosen in a partitioned
data allocation plan.
- Unreplicated - No duplication of fragments. This puts your
faith in no accidents, no disasters, no communications failures, and
the little elves who keep the Internet running. This would be likely
to be chosen in centralized data allocation plan.
The CAP Theorem
The author quotes another author, Dr.
Eric Brewer, who created the CAP theorem the year 2000. The link
in the previous sentence will take you to an update from Dr. Brewer from
2012. The original theorem says you can meet two of the following objectives
at any given time, but not all three:
- Consistency - this goes away during an update from one segment to
another
- Availability - this goes away if you are receiving an update or a
change
- Partition tolerance - This is actually fault tolerance for a lost
partition (segment), which goes away during an update.
In his 2013 article, Dr. Brewer clarifies that you can seek these three
states, and have them at any time that the database is not receiving changes.
He believes, as does our author, that we should pursue CAP, as
well as BASE (Basically Available, Soft state, Eventually consistent)
as well as ACID (Atomicity, Consistency, Isolation, Durability).
A distributed database will have to be dynamic, and will have to update
its replicas as long as it is a database that going to be used.
The chapter ends with a list 12 Commandments from C. J. Date, which is
now over thirty years old, and so important that it does not appear in
the chapter summary. Let's leave it there, at the end of the chapter.
|