Last Update: October 14, 2011 6PM
Chapter 12: Client/Server Database and Distributed Database
Presenting the Chapter
This chapter discusses client/server database and distributed database arrangements. What the two have in common is that in both cases the database (or parts of it) are located at computers or sites other than the computer issuing database queries.
The first and shorter part of the chapter is on client/server database. The first major topic is the difference between the file server and database server approaches to processing queries. Next is location transparency. Last is a discussion of the two-tier and two variations of the three-tier approaches to organizing the placement of data on different arrangements of hardware.
The second and longer part of the chapter is on distributed database. The strategy in this section is to start with basic distributed database design concepts and gradually build up to more and more complex options, stressing the advantages and disadvantages that are added to the mix with each new complexity. So, the section starts by considering a centralized database on a network to level-set and for purposes of comparison with the distributed options to come. Then, the first concept in distributing data is to disperse the tables of a relational database at different sites on a network without permitting replication of the tables. Then, replication is introduced and progresses through three stages. First is the case in which every table is replicated at every site. Second is the case in which there is a copy of the entire database at one site with selected tables replicated at other selected sites on the network. Third and last is the general case in which tables are replicated and placed at sites that optimize overall performance.
The chapter continues with other distributed database issues. First is the issue of updating replicated tables. This includes the difference between asynchronous and synchronous update, strategies for asynchronous update, and the two-phase commit protocol for synchronous update. Further topics include distributed joins, partitioning, and distributed directory management. The chapter ends with a review and consolidation of the advantages and disadvantages of the different design options.
Discussion Stimulation Points
There are several interesting discussion points in this chapter. To begin with, describing the World Wide Web as a client/server system makes the client/server concept more real to the students. It gives them something they can identify with, in this regard. You can challenge them as to why the WWW is a giant client/server system and generate a good discussion.
In terms of the different variations of distributed database design, moving progressively from the simple to the more complex options and focusing on the advantages and disadvantages of each as you go, tends to keep the students attention.
An interesting and fun classroom exercise in this chapter is to execute a two-phase commit by choosing a few students in the room and assigning each to be a city on a distributed database network. Go through the steps of the two-phase commit (they are presented very clearly in the PowerPoint slides for the book) with you, the instructor, as the initiating site and the students as the sites with the replicated tables. Have each student take a piece of paper and have part of the paper represent the log and the other part the database. Have them write the update to the log at the appropriate time and then to the database at the appropriate time. Vary the situation with students at times being able and at times not being able to lock their tables in the prepare phase.
Finally, challenge your students on how best to do a distributed join. As with the two-phase commit, you can assign students to different cities. Give them tables or replicated copies of tables, specify table lengths, and then initiate queries that will require joins. Vary the nature of the queries so that different number of records will be required from different tables in the join and see how the students react to different ways of passing the data around to accomplish the join.
Answers to End-of-Chapter Questions
1. What is a client/server database system?
A client/server system is a collection of PCs on a local area network, all attached to a server and possibly other devices such as a printer. Each PC or client can communicate with the others, can access software on the server, and can use the shared peripherals. The term client/server database adds a database at the server that can be accessed by the clients.
2. Explain the database server approach to client/server database.
There is a database located at the server. The processing is split between the client and the server. A query is entered at the client and the client computer performs the initial keyboard and screen interaction processing, as well as initial syntax checking of the query. The system then ships the query over the LAN to the server where the query is actually run against the database. The server then returns the results of the query to the client.
3. What are the advantages of the database server approach to client/server database compared to the file server approach?
The network data traffic is reduced to a tolerable level, even for frequently queried databases. Also, security and concurrency control can be handled at the server in a much more contained way.
4. What is data transparency in client/server database? Why is it important?
In data or location transparency, the user issues a query at the client and the software first checks to see if the required data is on the PCs own hard drive. If it is not there, then the software automatically looks for it on the server. In the three-tier approach, if the software doesnt find the data on the client PCs hard drive or on the LAN server, it can leave the LAN through a gateway computer and look for the data on, for example, a large, mainframe computer. This is important because the user at the client PC doesnt have to know where the data is to retrieve it.
5. Compare the two-tier arrangement of client/server database to the three-tier arrangement.
There are two uses of the term three-tier client/server system. In one, the comparison between two-tier and three-tier is that in the three-tier arrangement there is a third level of data storage beyond the PCs hard drive and the LAN server, for example a distant mainframe. In the other use of the term, the servers database is attached to the LAN server in the two-tier arrangement but in the three-tier arrangement there is an application server and then a separate database server, which has the database attached to it.
6. What is a distributed database? What is a distributed database management system?
In a distributed database, different parts of the database are located at various cities on a distributed network, each of which has its own computer and data storage facilities. All of this distributed data is still considered to be a single logical database. A distributed database management system is software that manages the distributed database environment as a single logical database.
7. Why would a company be interested in moving from the centralized to the distributed database approach?
The distributed database approach has several advantages including local data autonomy, reduction of telecommunications costs, and improved data availability (if replication is allowed.)
8. What are the advantages of locating a portion of a database in the city in which it is most frequently used?
To avoid most telecommunications costs regarding the data and to have the personnel who most frequently use the data also be responsible for managing it.
9. What are the advantages and disadvantages of data replication in a distributed database?
The advantages are improved data availability and reduced telecommunications costs. The disadvantages are increased security exposure and potential data integrity problems.
10. Describe the concept of asynchronous updating of replicated data. For what kinds of applications would it work or not work?
In asynchronous updating of replicated data, one copy of the data can be updated and there can be a delay in the other copies being updated. This works if the nature of the data and of the applications that use it can tolerate retrieved data not necessarily being up-to-the-minute accurate. It does not work if all copies of the data must be up-to-date at all times.
11. Describe the two-phase commit approach to updating replicated data.
In the two-phase commit approach, each computer on the network has a special log file in addition to its database tables. When an update is to be made at one site, the distributed DBMS does several things. In the prepare phase, the initiating site sends the update out to all of the other sites with the table copies, where the update is written to each sites log file. Each site attempts to freeze its copy of the table in the database. Each site reports back to the initiating site. If all sites report back that they were able to successfully freeze the tables, then in the commit phase the initiating site sends out the order to each site involved to copy the update from its log file to its database. Otherwise the initiating site aborts the process.
12. Describe the factors used in deciding how to accomplish a particular distributed join.
The factors include the distances and telecommunications costs among the sites with the needed tables and the querying site, and the numbers and sizes of the records that are involved in the join.
13. Describe horizontal and vertical partitioning in a distributed database.
In horizontal partitioning, rows of a table are grouped and different such partitions are located at different sites on the distributed network. In vertical partitioning, the columns of a table are subdivided and different partitions, groups of columns, are located at different sites on the distributed network. In vertical partitioning, each partition must have a copy of the tables primary key.
14. What are the advantages and disadvantages of horizontal partitioning in a distributed database?
The advantages are in local autonomy and reduced telecommunications costs by having the data partitioned so that records most frequently used at particular sites are stored at those sites. The disadvantage is that when a query requires records stored at more than one site, the DBMS must collect them from those sites.
15. What are the advantages and disadvantages of vertical partitioning in a distributed database?
The advantages are in local autonomy and reduced telecommunications costs by having the data partitioned so that columns most frequently used at particular sites are stored at those sites. The disadvantage is that when a query requires columns stored at more than one site, the DBMS must collect them from those sites.
16. What is the purpose of a directory in a distributed database? Where should the directory be located?
The distributed directory tells the distributed DBMS where the tables and replicated copies of tables are stored on the distributed network. Generally, the directory should be located at every node on the network because it is not very volatile.
17. Discuss the problem of directory management for distributed database. Do you think that as an issue, it is more critical, less critical, or about the same as the distribution of the data itself? Explain.
Directory management is not as much of an issue as data management because the directory is generally non-volatile.
Solutions to End-of-Chapter Exercises
1. Australian Boomerang, Ltd., wants to design a distributed relational database. The company is headquartered in Perth and has major operations in Sydney, Melbourne, and Darwin. The database involved consists of five tables, labeled A, B, C, D, and E, with the following characteristics:
Table A consists of 500,000 records and is heavily used in Perth and Sydney.
Table B consists of 100,000 records and is frequently required in all four cities.
Table C consists of 800 records and is frequently required in all four cities.
Table D consists of 75,000 records. Records 1-30,000 are most frequently used in Sydney. Records 30,001-75,000 are most frequently used in Melbourne.
Table E consists of 20,000 records and is used almost exclusively in Perth.
Design a distributed relational database for Australian Boomerang. Justify your placement, replication, and partitioning of the tables.
Table A should be replicated with a copy in Perth and a copy in Sydney. This will minimize telecommunications costs and improve availability. Since Table B is used in all four cities, it should be replicated for availability purposes. If it is not particularly volatile we could consider placing a copy of it in each city. If it is somewhat volatile we might have just two copies, placed in any two of the cities. If it is very volatile then we have to weigh the costs of synchronous or asynchronous update against the benefits of availability when considering replication. Since Table C is so small, we might lean towards minimal replication because the costs of shipping its data from one site to another, for direct retrievals or for distributed joins, would be low. Table D would certainly be a candidate for partitioning, with one partition in Sydney and one in Melbourne, as indicated. Since Table E is used almost exclusively in Perth, it should not be replicated and should be stored only in Perth.
2. Canadian Maple Trees, Inc., has a distributed relational database with tables in computers in Halifax, Montreal, Ottawa, Toronto, and Vancouver. The database consists of twelve tables, some of which are replicated in multiple cities. Among them are tables A, B, and C, with the following characteristics.
Table A consists of 800,000 records and is located in Halifax, Montreal, and Vancouver.
Table B consists of 100,000 records and is located in both Halifax and Toronto.
Table C consists of 20,000 records and is located in Ottawa and Vancouver.
Telecommunications costs among Montreal, Ottawa, and Toronto are relatively low, while telecommunications costs between those three cities and Halifax and Vancouver are relatively high.
A query is issued from Montreal that requires a join of tables A, B, and C. The query involves a single record from table A, 20 records from table B, and an undetermined number of records from table C.
Develop and justify a plan for solving this query.
The straightforward answer to this question is to ship the one record involved from Table A, from Montreal to Ottawa; to ship the 20 records involved from Table B from Toronto to Ottawa; to perform the join in Ottawa and to send the result back to Montreal. This makes sense since the number of records needed from Table C is undetermined, the number of records from Tables A and B are very small, and the communications costs among these three cities are low compared to the costs of going to Halifax or Vancouver.
Solutions to End-of-Chapter Minicases
1. Consider the Happy Cruise Lines relational database of Minicase 5.1. The company has decided to reconfigure this database as a distributed database among its major locations: New York, which is its headquarters, and its other major U.S. ports: Miami, Los Angeles, and Houston. Distributed and replicated among these four locations, the tables have the following characteristics:
SHIP consists of 20 records and is used in all four cities.
CRUISE consists of 4,000 records. CRUISE records are used most heavily in the cities from which the cruise described in the record began.
PORT consists of 42 records. The records that describe Atlantic Ocean ports are used most heavily in New York and Miami. The records that describe Caribbean Sea ports are used most heavily in Houston and Miami. The records that describe Pacific Ocean ports are used most heavily in Los Angeles.
VISIT consists of 15,000 records and is primarily used in New York and Los Angeles.
PASSENGER consists of 230,000 records and is primarily used in New York and Los Angeles.
VOYAGE consists of 720,000 records and is used in all four cities.
Design a distributed relational database for Happy Cruise Lines. Justify your placement, replication, and partitioning of the tables.
Since SHIP is so small, is used in all four cities, and is only rarely updated (when the cruise line buys new ships) it should be replicated at each of the four sites. The CRUISE table is a candidate for partitioning since the records are used most heavily in the cities from which the cruise described in the record began. If the New York headquarters frequently needs to access the entire CRUISE table, as would seem likely, then a copy of the entire table should be kept in New York. Based on the description of the PORT table, at first glance it might be a candidate for partitioning. But, since it is so small and so non-volatile, it would probably be best just to keep a copy of the entire table at every site. The large VISIT table should clearly be replicated in the two cities in which it used heavily, New York and Los Angeles, for performance and availability purposes. The same is true of the even larger PASSENGER table. The large VOYAGE table, which is used in all four cities should be replicated for availability purposes. The number of replicated copies would depend on its volatility and response time requirements. The more volatile the fewer copies should be made.
2. Consider the Super Baseball League relational database of Minicase 5.2. The league has decided to organize its database as a distributed database with replicated tables. The nodes on the distributed database will be Chicago (the leagues headquarters), Atlanta, San Francisco (where the league personnel office is located), and Dallas. The tables have the following characteristics:
TEAM consists of 20 records and is located in Chicago and Atlanta.
COACH consists of 85 records and is located in San Francisco and Dallas.
WORKEXP consists of 20,000 records and is located in San Francisco and Dallas.
BATS consists of 800,000 records and is located in Chicago and Atlanta.
PLAYER consists of 100,000 records and is located in San Francisco and Atlanta.
AFFILIATION consists of 20,000 records and is located in Chicago and San Francisco.
STADIUM consists of 20 records and is located only in Chicago.
Assume that telecommunications costs among the cities are all about the same.
Develop and justify a plan for solving the following queries:
a. A query is issued from Chicago to get a list of all of the work experience of all of the coaches on the Dodgers.
This query requires the TEAM and WORKEXP tables. There is no city that has both of these tables so it requires a distributed join. Since the query is issued in Chicago and the TEAM table is located there, the team number can be discerned (from the team name Dodgers in the single record with the team name value Dodgers) in Chicago and shipped to either San Francisco or Dallas where the join can be completed with the WORKEXP table.
b. A query is issued from Atlanta to get a list of the names of the coaches who work for the team based at Smith Memorial Stadium.
This query requires the STADIUM and COACH tables, neither of which is at Atlanta where the query was issued. The STADIUM table is only in Chicago but the COACH table is not there. Since only one record of the STADIUM table can be involved, the way to solve this query is to send the single record needed from the STADIUM table from Chicago to Dallas or San Francisco where the join with the COACH table can take place.
c. A query is issued from Dallas to find the names of all of the players who have compiled a batting average of at least .300 while playing on the Dodgers.
This query requires the TEAM, AFFILIATION, and PLAYER tables. None of them are in Dallas where the query was issued and there is no one site that has all three tables. San Francisco, Chicago, and Atlanta each have two of the three tables, in different combinations. Since only one of the TEAM records can be involved in the query (there is only one record in the TEAM table for Dodgers) the best solution would be to start at either Chicago or Atlanta, get the TEAM record for Dodgers, and send it to San Francisco where both the PLAYER and AFFILIATION tables are located. The join can then be performed in San Francisco.
Once the order is placed, the order will be delivered to your email less than 24 hours, mostly within 4 hours.
If you have questions, you can contact us here