E- Book Database

Daniel Christianto G

E- Book Database

d ata b a s e systems Both Thomas Connolly and Carolyn Begg have experience of database design in industry, and now apply this in their teaching and research at the University of Paisley in Scotland. A Practical Approach to Design, Implementation, and Management • Begg A clear introduction to design, implementation and management issues, as well as an extensive treatment of database languages and standards, make this book an indispensable complete reference for database students and professionals alike Features Complex subjects are clearly explained using running case studies throughout the book. Database design methodology is explicitly divided into three phases: conceptual, logical, and physical. Each phase is described with an example of how it works in practice. SQL is comprehensively covered in three tutorial-style chapters. Distributed, object-oriented, and object-relational DBMSs are fully discussed. Check out the Web site at www.booksites.net/connbegg, for full implementations of the case studies, lab guides for Access and Oracle, and additional student support. New! For the fourth edition • • • Extended treatment of XML, OLAP and data mining. Coverage of updated standards including SQL:2003, W3C (XPath andXQuery), and OMG. Now covers Oracle9i and Microsoft Office Access 2003. This book comes with a free six-month subscription to Database Place, an online tutorial that helps readers master the key concepts of database systems. Log on at www.aw.com/databaseplace. FOURTH EDITION d ata b a s e systems Over 200,000 people have been grounded in good database design practice by reading Database Systems. The new edition of this best-seller brings it up to date with the latest developments in database technology and builds on the clear, accessible approach that has contributed to the success of previous editions. an imprint of www.booksites.net/connbegg Connolly FOURTH EDITION • • • • • • Thomas Connolly Carolyn Begg d ata b a s e systems A Practical Approach to Design, Implementation, and Management www.booksites.net/connbegg www.booksites.net/connbegg www.pearson-books.com Databa se Systems A Companion Web site accompanies Database Systems, Fourth edition by Thomas Connolly and Carolyn Begg Visit the Database Systems Companion Web site at www.booksites.net/connbegg to find valuable learning material including: For Students: n n n n n n Tutorials on selected chapters Sample StayHome database Solutions to review questions DreamHome web implementation Extended version of File Organizations and Indexes Access and Oracle Lab Manuals INTERNATIONAL COMPUTER SCIENCE SERIES Consulting Editor A D McGettrick University of Strathclyde SELECTED TITLES IN THE SERIES Operating Systems J Bacon and T Harris Programming Language Essentials H E Bal and D Grune Programming in Ada 95 (2nd edn) J G P Barnes Java Gently (3rd edn) J Bishop Software Design (2nd edn) D Budgen Concurrent Programming A Burns and G Davies Real-Time Systems and Programming Languages: Ada 95, Real-Time Java and RealTime POSIX (3rd edn) A Burns and A Wellings Comparative Programming Languages (3rd edn) L B Wilson and R G Clark, updated by R G Clark Distributed Systems: Concepts and Design (3rd edn) G Coulouris, J Dollimore and T Kindberg Principles of Object-Oriented Software Development (2nd edn) A Eliëns Fortran 90 Programming T M R Ellis, I R Philips and T M Lahey Program Verification N Francez Introduction to Programming using SML M Hansen and H Rischel Functional C P Hartel and H Muller Algorithms and Data Structures: Design, Correctness, Analysis (2nd edn) J Kingston Introductory Logic and Sets for Computer Scientists N Nissanke Human–Computer Interaction J Preece et al. Algorithms: A Functional Programming Approach F Rabhi and G Lapalme Ada 95 From the Beginning (3rd edn) J Skansholm C++ From the Beginning J Skansholm Java From the Beginning (2nd edn) J Skansholm Software Engineering (6th edn) I Sommerville Object-Oriented Programming in Eiffel (2nd edn) P Thomas and R Weedon Miranda: The Craft of Functional Programming S Thompson Haskell: The Craft of Functional Programming (2nd edn) S Thompson Discrete Mathematics for Computer Scientists (2nd edn) J K Truss Compiler Design R Wilhelm and D Maurer Discover Delphi: Programming Principles Explained S Williams and S Walmsley Software Engineering with B J B Wordsworth THOMAS M. CONNOLLY • CAROLYN E. BEGG UNIVERSITY OF PAISLEY Databa se Systems A Practical Approach to Design, Implementation, and Management Fourth Edition Pearson Education Limited Edinburgh Gate Harlow Essex CM20 2JE England and Associated Companies throughout the world Visit us on the World Wide Web at: www.pearsoned.co.uk First published 1995 Second edition 1998 Third edition 2002 Fourth edition published 2005 © Pearson Education Limited 1995, 2005 The rights of Thomas M. Connolly and Carolyn E. Begg to be identified as authors of this work have been asserted by the authors in accordance with the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without either the prior written permission of the publisher or a licence permitting restricted copying in the United Kingdom issued by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London W1T 4LP. The programs in this book have been included for their instructional value. They have been tested with care but are not guaranteed for any particular purpose. The publisher does not offer any warranties or representations nor does it accept any liabilities with respect to the programs. All trademarks used herein are the property of their respective owners. The use of any trademark in this text does not vest in the author or publisher any trademark ownership rights in such trademarks, nor does the use of such trademarks imply any affiliation with or endorsement of this book by such owners. ISBN 0 321 21025 5 British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library Library of Congress Cataloguing-in-Publication Data A catalog record for this book is available from the Library of Congress 10 9 8 7 6 5 4 3 2 09 08 07 06 05 Typeset in 10/12pt Times by 35 Printed and bound in the United States of America To Sheena, for her patience, understanding, and love during the last few years. To our daughter, Kathryn, for her beauty and intelligence. To our happy and energetic son, Michael, for the constant joy he gives us. To our new child, Stephen, may he always be so happy. To my Mother, who died during the writing of the first edition. Thomas M. Connolly To Heather, Rowan, Calum, and David Carolyn E. Begg Brief Contents Preface Part 1 xxxiii Background 1 Chapter 1 Introduction to Databases 3 Chapter 2 Database Environment 33 The Relational Model and Languages 67 Chapter 3 The Relational Model 69 Chapter 4 Relational Algebra and Relational Calculus 88 Chapter 5 SQL: Data Manipulation 112 Chapter 6 SQL: Data Definition 157 Chapter 7 Query-By-Example 198 Chapter 8 Commercial RDBMSs: Office Access and Oracle 225 Part 2 Part 3 Chapter 9 Database Analysis and Design Techniques 279 Database Planning, Design, and Administration 281 Chapter 10 Fact-Finding Techniques 314 Chapter 11 Entity–Relationship Modeling 342 Chapter 12 Enhanced Entity–Relationship Modeling 371 Chapter 13 Normalization 387 Chapter 14 Advanced Normalization 415 viii | Brief Contents Part 4 Methodology 435 Chapter 15 Methodology – Conceptual Database Design 437 Chapter 16 Methodology – Logical Database Design for the Relational Model 461 Methodology – Physical Database Design for Relational Databases 494 Methodology – Monitoring and Tuning the Operational System 519 Chapter 17 Chapter 18 Part 5 Selected Database Issues 539 Chapter 19 Security 541 Chapter 20 Transaction Management 572 Chapter 21 Query Processing 630 Part 6 Distributed DBMSs and Replication 685 Chapter 22 Distributed DBMSs – Concepts and Design 687 Chapter 23 Distributed DBMSs – Advanced Concepts 734 Chapter 24 Replication and Mobile Databases 780 Part 7 Object DBMSs 801 Chapter 25 Introduction to Object DBMSs 803 Chapter 26 Object-Oriented DBMSs – Concepts 847 Chapter 27 Object-Oriented DBMSs – Standards and Systems 888 Chapter 28 Object-Relational DBMSs 935 Part 8 Web and DBMSs 991 Chapter 29 Web Technology and DBMSs 993 Chapter 30 Semistructured Data and XML 1065 Brief Contents Part 9 Business Intelligence | ix 1147 Chapter 31 Data Warehousing Concepts 1149 Chapter 32 Data Warehousing Design 1181 Chapter 33 OLAP 1204 Chapter 34 Data Mining 1232 Appendices A 1247 Users’ Requirements Specification for DreamHome Case Study 1249 B Other Case Studies 1255 C File Organizations and Indexes (extended version on Web site) 1268 D When is a DBMS Relational? 1293 E Programmatic SQL (extended version on Web site) 1298 F Alternative ER Modeling Notations 1320 G Summary of the Database Design Methodology for Relational Databases 1326 H I Estimating Disk Space Requirements On Web site Example Web Scripts On Web site References 1332 Further Reading 1345 Index 1356 Contents Preface Part 1 Chapter 1 1.1 1.2 xxxiii Background 1 Introduction to Databases 3 Introduction Traditional File-Based Systems 1.2.1 File-Based Approach 1.2.2 Limitations of the File-Based Approach Database Approach 1.3.1 The Database 1.3.2 The Database Management System (DBMS) 1.3.3 (Database) Application Programs 1.3.4 Components of the DBMS Environment 1.3.5 Database Design: The Paradigm Shift 4 7 7 12 14 15 16 17 18 21 1.4 Roles in the Database Environment 1.4.1 Data and Database Administrators 1.4.2 Database Designers 1.4.3 Application Developers 1.4.4 End-Users 21 22 22 23 23 1.5 History of Database Management Systems 24 1.6 Advantages and Disadvantages of DBMSs 26 Chapter Summary Review Questions Exercises 31 32 32 Database Environment 33 The Three-Level ANSI-SPARC Architecture 2.1.1 External Level 34 35 1.3 Chapter 2 2.1 xii | Contents 2.2 2.3 2.4 2.5 2.6 Part 2 Chapter 3 3.1 3.2 3.3 3.4 2.1.2 Conceptual Level 2.1.3 Internal Level 2.1.4 Schemas, Mappings, and Instances 2.1.5 Data Independence Database Languages 2.2.1 The Data Definition Language (DDL) 2.2.2 The Data Manipulation Language (DML) 2.2.3 Fourth-Generation Languages (4GLs) Data Models and Conceptual Modeling 2.3.1 Object-Based Data Models 2.3.2 Record-Based Data Models 2.3.3 Physical Data Models 2.3.4 Conceptual Modeling Functions of a DBMS Components of a DBMS Multi-User DBMS Architectures 2.6.1 Teleprocessing 2.6.2 File-Server Architectures 2.6.3 Traditional Two-Tier Client–Server Architecture 2.6.4 Three-Tier Client–Server Architecture 2.6.5 Transaction Processing Monitors 36 36 37 38 39 40 40 42 43 44 45 47 47 48 53 56 56 56 57 60 62 Chapter Summary Review Questions Exercises 64 65 65 The Relational Model and Languages 67 The Relational Model 69 Brief History of the Relational Model Terminology 3.2.1 Relational Data Structure 3.2.2 Mathematical Relations 3.2.3 Database Relations 3.2.4 Properties of Relations 3.2.5 Relational Keys 3.2.6 Representing Relational Database Schemas Integrity Constraints 3.3.1 Nulls 3.3.2 Entity Integrity 3.3.3 Referential Integrity 3.3.4 General Constraints Views 3.4.1 Terminology 70 71 72 75 76 77 78 79 81 81 82 83 83 83 84 Contents Chapter 4 4.1 4.2 4.3 Chapter 5 5.1 5.2 5.3 | xiii 3.4.2 Purpose of Views 3.4.3 Updating Views 84 85 Chapter Summary Review Questions Exercises 86 87 87 Relational Algebra and Relational Calculus 88 The Relational Algebra 4.1.1 Unary Operations 4.1.2 Set Operations 4.1.3 Join Operations 4.1.4 Division Operation 4.1.5 Aggregation and Grouping Operations 4.1.6 Summary of the Relational Algebra Operations The Relational Calculus 4.2.1 Tuple Relational Calculus 4.2.2 Domain Relational Calculus Other Languages Chapter Summary Review Questions Exercises 89 89 92 95 99 100 102 103 103 107 109 110 110 111 SQL: Data Manipulation 112 Introduction to SQL 5.1.1 Objectives of SQL 5.1.2 History of SQL 5.1.3 Importance of SQL 5.1.4 Terminology Writing SQL Commands Data Manipulation 5.3.1 Simple Queries 5.3.2 Sorting Results (ORDER BY Clause) 5.3.3 Using the SQL Aggregate Functions 5.3.4 Grouping Results (GROUP BY Clause) 5.3.5 Subqueries 5.3.6 ANY and ALL 5.3.7 Multi-Table Queries 5.3.8 EXISTS and NOT EXISTS 5.3.9 Combining Result Tables (UNION, INTERSECT, EXCEPT) 5.3.10 Database Updates 113 113 114 116 116 116 117 118 127 129 131 134 138 139 146 147 149 Chapter Summary Review Questions Exercises 154 155 155 xiv | Contents Chapter 6 6.1 6.2 6.3 6.4 6.5 6.6 Chapter 7 7.1 7.2 7.3 SQL: Data Definition 157 The ISO SQL Data Types 6.1.1 SQL Identifiers 6.1.2 SQL Scalar Data Types 6.1.3 Exact Numeric Data Integrity Enhancement Feature 6.2.1 Required Data 6.2.2 Domain Constraints 6.2.3 Entity Integrity 6.2.4 Referential Integrity 6.2.5 General Constraints Data Definition 6.3.1 Creating a Database 6.3.2 Creating a Table (CREATE TABLE) 6.3.3 Changing a Table Definition (ALTER TABLE) 6.3.4 Removing a Table (DROP TABLE) 6.3.5 Creating an Index (CREATE INDEX) 6.3.6 Removing an Index (DROP INDEX) Views 6.4.1 Creating a View (CREATE VIEW) 6.4.2 Removing a View (DROP VIEW) 6.4.3 View Resolution 6.4.4 Restrictions on Views 6.4.5 View Updatability 6.4.6 WITH CHECK OPTION 6.4.7 Advantages and Disadvantages of Views 6.4.8 View Materialization Transactions 6.5.1 Immediate and Deferred Integrity Constraints Discretionary Access Control 6.6.1 Granting Privileges to Other Users (GRANT) 6.6.2 Revoking Privileges from Users (REVOKE) 158 158 159 160 164 164 164 166 166 167 168 168 169 173 174 175 176 176 177 179 180 181 181 183 184 186 187 189 189 191 192 Chapter Summary Review Questions Exercises 194 195 195 Query-By-Example 198 Introduction to Microsoft Office Access Queries Building Select Queries Using QBE 7.2.1 Specifying Criteria 7.2.2 Creating Multi-Table Queries 7.2.3 Calculating Totals Using Advanced Queries 199 201 202 204 207 208 Contents 7.4 Chapter 8 8.1 8.2 Part 3 Chapter 9 9.1 9.2 | xv 7.3.1 Parameter Query 7.3.2 Crosstab Query 7.3.3 Find Duplicates Query 7.3.4 Find Unmatched Query 7.3.5 Autolookup Query Changing the Content of Tables Using Action Queries 7.4.1 Make-Table Action Query 7.4.2 Delete Action Query 7.4.3 Update Action Query 7.4.4 Append Action Query 208 209 212 214 215 215 215 217 217 221 Exercises 224 Commercial RDBMSs: Office Access and Oracle 225 Microsoft Office Access 2003 8.1.1 Objects 8.1.2 Microsoft Office Access Architecture 8.1.3 Table Definition 8.1.4 Relationships and Referential Integrity Definition 8.1.5 General Constraint Definition 8.1.6 Forms 8.1.7 Reports 8.1.8 Macros 8.1.9 Object Dependencies Oracle9i 8.2.1 Objects 8.2.2 Oracle Architecture 8.2.3 Table Definition 8.2.4 General Constraint Definition 8.2.5 PL/SQL 8.2.6 Subprograms, Stored Procedures, Functions, and Packages 8.2.7 Triggers 8.2.8 Oracle Internet Developer Suite 8.2.9 Other Oracle Functionality 8.2.10 Oracle10g 226 226 227 228 233 234 236 238 239 242 242 244 245 252 255 255 261 263 267 271 271 Chapter Summary Review Questions 276 277 Database Analysis and Design Techniques 279 Database Planning, Design, and Administration 281 The Information Systems Lifecycle The Database System Development Lifecycle 282 283 xvi | Contents 9.3 9.4 9.9 9.10 Database Planning System Definition 9.4.1 User Views Requirements Collection and Analysis 9.5.1 Centralized Approach 9.5.2 View Integration Approach Database Design 9.6.1 Approaches to Database Design 9.6.2 Data Modeling 9.6.3 Phases of Database Design DBMS Selection 9.7.1 Selecting the DBMS Application Design 9.8.1 Transaction Design 9.8.2 User Interface Design Guidelines Prototyping Implementation 285 286 287 288 289 289 291 291 292 293 295 296 299 300 301 303 304 9.11 Data Conversion and Loading 305 9.12 Testing 305 9.13 Operational Maintenance 306 9.14 CASE Tools 307 9.15 Data Administration and Database Administration 9.15.1 Data Administration 9.15.2 Database Administration 9.15.3 Comparison of Data and Database Administration 309 309 309 311 Chapter Summary Review Questions Exercises 311 313 313 Fact-Finding Techniques 314 When Are Fact-Finding Techniques Used? What Facts Are Collected? Fact-Finding Techniques 10.3.1 Examining Documentation 10.3.2 Interviewing 10.3.3 Observing the Enterprise in Operation 10.3.4 Research 10.3.5 Questionnaires Using Fact-Finding Techniques – A Worked Example 10.4.1 The DreamHome Case Study – An Overview 10.4.2 The DreamHome Case Study – Database Planning 315 316 317 317 317 319 319 320 321 321 326 9.5 9.6 9.7 9.8 Chapter 10 10.1 10.2 10.3 10.4 Contents | xvii 10.4.3 The DreamHome Case Study – System Definition 10.4.4 The DreamHome Case Study – Requirements Collection and Analysis 10.4.5 The DreamHome Case Study – Database Design 331 Chapter Summary Review Questions Exercises 340 341 341 Entity–Relationship Modeling 342 11.1 Entity Types 343 11.2 Relationship Types 11.2.1 Degree of Relationship Type 11.2.2 Recursive Relationship 346 347 349 11.3 Attributes 11.3.1 Simple and Composite Attributes 11.3.2 Single-Valued and Multi-Valued Attributes 11.3.3 Derived Attributes 11.3.4 Keys 350 351 351 352 352 11.4 Strong and Weak Entity Types 354 11.5 Attributes on Relationships 355 11.6 Structural Constraints 11.6.1 One-to-One (1:1) Relationships 11.6.2 One-to-Many (1:*) Relationships 11.6.3 Many-to-Many (*:*) Relationships 11.6.4 Multiplicity for Complex Relationships 11.6.5 Cardinality and Participation Constraints 356 357 358 359 361 362 11.7 Problems with ER Models 11.7.1 Fan Traps 11.7.2 Chasm Traps 364 364 365 Chapter Summary Review Questions Exercises 368 369 369 Enhanced Entity–Relationship Modeling 371 Specialization/Generalization 12.1.1 Superclasses and Subclasses 12.1.2 Superclass/Subclass Relationships 12.1.3 Attribute Inheritance 12.1.4 Specialization Process 12.1.5 Generalization Process 12.1.6 Constraints on Specialization/Generalization 372 372 373 374 374 375 378 Chapter 11 Chapter 12 12.1 332 340 xviii | Contents 12.2 12.3 Chapter 13 13.1 13.2 13.3 13.4 13.5 13.6 13.7 13.8 13.9 Chapter 14 14.1 14.2 14.3 14.4 14.5 12.1.7 Worked Example of using Specialization/Generalization to Model the Branch View of DreamHome Case Study Aggregation Composition 379 383 384 Chapter Summary Review Questions Exercises 385 386 386 Normalization 387 The Purpose of Normalization How Normalization Supports Database Design Data Redundancy and Update Anomalies 13.3.1 Insertion Anomalies 13.3.2 Deletion Anomalies 13.3.3 Modification Anomalies Functional Dependencies 13.4.1 Characteristics of Functional Dependencies 13.4.2 Identifying Functional Dependencies 13.4.3 Identifying the Primary Key for a Relation using Functional Dependencies The Process of Normalization First Normal Form (1NF) Second Normal Form (2NF) Third Normal Form (3NF) General Definitions of 2NF and 3NF 388 389 390 391 392 392 392 393 397 Chapter Summary Review Questions Exercises 412 413 413 Advanced Normalization 415 More on Functional Dependencies 14.1.1 Inference Rules for Functional Dependencies 14.1.2 Minimal Sets of Functional Dependencies Boyce–Codd Normal Form (BCNF) 14.2.1 Definition of Boyce–Codd Normal Form Review of Normalization up to BCNF Fourth Normal Form (4NF) 14.4.1 Multi-Valued Dependency 14.4.2 Definition of Fourth Normal Form Fifth Normal Form (5NF) 416 416 418 419 419 422 428 428 430 430 399 401 403 407 408 411 Contents Part 4 Chapter 15 15.1 15.2 15.3 Chapter 16 16.1 Chapter 17 17.1 17.2 17.3 | xix 14.5.1 Lossless-Join Dependency 14.5.2 Definition of Fifth Normal Form 430 431 Chapter Summary Review Questions Exercises 433 433 433 Methodology 435 Methodology – Conceptual Database Design 437 Introduction to the Database Design Methodology 15.1.1 What is a Design Methodology? 15.1.2 Conceptual, Logical, and Physical Database Design 15.1.3 Critical Success Factors in Database Design Overview of the Database Design Methodology Conceptual Database Design Methodology Step 1 Build Conceptual Data Model 438 438 439 440 440 442 442 Chapter Summary Review Questions Exercises 458 459 460 Methodology – Logical Database Design for the Relational Model 461 Logical Database Design Methodology for the Relational Model Step 2 Build and Validate Logical Data Model 462 462 Chapter Summary Review Questions Exercises 490 491 492 Methodology – Physical Database Design for Relational Databases 494 Comparison of Logical and Physical Database Design 495 Overview of Physical Database Design Methodology 496 The Physical Database Design Methodology for Relational Databases 497 Step 3 Translate Logical Data Model for Target DBMS 497 Step 4 Design File Organizations and Indexes 501 Step 5 Design User Views 515 Step 6 Design Security Mechanisms 516 Chapter Summary Review Questions Exercises 517 517 518 xx | Contents Chapter 18 18.1 18.2 Part 5 Chapter 19 19.1 19.2 19.3 19.4 19.5 Chapter 20 20.1 Methodology – Monitoring and Tuning the Operational System 519 Denormalizing and Introducing Controlled Redundancy Step 7 Consider the Introduction of Controlled Redundancy Monitoring the System to Improve Performance Step 8 Monitor and Tune the Operational System 519 519 532 532 Chapter Summary Review Questions Exercise 537 537 537 Selected Database Issues 539 Security 541 Database Security 19.1.1 Threats Countermeasures – Computer-Based Controls 19.2.1 Authorization 19.2.2 Access Controls 19.2.3 Views 19.2.4 Backup and Recovery 19.2.5 Integrity 19.2.6 Encryption 19.2.7 RAID (Redundant Array of Independent Disks) Security in Microsoft Office Access DBMS Security in Oracle DBMS DBMSs and Web Security 19.5.1 Proxy Servers 19.5.2 Firewalls 19.5.3 Message Digest Algorithms and Digital Signatures 19.5.4 Digital Certificates 19.5.5 Kerberos 19.5.6 Secure Sockets Layer and Secure HTTP 19.5.7 Secure Electronic Transactions and Secure Transaction Technology 19.5.8 Java Security 19.5.9 ActiveX Security 542 543 545 546 547 550 550 551 551 552 555 558 562 563 563 564 564 565 565 Chapter Summary Review Questions Exercises 570 571 571 Transaction Management 572 Transaction Support 20.1.1 Properties of Transactions 20.1.2 Database Architecture 573 575 576 566 566 569 Contents 20.2 20.3 20.4 20.5 Chapter 21 21.1 21.2 21.3 21.4 21.5 | xxi Concurrency Control 20.2.1 The Need for Concurrency Control 20.2.2 Serializability and Recoverability 20.2.3 Locking Methods 20.2.4 Deadlock 20.2.5 Timestamping Methods 20.2.6 Multiversion Timestamp Ordering 20.2.7 Optimistic Techniques 20.2.8 Granularity of Data Items Database Recovery 20.3.1 The Need for Recovery 20.3.2 Transactions and Recovery 20.3.3 Recovery Facilities 20.3.4 Recovery Techniques 20.3.5 Recovery in a Distributed DBMS Advanced Transaction Models 20.4.1 Nested Transaction Model 20.4.2 Sagas 20.4.3 Multilevel Transaction Model 20.4.4 Dynamic Restructuring 20.4.5 Workflow Models Concurrency Control and Recovery in Oracle 20.5.1 Oracle’s Isolation Levels 20.5.2 Multiversion Read Consistency 20.5.3 Deadlock Detection 20.5.4 Backup and Recovery 577 577 580 587 594 597 600 601 602 605 606 607 609 612 615 615 616 618 619 620 621 622 623 623 625 625 Chapter Summary Review Questions Exercises 626 627 628 Query Processing 630 Overview of Query Processing Query Decomposition Heuristical Approach to Query Optimization 21.3.1 Transformation Rules for the Relational Algebra Operations 21.3.2 Heuristical Processing Strategies Cost Estimation for the Relational Algebra Operations 21.4.1 Database Statistics 21.4.2 Selection Operation 21.4.3 Join Operation 21.4.4 Projection Operation 21.4.5 The Relational Algebra Set Operations Enumeration of Alternative Execution Strategies 631 635 639 640 645 646 646 647 654 662 664 665 xxii | Contents 21.6 21.5.1 Pipelining 21.5.2 Linear Trees 21.5.3 Physical Operators and Execution Strategies 21.5.4 Reducing the Search Space 21.5.5 Enumerating Left-Deep Trees 21.5.6 Semantic Query Optimization 21.5.7 Alternative Approaches to Query Optimization 21.5.8 Distributed Query Optimization Query Optimization in Oracle 21.6.1 Rule-Based and Cost-Based Optimization 21.6.2 Histograms 21.6.3 Viewing the Execution Plan 665 666 667 668 669 671 672 672 673 673 677 678 Chapter Summary Review Questions Exercises 680 681 681 Part 6 Distributed DBMSs and Replication 685 Chapter 22 Distributed DBMSs – Concepts and Design 687 Introduction 22.1.1 Concepts 22.1.2 Advantages and Disadvantages of DDBMSs 22.1.3 Homogeneous and Heterogeneous DDBMSs Overview of Networking Functions and Architectures of a DDBMS 22.3.1 Functions of a DDBMS 22.3.2 Reference Architecture for a DDBMS 22.3.3 Reference Architecture for a Federated MDBS 22.3.4 Component Architecture for a DDBMS Distributed Relational Database Design 22.4.1 Data Allocation 22.4.2 Fragmentation Transparencies in a DDBMS 22.5.1 Distribution Transparency 22.5.2 Transaction Transparency 22.5.3 Performance Transparency 22.5.4 DBMS Transparency 22.5.5 Summary of Transparencies in a DDBMS Date’s Twelve Rules for a DDBMS 688 689 693 697 699 703 703 704 705 706 708 709 710 719 719 722 725 728 728 729 Chapter Summary Review Questions Exercises 731 732 732 22.1 22.2 22.3 22.4 22.5 22.6 Contents Chapter 23 23.1 23.2 23.3 23.4 23.5 23.6 23.7 Chapter 24 24.1 24.2 24.3 24.4 24.5 24.6 24.7 24.8 | xxiii Distributed DBMSs – Advanced Concepts 734 Distributed Transaction Management Distributed Concurrency Control 23.2.1 Objectives 23.2.2 Distributed Serializability 23.2.3 Locking Protocols 23.2.4 Timestamp Protocols Distributed Deadlock Management Distributed Database Recovery 23.4.1 Failures in a Distributed Environment 23.4.2 How Failures Affect Recovery 23.4.3 Two-Phase Commit (2PC) 23.4.4 Three-Phase Commit (3PC) 23.4.5 Network Partitioning The X/Open Distributed Transaction Processing Model Distributed Query Optimization 23.6.1 Data Localization 23.6.2 Distributed Joins 23.6.3 Global Optimization Distribution in Oracle 23.7.1 Oracle’s DDBMS Functionality 735 736 736 737 738 740 741 744 744 745 746 752 756 758 761 762 766 767 772 772 Chapter Summary Review Questions Exercises 777 778 778 Replication and Mobile Databases 780 Introduction to Database Replication Benefits of Database Replication Applications of Replication Basic Components of Database Replication Database Replication Environments 24.5.1 Synchronous Versus Asynchronous Replication 24.5.2 Data Ownership Replication Servers 24.6.1 Replication Server Functionality 24.6.2 Implementation Issues Introduction to Mobile Databases 24.7.1 Mobile DBMSs Oracle Replication 24.8.1 Oracle’s Replication Functionality 781 781 783 783 784 784 784 788 788 789 792 794 794 794 xxiv | Contents Part 7 Chapter 25 25.1 25.2 25.3 25.4 25.5 25.6 25.7 Chapter 26 26.1 Chapter Summary Review Questions Exercises 799 800 800 Object DBMSs 801 Introduction to Object DBMSs 803 Advanced Database Applications Weaknesses of RDBMSs Object-Oriented Concepts 25.3.1 Abstraction, Encapsulation, and Information Hiding 25.3.2 Objects and Attributes 25.3.3 Object Identity 25.3.4 Methods and Messages 25.3.5 Classes 25.3.6 Subclasses, Superclasses, and Inheritance 25.3.7 Overriding and Overloading 25.3.8 Polymorphism and Dynamic Binding 25.3.9 Complex Objects Storing Objects in a Relational Database 25.4.1 Mapping Classes to Relations 25.4.2 Accessing Objects in the Relational Database Next-Generation Database Systems Object-Oriented Database Design 25.6.1 Comparison of Object-Oriented Data Modeling and Conceptual Data Modeling 25.6.2 Relationships and Referential Integrity 25.6.3 Behavioral Design Object-Oriented Analysis and Design with UML 25.7.1 UML Diagrams 25.7.2 Usage of UML in the Methodology for Database Design 804 809 814 814 815 816 818 819 820 822 823 824 825 826 827 828 830 Chapter Summary Review Questions Exercises 844 845 846 Object-Oriented DBMSs – Concepts 847 Introduction to Object-Oriented Data Models and OODBMSs 26.1.1 Definition of Object-Oriented DBMSs 26.1.2 Functional Data Models 26.1.3 Persistent Programming Languages 26.1.4 The Object-Oriented Database System Manifesto 26.1.5 Alternative Strategies for Developing an OODBMS 849 849 850 854 857 859 830 831 834 836 837 842 Contents 26.2 26.3 26.4 26.5 Chapter 27 27.1 27.2 27.3 | xxv OODBMS Perspectives 26.2.1 Pointer Swizzling Techniques 26.2.2 Accessing an Object Persistence 26.3.1 Persistence Schemes 26.3.2 Orthogonal Persistence Issues in OODBMSs 26.4.1 Transactions 26.4.2 Versions 26.4.3 Schema Evolution 26.4.4 Architecture 26.4.5 Benchmarking Advantages and Disadvantages of OODBMSs 26.5.1 Advantages 26.5.2 Disadvantages 860 862 865 867 868 869 871 871 872 873 876 878 881 881 883 Chapter Summary Review Questions Exercises 885 886 887 Object-Oriented DBMSs – Standards and Systems 888 Object Management Group 27.1.1 Background 27.1.2 The Common Object Request Broker Architecture 27.1.3 Other OMG Specifications 27.1.4 Model-Driven Architecture Object Data Standard ODMG 3.0, 1999 27.2.1 Object Data Management Group 27.2.2 The Object Model 27.2.3 The Object Definition Language 27.2.4 The Object Query Language 27.2.5 Other Parts of the ODMG Standard 27.2.6 Mapping the Conceptual Design to a Logical (Object-Oriented) Design ObjectStore 27.3.1 Architecture 27.3.2 Building an ObjectStore Application 27.3.3 Data Definition in ObjectStore 27.3.4 Data Manipulation in ObjectStore 889 889 891 894 897 897 897 900 908 911 917 Chapter Summary Review Questions Exercises 932 934 934 920 921 921 924 926 929 xxvi | Contents Chapter 28 Object-Relational DBMSs 935 28.1 Introduction to Object-Relational Database Systems 936 28.2 The Third-Generation Database Manifestos 28.2.1 The Third-Generation Database System Manifesto 28.2.2 The Third Manifesto 939 940 940 28.3 Postgres – An Early ORDBMS 28.3.1 Objectives of Postgres 28.3.2 Abstract Data Types 28.3.3 Relations and Inheritance 28.3.4 Object Identity 943 943 943 944 946 28.4 SQL:1999 and SQL:2003 28.4.1 Row Types 28.4.2 User-Defined Types 28.4.3 Subtypes and Supertypes 28.4.4 User-Defined Routines 28.4.5 Polymorphism 28.4.6 Reference Types and Object Identity 28.4.7 Creating Tables 28.4.8 Querying Data 28.4.9 Collection Types 28.4.10 Typed Views 28.4.11 Persistent Stored Modules 28.4.12 Triggers 28.4.13 Large Objects 28.4.14 Recursion 946 947 948 951 953 955 956 957 960 961 965 966 967 971 972 28.5 Query Processing and Optimization 28.5.1 New Index Types 974 977 28.6 Object-Oriented Extensions in Oracle 28.6.1 User-Defined Data Types 28.6.2 Manipulating Object Tables 28.6.3 Object Views 28.6.4 Privileges Comparison of ORDBMS and OODBMS 978 978 984 985 986 986 Chapter Summary Review Questions Exercises 988 988 989 28.7 Part 8 Chapter 29 29.1 Web and DBMSs 991 Web Technology and DBMSs 993 Introduction to the Internet and Web 994 Contents 29.2 | xxvii 29.1.1 Intranets and Extranets 996 29.1.2 e-Commerce and e-Business 997 The Web 998 29.2.1 HyperText Transfer Protocol 999 29.2.2 HyperText Markup Language 1001 29.2.3 Uniform Resource Locators 1002 29.2.4 Static and Dynamic Web Pages 1004 29.2.5 Web Services 1004 29.2.6 Requirements for Web–DBMS Integration 1005 29.2.7 Advantages and Disadvantages of the Web–DBMS Approach 1006 29.2.8 Approaches to Integrating the Web and DBMSs 1011 29.3 Scripting Languages 29.3.1 JavaScript and JScript 29.3.2 VBScript 29.3.3 Perl and PHP 1011 1012 1012 1013 29.4 Common Gateway Interface 29.4.1 Passing Information to a CGI Script 29.4.2 Advantages and Disadvantages of CGI 1014 1016 1018 29.5 HTTP Cookies 1019 29.6 Extending the Web Server 29.6.1 Comparison of CGI and API 1020 1021 29.7 Java 29.7.1 JDBC 29.7.2 SQLJ 29.7.3 Comparison of JDBC and SQLJ 29.7.4 Container-Managed Persistence (CMP) 29.7.5 Java Data Objects (JDO) 29.7.6 Java Servlets 29.7.7 JavaServer Pages 29.7.8 Java Web Services Microsoft’s Web Platform 29.8.1 Universal Data Access 29.8.2 Active Server Pages and ActiveX Data Objects 29.8.3 Remote Data Services 29.8.4 Comparison of ASP and JSP 29.8.5 Microsoft .NET 29.8.6 Microsoft Web Services 29.8.7 Microsoft Office Access and Web Page Generation 1021 1025 1030 1030 1031 1035 1040 1041 1042 1043 1045 1046 1049 1049 1050 1054 1054 Oracle Internet Platform 29.9.1 Oracle Application Server (OracleAS) 1055 1056 Chapter Summary Review Questions Exercises 1062 1063 1064 29.8 29.9 xxviii | Contents Chapter 30 30.1 30.2 30.3 30.4 30.5 30.6 30.7 Part 9 Chapter 31 31.1 Semistructured Data and XML 1065 Semistructured Data 30.1.1 Object Exchange Model (OEM) 30.1.2 Lore and Lorel Introduction to XML 30.2.1 Overview of XML 30.2.2 Document Type Definitions (DTDs) XML-Related Technologies 30.3.1 DOM and SAX Interfaces 30.3.2 Namespaces 30.3.3 XSL and XSLT 30.3.4 XPath (XML Path Language) 30.3.5 XPointer (XML Pointer Language) 30.3.6 XLink (XML Linking Language) 30.3.7 XHTML 30.3.8 Simple Object Access Protocol (SOAP) 30.3.9 Web Services Description Language (WSDL) 30.3.10 Universal Discovery, Description and Integration (UDDI) XML Schema 30.4.1 Resource Description Framework (RDF) XML Query Languages 30.5.1 Extending Lore and Lorel to Handle XML 30.5.2 XML Query Working Group 30.5.3 XQuery – A Query Language for XML 30.5.4 XML Information Set 30.5.5 XQuery 1.0 and XPath 2.0 Data Model 30.5.6 Formal Semantics XML and Databases 30.6.1 Storing XML in Databases 30.6.2 XML and SQL 30.6.3 Native XML Databases XML in Oracle 1066 1068 1069 1073 1076 1078 1082 1082 1083 1084 1085 1085 1086 1087 1087 1088 1088 1091 1098 1100 1100 1101 1103 1114 1115 1121 1128 1129 1132 1137 1139 Chapter Summary Review Questions Exercises 1142 1144 1145 Business Intelligence 1147 Data Warehousing Concepts 1149 Introduction to Data Warehousing 31.1.1 The Evolution of Data Warehousing 31.1.2 Data Warehousing Concepts 1150 1150 1151 Contents 31.2 31.3 31.4 31.5 31.6 Chapter 32 32.1 32.2 32.3 32.4 32.5 | xxix 31.1.3 Benefits of Data Warehousing 31.1.4 Comparison of OLTP Systems and Data Warehousing 31.1.5 Problems of Data Warehousing Data Warehouse Architecture 31.2.1 Operational Data 31.2.2 Operational Data Store 31.2.3 Load Manager 31.2.4 Warehouse Manager 31.2.5 Query Manager 31.2.6 Detailed Data 31.2.7 Lightly and Highly Summarized Data 31.2.8 Archive/Backup Data 31.2.9 Metadata 31.2.10 End-User Access Tools Data Warehouse Data Flows 31.3.1 Inflow 31.3.2 Upflow 31.3.3 Downflow 31.3.4 Outflow 31.3.5 Metaflow Data Warehousing Tools and Technologies 31.4.1 Extraction, Cleansing, and Transformation Tools 31.4.2 Data Warehouse DBMS 31.4.3 Data Warehouse Metadata 31.4.4 Administration and Management Tools Data Marts 31.5.1 Reasons for Creating a Data Mart 31.5.2 Data Marts Issues Data Warehousing Using Oracle 31.6.1 Oracle9i 1152 1153 1154 1156 1156 1157 1158 1158 1158 1159 1159 1159 1159 1160 1161 1162 1163 1164 1164 1165 1165 1165 1166 1169 1171 1171 1173 1173 1175 1175 Chapter Summary Review Questions Exercise 1178 1180 1180 Data Warehousing Design 1181 Designing a Data Warehouse Database Dimensionality Modeling 32.2.1 Comparison of DM and ER models Database Design Methodology for Data Warehouses Criteria for Assessing the Dimensionality of a Data Warehouse Data Warehousing Design Using Oracle 32.5.1 Oracle Warehouse Builder Components 32.5.2 Using Oracle Warehouse Builder 1182 1183 1186 1187 1195 1196 1197 1198 xxx | Contents Chapter 33 33.1 33.2 33.3 33.4 33.5 33.6 Chapter 34 34.1 34.2 34.3 34.4 34.5 34.6 Chapter Summary Review Questions Exercises 1202 1203 1203 OLAP 1204 Online Analytical Processing 33.1.1 OLAP Benchmarks OLAP Applications 33.2.1 OLAP Benefits Representation of Multi-Dimensional Data OLAP Tools 33.4.1 Codd’s Rules for OLAP Tools 33.4.2 Categories of OLAP Tools OLAP Extensions to the SQL Standard 33.5.1 Extended Grouping Capabilities 33.5.2 Elememtary OLAP Operators Oracle OLAP 33.6.1 Oracle OLAP Environment 33.6.2 Platform for Business Intelligence Applications 33.6.3 Oracle9i Database 33.6.4 Oracle OLAP 33.6.5 Performance 33.6.6 System Management 33.6.7 System Requirements 1205 1206 1206 1208 1209 1211 1211 1214 1217 1218 1222 1224 1225 1225 1226 1228 1229 1229 1230 Chapter Summary Review Questions Exercises 1230 1231 1231 Data Mining 1232 Data Mining Data Mining Techniques 34.2.1 Predictive Modeling 34.2.2 Database Segmentation 34.2.3 Link Analysis 34.2.4 Deviation Detection The Data Mining Process 34.3.1 The CRISP-DM Model Data Mining Tools Data Mining and Data Warehousing Oracle Data Mining (ODM) 34.6.1 Data Mining Capabilities 34.6.2 Enabling Data Mining Applications 1233 1233 1235 1236 1237 1238 1239 1239 1241 1242 1242 1242 1243 Contents 1243 1243 Chapter Summary Review Questions Exercises 1245 1246 1246 Users’ Requirements Specification for DreamHome Case Study A.1 A.2 B Branch User Views of DreamHome A.1.1 Data Requirements A.1.2 Transaction Requirements (Sample) Staff User Views of DreamHome A.2.1 Data Requirements A.2.2 Transaction Requirements (Sample) 1247 1249 1249 1249 1251 1252 1252 1253 Other Case Studies 1255 B.1 1255 1255 1257 1258 1258 1259 1260 1260 1266 B.2 B.3 C xxxi 34.6.3 Predictions and Insights 34.6.4 Oracle Data Mining Environment Appendices A | The University Accommodation Office Case Study B.1.1 Data Requirements B.1.2 Query Transactions (Sample) The EasyDrive School of Motoring Case Study B.2.1 Data Requirements B.2.2 Query Transactions (Sample) The Wellmeadows Hospital Case Study B.3.1 Data Requirements B.3.2 Transaction Requirements (Sample) File Organizations and Indexes (extended version on the Web site) C.1 C.2 C.3 C.4 Basic Concepts Unordered Files Ordered Files Hash Files C.4.1 Dynamic Hashing C.4.2 Limitations of Hashing C.5 Indexes C.5.1 Types of Index C.5.2 Indexed Sequential Files C.5.3 Secondary Indexes C.5.4 Multilevel Indexes 1268 1269 1270 1271 1272 1275 1276 1277 1277 1278 1279 1280 xxxii | Contents C.5.5 B+-trees C.5.6 Bitmap Indexes C.5.7 Join Indexes C.6 Clustered and Non-Clustered Tables C.6.1 Indexed Clusters C.6.2 Hash Clusters C.7 Guidelines for Selecting File Organizations 1280 1283 1284 1286 1286 1287 1288 Appendix Summary 1291 D When is a DBMS Relational? 1293 E Programmatic SQL (extended version on the Web site) 1298 E.1 E.2 E.3 F G H I Embedded SQL E.1.1 Simple Embedded SQL Statements E.1.2 SQL Communications Area E.1.3 Host Language Variables E.1.4 Retrieving Data Using Embedded SQL and Cursors E.1.5 Using Cursors to Modify Data E.1.6 ISO Standard for Embedded SQL Dynamic SQL The Open Database Connectivity (ODBC) Standard E.3.1 The ODBC Architecture E.3.2 ODBC Conformance Levels 1299 1299 1301 1303 1304 1310 1311 1312 1313 1314 1315 Appendix Summary Review Questions Exercises 1318 1319 1319 Alternative ER Modeling Notations 1320 F.1 F.2 ER Modeling Using the Chen Notation ER Modeling Using the Crow’s Feet Notation 1320 1320 Summary of the Database Design Methodology for Relational Databases 1326 Estimating Disk space Requirements On Web site Sample Web Scripts On Web site References 1332 Further Reading 1345 Index 1356 Preface Background The history of database research over the past 30 years is one of exceptional productivity that has led to the database system becoming arguably the most important development in the field of software engineering. The database is now the underlying framework of the information system, and has fundamentally changed the way many organizations operate. In particular, the developments in this technology over the last few years have produced systems that are more powerful and more intuitive to use. This has resulted in database systems becoming increasingly available to a wider variety of users. Unfortunately, the apparent simplicity of these systems has led to users creating databases and applications without the necessary knowledge to produce an effective and efficient system. And so the ‘software crisis’ or, as it is sometimes referred to, the ‘software depression’ continues. The original stimulus for this book came from the authors’ work in industry, providing consultancy on database design for new software systems or, as often as not, resolving inadequacies with existing systems. Added to this, the authors’ move to academia brought similar problems from different users – students. The objectives of this book, therefore, are to provide a textbook that introduces the theory behind databases as clearly as possible and, in particular, to provide a methodology for database design that can be used by both technical and non-technical readers. The methodology presented in this book for relational Database Management Systems (DBMSs) – the predominant system for business applications at present – has been tried and tested over the years in both industrial and academic environments. It consists of three main phases: conceptual, logical, and physical database design. The first phase starts with the production of a conceptual data model that is independent of all physical considerations. This model is then refined in the second phase into a logical data model by removing constructs that cannot be represented in relational systems. In the third phase, the logical data model is translated into a physical design for the target DBMS. The physical design phase considers the storage structures and access methods required for efficient and secure access to the database on secondary storage. The methodology in each phase is presented as a series of steps. For the inexperienced designer, it is expected that the steps will be followed in the order described, and xxxiv | Preface guidelines are provided throughout to help with this process. For the experienced designer, the methodology can be less prescriptive, acting more as a framework or checklist. To help the reader use the methodology and understand the important issues, the methodology has been described using a realistic worked example, based on an integrated case study, DreamHome. In addition, three additional case studies are provided in Appendix B to allow readers to try out the methodology for themselves. UML (Unified Modeling Language) Increasingly, companies are standardizing the way in which they model data by selecting a particular approach to data modeling and using it throughout their database development projects. A popular high-level data model used in conceptual/logical database design, and the one we use in this book, is based on the concepts of the Entity–Relationship (ER) model. Currently there is no standard notation for an ER model. Most books that cover database design for relational DBMSs tend to use one of two conventional notations: n Chen’s notation, consisting of rectangles representing entities and diamonds representing relationships, with lines linking the rectangles and diamonds; or n Crow’s Feet notation, again consisting of rectangles representing entities and lines between entities representing relationships, with a crow’s foot at one end of a line representing a one-to-many relationship. Both notations are well supported by current CASE tools. However, they can be quite cumbersome to use and a bit difficult to explain. Prior to this edition, we used Chen’s notation. However, following an extensive questionnaire carried out by Pearson Education, there was a general consensus that the notation should be changed to the latest objectoriented modeling language called UML (Unified Modeling Language). UML is a notation that combines elements from the three major strands of object-oriented design: Rumbaugh’s OMT modeling, Booch’s Object-Oriented Analysis and Design, and Jacobson’s Objectory. There are three primary reasons for adopting a different notation: (1) UML is becoming an industry standard; for example, the Object Management Group (OMG) has adopted the UML as the standard notation for object methods; (2) UML is arguably clearer and easier to use; (3) UML is now being adopted within academia for teaching object-oriented analysis and design, and using UML in database modules provides more synergy. Therefore, in this edition we have adopted the class diagram notation from UML. We believe you will find this notation easier to understand and use. Prior to making this move to UML, we spent a considerable amount of time experimenting with UML and checking its suitability for database design. We concluded this work by publishing a book through Pearson Education called Database Solutions: A Step-by-Step Guide to Building Databases. This book uses the methodology to design and build databases for two case studies, one with the target DBMS as Microsoft Office Access and one with the target database as Oracle. This book also contains many other case studies with sample solutions. Preface What’s New in the Fourth Edition The fourth edition of the book has been revised to improve readability, to update or to extend coverage of existing material, and to include new material. The major changes in the fourth edition are as follows. n n n n n n n n n n n Extended treatment of normalization (original chapter has been divided into two). Streamlined methodology for database design using UML notation for ER diagrams. New section on use of other parts of UML within analysis and design, covering use cases, sequence, collaboration, statechart, and activity diagrams. New section on enumeration of execution strategies within query optimization for both centralized and distributed DBMSs. Coverage of OMG specifications including the Common Warehouse Metamodel (CWM) and the Model Driven Architecture (MDA). Object-Relational chapter updated to reflect the new SQL:2003 standard. Extended treatment of Web–DBMS integration, including coverage of ContainerManaged Persistence (CMP), Java Data Objects (JDO), and ADO.NET. Extended treatment of XML, SOAP, WSDL, UDDI, XQuery 1.0 and XPath 2.0 (including the revised Data Model and Formal Semantics), SQL:2003 SQL/XML standard, storage of XML in relational databases, and native XML databases. Extended treatment of OLAP and data mining including the functionality of SQL:2003 and the CRISP-DM model. Coverage updated to Oracle9i (overview of Oracle10g) and Microsoft Office Access 2003. Additional Web resources, including extended chapter on file organizations and storage structures, full Web implementation of the DreamHome case study, a user guide for Oracle, and more examples for the Appendix on Web–DBMS integration. Intended Audience This book is intended as a textbook for a one- or two-semester course in database management or database design in an introductory undergraduate course, a graduate or advanced undergraduate course. Such courses are usually required in an information systems, business IT, or computer science curriculum. The book is also intended as a reference book for IT professionals, such as systems analysts or designers, application programmers, systems programmers, database practitioners, and for independent self-teachers. Owing to the widespread use of database systems nowadays, these professionals could come from any type of company that requires a database. It would be helpful for students to have a good background in the file organization and data structures concepts covered in Appendix C before covering the material in Chapter 17 on physical database design and Chapter 21 on query processing. This background ideally will have been obtained from a prior course. If this is not possible, then the material in | xxxv xxxvi | Preface Appendix C can be presented near the beginning of the database course, immediately following Chapter 1. An understanding of a high-level programming language, such as ‘C’, would be advantageous for Appendix E on embedded and dynamic SQL and Section 27.3 on ObjectStore. Distinguishing Features (1) An easy-to-use, step-by-step methodology for conceptual and logical database design, based on the widely accepted Entity–Relationship model, with normalization used as a validation technique. There is an integrated case study showing how to use the methodology. (2) An easy-to-use, step-by-step methodology for physical database design, covering the mapping of the logical design to a physical implementation, the selection of file organizations and indexes appropriate for the applications, and when to introduce controlled redundancy. Again, there is an integrated case study showing how to use the methodology. (3) There are separate chapters showing how database design fits into the overall database systems development lifecycle, how fact-finding techniques can be used to identify the system requirements, and how UML fits into the methodology. (4) A clear and easy-to-understand presentation, with definitions clearly highlighted, chapter objectives clearly stated, and chapters summarized. Numerous examples and diagrams are provided throughout each chapter to illustrate the concepts. There is a realistic case study integrated throughout the book and further case studies that can be used as student projects. (5) Extensive treatment of the latest formal and de facto standards: SQL (Structured Query Language), QBE (Query-By-Example), and the ODMG (Object Data Management Group) standard for object-oriented databases. (6) Three tutorial-style chapters on the SQL standard, covering both interactive and embedded SQL. (7) An overview chapter covering two of the most popular commercial DBMSs: Microsoft Office Access and Oracle. Many of the subsequent chapters examine how Microsoft Office Access and Oracle support the mechanisms that are being discussed. (8) Comprehensive coverage of the concepts and issues relating to distributed DBMSs and replication servers. (9) Comprehensive introduction to the concepts and issues relating to object-based DBMSs including a review of the ODMG standard, and a tutorial on the object management facilities within the latest release of the SQL standard, SQL:2003. (10) Extensive treatment of the Web as a platform for database applications with many code samples of accessing databases on the Web. In particular, we cover persistence through Container-Managed Persistence (CMP), Java Data Objects (JDO), JDBC, SQLJ, ActiveX Data Objects (ADO), ADO.NET, and Oracle PL/SQL Pages (PSP). Preface (11) An introduction to semistructured data and its relationship to XML and extensive coverage of XML and its related technologies. In particular, we cover XML Schema, XQuery, and the XQuery Data Model and Formal Semantics. We also cover the integration of XML into databases and examine the extensions added to SQL:2003 to enable the publication of XML. (12) Comprehensive introduction to data warehousing, Online Analytical Processing (OLAP), and data mining. (13) Comprehensive introduction to dimensionality modeling for designing a data warehouse database. An integrated case study is used to demonstrate a methodology for data warehouse database design. (14) Coverage of DBMS system implementation concepts, including concurrency and recovery control, security, and query processing and query optimization. Pedagogy Before starting to write any material for this book, one of the objectives was to produce a textbook that would be easy for the readers to follow and understand, whatever their background and experience. From the authors’ experience of using textbooks, which was quite considerable before undertaking a project of this size, and also from listening to colleagues, clients, and students, there were a number of design features that readers liked and disliked. With these comments in mind, the following style and structure was adopted: n n n n n n n A set of objectives, clearly identified at the start of each chapter. Each important concept that is introduced is clearly defined and highlighted by placing the definition in a box. Diagrams are liberally used throughout to support and clarify concepts. A very practical orientation: to this end, each chapter contains many worked examples to illustrate the concepts covered. A summary at the end of each chapter, covering the main concepts introduced. A set of review questions, the answers to which can be found in the text. A set of exercises that can be used by teachers or by individuals to demonstrate and test the individual’s understanding of the chapter, the answers to which can be found in the accompanying Instructor’s Guide. Instructor’s Guide A comprehensive supplement containing numerous instructional resources is available for this textbook, upon request to Pearson Education. The accompanying Instructor’s Guide includes: n Course structures These include suggestions for the material to be covered in a variety of courses. | xxxvii xxxviii | Preface n n n n n n n n Teaching suggestions These include lecture suggestions, teaching hints, and student project ideas that make use of the chapter content. Solutions Sample answers are provided for all review questions and exercises. Examination questions Examination questions (similar to the questions and exercises at the end of each chapter), with solutions. Transparency masters An electronic set of overhead transparencies containing the main points from each chapter, enlarged illustrations and tables from the text, help the instructor to associate lectures and class discussion to material in the textbook. A User’s Guide for Microsoft Office Access 2003 for student lab work. A User’s Guide for Oracle9i for student lab work. An extended chapter on file organizations and storage structures. A Web-based implementation of the DreamHome case study. Additional information about the Instructor’s Guide and the book can be found on the Pearson Education Web site at: http://www.booksites.net/connbegg Organization of this Book Part 1 Background Part 1 of the book serves to introduce the field of database systems and database design. Chapter 1 introduces the field of database management, examining the problems with the precursor to the database system, the file-based system, and the advantages offered by the database approach. Chapter 2 examines the database environment, discussing the advantages offered by the three-level ANSI-SPARC architecture, introducing the most popular data models, and outlining the functions that should be provided by a multi-user DBMS. The chapter also looks at the underlying software architecture for DBMSs, which could be omitted for a first course in database management. Part 2 The Relational Model and Languages Part 2 of the book serves to introduce the relational model and relational languages, namely the relational algebra and relational calculus, QBE (Query-By-Example), and SQL (Structured Query Language). This part also examines two highly popular commercial systems: Microsoft Office Access and Oracle. Chapter 3 introduces the concepts behind the relational model, the most popular data model at present, and the one most often chosen for standard business applications. After introducing the terminology and showing the relationship with mathematical relations, the relational integrity rules, entity integrity, and referential integrity are discussed. The chapter concludes with an overview on views, which is expanded upon in Chapter 6. Preface Chapter 4 introduces the relational algebra and relational calculus with examples to illustrate all the operations. This could be omitted for a first course in database management. However, relational algebra is required to understand Query Processing in Chapter 21 and fragmentation in Chapter 22 on distributed DBMSs. In addition, the comparative aspects of the procedural algebra and the non-procedural calculus act as a useful precursor for the study of SQL in Chapters 5 and 6, although not essential. Chapter 5 introduces the data manipulation statements of the SQL standard: SELECT, INSERT, UPDATE, and DELETE. The chapter is presented as a tutorial, giving a series of worked examples that demonstrate the main concepts of these statements. Chapter 6 covers the main data definition facilities of the SQL standard. Again, the chapter is presented as a worked tutorial. The chapter introduces the SQL data types and the data definition statements, the Integrity Enhancement Feature (IEF) and the more advanced features of the data definition statements, including the access control statements GRANT and REVOKE. It also examines views and how they can be created in SQL. Chapter 7 is another practical chapter that examines the interactive query language, Query-By-Example (QBE), which has acquired the reputation of being one of the easiest ways for non-technical computer users to access information in a database. QBE is demonstrated using Microsoft Office Access. Chapter 8 completes the second part of the book by providing introductions to two popular commercial relational DBMSs, namely Microsoft Office Access and Oracle. In subsequent chapters of the book, we examine how these systems implement various database facilities, such as security and query processing. Part 3 Database Analysis and Design Techniques Part 3 of the book discusses the main techniques for database analysis and design and how they can be applied in a practical way. Chapter 9 presents an overview of the main stages of the database application lifecycle. In particular, it emphasizes the importance of database design and shows how the process can be decomposed into three phases: conceptual, logical, and physical database design. It also describes how the design of the application (the functional approach) affects database design (the data approach). A crucial stage in the database application lifecycle is the selection of an appropriate DBMS. This chapter discusses the process of DBMS selection and provides some guidelines and recommendations. The chapter concludes with a discussion of the importance of data administration and database administration. Chapter 10 discusses when a database developer might use fact-finding techniques and what types of facts should be captured. The chapter describes the most commonly used fact-finding techniques and identifies the advantages and disadvantages of each. The chapter also demonstrates how some of these techniques may be used during the earlier stages of the database application lifecycle using the DreamHome case study. Chapters 11 and 12 cover the concepts of the Entity–Relationship (ER) model and the Enhanced Entity–Relationship (EER) model, which allows more advanced data modeling using subclasses and superclasses and categorization. The EER model is a popular high-level | xxxix xl | Preface conceptual data model and is a fundamental technique of the database design methodology presented herein. The reader is also introduced to UML to represent ER diagrams. Chapters 13 and 14 examine the concepts behind normalization, which is another important technique used in the logical database design methodology. Using a series of worked examples drawn from the integrated case study, they demonstrate how to transition a design from one normal form to another and show the advantages of having a logical database design that conforms to particular normal forms up to, and including, fifth normal form. Part 4 Methodology This part of the book covers a methodology for database design. The methodology is divided into three parts covering conceptual, logical, and physical database design. Each part of the methodology is illustrated using the DreamHome case study. Chapter 15 presents a step-by-step methodology for conceptual database design. It shows how to decompose the design into more manageable areas based on individual views, and then provides guidelines for identifying entities, attributes, relationships, and keys. Chapter 16 presents a step-by-step methodology for logical database design for the relational model. It shows how to map a conceptual data model to a logical data model and how to validate it against the required transactions using the technique of normalization. For database applications with multiple user views, this chapter shows how to merge the resulting data models together into a global data model that represents all the views of the part of the enterprise being modeled. Chapters 17 and 18 present a step-by-step methodology for physical database design for relational systems. It shows how to translate the logical data model developed during logical database design into a physical design for a relational system. The methodology addresses the performance of the resulting implementation by providing guidelines for choosing file organizations and storage structures, and when to introduce controlled redundancy. Part 5 Selected Database Issues Part 5 of the book examines four specific topics that the authors consider necessary for a modern course in database management. Chapter 19 considers database security, not just in the context of DBMS security but also in the context of the security of the DBMS environment. It illustrates security provision with Microsoft Office Access and Oracle. The chapter also examines the security problems that can arise in a Web environment and presents some approaches to overcoming them. Chapter 20 concentrates on three functions that a Database Management System should provide, namely transaction management, concurrency control, and recovery. These functions are intended to ensure that the database is reliable and remains in a consistent state when multiple users are accessing the database and in the presence of failures of Preface both hardware and software components. The chapter also discusses advanced transaction models that are more appropriate for transactions that may be of a long duration. The chapter concludes by examining transaction management within Oracle. Chapter 21 examines query processing and query optimization. The chapter considers the two main techniques for query optimization: the use of heuristic rules that order the operations in a query, and the other technique that compares different strategies based on their relative costs and selects the one that minimizes resource usage. The chapter concludes by examining query processing within Oracle. Part 6 Distributed DBMSs and Replication Part 6 of the book examines distributed DBMSs and object-based DBMSs. Distributed database management system (DDBMS) technology is one of the current major developments in the database systems area. The previous chapters of this book concentrate on centralized database systems: that is, systems with a single logical database located at one site under the control of a single DBMS. Chapter 22 discusses the concepts and problems of distributed DBMSs, where users can access the database at their own site and also access data stored at remote sites. Chapter 23 examines various advanced concepts associated with distributed DBMSs. In particular, it concentrates on the protocols associated with distributed transaction management, concurrency control, deadlock management, and database recovery. The chapter also examines the X/Open Distributed Transaction Processing (DTP) protocol. The chapter concludes by examining data distribution within Oracle. Chapter 24 discusses replication servers as an alternative to distributed DBMSs and examines the issues associated with mobile databases. The chapter also examines the data replication facilities in Oracle. Part 7 Object DBMSs The preceding chapters of this book concentrate on the relational model and relational systems. The justification for this is that such systems are currently the predominant DBMS for traditional business database applications. However, relational systems are not without their failings, and the object-based DBMS is a major development in the database systems area that attempts to overcome these failings. Chapters 25–28 examine this development in some detail. Chapter 25 acts as an introduction to object-based DBMSs and first examines the types of advanced database applications that are emerging, and discusses the weaknesses of the relational data model that makes it unsuitable for these types of applications. The chapter then introduces the main concepts of object orientation. It also discusses the problems of storing objects in a relational database. Chapter 26 examines the object-oriented DBMS (OODBMS), and starts by providing an introduction to object-oriented data models and persistent programming languages. The chapter discusses the difference between the two-level storage model used by conventional | xli xlii | Preface DBMSs and the single-level model used by OODBMSs, and how this affects data access. It also discusses the various approaches to providing persistence in programming languages and the different techniques for pointer swizzling, and examines version management, schema evolution, and OODBMS architectures. The chapter concludes by briefly showing how the methodology presented in Part 4 of this book may be extended for object-oriented databases. Chapter 27 addresses the object model proposed by the Object Data Management Group (ODMG), which has become a de facto standard for OODBMSs. The chapter also examines ObjectStore, a commercial OODBMS. Chapter 28 examines the object-relational DBMS, and provides a detailed overview of the object management features that have been added to the new release of the SQL standard, SQL:2003. The chapter also discusses how query processing and query optimization need to be extended to handle data type extensibility efficiently. The chapter concludes by examining some of the object-relational features within Oracle. Part 8 Web and DBMSs Part 8 of the book deals with the integration of the DBMS into the Web environment, semistructured data and its relationship to XML, XML query languages, and mapping XML to databases. Chapter 29 examines the integration of the DBMS into the Web environment. After providing a brief introduction to Internet and Web technology, the chapter examines the appropriateness of the Web as a database application platform and discusses the advantages and disadvantages of this approach. It then considers a number of the different approaches to integrating DBMSs into the Web environment, including scripting languages, CGI, server extensions, Java, ADO and ADO.NET, and Oracle’s Internet Platform. Chapter 30 examines semistructured data and then discusses XML and how XML is an emerging standard for data representation and interchange on the Web. The chapter then discusses XML-related technologies such as namespaces, XSL, XPath, XPointer, XLink, SOAP, WSDL, and UDDI. It also examines how XML Schema can be used to define the content model of an XML document and how the Resource Description Framework (RDF) provides a framework for the exchange of metadata. The chapter examines query languages for XML and, in particular, concentrates on XQuery, as proposed by W3C. It also examines the extensions added to SQL:2003 to enable the publication of XML and more generally mapping and storing XML in databases. Part 9 Business Intelligence (or Decision Support) The final part of the book deals with data warehousing, Online Analytical Processing (OLAP), and data mining. Chapter 31 discusses data warehousing, what it is, how it has evolved, and describes the potential benefits and problems associated with this system. The chapter examines the architecture, the main components, and the associated tools and technologies of a Preface data warehouse. The chapter also discusses data marts and the issues associated with the development and management of data marts. The chapter concludes by describing the data warehousing facilities of the Oracle DBMS. Chapter 32 provides an approach to the design of the database of a data warehouse/ data mart built to support decision-making. The chapter describes the basic concepts associated with dimensionality modeling and compares this technique with traditional Entity–Relationship (ER) modeling. It also describes and demonstrates a step-by-step methodology for designing a data warehouse using worked examples taken from an extended version of the DreamHome case study. The chapter concludes by describing how to design a data warehouse using the Oracle Warehouse Builder. Chapter 33 describes Online Analytical Processing (OLAP). It discusses what OLAP is and the main features of OLAP applications. The chapter discusses how multi-dimensional data can be represented and the main categories of OLAP tools. It also discusses the OLAP extensions to the SQL standard and how Oracle supports OLAP. Chapter 34 describes Data Mining (DM). It discusses what DM is and the main features of DM applications. The chapter describes the main characteristics of data mining operations and associated techniques. It describes the process of DM and the main features of DM tools with particular coverage of Oracle DM. Appendices Appendix A provides a description of DreamHome, a case study that is used extensively throughout the book. Appendix B provides three additional case studies, which can be used as student projects. Appendix C provides some background information on file organization and storage structures that is necessary for an understanding of the physical database design methodology presented in Chapter 17 and query processing in Chapter 21. Appendix D describes Codd’s 12 rules for a relational DBMS, which form a yardstick against which the ‘real’ relational DBMS products can be identified. Appendix E examines embedded and dynamic SQL, with sample programs in ‘C’. The chapter also examines the Open Database Connectivity (ODBC) standard, which has emerged as a de facto industry standard for accessing heterogeneous SQL databases. Appendix F describes two alternative data modeling notations to UML, namely Chen’s notation and Crow’s Foot. Appendix G summarizes the steps in the methodology presented in Chapters 15–18 for conceptual, logical, and physical database design. Appendix H (see companion Web site) discusses how to estimate the disk space requirements for an Oracle database. Appendix I (see companion Web site) provides some sample Web scripts to complement Chapter 29 on Web technology and DBMSs. The logical organization of the book and the suggested paths through it are illustrated in Figure P.1. | xliii Figure P.1 Logical organization of the book and suggested paths through it. Preface Corrections and Suggestions As a textbook of this size is so vulnerable to errors, disagreements, omissions, and confusion, your input is solicited for future reprints and editions. Comments, corrections, and constructive suggestions should be sent to Pearson Education, or by electronic mail to: [email protected] Acknowledgments This book is the outcome of many years of work by the authors in industry, research, and academia. It is therefore difficult to name all the people who have directly or indirectly helped us in our efforts; an idea here and there may have appeared insignificant at the time but may have had a significant causal effect. For those people we are about to omit, we apologize now. However, special thanks and apologies must first go to our families, who over the years have been neglected, even ignored, during our deepest concentrations. Next, for the first edition, we should like to thank our editors Dr Simon Plumtree and Nicky Jaeger, for their help, encouragement, and professionalism throughout this time; and our production editor Martin Tytler, and copy editor Lionel Browne. We should also like to thank the reviewers of the first edition, who contributed their comments, suggestions, and advice. In particular, we would like to mention: William H. Gwinn, Instructor, Texas Tech University; Adrian Larner, De Montfort University, Leicester; Professor Andrew McGettrick, University of Strathclyde; Dennis McLeod, Professor of Computer Science, University of Southern California; Josephine DeGuzman Mendoza, Associate Professor, California State University; Jeff Naughton, Professor A. B. Schwarzkopf, University of Oklahoma; Junping Sun, Assistant Professor, Nova Southeastern University; Donovan Young, Associate Professor, Georgia Tech; Dr Barry Eaglestone, Lecturer in Computer Science, University of Bradford; John Wade, IBM. We would also like to acknowledge Anne Strachan for her contribution to the first edition. For the second edition, we would first like to thank Sally Mortimore, our editor, and Martin Klopstock and Dylan Reisenberger in the production team. We should also like to thank the reviewers of the second edition, who contributed their comments, suggestions, and advice. In particular, we would like to mention: Stephano Ceri, Politecnico di Milano; Lars Gillberg, Mid Sweden University, Oestersund; Dawn Jutla, St Mary’s University, Halifax, Canada; Julie McCann, City University, London; Munindar Singh, North Carolina State University; Hugh Darwen, Hursely, UK; Claude Delobel, Paris, France; Dennis Murray, Reading, UK; and from our own department John Kawala and Dr Peter Knaggs. For the third and fourth editions, we would first like to thank Kate Brewin, our editor, Stuart Hay, Kay Holman, and Mary Lince in the production team, and copy editors Robert Chaundy and Ruth Freestone King. We should also like to thank the reviewers of the second edition, who contributed their comments, suggestions, and advice. In particular, we would like to mention: Richard Cooper, University of Glasgow, UK; Emma Eliason, University of Orebro, Sweden; Sari Hakkarainen, Stockholm University and the Royal Institute of Technology; Nenad Jukic, Loyola University Chicago, USA; Jan Paredaens, University of Antwerp, Belgium; Stephen Priest, Daniel Webster College, USA. Many others are still anonymous to us – we thank you for the time you must have spent on the manuscript. | xlv xlvi | Preface We should also like to thank Malcolm Bronte-Stewart for the DreamHome concept, Moira O’Donnell for ensuring the accuracy of the Wellmeadows Hospital case study, Alistair McMonnies, Richard Beeby, and Pauline Robertson for their help with material for the Web site, and special thanks to Thomas’s secretary Lyndonne MacLeod and Carolyn’s secretary June Blackburn, for their help and support during the years. Thomas M. Connolly Carolyn E. Begg Glasgow, March 2004 Preface Publisher’s Acknowledgments We are grateful to the following for permission to reproduce copyright material: Oracle Corporation for Figures 8.14, 8.15, 8.16, 8.22, 8.23, 8.24, 19.8, 19.9, 19.10, 30.29 and 30.30 reproduced with permission; The McGraw-Hill Companies, Inc., New York for Figure 19.11, reproduced from BYTE Magazine, June 1997. Reproduced with permission. © by The McGraw-Hill Companies, Inc., New York, NY USA. All rights reserved; Figures 27.4 and 27.5 are diagrams from the “Common Warehouse Metamodel (CWM) Specification”, March 2003, Version 1.1, Volume 1, formal/03-03-02. Reprinted with permission. Object Management, Inc. © OMG 2003; Screen shots reprinted by permission from Microsoft Corporation. In some instances we have been unable to trace the owners of copyright material, and we would appreciate any information that would enable us to do so. | xlvii xlviii | Features of the book 1.2 Clearly highlighted chapter objectives. Each important concept is clearly defined and highlighted by placing the definition in a box. 10.4 Diagrams are liberally used throughout to support and clarify concepts. A very practical orientation. Each chapter contains many worked examples to illustrate the concepts covered. Features of the book A set of review questions, the answers to which can be found in the text. A summary at the end of each chapter, covering the main concepts introduced. | xlix A set of exercises that can be used by teachers or by individuals to demonstrate and test the individual’s understanding of the chapter, the answers to which can be found in the accompanying Instructor’s Guide. A Companion Web site accompanies the text at www.booksites.net/connbegg. For further details of contents see following page. l | Companion Web site selected student resources Tutorials on selected chapters Access Lab Manual Part 1 Background Chapter 1 Introduction to Databases Chapter 2 Database Environment 3 33 Chapter 1 Introduction to Databases Chapter Objectives In this chapter you will learn: n Some common uses of database systems. n The characteristics of file-based systems. n The problems with the file-based approach. n The meaning of the term ‘database’. n The meaning of the term ‘database management system’ (DBMS). n The typical functions of a DBMS. n The major components of the DBMS environment. n The personnel involved in the DBMS environment. n The history of the development of DBMSs. n The advantages and disadvantages of DBMSs. The history of database system research is one of exceptional productivity and startling economic impact. Barely 20 years old as a basic science research field, database research has fueled an information services industry estimated at $10 billion per year in the U.S. alone. Achievements in database research underpin fundamental advances in communications systems, transportation and logistics, financial management, knowledge-based systems, accessibility to scientific literature, and a host of other civilian and defense applications. They also serve as the foundation for considerable progress in the basic science fields ranging from computing to biology. (Silberschatz et al., 1990, 1996) This quotation is from a workshop on database systems at the beginning of the 1990s and expanded upon in a subsequent workshop in 1996, and it provides substantial motivation for the study of the subject of this book: the database system. Since these workshops, the importance of the database system has, if anything, increased with the significant developments in hardware capability, hardware capacity, and communications, including the 4 | Chapter 1 z Introduction to Databases emergence of the Internet, electronic commerce, business intelligence, mobile communications, and grid computing. The database system is arguably the most important development in the field of software engineering, and the database is now the underlying framework of the information system, fundamentally changing the way that many organizations operate. Database technology has been an exciting area to work in and, since its emergence, has been the catalyst for many important developments in software engineering. The workshop emphasized that the developments in database systems were not over, as some people thought. In fact, to paraphrase an old saying, it may be that we are only at the end of the beginning of the development. The applications that will have to be handled in the future are so much more complex that we will have to rethink many of the algorithms currently being used, such as the algorithms for file storage and access, and query optimization. The development of these original algorithms has had significant ramifications in software engineering and, without doubt, the development of new algorithms will have similar effects. In this first chapter we introduce the database system. Structure of this Chapter In Section 1.1 we examine some uses of database systems that we find in everyday life but are not necessarily aware of. In Sections 1.2 and 1.3 we compare the early file-based approach to computerizing the manual file system with the modern, and more usable, database approach. In Section 1.4 we discuss the four types of role that people perform in the database environment, namely: data and database administrators, database designers, application developers, and the end-users. In Section 1.5 we provide a brief history of database systems, and follow that in Section 1.6 with a discussion of the advantages and disadvantages of database systems. Throughout this book, we illustrate concepts using a case study based on a fictitious property management company called DreamHome. We provide a detailed description of this case study in Section 10.4 and Appendix A. In Appendix B we present further case studies that are intended to provide additional realistic projects for the reader. There will be exercises based on these case studies at the end of many chapters. 1.1 Introduction The database is now such an integral part of our day-to-day life that often we are not aware we are using one. To start our discussion of databases, in this section we examine some applications of database systems. For the purposes of this discussion, we consider a database to be a collection of related data and the Database Management System (DBMS) to be the software that manages and controls access to the database. A database application is simply a program that interacts with the database at some point in its execution. We also use the more inclusive term database system to be a collection of application programs that interact with the database along with the DBMS and database itself. We provide more accurate definitions in Section 1.3. 1.1 Introduction Purchases from the supermarket When you purchase goods from your local supermarket, it is likely that a database is accessed. The checkout assistant uses a bar code reader to scan each of your purchases. This is linked to an application program that uses the bar code to find out the price of the item from a product database. The program then reduces the number of such items in stock and displays the price on the cash register. If the reorder level falls below a specified threshold, the database system may automatically place an order to obtain more stocks of that item. If a customer telephones the supermarket, an assistant can check whether an item is in stock by running an application program that determines availability from the database. Purchases using your credit card When you purchase goods using your credit card, the assistant normally checks that you have sufficient credit left to make the purchase. This check may be carried out by telephone or it may be carried out automatically by a card reader linked to a computer system. In either case, there is a database somewhere that contains information about the purchases that you have made using your credit card. To check your credit, there is a database application program that uses your credit card number to check that the price of the goods you wish to buy together with the sum of the purchases you have already made this month is within your credit limit. When the purchase is confirmed, the details of the purchase are added to this database. The application program also accesses the database to check that the credit card is not on the list of stolen or lost cards before authorizing the purchase. There are other application programs to send out monthly statements to each cardholder and to credit accounts when payment is received. Booking a holiday at the travel agents When you make inquiries about a holiday, the travel agent may access several databases containing holiday and flight details. When you book your holiday, the database system has to make all the necessary booking arrangements. In this case, the system has to ensure that two different agents do not book the same holiday or overbook the seats on the flight. For example, if there is only one seat left on the flight from London to New York and two agents try to reserve the last seat at the same time, the system has to recognize this situation, allow one booking to proceed, and inform the other agent that there are now no seats available. The travel agent may have another, usually separate, database for invoicing. Using the local library Your local library probably has a database containing details of the books in the library, details of the readers, reservations, and so on. There will be a computerized index that allows readers to find a book based on its title, or its authors, or its subject area. The database system handles reservations to allow a reader to reserve a book and to be informed by mail when the book is available. The system also sends reminders to borrowers who have failed to return books by the due date. Typically, the system will have a bar code | 5 6 | Chapter 1 z Introduction to Databases reader, similar to that used by the supermarket described earlier, which is used to keep track of books coming in and going out of the library. Taking out insurance Whenever you wish to take out insurance, for example personal insurance, building, and contents insurance for your house, or car insurance, your broker may access several databases containing figures for various insurance organizations. The personal details that you supply, such as name, address, age, and whether you drink or smoke, are used by the database system to determine the cost of the insurance. The broker can search several databases to find the organization that gives you the best deal. Renting a video When you wish to rent a video from a video rental company, you will probably find that the company maintains a database consisting of the video titles that it stocks, details on the copies it has for each title, whether the copy is available for rent or whether it is currently on loan, details of its members (the renters), and which videos they are currently renting and date they are returned. The database may even store more detailed information on each video, such as its director and its actors. The company can use this information to monitor stock usage and predict future buying trends based on historic rental data. Using the Internet Many of the sites on the Internet are driven by database applications. For example, you may visit an online bookstore that allows you to browse and buy books, such as Amazon.com. The bookstore allows you to browse books in different categories, such as computing or management, or it may allow you to browse books by author name. In either case, there is a database on the organization’s Web server that consists of book details, availability, shipping information, stock levels, and on-order information. Book details include book titles, ISBNs, authors, prices, sales histories, publishers, reviews, and detailed descriptions. The database allows books to be cross-referenced: for example, a book may be listed under several categories, such as computing, programming languages, bestsellers, and recommended titles. The cross-referencing also allows Amazon to give you information on other books that are typically ordered along with the title you are interested in. As with an earlier example, you can provide your credit card details to purchase one or more books online. Amazon.com personalizes its service for customers who return to its site by keeping a record of all previous transactions, including items purchased, shipping, and credit card details. When you return to the site, you can now be greeted by name and you can be presented with a list of recommended titles based on previous purchases. Studying at university If you are at university, there will be a database system containing information about yourself, the course you are enrolled in, details about your grant, the modules you have taken in previous years or are taking this year, and details of all your examination results. There 1.2 Traditional File-Based Systems may also be a database containing details relating to the next year’s admissions and a database containing details of the staff who work at the university, giving personal details and salary-related details for the payroll office. Traditional File-Based Systems 1.2 It is almost a tradition that comprehensive database books introduce the database system with a review of its predecessor, the file-based system. We will not depart from this tradition. Although the file-based approach is largely obsolete, there are good reasons for studying it: n n Understanding the problems inherent in file-based systems may prevent us from repeating these problems in database systems. In other words, we should learn from our earlier mistakes. Actually, using the word ‘mistakes’ is derogatory and does not give any cognizance to the work that served a useful purpose for many years. However, we have learned from this work that there are better ways to handle data. If you wish to convert a file-based system to a database system, understanding how the file system works will be extremely useful, if not essential. File-Based Approach File-based system A collection of application programs that perform services for the end-users such as the production of reports. Each program defines and manages its own data. File-based systems were an early attempt to computerize the manual filing system that we are all familiar with. For example, in an organization a manual file is set up to hold all external and internal correspondence relating to a project, product, task, client, or employee. Typically, there are many such files, and for safety they are labeled and stored in one or more cabinets. For security, the cabinets may have locks or may be located in secure areas of the building. In our own home, we probably have some sort of filing system which contains receipts, guarantees, invoices, bank statements, and such like. When we need to look something up, we go to the filing system and search through the system starting from the first entry until we find what we want. Alternatively, we may have an indexing system that helps locate what we want more quickly. For example, we may have divisions in the filing system or separate folders for different types of item that are in some way logically related. The manual filing system works well while the number of items to be stored is small. It even works quite adequately when there are large numbers of items and we have only to store and retrieve them. However, the manual filing system breaks down when we have to cross-reference or process the information in the files. For example, a typical real estate agent’s office might have a separate file for each property for sale or rent, each potential buyer and renter, and each member of staff. Consider the effort that would be required to answer the following questions: 1.2.1 | 7 8 | Chapter 1 z Introduction to Databases n n n n n n What three-bedroom properties do you have for sale with a garden and garage? What flats do you have for rent within three miles of the city center? What is the average rent for a two-bedroom flat? What is the total annual salary bill for staff? How does last month’s turnover compare with the projected figure for this month? What is the expected monthly turnover for the next financial year? Increasingly, nowadays, clients, senior managers, and staff want more and more information. In some areas there is a legal requirement to produce detailed monthly, quarterly, and annual reports. Clearly, the manual system is inadequate for this type of work. The filebased system was developed in response to the needs of industry for more efficient data access. However, rather than establish a centralized store for the organization’s operational data, a decentralized approach was taken, where each department, with the assistance of Data Processing (DP) staff, stored and controlled its own data. To understand what this means, consider the DreamHome example. The Sales Department is responsible for the selling and renting of properties. For example, whenever a client approaches the Sales Department with a view to marketing his or her property for rent, a form is completed, similar to that shown in Figure 1.1(a). This gives details of the property such as address and number of rooms together with the owner’s details. The Sales Department also handles inquiries from clients, and a form similar to the one shown in Figure 1.1(b) is completed for each one. With the assistance of the DP Department, the Sales Department creates an information system to handle the renting of property. The system consists of three files containing property, owner, and client details, as illustrated in Figure 1.2. For simplicity, we omit details relating to members of staff, branch offices, and business owners. The Contracts Department is responsible for handling the lease agreements associated with properties for rent. Whenever a client agrees to rent a property, a form is filled in by one of the Sales staff giving the client and property details, as shown in Figure 1.3. This form is passed to the Contracts Department which allocates a lease number and completes the payment and rental period details. Again, with the assistance of the DP Department, the Contracts Department creates an information system to handle lease agreements. The system consists of three files storing lease, property, and client details, containing similar data to that held by the Sales Department, as illustrated in Figure 1.4. The situation is illustrated in Figure 1.5. It shows each department accessing their own files through application programs written specially for them. Each set of departmental application programs handles data entry, file maintenance, and the generation of a fixed set of specific reports. What is more important, the physical structure and storage of the data files and records are defined in the application code. We can find similar examples in other departments. For example, the Payroll Department stores details relating to each member of staff’s salary, namely: StaffSalary(staffNo, fName, lName, sex, salary, branchNo) The Personnel Department also stores staff details, namely: Staff(staffNo, fName, lName, position, sex, dateOfBirth, salary, branchNo) Figure 1.1 Sales Department forms: (a) Property for Rent Details form; (b) Client Details form. 1.2 Traditional File-Based Systems | 9 10 | Chapter 1 z Introduction to Databases Figure 1.2 The PropertyForRent, PrivateOwner, and Client files used by Sales. Figure 1.3 Lease Details form used by Contracts Department. 1.2 Traditional File-Based Systems | 11 Figure 1.4 The Lease, PropertyForRent, and Client files used by Contracts. Figure 1.5 File-based processing. It can be seen quite clearly that there is a significant amount of duplication of data in these departments, and this is generally true of file-based systems. Before we discuss the limitations of this approach, it may be useful to understand the terminology used in filebased systems. A file is simply a collection of records, which contains logically related 12 | Chapter 1 z Introduction to Databases data. For example, the PropertyForRent file in Figure 1.2 contains six records, one for each property. Each record contains a logically connected set of one or more fields, where each field represents some characteristic of the real-world object that is being modeled. In Figure 1.2, the fields of the PropertyForRent file represent characteristics of properties, such as address, property type, and number of rooms. 1.2.2 Limitations of the File-Based Approach This brief description of traditional file-based systems should be sufficient to discuss the limitations of this approach. We list five problems in Table 1.1. Separation and isolation of data When data is isolated in separate files, it is more difficult to access data that should be available. For example, if we want to produce a list of all houses that match the requirements of clients, we first need to create a temporary file of those clients who have ‘house’ as the preferred type. We then search the PropertyForRent file for those properties where the property type is ‘house’ and the rent is less than the client’s maximum rent. With file systems, such processing is difficult. The application developer must synchronize the processing of two files to ensure the correct data is extracted. This difficulty is compounded if we require data from more than two files. Duplication of data Owing to the decentralized approach taken by each department, the file-based approach encouraged, if not necessitated, the uncontrolled duplication of data. For example, in Figure 1.5 we can clearly see that there is duplication of both property and client details in the Sales and Contracts Departments. Uncontrolled duplication of data is undesirable for several reasons, including: n n n Duplication is wasteful. It costs time and money to enter the data more than once. It takes up additional storage space, again with associated costs. Often, the duplication of data can be avoided by sharing data files. Perhaps more importantly, duplication can lead to loss of data integrity; in other words, the data is no longer consistent. For example, consider the duplication of data between Table 1.1 Limitations of file-based systems. Separation and isolation of data Duplication of data Data dependence Incompatible file formats Fixed queries/proliferation of application programs 1.2 Traditional File-Based Systems the Payroll and Personnel Departments described above. If a member of staff moves house and the change of address is communicated only to Personnel and not to Payroll, the person’s payslip will be sent to the wrong address. A more serious problem occurs if an employee is promoted with an associated increase in salary. Again, the change is notified to Personnel but the change does not filter through to Payroll. Now, the employee is receiving the wrong salary. When this error is detected, it will take time and effort to resolve. Both these examples illustrate inconsistencies that may result from the duplication of data. As there is no automatic way for Personnel to update the data in the Payroll files, it is not difficult to foresee such inconsistencies arising. Even if Payroll is notified of the changes, it is possible that the data will be entered incorrectly. Data dependence As we have already mentioned, the physical structure and storage of the data files and records are defined in the application code. This means that changes to an existing structure are difficult to make. For example, increasing the size of the PropertyForRent address field from 40 to 41 characters sounds like a simple change, but it requires the creation of a one-off program (that is, a program that is run only once and can then be discarded) that converts the PropertyForRent file to the new format. This program has to: n n n n n open the original PropertyForRent file for reading; open a temporary file with the new structure; read a record from the original file, convert the data to conform to the new structure, and write it to the temporary file. Repeat this step for all records in the original file; delete the original PropertyForRent file; rename the temporary file as PropertyForRent. In addition, all programs that access the PropertyForRent file must be modified to conform to the new file structure. There might be many such programs that access the PropertyForRent file. Thus, the programmer needs to identify all the affected programs, modify them, and then retest them. Note that a program does not even have to use the address field to be affected: it has only to use the PropertyForRent file. Clearly, this could be very time-consuming and subject to error. This characteristic of file-based systems is known as program–data dependence. Incompatible file formats Because the structure of files is embedded in the application programs, the structures are dependent on the application programming language. For example, the structure of a file generated by a COBOL program may be different from the structure of a file generated by a ‘C’ program. The direct incompatibility of such files makes them difficult to process jointly. For example, suppose that the Contracts Department wants to find the names and addresses of all owners whose property is currently rented out. Unfortunately, Contracts | 13 14 | Chapter 1 z Introduction to Databases does not hold the details of property owners; only the Sales Department holds these. However, Contracts has the property number (propertyNo), which can be used to find the corresponding property number in the Sales Department’s PropertyForRent file. This file holds the owner number (ownerNo), which can be used to find the owner details in the PrivateOwner file. The Contracts Department programs in COBOL and the Sales Department programs in ‘C’. Therefore, to match propertyNo fields in the two PropertyForRent files requires an application developer to write software to convert the files to some common format to facilitate processing. Again, this can be time-consuming and expensive. Fixed queries/proliferation of application programs From the end-user’s point of view, file-based systems proved to be a great improvement over manual systems. Consequently, the requirement for new or modified queries grew. However, file-based systems are very dependent upon the application developer, who has to write any queries or reports that are required. As a result, two things happened. In some organizations, the type of query or report that could be produced was fixed. There was no facility for asking unplanned (that is, spur-of-the-moment or ad hoc) queries either about the data itself or about which types of data were available. In other organizations, there was a proliferation of files and application programs. Eventually, this reached a point where the DP Department, with its current resources, could not handle all the work. This put tremendous pressure on the DP staff, resulting in programs that were inadequate or inefficient in meeting the demands of the users, documentation that was limited, and maintenance that was difficult. Often, certain types of functionality were omitted including: n n n there was no provision for security or integrity; recovery, in the event of a hardware or software failure, was limited or non-existent; access to the files was restricted to one user at a time – there was no provision for shared access by staff in the same department. In either case, the outcome was not acceptable. Another solution was required. 1.3 Database Approach All the above limitations of the file-based approach can be attributed to two factors: (1) the definition of the data is embedded in the application programs, rather than being stored separately and independently; (2) there is no control over the access and manipulation of data beyond that imposed by the application programs. To become more effective, a new approach was required. What emerged were the database and the Database Management System (DBMS). In this section, we provide a more formal definition of these terms, and examine the components that we might expect in a DBMS environment. 1.3 Database Approach The Database Database A shared collection of logically related data, and a description of this data, designed to meet the information needs of an organization. We now examine the definition of a database to understand the concept fully. The database is a single, possibly large repository of data that can be used simultaneously by many departments and users. Instead of disconnected files with redundant data, all data items are integrated with a minimum amount of duplication. The database is no longer owned by one department but is a shared corporate resource. The database holds not only the organization’s operational data but also a description of this data. For this reason, a database is also defined as a self-describing collection of integrated records. The description of the data is known as the system catalog (or data dictionary or metadata – the ‘data about data’). It is the self-describing nature of a database that provides program–data independence. The approach taken with database systems, where the definition of data is separated from the application programs, is similar to the approach taken in modern software development, where an internal definition of an object and a separate external definition are provided. The users of an object see only the external definition and are unaware of how the object is defined and how it functions. One advantage of this approach, known as data abstraction, is that we can change the internal definition of an object without affecting the users of the object, provided the external definition remains the same. In the same way, the database approach separates the structure of the data from the application programs and stores it in the database. If new data structures are added or existing structures are modified then the application programs are unaffected, provided they do not directly depend upon what has been modified. For example, if we add a new field to a record or create a new file, existing applications are unaffected. However, if we remove a field from a file that an application program uses, then that application program is affected by this change and must be modified accordingly. The final term in the definition of a database that we should explain is ‘logically related’. When we analyze the information needs of an organization, we attempt to identify entities, attributes, and relationships. An entity is a distinct object (a person, place, thing, concept, or event) in the organization that is to be represented in the database. An attribute is a property that describes some aspect of the object that we wish to record, and a relationship is an association between entities. For example, Figure 1.6 shows an Entity– Relationship (ER) diagram for part of the DreamHome case study. It consists of: n n n six entities (the rectangles): Branch, Staff, PropertyForRent, Client, PrivateOwner, and Lease; seven relationships (the names adjacent to the lines): Has, Offers, Oversees, Views, Owns, LeasedBy, and Holds; six attributes, one for each entity: branchNo, staffNo, propertyNo, clientNo, ownerNo, and leaseNo. The database represents the entities, the attributes, and the logical relationships between the entities. In other words, the database holds data that is logically related. We discuss the Entity–Relationship model in detail in Chapters 11 and 12. 1.3.1 | 15 16 | Chapter 1 z Introduction to Databases Figure 1.6 Example Entity–Relationship diagram. Branch Staff Has branchNo 1..1 1..* staffNo 1..1 0..1 Oversees Offers 0..100 1..* 1..* PropertyForRent propertyNo Client Views 0..* 0..* clientNo 1..1 Owns 1..1 Holds LeasedBy 0..1 PrivateOwner ownerNo 0..* Lease 0..* leaseNo 1.3.2 The Database Management System (DBMS) DBMS A software system that enables users to define, create, maintain, and control access to the database. The DBMS is the software that interacts with the users’ application programs and the database. Typically, a DBMS provides the following facilities: n n n It allows users to define the database, usually through a Data Definition Language (DDL). The DDL allows users to specify the data types and structures and the constraints on the data to be stored in the database. It allows users to insert, update, delete, and retrieve data from the database, usually through a Data Manipulation Language (DML). Having a central repository for all data and data descriptions allows the DML to provide a general inquiry facility to this data, called a query language. The provision of a query language alleviates the problems with file-based systems where the user has to work with a fixed set of queries or there is a proliferation of programs, giving major software management problems. The most common query language is the Structured Query Language (SQL, pronounced ‘S-Q-L’, or sometimes ‘See-Quel’), which is now both the formal and de facto standard language for relational DBMSs. To emphasize the importance of SQL, we devote Chapters 5 and 6, most of 28, and Appendix E to a comprehensive study of this language. It provides controlled access to the database. For example, it may provide: – a security system, which prevents unauthorized users accessing the database; – an integrity system, which maintains the consistency of stored data; – a concurrency control system, which allows shared access of the database; 1.3 Database Approach – a recovery control system, which restores the database to a previous consistent state following a hardware or software failure; – a user-accessible catalog, which contains descriptions of the data in the database. (Database) Application Programs Application program 1.3.3 A computer program that interacts with the database by issuing an appropriate request (typically an SQL statement) to the DBMS. Users interact with the database through a number of application programs that are used to create and maintain the database and to generate information. These programs can be conventional batch applications or, more typically nowadays, they will be online applications. The application programs may be written in some programming language or in some higher-level fourth-generation language. The database approach is illustrated in Figure 1.7, based on the file approach of Figure 1.5. It shows the Sales and Contracts Departments using their application programs to access the database through the DBMS. Each set of departmental application programs handles data entry, data maintenance, and the generation of reports. However, compared with the file-based approach, the physical structure and storage of the data are now managed by the DBMS. Views With this functionality, the DBMS is an extremely powerful and useful tool. However, as the end-users are not too interested in how complex or easy a task is for the system, it could be argued that the DBMS has made things more complex because they now see Figure 1.7 Database processing. | 17 18 | Chapter 1 z Introduction to Databases more data than they actually need or want. For example, the details that the Contracts Department wants to see for a rental property, as shown in Figure 1.5, have changed in the database approach, shown in Figure 1.7. Now the database also holds the property type, the number of rooms, and the owner details. In recognition of this problem, a DBMS provides another facility known as a view mechanism, which allows each user to have his or her own view of the database (a view is in essence some subset of the database). For example, we could set up a view that allows the Contracts Department to see only the data that they want to see for rental properties. As well as reducing complexity by letting users see the data in the way they want to see it, views have several other benefits: n n n Views provide a level of security. Views can be set up to exclude data that some users should not see. For example, we could create a view that allows a branch manager and the Payroll Department to see all staff data, including salary details, and we could create a second view that other staff would use that excludes salary details. Views provide a mechanism to customize the appearance of the database. For example, the Contracts Department may wish to call the monthly rent field (rent) by the more obvious name, Monthly Rent. A view can present a consistent, unchanging picture of the structure of the database, even if the underlying database is changed (for example, fields added or removed, relationships changed, files split, restructured, or renamed). If fields are added or removed from a file, and these fields are not required by the view, the view is not affected by this change. Thus, a view helps provide the program–data independence we mentioned in the previous section. The above discussion is general and the actual level of functionality offered by a DBMS differs from product to product. For example, a DBMS for a personal computer may not support concurrent shared access, and it may provide only limited security, integrity, and recovery control. However, modern, large multi-user DBMS products offer all the above functions and much more. Modern systems are extremely complex pieces of software consisting of millions of lines of code, with documentation comprising many volumes. This is a result of having to provide software that handles requirements of a more general nature. Furthermore, the use of DBMSs nowadays requires a system that provides almost total reliability and 24/7 availability (24 hours a day, 7 days a week), even in the presence of hardware or software failure. The DBMS is continually evolving and expanding to cope with new user requirements. For example, some applications now require the storage of graphic images, video, sound, and so on. To reach this market, the DBMS must change. It is likely that new functionality will always be required, so that the functionality of the DBMS will never become static. We discuss the basic functions provided by a DBMS in later chapters. 1.3.4 Components of the DBMS Environment We can identify five major components in the DBMS environment: hardware, software, data, procedures, and people, as illustrated in Figure 1.8. 1.3 Database Approach | 19 Figure 1.8 DBMS environment. Hardware The DBMS and the applications require hardware to run. The hardware can range from a single personal computer, to a single mainframe, to a network of computers. The particular hardware depends on the organization’s requirements and the DBMS used. Some DBMSs run only on particular hardware or operating systems, while others run on a wide variety of hardware and operating systems. A DBMS requires a minimum amount of main memory and disk space to run, but this minimum configuration may not necessarily give acceptable performance. A simplified hardware configuration for DreamHome is illustrated in Figure 1.9. It consists of a network of minicomputers, with a central computer located in London running the backend of the DBMS, that is, the part of the DBMS that manages and controls access to the database. It also shows several computers at various Figure 1.9 DreamHome hardware configuration. 20 | Chapter 1 z Introduction to Databases locations running the frontend of the DBMS, that is, the part of the DBMS that interfaces with the user. This is called a client–server architecture: the backend is the server and the frontends are the clients. We discuss this type of architecture in Section 2.6. Software The software component comprises the DBMS software itself and the application programs, together with the operating system, including network software if the DBMS is being used over a network. Typically, application programs are written in a third-generation programming language (3GL), such as ‘C’, C++, Java, Visual Basic, COBOL, Fortran, Ada, or Pascal, or using a fourth-generation language (4GL), such as SQL, embedded in a thirdgeneration language. The target DBMS may have its own fourth-generation tools that allow rapid development of applications through the provision of non-procedural query languages, reports generators, forms generators, graphics generators, and application generators. The use of fourth-generation tools can improve productivity significantly and produce programs that are easier to maintain. We discuss fourth-generation tools in Section 2.2.3. Data Perhaps the most important component of the DBMS environment, certainly from the end-users’ point of view, is the data. From Figure 1.8, we observe that the data acts as a bridge between the machine components and the human components. The database contains both the operational data and the metadata, the ‘data about data’. The structure of the database is called the schema. In Figure 1.7, the schema consists of four files, or tables, namely: PropertyForRent, PrivateOwner, Client, and Lease. The PropertyForRent table has eight fields, or attributes, namely: propertyNo, street, city, postcode, type (the property type), rooms (the number of rooms), rent (the monthly rent), and ownerNo. The ownerNo attribute models the relationship between PropertyForRent and PrivateOwner: that is, an owner Owns a property for rent, as depicted in the Entity–Relationship diagram of Figure 1.6. For example, in Figure 1.2 we observe that owner CO46, Joe Keogh, owns property PA14. The data also incorporates the system catalog, which we discuss in detail in Section 2.4. Procedures Procedures refer to the instructions and rules that govern the design and use of the database. The users of the system and the staff that manage the database require documented procedures on how to use or run the system. These may consist of instructions on how to: n n n n n log on to the DBMS; use a particular DBMS facility or application program; start and stop the DBMS; make backup copies of the database; handle hardware or software failures. This may include procedures on how to identify the failed component, how to fix the failed component (for example, telephone the appropriate hardware engineer) and, following the repair of the fault, how to recover the database; 1.4 Roles in the Database Environment n change the structure of a table, reorganize the database across multiple disks, improve performance, or archive data to secondary storage. People The final component is the people involved with the system. We discuss this component in Section 1.4. Database Design: The Paradigm Shift 1.3.5 Until now, we have taken it for granted that there is a structure to the data in the database. For example, we have identified four tables in Figure 1.7: PropertyForRent, PrivateOwner, Client, and Lease. But how did we get this structure? The answer is quite simple: the structure of the database is determined during database design. However, carrying out database design can be extremely complex. To produce a system that will satisfy the organization’s information needs requires a different approach from that of file-based systems, where the work was driven by the application needs of individual departments. For the database approach to succeed, the organization now has to think of the data first and the application second. This change in approach is sometimes referred to as a paradigm shift. For the system to be acceptable to the end-users, the database design activity is crucial. A poorly designed database will generate errors that may lead to bad decisions being made, which may have serious repercussions for the organization. On the other hand, a well-designed database produces a system that provides the correct information for the decision-making process to succeed in an efficient way. The objective of this book is to help effect this paradigm shift. We devote several chapters to the presentation of a complete methodology for database design (see Chapters 15–18). It is presented as a series of simple-to-follow steps, with guidelines provided throughout. For example, in the Entity–Relationship diagram of Figure 1.6, we have identified six entities, seven relationships, and six attributes. We provide guidelines to help identify the entities, attributes, and relationships that have to be represented in the database. Unfortunately, database design methodologies are not very popular. Many organizations and individual designers rely very little on methodologies for conducting the design of databases, and this is commonly considered a major cause of failure in the development of database systems. Owing to the lack of structured approaches to database design, the time or resources required for a database project are typically underestimated, the databases developed are inadequate or inefficient in meeting the demands of applications, documentation is limited, and maintenance is difficult. Roles in the Database Environment In this section, we examine what we listed in the previous section as the fifth component of the DBMS environment: the people. We can identify four distinct types of people that participate in the DBMS environment: data and database administrators, database designers, application developers, and the end-users. 1.4 | 21 22 | Chapter 1 z Introduction to Databases 1.4.1 Data and Database Administrators The database and the DBMS are corporate resources that must be managed like any other resource. Data and database administration are the roles generally associated with the management and control of a DBMS and its data. The Data Administrator (DA) is responsible for the management of the data resource including database planning, development and maintenance of standards, policies and procedures, and conceptual/logical database design. The DA consults with and advises senior managers, ensuring that the direction of database development will ultimately support corporate objectives. The Database Administrator (DBA) is responsible for the physical realization of the database, including physical database design and implementation, security and integrity control, maintenance of the operational system, and ensuring satisfactory performance of the applications for users. The role of the DBA is more technically oriented than the role of the DA, requiring detailed knowledge of the target DBMS and the system environment. In some organizations there is no distinction between these two roles; in others, the importance of the corporate resources is reflected in the allocation of teams of staff dedicated to each of these roles. We discuss data and database administration in more detail in Section 9.15. 1.4.2 Database Designers In large database design projects, we can distinguish between two types of designer: logical database designers and physical database designers. The logical database designer is concerned with identifying the data (that is, the entities and attributes), the relationships between the data, and the constraints on the data that is to be stored in the database. The logical database designer must have a thorough and complete understanding of the organization’s data and any constraints on this data (the constraints are sometimes called business rules). These constraints describe the main characteristics of the data as viewed by the organization. Examples of constraints for DreamHome are: n n n a member of staff cannot manage more than 100 properties for rent or sale at the same time; a member of staff cannot handle the sale or rent of his or her own property; a solicitor cannot act for both the buyer and seller of a property. To be effective, the logical database designer must involve all prospective database users in the development of the data model, and this involvement should begin as early in the process as possible. In this book, we split the work of the logical database designer into two stages: n n conceptual database design, which is independent of implementation details such as the target DBMS, application programs, programming languages, or any other physical considerations; logical database design, which targets a specific data model, such as relational, network, hierarchical, or object-oriented. 1.4 Roles in the Database Environment The physical database designer decides how the logical database design is to be physically realized. This involves: n n n mapping the logical database design into a set of tables and integrity constraints; selecting specific storage structures and access methods for the data to achieve good performance; designing any security measures required on the data. Many parts of physical database design are highly dependent on the target DBMS, and there may be more than one way of implementing a mechanism. Consequently, the physical database designer must be fully aware of the functionality of the target DBMS and must understand the advantages and disadvantages of each alternative for a particular implementation. The physical database designer must be capable of selecting a suitable storage strategy that takes account of usage. Whereas conceptual and logical database design are concerned with the what, physical database design is concerned with the how. It requires different skills, which are often found in different people. We present a methodology for conceptual database design in Chapter 15, for logical database design in Chapter 16, and for physical database design in Chapters 17 and 18. Application Developers 1.4.3 Once the database has been implemented, the application programs that provide the required functionality for the end-users must be implemented. This is the responsibility of the application developers. Typically, the application developers work from a specification produced by systems analysts. Each program contains statements that request the DBMS to perform some operation on the database. This includes retrieving data, inserting, updating, and deleting data. The programs may be written in a third-generation programming language or a fourth-generation language, as discussed in the previous section. End-Users The end-users are the ‘clients’ for the database, which has been designed and implemented, and is being maintained to serve their information needs. End-users can be classified according to the way they use the system: n Naïve users are typically unaware of the DBMS. They access the database through specially written application programs that attempt to make the operations as simple as possible. They invoke database operations by entering simple commands or choosing options from a menu. This means that they do not need to know anything about the database or the DBMS. For example, the checkout assistant at the local supermarket uses a bar code reader to find out the price of the item. However, there is an application program present that reads the bar code, looks up the price of the item in the database, reduces the database field containing the number of such items in stock, and displays the price on the till. 1.4.4 | 23 24 | Chapter 1 z Introduction to Databases n 1.5 Sophisticated users. At the other end of the spectrum, the sophisticated end-user is familiar with the structure of the database and the facilities offered by the DBMS. Sophisticated end-users may use a high-level query language such as SQL to perform the required operations. Some sophisticated end-users may even write application programs for their own use. History of Database Management Systems We have already seen that the predecessor to the DBMS was the file-based system. However, there was never a time when the database approach began and the file-based system ceased. In fact, the file-based system still exists in specific areas. It has been suggested that the DBMS has its roots in the 1960s Apollo moon-landing project, which was initiated in response to President Kennedy’s objective of landing a man on the moon by the end of that decade. At that time there was no system available that would be able to handle and manage the vast amounts of information that the project would generate. As a result, North American Aviation (NAA, now Rockwell International), the prime contractor for the project, developed software known as GUAM (Generalized Update Access Method). GUAM was based on the concept that smaller components come together as parts of larger components, and so on, until the final product is assembled. This structure, which conforms to an upside-down tree, is also known as a hierarchical structure. In the mid-1960s, IBM joined NAA to develop GUAM into what is now known as IMS (Information Management System). The reason why IBM restricted IMS to the management of hierarchies of records was to allow the use of serial storage devices, most notably magnetic tape, which was a market requirement at that time. This restriction was subsequently dropped. Although one of the earliest commercial DBMSs, IMS is still the main hierarchical DBMS used by most large mainframe installations. In the mid-1960s, another significant development was the emergence of IDS (Integrated Data Store) from General Electric. This work was headed by one of the early pioneers of database systems, Charles Bachmann. This development led to a new type of database system known as the network DBMS, which had a profound effect on the information systems of that generation. The network database was developed partly to address the need to represent more complex data relationships than could be modeled with hierarchical structures, and partly to impose a database standard. To help establish such standards, the Conference on Data Systems Languages (CODASYL), comprising representatives of the US government and the world of business and commerce, formed a List Processing Task Force in 1965, subsequently renamed the Data Base Task Group (DBTG) in 1967. The terms of reference for the DBTG were to define standard specifications for an environment that would allow database creation and data manipulation. A draft report was issued in 1969 and the first definitive report in 1971. The DBTG proposal identified three components: n n n the network schema – the logical organization of the entire database as seen by the DBA – which includes a definition of the database name, the type of each record, and the components of each record type; the subschema – the part of the database as seen by the user or application program; a data management language to define the data characteristics and the data structure, and to manipulate the data. 1.5 History of Database Management Systems For standardization, the DBTG specified three distinct languages: n n n a schema Data Definition Language (DDL), which enables the DBA to define the schema; a subschema DDL, which allows the application programs to define the parts of the database they require; a Data Manipulation Language (DML), to manipulate the data. Although the report was not formally adopted by the American National Standards Institute (ANSI), a number of systems were subsequently developed following the DBTG proposal. These systems are now known as CODASYL or DBTG systems. The CODASYL and hierarchical approaches represented the first-generation of DBMSs. We look more closely at these systems on the Web site for this book (see Preface for the URL). However, these two models have some fundamental disadvantages: n n n complex programs have to be written to answer even simple queries based on navigational record-oriented access; there is minimal data independence; there is no widely accepted theoretical foundation. In 1970 E. F. Codd of the IBM Research Laboratory produced his highly influential paper on the relational data model. This paper was very timely and addressed the disadvantages of the former approaches. Many experimental relational DBMSs were implemented thereafter, with the first commercial products appearing in the late 1970s and early 1980s. Of particular note is the System R project at IBM’s San José Research Laboratory in California, which was developed during the late 1970s (Astrahan et al., 1976). This project was designed to prove the practicality of the relational model by providing an implementation of its data structures and operations, and led to two major developments: n n the development of a structured query language called SQL, which has since become the standard language for relational DBMSs; the production of various commercial relational DBMS products during the 1980s, for example DB2 and SQL/DS from IBM and Oracle from Oracle Corporation. Now there are several hundred relational DBMSs for both mainframe and PC environments, though many are stretching the definition of the relational model. Other examples of multi-user relational DBMSs are Advantage Ingres Enterprise Relational Database from Computer Associates, and Informix from IBM. Examples of PC-based relational DBMSs are Office Access and Visual FoxPro from Microsoft, InterBase and JDataStore from Borland, and R:Base from R:Base Technologies. Relational DBMSs are referred to as second-generation DBMSs. We discuss the relational data model in Chapter 3. The relational model is not without its failings, and in particular its limited modeling capabilities. There has been much research since then attempting to address this problem. In 1976, Chen presented the Entity–Relationship model, which is now a widely accepted technique for database design and the basis for the methodology presented in Chapters 15 and 16 of this book. In 1979, Codd himself attempted to address some of the failings in his original work with an extended version of the relational model called RM/T (1979) and subsequently RM/V2 (1990). The attempts to provide a data model that represents the ‘real world’ more closely have been loosely classified as semantic data modeling. | 25 26 | Chapter 1 z Introduction to Databases In response to the increasing complexity of database applications, two ‘new’ systems have emerged: the Object-Oriented DBMS (OODBMS) and the Object-Relational DBMS (ORDBMS). However, unlike previous models, the actual composition of these models is not clear. This evolution represents third-generation DBMSs, which we discuss in Chapters 25–28. 1.6 Advantages and Disadvantages of DBMSs The database management system has promising potential advantages. Unfortunately, there are also disadvantages. In this section, we examine these advantages and disadvantages. Advantages The advantages of database management systems are listed in Table 1.2. Control of data redundancy As we discussed in Section 1.2, traditional file-based systems waste space by storing the same information in more than one file. For example, in Figure 1.5, we stored similar data for properties for rent and clients in both the Sales and Contracts Departments. In contrast, the database approach attempts to eliminate the redundancy by integrating the files so that multiple copies of the same data are not stored. However, the database approach does not eliminate redundancy entirely, but controls the amount of redundancy inherent in the database. Sometimes, it is necessary to duplicate key data items to model relationships. At other times, it is desirable to duplicate some data items to improve performance. The reasons for controlled duplication will become clearer as you read the next few chapters. Data consistency By eliminating or controlling redundancy, we reduce the risk of inconsistencies occurring. If a data item is stored only once in the database, any update to its value has to be performed only once and the new value is available immediately to all users. If a data item is stored more than once and the system is aware of this, the system can ensure that all copies Table 1.2 Advantages of DBMSs. Control of data redundancy Data consistency More information from the same amount of data Sharing of data Improved data integrity Improved security Enforcement of standards Economy of scale Balance of conflicting requirements Improved data accessibility and responsiveness Increased productivity Improved maintenance through data independence Increased concurrency Improved backup and recovery services 1.6 Advantages and Disadvantages of DBMSs of the item are kept consistent. Unfortunately, many of today’s DBMSs do not automatically ensure this type of consistency. More information from the same amount of data With the integration of the operational data, it may be possible for the organization to derive additional information from the same data. For example, in the file-based system illustrated in Figure 1.5, the Contracts Department does not know who owns a leased property. Similarly, the Sales Department has no knowledge of lease details. When we integrate these files, the Contracts Department has access to owner details and the Sales Department has access to lease details. We may now be able to derive more information from the same amount of data. Sharing of data Typically, files are owned by the people or departments that use them. On the other hand, the database belongs to the entire organization and can be shared by all authorized users. In this way, more users share more of the data. Furthermore, new applications can build on the existing data in the database and add only data that is not currently stored, rather than having to define all data requirements again. The new applications can also rely on the functions provided by the DBMS, such as data definition and manipulation, and concurrency and recovery control, rather than having to provide these functions themselves. Improved data integrity Database integrity refers to the validity and consistency of stored data. Integrity is usually expressed in terms of constraints, which are consistency rules that the database is not permitted to violate. Constraints may apply to data items within a single record or they may apply to relationships between records. For example, an integrity constraint could state that a member of staff’s salary cannot be greater than £40,000 or that the branch number contained in a staff record, representing the branch where the member of staff works, must correspond to an existing branch office. Again, integration allows the DBA to define, and the DBMS to enforce, integrity constraints. Improved security Database security is the protection of the database from unauthorized users. Without suitable security measures, integration makes the data more vulnerable than file-based systems. However, integration allows the DBA to define, and the DBMS to enforce, database security. This may take the form of user names and passwords to identify people authorized to use the database. The access that an authorized user is allowed on the data may be restricted by the operation type (retrieval, insert, update, delete). For example, the DBA has access to all the data in the database; a branch manager may have access to all data that relates to his or her branch office; and a sales assistant may have access to all data relating to properties but no access to sensitive data such as staff salary details. Enforcement of standards Again, integration allows the DBA to define and enforce the necessary standards. These may include departmental, organizational, national, or international standards for such | 27 28 | Chapter 1 z Introduction to Databases things as data formats to facilitate exchange of data between systems, naming conventions, documentation standards, update procedures, and access rules. Economy of scale Combining all the organization’s operational data into one database, and creating a set of applications that work on this one source of data, can result in cost savings. In this case, the budget that would normally be allocated to each department for the development and maintenance of its file-based system can be combined, possibly resulting in a lower total cost, leading to an economy of scale. The combined budget can be used to buy a system configuration that is more suited to the organization’s needs. This may consist of one large, powerful computer or a network of smaller computers. Balance of conflicting requirements Each user or department has needs that may be in conflict with the needs of other users. Since the database is under the control of the DBA, the DBA can make decisions about the design and operational use of the database that provide the best use of resources for the organization as a whole. These decisions will provide optimal performance for important applications, possibly at the expense of less critical ones. Improved data accessibility and responsiveness Again, as a result of integration, data that crosses departmental boundaries is directly accessible to the end-users. This provides a system with potentially much more functionality that can, for example, be used to provide better services to the end-user or the organization’s clients. Many DBMSs provide query languages or report writers that allow users to ask ad hoc questions and to obtain the required information almost immediately at their terminal, without requiring a programmer to write some software to extract this information from the database. For example, a branch manager could list all flats with a monthly rent greater than £400 by entering the following SQL command at a terminal: SELECT* FROM PropertyForRent WHERE type = ‘Flat’ AND rent > 400; Increased productivity As mentioned previously, the DBMS provides many of the standard functions that the programmer would normally have to write in a file-based application. At a basic level, the DBMS provides all the low-level file-handling routines that are typical in application programs. The provision of these functions allows the programmer to concentrate on the specific functionality required by the users without having to worry about low-level implementation details. Many DBMSs also provide a fourth-generation environment consisting of tools to simplify the development of database applications. This results in increased programmer productivity and reduced development time (with associated cost savings). Improved maintenance through data independence In file-based systems, the descriptions of the data and the logic for accessing the data are built into each application program, making the programs dependent on the data. A 1.6 Advantages and Disadvantages of DBMSs change to the structure of the data, for example making an address 41 characters instead of 40 characters, or a change to the way the data is stored on disk, can require substantial alterations to the programs that are affected by the change. In contrast, a DBMS separates the data descriptions from the applications, thereby making applications immune to changes in the data descriptions. This is known as data independence and is discussed further in Section 2.1.5. The provision of data independence simplifies database application maintenance. Increased concurrency In some file-based systems, if two or more users are allowed to access the same file simultaneously, it is possible that the accesses will interfere with each other, resulting in loss of information or even loss of integrity. Many DBMSs manage concurrent database access and ensure such problems cannot occur. We discuss concurrency control in Chapter 20. Improved backup and recovery services Many file-based systems place the responsibility on the user to provide measures to protect the data from failures to the computer system or application program. This may involve taking a nightly backup of the data. In the event of a failure during the next day, the backup is restored and the work that has taken place since this backup is lost and has to be re-entered. In contrast, modern DBMSs provide facilities to minimize the amount of processing that is lost following a failure. We discuss database recovery in Section 20.3. Disadvantages The disadvantages of the database approach are summarized in Table 1.3. Complexity The provision of the functionality we expect of a good DBMS makes the DBMS an extremely complex piece of software. Database designers and developers, the data and database administrators, and end-users must understand this functionality to take full advantage of it. Failure to understand the system can lead to bad design decisions, which can have serious consequences for an organization. Table 1.3 Disadvantages of DBMSs. Complexity Size Cost of DBMSs Additional hardware costs Cost of conversion Performance Higher impact of a failure | 29 30 | Chapter 1 z Introduction to Databases Size The complexity and breadth of functionality makes the DBMS an extremely large piece of software, occupying many megabytes of disk space and requiring substantial amounts of memory to run efficiently. Cost of DBMSs The cost of DBMSs varies significantly, depending on the environment and functionality provided. For example, a single-user DBMS for a personal computer may only cost US$100. However, a large mainframe multi-user DBMS servicing hundreds of users can be extremely expensive, perhaps US$100,000 or even US$1,000,000. There is also the recurrent annual maintenance cost, which is typically a percentage of the list price. Additional hardware costs The disk storage requirements for the DBMS and the database may necessitate the purchase of additional storage space. Furthermore, to achieve the required performance, it may be necessary to purchase a larger machine, perhaps even a machine dedicated to running the DBMS. The procurement of additional hardware results in further expenditure. Cost of conversion In some situations, the cost of the DBMS and extra hardware may be insignificant compared with the cost of converting existing applications to run on the new DBMS and hardware. This cost also includes the cost of training staff to use these new systems, and possibly the employment of specialist staff to help with the conversion and running of the system. This cost is one of the main reasons why some organizations feel tied to their current systems and cannot switch to more modern database technology. The term legacy system is sometimes used to refer to an older, and usually inferior, system. Performance Typically, a file-based system is written for a specific application, such as invoicing. As a result, performance is generally very good. However, the DBMS is written to be more general, to cater for many applications rather than just one. The effect is that some applications may not run as fast as they used to. Higher impact of a failure The centralization of resources increases the vulnerability of the system. Since all users and applications rely on the availability of the DBMS, the failure of certain components can bring operations to a halt. Chapter Summary | 31 Chapter Summary n The Database Management System (DBMS) is now the underlying framework of the information system and has fundamentally changed the way that many organizations operate. The database system remains a very active research area and many significant problems have still to be satisfactorily resolved. n The predecessor to the DBMS was the file-based system, which is a collection of application programs that perform services for the end-users, usually the production of reports. Each program defines and manages its own data. Although the file-based system was a great improvement on the manual filing system, it still has significant problems, mainly the amount of data redundancy present and program–data dependence. n The database approach emerged to resolve the problems with the file-based approach. A database is a shared collection of logically related data, and a description of this data, designed to meet the information needs of an organization. A DBMS is a software system that enables users to define, create, maintain, and control access to the database. An application program is a computer program that interacts with the database by issuing an appropriate request (typically an SQL statement) to the DBMS. The more inclusive term database system is used to define a collection of application programs that interact with the database along with the DBMS and database itself. n All access to the database is through the DBMS. The DBMS provides a Data Definition Language (DDL), which allows users to define the database, and a Data Manipulation Language (DML), which allows users to insert, update, delete, and retrieve data from the database. n The DBMS provides controlled access to the database. It provides security, integrity, concurrency and recovery control, and a user-accessible catalog. It also provides a view mechanism to simplify the data that users have to deal with. n The DBMS environment consists of hardware (the computer), software (the DBMS, operating system, and applications programs), data, procedures, and people. The people include data and database administrators, database designers, application developers, and end-users. n The roots of the DBMS lie in file-based systems. The hierarchical and CODASYL systems represent the first-generation of DBMSs. The hierarchical model is typified by IMS (Information Management System) and the network or CODASYL model by IDS (Integrated Data Store), both developed in the mid-1960s. The relational model, proposed by E. F. Codd in 1970, represents the second-generation of DBMSs. It has had a fundamental effect on the DBMS community and there are now over one hundred relational DBMSs. The third-generation of DBMSs are represented by the Object-Relational DBMS and the Object-Oriented DBMS. n Some advantages of the database approach include control of data redundancy, data consistency, sharing of data, and improved security and integrity. Some disadvantages include complexity, cost, reduced performance, and higher impact of a failure. 32 | Chapter 1 z Introduction to Databases Review Questions 1.1 List four examples of database systems other than those listed in Section 1.1. 1.2 Discuss each of the following terms: (a) data (b) database (c) database management system (d) database application program (e) data independence (f ) security (g) integrity (h) views. 1.3 Describe the approach taken to the handling of data in the early file-based systems. Discuss the disadvantages of this approach. 1.4 Describe the main characteristics of the database approach and contrast it with the file-based approach. 1.5 Describe the five components of the DBMS environment and discuss how they relate to each other. 1.6 Discuss the roles of the following personnel in the database environment: (a) data administrator (b) database administrator (c) logical database designer (d) physical database designer (e) application developer (f ) end-users. 1.7 Discuss the advantages and disadvantages of DBMSs. Exercises 1.8 Interview some users of database systems. Which DBMS features do they find most useful and why? Which DBMS facilities do they find least useful and why? What do these users perceive to be the advantages and disadvantages of the DBMS? 1.9 Write a small program (using pseudocode if necessary) that allows entry and display of client details including a client number, name, address, telephone number, preferred number of rooms, and maximum rent. The details should be stored in a file. Enter a few records and display the details. Now repeat this process but rather than writing a special program, use any DBMS that you have access to. What can you conclude from these two approaches? 1.10 Study the DreamHome case study presented in Section 10.4 and Appendix A. In what ways would a DBMS help this organization? What data can you identify that needs to be represented in the database? What relationships exist between the data items? What queries do you think are required? 1.11 Study the Wellmeadows Hospital case study presented in Appendix B.3. In what ways would a DBMS help this organization? What data can you identify that needs to be represented in the database? What relationships exist between the data items? Chapter 2 Database Environment Chapter Objectives In this chapter you will learn: n The purpose and origin of the three-level database architecture. n The contents of the external, conceptual, and internal levels. n The purpose of the external/conceptual and the conceptual/internal mappings. n The meaning of logical and physical data independence. n The distinction between a Data Definition Language (DDL) and a Data Manipulation Language (DML). n A classification of data models. n The purpose and importance of conceptual modeling. n The typical functions and services a DBMS should provide. n The function and importance of the system catalog. n The software components of a DBMS. n The meaning of the client–server architecture and the advantages of this type of architecture for a DBMS. n The function and uses of Transaction Processing (TP) Monitors. A major aim of a database system is to provide users with an abstract view of data, hiding certain details of how data is stored and manipulated. Therefore, the starting point for the design of a database must be an abstract and general description of the information requirements of the organization that is to be represented in the database. In this chapter, and throughout this book, we use the term ‘organization’ loosely, to mean the whole organization or part of the organization. For example, in the DreamHome case study we may be interested in modeling: n the ‘real world’ entities Staff, PropertyforRent, PrivateOwner, and Client; n attributes describing properties or qualities of each entity (for example, name, position, and salary); n relationships between these entities (for example, Staff Manages PropertyForRent). Staff have a 34 | Chapter 2 z Database Environment Furthermore, since a database is a shared resource, each user may require a different view of the data held in the database. To satisfy these needs, the architecture of most commercial DBMSs available today is based to some extent on the so-called ANSI-SPARC architecture. In this chapter, we discuss various architectural and functional characteristics of DBMSs. Structure of this Chapter In Section 2.1 we examine the three-level ANSI-SPARC architecture and its associated benefits. In Section 2.2 we consider the types of language that are used by DBMSs, and in Section 2.3 we introduce the concepts of data models and conceptual modeling, which we expand on in later parts of the book. In Section 2.4 we discuss the functions that we would expect a DBMS to provide, and in Sections 2.5 and 2.6 we examine the internal architecture of a typical DBMS. The examples in this chapter are drawn from the DreamHome case study, which we discuss more fully in Section 10.4 and Appendix A. Much of the material in this chapter provides important background information on DBMSs. However, the reader who is new to the area of database systems may find some of the material difficult to appreciate on first reading. Do not be too concerned about this, but be prepared to revisit parts of this chapter at a later date when you have read subsequent chapters of the book. 2.1 The Three-Level ANSI-SPARC Architecture An early proposal for a standard terminology and general architecture for database systems was produced in 1971 by the DBTG (Data Base Task Group) appointed by the Conference on Data Systems and Languages (CODASYL, 1971). The DBTG recognized the need for a two-level approach with a system view called the schema and user views called subschemas. The American National Standards Institute (ANSI) Standards Planning and Requirements Committee (SPARC), ANSI/X3/SPARC, produced a similar terminology and architecture in 1975 (ANSI, 1975). ANSI-SPARC recognized the need for a three-level approach with a system catalog. These proposals reflected those published by the IBM user organizations Guide and Share some years previously, and concentrated on the need for an implementation-independent layer to isolate programs from underlying representational issues (Guide/Share, 1970). Although the ANSI-SPARC model did not become a standard, it still provides a basis for understanding some of the functionality of a DBMS. For our purposes, the fundamental point of these and later reports is the identification of three levels of abstraction, that is, three distinct levels at which data items can be described. The levels form a three-level architecture comprising an external, a conceptual, and an internal level, as depicted in Figure 2.1. The way users perceive the data is called the external level. The way the DBMS and the operating system perceive the data is the internal level, where the data is actually stored using the data structures and file 2.1 The Three-Level ANSI-SPARC Architecture | 35 Figure 2.1 The ANSI-SPARC three-level architecture. organizations described in Appendix C. The conceptual level provides both the mapping and the desired independence between the external and internal levels. The objective of the three-level architecture is to separate each user’s view of the database from the way the database is physically represented. There are several reasons why this separation is desirable: n n n n n Each user should be able to access the same data, but have a different customized view of the data. Each user should be able to change the way he or she views the data, and this change should not affect other users. Users should not have to deal directly with physical database storage details, such as indexing or hashing (see Appendix C). In other words, a user’s interaction with the database should be independent of storage considerations. The Database Administrator (DBA) should be able to change the database storage structures without affecting the users’ views. The internal structure of the database should be unaffected by changes to the physical aspects of storage, such as the changeover to a new storage device. The DBA should be able to change the conceptual structure of the database without affecting all users. External Level External level The users’ view of the database. This level describes that part of the database that is relevant to each user. 2.1.1 36 | Chapter 2 z Database Environment The external level consists of a number of different external views of the database. Each user has a view of the ‘real world’ represented in a form that is familiar for that user. The external view includes only those entities, attributes, and relationships in the ‘real world’ that the user is interested in. Other entities, attributes, or relationships that are not of interest may be represented in the database, but the user will be unaware of them. In addition, different views may have different representations of the same data. For example, one user may view dates in the form (day, month, year), while another may view dates as (year, month, day). Some views might include derived or calculated data: data not actually stored in the database as such, but created when needed. For example, in the DreamHome case study, we may wish to view the age of a member of staff. However, it is unlikely that ages would be stored, as this data would have to be updated daily. Instead, the member of staff’s date of birth would be stored and age would be calculated by the DBMS when it is referenced. Views may even include data combined or derived from several entities. We discuss views in more detail in Sections 3.4 and 6.4. 2.1.2 Conceptual Level Conceptual level The community view of the database. This level describes what data is stored in the database and the relationships among the data. The middle level in the three-level architecture is the conceptual level. This level contains the logical structure of the entire database as seen by the DBA. It is a complete view of the data requirements of the organization that is independent of any storage considerations. The conceptual level represents: n n n n all entities, their attributes, and their relationships; the constraints on the data; semantic information about the data; security and integrity information. The conceptual level supports each external view, in that any data available to a user must be contained in, or derivable from, the conceptual level. However, this level must not contain any storage-dependent details. For instance, the description of an entity should contain only data types of attributes (for example, integer, real, character) and their length (such as the maximum number of digits or characters), but not any storage considerations, such as the number of bytes occupied. 2.1.3 Internal Level Internal level The physical representation of the database on the computer. This level describes how the data is stored in the database. 2.1 The Three-Level ANSI-SPARC Architecture The internal level covers the physical implementation of the database to achieve optimal runtime performance and storage space utilization. It covers the data structures and file organizations used to store data on storage devices. It interfaces with the operating system access methods (file management techniques for storing and retrieving data records) to place the data on the storage devices, build the indexes, retrieve the data, and so on. The internal level is concerned with such things as: n n n n storage space allocation for data and indexes; record descriptions for storage (with stored sizes for data items); record placement; data compression and data encryption techniques. Below the internal level there is a physical level that may be managed by the operating system under the direction of the DBMS. However, the functions of the DBMS and the operating system at the physical level are not clear-cut and vary from system to system. Some DBMSs take advantage of many of the operating system access methods, while others use only the most basic ones and create their own file organizations. The physical level below the DBMS consists of items only the operating system knows, such as exactly how the sequencing is implemented and whether the fields of internal records are stored as contiguous bytes on the disk. Schemas, Mappings, and Instances The overall description of the database is called the database schema. There are three different types of schema in the database and these are defined according to the levels of abstraction of the three-level architecture illustrated in Figure 2.1. At the highest level, we have multiple external schemas (also called subschemas) that correspond to different views of the data. At the conceptual level, we have the conceptual schema, which describes all the entities, attributes, and relationships together with integrity constraints. At the lowest level of abstraction we have the internal schema, which is a complete description of the internal model, containing the definitions of stored records, the methods of representation, the data fields, and the indexes and storage structures used. There is only one conceptual schema and one internal schema per database. The DBMS is responsible for mapping between these three types of schema. It must also check the schemas for consistency; in other words, the DBMS must check that each external schema is derivable from the conceptual schema, and it must use the information in the conceptual schema to map between each external schema and the internal schema. The conceptual schema is related to the internal schema through a conceptual/internal mapping. This enables the DBMS to find the actual record or combination of records in physical storage that constitute a logical record in the conceptual schema, together with any constraints to be enforced on the operations for that logical record. It also allows any differences in entity names, attribute names, attribute order, data types, and so on, to be resolved. Finally, each external schema is related to the conceptual schema by the external/conceptual mapping. This enables the DBMS to map names in the user’s view on to the relevant part of the conceptual schema. 2.1.4 | 37 38 | Chapter 2 z Database Environment Figure 2.2 Differences between the three levels. An example of the different levels is shown in Figure 2.2. Two different external views of staff details exist: one consisting of a staff number (sNo), first name (fName), last name (lName), age, and salary; a second consisting of a staff number (staffNo), last name (lName), and the number of the branch the member of staff works at (branchNo). These external views are merged into one conceptual view. In this merging process, the major difference is that the age field has been changed into a date of birth field, DOB. The DBMS maintains the external/conceptual mapping; for example, it maps the sNo field of the first external view to the staffNo field of the conceptual record. The conceptual level is then mapped to the internal level, which contains a physical description of the structure for the conceptual record. At this level, we see a definition of the structure in a high-level language. The structure contains a pointer, next, which allows the list of staff records to be physically linked together to form a chain. Note that the order of fields at the internal level is different from that at the conceptual level. Again, the DBMS maintains the conceptual/internal mapping. It is important to distinguish between the description of the database and the database itself. The description of the database is the database schema. The schema is specified during the database design process and is not expected to change frequently. However, the actual data in the database may change frequently; for example, it changes every time we insert details of a new member of staff or a new property. The data in the database at any particular point in time is called a database instance. Therefore, many database instances can correspond to the same database schema. The schema is sometimes called the intension of the database, while an instance is called an extension (or state) of the database. 2.1.5 Data Independence A major objective for the three-level architecture is to provide data independence, which means that upper levels are unaffected by changes to lower levels. There are two kinds of data independence: logical and physical. 2.2 Database Languages | 39 Figure 2.3 Data independence and the ANSISPARC three-level architecture. Logical data independence Logical data independence refers to the immunity of the external schemas to changes in the conceptual schema. Changes to the conceptual schema, such as the addition or removal of new entities, attributes, or relationships, should be possible without having to change existing external schemas or having to rewrite application programs. Clearly, the users for whom the changes have been made need to be aware of them, but what is important is that other users should not be. Physical data independence Physical data independence refers to the immunity of the conceptual schema to changes in the internal schema. Changes to the internal schema, such as using different file organizations or storage structures, using different storage devices, modifying indexes, or hashing algorithms, should be possible without having to change the conceptual or external schemas. From the users’ point of view, the only effect that may be noticed is a change in performance. In fact, deterioration in performance is the most common reason for internal schema changes. Figure 2.3 illustrates where each type of data independence occurs in relation to the threelevel architecture. The two-stage mapping in the ANSI-SPARC architecture may be inefficient, but provides greater data independence. However, for more efficient mapping, the ANSI-SPARC model allows the direct mapping of external schemas on to the internal schema, thus bypassing the conceptual schema. This, of course, reduces data independence, so that every time the internal schema changes, the external schema, and any dependent application programs may also have to change. Database Languages A data sublanguage consists of two parts: a Data Definition Language (DDL) and a Data Manipulation Language (DML). The DDL is used to specify the database schema 2.2 40 | Chapter 2 z Database Environment and the DML is used to both read and update the database. These languages are called data sublanguages because they do not include constructs for all computing needs such as conditional or iterative statements, which are provided by the high-level programming languages. Many DBMSs have a facility for embedding the sublanguage in a high-level programming language such as COBOL, Fortran, Pascal, Ada, ‘C’, C++, Java, or Visual Basic. In this case, the high-level language is sometimes referred to as the host language. To compile the embedded file, the commands in the data sublanguage are first removed from the host-language program and replaced by function calls. The pre-processed file is then compiled, placed in an object module, linked with a DBMS-specific library containing the replaced functions, and executed when required. Most data sublanguages also provide non-embedded, or interactive, commands that can be input directly from a terminal. 2.2.1 The Data Definition Language (DDL) DDL A language that allows the DBA or user to describe and name the entities, attributes, and relationships required for the application, together with any associated integrity and security constraints. The database schema is specified by a set of definitions expressed by means of a special language called a Data Definition Language. The DDL is used to define a schema or to modify an existing one. It cannot be used to manipulate data. The result of the compilation of the DDL statements is a set of tables stored in special files collectively called the system catalog. The system catalog integrates the metadata, that is data that describes objects in the database and makes it easier for those objects to be accessed or manipulated. The metadata contains definitions of records, data items, and other objects that are of interest to users or are required by the DBMS. The DBMS normally consults the system catalog before the actual data is accessed in the database. The terms data dictionary and data directory are also used to describe the system catalog, although the term ‘data dictionary’ usually refers to a more general software system than a catalog for a DBMS. We discuss the system catalog further in Section 2.4. At a theoretical level, we could identify different DDLs for each schema in the threelevel architecture, namely a DDL for the external schemas, a DDL for the conceptual schema, and a DDL for the internal schema. However, in practice, there is one comprehensive DDL that allows specification of at least the external and conceptual schemas. 2.2.2 The Data Manipulation Language (DML) DML A language that provides a set of operations to support the basic data manipulation operations on the data held in the database. 2.2 Database Languages Data manipulation operations usually include the following: n n n n insertion of new data into the database; modification of data stored in the database; retrieval of data contained in the database; deletion of data from the database. Therefore, one of the main functions of the DBMS is to support a data manipulation language in which the user can construct statements that will cause such data manipulation to occur. Data manipulation applies to the external, conceptual, and internal levels. However, at the internal level we must define rather complex low-level procedures that allow efficient data access. In contrast, at higher levels, emphasis is placed on ease of use and effort is directed at providing efficient user interaction with the system. The part of a DML that involves data retrieval is called a query language. A query language can be defined as a high-level special-purpose language used to satisfy diverse requests for the retrieval of data held in the database. The term ‘query’ is therefore reserved to denote a retrieval statement expressed in a query language. The terms ‘query language’ and ‘DML’ are commonly used interchangeably, although this is technically incorrect. DMLs are distinguished by their underlying retrieval constructs. We can distinguish between two types of DML: procedural and non-procedural. The prime difference between these two data manipulation languages is that procedural languages specify how the output of a DML statement is to be obtained, while non-procedural DMLs describe only what output is to be obtained. Typically, procedural languages treat records individually, whereas non-procedural languages operate on sets of records. Procedural DMLs Procedural DML A language that allows the user to tell the system what data is needed and exactly how to retrieve the data. With a procedural DML, the user, or more normally the programmer, specifies what data is needed and how to obtain it. This means that the user must express all the data access operations that are to be used by calling appropriate procedures to obtain the information required. Typically, such a procedural DML retrieves a record, processes it and, based on the results obtained by this processing, retrieves another record that would be processed similarly, and so on. This process of retrievals continues until the data requested from the retrieval has been gathered. Typically, procedural DMLs are embedded in a high-level programming language that contains constructs to facilitate iteration and handle navigational logic. Network and hierarchical DMLs are normally procedural (see Section 2.3). Non-procedural DMLs Non-procedural DML A language that allows the user to state what data is needed rather than how it is to be retrieved. | 41 42 | Chapter 2 z Database Environment Non-procedural DMLs allow the required data to be specified in a single retrieval or update statement. With non-procedural DMLs, the user specifies what data is required without specifying how it is to be obtained. The DBMS translates a DML statement into one or more procedures that manipulate the required sets of records. This frees the user from having to know how data structures are internally implemented and what algorithms are required to retrieve and possibly transform the data, thus providing users with a considerable degree of data independence. Non-procedural languages are also called declarative languages. Relational DBMSs usually include some form of non-procedural language for data manipulation, typically SQL (Structured Query Language) or QBE (Query-ByExample). Non-procedural DMLs are normally easier to learn and use than procedural DMLs, as less work is done by the user and more by the DBMS. We examine SQL in detail in Chapters 5, 6, and Appendix E, and QBE in Chapter 7. 2.2.3 Fourth-Generation Languages (4GLs) There is no consensus about what constitutes a fourth-generation language; it is in essence a shorthand programming language. An operation that requires hundreds of lines in a third-generation language (3GL), such as COBOL, generally requires significantly fewer lines in a 4GL. Compared with a 3GL, which is procedural, a 4GL is non-procedural: the user defines what is to be done, not how. A 4GL is expected to rely largely on much higher-level components known as fourth-generation tools. The user does not define the steps that a program needs to perform a task, but instead defines parameters for the tools that use them to generate an application program. It is claimed that 4GLs can improve productivity by a factor of ten, at the cost of limiting the types of problem that can be tackled. Fourthgeneration languages encompass: n n n n presentation languages, such as query languages and report generators; speciality languages, such as spreadsheets and database languages; application generators that define, insert, update, and retrieve data from the database to build applications; very high-level languages that are used to generate application code. SQL and QBE, mentioned above, are examples of 4GLs. We now briefly discuss some of the other types of 4GL. Forms generators A forms generator is an interactive facility for rapidly creating data input and display layouts for screen forms. The forms generator allows the user to define what the screen is to look like, what information is to be displayed, and where on the screen it is to be displayed. It may also allow the definition of colors for screen elements and other characteristics, such as bold, underline, blinking, reverse video, and so on. The better forms generators allow the creation of derived attributes, perhaps using arithmetic operators or aggregates, and the specification of validation checks for data input. 2.3 Data Models and Conceptual Modeling Report generators A report generator is a facility for creating reports from data stored in the database. It is similar to a query language in that it allows the user to ask questions of the database and retrieve information from it for a report. However, in the case of a report generator, we have much greater control over what the output looks like. We can let the report generator automatically determine how the output should look or we can create our own customized output reports using special report-generator command instructions. There are two main types of report generator: language-oriented and visually oriented. In the first case, we enter a command in a sublanguage to define what data is to be included in the report and how the report is to be laid out. In the second case, we use a facility similar to a forms generator to define the same information. Graphics generators A graphics generator is a facility to retrieve data from the database and display the data as a graph showing trends and relationships in the data. Typically, it allows the user to create bar charts, pie charts, line charts, scatter charts, and so on. Application generators An application generator is a facility for producing a program that interfaces with the database. The use of an application generator can reduce the time it takes to design an entire software application. Application generators typically consist of pre-written modules that comprise fundamental functions that most programs use. These modules, usually written in a high-level language, constitute a ‘library’ of functions to choose from. The user specifies what the program is supposed to do; the application generator determines how to perform the tasks. Data Models and Conceptual Modeling We mentioned earlier that a schema is written using a data definition language. In fact, it is written in the data definition language of a particular DBMS. Unfortunately, this type of language is too low level to describe the data requirements of an organization in a way that is readily understandable by a variety of users. What we require is a higher-level description of the schema: that is, a data model. Data model An integrated collection of concepts for describing and manipulating data, relationships between data, and constraints on the data in an organization. A model is a representation of ‘real world’ objects and events, and their associations. It is an abstraction that concentrates on the essential, inherent aspects of an organization and ignores the accidental properties. A data model represents the organization itself. It should provide the basic concepts and notations that will allow database designers and end-users 2.3 | 43 44 | Chapter 2 z Database Environment unambiguously and accurately to communicate their understanding of the organizational data. A data model can be thought of as comprising three components: (1) a structural part, consisting of a set of rules according to which databases can be constructed; (2) a manipulative part, defining the types of operation that are allowed on the data (this includes the operations that are used for updating or retrieving data from the database and for changing the structure of the database); (3) possibly a set of integrity constraints, which ensures that the data is accurate. The purpose of a data model is to represent data and to make the data understandable. If it does this, then it can be easily used to design a database. To reflect the ANSI-SPARC architecture introduced in Section 2.1, we can identify three related data models: (1) an external data model, to represent each user’s view of the organization, sometimes called the Universe of Discourse (UoD); (2) a conceptual data model, to represent the logical (or community) view that is DBMSindependent; (3) an internal data model, to represent the conceptual schema in such a way that it can be understood by the DBMS. There have been many data models proposed in the literature. They fall into three broad categories: object-based, record-based, and physical data models. The first two are used to describe data at the conceptual and external levels, the latter is used to describe data at the internal level. 2.3.1 Object-Based Data Models Object-based data models use concepts such as entities, attributes, and relationships. An entity is a distinct object (a person, place, thing, concept, event) in the organization that is to be represented in the database. An attribute is a property that describes some aspect of the object that we wish to record, and a relationship is an association between entities. Some of the more common types of object-based data model are: n n n n Entity–Relationship Semantic Functional Object-Oriented. The Entity–Relationship model has emerged as one of the main techniques for database design and forms the basis for the database design methodology used in this book. The object-oriented data model extends the definition of an entity to include not only the attributes that describe the state of the object but also the actions that are associated with the object, that is, its behavior. The object is said to encapsulate both state and behavior. We look at the Entity–Relationship model in depth in Chapters 11 and 12 and 2.3 Data Models and Conceptual Modeling | 45 the object-oriented model in Chapters 25–28. We also examine the functional data model in Section 26.1.2. Record-Based Data Models 2.3.2 In a record-based model, the database consists of a number of fixed-format records possibly of differing types. Each record type defines a fixed number of fields, each typically of a fixed length. There are three principal types of record-based logical data model: the relational data model, the network data model, and the hierarchical data model. The hierarchical and network data models were developed almost a decade before the relational data model, so their links to traditional file processing concepts are more evident. Relational data model The relational data model is based on the concept of mathematical relations. In the relational model, data and relationships are represented as tables, each of which has a number of columns with a unique name. Figure 2.4 is a sample instance of a relational schema for part of the DreamHome case study, showing branch and staff details. For example, it shows that employee John White is a manager with a salary of £30,000, who works at branch (branchNo) B005, which, from the first table, is at 22 Deer Rd in London. It is important to note that there is a relationship between Staff and Branch: a branch office has staff. However, there is no explicit link between these two tables; it is only by knowing that the attribute branchNo in the Staff relation is the same as the branchNo of the Branch relation that we can establish that a relationship exists. Note that the relational data model requires only that the database be perceived by the user as tables. However, this perception applies only to the logical structure of the Figure 2.4 A sample instance of a relational schema. 46 | Chapter 2 z Database Environment Figure 2.5 A sample instance of a network schema. database, that is, the external and conceptual levels of the ANSI-SPARC architecture. It does not apply to the physical structure of the database, which can be implemented using a variety of storage structures. We discuss the relational data model in Chapter 3. Network data model In the network model, data is represented as collections of records, and relationships are represented by sets. Compared with the relational model, relationships are explicitly modeled by the sets, which become pointers in the implementation. The records are organized as generalized graph structures with records appearing as nodes (also called segments) and sets as edges in the graph. Figure 2.5 illustrates an instance of a network schema for the same data set presented in Figure 2.4. The most popular network DBMS is Computer Associates’ IDMS/ R. We discuss the network data model in more detail on the Web site for this book (see Preface for the URL). Hierarchical data model The hierarchical model is a restricted type of network model. Again, data is represented as collections of records and relationships are represented by sets. However, the hierarchical model allows a node to have only one parent. A hierarchical model can be represented as a tree graph, with records appearing as nodes (also called segments) and sets as edges. Figure 2.6 illustrates an instance of a hierarchical schema for the same data set presented in Figure 2.4. The main hierarchical DBMS is IBM’s IMS, although IMS also provides non-hierarchical features. We discuss the hierarchical data model in more detail on the Web site for this book (see Preface for the URL). Record-based (logical) data models are used to specify the overall structure of the database and a higher-level description of the implementation. Their main drawback lies in the fact that they do not provide adequate facilities for explicitly specifying constraints on the data, whereas the object-based data models lack the means of logical structure specification but provide more semantic substance by allowing the user to specify constraints on the data. The majority of modern commercial systems are based on the relational paradigm, whereas the early database systems were based on either the network or hierarchical data 2.3 Data Models and Conceptual Modeling | 47 models. The latter two models require the user to have knowledge of the physical database being accessed, whereas the former provides a substantial amount of data independence. Hence, while relational systems adopt a declarative approach to database processing (that is, they specify what data is to be retrieved), network and hierarchical systems adopt a navigational approach (that is, they specify how the data is to be retrieved). Figure 2.6 A sample instance of a hierarchical schema. Physical Data Models 2.3.3 Physical data models describe how data is stored in the computer, representing information such as record structures, record orderings, and access paths. There are not as many physical data models as logical data models, the most common ones being the unifying model and the frame memory. Conceptual Modeling From an examination of the three-level architecture, we see that the conceptual schema is the ‘heart’ of the database. It supports all the external views and is, in turn, supported by the internal schema. However, the internal schema is merely the physical implementation of the conceptual schema. The conceptual schema should be a complete and accurate representation of the data requirements of the enterprise.† If this is not the case, some information about the enterprise will be missing or incorrectly represented and we will have difficulty fully implementing one or more of the external views. † When we are discussing the organization in the context of database design we normally refer to the business or organization as the enterprise. 2.3.4 48 | Chapter 2 z Database Environment Conceptual modeling, or conceptual database design, is the process of constructing a model of the information use in an enterprise that is independent of implementation details, such as the target DBMS, application programs, programming languages, or any other physical considerations. This model is called a conceptual data model. Conceptual models are also referred to as logical models in the literature. However, in this book we make a distinction between conceptual and logical data models. The conceptual model is independent of all implementation details, whereas the logical model assumes knowledge of the underlying data model of the target DBMS. In Chapters 15 and 16 we present a methodology for database design that begins by producing a conceptual data model, which is then refined into a logical model based on the relational data model. We discuss database design in more detail in Section 9.6. 2.4 Functions of a DBMS In this section we look at the types of function and service we would expect a DBMS to provide. Codd (1982) lists eight services that should be provided by any full-scale DBMS, and we have added two more that might reasonably be expected to be available. (1) Data storage, retrieval, and update A DBMS must furnish users with the ability to store, retrieve, and update data in the database. This is the fundamental function of a DBMS. From the discussion in Section 2.1, clearly in providing this functionality the DBMS should hide the internal physical implementation details (such as file organization and storage structures) from the user. (2) A user-accessible catalog A DBMS must furnish a catalog in which descriptions of data items are stored and which is accessible to users. A key feature of the ANSI-SPARC architecture is the recognition of an integrated system catalog to hold data about the schemas, users, applications, and so on. The catalog is expected to be accessible to users as well as to the DBMS. A system catalog, or data dictionary, is a repository of information describing the data in the database: it is, the ‘data about the data’ or metadata. The amount of information and the way the information is used vary with the DBMS. Typically, the system catalog stores: n n n n names, types, and sizes of data items; names of relationships; integrity constraints on the data; names of authorized users who have access to the data; 2.4 Functions of a DBMS n n n the data items that each user can access and the types of access allowed; for example, insert, update, delete, or read access; external, conceptual, and internal schemas and the mappings between the schemas, as described in Section 2.1.4; usage statistics, such as the frequencies of transactions and counts on the number of accesses made to objects in the database. The DBMS system catalog is one of the fundamental components of the system. Many of the software components that we describe in the next section rely on the system catalog for information. Some benefits of a system catalog are: n n n n n n n n n Information about data can be collected and stored centrally. This helps to maintain control over the data as a resource. The meaning of data can be defined, which will help other users understand the purpose of the data. Communication is simplified, since exact meanings are stored. The system catalog may also identify the user or users who own or access the data. Redundancy and inconsistencies can be identified more easily since the data is centralized. Changes to the database can be recorded. The impact of a change can be determined before it is implemented, since the system catalog records each data item, all its relationships, and all its users. Security can be enforced. Integrity can be ensured. Audit information can be provided. Some authors make a distinction between system catalog and data directory, where a data directory holds information relating to where data is stored and how it is stored. The International Organization for Standardization (ISO) has adopted a standard for data dictionaries called Information Resource Dictionary System (IRDS) (ISO, 1990, 1993). IRDS is a software tool that can be used to control and document an organization’s information sources. It provides a definition for the tables that comprise the data dictionary and the operations that can be used to access these tables. We use the term ‘system catalog’ in this book to refer to all repository information. We discuss other types of statistical information stored in the system catalog to assist with query optimization in Section 21.4.1. (3) Transaction support A DBMS must furnish a mechanism which will ensure either that all the updates corresponding to a given transaction are made or that none of them is made. A transaction is a series of actions, carried out by a single user or application program, which accesses or changes the contents of the database. For example, some simple transactions for the DreamHome case study might be to add a new member of staff to the database, to update the salary of a member of staff, or to delete a property from the register. | 49 50 | Chapter 2 z Database Environment Figure 2.7 The lost update problem. A more complicated example might be to delete a member of staff from the database and to reassign the properties that he or she managed to another member of staff. In this case, there is more than one change to be made to the database. If the transaction fails during execution, perhaps because of a computer crash, the database will be in an inconsistent state: some changes will have been made and others not. Consequently, the changes that have been made will have to be undone to return the database to a consistent state again. We discuss transaction support in Section 20.1. (4) Concurrency control services A DBMS must furnish a mechanism to ensure that the database is updated correctly when multiple users are updating the database concurrently. One major objective in using a DBMS is to enable many users to access shared data concurrently. Concurrent access is relatively easy if all users are only reading data, as there is no way that they can interfere with one another. However, when two or more users are accessing the database simultaneously and at least one of them is updating data, there may be interference that can result in inconsistencies. For example, consider two transactions T1 and T2, which are executing concurrently as illustrated in Figure 2.7. T1 is withdrawing £10 from an account (with balance balx) and T2 is depositing £100 into the same account. If these transactions were executed serially, one after the other with no interleaving of operations, the final balance would be £190 regardless of which was performed first. However, in this example transactions T1 and T2 start at nearly the same time and both read the balance as £100. T2 then increases balx by £100 to £200 and stores the update in the database. Meanwhile, transaction T1 decrements its copy of balx by £10 to £90 and stores this value in the database, overwriting the previous update and thereby ‘losing’ £100. The DBMS must ensure that, when multiple users are accessing the database, interference cannot occur. We discuss this issue fully in Section 20.2. (5) Recovery services A DBMS must furnish a mechanism for recovering the database in the event that the database is damaged in any way. 2.4 Functions of a DBMS When discussing transaction support, we mentioned that if the transaction fails then the database has to be returned to a consistent state. This may be a result of a system crash, media failure, a hardware or software error causing the DBMS to stop, or it may be the result of the user detecting an error during the transaction and aborting the transaction before it completes. In all these cases, the DBMS must provide a mechanism to recover the database to a consistent state. We discuss database recovery in Section 20.3. (6) Authorization services A DBMS must furnish a mechanism to ensure that only authorized users can access the database. It is not difficult to envisage instances where we would want to prevent some of the data stored in the database from being seen by all users. For example, we may want only branch managers to see salary-related information for staff and prevent all other users from seeing this data. Additionally, we may want to protect the database from unauthorized access. The term security refers to the protection of the database against unauthorized access, either intentional or accidental. We expect the DBMS to provide mechanisms to ensure the data is secure. We discuss security in Chapter 19. (7) Support for data communication A DBMS must be capable of integrating with communication software. Most users access the database from workstations. Sometimes these workstations are connected directly to the computer hosting the DBMS. In other cases, the workstations are at remote locations and communicate with the computer hosting the DBMS over a network. In either case, the DBMS receives requests as communications messages and responds in a similar way. All such transmissions are handled by a Data Communication Manager (DCM). Although the DCM is not part of the DBMS, it is necessary for the DBMS to be capable of being integrated with a variety of DCMs if the system is to be commercially viable. Even DBMSs for personal computers should be capable of being run on a local area network so that one centralized database can be established for users to share, rather than having a series of disparate databases, one for each user. This does not imply that the database has to be distributed across the network; rather that users should be able to access a centralized database from remote locations. We refer to this type of topology as distributed processing (see Section 22.1.1). (8) Integrity services A DBMS must furnish a means to ensure that both the data in the database and changes to the data follow certain rules. | 51 52 | Chapter 2 z Database Environment Database integrity refers to the correctness and consistency of stored data: it can be considered as another type of database protection. While integrity is related to security, it has wider implications: integrity is concerned with the quality of data itself. Integrity is usually expressed in terms of constraints, which are consistency rules that the database is not permitted to violate. For example, we may want to specify a constraint that no member of staff can manage more than 100 properties at any one time. Here, we would want the DBMS to check when we assign a property to a member of staff that this limit would not be exceeded and to prevent the assignment from occurring if the limit has been reached. In addition to these eight services, we could also reasonably expect the following two services to be provided by a DBMS. (9) Services to promote data independence A DBMS must include facilities to support the independence of programs from the actual structure of the database. We discussed the concept of data independence in Section 2.1.5. Data independence is normally achieved through a view or subschema mechanism. Physical data independence is easier to achieve: there are usually several types of change that can be made to the physical characteristics of the database without affecting the views. However, complete logical data independence is more difficult to achieve. The addition of a new entity, attribute, or relationship can usually be accommodated, but not their removal. In some systems, any type of change to an existing component in the logical structure is prohibited. (10) Utility services A DBMS should provide a set of utility services. Utility programs help the DBA to administer the database effectively. Some utilities work at the external level, and consequently can be produced by the DBA. Other utilities work at the internal level and can be provided only by the DBMS vendor. Examples of utilities of the latter kind are: n n n n n import facilities, to load the database from flat files, and export facilities, to unload the database to flat files; monitoring facilities, to monitor database usage and operation; statistical analysis programs, to examine performance or usage statistics; index reorganization facilities, to reorganize indexes and their overflows; garbage collection and reallocation, to remove deleted records physically from the storage devices, to consolidate the space released, and to reallocate it where it is needed. 2.5 Components of a DBMS Components of a DBMS | 53 2.5 DBMSs are highly complex and sophisticated pieces of software that aim to provide the services discussed in the previous section. It is not possible to generalize the component structure of a DBMS as it varies greatly from system to system. However, it is useful when trying to understand database systems to try to view the components and the relationships between them. In this section, we present a possible architecture for a DBMS. We examine the architecture of the Oracle DBMS in Section 8.2.2. A DBMS is partitioned into several software components (or modules), each of which is assigned a specific operation. As stated previously, some of the functions of the DBMS are supported by the underlying operating system. However, the operating system provides only basic services and the DBMS must be built on top of it. Thus, the design of a DBMS must take into account the interface between the DBMS and the operating system. The major software components in a DBMS environment are depicted in Figure 2.8. This diagram shows how the DBMS interfaces with other software components, such as user queries and access methods (file management techniques for storing and retrieving data records). We will provide an overview of file organizations and access methods in Appendix C. For a more comprehensive treatment, the interested reader is referred to Teorey and Fry (1982), Weiderhold (1983), Smith and Barnes (1987), and Ullman (1988). Figure 2.8 Major components of a DBMS. 54 | Chapter 2 z Database Environment Figure 2.9 Components of a database manager. Figure 2.8 shows the following components: n n n Query processor This is a major DBMS component that transforms queries into a series of low-level instructions directed to the database manager. We discuss query processing in Chapter 21. Database manager (DM) The DM interfaces with user-submitted application programs and queries. The DM accepts queries and examines the external and conceptual schemas to determine what conceptual records are required to satisfy the request. The DM then places a call to the file manager to perform the request. The components of the DM are shown in Figure 2.9. File manager The file manager manipulates the underlying storage files and manages the allocation of storage space on disk. It establishes and maintains the list of structures 2.5 Components of a DBMS n n n and indexes defined in the internal schema. If hashed files are used it calls on the hashing functions to generate record addresses. However, the file manager does not directly manage the physical input and output of data. Rather it passes the requests on to the appropriate access methods, which either read data from or write data into the system buffer (or cache). DML preprocessor This module converts DML statements embedded in an application program into standard function calls in the host language. The DML preprocessor must interact with the query processor to generate the appropriate code. DDL compiler The DDL compiler converts DDL statements into a set of tables containing metadata. These tables are then stored in the system catalog while control information is stored in data file headers. Catalog manager The catalog manager manages access to and maintains the system catalog. The system catalog is accessed by most DBMS components. The major software components for the database manager are as follows: n n n n n n n n Authorization control This module checks that the user has the necessary authorization to carry out the required operation. Command processor Once the system has checked that the user has authority to carry out the operation, control is passed to the command processor. Integrity checker For an operation that changes the database, the integrity checker checks that the requested operation satisfies all necessary integrity constraints (such as key constraints). Query optimizer This module determines an optimal strategy for the query execution. We discuss query optimization in Chapter 21. Transaction manager This module performs the required processing of operations it receives from transactions. Scheduler This module is responsible for ensuring that concurrent operations on the database proceed without conflicting with one another. It controls the relative order in which transaction operations are executed. Recovery manager This module ensures that the database remains in a consistent state in the presence of failures. It is responsible for transaction commit and abort. Buffer manager This module is responsible for the transfer of data between main memory and secondary storage, such as disk and tape. The recovery manager and the buffer manager are sometimes referred to collectively as the data manager. The buffer manager is sometimes known as the cache manager. We discuss the last four modules in Chapter 20. In addition to the above modules, several other data structures are required as part of the physical-level implementation. These structures include data and index files, and the system catalog. An attempt has been made to standardize DBMSs, and a reference model was proposed by the Database Architecture Framework Task Group (DAFTG, 1986). The purpose of this reference model was to define a conceptual framework aiming to divide standardization attempts into manageable pieces and to show at a very broad level how these pieces could be interrelated. | 55 56 | Chapter 2 z Database Environment 2.6 Multi-User DBMS Architectures In this section we look at the common architectures that are used to implement multi-user database management systems, namely teleprocessing, file-server, and client–server. 2.6.1 Teleprocessing The traditional architecture for multi-user systems was teleprocessing, where there is one computer with a single central processing unit (CPU) and a number of terminals, as illustrated in Figure 2.10. All processing is performed within the boundaries of the same physical computer. User terminals are typically ‘dumb’ ones, incapable of functioning on their own. They are cabled to the central computer. The terminals send messages via the communications control subsystem of the operating system to the user’s application program, which in turn uses the services of the DBMS. In the same way, messages are routed back to the user’s terminal. Unfortunately, this architecture placed a tremendous burden on the central computer, which not only had to run the application programs and the DBMS, but also had to carry out a significant amount of work on behalf of the terminals (such as formatting data for display on the screen). In recent years, there have been significant advances in the development of highperformance personal computers and networks. There is now an identifiable trend in industry towards downsizing, that is, replacing expensive mainframe computers with more cost-effective networks of personal computers that achieve the same, or even better, results. This trend has given rise to the next two architectures: file-server and client–server. 2.6.2 File-Server Architecture In a file-server environment, the processing is distributed about the network, typically a local area network (LAN). The file-server holds the files required by the applications and the DBMS. However, the applications and the DBMS run on each workstation, requesting Figure 2.10 Teleprocessing topology. 2.6 Multi-User DBMS Architectures | Figure 2.11 File-server architecture. files from the file-server when necessary, as illustrated in Figure 2.11. In this way, the file-server acts simply as a shared hard disk drive. The DBMS on each workstation sends requests to the file-server for all data that the DBMS requires that is stored on disk. This approach can generate a significant amount of network traffic, which can lead to performance problems. For example, consider a user request that requires the names of staff who work in the branch at 163 Main St. We can express this request in SQL (see Chapter 5) as: SELECT fName, lName FROM Branch b, Staff s WHERE b.branchNo = s.branchNo AND b.street = ‘163 Main St’; As the file-server has no knowledge of SQL, the DBMS has to request the files corresponding to the Branch and Staff relations from the file-server, rather than just the staff names that satisfy the query. The file-server architecture, therefore, has three main disadvantages: (1) There is a large amount of network traffic. (2) A full copy of the DBMS is required on each workstation. (3) Concurrency, recovery, and integrity control are more complex because there can be multiple DBMSs accessing the same files. Traditional Two-Tier Client–Server Architecture To overcome the disadvantages of the first two approaches and accommodate an increasingly decentralized business environment, the client–server architecture was developed. Client–server refers to the way in which software components interact to form a system. 2.6.3 57 58 | Chapter 2 z Database Environment Figure 2.12 Client–server architecture. As the name suggests, there is a client process, which requires some resource, and a server, which provides the resource. There is no requirement that the client and server must reside on the same machine. In practice, it is quite common to place a server at one site in a local area network and the clients at the other sites. Figure 2.12 illustrates the client–server architecture and Figure 2.13 shows some possible combinations of the client–server topology. Data-intensive business applications consist of four major components: the database, the transaction logic, the business and data application logic, and the user interface. The traditional two-tier client–server architecture provides a very basic separation of these components. The client (tier 1) is primarily responsible for the presentation of data to the user, and the server (tier 2) is primarily responsible for supplying data services to the client, as illustrated in Figure 2.14. Presentation services handle user interface actions and the main business and data application logic. Data services provide limited business application logic, typically validation that the client is unable to carry out due to lack of information, and access to the requested data, independent of its location. The data can come from relational DBMSs, object-relational DBMSs, object-oriented DBMSs, legacy DBMSs, or proprietary data access systems. Typically, the client would run on end-user desktops and interact with a centralized database server over a network. A typical interaction between client and server is as follows. The client takes the user’s request, checks the syntax and generates database requests in SQL or another database language appropriate to the application logic. It then transmits the message to the server, waits for a response, and formats the response for the end-user. The server accepts and processes the database requests, then transmits the results back to the client. The processing involves checking authorization, ensuring integrity, maintaining the system catalog, and performing query and update processing. In addition, it also provides concurrency and recovery control. The operations of client and server are summarized in Table 2.1. 2.6 Multi-User DBMS Architectures | 59 Figure 2.13 Alternative client–server topologies: (a) single client, single server; (b) multiple clients, single server; (c) multiple clients, multiple servers. There are many advantages to this type of architecture. For example: n n n n It enables wider access to existing databases. Increased performance – if the clients and server reside on different computers then different CPUs can be processing applications in parallel. It should also be easier to tune the server machine if its only task is to perform database processing. Hardware costs may be reduced – it is only the server that requires storage and processing power sufficient to store and manage the database. Communication costs are reduced – applications carry out part of the operations on the client and send only requests for database access across the network, resulting in less data being sent across the network. 60 | Chapter 2 z Database Environment Figure 2.14 The traditional two-tier client–server architecture. Table 2.1 n n Summary of client–server functions. Client Server Manages the user interface Accepts and checks syntax of user input Processes application logic Generates database requests and transmits to server Passes response back to user Accepts and processes database requests from clients Checks authorization Ensures integrity constraints not violated Performs query/update processing and transmits response to client Maintains system catalog Provides concurrent database access Provides recovery control Increased consistency – the server can handle integrity checks, so that constraints need be defined and validated only in the one place, rather than having each application program perform its own checking. It maps on to open systems architecture quite naturally. Some database vendors have used this architecture to indicate distributed database capability, that is a collection of multiple, logically interrelated databases distributed over a computer network. However, although the client–server architecture can be used to provide distributed DBMSs, by itself it does not constitute a distributed DBMS. We discuss distributed DBMSs in Chapters 22 and 23. 2.6.4 Three-Tier Client–Server Architecture The need for enterprise scalability challenged this traditional two-tier client–server model. In the mid-1990s, as applications became more complex and potentially could be deployed 2.6 Multi-User DBMS Architectures | Figure 2.15 The three-tier architecture. to hundreds or thousands of end-users, the client side presented two problems that prevented true scalability: n n A ‘fat’ client, requiring considerable resources on the client’s computer to run effectively. This includes disk space, RAM, and CPU power. A significant client-side administration overhead. By 1995, a new variation of the traditional two-tier client–server model appeared to solve the problem of enterprise scalability. This new architecture proposed three layers, each potentially running on a different platform: (1) The user interface layer, which runs on the end-user’s computer (the client). (2) The business logic and data processing layer. This middle tier runs on a server and is often called the application server. (3) A DBMS, which stores the data required by the middle tier. This tier may run on a separate server called the database server. As illustrated in Figure 2.15 the client is now responsible only for the application’s user interface and perhaps performing some simple logic processing, such as input validation, thereby providing a ‘thin’ client. The core business logic of the application now resides in its own layer, physically connected to the client and database server over a local area network (LAN) or wide area network (WAN). One application server is designed to serve multiple clients. 61 62 | Chapter 2 z Database Environment The three-tier design has many advantages over traditional two-tier or single-tier designs, which include: n n n n The need for less expensive hardware because the client is ‘thin’. Application maintenance is centralized with the transfer of the business logic for many end-users into a single application server. This eliminates the concerns of software distribution that are problematic in the traditional two-tier client–server model. The added modularity makes it easier to modify or replace one tier without affecting the other tiers. Load balancing is easier with the separation of the core business logic from the database functions. An additional advantage is that the three-tier architecture maps quite naturally to the Web environment, with a Web browser acting as the ‘thin’ client, and a Web server acting as the application server. The three-tier architecture can be extended to n-tiers, with additional tiers added to provide more flexibility and scalability. For example, the middle tier of the three-tier architecture could be split into two, with one tier for the Web server and another for the application server. This three-tier architecture has proved more appropriate for some environments, such as the Internet and corporate intranets where a Web browser can be used as a client. It is also an important architecture for Transaction Processing Monitors, as we discuss next. 2.6.5 Transaction Processing Monitors TP Monitor A program that controls data transfer between clients and servers in order to provide a consistent environment, particularly for online transaction processing (OLTP). Complex applications are often built on top of several resource managers (such as DBMSs, operating systems, user interfaces, and messaging software). A Transaction Processing Monitor, or TP Monitor, is a middleware component that provides access to the services of a number of resource managers and provides a uniform interface for programmers who are developing transactional software. A TP Monitor forms the middle tier of a three-tier architecture, as illustrated in Figure 2.16. TP Monitors provide significant advantages, including: n n Transaction routing The TP Monitor can increase scalability by directing transactions to specific DBMSs. Managing distributed transactions The TP Monitor can manage transactions that require access to data held in multiple, possibly heterogeneous, DBMSs. For example, a transaction may require to update data items held in an Oracle DBMS at site 1, an Informix DBMS at site 2, and an IMS DBMS as site 3. TP Monitors normally control transactions using the X/Open Distributed Transaction Processing (DTP) standard. A 2.6 Multi-User DBMS Architectures | 63 Figure 2.16 Transaction Processing Monitor as the middle tier of a three-tier client–server architecture. n n n DBMS that supports this standard can function as a resource manager under the control of a TP Monitor acting as a transaction manager. We discuss distributed transactions and the DTP standard in Chapters 22 and 23. Load balancing The TP Monitor can balance client requests across multiple DBMSs on one or more computers by directing client service calls to the least loaded server. In addition, it can dynamically bring in additional DBMSs as required to provide the necessary performance. Funneling In environments with a large number of users, it may sometimes be difficult for all users to be logged on simultaneously to the DBMS. In many cases, we would find that users generally do not need continuous access to the DBMS. Instead of each user connecting to the DBMS, the TP Monitor can establish connections with the DBMSs as and when required, and can funnel user requests through these connections. This allows a larger number of users to access the available DBMSs with a potentially much smaller number of connections, which in turn would mean less resource usage. Increased reliability The TP Monitor acts as a transaction manager, performing the necessary actions to maintain the consistency of the database, with the DBMS acting as a resource manager. If the DBMS fails, the TP Monitor may be able to resubmit the transaction to another DBMS or can hold the transaction until the DBMS becomes available again. TP Monitors are typically used in environments with a very high volume of transactions, where the TP Monitor can be used to offload processes from the DBMS server. Prominent examples of TP Monitors include CICS and Encina from IBM (which are primarily used on IBM AIX or Windows NT and bundled now in the IBM TXSeries) and Tuxedo from BEA Systems. 64 | Chapter 2 z Database Environment Chapter Summary n n n n n n n n n n n The ANSI-SPARC database architecture uses three levels of abstraction: external, conceptual, and internal. The external level consists of the users’ views of the database. The conceptual level is the community view of the database. It specifies the information content of the entire database, independent of storage considerations. The conceptual level represents all entities, their attributes, and their relationships, as well as the constraints on the data, and security and integrity information. The internal level is the computer’s view of the database. It specifies how data is represented, how records are sequenced, what indexes and pointers exist, and so on. The external/conceptual mapping transforms requests and results between the external and conceptual levels. The conceptual/internal mapping transforms requests and results between the conceptual and internal levels. A database schema is a description of the database structure. Data independence makes each level immune to changes to lower levels. Logical data independence refers to the immunity of the external schemas to changes in the conceptual schema. Physical data independence refers to the immunity of the conceptual schema to changes in the internal schema. A data sublanguage consists of two parts: a Data Definition Language (DDL) and a Data Manipulation Language (DML). The DDL is used to specify the database schema and the DML is used to both read and update the database. The part of a DML that involves data retrieval is called a query language. A data model is a collection of concepts that can be used to describe a set of data, the operations to manipulate the data, and a set of integrity constraints for the data. They fall into three broad categories: object-based data models, record-based data models, and physical data models. The first two are used to describe data at the conceptual and external levels; the latter is used to describe data at the internal level. Object-based data models include the Entity–Relationship, semantic, functional, and object-oriented models. Record-based data models include the relational, network, and hierarchical models. Conceptual modeling is the process of constructing a detailed architecture for a database that is independent of implementation details, such as the target DBMS, application programs, programming languages, or any other physical considerations. The design of the conceptual schema is critical to the overall success of the system. It is worth spending the time and energy necessary to produce the best possible conceptual design. Functions and services of a multi-user DBMS include data storage, retrieval, and update; a user-accessible catalog; transaction support; concurrency control and recovery services; authorization services; support for data communication; integrity services; services to promote data independence; utility services. The system catalog is one of the fundamental components of a DBMS. It contains ‘data about the data’, or metadata. The catalog should be accessible to users. The Information Resource Dictionary System is an ISO standard that defines a set of access methods for a data dictionary. This allows dictionaries to be shared and transferred from one system to another. Client–server architecture refers to the way in which software components interact. There is a client process that requires some resource, and a server that provides the resource. In the two-tier model, the client handles the user interface and business processing logic and the server handles the database functionality. In the Web environment, the traditional two-tier model has been replaced by a three-tier model, consisting of a user interface layer (the client), a business logic and data processing layer (the application server), and a DBMS (the database server), distributed over different machines. A Transaction Processing (TP) Monitor is a program that controls data transfer between clients and servers in order to provide a consistent environment, particularly for online transaction processing (OLTP). The advantages include transaction routing, distributed transactions, load balancing, funneling, and increased reliability. Exercises | 65 Review Questions 2.1 2.2 2.3 2.4 2.5 2.6 Discuss the concept of data independence and explain its importance in a database environment. To address the issue of data independence, the ANSI-SPARC three-level architecture was proposed. Compare and contrast the three levels of this model. What is a data model? Discuss the main types of data model. Discuss the function and importance of conceptual modeling. Describe the types of facility you would expect to be provided in a multi-user DBMS. Of the facilities described in your answer to Question 2.5, which ones do you think would not be needed in a standalone PC DBMS? Provide justification for your answer. 2.7 Discuss the function and importance of the system catalog. 2.8 Describe the main components in a DBMS and suggest which components are responsible for each facility identified in Question 2.5. 2.9 What is meant by the term ‘client–server architecture’ and what are the advantages of this approach? Compare the client–server architecture with two other architectures. 2.10 Compare and contrast the two-tier client–server architecture for traditional DBMSs with the three-tier client–server architecture. Why is the latter architecture more appropriate for the Web? 2.11 What is a TP Monitor? What advantages does a TP Monitor bring to an OLTP environment? Exercises 2.12 Analyze the DBMSs that you are currently using. Determine each system’s compliance with the functions that we would expect to be provided by a DBMS. What type of language does each system provide? What type of architecture does each DBMS use? Check the accessibility and extensibility of the system catalog. Is it possible to export the system catalog to another system? 2.13 Write a program that stores names and telephone numbers in a database. Write another program that stores names and addresses in a database. Modify the programs to use external, conceptual, and internal schemas. What are the advantages and disadvantages of this modification? 2.14 Write a program that stores names and dates of birth in a database. Extend the program so that it stores the format of the data in the database: in other words, create a system catalog. Provide an interface that makes this system catalog accessible to external users. 2.15 How would you modify your program in Exercise 2.13 to conform to a client–server architecture? What would be the advantages and disadvantages of this modification? Part 2 The Relational Model and Languages Chapter 3 The Relational Model 69 Chapter 4 Relational Algebra and Relational Calculus 88 Chapter 5 SQL: Data Manipulation 112 Chapter 6 SQL: Data Definition 157 Chapter 7 Query-By-Example 198 Chapter 8 Commercial RDBMSs: Office Access and Oracle 225 Chapter 3 The Relational Model Chapter Objectives In this chapter you will learn: n The origins of the relational model. n The terminology of the relational model. n How tables are used to represent data. n The connection between mathematical relations and relations in the relational model. n Properties of database relations. n How to identify candidate, primary, alternate, and foreign keys. n The meaning of entity integrity and referential integrity. n The purpose and advantages of views in relational systems. The Relational Database Management System (RDBMS) has become the dominant data-processing software in use today, with estimated new licence sales of between US$6 billion and US$10 billion per year (US$25 billion with tools sales included). This software represents the second generation of DBMSs and is based on the relational data model proposed by E. F. Codd (1970). In the relational model, all data is logically structured within relations (tables). Each relation has a name and is made up of named attributes (columns) of data. Each tuple (row) contains one value per attribute. A great strength of the relational model is this simple logical structure. Yet, behind this simple structure is a sound theoretical foundation that is lacking in the first generation of DBMSs (the network and hierarchical DBMSs). We devote a significant amount of this book to the RDBMS, in recognition of the importance of these systems. In this chapter, we discuss the terminology and basic structural concepts of the relational data model. In the next chapter, we examine the relational languages that can be used for update and data retrieval. 70 | Chapter 3 z The Relational Model Structure of this Chapter To put our treatment of the RDBMS into perspective, in Section 3.1 we provide a brief history of the relational model. In Section 3.2 we discuss the underlying concepts and terminology of the relational model. In Section 3.3 we discuss the relational integrity rules, including entity integrity and referential integrity. In Section 3.4 we introduce the concept of views, which are important features of relational DBMSs although, strictly speaking, not a concept of the relational model per se. Looking ahead, in Chapters 5 and 6 we examine SQL (Structured Query Language), the formal and de facto standard language for RDBMSs, and in Chapter 7 we examine QBE (Query-By-Example), another highly popular visual query language for RDBMSs. In Chapters 15–18 we present a complete methodology for relational database design. In Appendix D, we examine Codd’s twelve rules, which form a yardstick against which RDBMS products can be identified. The examples in this chapter are drawn from the DreamHome case study, which is described in detail in Section 10.4 and Appendix A. 3.1 Brief History of the Relational Model The relational model was first proposed by E. F. Codd in his seminal paper ‘A relational model of data for large shared data banks’ (Codd, 1970). This paper is now generally accepted as a landmark in database systems, although a set-oriented model had been proposed previously (Childs, 1968). The relational model’s objectives were specified as follows: n n n To allow a high degree of data independence. Application programs must not be affected by modifications to the internal data representation, particularly by changes to file organizations, record orderings, or access paths. To provide substantial grounds for dealing with data semantics, consistency, and redundancy problems. In particular, Codd’s paper introduced the concept of normalized relations, that is, relations that have no repeating groups. (The process of normalization is discussed in Chapters 13 and 14.) To enable the expansion of set-oriented data manipulation languages. Although interest in the relational model came from several directions, the most significant research may be attributed to three projects with rather different perspectives. The first of these, at IBM’s San José Research Laboratory in California, was the prototype relational DBMS System R, which was developed during the late 1970s (Astrahan et al., 1976). This project was designed to prove the practicality of the relational model by providing an implementation of its data structures and operations. It also proved to be an excellent source of information about implementation concerns such as transaction management, concurrency control, recovery techniques, query optimization, data security and integrity, human factors, and user interfaces, and led to the publication of many research papers and to the development of other prototypes. In particular, the System R project led to two major developments: 3.2 Terminology n n the development of a structured query language called SQL (pronounced ‘S-Q-L’, or sometimes ‘See-Quel’), which has since become the formal International Organization for Standardization ( ISO) and de facto standard language for relational DBMSs; the production of various commercial relational DBMS products during the late 1970s and the 1980s: for example, DB2 and SQL/DS from IBM and Oracle from Oracle Corporation. The second project to have been significant in the development of the relational model was the INGRES (Interactive Graphics Retrieval System) project at the University of California at Berkeley, which was active at about the same time as the System R project. The INGRES project involved the development of a prototype RDBMS, with the research concentrating on the same overall objectives as the System R project. This research led to an academic version of INGRES, which contributed to the general appreciation of relational concepts, and spawned the commercial products INGRES from Relational Technology Inc. (now Advantage Ingres Enterprise Relational Database from Computer Associates) and the Intelligent Database Machine from Britton Lee Inc. The third project was the Peterlee Relational Test Vehicle at the IBM UK Scientific Centre in Peterlee (Todd, 1976). This project had a more theoretical orientation than the System R and INGRES projects and was significant, principally for research into such issues as query processing and optimization, and functional extension. Commercial systems based on the relational model started to appear in the late 1970s and early 1980s. Now there are several hundred RDBMSs for both mainframe and PC environments, even though many do not strictly adhere to the definition of the relational model. Examples of PC-based RDBMSs are Office Access and Visual FoxPro from Microsoft, InterBase and JDataStore from Borland, and R:Base from R:BASE Technologies. Owing to the popularity of the relational model, many non-relational systems now provide a relational user interface, irrespective of the underlying model. Computer Associates’ IDMS, the principal network DBMS, has become Advantage CA-IDMS, supporting a relational view of data. Other mainframe DBMSs that support some relational features are Computer Corporation of America’s Model 204 and Software AG’s ADABAS. Some extensions to the relational model have also been proposed; for example, extensions to: n n n capture more closely the meaning of data (for example, Codd, 1979); support object-oriented concepts (for example, Stonebraker and Rowe, 1986); support deductive capabilities (for example, Gardarin and Valduriez, 1989). We discuss some of these extensions in Chapters 25–28 on Object DBMSs. Terminology The relational model is based on the mathematical concept of a relation, which is physically represented as a table. Codd, a trained mathematician, used terminology taken from mathematics, principally set theory and predicate logic. In this section we explain the terminology and structural concepts of the relational model. 3.2 | 71 72 | Chapter 3 z The Relational Model 3.2.1 Relational Data Structure Relation A relation is a table with columns and rows. An RDBMS requires only that the database be perceived by the user as tables. Note, however, that this perception applies only to the logical structure of the database: that is, the external and conceptual levels of the ANSI-SPARC architecture discussed in Section 2.1. It does not apply to the physical structure of the database, which can be implemented using a variety of storage structures (see Appendix C). Attribute An attribute is a named column of a relation. In the relational model, relations are used to hold information about the objects to be represented in the database. A relation is represented as a two-dimensional table in which the rows of the table correspond to individual records and the table columns correspond to attributes. Attributes can appear in any order and the relation will still be the same relation, and therefore convey the same meaning. For example, the information on branch offices is represented by the Branch relation, with columns for attributes branchNo (the branch number), street, city, and postcode. Similarly, the information on staff is represented by the Staff relation, with columns for attributes staffNo (the staff number), fName, lName, position, sex, DOB (date of birth), salary, and branchNo (the number of the branch the staff member works at). Figure 3.1 shows instances of the Branch and Staff relations. As you can see from this example, a column contains values of a single attribute; for example, the branchNo columns contain only numbers of existing branch offices. Domain A domain is the set of allowable values for one or more attributes. Domains are an extremely powerful feature of the relational model. Every attribute in a relation is defined on a domain. Domains may be distinct for each attribute, or two or more attributes may be defined on the same domain. Figure 3.2 shows the domains for some of the attributes of the Branch and Staff relations. Note that, at any given time, typically there will be values in a domain that do not currently appear as values in the corresponding attribute. The domain concept is important because it allows the user to define in a central place the meaning and source of values that attributes can hold. As a result, more information is available to the system when it undertakes the execution of a relational operation, and operations that are semantically incorrect can be avoided. For example, it is not sensible to compare a street name with a telephone number, even though the domain definitions for both these attributes are character strings. On the other hand, the monthly rental on a property and the number of months a property has been leased have different domains (the first a monetary value, the second an integer value), but it is still a legal operation to 3.2 Terminology Relation B005 B007 B003 B004 B002 22 Deer Rd 16 Argyll St 163 Main St 32 Manse Rd 56 Clover Dr postcode London Aberdeen Glasgow Bristol London SW1 4EH AB2 3SU G11 9QX BS99 1NZ NW10 6EU Cardinality Branch city Degree Primary key 73 Figure 3.1 Instances of the Branch and Staff relations. Attributes branchNo street | Foreign key Staff Relation staffNo fName lName position SL21 SG37 SG14 SA9 SG5 SL41 John Ann David Mary Susan Julie White Beech Ford Howe Brand Lee Manager Assistant Supervisor Assistant Manager Assistant sex DOB salary branchNo M F M F F F 30000 12000 18000 9000 24000 9000 1-Oct-45 10-Nov-60 24-Mar-58 19-Feb-70 3-Jun-40 13-Jun-65 B005 B003 B003 B007 B003 B005 Attribute Domain Name Meaning Domain Definition branchNo street city postcode sex DOB BranchNumbers StreetNames CityNames Postcodes Sex DatesOfBirth The set of all possible branch numbers The set of all street names in Britain The set of all city names in Britain The set of all postcodes in Britain The sex of a person Possible values of staff birth dates salary Salaries Possible values of staff salaries character: size 4, range B001–B999 character: size 25 character: size 15 character: size 8 character: size 1, value M or F date, range from 1-Jan-20, format dd-mmm-yy monetary: 7 digits, range 6000.00–40000.00 multiply two values from these domains. As these two examples illustrate, a complete implementation of domains is not straightforward and, as a result, many RDBMSs do not support them fully. Tuple A tuple is a row of a relation. The elements of a relation are the rows or tuples in the table. In the Branch relation, each row contains four values, one for each attribute. Tuples can appear in any order and the relation will still be the same relation, and therefore convey the same meaning. Figure 3.2 Domains for some attributes of the Branch and Staff relations. 74 | Chapter 3 z The Relational Model The structure of a relation, together with a specification of the domains and any other restrictions on possible values, is sometimes called its intension, which is usually fixed unless the meaning of a relation is changed to include additional attributes. The tuples are called the extension (or state) of a relation, which changes over time. Degree The degree of a relation is the number of attributes it contains. The Branch relation in Figure 3.1 has four attributes or degree four. This means that each row of the table is a four-tuple, containing four values. A relation with only one attribute would have degree one and be called a unary relation or one-tuple. A relation with two attributes is called binary, one with three attributes is called ternary, and after that the term n-ary is usually used. The degree of a relation is a property of the intension of the relation. Cardinality The cardinality of a relation is the number of tuples it contains. By contrast, the number of tuples is called the cardinality of the relation and this changes as tuples are added or deleted. The cardinality is a property of the extension of the relation and is determined from the particular instance of the relation at any given moment. Finally, we have the definition of a relational database. Relational database A collection of normalized relations with distinct relation names. A relational database consists of relations that are appropriately structured. We refer to this appropriateness as normalization. We defer the discussion of normalization until Chapters 13 and 14. Alternative terminology The terminology for the relational model can be quite confusing. We have introduced two sets of terms. In fact, a third set of terms is sometimes used: a relation may be referred to as a file, the tuples as records, and the attributes as fields. This terminology stems from the fact that, physically, the RDBMS may store each relation in a file. Table 3.1 summarizes the different terms for the relational model. Table 3.1 Alternative terminology for relational model terms. Formal terms Alternative 1 Alternative 2 Relation Tuple Attribute Table Row Column File Record Field 3.2 Terminology Mathematical Relations 3.2.2 To understand the true meaning of the term relation, we have to review some concepts from mathematics. Suppose that we have two sets, D1 and D2, where D1 = {2, 4} and D2 = {1, 3, 5}. The Cartesian product of these two sets, written D1 × D2, is the set of all ordered pairs such that the first element is a member of D1 and the second element is a member of D2. An alternative way of expressing this is to find all combinations of elements with the first from D1 and the second from D2. In our case, we have: D1 × D2 = {(2, 1), (2, 3), (2, 5), (4, 1), (4, 3), (4, 5)} Any subset of this Cartesian product is a relation. For example, we could produce a relation R such that: R = {(2, 1), (4, 1)} We may specify which ordered pairs will be in the relation by giving some condition for their selection. For example, if we observe that R includes all those ordered pairs in which the second element is 1, then we could write R as: R = {(x, y) | x ∈ D1, y ∈ D2, and y = 1} Using these same sets, we could form another relation always twice the second. Thus, we could write S as: S S in which the first element is = {(x, y) | x ∈ D1, y ∈ D2, and x = 2y} or, in this instance, S = {(2, 1)} since there is only one ordered pair in the Cartesian product that satisfies this condition. We can easily extend the notion of a relation to three sets. Let D1, D2, and D3 be three sets. The Cartesian product D1 × D2 × D3 of these three sets is the set of all ordered triples such that the first element is from D1, the second element is from D2, and the third element is from D3. Any subset of this Cartesian product is a relation. For example, suppose we have: D1 = {1, 3} D1 × D2 × D3 = {(1, 2, 5), (1, 2, 6), (1, 4, 5), (1, 4, 6), (3, 2, 5), (3, 2, 6), (3, 4, 5), (3, 4, 6)} D2 = {2, 4} D3 = {5, 6} Any subset of these ordered triples is a relation. We can extend the three sets and define a general relation on n domains. Let D1, D2, . . . , Dn be n sets. Their Cartesian product is defined as: D1 × D2 × . . . × Dn = {(d1, d2, . . . , dn) | d1 ∈ D1, d2 ∈ D2, . . . , dn ∈ Dn} and is usually written as: n X Di i=1 Any set of n-tuples from this Cartesian product is a relation on the n sets. Note that in defining these relations we have to specify the sets, or domains, from which we choose values. | 75 76 | Chapter 3 z The Relational Model 3.2.3 Database Relations Applying the above concepts to databases, we can define a relation schema. Relation schema A named relation defined by a set of attribute and domain name pairs. Let A1, A2, . . . , An be attributes with domains D1, D2, . . . , Dn. Then the set {A1:D1, A2:D2, . . . , An:Dn} is a relation schema. A relation R defined by a relation schema S is a set of mappings from the attribute names to their corresponding domains. Thus, relation R is a set of n-tuples: (A1:d1, A2:d2, . . . , An:dn) such that d1 ∈ D1, d2 ∈ D2, . . . , dn ∈ Dn Each element in the n-tuple consists of an attribute and a value for that attribute. Normally, when we write out a relation as a table, we list the attribute names as column headings and write out the tuples as rows having the form (d1, d2, . . . , dn), where each value is taken from the appropriate domain. In this way, we can think of a relation in the relational model as any subset of the Cartesian product of the domains of the attributes. A table is simply a physical representation of such a relation. In our example, the Branch relation shown in Figure 3.1 has attributes branchNo, street, city, and postcode, each with its corresponding domain. The Branch relation is any subset of the Cartesian product of the domains, or any set of four-tuples in which the first element is from the domain BranchNumbers, the second is from the domain StreetNames, and so on. One of the four-tuples is: {(B005, 22 Deer Rd, London, SW1 4EH)} or more correctly: {(branchNo: B005, street: 22 Deer Rd, city: London, postcode: SW1 4EH)} We refer to this as a relation instance. The Branch table is a convenient way of writing out all the four-tuples that form the relation at a specific moment in time, which explains why table rows in the relational model are called tuples. In the same way that a relation has a schema, so too does the relational database. Relational database schema A set of relation schemas, each with a distinct name. If R1 , R2, . . . , Rn are a set of relation schemas, then we can write the relational database schema, or simply relational schema, R, as: R = {R1 , R2, . . . , Rn} 3.2 Terminology Properties of Relations A relation has the following properties: n n n n n n n the relation has a name that is distinct from all other relation names in the relational schema; each cell of the relation contains exactly one atomic (single) value; each attribute has a distinct name; the values of an attribute are all from the same domain; each tuple is distinct; there are no duplicate tuples; the order of attributes has no significance; the order of tuples has no significance, theoretically. (However, in practice, the order may affect the efficiency of accessing tuples.) To illustrate what these restrictions mean, consider again the Branch relation shown in Figure 3.1. Since each cell should contain only one value, it is illegal to store two postcodes for a single branch office in a single cell. In other words, relations do not contain repeating groups. A relation that satisfies this property is said to be normalized or in first normal form. (Normal forms are discussed in Chapters 13 and 14.) The column names listed at the tops of columns correspond to the attributes of the relation. The values in the branchNo attribute are all from the BranchNumbers domain; we should not allow a postcode value to appear in this column. There can be no duplicate tuples in a relation. For example, the row (B005, 22 Deer Rd, London, SW1 4EH) appears only once. Provided an attribute name is moved along with the attribute values, we can interchange columns. The table would represent the same relation if we were to put the city attribute before the postcode attribute, although for readability it makes more sense to keep the address elements in the normal order. Similarly, tuples can be interchanged, so the records of branches B005 and B004 can be switched and the relation will still be the same. Most of the properties specified for relations result from the properties of mathematical relations: n n n n When we derived the Cartesian product of sets with simple, single-valued elements such as integers, each element in each tuple was single-valued. Similarly, each cell of a relation contains exactly one value. However, a mathematical relation need not be normalized. Codd chose to disallow repeating groups to simplify the relational data model. In a relation, the possible values for a given position are determined by the set, or domain, on which the position is defined. In a table, the values in each column must come from the same attribute domain. In a set, no elements are repeated. Similarly, in a relation, there are no duplicate tuples. Since a relation is a set, the order of elements has no significance. Therefore, in a relation the order of tuples is immaterial. However, in a mathematical relation, the order of elements in a tuple is important. For example, the ordered pair (1, 2) is quite different from the ordered pair (2, 1). This is not 3.2.4 | 77 78 | Chapter 3 z The Relational Model the case for relations in the relational model, which specifically requires that the order of attributes be immaterial. The reason is that the column headings define which attribute the value belongs to. This means that the order of column headings in the intension is immaterial, but once the structure of the relation is chosen, the order of elements within the tuples of the extension must match the order of attribute names. 3.2.5 Relational Keys As stated above, there are no duplicate tuples within a relation. Therefore, we need to be able to identify one or more attributes (called relational keys) that uniquely identifies each tuple in a relation. In this section, we explain the terminology used for relational keys. Superkey An attribute, or set of attributes, that uniquely identifies a tuple within a relation. A superkey uniquely identifies each tuple within a relation. However, a superkey may contain additional attributes that are not necessary for unique identification, and we are interested in identifying superkeys that contain only the minimum number of attributes necessary for unique identification. Candidate key A superkey such that no proper subset is a superkey within the relation. A candidate key, K, for a relation R has two properties: n n uniqueness – in each tuple of R, the values of K uniquely identify that tuple; irreducibility – no proper subset of K has the uniqueness property. There may be several candidate keys for a relation. When a key consists of more than one attribute, we call it a composite key. Consider the Branch relation shown in Figure 3.1. Given a value of city, we can determine several branch offices (for example, London has two branch offices). This attribute cannot be a candidate key. On the other hand, since DreamHome allocates each branch office a unique branch number, then given a branch number value, branchNo, we can determine at most one tuple, so that branchNo is a candidate key. Similarly, postcode is also a candidate key for this relation. Now consider a relation Viewing, which contains information relating to properties viewed by clients. The relation comprises a client number (clientNo), a property number (propertyNo), a date of viewing (viewDate) and, optionally, a comment (comment). Given a client number, clientNo, there may be several corresponding viewings for different properties. Similarly, given a property number, propertyNo, there may be several clients who viewed this property. Therefore, clientNo by itself or propertyNo by itself cannot be selected as a candidate key. However, the combination of clientNo and propertyNo identifies at most one tuple, so, for the Viewing relation, clientNo and propertyNo together form the (composite) candidate key. If we need to cater for the possibility that a client may view a property more 3.2 Terminology than once, then we could add viewDate to the composite key. However, we assume that this is not necessary. Note that an instance of a relation cannot be used to prove that an attribute or combination of attributes is a candidate key. The fact that there are no duplicates for the values that appear at a particular moment in time does not guarantee that duplicates are not possible. However, the presence of duplicates in an instance can be used to show that some attribute combination is not a candidate key. Identifying a candidate key requires that we know the ‘real world’ meaning of the attribute(s) involved so that we can decide whether duplicates are possible. Only by using this semantic information can we be certain that an attribute combination is a candidate key. For example, from the data presented in Figure 3.1, we may think that a suitable candidate key for the Staff relation would be lName, the employee’s surname. However, although there is only a single value of ‘White’ in this instance of the Staff relation, a new member of staff with the surname ‘White’ may join the company, invalidating the choice of lName as a candidate key. Primary key The candidate key that is selected to identify tuples uniquely within the relation. Since a relation has no duplicate tuples, it is always possible to identify each row uniquely. This means that a relation always has a primary key. In the worst case, the entire set of attributes could serve as the primary key, but usually some smaller subset is sufficient to distinguish the tuples. The candidate keys that are not selected to be the primary key are called alternate keys. For the Branch relation, if we choose branchNo as the primary key, postcode would then be an alternate key. For the Viewing relation, there is only one candidate key, comprising clientNo and propertyNo, so these attributes would automatically form the primary key. Foreign key An attribute, or set of attributes, within one relation that matches the candidate key of some (possibly the same) relation. When an attribute appears in more than one relation, its appearance usually represents a relationship between tuples of the two relations. For example, the inclusion of branchNo in both the Branch and Staff relations is quite deliberate and links each branch to the details of staff working at that branch. In the Branch relation, branchNo is the primary key. However, in the Staff relation the branchNo attribute exists to match staff to the branch office they work in. In the Staff relation, branchNo is a foreign key. We say that the attribute branchNo in the Staff relation targets the primary key attribute branchNo in the home relation, Branch. These common attributes play an important role in performing data manipulation, as we see in the next chapter. Representing Relational Database Schemas A relational database consists of any number of normalized relations. The relational schema for part of the DreamHome case study is: 3.2.6 | 79 80 | Chapter 3 z The Relational Model Figure 3.3 Instance of the DreamHome rental database. 3.3 Integrity Constraints Branch Staff PropertyForRent Client PrivateOwner Viewing Registration (branchNo, street, city, postcode) (staffNo, fName, lName, position, sex, DOB, salary, branchNo) (propertyNo, street, city, postcode, type, rooms, rent, ownerNo, branchNo) (clientNo, fName, lName, telNo, prefType, maxRent) (ownerNo, fName, lName, address, telNo) (clientNo, propertyNo, viewDate, comment) (clientNo, branchNo, staffNo, dateJoined) staffNo, The common convention for representing a relation schema is to give the name of the relation followed by the attribute names in parentheses. Normally, the primary key is underlined. The conceptual model, or conceptual schema, is the set of all such schemas for the database. Figure 3.3 shows an instance of this relational schema. Integrity Constraints 3.3 In the previous section we discussed the structural part of the relational data model. As stated in Section 2.3, a data model has two other parts: a manipulative part, defining the types of operation that are allowed on the data, and a set of integrity constraints, which ensure that the data is accurate. In this section we discuss the relational integrity constraints and in the next chapter we discuss the relational manipulation operations. We have already seen an example of an integrity constraint in Section 3.2.1: since every attribute has an associated domain, there are constraints (called domain constraints) that form restrictions on the set of values allowed for the attributes of relations. In addition, there are two important integrity rules, which are constraints or restrictions that apply to all instances of the database. The two principal rules for the relational model are known as entity integrity and referential integrity. Other types of integrity constraint are multiplicity, which we discuss in Section 11.6, and general constraints, which we introduce in Section 3.3.4. Before we define entity and referential integrity, it is necessary to understand the concept of nulls. Nulls Null 3.3.1 Represents a value for an attribute that is currently unknown or is not applicable for this tuple. A null can be taken to mean the logical value ‘unknown’. It can mean that a value is not applicable to a particular tuple, or it could merely mean that no value has yet been supplied. Nulls are a way to deal with incomplete or exceptional data. However, a null is not the same as a zero numeric value or a text string filled with spaces; zeros and spaces are values, but a null represents the absence of a value. Therefore, nulls should be treated differently from other values. Some authors use the term ‘null value’, however as a null is not a value but represents the absence of a value, the term ‘null value’ is deprecated. | 81 82 | Chapter 3 z The Relational Model For example, in the Viewing relation shown in Figure 3.3, the comment attribute may be undefined until the potential renter has visited the property and returned his or her comment to the agency. Without nulls, it becomes necessary to introduce false data to represent this state or to add additional attributes that may not be meaningful to the user. In our example, we may try to represent a null comment with the value ‘−1’. Alternatively, we may add a new attribute hasCommentBeenSupplied to the Viewing relation, which contains a Y (Yes) if a comment has been supplied, and N (No) otherwise. Both these approaches can be confusing to the user. Nulls can cause implementation problems, arising from the fact that the relational model is based on first-order predicate calculus, which is a two-valued or Boolean logic – the only values allowed are true or false. Allowing nulls means that we have to work with a higher-valued logic, such as three- or four-valued logic (Codd, 1986, 1987, 1990). The incorporation of nulls in the relational model is a contentious issue. Codd later regarded nulls as an integral part of the model (Codd, 1990). Others consider this approach to be misguided, believing that the missing information problem is not fully understood, that no fully satisfactory solution has been found and, consequently, that the incorporation of nulls in the relational model is premature (see, for example, Date, 1995). We are now in a position to define the two relational integrity rules. 3.3.2 Entity Integrity The first integrity rule applies to the primary keys of base relations. For the present, we define a base relation as a relation that corresponds to an entity in the conceptual schema (see Section 2.1). We provide a more precise definition in Section 3.4. Entity integrity In a base relation, no attribute of a primary key can be null. By definition, a primary key is a minimal identifier that is used to identify tuples uniquely. This means that no subset of the primary key is sufficient to provide unique identification of tuples. If we allow a null for any part of a primary key, we are implying that not all the attributes are needed to distinguish between tuples, which contradicts the definition of the primary key. For example, as branchNo is the primary key of the Branch relation, we should not be able to insert a tuple into the Branch relation with a null for the branchNo attribute. As a second example, consider the composite primary key of the Viewing relation, comprising the client number (clientNo) and the property number (propertyNo). We should not be able to insert a tuple into the Viewing relation with either a null for the clientNo attribute, or a null for the propertyNo attribute, or nulls for both attributes. If we were to examine this rule in detail, we would find some anomalies. First, why does the rule apply only to primary keys and not more generally to candidate keys, which also identify tuples uniquely? Secondly, why is the rule restricted to base relations? For example, using the data of the Viewing relation shown in Figure 3.3, consider the query, ‘List all comments from viewings’. This will produce a unary relation consisting of the attribute comment. By definition, this attribute must be a primary key, but it contains nulls 3.4 Views (corresponding to the viewings on PG36 and PG4 by client CR56). Since this relation is not a base relation, the model allows the primary key to be null. There have been several attempts to redefine this rule (see, for example, Codd, 1988; Date, 1990). Referential Integrity 3.3.3 The second integrity rule applies to foreign keys. Referential integrity If a foreign key exists in a relation, either the foreign key value must match a candidate key value of some tuple in its home relation or the foreign key value must be wholly null. For example, branchNo in the Staff relation is a foreign key targeting the branchNo attribute in the home relation, Branch. It should not be possible to create a staff record with branch number B025, for example, unless there is already a record for branch number B025 in the Branch relation. However, we should be able to create a new staff record with a null branch number, to cater for the situation where a new member of staff has joined the company but has not yet been assigned to a particular branch office. General Constraints General constraints 3.3.4 Additional rules specified by the users or database administrators of a database that define or constrain some aspect of the enterprise. It is also possible for users to specify additional constraints that the data must satisfy. For example, if an upper limit of 20 has been placed upon the number of staff that may work at a branch office, then the user must be able to specify this general constraint and expect the DBMS to enforce it. In this case, it should not be possible to add a new member of staff at a given branch to the Staff relation if the number of staff currently assigned to that branch is 20. Unfortunately, the level of support for general constraints varies from system to system. We discuss the implementation of relational integrity in Chapters 6 and 17. Views In the three-level ANSI-SPARC architecture presented in Chapter 2, we described an external view as the structure of the database as it appears to a particular user. In the relational model, the word ‘view’ has a slightly different meaning. Rather than being the entire external model of a user’s view, a view is a virtual or derived relation: a relation that does not necessarily exist in its own right, but may be dynamically derived from one or more base relations. Thus, an external model can consist of both base (conceptual-level) relations and views derived from the base relations. In this section, we briefly discuss 3.4 | 83 84 | Chapter 3 z The Relational Model views in relational systems. In Section 6.4 we examine views in more detail and show how they can be created and used within SQL. 3.4.1 Terminology The relations we have been dealing with so far in this chapter are known as base relations. Base relation A named relation corresponding to an entity in the conceptual schema, whose tuples are physically stored in the database. We can define views in terms of base relations: View The dynamic result of one or more relational operations operating on the base relations to produce another relation. A view is a virtual relation that does not necessarily exist in the database but can be produced upon request by a particular user, at the time of request. A view is a relation that appears to the user to exist, can be manipulated as if it were a base relation, but does not necessarily exist in storage in the sense that the base relations do (although its definition is stored in the system catalog). The contents of a view are defined as a query on one or more base relations. Any operations on the view are automatically translated into operations on the relations from which it is derived. Views are dynamic, meaning that changes made to the base relations that affect the view are immediately reflected in the view. When users make permitted changes to the view, these changes are made to the underlying relations. In this section, we describe the purpose of views and briefly examine restrictions that apply to updates made through views. However, we defer treatment of how views are defined and processed until Section 6.4. 3.4.2 Purpose of Views The view mechanism is desirable for several reasons: n n n It provides a powerful and flexible security mechanism by hiding parts of the database from certain users. Users are not aware of the existence of any attributes or tuples that are missing from the view. It permits users to access data in a way that is customized to their needs, so that the same data can be seen by different users in different ways, at the same time. It can simplify complex operations on the base relations. For example, if a view is defined as a combination (join) of two relations (see Section 4.1), users may now perform more simple operations on the view, which will be translated by the DBMS into equivalent operations on the join. 3.4 Views A view should be designed to support the external model that the user finds familiar. For example: n A user might need Branch tuples that contain the names of managers as well as the other attributes already in Branch. This view is created by combining the Branch relation with a restricted form of the Staff relation where the staff position is ‘Manager’. n Some members of staff should see Staff tuples without the salary attribute. n Attributes may be renamed or the order of attributes changed. For example, the user accustomed to calling the branchNo attribute of branches by the full name Branch Number may see that column heading. n Some members of staff should see only property records for those properties that they manage. Although all these examples demonstrate that a view provides logical data independence (see Section 2.1.5), views allow a more significant type of logical data independence that supports the reorganization of the conceptual schema. For example, if a new attribute is added to a relation, existing users can be unaware of its existence if their views are defined to exclude it. If an existing relation is rearranged or split up, a view may be defined so that users can continue to see their original views. We will see an example of this in Section 6.4.7 when we discuss the advantages and disadvantages of views in more detail. Updating Views All updates to a base relation should be immediately reflected in all views that reference that base relation. Similarly, if a view is updated, then the underlying base relation should reflect the change. However, there are restrictions on the types of modification that can be made through views. We summarize below the conditions under which most systems determine whether an update is allowed through a view: n Updates are allowed through a view defined using a simple query involving a single base relation and containing either the primary key or a candidate key of the base relation. n Updates are not allowed through views involving multiple base relations. n Updates are not allowed through views involving aggregation or grouping operations. Classes of views have been defined that are theoretically not updatable, theoretically updatable, and partially updatable. A survey on updating relational views can be found in Furtado and Casanova (1985). 3.4.3 | 85 86 | Chapter 3 z The Relational Model Chapter Summary n n n n n n n n n n The Relational Database Management System (RDBMS) has become the dominant data-processing software in use today, with estimated new licence sales of between US$6 billion and US$10 billion per year (US$25 billion with tools sales included). This software represents the second generation of DBMSs and is based on the relational data model proposed by E. F. Codd. A mathematical relation is a subset of the Cartesian product of two or more sets. In database terms, a relation is any subset of the Cartesian product of the domains of the attributes. A relation is normally written as a set of n-tuples, in which each element is chosen from the appropriate domain. Relations are physically represented as tables, with the rows corresponding to individual tuples and the columns to attributes. The structure of the relation, with domain specifications and other constraints, is part of the intension of the database, while the relation with all its tuples written out represents an instance or extension of the database. Properties of database relations are: each cell contains exactly one atomic value, attribute names are distinct, attribute values come from the same domain, attribute order is immaterial, tuple order is immaterial, and there are no duplicate tuples. The degree of a relation is the number of attributes, while the cardinality is the number of tuples. A unary relation has one attribute, a binary relation has two, a ternary relation has three, and an n-ary relation has n attributes. A superkey is an attribute, or set of attributes, that identifies tuples of a relation uniquely, while a candidate key is a minimal superkey. A primary key is the candidate key chosen for use in identification of tuples. A relation must always have a primary key. A foreign key is an attribute, or set of attributes, within one relation that is the candidate key of another relation. A null represents a value for an attribute that is unknown at the present time or is not applicable for this tuple. Entity integrity is a constraint that states that in a base relation no attribute of a primary key can be null. Referential integrity states that foreign key values must match a candidate key value of some tuple in the home relation or be wholly null. Apart from relational integrity, integrity constraints include, required data, domain, and multiplicity constraints; other integrity constraints are called general constraints. A view in the relational model is a virtual or derived relation that is dynamically created from the underlying base relation(s) when required. Views provide security and allow the designer to customize a user’s model. Not all views are updatable. Exercises | 87 Review Questions 3.1 Discuss each of the following concepts in the context of the relational data model: (a) relation (b) attribute (c) domain (d) tuple (e) intension and extension (f) degree and cardinality. 3.2 Describe the relationship between mathematical relations and relations in the relational data model. 3.3 Describe the differences between a relation and a relation schema. What is a relational database schema? 3.4 Discuss the properties of a relation. 3.5 Discuss the differences between the candidate keys and the primary key of a relation. Explain what is meant by a foreign key. How do foreign keys of relations relate to candidate keys? Give examples to illustrate your answer. 3.6 Define the two principal integrity rules for the relational model. Discuss why it is desirable to enforce these rules. 3.7 What is a view? Discuss the difference between a view and a base relation. Exercises The following tables form part of a database held in a relational DBMS: Hotel Room Booking Guest (hotelNo, hotelName, city) (roomNo, hotelNo, type, price) (hotelNo, guestNo, dateFrom, dateTo, roomNo) (guestNo, guestName, guestAddress) where Hotel contains hotel details and hotelNo is the primary key; Room contains room details for each hotel and (roomNo, hotelNo) forms the primary key; Booking contains details of bookings and (hotelNo, guestNo, dateFrom) forms the primary key; Guest contains guest details and guestNo is the primary key. 3.8 Identify the foreign keys in this schema. Explain how the entity and referential integrity rules apply to these relations. 3.9 Produce some sample tables for these relations that observe the relational integrity rules. Suggest some general constraints that would be appropriate for this schema. 3.10 Analyze the RDBMSs that you are currently using. Determine the support the system provides for primary keys, alternate keys, foreign keys, relational integrity, and views. 3.11 Implement the above schema in one of the RDBMSs you currently use. Implement, where possible, the primary, alternate and foreign keys, and appropriate relational integrity constraints. Chapter 4 Relational Algebra and Relational Calculus Chapter Objectives In this chapter you will learn: n The meaning of the term ‘relational completeness’. n How to form queries in the relational algebra. n How to form queries in the tuple relational calculus. n How to form queries in the domain relational calculus. n The categories of relational Data Manipulation Languages (DMLs). In the previous chapter we introduced the main structural components of the relational model. As we discussed in Section 2.3, another important part of a data model is a manipulation mechanism, or query language, to allow the underlying data to be retrieved and updated. In this chapter we examine the query languages associated with the relational model. In particular, we concentrate on the relational algebra and the relational calculus as defined by Codd (1971) as the basis for relational languages. Informally, we may describe the relational algebra as a (high-level) procedural language: it can be used to tell the DBMS how to build a new relation from one or more relations in the database. Again, informally, we may describe the relational calculus as a non-procedural language: it can be used to formulate the definition of a relation in terms of one or more database relations. However, formally the relational algebra and relational calculus are equivalent to one another: for every expression in the algebra, there is an equivalent expression in the calculus (and vice versa). Both the algebra and the calculus are formal, non-user-friendly languages. They have been used as the basis for other, higher-level Data Manipulation Languages (DMLs) for relational databases. They are of interest because they illustrate the basic operations required of any DML and because they serve as the standard of comparison for other relational languages. The relational calculus is used to measure the selective power of relational languages. A language that can be used to produce any relation that can be derived using the relational calculus is said to be relationally complete. Most relational query languages are relationally complete but have more expressive power than the relational algebra or relational calculus because of additional operations such as calculated, summary, and ordering functions. 4.1 The Relational Algebra Structure of this Chapter In Section 4.1 we examine the relational algebra and in Section 4.2 we examine two forms of the relational calculus: tuple relational calculus and domain relational calculus. In Section 4.3 we briefly discuss some other relational languages. We use the DreamHome rental database instance shown in Figure 3.3 to illustrate the operations. In Chapters 5 and 6 we examine SQL (Structured Query Language), the formal and de facto standard language for RDBMSs, which has constructs based on the tuple relational calculus. In Chapter 7 we examine QBE (Query-By-Example), another highly popular visual query language for RDBMSs, which is in part based on the domain relational calculus. The Relational Algebra 4.1 The relational algebra is a theoretical language with operations that work on one or more relations to define another relation without changing the original relation(s). Thus, both the operands and the results are relations, and so the output from one operation can become the input to another operation. This allows expressions to be nested in the relational algebra, just as we can nest arithmetic operations. This property is called closure: relations are closed under the algebra, just as numbers are closed under arithmetic operations. The relational algebra is a relation-at-a-time (or set) language in which all tuples, possibly from several relations, are manipulated in one statement without looping. There are several variations of syntax for relational algebra commands and we use a common symbolic notation for the commands and present it informally. The interested reader is referred to Ullman (1988) for a more formal treatment. There are many variations of the operations that are included in relational algebra. Codd (1972a) originally proposed eight operations, but several others have been developed. The five fundamental operations in relational algebra, Selection, Projection, Cartesian product, Union, and Set difference, perform most of the data retrieval operations that we are interested in. In addition, there are also the Join, Intersection, and Division operations, which can be expressed in terms of the five basic operations. The function of each operation is illustrated in Figure 4.1. The Selection and Projection operations are unary operations, since they operate on one relation. The other operations work on pairs of relations and are therefore called binary operations. In the following definitions, let R and S be two relations defined over the attributes A = (a1, a2, . . . , aN) and B = (b1, b2, . . . , bM), respectively. Unary Operations We start the discussion of the relational algebra by examining the two unary operations: Selection and Projection. 4.1.1 | 89 90 | Chapter 4 z Relational Algebra and Relational Calculus Figure 4.1 Illustration showing the function of the relational algebra operations. Selection (or Restriction) spredicate(R) The Selection operation works on a single relation R and defines a relation that contains only those tuples of R that satisfy the specified condition ( predicate). 4.1 The Relational Algebra | 91 Example 4.1 Selection operation List all staff with a salary greater than £10,000. σsalary > 10000(Staff) Here, the input relation is Staff and the predicate is salary > 10000. The Selection operation defines a relation containing only those Staff tuples with a salary greater than £10,000. The result of this operation is shown in Figure 4.2. More complex predicates can be generated using the logical operators ∧ (AND), ∨ (OR) and ~ (NOT). Figure 4.2 Selecting salary > 10000 from the Staff relation. Projection Π a , . . . , a (R) 1 n The Projection operation works on a single relation R and defines a relation that contains a vertical subset of R, extracting the values of specified attributes and eliminating duplicates. Example 4.2 Projection operation Produce a list of salaries for all staff, showing only the staffNo, fName, lName, and salary details. ΠstaffNo, fName, lName, salary(Staff) In this example, the Projection operation defines a relation that contains only the designated Staff attributes staffNo, fName, lName, and salary, in the specified order. The result of this operation is shown in Figure 4.3. Figure 4.3 Projecting the Staff relation over the staffNo, fName, lName, and salary attributes. 92 | Chapter 4 z Relational Algebra and Relational Calculus 4.1.2 Set Operations The Selection and Projection operations extract information from only one relation. There are obviously cases where we would like to combine information from several relations. In the remainder of this section, we examine the binary operations of the relational algebra, starting with the set operations of Union, Set difference, Intersection, and Cartesian product. Union R∪S The union of two relations R and S defines a relation that contains all the tuples of R, or S, or both R and S, duplicate tuples being eliminated. R and S must be union-compatible. If R and S have I and J tuples, respectively, their union is obtained by concatenating them into one relation with a maximum of (I + J) tuples. Union is possible only if the schemas of the two relations match, that is, if they have the same number of attributes with each pair of corresponding attributes having the same domain. In other words, the relations must be union-compatible. Note that attributes names are not used in defining unioncompatibility. In some cases, the Projection operation may be used to make two relations union-compatible. Example 4.3 Union operation List all cities where there is either a branch office or a property for rent. Πcity(Branch) ∪ Πcity(PropertyForRent) Figure 4.4 Union based on the city attribute from the Branch and PropertyForRent relations. To produce union-compatible relations, we first use the Projection operation to project the Branch and PropertyForRent relations over the attribute city, eliminating duplicates where necessary. We then use the Union operation to combine these new relations to produce the result shown in Figure 4.4. Set difference R−S The Set difference operation defines a relation consisting of the tuples that are in relation R, but not in S. R and S must be union-compatible. 4.1 The Relational Algebra | 93 Example 4.4 Set difference operation List all cities where there is a branch office but no properties for rent. Πcity(Branch) − Πcity(PropertyForRent) As in the previous example, we produce union-compatible relations by projecting the Branch and PropertyForRent relations over the attribute city. We then use the Set difference operation to combine these new relations to produce the result shown in Figure 4.5. Figure 4.5 Set difference based on the city attribute from the Branch and PropertyForRent relations. Intersection R∩S The Intersection operation defines a relation consisting of the set of all tuples that are in both R and S. R and S must be union-compatible. Example 4.5 Intersection operation List all cities where there is both a branch office and at least one property for rent. Πcity(Branch) ∩ Πcity(PropertyForRent) As in the previous example, we produce union-compatible relations by projecting the Branch and PropertyForRent relations over the attribute city. We then use the Intersection operation to combine these new relations to produce the result shown in Figure 4.6. Note that we can express the Intersection operation in terms of the Set difference operation: R ∩ S = R − (R − S) Cartesian product R×S The Cartesian product operation defines a relation that is the concatenation of every tuple of relation R with every tuple of relation S. The Cartesian product operation multiplies two relations to define another relation consisting of all possible pairs of tuples from the two relations. Therefore, if one relation has I tuples and N attributes and the other has J tuples and M attributes, the Cartesian product relation will contain (I * J) tuples with (N + M) attributes. It is possible that the two relations may have attributes with the same name. In this case, the attribute names are prefixed with the relation name to maintain the uniqueness of attribute names within a relation. Figure 4.6 Intersection based on city attribute from the Branch and PropertyForRent relations. 94 | Chapter 4 z Relational Algebra and Relational Calculus Example 4.6 Cartesian product operation List the names and comments of all clients who have viewed a property for rent. The names of clients are held in the Client relation and the details of viewings are held in the Viewing relation. To obtain the list of clients and the comments on properties they have viewed, we need to combine these two relations: (ΠclientNo, fName, lName(Client)) × (ΠclientNo, propertyNo, comment(Viewing)) This result of this operation is shown in Figure 4.7. In its present form, this relation contains more information than we require. For example, the first tuple of this relation contains different clientNo values. To obtain the required list, we need to carry out a Selection operation on this relation to extract those tuples where Client.clientNo = Viewing.clientNo. The complete operation is thus: σClient.clientNo = Viewing.clientNo((ΠclientNo, fName, lName(Client)) × (ΠclientNo, propertyNo, comment(Viewing))) The result of this operation is shown in Figure 4.8. Figure 4.7 Cartesian product of reduced Client and Viewing relations. Figure 4.8 Restricted Cartesian product of reduced Client and Viewing relations. 4.1 The Relational Algebra Decomposing complex operations The relational algebra operations can be of arbitrary complexity. We can decompose such operations into a series of smaller relational algebra operations and give a name to the results of intermediate expressions. We use the assignment operation, denoted by ←, to name the results of a relational algebra operation. This works in a similar manner to the assignment operation in a programming language: in this case, the right-hand side of the operation is assigned to the left-hand side. For instance, in the previous example we could rewrite the operation as follows: TempViewing(clientNo, propertyNo, comment) ← ΠclientNo, propertyNo, comment(Viewing) TempClient(clientNo, fName, lName) ← ΠclientNo, fName, lName(Client) Comment(clientNo, fName, lName, vclientNo, propertyNo, comment) ← TempClient × TempViewing Result ← sclientNo = vclientNo(Comment) Alternatively, we can use the Rename operation ρ (rho), which gives a name to the result of a relational algebra operation. Rename allows an optional name for each of the attributes of the new relation to be specified. rS(E) or rS(a , a , . . . , a )(E) 1 2 n The Rename operation provides a new name S for the expression E, and optionally names the attributes as a1, a2, . . . , an. Join Operations Typically, we want only combinations of the Cartesian product that satisfy certain conditions and so we would normally use a Join operation instead of the Cartesian product operation. The Join operation, which combines two relations to form a new relation, is one of the essential operations in the relational algebra. Join is a derivative of Cartesian product, equivalent to performing a Selection operation, using the join predicate as the selection formula, over the Cartesian product of the two operand relations. Join is one of the most difficult operations to implement efficiently in an RDBMS and is one of the reasons why relational systems have intrinsic performance problems. We examine strategies for implementing the Join operation in Section 21.4.3. There are various forms of Join operation, each with subtle differences, some more useful than others: n Theta join n Equijoin (a particular type of Theta join) n Natural join n Outer join n Semijoin. 4.1.3 | 95 96 | Chapter 4 z Relational Algebra and Relational Calculus Theta join (q-join) R !F S The Theta join operation defines a relation that contains tuples satisfying the predicate F from the Cartesian product of R and S. The predicate F is of the form R.ai q S.bi where q may be one of the comparison operators (<, ≤, >, ≥, =, ≠). We can rewrite the Theta join in terms of the basic Selection and Cartesian product operations: R 1F S = σF (R × S) As with Cartesian product, the degree of a Theta join is the sum of the degrees of the operand relations R and S. In the case where the predicate F contains only equality (=), the term Equijoin is used instead. Consider again the query of Example 4.6. Example 4.7 Equijoin operation List the names and comments of all clients who have viewed a property for rent. In Example 4.6 we used the Cartesian product and Selection operations to obtain this list. However, the same result is obtained using the Equijoin operation: (ΠclientNo, fName, lName(Client)) 1 Client.clientNo = Viewing.clientNo (ΠclientNo, propertyNo, comment(Viewing)) or Result ← TempClient 1 TempClient.clientNo = TempViewing.clientNo TempViewing The result of these operations was shown in Figure 4.8. Natural join R!S The Natural join is an Equijoin of the two relations R and S over all common attributes x. One occurrence of each common attribute is eliminated from the result. The Natural join operation performs an Equijoin over all the attributes in the two relations that have the same name. The degree of a Natural join is the sum of the degrees of the relations R and S less the number of attributes in x. 4.1 The Relational Algebra | 97 Example 4.8 Natural join operation List the names and comments of all clients who have viewed a property for rent. In Example 4.7 we used the Equijoin to produce this list, but the resulting relation had two occurrences of the join attribute clientNo. We can use the Natural join to remove one occurrence of the clientNo attribute: (ΠclientNo, fName, lName(Client)) 1 (ΠclientNo, propertyNo, comment(Viewing)) or Result ← TempClient 1 TempViewing The result of this operation is shown in Figure 4.9. Figure 4.9 Natural join of restricted Client and Viewing relations. Outer join Often in joining two relations, a tuple in one relation does not have a matching tuple in the other relation; in other words, there is no matching value in the join attributes. We may want tuples from one of the relations to appear in the result even when there are no matching values in the other relation. This may be accomplished using the Outer join. R%S The (left) Outer join is a join in which tuples from R that do not have matching values in the common attributes of S are also included in the result relation. Missing values in the second relation are set to null. The Outer join is becoming more widely available in relational systems and is a specified operator in the SQL standard (see Section 5.3.7). The advantage of an Outer join is that information is preserved, that is, the Outer join preserves tuples that would have been lost by other types of join. 98 | Chapter 4 z Relational Algebra and Relational Calculus Example 4.9 Left Outer join operation Produce a status report on property viewings. In this case, we want to produce a relation consisting of the properties that have been viewed with comments and those that have not been viewed. This can be achieved using the following Outer join: (ΠpropertyNo, street, city(PropertyForRent)) 5 Viewing The resulting relation is shown in Figure 4.10. Note that properties PL94, PG21, and PG16 have no viewings, but these tuples are still contained in the result with nulls for the attributes from the Viewing relation. Figure 4.10 Left (natural) Outer join of PropertyForRent and Viewing relations. Strictly speaking, Example 4.9 is a Left (natural) Outer join as it keeps every tuple in the left-hand relation in the result. Similarly, there is a Right Outer join that keeps every tuple in the right-hand relation in the result. There is also a Full Outer join that keeps all tuples in both relations, padding tuples with nulls when no matching tuples are found. Semijoin R @F S The Semijoin operation defines a relation that contains the tuples of R that participate in the join of R with S. The Semijoin operation performs a join of the two relations and then projects over the attributes of the first operand. One advantage of a Semijoin is that it decreases the number of tuples that need to be handled to form the join. It is particularly useful for computing joins in distributed systems (see Sections 22.4.2 and 23.6.2). We can rewrite the Semijoin using the Projection and Join operations: R 2F S = ΠA(R 1F S) A is the set of all attributes for R This is actually a Semi-Theta join. There are variants for Semi-Equijoin and Semi-Natural join. 4.1 The Relational Algebra | 99 Example 4.10 Semijoin operation List complete details of all staff who work at the branch in Glasgow. If we are interested in seeing only the attributes of the Staff relation, we can use the following Semijoin operation, producing the relation shown in Figure 4.11. Staff 2 Staff.branchNo = Branch branchNo.(σcity = ‘Glasgow’ (Branch)) Figure 4.11 Semijoin of Staff and Branch relations. Division Operation The Division operation is useful for a particular type of query that occurs quite frequently in database applications. Assume relation R is defined over the attribute set A and relation S is defined over the attribute set B such that B ⊆ A (B is a subset of A). Let C = A − B, that is, C is the set of attributes of R that are not attributes of S. We have the following definition of the Division operation. R÷S The Division operation defines a relation over the attributes C that consists of the set of tuples from R that match the combination of every tuple in S. We can express the Division operation in terms of the basic operations: ← ΠC (R) T2 ← ΠC ((T1 × S) − R) T ← T1 − T2 T1 Example 4.11 Division operation Identify all clients who have viewed all properties with three rooms. We can use the Selection operation to find all properties with three rooms followed by the Projection operation to produce a relation containing only these property numbers. We can then use the following Division operation to obtain the new relation shown in Figure 4.12. (ΠclientNo, propertyNo(Viewing)) ÷ (ΠpropertyNo(σrooms = 3(PropertyForRent))) 4.1.4 100 | Chapter 4 z Relational Algebra and Relational Calculus Figure 4.12 Result of the Division operation on the Viewing and PropertyForRent relations. 4.1.5 Aggregation and Grouping Operations As well as simply retrieving certain tuples and attributes of one or more relations, we often want to perform some form of summation or aggregation of data, similar to the totals at the bottom of a report, or some form of grouping of data, similar to subtotals in a report. These operations cannot be performed using the basic relational algebra operations considered above. However, additional operations have been proposed, as we now discuss. Aggregate operations ℑAL(R) Applies the aggregate function list, AL, to the relation R to define a relation over the aggregate list. AL contains one or more (<aggregate_function>, <attribute>) pairs. The main aggregate functions are: n n n n n COUNT – returns the number of values in the associated attribute. SUM – returns the sum of the values in the associated attribute. AVG – returns the average of the values in the associated attribute. MIN – returns the smallest value in the associated attribute. MAX – returns the largest value in the associated attribute. Example 4.12 Aggregate operations (a) How many properties cost more than £350 per month to rent? We can use the aggregate function COUNT to produce the relation 4.13(a) as follows: R shown in Figure ρR(myCount) ℑ COUNT propertyNo (σrent > 350 (PropertyForRent)) (b) Find the minimum, maximum, and average staff salary. We can use the aggregate functions, MIN, MAX, and AVERAGE, to produce the relation R shown in Figure 4.13(b) as follows: 4.1 The Relational Algebra ρR(myMin, myMax, myAverage) ℑ MIN salary, MAX salary, AVERAGE salary (Staff) Grouping operation GA ℑAL(R) | 101 Figure 4.13 Result of the Aggregate operations: (a) finding the number of properties whose rent is greater than £350; (b) finding the minimum, maximum, and average staff salary. Groups the tuples of relation R by the grouping attributes, GA, and then applies the aggregate function list AL to define a new relation. AL contains one or more (<aggregate_function>, <attribute>) pairs. The resulting relation contains the grouping attributes, GA, along with the results of each of the aggregate functions. The general form of the grouping operation is as follows: a1, a2, . . . , an ℑ < A p a p >, <A q a q>, . . . , < A z a z> (R) where R is any relation, a1, a2, . . . , an are attributes of R on which to group, ap, aq, . . . , az are other attributes of R, and Ap, Aq, . . . , Az are aggregate functions. The tuples of R are partitioned into groups such that: n n all tuples in a group have the same value for a1, a2, . . . , an; tuples in different groups have different values for a1, a2, . . . , an. We illustrate the use of the grouping operation with the following example. Example 4.13 Grouping operation Find the number of staff working in each branch and the sum of their salaries. We first need to group tuples according to the branch number, branchNo, and then use the aggregate functions COUNT and SUM to produce the required relation. The relational algebra expression is as follows: ρR(branchNo, myCount, mySum) branchNo ℑ COUNT staffNo, SUM salary (Staff) The resulting relation is shown in Figure 4.14. Figure 4.14 Result of the grouping operation to find the number of staff working in each branch and the sum of their salaries. 102 | Chapter 4 z Relational Algebra and Relational Calculus 4.1.6 Summary of the Relational Algebra Operations The relational algebra operations are summarized in Table 4.1. Table 4.1 Operations in the relational algebra. Operation Notation Function Selection σpredicate(R) Projection Πa , . . . , a (R) Union R∪S Set difference R−S Intersection R∩S Cartesian product Theta join R×S Equijoin R 1F S Natural join R1S (Left) Outer join R5S Semijoin R 2F S Division R÷S Aggregate ℑAL( R) Produces a relation that contains only those tuples of R that satisfy the specified predicate. Produces a relation that contains a vertical subset of R, extracting the values of specified attributes and eliminating duplicates. Produces a relation that contains all the tuples of R, or S, or both R and S, duplicate tuples being eliminated. R and S must be union-compatible. Produces a relation that contains all the tuples in R that are not in S. R and S must be union-compatible. Produces a relation that contains all the tuples in both R and S. R and S must be union-compatible. Produces a relation that is the concatenation of every tuple of relation R with every tuple of relation S. Produces a relation that contains tuples satisfying the predicate F from the Cartesian product of R and S. Produces a relation that contains tuples satisfying the predicate F (which only contains equality comparisons) from the Cartesian product of R and S. An Equijoin of the two relations R and S over all common attributes x. One occurrence of each common attribute is eliminated. A join in which tuples from R that do not have matching values in the common attributes of S are also included in the result relation. Produces a relation that contains the tuples of R that participate in the join of R with S satisfying the predicate F. Produces a relation that consists of the set of tuples from R defined over the attributes C that match the combination of every tuple in S, where C is the set of attributes that are in R but not in S. Applies the aggregate function list, AL, to the relation R to define a relation over the aggregate list. AL contains one or more (<aggregate_function>, <attribute>) pairs. Groups the tuples of relation R by the grouping attributes, GA, and then applies the aggregate function list AL to define a new relation. AL contains one or more (<aggregate_function>, <attribute>) pairs. The resulting relation contains the grouping attributes, GA, along with the results of each of the aggregate functions. Grouping 1 n R 1F S ℑAL( R) GA 4.2 The Relational Calculus The Relational Calculus 4.2 A certain order is always explicitly specified in a relational algebra expression and a strategy for evaluating the query is implied. In the relational calculus, there is no description of how to evaluate a query; a relational calculus query specifies what is to be retrieved rather than how to retrieve it. The relational calculus is not related to differential and integral calculus in mathematics, but takes its name from a branch of symbolic logic called predicate calculus. When applied to databases, it is found in two forms: tuple relational calculus, as originally proposed by Codd (1972a), and domain relational calculus, as proposed by Lacroix and Pirotte (1977). In first-order logic or predicate calculus, a predicate is a truth-valued function with arguments. When we substitute values for the arguments, the function yields an expression, called a proposition, which can be either true or false. For example, the sentences, ‘John White is a member of staff’ and ‘John White earns more than Ann Beech’ are both propositions, since we can determine whether they are true or false. In the first case, we have a function, ‘is a member of staff’, with one argument (John White); in the second case, we have a function, ‘earns more than’, with two arguments (John White and Ann Beech). If a predicate contains a variable, as in ‘x is a member of staff’, there must be an associated range for x. When we substitute some values of this range for x, the proposition may be true; for other values, it may be false. For example, if the range is the set of all people and we replace x by John White, the proposition ‘John White is a member of staff’ is true. If we replace x by the name of a person who is not a member of staff, the proposition is false. If P is a predicate, then we can write the set of all x such that P is true for x, as: {x | P(x)} We may connect predicates by the logical connectives ∧ (AND), ∨ (OR), and ~ (NOT) to form compound predicates. Tuple Relational Calculus In the tuple relational calculus we are interested in finding tuples for which a predicate is true. The calculus is based on the use of tuple variables. A tuple variable is a variable that ‘ranges over’ a named relation: that is, a variable whose only permitted values are tuples of the relation. (The word ‘range’ here does not correspond to the mathematical use of range, but corresponds to a mathematical domain.) For example, to specify the range of a tuple variable S as the Staff relation, we write: Staff(S) To express the query ‘Find the set of all tuples S such that F(S) is true’, we can write: {S | F(S )} 4.2.1 | 103 104 | Chapter 4 z Relational Algebra and Relational Calculus F is called a formula (well-formed formula, or wff in mathematical logic). For example, to express the query ‘Find the staffNo, fName, lName, position, sex, DOB, salary, and branchNo of all staff earning more than £10,000’, we can write: {S | Staff(S) ∧ S.salary > 10000} means the value of the salary attribute for the tuple variable S. To retrieve a particular attribute, such as salary, we would write: S.salary {S.salary | Staff(S) ∧ S.salary > 10000} The existential and universal quantifiers There are two quantifiers we can use with formulae to tell how many instances the predicate applies to. The existential quantifier ∃ (‘there exists’) is used in formulae that must be true for at least one instance, such as: Staff(S) ∧ (∃B) (Branch(B) ∧ (B.branchNo = S.branchNo) ∧ B.city = ‘London’) This means, ‘There exists a Branch tuple that has the same branchNo as the branchNo of the current Staff tuple, S, and is located in London’. The universal quantifier ∀ (‘for all’) is used in statements about every instance, such as: (∀B) (B.city ≠ ‘Paris’) This means, ‘For all Branch tuples, the address is not in Paris’. We can apply a generalization of De Morgan’s laws to the existential and universal quantifiers. For example: (∃X)(F(X)) ≡ ~(∀X)(~(F(X))) (∀X)(F(X)) ≡ ~(∃X)(~(F(X))) (∃X)(F1(X) ∧ F2(X)) ≡ ~(∀X)(~(F1(X)) ∨ ~(F2(X))) (∀X)(F1(X) ∧ F2(X)) ≡ ~(∃X)(~(F1(X)) ∨ ~(F2(X))) Using these equivalence rules, we can rewrite the above formula as: ~(∃B) (B.city = ‘Paris’) which means, ‘There are no branches with an address in Paris’. Tuple variables that are qualified by ∀ or ∃ are called bound variables, otherwise the tuple variables are called free variables. The only free variables in a relational calculus expression should be those on the left side of the bar ( | ). For example, in the following query: {S.fName, S.lName | Staff(S) ∧ (∃B) (Branch(B) ∧ (B.branchNo = S.branchNo) ∧ B.city = ‘London’)} S is the only free variable and S is then bound successively to each tuple of Staff. 4.2 The Relational Calculus Expressions and formulae As with the English alphabet, in which some sequences of characters do not form a correctly structured sentence, so in calculus not every sequence of formulae is acceptable. The formulae should be those sequences that are unambiguous and make sense. An expression in the tuple relational calculus has the following general form: {S1.a1, S2.a2, . . . , Sn.an | F(S1, S2, . . . , Sm)} m≥n where S1, S2, . . . , Sn, . . . , Sm are tuple variables, each ai is an attribute of the relation over which Si ranges, and F is a formula. A (well-formed) formula is made out of one or more atoms, where an atom has one of the following forms: n R(Si), n n where Si is a tuple variable and R is a relation. Si.a1 θ Sj.a2, where Si and Sj are tuple variables, a1 is an attribute of the relation over which Si ranges, a2 is an attribute of the relation over which Sj ranges, and θ is one of the comparison operators (<, ≤, >, ≥ , =, ≠); the attributes a1 and a2 must have domains whose members can be compared by θ. Si.a1 θ c, where Si is a tuple variable, a1 is an attribute of the relation over which Si ranges, c is a constant from the domain of attribute a1, and θ is one of the comparison operators. We recursively build up formulae from atoms using the following rules: n n n An atom is a formula. If F1 and F2 are formulae, so are their conjunction F1 ∧ F2, their disjunction F1 ∨ F2, and the negation ~F1. If F is a formula with free variable X, then (∃X )(F) and (∀X )(F) are also formulae. Example 4.14 Tuple relational calculus (a) List the names of all managers who earn more than £25,000. {S.fName, S.lName | Staff(S) ∧ S.position = ‘Manager’ ∧ S.salary > 25000} (b) List the staff who manage properties for rent in Glasgow. {S | Staff(S) ∧ (∃P) (PropertyForRent(P) ∧ (P.staffNo = S.staffNo) ∧ P.city = ‘Glasgow’)} The staffNo attribute in the PropertyForRent relation holds the staff number of the member of staff who manages the property. We could reformulate the query as: ‘For each member of staff whose details we want to list, there exists a tuple in the relation PropertyForRent for that member of staff with the value of the attribute city in that tuple being Glasgow.’ Note that in this formulation of the query, there is no indication of a strategy for executing the query – the DBMS is free to decide the operations required to fulfil the request and the execution order of these operations. On the other hand, the equivalent | 105 106 | Chapter 4 z Relational Algebra and Relational Calculus relational algebra formulation would be: ‘Select tuples from PropertyForRent such that the city is Glasgow and perform their join with the Staff relation’, which has an implied order of execution. (c) List the names of staff who currently do not manage any properties. {S.fName, S.lName | Staff(S) ∧ (~(∃P) (PropertyForRent(P) ∧ (S.staffNo = P.staffNo)))} Using the general transformation rules for quantifiers given above, we can rewrite this as: {S.fName, S.lName | Staff(S) ∧ ((∀P) (~PropertyForRent(P) ∨ ~(S.staffNo = P.staffNo)))} (d ) List the names of clients who have viewed a property for rent in Glasgow. {C.fName, C.lName | Client(C) ∧ ((∃V) (∃P) (Viewing(V) ∧ PropertyForRent(P) ∧ (C.clientNo = V.clientNo) ∧ (V.propertyNo = P.propertyNo) ∧ P.city = ‘Glasgow’))} To answer this query, note that we can rephrase ‘clients who have viewed a property in Glasgow’ as ‘clients for whom there exists some viewing of some property in Glasgow’. (e) List all cities where there is either a branch office or a property for rent. {T.city | (∃B) (Branch(B) ∧ B.city = T.city) ∨ (∃P) (PropertyForRent(P) ∧ P.city = T.city)} Compare this with the equivalent relational algebra expression given in Example 4.3. (f ) List all the cities where there is a branch office but no properties for rent. {B.city | Branch(B) ∧ (~(∃P) (PropertyForRent(P) ∧ B.city = P.city))} Compare this with the equivalent relational algebra expression given in Example 4.4. (g) List all the cities where there is both a branch office and at least one property for rent. {B.city | Branch(B) ∧ ((∃P) (PropertyForRent(P) ∧ B.city = P.city))} Compare this with the equivalent relational algebra expression given in Example 4.5. Safety of expressions Before we complete this section, we should mention that it is possible for a calculus expression to generate an infinite set. For example: 4.2 The Relational Calculus {S | ~ Staff(S)} would mean the set of all tuples that are not in the Staff relation. Such an expression is said to be unsafe. To avoid this, we have to add a restriction that all values that appear in the result must be values in the domain of the expression E, denoted dom(E). In other words, the domain of E is the set of all values that appear explicitly in E or that appear in one or more relations whose names appear in E. In this example, the domain of the expression is the set of all values appearing in the Staff relation. An expression is safe if all values that appear in the result are values from the domain of the expression. The above expression is not safe since it will typically include tuples from outside the Staff relation (and so outside the domain of the expression). All other examples of tuple relational calculus expressions in this section are safe. Some authors have avoided this problem by using range variables that are defined by a separate RANGE statement. The interested reader is referred to Date (2000). Domain Relational Calculus In the tuple relational calculus, we use variables that range over tuples in a relation. In the domain relational calculus, we also use variables but in this case the variables take their values from domains of attributes rather than tuples of relations. An expression in the domain relational calculus has the following general form: {d1, d2, . . . , dn | F(d1, d2, . . . , dm)} m≥n where d1, d2, . . . , dn, . . . , dm represent domain variables and F(d1, d2, . . . , dm) represents a formula composed of atoms, where each atom has one of the following forms: n R(d1, d2, . . . , dn), where R is a relation of degree n and each di is a domain variable. θ dj , where di and dj are domain variables and θ is one of the comparison operators (<, ≤, >, ≥, =, ≠); the domains di and dj must have members that can be compared by θ. di θ c, where di is a domain variable, c is a constant from the domain of di, and θ is one of the comparison operators. n di n We recursively build up formulae from atoms using the following rules: n n n An atom is a formula. If F1 and F2 are formulae, so are their conjunction F1 ∧ F2, their disjunction F1 ∨ F2, and the negation ~F1. If F is a formula with domain variable X, then (∃X )(F) and (∀X )(F) are also formulae. 4.2.2 | 107 108 | Chapter 4 z Relational Algebra and Relational Calculus Example 4.15 Domain relational calculus In the following examples, we use the following shorthand notation: (∃d1, d2, . . . , dn) in place of (∃d1), (∃d2), . . . , (∃dn) (a) Find the names of all managers who earn more than £25,000. {fN, lN | (∃sN, posn, sex, DOB, sal, bN) (Staff(sN, fN, lN, posn, sex, DOB, sal, bN) ∧ posn = ‘Manager’ ∧ sal > 25000)} If we compare this query with the equivalent tuple relational calculus query in Example 4.12(a), we see that each attribute is given a (variable) name. The condition Staff(sN, fN, . . . , bN) ensures that the domain variables are restricted to be attributes of the same tuple. Thus, we can use the formula posn = ‘Manager’, rather than Staff.position = ‘Manager’. Also note the difference in the use of the existential quantifier. In the tuple relational calculus, when we write ∃posn for some tuple variable posn, we bind the variable to the relation Staff by writing Staff(posn). On the other hand, in the domain relational calculus posn refers to a domain value and remains unconstrained until it appears in the subformula Staff(sN, fN, lN, posn, sex, DOB, sal, bN) when it becomes constrained to the position values that appear in the Staff relation. For conciseness, in the remaining examples in this section we quantify only those domain variables that actually appear in a condition (in this example, posn and sal). (b) List the staff who manage properties for rent in Glasgow. {sN, fN, lN, posn, sex, DOB, sal, bN | (∃sN1, cty) (Staff(sN, fN, lN, posn, sex, DOB, sal, bN) ∧ PropertyForRent(pN, st, cty, pc, typ, rms, rnt, oN, sN1, bN1) ∧ (sN = sN1) ∧ cty = ‘Glasgow’)} This query can also be written as: {sN, fN, lN, posn, sex, DOB, sal, bN | (Staff(sN, fN, lN, posn, sex, DOB, sal, bN) ∧ PropertyForRent(pN, st, ‘Glasgow’, pc, typ, rms, rnt, oN, sN, bN1))} In this version, the domain variable cty in PropertyForRent has been replaced with the constant ‘Glasgow’ and the same domain variable sN, which represents the staff number, has been repeated for Staff and PropertyForRent. (c) List the names of staff who currently do not manage any properties for rent. {fN, lN | (∃sN) (Staff(sN, fN, lN, posn, sex, DOB, sal, bN) ∧ (~(∃sN1) (PropertyForRent(pN, st, cty, pc, typ, rms, rnt, oN, sN1, bN1) ∧ (sN = sN1))))} (d) List the names of clients who have viewed a property for rent in Glasgow. {fN, lN | (∃cN, cN1, pN, pN1, cty) (Client(cN, fN, lN, tel, pT, mR) ∧ Viewing(cN1, pN1, dt, cmt) ∧ PropertyForRent(pN, st, cty, pc, typ, rms, rnt, oN, sN, bN) ∧ (cN = cN1) ∧ (pN = pN1) ∧ cty = ‘Glasgow’)} 4.3 Other Languages (e) List all cities where there is either a branch office or a property for rent. {cty | (Branch(bN, st, cty, pc) ∨ PropertyForRent(pN, st1, cty, pc1, typ, rms, rnt, oN, sN, bN1))} (f) List all the cities where there is a branch office but no properties for rent. {cty | (Branch(bN, st, cty, pc) ∧ (~(∃cty1) (PropertyForRent(pN, st1, cty1, pc1, typ, rms, rnt, oN, sN, bN1) ∧ (cty = cty1))))} (g) List all the cities where there is both a branch office and at least one property for rent. {cty | (Branch(bN, st, cty, pc) ∧ (∃cty1) (PropertyForRent(pN, st1, cty1, pc1, typ, rms, rnt, oN, sN, bN1) ∧ (cty = cty1)))} These queries are safe. When the domain relational calculus is restricted to safe expressions, it is equivalent to the tuple relational calculus restricted to safe expressions, which in turn is equivalent to the relational algebra. This means that for every relational algebra expression there is an equivalent expression in the relational calculus, and for every tuple or domain relational calculus expression there is an equivalent relational algebra expression. Other Languages Although the relational calculus is hard to understand and use, it was recognized that its non-procedural property is exceedingly desirable, and this resulted in a search for other easy-to-use non-procedural techniques. This led to another two categories of relational languages: transform-oriented and graphical. Transform-oriented languages are a class of non-procedural languages that use relations to transform input data into required outputs. These languages provide easy-to-use structures for expressing what is desired in terms of what is known. SQUARE (Boyce et al., 1975), SEQUEL (Chamberlin et al., 1976), and SEQUEL’s offspring, SQL, are all transform-oriented languages. We discuss SQL in Chapters 5 and 6. Graphical languages provide the user with a picture or illustration of the structure of the relation. The user fills in an example of what is wanted and the system returns the required data in that format. QBE (Query-By-Example) is an example of a graphical language (Zloof, 1977). We demonstrate the capabilities of QBE in Chapter 7. Another category is fourth-generation languages (4GLs), which allow a complete customized application to be created using a limited set of commands in a user-friendly, often menu-driven environment (see Section 2.2). Some systems accept a form of natural language, a restricted version of natural English, sometimes called a fifth-generation language (5GL), although this development is still at an early stage. 4.3 | 109 110 | Chapter 4 z Relational Algebra and Relational Calculus Chapter Summary n The relational algebra is a (high-level) procedural language: it can be used to tell the DBMS how to build a new relation from one or more relations in the database. The relational calculus is a non-procedural language: it can be used to formulate the definition of a relation in terms of one or more database relations. However, formally the relational algebra and relational calculus are equivalent to one another: for every expression in the algebra, there is an equivalent expression in the calculus (and vice versa). n The relational calculus is used to measure the selective power of relational languages. A language that can be used to produce any relation that can be derived using the relational calculus is said to be relationally complete. Most relational query languages are relationally complete but have more expressive power than the relational algebra or relational calculus because of additional operations such as calculated, summary, and ordering functions. n The five fundamental operations in relational algebra, Selection, Projection, Cartesian product, Union, and Set difference, perform most of the data retrieval operations that we are interested in. In addition, there are also the Join, Intersection, and Division operations, which can be expressed in terms of the five basic operations. n The relational calculus is a formal non-procedural language that uses predicates. There are two forms of the relational calculus: tuple relational calculus and domain relational calculus. n In the tuple relational calculus, we are interested in finding tuples for which a predicate is true. A tuple variable is a variable that ‘ranges over’ a named relation: that is, a variable whose only permitted values are tuples of the relation. n In the domain relational calculus, domain variables take their values from domains of attributes rather than tuples of relations. n The relational algebra is logically equivalent to a safe subset of the relational calculus (and vice versa). n Relational data manipulation languages are sometimes classified as procedural or non-procedural, transformoriented, graphical, fourth-generation, or fifth-generation. Review Questions 4.1 What is the difference between a procedural and a non-procedural language? How would you classify the relational algebra and relational calculus? 4.2 Explain the following terms: (a) relationally complete (b) closure of relational operations. 4.3 Define the five basic relational algebra operations. Define the Join, Intersection, and Division operations in terms of these five basic operations. 4.4 Discuss the differences between the five Join operations: Theta join, Equijoin, Natural join, Outer join, and Semijoin. Give examples to illustrate your answer. 4.5 Compare and contrast the tuple relational calculus with domain relational calculus. In particular, discuss the distinction between tuple and domain variables. 4.6 Define the structure of a (well-formed) formula in both the tuple relational calculus and domain relational calculus. 4.7 Explain how a relational calculus expression can be unsafe. Illustrate your answer with an example. Discuss how to ensure that a relational calculus expression is safe. Exercises | 111 Exercises For the following exercises, use the Hotel schema defined at the start of the Exercises at the end of Chapter 3. 4.8 Describe the relations that would be produced by the following relational algebra operations: (a) (b) (c) (d) (e) (f) 4.9 ΠhotelNo(σprice > 50(Room)) σHotel.hotelNo = Room.hotelNo(Hotel × Room) ΠhotelName(Hotel 1 Hotel.hotelNo = Room.hotelNo(σprice > 50(Room))) Guest 5 (σdateTo ≥ ‘1-Jan-2002’(Booking)) Hotel 2 Hotel.hotelNo = Room.hotelNo(σprice > 50(Room)) ΠguestName, hotelNo(Booking 1 Booking.guestNo = Guest.guestNo Guest) ÷ ΠhotelNo(σcity = ‘London’(Hotel)) Provide the equivalent tuple relational calculus and domain relational calculus expressions for each of the relational algebra queries given in Exercise 4.8. 4.10 Describe the relations that would be produced by the following tuple relational calculus expressions: (a) {H.hotelName | Hotel(H) ∧ H.city = ‘London’} (b) {H.hotelName | Hotel(H) ∧ (∃R) (Room(R) ∧ H.hotelNo = R.hotelNo ∧ R.price > 50)} (c) {H.hotelName | Hotel(H) ∧ (∃B) (∃G) (Booking(B) ∧ Guest(G) ∧ H.hotelNo = B.hotelNo ∧ B.guestNo = G.guestNo ∧ G.guestName = ‘John Smith’)} (d) {H.hotelName, G.guestName, B1.dateFrom, B2.dateFrom | Hotel(H) ∧ Guest(G) ∧ Booking(B1) ∧ Booking(B2) ∧ H.hotelNo = B1.hotelNo ∧ G.guestNo = B1.guestNo ∧ B2.hotelNo = B1.hotelNo ∧ B2.guestNo = B1.guestNo ∧ B2.dateFrom ≠ B1.dateFrom} 4.11 Provide the equivalent domain relational calculus and relational algebra expressions for each of the tuple relational calculus expressions given in Exercise 4.10. 4.12 Generate the relational algebra, tuple relational calculus, and domain relational calculus expressions for the following queries: (a) (b) (c) (d) (e) (f) List all hotels. List all single rooms with a price below £20 per night. List the names and cities of all guests. List the price and type of all rooms at the Grosvenor Hotel. List all guests currently staying at the Grosvenor Hotel. List the details of all rooms at the Grosvenor Hotel, including the name of the guest staying in the room, if the room is occupied. (g) List the guest details (guestNo, guestName, and guestAddress) of all guests staying at the Grosvenor Hotel. 4.13 Using relational algebra, create a view of all rooms in the Grosvenor Hotel, excluding price details. What are the advantages of this view? 4.14 Analyze the RDBMSs that you are currently using. What types of relational language does the system provide? For each of the languages provided, what are the equivalent operations for the eight relational algebra operations defined in Section 4.1? Chapter 5 SQL: Data Manipulation Chapter Objectives In this chapter you will learn: n The purpose and importance of the Structured Query Language (SQL). n The history and development of SQL. n How to write an SQL command. n How to retrieve data from the database using the SELECT statement. n How to build SQL statements that: – use the WHERE clause to retrieve rows that satisfy various conditions; – sort query results using ORDER BY; – use the aggregate functions of SQL; – group data using GROUP BY; – use subqueries; – join tables together; – perform set operations (UNION, INTERSECT, EXCEPT). n How to perform database updates using INSERT, UPDATE, and DELETE. In Chapters 3 and 4 we described the relational data model and relational languages in some detail. A particular language that has emerged from the development of the relational model is the Structured Query Language, or SQL as it is commonly called. Over the last few years, SQL has become the standard relational database language. In 1986, a standard for SQL was defined by the American National Standards Institute (ANSI), which was subsequently adopted in 1987 as an international standard by the International Organization for Standardization (ISO, 1987). More than one hundred Database Management Systems now support SQL, running on various hardware platforms from PCs to mainframes. Owing to the current importance of SQL, we devote three chapters of this book to examining the language in detail, providing a comprehensive treatment for both technical and non-technical users including programmers, database professionals, and managers. In these chapters we largely concentrate on the ISO definition of the SQL language. However, owing to the complexity of this standard, we do not attempt to cover all parts of the language. In this chapter, we focus on the data manipulation statements of the language. 5.1 Introduction to SQL Structure of this Chapter In Section 5.1 we introduce SQL and discuss why the language is so important to database applications. In Section 5.2 we introduce the notation used in this book to specify the structure of an SQL statement. In Section 5.3 we discuss how to retrieve data from relations using SQL, and how to insert, update, and delete data from relations. Looking ahead, in Chapter 6 we examine other features of the language, including data definition, views, transactions, and access control. In Section 28.4 we examine in some detail the features that have been added to the SQL specification to support object-oriented data management, referred to as SQL:1999 or SQL3. In Appendix E we discuss how SQL can be embedded in high-level programming languages to access constructs that were not available in SQL until very recently. The two formal languages, relational algebra and relational calculus, that we covered in Chapter 4 provide a foundation for a large part of the SQL standard and it may be useful to refer back to this chapter occasionally to see the similarities. However, our presentation of SQL is mainly independent of these languages for those readers who have omitted Chapter 4. The examples in this chapter use the DreamHome rental database instance shown in Figure 3.3. Introduction to SQL 5.1 In this section we outline the objectives of SQL, provide a short history of the language, and discuss why the language is so important to database applications. Objectives of SQL Ideally, a database language should allow a user to: n n n create the database and relation structures; perform basic data management tasks, such as the insertion, modification, and deletion of data from the relations; perform both simple and complex queries. A database language must perform these tasks with minimal user effort, and its command structure and syntax must be relatively easy to learn. Finally, the language must be portable, that is, it must conform to some recognized standard so that we can use the same command structure and syntax when we move from one DBMS to another. SQL is intended to satisfy these requirements. SQL is an example of a transform-oriented language, or a language designed to use relations to transform inputs into required outputs. As a language, the ISO SQL standard has two major components: n n a Data Definition Language (DDL) for defining the database structure and controlling access to the data; a Data Manipulation Language (DML) for retrieving and updating data. 5.1.1 | 113 114 | Chapter 5 z SQL: Data Manipulation Until SQL:1999, SQL contained only these definitional and manipulative commands; it did not contain flow of control commands, such as IF . . . THEN . . . ELSE, GO TO, or DO . . . WHILE. These had to be implemented using a programming or job-control language, or interactively by the decisions of the user. Owing to this lack of computational completeness, SQL can be used in two ways. The first way is to use SQL interactively by entering the statements at a terminal. The second way is to embed SQL statements in a procedural language, as we discuss in Appendix E. We also discuss SQL:1999 and SQL:2003 in Chapter 28. SQL is a relatively easy language to learn: n n n n It is a non-procedural language: you specify what information you require, rather than how to get it. In other words, SQL does not require you to specify the access methods to the data. Like most modern languages, SQL is essentially free-format, which means that parts of statements do not have to be typed at particular locations on the screen. The command structure consists of standard English words such as CREATE TABLE, INSERT, SELECT. For example: – CREATE TABLE Staff (staffNo VARCHAR(5), lName VARCHAR(15), salary DECIMAL(7,2)); – INSERT INTO Staff VALUES (‘SG16’, ‘Brown’, 8300); – SELECT staffNo, lName, salary FROM Staff WHERE salary > 10000; SQL can be used by a range of users including Database Administrators (DBA), management personnel, application developers, and many other types of end-user. An international standard now exists for the SQL language making it both the formal and de facto standard language for defining and manipulating relational databases (ISO, 1992, 1999a). 5.1.2 History of SQL As stated in Chapter 3, the history of the relational model (and indirectly SQL) started with the publication of the seminal paper by E. F. Codd, while working at IBM’s Research Laboratory in San José (Codd, 1970). In 1974, D. Chamberlin, also from the IBM San José Laboratory, defined a language called the Structured English Query Language, or SEQUEL. A revised version, SEQUEL/2, was defined in 1976, but the name was subsequently changed to SQL for legal reasons (Chamberlin and Boyce, 1974; Chamberlin et al., 1976). Today, many people still pronounce SQL as ‘See-Quel’, though the official pronunciation is ‘S-Q-L’. IBM produced a prototype DBMS based on SEQUEL/2, called System R (Astrahan et al., 1976). The purpose of this prototype was to validate the feasibility of the relational model. Besides its other successes, one of the most important results that has been attributed to this project was the development of SQL. However, the roots of SQL are in the language SQUARE (Specifying Queries As Relational Expressions), which pre-dates 5.1 Introduction to SQL the System R project. SQUARE was designed as a research language to implement relational algebra with English sentences (Boyce et al., 1975). In the late 1970s, the database system Oracle was produced by what is now called the Oracle Corporation, and was probably the first commercial implementation of a relational DBMS based on SQL. INGRES followed shortly afterwards, with a query language called QUEL, which although more ‘structured’ than SQL, was less English-like. When SQL emerged as the standard database language for relational systems, INGRES was converted to an SQL-based DBMS. IBM produced its first commercial RDBMS, called SQL/ DS, for the DOS/VSE and VM/CMS environments in 1981 and 1982, respectively, and subsequently as DB2 for the MVS environment in 1983. In 1982, the American National Standards Institute began work on a Relational Database Language (RDL) based on a concept paper from IBM. ISO joined in this work in 1983, and together they defined a standard for SQL. (The name RDL was dropped in 1984, and the draft standard reverted to a form that was more like the existing implementations of SQL.) The initial ISO standard published in 1987 attracted a considerable degree of criticism. Date, an influential researcher in this area, claimed that important features such as referential integrity constraints and certain relational operators had been omitted. He also pointed out that the language was extremely redundant; in other words, there was more than one way to write the same query (Date, 1986, 1987a, 1990). Much of the criticism was valid, and had been recognized by the standards bodies before the standard was published. It was decided, however, that it was more important to release a standard as early as possible to establish a common base from which the language and the implementations could develop than to wait until all the features that people felt should be present could be defined and agreed. In 1989, ISO published an addendum that defined an ‘Integrity Enhancement Feature’ (ISO, 1989). In 1992, the first major revision to the ISO standard occurred, sometimes referred to as SQL2 or SQL-92 (ISO, 1992). Although some features had been defined in the standard for the first time, many of these had already been implemented, in part or in a similar form, in one or more of the many SQL implementations. It was not until 1999 that the next release of the standard was formalized, commonly referred to as SQL:1999 (ISO, 1999a). This release contains additional features to support object-oriented data management, which we examine in Section 28.4. A further release, SQL:2003, was produced in late 2003. Features that are provided on top of the standard by the vendors are called extensions. For example, the standard specifies six different data types for data in an SQL database. Many implementations supplement this list with a variety of extensions. Each implementation of SQL is called a dialect. No two dialects are exactly alike, and currently no dialect exactly matches the ISO standard. Moreover, as database vendors introduce new functionality, they are expanding their SQL dialects and moving them even further apart. However, the central core of the SQL language is showing signs of becoming more standardized. In fact, SQL:2003 has a set of features called Core SQL that a vendor must implement to claim conformance with the SQL:2003 standard. Many of the remaining features are divided into packages; for example, there are packages for object features and OLAP (OnLine Analytical Processing). Although SQL was originally an IBM concept, its importance soon motivated other vendors to create their own implementations. Today there are literally hundreds of SQLbased products available, with new products being introduced regularly. | 115 116 | Chapter 5 z SQL: Data Manipulation 5.1.3 Importance of SQL SQL is the first and, so far, only standard database language to gain wide acceptance. The only other standard database language, the Network Database Language (NDL), based on the CODASYL network model, has few followers. Nearly every major current vendor provides database products based on SQL or with an SQL interface, and most are represented on at least one of the standard-making bodies. There is a huge investment in the SQL language both by vendors and by users. It has become part of application architectures such as IBM’s Systems Application Architecture (SAA) and is the strategic choice of many large and influential organizations, for example, the X/OPEN consortium for UNIX standards. SQL has also become a Federal Information Processing Standard (FIPS), to which conformance is required for all sales of DBMSs to the US government. The SQL Access Group, a consortium of vendors, defined a set of enhancements to SQL that would support interoperability across disparate systems. SQL is used in other standards and even influences the development of other standards as a definitional tool. Examples include ISO’s Information Resource Dictionary System (IRDS) standard and Remote Data Access (RDA) standard. The development of the language is supported by considerable academic interest, providing both a theoretical basis for the language and the techniques needed to implement it successfully. This is especially true in query optimization, distribution of data, and security. There are now specialized implementations of SQL that are directed at new markets, such as OnLine Analytical Processing (OLAP). 5.1.4 Terminology The ISO SQL standard does not use the formal terms of relations, attributes, and tuples, instead using the terms tables, columns, and rows. In our presentation of SQL we mostly use the ISO terminology. It should also be noted that SQL does not adhere strictly to the definition of the relational model described in Chapter 3. For example, SQL allows the table produced as the result of the SELECT statement to contain duplicate rows, it imposes an ordering on the columns, and it allows the user to order the rows of a result table. 5.2 Writing SQL Commands In this section we briefly describe the structure of an SQL statement and the notation we use to define the format of the various SQL constructs. An SQL statement consists of reserved words and user-defined words. Reserved words are a fixed part of the SQL language and have a fixed meaning. They must be spelt exactly as required and cannot be split across lines. User-defined words are made up by the user (according to certain syntax rules) and represent the names of various database objects such as tables, columns, views, indexes, and so on. The words in a statement are also built according to a set of syntax rules. Although the standard does not require it, many dialects of SQL require the use of a statement terminator to mark the end of each SQL statement (usually the semicolon ‘;’ is used). 5.3 Data Manipulation Most components of an SQL statement are case insensitive, which means that letters can be typed in either upper or lower case. The one important exception to this rule is that literal character data must be typed exactly as it appears in the database. For example, if we store a person’s surname as ‘SMITH’ and then search for it using the string ‘Smith’, the row will not be found. Although SQL is free-format, an SQL statement or set of statements is more readable if indentation and lineation are used. For example: n n n each clause in a statement should begin on a new line; the beginning of each clause should line up with the beginning of other clauses; if a clause has several parts, they should each appear on a separate line and be indented under the start of the clause to show the relationship. Throughout this and the next chapter, we use the following extended form of the Backus Naur Form (BNF) notation to define SQL statements: n n n n n n upper-case letters are used to represent reserved words and must be spelt exactly as shown; lower-case letters are used to represent user-defined words; a vertical bar ( | ) indicates a choice among alternatives; for example, a | b | c; curly braces indicate a required element; for example, {a}; square brackets indicate an optional element; for example, [a]; an ellipsis ( . . . ) is used to indicate optional repetition of an item zero or more times. For example: {a | b} (, c . . . ) means either a or b followed by zero or more repetitions of c separated by commas. In practice, the DDL statements are used to create the database structure (that is, the tables) and the access mechanisms (that is, what each user can legally access), and then the DML statements are used to populate and query the tables. However, in this chapter we present the DML before the DDL statements to reflect the importance of DML statements to the general user. We discuss the main DDL statements in the next chapter. Data Manipulation This section looks at the SQL DML statements, namely: n n n n SELECT – to query data in the database; INSERT – to insert data into a table; UPDATE – to update data in a table; DELETE – to delete data from a table. Owing to the complexity of the SELECT statement and the relative simplicity of the other DML statements, we devote most of this section to the SELECT statement and its various formats. We begin by considering simple queries, and successively add more complexity 5.3 | 117 118 | Chapter 5 z SQL: Data Manipulation to show how more complicated queries that use sorting, grouping, aggregates, and also queries on multiple tables can be generated. We end the chapter by considering the INSERT, UPDATE, and DELETE statements. We illustrate the SQL statements using the instance of the DreamHome case study shown in Figure 3.3, which consists of the following tables: Branch Staff PropertyForRent Client PrivateOwner Viewing (branchNo, street, city, postcode) (staffNo, fName, lName, position, sex, DOB, salary, branchNo) (propertyNo, street, city, postcode, type, rooms, rent, ownerNo, branchNo) (clientNo, fName, lName, telNo, prefType, maxRent) (ownerNo, fName, lName, address, telNo) (clientNo, propertyNo, viewDate, comment) staffNo, Literals Before we discuss the SQL DML statements, it is necessary to understand the concept of literals. Literals are constants that are used in SQL statements. There are different forms of literals for every data type supported by SQL (see Section 6.1.1). However, for simplicity, we can distinguish between literals that are enclosed in single quotes and those that are not. All non-numeric data values must be enclosed in single quotes; all numeric data values must not be enclosed in single quotes. For example, we could use literals to insert data into a table: INSERT INTO PropertyForRent(propertyNo, street, city, postcode, type, rooms, rent, ownerNo, staffNo, branchNo) VALUES (‘PA14’, ‘16 Holhead’, ‘Aberdeen’, ‘AB7 5SU’, ‘House’, 6, 650.00, ‘CO46’, ‘SA9’, ‘B007’); The value in column rooms is an integer literal and the value in column rent is a decimal number literal; they are not enclosed in single quotes. All other columns are character strings and are enclosed in single quotes. 5.3.1 Simple Queries The purpose of the SELECT statement is to retrieve and display data from one or more database tables. It is an extremely powerful command capable of performing the equivalent of the relational algebra’s Selection, Projection, and Join operations in a single statement (see Section 4.1). SELECT is the most frequently used SQL command and has the following general form: SELECT [DISTINCT | ALL] {* | [columnExpression [AS newName]] [, . . . ]} FROM TableName [alias] [, . . . ] [WHERE condition] [GROUP BY columnList] [HAVING condition] [ORDER BY columnList] 5.3 Data Manipulation columnExpression represents a column name or an expression, TableName is the name of an existing database table or view that you have access to, and alias is an optional abbreviation for TableName. The sequence of processing in a SELECT statement is: FROM WHERE GROUP BY HAVING SELECT ORDER BY specifies the table or tables to be used filters the rows subject to some condition forms groups of rows with the same column value filters the groups subject to some condition specifies which columns are to appear in the output specifies the order of the output The order of the clauses in the SELECT statement cannot be changed. The only two mandatory clauses are the first two: SELECT and FROM; the remainder are optional. The SELECT operation is closed: the result of a query on a table is another table (see Section 4.1). There are many variations of this statement, as we now illustrate. Retrieve all rows Example 5.1 Retrieve all columns, all rows List full details of all staff. Since there are no restrictions specified in this query, the WHERE clause is unnecessary and all columns are required. We write this query as: SELECT staffNo, fName, lName, position, sex, DOB, salary, branchNo FROM Staff; Since many SQL retrievals require all columns of a table, there is a quick way of expressing ‘all columns’ in SQL, using an asterisk (*) in place of the column names. The following statement is an equivalent and shorter way of expressing this query: SELECT * FROM Staff; The result table in either case is shown in Table 5.1. Table 5.1 Result table for Example 5.1. staffNo fName lName position sex DOB salary branchNo SL21 SG37 SG14 SA9 SG5 SL41 John Ann David Mary Susan Julie White Beech Ford Howe Brand Lee Manager Assistant Supervisor Assistant Manager Assistant M F M F F F 1-Oct-45 10-Nov-60 24-Mar-58 19-Feb-70 3-Jun-40 13-Jun-65 30000.00 12000.00 18000.00 9000.00 24000.00 9000.00 B005 B003 B003 B007 B003 B005 | 119 120 | Chapter 5 z SQL: Data Manipulation Example 5.2 Retrieve specific columns, all rows Produce a list of salaries for all staff, showing only the staff number, the first and last names, and the salary details. SELECT staffNo, fName, lName, salary FROM Staff; In this example a new table is created from Staff containing only the designated columns staffNo, fName, lName, and salary, in the specified order. The result of this operation is shown in Table 5.2. Note that, unless specified, the rows in the result table may not be sorted. Some DBMSs do sort the result table based on one or more columns (for example, Microsoft Office Access would sort this result table based on the primary key staffNo). We describe how to sort the rows of a result table in the next section. Table 5.2 Result table for Example 5.2. staffNo fName lName salary SL21 SG37 SG14 SA9 SG5 SL41 John Ann David Mary Susan Julie White Beech Ford Howe Brand Lee 30000.00 12000.00 18000.00 9000.00 24000.00 9000.00 Example 5.3 Use of DISTINCT List the property numbers of all properties that have been viewed. SELECT propertyNo FROM Viewing; The result table is shown in Table 5.3(a). Notice that there are several duplicates because, unlike the relational algebra Projection operation (see Section 4.1.1), SELECT does not eliminate duplicates when it projects over one or more columns. To eliminate the duplicates, we use the DISTINCT keyword. Rewriting the query as: SELECT DISTINCT propertyNo FROM Viewing; we get the result table shown in Table 5.3(b) with the duplicates eliminated. 5.3 Data Manipulation Table 5.3(a) Result table for Example 5.3 with duplicates. Table 5.3(b) Result table for Example 5.3 with duplicates eliminated. propertyNo propertyNo PA14 PG4 PG4 PA14 PG36 PA14 PG4 PG36 Example 5.4 Calculated fields Produce a list of monthly salaries for all staff, showing the staff number, the first and last names, and the salary details. SELECT staffNo, fName, lName, salary/12 FROM Staff; This query is almost identical to Example 5.2, with the exception that monthly salaries are required. In this case, the desired result can be obtained by simply dividing the salary by 12, giving the result table shown in Table 5.4. This is an example of the use of a calculated field (sometimes called a computed or derived field). In general, to use a calculated field you specify an SQL expression in the SELECT list. An SQL expression can involve addition, subtraction, multiplication, and division, and parentheses can be used to build complex expressions. More than one table column can be used in a calculated column; however, the columns referenced in an arithmetic expression must have a numeric type. The fourth column of this result table has been output as col4. Normally, a column in the result table takes its name from the corresponding column of the database table from which it has been retrieved. However, in this case, SQL does not know how to label the column. Some dialects give the column a name corresponding to its position in the table Table 5.4 Result table for Example 5.4. staffNo fName lName col4 SL21 SG37 SG14 SA9 SG5 SL41 John Ann David Mary Susan Julie White Beech Ford Howe Brand Lee 2500.00 1000.00 1500.00 750.00 2000.00 750.00 | 121 122 | Chapter 5 z SQL: Data Manipulation (for example, col4); some may leave the column name blank or use the expression entered in the SELECT list. The ISO standard allows the column to be named using an AS clause. In the previous example, we could have written: SELECT staffNo, fName, lName, salary/12 AS monthlySalary FROM Staff; In this case the column heading of the result table would be monthlySalary rather than col4. Row selection (WHERE clause) The above examples show the use of the SELECT statement to retrieve all rows from a table. However, we often need to restrict the rows that are retrieved. This can be achieved with the WHERE clause, which consists of the keyword WHERE followed by a search condition that specifies the rows to be retrieved. The five basic search conditions (or predicates using the ISO terminology) are as follows: n n n n n Comparison Compare the value of one expression to the value of another expression. Range Test whether the value of an expression falls within a specified range of values. Set membership Test whether the value of an expression equals one of a set of values. Pattern match Test whether a string matches a specified pattern. Null Test whether a column has a null (unknown) value. The WHERE clause is equivalent to the relational algebra Selection operation discussed in Section 4.1.1. We now present examples of each of these types of search conditions. Example 5.5 Comparison search condition List all staff with a salary greater than £10,000. SELECT staffNo, fName, lName, position, salary FROM Staff WHERE salary > 10000; Here, the table is Staff and the predicate is salary > 10000. The selection creates a new table containing only those Staff rows with a salary greater than £10,000. The result of this operation is shown in Table 5.5. Table 5.5 Result table for Example 5.5. staffNo fName lName position salary SL21 SG37 SG14 SG5 John Ann David Susan White Beech Ford Brand Manager Assistant Supervisor Manager 30000.00 12000.00 18000.00 24000.00 5.3 Data Manipulation In SQL, the following simple comparison operators are available: = <> < > equals is not equal to (ISO standard) is less than is greater than ! = is not equal to (allowed in some dialects) < = is less than or equal to > = is greater than or equal to More complex predicates can be generated using the logical operators AND, OR, and NOT, with parentheses (if needed or desired) to show the order of evaluation. The rules for evaluating a conditional expression are: n n n n an expression is evaluated left to right; subexpressions in brackets are evaluated first; NOTs are evaluated before ANDs and ORs; ANDs are evaluated before ORs. The use of parentheses is always recommended in order to remove any possible ambiguities. Example 5.6 Compound comparison search condition List the addresses of all branch offices in London or Glasgow. SELECT * FROM Branch WHERE city = ‘London’ OR city = ‘Glasgow’; In this example the logical operator OR is used in the WHERE clause to find the branches in London (city = ‘London’) or in Glasgow (city = ‘Glasgow’). The result table is shown in Table 5.6. Table 5.6 Result table for Example 5.6. branchNo street city postcode B005 B003 B002 22 Deer Rd 163 Main St 56 Clover Dr London Glasgow London SW1 4EH G11 9QX NW10 6EU | 123 124 | Chapter 5 z SQL: Data Manipulation Example 5.7 Range search condition (BETWEEN/NOT BETWEEN) List all staff with a salary between £20,000 and £30,000. SELECT staffNo, fName, lName, position, salary FROM Staff WHERE salary BETWEEN 20000 AND 30000; The BETWEEN test includes the endpoints of the range, so any members of staff with a salary of £20,000 or £30,000 would be included in the result. The result table is shown in Table 5.7. Table 5.7 Result table for Example 5.7. staffNo fName lName position salary SL21 SG5 John Susan White Brand Manager Manager 30000.00 24000.00 There is also a negated version of the range test (NOT BETWEEN) that checks for values outside the range. The BETWEEN test does not add much to the expressive power of SQL, because it can be expressed equally well using two comparison tests. We could have expressed the above query as: SELECT staffNo, fName, lName, position, salary FROM Staff WHERE salary > = 20000 AND salary < = 30000; However, the BETWEEN test is a simpler way to express a search condition when considering a range of values. Example 5.8 Set membership search condition (IN/NOT IN) List all managers and supervisors. SELECT staffNo, fName, lName, position FROM Staff WHERE position IN (‘Manager’, ‘Supervisor’); The set membership test (IN) tests whether a data value matches one of a list of values, in this case either ‘Manager’ or ‘Supervisor’. The result table is shown in Table 5.8. There is a negated version (NOT IN) that can be used to check for data values that do not lie in a specific list of values. Like BETWEEN, the IN test does not add much to the expressive power of SQL. We could have expressed the above query as: 5.3 Data Manipulation Table 5.8 Result table for Example 5.8. staffNo fName lName position SL21 SG14 SG5 John David Susan White Ford Brand Manager Supervisor Manager SELECT staffNo, fName, lName, position FROM Staff WHERE position = ‘Manager’ OR position = ‘Supervisor’; However, the IN test provides a more efficient way of expressing the search condition, particularly if the set contains many values. Example 5.9 Pattern match search condition (LIKE/NOT LIKE) Find all owners with the string ‘Glasgow’ in their address. For this query, we must search for the string ‘Glasgow’ appearing somewhere within the address column of the PrivateOwner table. SQL has two special pattern-matching symbols: n n % percent character represents any sequence of zero or more characters (wildcard). _ underscore character represents any single character. All other characters in the pattern represent themselves. For example: LIKE ‘H%’ means the first character must be H, but the rest of the string can be anything. address LIKE ‘H_ _ _’ means that there must be exactly four characters in the string, the first of which must be an H. address LIKE ‘%e’ means any sequence of characters, of length at least 1, with the last character an e. address LIKE ‘%Glasgow%’ means a sequence of characters of any length containing Glasgow. address NOT LIKE ‘H%’ means the first character cannot be an H. n address n n n n If the search string can include the pattern-matching character itself, we can use an escape character to represent the pattern-matching character. For example, to check for the string ‘15%’, we can use the predicate: LIKE ‘15#%’ ESCAPE ‘#’ Using the pattern-matching search condition of SQL, we can find all owners with the string ‘Glasgow’ in their address using the following query, producing the result table shown in Table 5.9: | 125 126 | Chapter 5 z SQL: Data Manipulation SELECT ownerNo, fName, lName, address, telNo FROM PrivateOwner WHERE address LIKE ‘%Glasgow%’; Note, some RDBMSs, such as Microsoft Office Access, use the wildcard characters * and ? instead of % and _ . Table 5.9 Result table for Example 5.9. ownerNo fName IName address telNo CO87 CO40 CO93 Carol Tina Tony Farrel Murphy Shaw 6 Achray St, Glasgow G32 9DX 63 Well St, Glasgow G42 12 Park Pl, Glasgow G4 0QR 0141-357-7419 0141-943-1728 0141-225-7025 Example 5.10 NULL search condition (IS NULL/IS NOT NULL) List the details of all viewings on property PG4 where a comment has not been supplied. From the Viewing table of Figure 3.3, we can see that there are two viewings for property PG4: one with a comment, the other without a comment. In this simple example, you may think that the latter row could be accessed by using one of the search conditions: (propertyNo = ‘PG4’ AND comment = ‘ ’) or (propertyNo = ‘PG4’ AND comment < > ‘too remote’) However, neither of these conditions would work. A null comment is considered to have an unknown value, so we cannot test whether it is equal or not equal to another string. If we tried to execute the SELECT statement using either of these compound conditions, we would get an empty result table. Instead, we have to test for null explicitly using the special keyword IS NULL: SELECT clientNo, viewDate FROM Viewing WHERE propertyNo = ‘PG4’ AND comment IS NULL; The result table is shown in Table 5.10. The negated version (IS NOT NULL) can be used to test for values that are not null. Table 5.10 Result table for Example 5.10. clientNo viewDate CR56 26-May-04 5.3 Data Manipulation Sorting Results (ORDER BY Clause) In general, the rows of an SQL query result table are not arranged in any particular order (although some DBMSs may use a default ordering based, for example, on a primary key). However, we can ensure the results of a query are sorted using the ORDER BY clause in the SELECT statement. The ORDER BY clause consists of a list of column identifiers that the result is to be sorted on, separated by commas. A column identifier may be either a column name or a column number† that identifies an element of the SELECT list by its position within the list, 1 being the first (left-most) element in the list, 2 the second element in the list, and so on. Column numbers could be used if the column to be sorted on is an expression and no AS clause is specified to assign the column a name that can subsequently be referenced. The ORDER BY clause allows the retrieved rows to be ordered in ascending (ASC) or descending (DESC) order on any column or combination of columns, regardless of whether that column appears in the result. However, some dialects insist that the ORDER BY elements appear in the SELECT list. In either case, the ORDER BY clause must always be the last clause of the SELECT statement. Example 5.11 Single-column ordering Produce a list of salaries for all staff, arranged in descending order of salary. SELECT staffNo, fName, lName, salary FROM Staff ORDER BY salary DESC; This example is very similar to Example 5.2. The difference in this case is that the output is to be arranged in descending order of salary. This is achieved by adding the ORDER BY clause to the end of the SELECT statement, specifying salary as the column to be sorted, and DESC to indicate that the order is to be descending. In this case, we get the result table shown in Table 5.11. Note that we could have expressed the ORDER BY clause as: ORDER BY 4 DESC, with the 4 relating to the fourth column name in the SELECT list, namely salary. Table 5.11 † Result table for Example 5.11. staffNo fName lName salary SL21 SG5 SG14 SG37 SA9 SL41 John Susan David Ann Mary Julie White Brand Ford Beech Howe Lee 30000.00 24000.00 18000.00 12000.00 9000.00 9000.00 Column numbers are a deprecated feature of the ISO standard and should not be used. 5.3.2 | 127 128 | Chapter 5 z SQL: Data Manipulation It is possible to include more than one element in the ORDER BY clause. The major sort key determines the overall order of the result table. In Example 5.11, the major sort key is salary. If the values of the major sort key are unique, there is no need for additional keys to control the sort. However, if the values of the major sort key are not unique, there may be multiple rows in the result table with the same value for the major sort key. In this case, it may be desirable to order rows with the same value for the major sort key by some additional sort key. If a second element appears in the ORDER BY clause, it is called a minor sort key. Example 5.12 Multiple column ordering Produce an abbreviated list of properties arranged in order of property type. SELECT propertyNo, type, rooms, rent FROM PropertyForRent ORDER BY type; In this case we get the result table shown in Table 5.12(a). Table 5.12(a) Result table for Example 5.12 with one sort key. propertyNo type rooms rent PL94 PG4 PG36 PG16 PA14 PG21 Flat Flat Flat Flat House House 4 3 3 4 6 5 400 350 375 450 650 600 There are four flats in this list. As we did not specify any minor sort key, the system arranges these rows in any order it chooses. To arrange the properties in order of rent, we specify a minor order, as follows: SELECT propertyNo, type, rooms, rent FROM PropertyForRent ORDER BY type, rent DESC; Now, the result is ordered first by property type, in ascending alphabetic order (ASC being the default setting), and within property type, in descending order of rent. In this case, we get the result table shown in Table 5.12(b). The ISO standard specifies that nulls in a column or expression sorted with ORDER BY should be treated as either less than all non-null values or greater than all non-null values. The choice is left to the DBMS implementor. 5.3 Data Manipulation Table 5.12(b) Result table for Example 5.12 with two sort keys. propertyNo type rooms rent PG16 PL94 PG36 PG4 PA14 PG21 Flat Flat Flat Flat House House 4 4 3 3 6 5 450 400 375 350 650 600 Using the SQL Aggregate Functions As well as retrieving rows and columns from the database, we often want to perform some form of summation or aggregation of data, similar to the totals at the bottom of a report. The ISO standard defines five aggregate functions: n n n n n COUNT – returns the number of values in a specified column; SUM – returns the sum of the values in a specified column; AVG – returns the average of the values in a specified column; MIN – returns the smallest value in a specified column; MAX – returns the largest value in a specified column. These functions operate on a single column of a table and return a single value. COUNT, MIN, and MAX apply to both numeric and non-numeric fields, but SUM and AVG may be used on numeric fields only. Apart from COUNT(*), each function eliminates nulls first and operates only on the remaining non-null values. COUNT(*) is a special use of COUNT, which counts all the rows of a table, regardless of whether nulls or duplicate values occur. If we want to eliminate duplicates before the function is applied, we use the keyword DISTINCT before the column name in the function. The ISO standard allows the keyword ALL to be specified if we do not want to eliminate duplicates, although ALL is assumed if nothing is specified. DISTINCT has no effect with the MIN and MAX functions. However, it may have an effect on the result of SUM or AVG, so consideration must be given to whether duplicates should be included or excluded in the computation. In addition, DISTINCT can be specified only once in a query. It is important to note that an aggregate function can be used only in the SELECT list and in the HAVING clause (see Section 5.3.4). It is incorrect to use it elsewhere. If the SELECT list includes an aggregate function and no GROUP BY clause is being used to group data together (see Section 5.3.4), then no item in the SELECT list can include any reference to a column unless that column is the argument to an aggregate function. For example, the following query is illegal: 5.3.3 | 129 130 | Chapter 5 z SQL: Data Manipulation SELECT staffNo, COUNT(salary) FROM Staff ; because the query does not have a GROUP BY clause and the column SELECT list is used outside an aggregate function. staffNo in the Example 5.13 Use of COUNT(*) How many properties cost more than £350 per month to rent? Table 5.13 Result table for Example 5.13. myCount 5 SELECT COUNT(*) AS myCount FROM PropertyForRent WHERE rent > 350; Restricting the query to properties that cost more than £350 per month is achieved using the WHERE clause. The total number of properties satisfying this condition can then be found by applying the aggregate function COUNT. The result table is shown in Table 5.13. Example 5.14 Use of COUNT(DISTINCT) How many different properties were viewed in May 2004? Table 5.14 Result table for Example 5.14. myCount 2 SELECT COUNT(DISTINCT propertyNo) AS myCount FROM Viewing WHERE viewDate BETWEEN ‘1-May-04’ AND ‘31-May-04’; Again, restricting the query to viewings that occurred in May 2004 is achieved using the WHERE clause. The total number of viewings satisfying this condition can then be found by applying the aggregate function COUNT. However, as the same property may be viewed many times, we have to use the DISTINCT keyword to eliminate duplicate properties. The result table is shown in Table 5.14. Example 5.15 Use of COUNT and SUM Find the total number of Managers and the sum of their salaries. SELECT COUNT(staffNo) AS myCount, SUM(salary) AS mySum FROM Staff WHERE position = ‘Manager’; 5.3 Data Manipulation Table 5.15 Result table for Example 5.15. myCount mySum 2 54000.00 Restricting the query to Managers is achieved using the WHERE clause. The number of Managers and the sum of their salaries can be found by applying the COUNT and the SUM functions respectively to this restricted set. The result table is shown in Table 5.15. Example 5.16 Use of MIN, MAX, AVG Find the minimum, maximum, and average staff salary. SELECT MIN(salary) AS myMin, MAX(salary) AS myMax, AVG(salary) AS myAvg FROM Staff ; In this example we wish to consider all staff and therefore do not require a WHERE clause. The required values can be calculated using the MIN, MAX, and AVG functions based on the salary column. The result table is shown in Table 5.16. Table 5.16 Result table for Example 5.16. myMin myMax myAvg 9000.00 30000.00 17000.00 Grouping Results (GROUP BY Clause) The above summary queries are similar to the totals at the bottom of a report. They condense all the detailed data in the report into a single summary row of data. However, it is often useful to have subtotals in reports. We can use the GROUP BY clause of the SELECT statement to do this. A query that includes the GROUP BY clause is called a grouped query, because it groups the data from the SELECT table(s) and produces a single summary row for each group. The columns named in the GROUP BY clause are called the grouping columns. The ISO standard requires the SELECT clause and the GROUP BY clause to be closely integrated. When GROUP BY is used, each item in the SELECT list must be single-valued per group. Further, the SELECT clause may contain only: 5.3.4 | 131 132 | Chapter 5 z SQL: Data Manipulation n n n n column names; aggregate functions; constants; an expression involving combinations of the above. All column names in the SELECT list must appear in the GROUP BY clause unless the name is used only in an aggregate function. The contrary is not true: there may be column names in the GROUP BY clause that do not appear in the SELECT list. When the WHERE clause is used with GROUP BY, the WHERE clause is applied first, then groups are formed from the remaining rows that satisfy the search condition. The ISO standard considers two nulls to be equal for purposes of the GROUP BY clause. If two rows have nulls in the same grouping columns and identical values in all the non-null grouping columns, they are combined into the same group. Example 5.17 Use of GROUP BY Find the number of staff working in each branch and the sum of their salaries. SELECT branchNo, COUNT(staffNo) AS myCount, SUM(salary) AS mySum FROM Staff GROUP BY branchNo ORDER BY branchNo; It is not necessary to include the column names staffNo and salary in the GROUP BY list because they appear only in the SELECT list within aggregate functions. On the other hand, branchNo is not associated with an aggregate function and so must appear in the GROUP BY list. The result table is shown in Table 5.17. Table 5.17 Result table for Example 5.17. branchNo myCount mySum B003 B005 B007 3 2 1 54000.00 39000.00 9000.00 Conceptually, SQL performs the query as follows: (1) SQL divides the staff into groups according to their respective branch numbers. Within each group, all staff have the same branch number. In this example, we get three groups: 5.3 Data Manipulation (2) For each group, SQL computes the number of staff members and calculates the sum of the values in the salary column to get the total of their salaries. SQL generates a single summary row in the query result for each group. (3) Finally, the result is sorted in ascending order of branch number, branchNo. The SQL standard allows the SELECT list to contain nested queries (see Section 5.3.5). Therefore, we could also express the above query as: SELECT branchNo, (SELECT COUNT(staffNo) AS myCount FROM Staff s WHERE s.branchNo = b.branchNo), (SELECT SUM(salary) AS mySum FROM Staff s WHERE s.branchNo = b.branchNo) FROM Branch b ORDER BY branchNo; With this version of the query, however, the two aggregate values are produced for each branch office in Branch, in some cases possibly with zero values. Restricting groupings (HAVING clause) The HAVING clause is designed for use with the GROUP BY clause to restrict the groups that appear in the final result table. Although similar in syntax, HAVING and WHERE serve different purposes. The WHERE clause filters individual rows going into the final result table, whereas HAVING filters groups going into the final result table. The ISO standard requires that column names used in the HAVING clause must also appear in the GROUP BY list or be contained within an aggregate function. In practice, the search condition in the HAVING clause always includes at least one aggregate function, otherwise the search condition could be moved to the WHERE clause and applied to individual rows. (Remember that aggregate functions cannot be used in the WHERE clause.) The HAVING clause is not a necessary part of SQL – any query expressed using a HAVING clause can always be rewritten without the HAVING clause. | 133 134 | Chapter 5 z SQL: Data Manipulation Example 5.18 Use of HAVING For each branch office with more than one member of staff, find the number of staff working in each branch and the sum of their salaries. SELECT branchNo, COUNT(staffNo) AS myCount, SUM(salary) AS mySum FROM Staff GROUP BY branchNo HAVING COUNT(staffNo) > 1 ORDER BY branchNo; This is similar to the previous example with the additional restriction that we want to consider only those groups (that is, branches) with more than one member of staff. This restriction applies to the groups and so the HAVING clause is used. The result table is shown in Table 5.18. Table 5.18 Result table for Example 5.18. branchNo myCount mySum B003 B005 3 2 54000.00 39000.00 5.3.5 Subqueries In this section we examine the use of a complete SELECT statement embedded within another SELECT statement. The results of this inner SELECT statement (or subselect) are used in the outer statement to help determine the contents of the final result. A subselect can be used in the WHERE and HAVING clauses of an outer SELECT statement, where it is called a subquery or nested query. Subselects may also appear in INSERT, UPDATE, and DELETE statements (see Section 5.3.10). There are three types of subquery: n A scalar subquery returns a single column and a single row; that is, a single value. In principle, a scalar subquery can be used whenever a single value is needed. Example 5.19 uses a scalar subquery. n A row subquery returns multiple columns, but again only a single row. A row subquery can be used whenever a row value constructor is needed, typically in predicates. 5.3 Data Manipulation n A table subquery returns one or more columns and multiple rows. A table subquery can be used whenever a table is needed, for example, as an operand for the IN predicate. Example 5.19 Using a subquery with equality List the staff who work in the branch at ‘163 Main St’. SELECT staffNo, fName, lName, position FROM Staff WHERE branchNo = (SELECT branchNo FROM Branch WHERE street = ‘163 Main St’); The inner SELECT statement (SELECT branchNo FROM Branch . . . ) finds the branch number that corresponds to the branch with street name ‘163 Main St’ (there will be only one such branch number, so this is an example of a scalar subquery). Having obtained this branch number, the outer SELECT statement then retrieves the details of all staff who work at this branch. In other words, the inner SELECT returns a result table containing a single value ‘B003’, corresponding to the branch at ‘163 Main St’, and the outer SELECT becomes: SELECT staffNo, fName, lName, position FROM Staff WHERE branchNo = ‘B003’; The result table is shown in Table 5.19. Table 5.19 Result table for Example 5.19. staffNo fName lName position SG37 SG14 SG5 Ann David Susan Beech Ford Brand Assistant Supervisor Manager We can think of the subquery as producing a temporary table with results that can be accessed and used by the outer statement. A subquery can be used immediately following a relational operator (=, <, >, <=, > =, < >) in a WHERE clause, or a HAVING clause. The subquery itself is always enclosed in parentheses. | 135 136 | Chapter 5 z SQL: Data Manipulation Example 5.20 Using a subquery with an aggregate function List all staff whose salary is greater than the average salary, and show by how much their salary is greater than the average. SELECT staffNo, fName, lName, position, salary – (SELECT AVG(salary) FROM Staff) AS salDiff FROM Staff WHERE salary > (SELECT AVG(salary) FROM Staff); First, note that we cannot write ‘WHERE salary > AVG(salary)’ because aggregate functions cannot be used in the WHERE clause. Instead, we use a subquery to find the average salary, and then use the outer SELECT statement to find those staff with a salary greater than this average. In other words, the subquery returns the average salary as £17,000. Note also the use of the scalar subquery in the SELECT list, to determine the difference from the average salary. The outer query is reduced then to: SELECT staffNo, fName, lName, position, salary – 17000 AS salDiff FROM Staff WHERE salary > 17000; The result table is shown in Table 5.20. Table 5.20 Result table for Example 5.20. staffNo fName lName position salDiff SL21 SG14 SG5 John David Susan White Ford Brand Manager Supervisor Manager 13000.00 1000.00 7000.00 The following rules apply to subqueries: (1) The ORDER BY clause may not be used in a subquery (although it may be used in the outermost SELECT statement). (2) The subquery SELECT list must consist of a single column name or expression, except for subqueries that use the keyword EXISTS (see Section 5.3.8). (3) By default, column names in a subquery refer to the table name in the FROM clause of the subquery. It is possible to refer to a table in a FROM clause of an outer query by qualifying the column name (see below). 5.3 Data Manipulation (4) When a subquery is one of the two operands involved in a comparison, the subquery must appear on the right-hand side of the comparison. For example, it would be incorrect to express the last example as: SELECT staffNo, fName, lName, position, salary FROM Staff WHERE (SELECT AVG(salary) FROM Staff) < salary; because the subquery appears on the left-hand side of the comparison with salary. Example 5.21 Nested subqueries: use of IN List the properties that are handled by staff who work in the branch at ‘163 Main St’. SELECT propertyNo, street, city, postcode, type, rooms, rent FROM PropertyForRent WHERE staffNo IN (SELECT staffNo FROM Staff WHERE branchNo = (SELECT branchNo FROM Branch WHERE street = ‘163 Main St’)); Working from the innermost query outwards, the first query selects the number of the branch at ‘163 Main St’. The second query then selects those staff who work at this branch number. In this case, there may be more than one such row found, and so we cannot use the equality condition (=) in the outermost query. Instead, we use the IN keyword. The outermost query then retrieves the details of the properties that are managed by each member of staff identified in the middle query. The result table is shown in Table 5.21. Table 5.21 Result table for Example 5.21. propertyNo street city postcode type rooms rent PG16 PG36 PG21 5 Novar Dr 2 Manor Rd 18 Dale Rd Glasgow Glasgow Glasgow G12 9AX G32 4QX G12 Flat Flat House 4 3 5 450 375 600 | 137 138 | Chapter 5 z SQL: Data Manipulation 5.3.6 ANY and ALL The words ANY and ALL may be used with subqueries that produce a single column of numbers. If the subquery is preceded by the keyword ALL, the condition will only be true if it is satisfied by all values produced by the subquery. If the subquery is preceded by the keyword ANY, the condition will be true if it is satisfied by any (one or more) values produced by the subquery. If the subquery is empty, the ALL condition returns true, the ANY condition returns false. The ISO standard also allows the qualifier SOME to be used in place of ANY. Example 5.22 Use of ANY/SOME Find all staff whose salary is larger than the salary of at least one member of staff at branch B003. SELECT staffNo, fName, lName, position, salary FROM Staff WHERE salary > SOME (SELECT salary FROM Staff WHERE branchNo = ‘B003’); While this query can be expressed using a subquery that finds the minimum salary of the staff at branch B003, and then an outer query that finds all staff whose salary is greater than this number (see Example 5.20), an alternative approach uses the SOME/ANY keyword. The inner query produces the set {12000, 18000, 24000} and the outer query selects those staff whose salaries are greater than any of the values in this set (that is, greater than the minimum value, 12000). This alternative method may seem more natural than finding the minimum salary in a subquery. In either case, the result table is shown in Table 5.22. Table 5.22 Result table for Example 5.22. staffNo fName lName position salary SL21 SG14 SG5 John David Susan White Ford Brand Manager Supervisor Manager 30000.00 18000.00 24000.00 5.3 Data Manipulation Example 5.23 Use of ALL Find all staff whose salary is larger than the salary of every member of staff at branch B003. SELECT staffNo, fName, lName, position, salary FROM Staff WHERE salary > ALL (SELECT salary FROM Staff WHERE branchNo = ‘B003’); This is very similar to the last example. Again, we could use a subquery to find the maximum salary of staff at branch B003 and then use an outer query to find all staff whose salary is greater than this number. However, in this example we use the ALL keyword. The result table is shown in Table 5.23. Table 5.23 Result table for Example 5.23. staffNo fName lName position salary SL21 John White Manager 30000.00 Multi-Table Queries All the examples we have considered so far have a major limitation: the columns that are to appear in the result table must all come from a single table. In many cases, this is not sufficient. To combine columns from several tables into a result table we need to use a join operation. The SQL join operation combines information from two tables by forming pairs of related rows from the two tables. The row pairs that make up the joined table are those where the matching columns in each of the two tables have the same value. If we need to obtain information from more than one table, the choice is between using a subquery and using a join. If the final result table is to contain columns from different tables, then we must use a join. To perform a join, we simply include more than one table name in the FROM clause, using a comma as a separator, and typically including a WHERE clause to specify the join column(s). It is also possible to use an alias for a table named in the FROM clause. In this case, the alias is separated from the table name with a space. An alias can be used to qualify a column name whenever there is ambiguity regarding the source of the column name. It can also be used as a shorthand notation for the table name. If an alias is provided it can be used anywhere in place of the table name. 5.3.7 | 139 140 | Chapter 5 z SQL: Data Manipulation Example 5.24 Simple join List the names of all clients who have viewed a property along with any comment supplied. SELECT c.clientNo, fName, lName, propertyNo, comment FROM Client c, Viewing v WHERE c.clientNo = v.clientNo; We want to display the details from both the Client table and the Viewing table, and so we have to use a join. The SELECT clause lists the columns to be displayed. Note that it is necessary to qualify the client number, clientNo, in the SELECT list: clientNo could come from either table, and we have to indicate which one. (We could equally well have chosen the clientNo column from the Viewing table.) The qualification is achieved by prefixing the column name with the appropriate table name (or its alias). In this case, we have used c as the alias for the Client table. To obtain the required rows, we include those rows from both tables that have identical values in the clientNo columns, using the search condition (c.clientNo = v.clientNo). We call these two columns the matching columns for the two tables. This is equivalent to the relational algebra Equijoin operation discussed in Section 4.1.3. The result table is shown in Table 5.24. Table 5.24 Result table for Example 5.24. clientNo fName lName propertyNo CR56 CR56 CR56 CR62 CR76 Aline Aline Aline Mary John Stewart Stewart Stewart Tregear Kay PG36 PA14 PG4 PA14 PG4 comment too small no dining room too remote The most common multi-table queries involve two tables that have a one-to-many (1:*) (or a parent/child) relationship (see Section 11.6.2). The previous query involving clients and viewings is an example of such a query. Each viewing (child) has an associated client (parent), and each client (parent) can have many associated viewings (children). The pairs of rows that generate the query results are parent/child row combinations. In Section 3.2.5 we described how primary key and foreign keys create the parent/child relationship in a relational database: the table containing the primary key is the parent table and the table containing the foreign key is the child table. To use the parent/child relationship in an SQL query, we specify a search condition that compares the primary key and the foreign key. In Example 5.24, we compared the primary key in the Client table, c.clientNo, with the foreign key in the Viewing table, v.clientNo. 5.3 Data Manipulation The SQL standard provides the following alternative ways to specify this join: FROM Client c JOIN Viewing v ON c.clientNo = v.clientNo FROM Client JOIN Viewing USING clientNo FROM Client NATURAL JOIN Viewing In each case, the FROM clause replaces the original FROM and WHERE clauses. However, the first alternative produces a table with two identical clientNo columns; the remaining two produce a table with a single clientNo column. Example 5.25 Sorting a join For each branch office, list the numbers and names of staff who manage properties and the properties that they manage. SELECT s.branchNo, s.staffNo, fName, lName, propertyNo FROM Staff s, PropertyForRent p WHERE s.staffNo = p.staffNo ORDER BY s.branchNo, s.staffNo, propertyNo; To make the results more readable, we have ordered the output using the branch number as the major sort key and the staff number and property number as the minor keys. The result table is shown in Table 5.25. Table 5.25 Result table for Example 5.25. branchNo staffNo fName lName propertyNo B003 B003 B003 B005 B007 SG14 SG37 SG37 SL41 SA9 David Ann Ann Julie Mary Ford Beech Beech Lee Howe PG16 PG21 PG36 PL94 PA14 Example 5.26 Three-table join For each branch, list the numbers and names of staff who manage properties, including the city in which the branch is located and the properties that the staff manage. SELECT b.branchNo, b.city, s.staffNo, fName, lName, propertyNo FROM Branch b, Staff s, PropertyForRent p WHERE b.branchNo = s.branchNo AND s.staffNo = p.staffNo ORDER BY b.branchNo, s.staffNo, propertyNo; | 141 142 | Chapter 5 z SQL: Data Manipulation The result table requires columns from three tables: Branch, Staff, and PropertyForRent, so a join must be used. The Branch and Staff details are joined using the condition (b.branchNo = s.branchNo), to link each branch to the staff who work there. The Staff and PropertyForRent details are joined using the condition (s.staffNo = p.staffNo), to link staff to the properties they manage. The result table is shown in Table 5.26. Table 5.26 Result table for Example 5.26. branchNo city staffNo fName lName propertyNo B003 B003 B003 B005 B007 Glasgow Glasgow Glasgow London Aberdeen SG14 SG37 SG37 SL41 SA9 David Ann Ann Julie Mary Ford Beech Beech Lee Howe PG16 PG21 PG36 PL94 PA14 Note, again, that the SQL standard provides alternative formulations for the FROM and WHERE clauses, for example: FROM (Branch b JOIN Staff s USING branchNo) AS bs JOIN PropertyForRent p USING staffNo Example 5.27 Multiple grouping columns Find the number of properties handled by each staff member. SELECT s.branchNo, s.staffNo, COUNT(*) AS myCount FROM Staff s, PropertyForRent p WHERE s.staffNo = p.staffNo GROUP BY s.branchNo, s.staffNo ORDER BY s.branchNo, s.staffNo; To list the required numbers, we first need to find out which staff actually manage properties. This can be found by joining the Staff and PropertyForRent tables on the staffNo column, using the FROM/WHERE clauses. Next, we need to form groups consisting of the branch number and staff number, using the GROUP BY clause. Finally, we sort the output using the ORDER BY clause. The result table is shown in Table 5.27(a). Table 5.27(a) Result table for Example 5.27. branchNo staffNo myCount B003 B003 B005 B007 SG14 SG37 SL41 SA9 1 2 1 1 5.3 Data Manipulation Computing a join A join is a subset of a more general combination of two tables known as the Cartesian product (see Section 4.1.2). The Cartesian product of two tables is another table consisting of all possible pairs of rows from the two tables. The columns of the product table are all the columns of the first table followed by all the columns of the second table. If we specify a two-table query without a WHERE clause, SQL produces the Cartesian product of the two tables as the query result. In fact, the ISO standard provides a special form of the SELECT statement for the Cartesian product: SELECT FROM [DISTINCT | ALL] {* | columnList} CROSS JOIN TableName2 TableName1 Consider again Example 5.24, where we joined the Client and Viewing tables using the matching column, clientNo. Using the data from Figure 3.3, the Cartesian product of these two tables would contain 20 rows (4 clients * 5 viewings = 20 rows). It is equivalent to the query used in Example 5.24 without the WHERE clause. Conceptually, the procedure for generating the results of a SELECT with a join is as follows: (1) Form the Cartesian product of the tables named in the FROM clause. (2) If there is a WHERE clause, apply the search condition to each row of the product table, retaining those rows that satisfy the condition. In terms of the relational algebra, this operation yields a restriction of the Cartesian product. (3) For each remaining row, determine the value of each item in the SELECT list to produce a single row in the result table. (4) If SELECT DISTINCT has been specified, eliminate any duplicate rows from the result table. In the relational algebra, Steps 3 and 4 are equivalent to a projection of the restriction over the columns mentioned in the SELECT list. (5) If there is an ORDER BY clause, sort the result table as required. Outer joins The join operation combines data from two tables by forming pairs of related rows where the matching columns in each table have the same value. If one row of a table is unmatched, the row is omitted from the result table. This has been the case for the joins we examined above. The ISO standard provides another set of join operators called outer joins (see Section 4.1.3). The Outer join retains rows that do not satisfy the join condition. To understand the Outer join operators, consider the following two simplified Branch and PropertyForRent tables, which we refer to as Branch1 and PropertyForRent1, respectively: PropertyForRent1 Branch1 branchNo bCity propertyNo pCity B003 B004 B002 Glasgow Bristol London PA14 PL94 PG4 Aberdeen London Glasgow | 143 144 | Chapter 5 z SQL: Data Manipulation The (Inner) join of these two tables: SELECT b.*, p.* FROM Branch1 b, PropertyForRent1 p WHERE b.bCity = p.pCity; produces the result table shown in Table 5.27(b). Table 5.27( b) Result table for inner join of Branch1 and PropertyForRent1 tables. branchNo bCity propertyNo pCity B003 B002 Glasgow London PG4 PL94 Glasgow London The result table has two rows where the cities are the same. In particular, note that there is no row corresponding to the branch office in Bristol and there is no row corresponding to the property in Aberdeen. If we want to include the unmatched rows in the result table, we can use an Outer join. There are three types of Outer join: Left, Right, and Full Outer joins. We illustrate their functionality in the following examples. Example 5.28 Left Outer join List all branch offices and any properties that are in the same city. The Left Outer join of these two tables: SELECT b.*, p.* FROM Branch1 b LEFT JOIN PropertyForRent1 p ON b.bCity = p.pCity; produces the result table shown in Table 5.28. In this example the Left Outer join includes not only those rows that have the same city, but also those rows of the first (left) table that are unmatched with rows from the second (right) table. The columns from the second table are filled with NULLs. Table 5.28 Result table for Example 5.28. branchNo bCity propertyNo pCity B003 B004 B002 Glasgow Bristol London PG4 NULL PL94 Glasgow NULL London 5.3 Data Manipulation Example 5.29 Right Outer join List all properties and any branch offices that are in the same city. The Right Outer join of these two tables: SELECT b.*, p.* FROM Branch1 b RIGHT JOIN PropertyForRent1 p ON b.bCity = p.pCity; produces the result table shown in Table 5.29. In this example the Right Outer join includes not only those rows that have the same city, but also those rows of the second (right) table that are unmatched with rows from the first (left) table. The columns from the first table are filled with NULLs. Table 5.29 Result table for Example 5.29. branchNo bCity propertyNo pCity NULL B003 B002 NULL Glasgow London PA14 PG4 PL94 Aberdeen Glasgow London Example 5.30 Full Outer join List the branch offices and properties that are in the same city along with any unmatched branches or properties. The Full Outer join of these two tables: SELECT b.*, p.* FROM Branch1 b FULL JOIN PropertyForRent1 p ON b.bCity = p.pCity; produces the result table shown in Table 5.30. In this case, the Full Outer join includes not only those rows that have the same city, but also those rows that are unmatched in both tables. The unmatched columns are filled with NULLs. Table 5.30 Result table for Example 5.30. branchNo bCity propertyNo pCity NULL B003 B004 B002 NULL Glasgow Bristol London PA14 PG4 NULL PL94 Aberdeen Glasgow NULL London | 145 146 | Chapter 5 z SQL: Data Manipulation 5.3.8 EXISTS and NOT EXISTS The keywords EXISTS and NOT EXISTS are designed for use only with subqueries. They produce a simple true/false result. EXISTS is true if and only if there exists at least one row in the result table returned by the subquery; it is false if the subquery returns an empty result table. NOT EXISTS is the opposite of EXISTS. Since EXISTS and NOT EXISTS check only for the existence or non-existence of rows in the subquery result table, the subquery can contain any number of columns. For simplicity it is common for subqueries following one of these keywords to be of the form: (SELECT * FROM . . . ) Example 5.31 Query using EXISTS Find all staff who work in a London branch office. SELECT staffNo, fName, lName, position FROM Staff s WHERE EXISTS (SELECT * FROM Branch b WHERE s.branchNo = b.branchNo AND city = ‘London’); This query could be rephrased as ‘Find all staff such that there exists a Branch row containing his/her branch number, branchNo, and the branch city equal to London’. The test for inclusion is the existence of such a row. If it exists, the subquery evaluates to true. The result table is shown in Table 5.31. Table 5.31 Result table for Example 5.31. staffNo fName lName position SL21 SL41 John Julie White Lee Manager Assistant Note that the first part of the search condition s.branchNo = b.branchNo is necessary to ensure that we consider the correct branch row for each member of staff. If we omitted this part of the query, we would get all staff rows listed out because the subquery (SELECT * FROM Branch WHERE city = ‘London’) would always be true and the query would be reduced to: SELECT staffNo, fName, lName, position FROM Staff WHERE true; 5.3 Data Manipulation | 147 which is equivalent to: SELECT staffNo, fName, lName, position FROM Staff; We could also have written this query using the join construct: SELECT staffNo, fName, lName, position FROM Staff s, Branch b WHERE s.branchNo = b.branchNo AND city = ‘London’; Combining Result Tables (UNION, INTERSECT, EXCEPT) 5.3.9 In SQL, we can use the normal set operations of Union, Intersection, and Difference to combine the results of two or more queries into a single result table: n n n The Union of two tables, A and B, is a table containing all rows that are in either the first table A or the second table B or both. The Intersection of two tables, A and B, is a table containing all rows that are common to both tables A and B. The Difference of two tables, A and B, is a table containing all rows that are in table A but are not in table B. The set operations are illustrated in Figure 5.1. There are restrictions on the tables that can be combined using the set operations, the most important one being that the two tables have to be union-compatible; that is, they have the same structure. This implies that the two tables must contain the same number of columns, and that their corresponding columns have the same data types and lengths. It is the user’s responsibility to ensure that data values in corresponding columns come from the same domain. For example, it would not be sensible to combine a column containing the age of staff with the number of rooms in a property, even though both columns may have the same data type: for example, SMALLINT. Figure 5.1 Union, intersection, and difference set operations. 148 | Chapter 5 z SQL: Data Manipulation The three set operators in the ISO standard are called UNION, INTERSECT, and EXCEPT. The format of the set operator clause in each case is: operator [ALL] [CORRESPONDING [BY {column1 [, . . . ]}]] If CORRESPONDING BY is specified, then the set operation is performed on the named column(s); if CORRESPONDING is specified but not the BY clause, the set operation is performed on the columns that are common to both tables. If ALL is specified, the result can include duplicate rows. Some dialects of SQL do not support INTERSECT and EXCEPT; others use MINUS in place of EXCEPT. Example 5.32 Use of UNION Construct a list of all cities where there is either a branch office or a property. Table 5.32 Result table for Example 5.32. city London Glasgow Aberdeen Bristol (SELECT city or (SELECT * FROM Branch FROM Branch WHERE city IS NOT NULL) WHERE city IS NOT NULL) UNION UNION CORRESPONDING BY city (SELECT city (SELECT * FROM PropertyForRent FROM PropertyForRent WHERE city IS NOT NULL); WHERE city IS NOT NULL); This query is executed by producing a result table from the first query and a result table from the second query, and then merging both tables into a single result table consisting of all the rows from both result tables with the duplicate rows removed. The final result table is shown in Table 5.32. Example 5.33 Use of INTERSECT Table 5.33 Result table for Example 5.33. city Aberdeen Glasgow London Construct a list of all cities where there is both a branch office and a property. (SELECT city FROM Branch) INTERSECT (SELECT city FROM PropertyForRent); or (SELECT * FROM Branch) INTERSECT CORRESPONDING BY city (SELECT * FROM PropertyForRent); This query is executed by producing a result table from the first query and a result table from the second query, and then creating a single result table consisting of those rows that are common to both result tables. The final result table is shown in Table 5.33. 5.3 Data Manipulation | 149 We could rewrite this query without the INTERSECT operator, for example: SELECT DISTINCT b.city or SELECT DISTINCT city FROM Branch b, PropertyForRent p FROM Branch b WHERE b.city = p.city; WHERE EXISTS (SELECT * FROM PropertyForRent p WHERE b.city = p.city); The ability to write a query in several equivalent forms illustrates one of the disadvantages of the SQL language. Example 5.34 Use of EXCEPT Construct a list of all cities where there is a branch office but no properties. (SELECT city FROM Branch) EXCEPT (SELECT city FROM PropertyForRent); or (SELECT * FROM Branch) EXCEPT CORRESPONDING BY city (SELECT * FROM PropertyForRent); This query is executed by producing a result table from the first query and a result table from the second query, and then creating a single result table consisting of those rows that appear in the first result table but not in the second one. The final result table is shown in Table 5.34. We could rewrite this query without the EXCEPT operator, for example: SELECT DISTINCT city FROM Branch WHERE city NOT IN (SELECT city FROM PropertyForRent); or SELECT DISTINCT city FROM Branch b WHERE NOT EXISTS (SELECT * FROM PropertyForRent p WHERE b.city = p.city); Database Updates SQL is a complete data manipulation language that can be used for modifying the data in the database as well as querying the database. The commands for modifying the database are not as complex as the SELECT statement. In this section, we describe the three SQL statements that are available to modify the contents of the tables in the database: n n n INSERT – adds new rows of data to a table; UPDATE – modifies existing data in a table; DELETE – removes rows of data from a table. Table 5.34 Result table for Example 5.34. city Bristol 5.3.10 150 | Chapter 5 z SQL: Data Manipulation Adding data to the database (INSERT) There are two forms of the INSERT statement. The first allows a single row to be inserted into a named table and has the following format: INSERT INTO TableName [(columnList)] VALUES (dataValueList) TableName may be either a base table or an updatable view (see Section 6.4), and columnList represents a list of one or more column names separated by commas. The columnList is optional; if omitted, SQL assumes a list of all columns in their original CREATE TABLE order. If specified, then any columns that are omitted from the list must have been declared as NULL columns when the table was created, unless the DEFAULT option was used when creating the column (see Section 6.3.2). The dataValueList must match the columnList as follows: n n n the number of items in each list must be the same; there must be a direct correspondence in the position of items in the two lists, so that the first item in dataValueList applies to the first item in columnList, the second item in dataValueList applies to the second item in columnList, and so on; the data type of each item in dataValueList must be compatible with the data type of the corresponding column. Example 5.35 INSERT . . . VALUES Insert a new row into the Staff table supplying data for all columns. INSERT INTO Staff VALUES (‘SG16’, ‘Alan’, ‘Brown’, ‘Assistant’, ‘M’, DATE ‘1957-05-25’, 8300, ‘B003’); As we are inserting data into each column in the order the table was created, there is no need to specify a column list. Note that character literals such as ‘Alan’ must be enclosed in single quotes. Example 5.36 INSERT using defaults Insert a new row into the Staff table supplying data for all mandatory columns: staffNo, fName, lName, position, salary, and branchNo. INSERT INTO Staff (staffNo, fName, lName, position, salary, branchNo) VALUES (‘SG44’, ‘Anne’, ‘Jones’, ‘Assistant’, 8100, ‘B003’); 5.3 Data Manipulation As we are inserting data only into certain columns, we must specify the names of the columns that we are inserting data into. The order for the column names is not significant, but it is more normal to specify them in the order they appear in the table. We could also express the INSERT statement as: INSERT INTO Staff VALUES (‘SG44’, ‘Anne’, ‘Jones’, ‘Assistant’, NULL, NULL, 8100, ‘B003’); In this case, we have explicitly specified that the columns sex and DOB should be set to NULL. The second form of the INSERT statement allows multiple rows to be copied from one or more tables to another, and has the following format: INSERT INTO TableName [(columnList)] SELECT . . . TableName and columnList are defined as before when inserting a single row. The SELECT clause can be any valid SELECT statement. The rows inserted into the named table are identical to the result table produced by the subselect. The same restrictions that apply to the first form of the INSERT statement also apply here. Example 5.37 INSERT . . . SELECT Assume that there is a table StaffPropCount that contains the names of staff and the number of properties they manage: StaffPropCount(staffNo, fName, lName, propCount) Populate the StaffPropCount table using details from the Staff and PropertyForRent tables. INSERT INTO StaffPropCount (SELECT s.staffNo, fName, lName, COUNT(*) FROM Staff s, PropertyForRent p WHERE s.staffNo = p.staffNo GROUP BY s.staffNo, fName, lName) UNION (SELECT staffNo, fName, lName, 0 FROM Staff s WHERE NOT EXISTS (SELECT * FROM PropertyForRent p WHERE p.staffNo = s.staffNo)); This example is complex because we want to count the number of properties that staff manage. If we omit the second part of the UNION, then we get only a list of those staff who currently manage at least one property; in other words, we exclude those staff who | 151 152 | Chapter 5 z SQL: Data Manipulation currently do not manage any properties. Therefore, to include the staff who do not manage any properties, we need to use the UNION statement and include a second SELECT statement to add in such staff, using a 0 value for the count attribute. The StaffPropCount table will now be as shown in Table 5.35. Note that some dialects of SQL may not allow the use of the UNION operator within a subselect for an INSERT. Table 5.35 Result table for Example 5.37. staffNo fName lName propCount SG14 SL21 SG37 SA9 SG5 SL41 David John Ann Mary Susan Julie Ford White Beech Howe Brand Lee 1 0 2 1 0 1 Modifying data in the database (UPDATE) The UPDATE statement allows the contents of existing rows in a named table to be changed. The format of the command is: UPDATE TableName SET columnName1 = dataValue1 [, columnName2 = dataValue2 . . . ] [WHERE searchCondition] TableName can be the name of a base table or an updatable view (see Section 6.4). The SET clause specifies the names of one or more columns that are to be updated. The WHERE clause is optional; if omitted, the named columns are updated for all rows in the table. If a WHERE clause is specified, only those rows that satisfy the searchCondition are updated. The new dataValue(s) must be compatible with the data type(s) for the corresponding column(s). Example 5.38 UPDATE all rows Give all staff a 3% pay increase. UPDATE Staff SET salary = salary*1.03; As the update applies to all rows in the Staff table, the WHERE clause is omitted. 5.3 Data Manipulation Example 5.39 UPDATE specific rows Give all Managers a 5% pay increase. UPDATE Staff SET salary = salary*1.05 WHERE position = ‘Manager’; The WHERE clause finds the rows that contain data for Managers and the update salary = is applied only to these particular rows. salary*1.05 Example 5.40 UPDATE multiple columns Promote David Ford (staffNo = ‘SG14’) to Manager and change his salary to £18,000. UPDATE Staff SET position = ‘Manager’, salary = 18000 WHERE staffNo = ‘SG14’; Deleting data from the database (DELETE) The DELETE statement allows rows to be deleted from a named table. The format of the command is: DELETE FROM TableName [WHERE searchCondition] As with the INSERT and UPDATE statements, TableName can be the name of a base table or an updatable view (see Section 6.4). The searchCondition is optional; if omitted, all rows are deleted from the table. This does not delete the table itself – to delete the table contents and the table definition, the DROP TABLE statement must be used instead (see Section 6.3.3). If a searchCondition is specified, only those rows that satisfy the condition are deleted. Example 5.41 DELETE specific rows Delete all viewings that relate to property PG4. DELETE FROM Viewing WHERE propertyNo = ‘PG4’; The WHERE clause finds the rows for property PG4 and the delete operation is applied only to these particular rows. | 153 154 | Chapter 5 z SQL: Data Manipulation Example 5.42 DELETE all rows Delete all rows from the Viewing table. DELETE FROM Viewing; No WHERE clause has been specified, so the delete operation applies to all rows in the table. This removes all rows from the table leaving only the table definition, so that we are still able to insert data into the table at a later stage. Chapter Summary n n n n n n n n SQL is a non-procedural language, consisting of standard English words such as SELECT, INSERT, DELETE, that can be used by professionals and non-professionals alike. It is both the formal and de facto standard language for defining and manipulating relational databases. The SELECT statement is the most important statement in the language and is used to express a query. It combines the three fundamental relational algebra operations of Selection, Projection, and Join. Every SELECT statement produces a query result table consisting of one or more columns and zero or more rows. The SELECT clause identifies the columns and/or calculated data to appear in the result table. All column names that appear in the SELECT clause must have their corresponding tables or views listed in the FROM clause. The WHERE clause selects rows to be included in the result table by applying a search condition to the rows of the named table(s). The ORDER BY clause allows the result table to be sorted on the values in one or more columns. Each column can be sorted in ascending or descending order. If specified, the ORDER BY clause must be the last clause in the SELECT statement. SQL supports five aggregate functions (COUNT, SUM, AVG, MIN, and MAX) that take an entire column as an argument and compute a single value as the result. It is illegal to mix aggregate functions with column names in a SELECT clause, unless the GROUP BY clause is used. The GROUP BY clause allows summary information to be included in the result table. Rows that have the same value for one or more columns can be grouped together and treated as a unit for using the aggregate functions. In this case the aggregate functions take each group as an argument and compute a single value for each group as the result. The HAVING clause acts as a WHERE clause for groups, restricting the groups that appear in the final result table. However, unlike the WHERE clause, the HAVING clause can include aggregate functions. A subselect is a complete SELECT statement embedded in another query. A subselect may appear within the WHERE or HAVING clauses of an outer SELECT statement, where it is called a subquery or nested query. Conceptually, a subquery produces a temporary table whose contents can be accessed by the outer query. A subquery can be embedded in another subquery. There are three types of subquery: scalar, row, and table. A scalar subquery returns a single column and a single row; that is, a single value. In principle, a scalar subquery can be used whenever a single value is needed. A row subquery returns multiple columns, but again only a single row. A row subquery can be used whenever a row value constructor is needed, typically in predicates. A table subquery returns one or more columns and multiple rows. A table subquery can be used whenever a table is needed, for example, as an operand for the IN predicate. Exercises n n | 155 If the columns of the result table come from more than one table, a join must be used, by specifying more than one table in the FROM clause and typically including a WHERE clause to specify the join column(s). The ISO standard allows Outer joins to be defined. It also allows the set operations of Union, Intersection, and Difference to be used with the UNION, INTERSECT, and EXCEPT commands. As well as SELECT, the SQL DML includes the INSERT statement to insert a single row of data into a named table or to insert an arbitrary number of rows from one or more other tables using a subselect; the UPDATE statement to update one or more values in a specified column or columns of a named table; the DELETE statement to delete one or more rows from a named table. Review Questions 5.1 What are the two major components of SQL and what function do they serve? 5.2 What are the advantages and disadvantages of SQL? 5.3 Explain the function of each of the clauses in the SELECT statement. What restrictions are imposed on these clauses? 5.4 What restrictions apply to the use of the aggregate functions within the SELECT statement? How do nulls affect the aggregate functions? 5.5 Explain how the GROUP BY clause works. What is the difference between the WHERE and HAVING clauses? 5.6 What is the difference between a subquery and a join? Under what circumstances would you not be able to use a subquery? Exercises For Exercises 5.7–5.28, use the Hotel schema defined at the start of the Exercises at the end of Chapter 3. Simple queries 5.7 List full details of all hotels. 5.8 List full details of all hotels in London. 5.9 List the names and addresses of all guests living in London, alphabetically ordered by name. 5.10 List all double or family rooms with a price below £40.00 per night, in ascending order of price. 5.11 List the bookings for which no dateTo has been specified. Aggregate functions 5.12 How many hotels are there? 5.13 What is the average price of a room? 5.14 What is the total revenue per night from all double rooms? 5.15 How many different guests have made bookings for August? 156 | Chapter 5 z SQL: Data Manipulation Subqueries and joins 5.16 List the price and type of all rooms at the Grosvenor Hotel. 5.17 List all guests currently staying at the Grosvenor Hotel. 5.18 List the details of all rooms at the Grosvenor Hotel, including the name of the guest staying in the room, if the room is occupied. 5.19 What is the total income from bookings for the Grosvenor Hotel today? 5.20 List the rooms that are currently unoccupied at the Grosvenor Hotel. 5.21 What is the lost income from unoccupied rooms at the Grosvenor Hotel? Grouping 5.22 List the number of rooms in each hotel. 5.23 List the number of rooms in each hotel in London. 5.24 What is the average number of bookings for each hotel in August? 5.25 What is the most commonly booked room type for each hotel in London? 5.26 What is the lost income from unoccupied rooms at each hotel today? Populating tables 5.27 Insert rows into each of these tables. 5.28 Update the price of all rooms by 5%. General 5.29 Investigate the SQL dialect on any DBMS that you are currently using. Determine the system’s compliance with the DML statements of the ISO standard. Investigate the functionality of any extensions the DBMS supports. Are there any functions not supported? 5.30 Show that a query using the HAVING clause has an equivalent formulation without a HAVING clause. 5.31 Show that SQL is relationally complete. Chapter 6 SQL: Data Definition Chapter Objectives In this chapter you will learn: n The data types supported by the SQL standard. n The purpose of the integrity enhancement feature of SQL. n How to define integrity constraints using SQL including: – required data; – domain constraints; – entity integrity; – referential integrity; – general constraints. n How to use the integrity enhancement feature in the CREATE and ALTER TABLE statements. n The purpose of views. n How to create and delete views using SQL. n How the DBMS performs operations on views. n Under what conditions views are updatable. n The advantages and disadvantages of views. n How the ISO transaction model works. n How to use the GRANT and REVOKE statements as a level of security. In the previous chapter we discussed in some detail the Structured Query Language (SQL) and, in particular, the SQL data manipulation facilities. In this chapter we continue our presentation of SQL and examine the main SQL data definition facilities. 158 | Chapter 6 z SQL: Data Definition Structure of this Chapter In Section 6.1 we examine the ISO SQL data types. The 1989 ISO standard introduced an Integrity Enhancement Feature (IEF), which provides facilities for defining referential integrity and other constraints (ISO, 1989). Prior to this standard, it was the responsibility of each application program to ensure compliance with these constraints. The provision of an IEF greatly enhances the functionality of SQL and allows constraint checking to be centralized and standardized. We consider the Integrity Enhancement Feature in Section 6.2 and the main SQL data definition facilities in Section 6.3. In Section 6.4 we show how views can be created using SQL, and how the DBMS converts operations on views into equivalent operations on the base tables. We also discuss the restrictions that the ISO SQL standard places on views in order for them to be updatable. In Section 6.5, we briefly describe the ISO SQL transaction model. Views provide a certain degree of database security. SQL also provides a separate access control subsystem, containing facilities to allow users to share database objects or, alternatively, restrict access to database objects. We discuss the access control subsystem in Section 6.6. In Section 28.4 we examine in some detail the features that have recently been added to the SQL specification to support object-oriented data management, often covering SQL:1999 and SQL:2003. In Appendix E we discuss how SQL can be embedded in highlevel programming languages to access constructs that until recently were not available in SQL. As in the previous chapter, we present the features of SQL using examples drawn from the DreamHome case study. We use the same notation for specifying the format of SQL statements as defined in Section 5.2. 6.1 The ISO SQL Data Types In this section we introduce the data types defined in the SQL standard. We start by defining what constitutes a valid identifier in SQL. 6.1.1 SQL Identifiers SQL identifiers are used to identify objects in the database, such as table names, view names, and columns. The characters that can be used in a user-defined SQL identifier must appear in a character set. The ISO standard provides a default character set, which consists of the upper-case letters A . . . Z, the lower-case letters a . . . z, the digits 0 . . . 9, and the underscore (_) character. It is also possible to specify an alternative character set. The following restrictions are imposed on an identifier: n n n an identifier can be no longer than 128 characters (most dialects have a much lower limit than this); an identifier must start with a letter; an identifier cannot contain spaces. 6.1 The ISO SQL Data Types Table 6.1 † | 159 ISO SQL data types. Data type Declarations boolean character bit † exact numeric approximate numeric datetime interval large objects BOOLEAN CHAR VARCHAR BIT BIT VARYING NUMERIC DECIMAL FLOAT REAL DATE TIME INTERVAL CHARACTER LARGE OBJECT INTEGER DOUBLE PRECISION TIMESTAMP SMALLINT BINARY LARGE OBJECT BIT and BIT VARYING have been removed from the SQL:2003 standard. SQL Scalar Data Types Table 6.1 shows the SQL scalar data types defined in the ISO standard. Sometimes, for manipulation and conversion purposes, the data types character and bit are collectively referred to as string data types, and exact numeric and approximate numeric are referred to as numeric data types, as they share similar properties. The SQL:2003 standard also defines both character large objects and binary large objects, although we defer discussion of these data types until Section 28.4. Boolean data Boolean data consists of the distinct truth values TRUE and FALSE. Unless prohibited by a NOT NULL constraint, boolean data also supports the UNKNOWN truth value as the NULL value. All boolean data type values and SQL truth values are mutually comparable and assignable. The value TRUE is greater than the value FALSE, and any comparison involving the NULL value or an UNKNOWN truth value returns an UNKNOWN result. Character data Character data consists of a sequence of characters from an implementation-defined character set, that is, it is defined by the vendor of the particular SQL dialect. Thus, the exact characters that can appear as data values in a character type column will vary. ASCII and EBCDIC are two sets in common use today. The format for specifying a character data type is: CHARACTER [VARYING] [length] CHARACTER can be abbreviated to CHAR and CHARACTER VARYING to VARCHAR. When a character string column is defined, a length can be specified to indicate the maximum number of characters that the column can hold (default length is 1). A character 6.1.2 160 | Chapter 6 z SQL: Data Definition string may be defined as having a fixed or varying length. If the string is defined to be a fixed length and we enter a string with fewer characters than this length, the string is padded with blanks on the right to make up the required size. If the string is defined to be of a varying length and we enter a string with fewer characters than this length, only those characters entered are stored, thereby using less space. For example, the branch number column branchNo of the Branch table, which has a fixed length of four characters, is declared as: branchNo CHAR(4) The column address of the PrivateOwner table, which has a variable number of characters up to a maximum of 30, is declared as: address VARCHAR(30) Bit data The bit data type is used to define bit strings, that is, a sequence of binary digits (bits), each having either the value 0 or 1. The format for specifying the bit data type is similar to that of the character data type: BIT [VARYING] [length] For example, to hold the fixed length binary string ‘0011’, we declare a column as: bitString bitString, BIT(4) 6.1.3 Exact Numeric Data The exact numeric data type is used to define numbers with an exact representation. The number consists of digits, an optional decimal point, and an optional sign. An exact numeric data type consists of a precision and a scale. The precision gives the total number of significant decimal digits; that is, the total number of digits, including decimal places but excluding the point itself. The scale gives the total number of decimal places. For example, the exact numeric value −12.345 has precision 5 and scale 3. A special case of exact numeric occurs with integers. There are several ways of specifying an exact numeric data type: NUMERIC [ precision [, scale] ] DECIMAL [ precision [, scale] ] INTEGER SMALLINT INTEGER can be abbreviated to INT and DECIMAL to DEC 6.1 The ISO SQL Data Types NUMERIC and DECIMAL store numbers in decimal notation. The default scale is always 0; the default precision is implementation-defined. INTEGER is used for large positive or negative whole numbers. SMALLINT is used for small positive or negative whole numbers. By specifying this data type, less storage space can be reserved for the data. For example, the maximum absolute value that can be stored with SMALLINT might be 32 767. The column rooms of the PropertyForRent table, which represents the number of rooms in a property, is obviously a small integer and can be declared as: rooms SMALLINT The column salary of the Staff table can be declared as: salary DECIMAL(7,2) which can handle a value up to 99,999.99. Approximate numeric data The approximate numeric data type is used for defining numbers that do not have an exact representation, such as real numbers. Approximate numeric, or floating point, is similar to scientific notation in which a number is written as a mantissa times some power of ten (the exponent). For example, 10E3, +5.2E6, −0.2E−4. There are several ways of specifying an approximate numeric data type: FLOAT [precision] REAL DOUBLE PRECISION The precision controls the precision of the mantissa. The precision of REAL and DOUBLE PRECISION is implementation-defined. Datetime data The datetime data type is used to define points in time to a certain degree of accuracy. Examples are dates, times, and times of day. The ISO standard subdivides the datetime data type into YEAR, MONTH, DAY, HOUR, MINUTE, SECOND, TIMEZONE_HOUR, and TIMEZONE_MINUTE. The latter two fields specify the hour and minute part of the time zone offset from Universal Coordinated Time (which used to be called Greenwich Mean Time). Three types of datetime data type are supported: DATE TIME [timePrecision] [WITH TIME ZONE] TIMESTAMP [timePrecision] [WITH TIME ZONE] DATE is used to store calendar dates using the YEAR, MONTH, and DAY fields. TIME is used to store time using the HOUR, MINUTE, and SECOND fields. TIMESTAMP is | 161 162 | Chapter 6 z SQL: Data Definition used to store date and times. The timePrecision is the number of decimal places of accuracy to which the SECOND field is kept. If not specified, TIME defaults to a precision of 0 (that is, whole seconds), and TIMESTAMP defaults to 6 (that is, microseconds). The WITH TIME ZONE keyword controls the presence of the TIMEZONE_HOUR and TIMEZONE_MINUTE fields. For example, the column date of the Viewing table, which represents the date (year, month, day) that a client viewed a property, is declared as: viewDate DATE Interval data The interval data type is used to represent periods of time. Every interval data type consists of a contiguous subset of the fields: YEAR, MONTH, DAY, HOUR, MINUTE, SECOND. There are two classes of interval data type: year–month intervals and day– time intervals. The year–month class may contain only the YEAR and/or the MONTH fields; the day–time class may contain only a contiguous selection from DAY, HOUR, MINUTE, SECOND. The format for specifying the interval data type is: INTERVAL {{startField TO endField} singleDatetimeField} startField = YEAR | MONTH | DAY | HOUR | MINUTE [(intervalLeadingFieldPrecision)] endField = YEAR | MONTH | DAY | HOUR | MINUTE | SECOND [(fractionalSecondsPrecision)] singleDatetimeField = startField | SECOND [(intervalLeadingFieldPrecision [, fractionalSecondsPrecision])] In all cases, startField has a leading field precision that defaults to 2. For example: INTERVAL YEAR(2) TO MONTH represents an interval of time with a value between 0 years 0 months, and 99 years 11 months; and: INTERVAL HOUR TO SECOND(4) represents an interval of time with a value between 0 hours 0 minutes 0 seconds and 99 hours 59 minutes 59.9999 seconds (the fractional precision of second is 4). Scalar operators SQL provides a number of built-in scalar operators and functions that can be used to construct a scalar expression: that is, an expression that evaluates to a scalar value. Apart from the obvious arithmetic operators (+, −, *, /), the operators shown in Table 6.2 are available. Table 6.2 ISO SQL scalar operators. Operator Meaning Returns the length of a string in bits. For example, BIT_LENGTH(X‘FFFF’) returns 16. Returns the length of a string in octets (bit length divided by 8). OCTET_LENGTH For example, OCTET_LENGTH(X‘FFFF’) returns 2. Returns the length of a string in characters (or octets, if the string CHAR_LENGTH is a bit string). For example, CHAR_LENGTH(‘Beech’) returns 5. Converts a value expression of one data type into a value in CAST another data type. For example, CAST(5.2E6 AS INTEGER). Concatenates two character strings or bit strings. For example, || fName || lName. Returns a character string representing the current authorization CURRENT_USER identifier (informally, the current user name). or USER Returns a character string representing the SQL-session SESSION_USER authorization identifier. Returns a character string representing the identifier of the user SYSTEM_USER who invoked the current module. Converts upper-case letters to lower-case. For example, LOWER LOWER(SELECT fName FROM Staff WHERE staffNo = ‘SL21’) returns ‘john’ Converts lower-case letters to upper-case. For example, UPPER UPPER(SELECT fName FROM Staff WHERE staffNo = ‘SL21’) returns ‘JOHN’ Removes leading (LEADING), trailing (TRAILING), or both TRIM leading and trailing (BOTH) characters from a string. For example, TRIM(BOTH ‘*’ FROM ‘*** Hello World ***’) returns ‘Hello World’ Returns the position of one string within another string. POSITION For example, POSITION(‘ee’ IN ‘Beech’) returns 2. Returns a substring selected from within a string. For example, SUBSTRING SUBSTRING(‘Beech’ FROM 1 TO 3) returns the string ‘Bee’. Returns one of a specified set of values, based on some CASE condition. For example, CASE type WHEN ‘House’ THEN 1 WHEN ‘Flat’ THEN 2 ELSE 0 END Returns the current date in the time zone that is local to the user. CURRENT_DATE Returns the current time in the time zone that is the current CURRENT_TIME default for the session. For example, CURRENT_TIME(6) gives time to microseconds precision. CURRENT_TIMESTAMP Returns the current date and time in the time zone that is the current default for the session. For example, CURRENT_TIMESTAMP(0) gives time to seconds precision. Returns the value of a specified field from a datetime or interval EXTRACT value. For example, EXTRACT(YEAR FROM Registration.dateJoined). BIT_LENGTH 164 | Chapter 6 z SQL: Data Definition 6.2 Integrity Enhancement Feature In this section, we examine the facilities provided by the SQL standard for integrity control. Integrity control consists of constraints that we wish to impose in order to protect the database from becoming inconsistent. We consider five types of integrity constraint (see Section 3.3): n n n n n required data; domain constraints; entity integrity; referential integrity; general constraints. These constraints can be defined in the CREATE and ALTER TABLE statements, as we will see shortly. 6.2.1 Required Data Some columns must contain a valid value; they are not allowed to contain nulls. A null is distinct from blank or zero, and is used to represent data that is either not available, missing, or not applicable (see Section 3.3.1). For example, every member of staff must have an associated job position (for example, Manager, Assistant, and so on). The ISO standard provides the NOT NULL column specifier in the CREATE and ALTER TABLE statements to provide this type of constraint. When NOT NULL is specified, the system rejects any attempt to insert a null in the column. If NULL is specified, the system accepts nulls. The ISO default is NULL. For example, to specify that the column position of the Staff table cannot be null, we define the column as: position VARCHAR(10) NOT NULL 6.2.2 Domain Constraints Every column has a domain, in other words a set of legal values (see Section 3.2.1). For example, the sex of a member of staff is either ‘M’ or ‘F’, so the domain of the column sex of the Staff table is a single character string consisting of either ‘M’ or ‘F’. The ISO standard provides two mechanisms for specifying domains in the CREATE and ALTER TABLE statements. The first is the CHECK clause, which allows a constraint to be defined on a column or the entire table. The format of the CHECK clause is: CHECK (searchCondition) 6.2 Integrity Enhancement Feature In a column constraint, the CHECK clause can reference only the column being defined. Thus, to ensure that the column sex can only be specified as ‘M’ or ‘F’, we could define the column as: sex CHAR NOT NULL CHECK (sex IN (‘M’, ‘F’)) However, the ISO standard allows domains to be defined more explicitly using the CREATE DOMAIN statement: CREATE DOMAIN DomainName [AS] dataType [DEFAULT defaultOption] [CHECK (searchCondition)] A domain is given a name, DomainName, a data type (as described in Section 6.1.2), an optional default value, and an optional CHECK constraint. This is not the complete definition, but it is sufficient to demonstrate the basic concept. Thus, for the above example, we could define a domain for sex as: CREATE DOMAIN SexType AS CHAR DEFAULT ‘M’ CHECK (VALUE IN (‘M’, ‘F’)); This creates a domain SexType that consists of a single character with either the value ‘M’ or ‘F’. When defining the column sex, we can now use the domain name SexType in place of the data type CHAR: sex SexType NOT NULL The searchCondition can involve a table lookup. For example, we can create a domain BranchNumber to ensure that the values entered correspond to an existing branch number in the Branch table, using the statement: CREATE DOMAIN BranchNumber AS CHAR(4) CHECK (VALUE IN (SELECT branchNo FROM Branch)); The preferred method of defining domain constraints is using the CREATE DOMAIN statement. Domains can be removed from the database using the DROP DOMAIN statement: DROP DOMAIN DomainName [RESTRICT | CASCADE] The drop behavior, RESTRICT or CASCADE, specifies the action to be taken if the domain is currently being used. If RESTRICT is specified and the domain is used in an existing table, view, or assertion definition (see Section 6.2.5), the drop will fail. In the case of CASCADE, any table column that is based on the domain is automatically changed to use the domain’s underlying data type, and any constraint or default clause for the domain is replaced by a column constraint or column default clause, if appropriate. | 165 166 | Chapter 6 z SQL: Data Definition 6.2.3 Entity Integrity The primary key of a table must contain a unique, non-null value for each row. For example, each row of the PropertyForRent table has a unique value for the property number propertyNo, which uniquely identifies the property represented by that row. The ISO standard supports entity integrity with the PRIMARY KEY clause in the CREATE and ALTER TABLE statements. For example, to define the primary key of the PropertyForRent table, we include the clause: PRIMARY KEY(propertyNo) To define a composite primary key, we specify multiple column names in the PRIMARY KEY clause, separating each by a comma. For example, to define the primary key of the Viewing table, which consists of the columns clientNo and propertyNo, we include the clause: PRIMARY KEY(clientNo, propertyNo) The PRIMARY KEY clause can be specified only once per table. However, it is still possible to ensure uniqueness for any alternate keys in the table using the keyword UNIQUE. Every column that appears in a UNIQUE clause must also be declared as NOT NULL. There may be as many UNIQUE clauses per table as required. SQL rejects any INSERT or UPDATE operation that attempts to create a duplicate value within each candidate key (that is, primary key or alternate key). For example, with the Viewing table we could also have written: VARCHAR(5) VARCHAR(5) UNIQUE (clientNo, propertyNo) clientNo propertyNo NOT NULL, NOT NULL, 6.2.4 Referential Integrity A foreign key is a column, or set of columns, that links each row in the child table containing the foreign key to the row of the parent table containing the matching candidate key value. Referential integrity means that, if the foreign key contains a value, that value must refer to an existing, valid row in the parent table (see Section 3.3.3). For example, the branch number column branchNo in the PropertyForRent table links the property to that row in the Branch table where the property is assigned. If the branch number is not null, it must contain a valid value from the column branchNo of the Branch table, or the property is assigned to an invalid branch office. The ISO standard supports the definition of foreign keys with the FOREIGN KEY clause in the CREATE and ALTER TABLE statements. For example, to define the foreign key branchNo of the PropertyForRent table, we include the clause: FOREIGN KEY(branchNo) REFERENCES Branch SQL rejects any INSERT or UPDATE operation that attempts to create a foreign key value in a child table without a matching candidate key value in the parent table. The action SQL takes for any UPDATE or DELETE operation that attempts to update or delete a candidate key value in the parent table that has some matching rows in the child table is 6.2 Integrity Enhancement Feature dependent on the referential action specified using the ON UPDATE and ON DELETE subclauses of the FOREIGN KEY clause. When the user attempts to delete a row from a parent table, and there are one or more matching rows in the child table, SQL supports four options regarding the action to be taken: n n n n CASCADE Delete the row from the parent table and automatically delete the matching rows in the child table. Since these deleted rows may themselves have a candidate key that is used as a foreign key in another table, the foreign key rules for these tables are triggered, and so on in a cascading manner. SET NULL Delete the row from the parent table and set the foreign key value(s) in the child table to NULL. This is valid only if the foreign key columns do not have the NOT NULL qualifier specified. SET DEFAULT Delete the row from the parent table and set each component of the foreign key in the child table to the specified default value. This is valid only if the foreign key columns have a DEFAULT value specified (see Section 6.3.2). NO ACTION Reject the delete operation from the parent table. This is the default setting if the ON DELETE rule is omitted. SQL supports the same options when the candidate key in the parent table is updated. With CASCADE, the foreign key value(s) in the child table are set to the new value(s) of the candidate key in the parent table. In the same way, the updates cascade if the updated column(s) in the child table reference foreign keys in another table. For example, in the PropertyForRent table, the staff number staffNo is a foreign key referencing the Staff table. We can specify a deletion rule such that, if a staff record is deleted from the Staff table, the values of the corresponding staffNo column in the PropertyForRent table are set to NULL: FOREIGN KEY (staffNo) REFERENCES Staff ON DELETE SET NULL Similarly, the owner number ownerNo in the PropertyForRent table is a foreign key referencing the PrivateOwner table. We can specify an update rule such that, if an owner number is updated in the PrivateOwner table, the corresponding column(s) in the PropertyForRent table are set to the new value: FOREIGN KEY (ownerNo) REFERENCES PrivateOwner ON UPDATE CASCADE General Constraints Updates to tables may be constrained by enterprise rules governing the real-world transactions that are represented by the updates. For example, DreamHome may have a rule that prevents a member of staff from managing more than 100 properties at the same time. The ISO standard allows general constraints to be specified using the CHECK and UNIQUE clauses of the CREATE and ALTER TABLE statements and the CREATE ASSERTION statement. We have already discussed the CHECK and UNIQUE clauses earlier in this section. The CREATE ASSERTION statement is an integrity constraint that is not directly linked with a table definition. The format of the statement is: 6.2.5 | 167 168 | Chapter 6 z SQL: Data Definition CREATE ASSERTION AssertionName CHECK (searchCondition) This statement is very similar to the CHECK clause discussed above. However, when a general constraint involves more than one table, it may be preferable to use an ASSERTION rather than duplicate the check in each table or place the constraint in an arbitrary table. For example, to define the general constraint that prevents a member of staff from managing more than 100 properties at the same time, we could write: CREATE ASSERTION StaffNotHandlingTooMuch CHECK (NOT EXISTS (SELECT staffNo FROM PropertyForRent GROUP BY staffNo HAVING COUNT(*) > 100)) We show how to use these integrity features in the following section when we examine the CREATE and ALTER TABLE statements. 6.3 Data Definition The SQL Data Definition Language (DDL) allows database objects such as schemas, domains, tables, views, and indexes to be created and destroyed. In this section, we briefly examine how to create and destroy schemas, tables, and indexes. We discuss how to create and destroy views in the next section. The ISO standard also allows the creation of character sets, collations, and translations. However, we will not consider these database objects in this book. The interested reader is referred to Cannan and Otten (1993). The main SQL data definition language statements are: CREATE SCHEMA CREATE DOMAIN CREATE TABLE CREATE VIEW ALTER DOMAIN ALTER TABLE DROP SCHEMA DROP DOMAIN DROP TABLE DROP VIEW These statements are used to create, change, and destroy the structures that make up the conceptual schema. Although not covered by the SQL standard, the following two statements are provided by many DBMSs: CREATE INDEX DROP INDEX Additional commands are available to the DBA to specify the physical details of data storage; however, we do not discuss them here as these commands are system specific. 6.3.1 Creating a Database The process of creating a database differs significantly from product to product. In multi-user systems, the authority to create a database is usually reserved for the DBA. 6.3 Data Definition In a single-user system, a default database may be established when the system is installed and configured and others can be created by the user as and when required. The ISO standard does not specify how databases are created, and each dialect generally has a different approach. According to the ISO standard, relations and other database objects exist in an environment. Among other things, each environment consists of one or more catalogs, and each catalog consists of a set of schemas. A schema is a named collection of database objects that are in some way related to one another (all the objects in the database are described in one schema or another). The objects in a schema can be tables, views, domains, assertions, collations, translations, and character sets. All the objects in a schema have the same owner and share a number of defaults. The standard leaves the mechanism for creating and destroying catalogs as implementation-defined, but provides mechanisms for creating and destroying schemas. The schema definition statement has the following (simplified) form: CREATE SCHEMA [Name | AUTHORIZATION CreatorIdentifier] Therefore, if the creator of a schema SqlTests is Smith, the SQL statement is: CREATE SCHEMA SqlTests AUTHORIZATION Smith; The ISO standard also indicates that it should be possible to specify within this statement the range of facilities available to the users of the schema, but the details of how these privileges are specified are implementation-dependent. A schema can be destroyed using the DROP SCHEMA statement, which has the following form: DROP SCHEMA Name [RESTRICT | CASCADE] If RESTRICT is specified, which is the default if neither qualifier is specified, the schema must be empty or the operation fails. If CASCADE is specified, the operation cascades to drop all objects associated with the schema in the order defined above. If any of these drop operations fail, the DROP SCHEMA fails. The total effect of a DROP SCHEMA with CASCADE can be very extensive and should be carried out only with extreme caution. The CREATE and DROP SCHEMA statements are not yet widely implemented. Creating a Table (CREATE TABLE) Having created the database structure, we may now create the table structures for the base relations to be located in the database. This is achieved using the CREATE TABLE statement, which has the following basic syntax: 6.3.2 | 169 170 | Chapter 6 z SQL: Data Definition CREATE TABLE TableName {(columName dataType [NOT NULL] [UNIQUE] [DEFAULT defaultOption] [CHECK (searchCondition)] [, . . . ]} [PRIMARY KEY (listOfColumns),] {[UNIQUE (listOfColumns)] [, . . . ]} {[FOREIGN KEY (listOfForeignKeyColumns) REFERENCES ParentTableName [(listOfCandidateKeyColumns)] [MATCH {PARTIAL | FULL} [ON UPDATE referentialAction] [ON DELETE referentialAction]] [, . . . ]} {[CHECK (searchCondition)] [, . . . ]}) As we discussed in the previous section, this version of the CREATE TABLE statement incorporates facilities for defining referential integrity and other constraints. There is significant variation in the support provided by different dialects for this version of the statement. However, when it is supported, the facilities should be used. The CREATE TABLE statement creates a table called TableName consisting of one or more columns of the specified dataType. The set of permissible data types is described in Section 6.1.2. The optional DEFAULT clause can be specified to provide a default value for a particular column. SQL uses this default value whenever an INSERT statement fails to specify a value for the column. Among other values, the defaultOption includes literals. The NOT NULL, UNIQUE, and CHECK clauses were discussed in the previous section. The remaining clauses are known as table constraints and can optionally be preceded with the clause: CONSTRAINT ConstraintName which allows the constraint to be dropped by name using the ALTER TABLE statement (see below). The PRIMARY KEY clause specifies the column or columns that form the primary key for the table. If this clause is available, it should be specified for every table created. By default, NOT NULL is assumed for each column that comprises the primary key. Only one PRIMARY KEY clause is allowed per table. SQL rejects any INSERT or UPDATE operation that attempts to create a duplicate row within the PRIMARY KEY column(s). In this way, SQL guarantees the uniqueness of the primary key. The FOREIGN KEY clause specifies a foreign key in the (child) table and the relationship it has to another (parent) table. This clause implements referential integrity constraints. The clause specifies the following: n n A listOfForeignKeyColumns, the column or columns from the table being created that form the foreign key. A REFERENCES subclause, giving the parent table; that is, the table holding the matching candidate key. If the listOfCandidateKeyColumns is omitted, the foreign key is assumed to match the primary key of the parent table. In this case, the parent table must have a PRIMARY KEY clause in its CREATE TABLE statement. 6.3 Data Definition n An optional update rule (ON UPDATE) for the relationship that specifies the action to be taken when a candidate key is updated in the parent table that matches a foreign key in the child table. The referentialAction can be CASCADE, SET NULL, SET DEFAULT, or NO ACTION. If the ON UPDATE clause is omitted, the default NO ACTION is assumed (see Section 6.2). n An optional delete rule (ON DELETE) for the relationship that specifies the action to be taken when a row is deleted from the parent table that has a candidate key that matches a foreign key in the child table. The referentialAction is the same as for the ON UPDATE rule. n By default, the referential constraint is satisfied if any component of the foreign key is null or there is a matching row in the parent table. The MATCH option provides additional constraints relating to nulls within the foreign key. If MATCH FULL is specified, the foreign key components must all be null or must all have values. If MATCH PARTIAL is specified, the foreign key components must all be null, or there must be at least one row in the parent table that could satisfy the constraint if the other nulls were correctly substituted. Some authors argue that referential integrity should imply MATCH FULL. There can be as many FOREIGN KEY clauses as required. The CHECK and CONSTRAINT clauses allow additional constraints to be defined. If used as a column constraint, the CHECK clause can reference only the column being defined. Constraints are in effect checked after every SQL statement has been executed, although this check can be deferred until the end of the enclosing transaction (see Section 6.5). Example 6.1 demonstrates the potential of this version of the CREATE TABLE statement. Example 6.1 CREATE TABLE Create the PropertyForRent table using the available features of the CREATE TABLE statement. CREATE DOMAIN OwnerNumber AS VARCHAR(5) CHECK (VALUE IN (SELECT ownerNo FROM PrivateOwner)); CREATE DOMAIN StaffNumber AS VARCHAR(5) CHECK (VALUE IN (SELECT staffNo FROM Staff)); CREATE DOMAIN BranchNumber AS CHAR(4) CHECK (VALUE IN (SELECT branchNo FROM Branch)); CREATE DOMAIN PropertyNumber AS VARCHAR(5); CREATE DOMAIN Street AS VARCHAR(25); CREATE DOMAIN City AS VARCHAR(15); CREATE DOMAIN PostCode AS VARCHAR(8); CREATE DOMAIN PropertyType AS CHAR(1) CHECK(VALUE IN (‘B’, ‘C’, ‘D’, ‘E’, ‘F’, ‘M’, ‘S’)); | 171 172 | Chapter 6 z SQL: Data Definition CREATE DOMAIN PropertyRooms AS SMALLINT; CHECK(VALUE BETWEEN 1 AND 15); CREATE DOMAIN PropertyRent AS DECIMAL(6,2) CHECK(VALUE BETWEEN 0 AND 9999.99); CREATE TABLE PropertyForRent( propertyNo PropertyNumber NOT NULL, street Street NOT NULL, city City NOT NULL, postcode PostCode, type PropertyType NOT NULL DEFAULT ‘F’, rooms PropertyRooms NOT NULL DEFAULT 4, rent PropertyRent NOT NULL DEFAULT 600, ownerNo OwnerNumber NOT NULL, staffNo StaffNumber CONSTRAINT StaffNotHandlingTooMuch CHECK (NOT EXISTS (SELECT staffNo FROM PropertyForRent GROUP BY staffNo HAVING COUNT(*) > 100)), branchNo BranchNumber NOT NULL, PRIMARY KEY (propertyNo), FOREIGN KEY (staffNo) REFERENCES Staff ON DELETE SET NULL ON UPDATE CASCADE, FOREIGN KEY (ownerNo) REFERENCES PrivateOwner ON DELETE NO ACTION ON UPDATE CASCADE, FOREIGN KEY (branchNo) REFERENCES Branch ON DELETE NO ACTION ON UPDATE CASCADE); A default value of ‘F’ for ‘Flat’ has been assigned to the property type column type. A CONSTRAINT for the staff number column has been specified to ensure that a member of staff does not handle too many properties. The constraint checks that the number of properties the staff member currently handles is not greater than 100. The primary key is the property number, propertyNo. SQL automatically enforces uniqueness on this column. The staff number, staffNo, is a foreign key referencing the Staff table. A deletion rule has been specified such that, if a record is deleted from the Staff table, the corresponding values of the staffNo column in the PropertyForRent table are set to NULL. Additionally, an update rule has been specified such that, if a staff number is updated in the Staff table, the corresponding values in the staffNo column in the PropertyForRent table are updated accordingly. The owner number, ownerNo, is a foreign key referencing the PrivateOwner table. A deletion rule of NO ACTION has been specified to prevent deletions from the PrivateOwner table if there are matching ownerNo values in the PropertyForRent table. An update rule of CASCADE has been specified such that, if an owner number is updated, the corresponding values in the ownerNo column in the PropertyForRent table are set to the new value. The same rules have been specified for the branchNo column. In all FOREIGN KEY constraints, because the listOfCandidateKeyColumns has been omitted, SQL assumes that the foreign keys match the primary keys of the respective parent tables. 6.3 Data Definition Note, we have not specified NOT NULL for the staff number column staffNo because there may be periods of time when there is no member of staff allocated to manage the property (for example, when the property is first registered). However, the other foreign key columns – ownerNo (the owner number) and branchNo (the branch number) – must be specified at all times. Changing a Table Definition (ALTER TABLE) The ISO standard provides an ALTER TABLE statement for changing the structure of a table once it has been created. The definition of the ALTER TABLE statement in the ISO standard consists of six options to: n n n n n n add a new column to a table; drop a column from a table; add a new table constraint; drop a table constraint; set a default for a column; drop a default for a column. The basic format of the statement is: ALTER TABLE TableName [ADD [COLUMN] columnName dataType [NOT NULL] [UNIQUE] [DEFAULT defaultOption] [CHECK (searchCondition)]] [DROP [COLUMN] columnName [RESTRICT | CASCADE]] [ADD [CONSTRAINT [ConstraintName]] tableConstraintDefinition] [DROP CONSTRAINT ConstraintName [RESTRICT | CASCADE]] [ALTER [COLUMN] SET DEFAULT defaultOption] [ALTER [COLUMN] DROP DEFAULT] Here the parameters are as defined for the CREATE TABLE statement in the previous section. A tableConstraintDefinition is one of the clauses: PRIMARY KEY, UNIQUE, FOREIGN KEY, or CHECK. The ADD COLUMN clause is similar to the definition of a column in the CREATE TABLE statement. The DROP COLUMN clause specifies the name of the column to be dropped from the table definition, and has an optional qualifier that specifies whether the DROP action is to cascade or not: n n RESTRICT The DROP operation is rejected if the column is referenced by another database object (for example, by a view definition). This is the default setting. CASCADE The DROP operation proceeds and automatically drops the column from any database objects it is referenced by. This operation cascades, so that if a column is dropped from a referencing object, SQL checks whether that column is referenced by any other object and drops it from there if it is, and so on. 6.3.3 | 173 174 | Chapter 6 z SQL: Data Definition Example 6.2 ALTER TABLE (a) Change the Staff table by removing the default of ‘Assistant’ for the position column and setting the default for the sex column to female (‘F’). ALTER TABLE Staff ALTER position DROP DEFAULT; ALTER TABLE Staff ALTER sex SET DEFAULT ‘F’; (b) Change the PropertyForRent table by removing the constraint that staff are not allowed to handle more than 100 properties at a time. Change the Client table by adding a new column representing the preferred number of rooms. ALTER TABLE PropertyForRent DROP CONSTRAINT StaffNotHandlingTooMuch; ALTER TABLE Client ADD prefNoRooms PropertyRooms; The ALTER TABLE statement is not available in all dialects of SQL. In some dialects, the ALTER TABLE statement cannot be used to remove an existing column from a table. In such cases, if a column is no longer required, the column could simply be ignored but kept in the table definition. If, however, you wish to remove the column from the table you must: n n n n upload all the data from the table; remove the table definition using the DROP TABLE statement; redefine the new table using the CREATE TABLE statement; reload the data back into the new table. The upload and reload steps are typically performed with special-purpose utility programs supplied with the DBMS. However, it is possible to create a temporary table and use the INSERT . . . SELECT statement to load the data from the old table into the temporary table and then from the temporary table into the new table. 6.3.4 Removing a Table (DROP TABLE) Over time, the structure of a database will change; new tables will be created and some tables will no longer be needed. We can remove a redundant table from the database using the DROP TABLE statement, which has the format: DROP TABLE TableName [RESTRICT | CASCADE] 6.3 Data Definition For example, to remove the PropertyForRent table we use the command: DROP TABLE PropertyForRent; Note, however, that this command removes not only the named table, but also all the rows within it. To simply remove the rows from the table but retain the table structure, use the DELETE statement instead (see Section 5.3.10). The DROP TABLE statement allows you to specify whether the DROP action is to be cascaded or not: n n RESTRICT The DROP operation is rejected if there are any other objects that depend for their existence upon the continued existence of the table to be dropped. CASCADE The DROP operation proceeds and SQL automatically drops all dependent objects (and objects dependent on these objects). The total effect of a DROP TABLE with CASCADE can be very extensive and should be carried out only with extreme caution. One common use of DROP TABLE is to correct mistakes made when creating a table. If a table is created with an incorrect structure, DROP TABLE can be used to delete the newly created table and start again. Creating an Index (CREATE INDEX) An index is a structure that provides accelerated access to the rows of a table based on the values of one or more columns (see Appendix C for a discussion of indexes and how they may be used to improve the efficiency of data retrievals). The presence of an index can significantly improve the performance of a query. However, since indexes may be updated by the system every time the underlying tables are updated, additional overheads may be incurred. Indexes are usually created to satisfy particular search criteria after the table has been in use for some time and has grown in size. The creation of indexes is not standard SQL. However, most dialects support at least the following capabilities: CREATE [UNIQUE] INDEX IndexName ON TableName (columnName [ASC | DESC] [, . . . ]) The specified columns constitute the index key and should be listed in major to minor order. Indexes can be created only on base tables not on views. If the UNIQUE clause is used, uniqueness of the indexed column or combination of columns will be enforced by the DBMS. This is certainly required for the primary key and possibly for other columns as well (for example, for alternate keys). Although indexes can be created at any time, we may have a problem if we try to create a unique index on a table with records in it, because the values stored for the indexed column(s) may already contain duplicates. Therefore, it is good practice to create unique indexes, at least for primary key columns, when the base table is created and the DBMS does not automatically enforce primary key uniqueness. For the Staff and PropertyForRent tables, we may want to create at least the following indexes: CREATE UNIQUE INDEX StaffNoInd ON Staff (staffNo); CREATE UNIQUE INDEX PropertyNoInd ON PropertyForRent (propertyNo); 6.3.5 | 175 176 | Chapter 6 z SQL: Data Definition For each column, we may specify that the order is ascending (ASC) or descending (DESC), with ASC being the default setting. For example, if we create an index on the PropertyForRent table as: CREATE INDEX RentInd ON PropertyForRent (city, rent); then an index called RentInd is created for the PropertyForRent table. Entries will be in alphabetical order by city and then by rent within each city. 6.3.6 Removing an Index (DROP INDEX) If we create an index for a base table and later decide that it is no longer needed, we can use the DROP INDEX statement to remove the index from the database. DROP INDEX has the format: DROP INDEX IndexName The following statement will remove the index created in the previous example: DROP INDEX RentInd; 6.4 Views Recall from Section 3.4 the definition of a view: View The dynamic result of one or more relational operations operating on the base relations to produce another relation. A view is a virtual relation that does not necessarily exist in the database but can be produced upon request by a particular user, at the time of request. To the database user, a view appears just like a real table, with a set of named columns and rows of data. However, unlike a base table, a view does not necessarily exist in the database as a stored set of data values. Instead, a view is defined as a query on one or more base tables or views. The DBMS stores the definition of the view in the database. When the DBMS encounters a reference to a view, one approach is to look up this definition and translate the request into an equivalent request against the source tables of the view and then perform the equivalent request. This merging process, called view resolution, is discussed in Section 6.4.3. An alternative approach, called view materialization, stores the view as a temporary table in the database and maintains the currency of the view as the underlying base tables are updated. We discuss view materialization in Section 6.4.8. First, we examine how to create and use views. 6.4 Views Creating a View (CREATE VIEW) The format of the CREATE VIEW statement is: CREATE VIEW ViewName [(newColumnName [, . . . ])] AS subselect [WITH [CASCADED | LOCAL] CHECK OPTION] A view is defined by specifying an SQL SELECT statement. A name may optionally be assigned to each column in the view. If a list of column names is specified, it must have the same number of items as the number of columns produced by the subselect. If the list of column names is omitted, each column in the view takes the name of the corresponding column in the subselect statement. The list of column names must be specified if there is any ambiguity in the name for a column. This may occur if the subselect includes calculated columns, and the AS subclause has not been used to name such columns, or it produces two columns with identical names as the result of a join. The subselect is known as the defining query. If WITH CHECK OPTION is specified, SQL ensures that if a row fails to satisfy the WHERE clause of the defining query of the view, it is not added to the underlying base table of the view (see Section 6.4.6). It should be noted that to create a view successfully, you must have SELECT privilege on all the tables referenced in the subselect and USAGE privilege on any domains used in referenced columns. These privileges are discussed further in Section 6.6. Although all views are created in the same way, in practice different types of view are used for different purposes. We illustrate the different types of view with examples. Example 6.3 Create a horizontal view Create a view so that the manager at branch B003 can see only the details for staff who work in his or her branch office. A horizontal view restricts a user’s access to selected rows of one or more tables. CREATE VIEW Manager3Staff AS SELECT * FROM Staff WHERE branchNo = ‘B003’; This creates a view called Manager3Staff with the same column names as the Staff table but containing only those rows where the branch number is B003. (Strictly speaking, the branchNo column is unnecessary and could have been omitted from the definition of the view, as all entries have branchNo = ‘B003’.) If we now execute the statement: SELECT * FROM Manager3Staff; we would get the result table shown in Table 6.3. To ensure that the branch manager can see only these rows, the manager should not be given access to the base table Staff. Instead, the manager should be given access permission to the view Manager3Staff. This, in effect, gives the branch manager a customized view of the Staff table, showing only the staff at his or her own branch. We discuss access permissions in Section 6.6. 6.4.1 | 177 178 | Chapter 6 z SQL: Data Definition Table 6.3 Data for view Manager3Staff. staffNo fName lName position sex DOB salary branchNo SG37 SG14 SG5 Ann David Susan Beech Ford Brand Assistant Supervisor Manager F M F 10-Nov-60 24-Mar-58 3-Jun-40 12000.00 18000.00 24000.00 B003 B003 B003 Example 6.4 Create a vertical view Create a view of the staff details at branch B003 that excludes salary information, so that only managers can access the salary details for staff who work at their branch. A vertical view restricts a user’s access to selected columns of one or more tables. CREATE VIEW Staff3 AS SELECT staffNo, fName, lName, position, sex FROM Staff WHERE branchNo = ‘B003’; Note that we could rewrite this statement to use the Manager3Staff view instead of the Staff table, thus: CREATE VIEW Staff3 AS SELECT staffNo, fName, lName, position, sex FROM Manager3Staff; Either way, this creates a view called Staff3 with the same columns as the Staff table, but excluding the salary, DOB, and branchNo columns. If we list this view we would get the result table shown in Table 6.4. To ensure that only the branch manager can see the salary details, staff at branch B003 should not be given access to the base table Staff or the view Manager3Staff. Instead, they should be given access permission to the view Staff3, thereby denying them access to sensitive salary data. Vertical views are commonly used where the data stored in a table is used by various users or groups of users. They provide a private table for these users composed only of the columns they need. Table 6.4 Data for view Staff3. staffNo fName lName position sex SG37 SG14 SG5 Ann David Susan Beech Ford Brand Assistant Supervisor Manager F M F 6.4 Views Example 6.5 Grouped and joined views Create a view of staff who manage properties for rent, which includes the branch number they work at, their staff number, and the number of properties they manage (see Example 5.27). CREATE VIEW StaffPropCnt (branchNo, staffNo, cnt) AS SELECT s.branchNo, s.staffNo, COUNT(*) FROM Staff s, PropertyForRent p WHERE s.staffNo = p.staffNo GROUP BY s.branchNo, s.staffNo; This gives the data shown in Table 6.5. This example illustrates the use of a subselect containing a GROUP BY clause (giving a view called a grouped view), and containing multiple tables (giving a view called a joined view). One of the most frequent reasons for using views is to simplify multi-table queries. Once a joined view has been defined, we can often use a simple single-table query against the view for queries that would otherwise require a multi-table join. Note that we have to name the columns in the definition of the view because of the use of the unqualified aggregate function COUNT in the subselect. Table 6.5 Data for view StaffPropCnt. branchNo staffNo cnt B003 B003 B005 B007 SG14 SG37 SL41 SA9 1 2 1 1 Removing a View (DROP VIEW) A view is removed from the database with the DROP VIEW statement: DROP VIEW ViewName [RESTRICT | CASCADE] DROP VIEW causes the definition of the view to be deleted from the database. For example, we could remove the Manager3Staff view using the statement: DROP VIEW Manager3Staff; If CASCADE is specified, DROP VIEW deletes all related dependent objects, in other words, all objects that reference the view. This means that DROP VIEW also deletes any 6.4.2 | 179 180 | Chapter 6 z SQL: Data Definition views that are defined on the view being dropped. If RESTRICT is specified and there are any other objects that depend for their existence on the continued existence of the view being dropped, the command is rejected. The default setting is RESTRICT. 6.4.3 View Resolution Having considered how to create and use views, we now look more closely at how a query on a view is handled. To illustrate the process of view resolution, consider the following query that counts the number of properties managed by each member of staff at branch office B003. This query is based on the StaffPropCnt view of Example 6.5: SELECT staffNo, cnt FROM StaffPropCnt WHERE branchNo = ‘B003’ ORDER BY staffNo; View resolution merges the above query with the defining query of the as follows: StaffPropCnt view (1) The view column names in the SELECT list are translated into their corresponding column names in the defining query. This gives: SELECT s.staffNo AS staffNo, COUNT(*) AS cnt (2) View names in the FROM clause are replaced with the corresponding FROM lists of the defining query: FROM Staff s, PropertyForRent p (3) The WHERE clause from the user query is combined with the WHERE clause of the defining query using the logical operator AND, thus: WHERE s.staffNo = p.staffNo AND branchNo = ‘B003’ (4) The GROUP BY and HAVING clauses are copied from the defining query. In this example, we have only a GROUP BY clause: GROUP BY s.branchNo, s.staffNo (5) Finally, the ORDER BY clause is copied from the user query with the view column name translated into the defining query column name: ORDER BY s.staffNo (6) The final merged query becomes: SELECT s.staffNo AS staffNo, COUNT(*) AS cnt FROM Staff s, PropertyForRent p WHERE s.staffNo = p.staffNo AND branchNo = ‘B003’ GROUP BY s.branchNo, s.staffNo ORDER BY s.staffNo; 6.4 Views This gives the result table shown in Table 6.6. Table 6.6 Result table after view resolution. staffNo cnt SG14 SG37 1 2 Restrictions on Views 6.4.4 The ISO standard imposes several important restrictions on the creation and use of views, although there is considerable variation among dialects. n If a column in the view is based on an aggregate function, then the column may appear only in SELECT and ORDER BY clauses of queries that access the view. In particular, such a column may not be used in a WHERE clause and may not be an argument to an aggregate function in any query based on the view. For example, consider the view StaffPropCnt of Example 6.5, which has a column cnt based on the aggregate function COUNT. The following query would fail: SELECT COUNT(cnt) FROM StaffPropCnt; because we are using an aggregate function on the column cnt, which is itself based on an aggregate function. Similarly, the following query would also fail: SELECT * FROM StaffPropCnt WHERE cnt > 2; n because we are using the view column, cnt, derived from an aggregate function in a WHERE clause. A grouped view may never be joined with a base table or a view. For example, the StaffPropCnt view is a grouped view, so that any attempt to join this view with another table or view fails. View Updatability All updates to a base table are immediately reflected in all views that encompass that base table. Similarly, we may expect that if a view is updated then the base table(s) will reflect that change. However, consider again the view StaffPropCnt of Example 6.5. Consider what would happen if we tried to insert a record that showed that at branch B003, staff member SG5 manages two properties, using the following insert statement: 6.4.5 | 181 182 | Chapter 6 z SQL: Data Definition INSERT INTO StaffPropCnt VALUES (‘B003’, ‘SG5’, 2); We have to insert two records into the PropertyForRent table showing which properties staff member SG5 manages. However, we do not know which properties they are; all we know is that this member of staff manages two properties. In other words, we do not know the corresponding primary key values for the PropertyForRent table. If we change the definition of the view and replace the count with the actual property numbers: CREATE VIEW StaffPropList (branchNo, staffNo, propertyNo) AS SELECT s.branchNo, s.staffNo, p.propertyNo FROM Staff s, PropertyForRent p WHERE s.staffNo = p.staffNo; and we try to insert the record: INSERT INTO StaffPropList VALUES (‘B003’, ‘SG5’, ‘PG19’); then there is still a problem with this insertion, because we specified in the definition of the PropertyForRent table that all columns except postcode and staffNo were not allowed to have nulls (see Example 6.1). However, as the StaffPropList view excludes all columns from the PropertyForRent table except the property number, we have no way of providing the remaining non-null columns with values. The ISO standard specifies the views that must be updatable in a system that conforms to the standard. The definition given in the ISO standard is that a view is updatable if and only if: n n n n n DISTINCT is not specified; that is, duplicate rows must not be eliminated from the query results. Every element in the SELECT list of the defining query is a column name (rather than a constant, expression, or aggregate function) and no column name appears more than once. The FROM clause specifies only one table; that is, the view must have a single source table for which the user has the required privileges. If the source table is itself a view, then that view must satisfy these conditions. This, therefore, excludes any views based on a join, union (UNION), intersection (INTERSECT), or difference (EXCEPT). The WHERE clause does not include any nested SELECTs that reference the table in the FROM clause. There is no GROUP BY or HAVING clause in the defining query. In addition, every row that is added through the view must not violate the integrity constraints of the base table. For example, if a new row is added through a view, columns that are not included in the view are set to null, but this must not violate a NOT NULL integrity constraint in the base table. The basic concept behind these restrictions is as follows: Updatable view For a view to be updatable, the DBMS must be able to trace any row or column back to its row or column in the source table. 6.4 Views WITH CHECK OPTION Rows exist in a view because they satisfy the WHERE condition of the defining query. If a row is altered such that it no longer satisfies this condition, then it will disappear from the view. Similarly, new rows will appear within the view when an insert or update on the view cause them to satisfy the WHERE condition. The rows that enter or leave a view are called migrating rows. Generally, the WITH CHECK OPTION clause of the CREATE VIEW statement prohibits a row migrating out of the view. The optional qualifiers LOCAL/CASCADED are applicable to view hierarchies: that is, a view that is derived from another view. In this case, if WITH LOCAL CHECK OPTION is specified, then any row insert or update on this view, and on any view directly or indirectly defined on this view, must not cause the row to disappear from the view, unless the row also disappears from the underlying derived view/table. If the WITH CASCADED CHECK OPTION is specified (the default setting), then any row insert or update on this view and on any view directly or indirectly defined on this view must not cause the row to disappear from the view. This feature is so useful that it can make working with views more attractive than working with the base tables. When an INSERT or UPDATE statement on the view violates the WHERE condition of the defining query, the operation is rejected. This enforces constraints on the database and helps preserve database integrity. The WITH CHECK OPTION can be specified only for an updatable view, as defined in the previous section. Example 6.6 WITH CHECK OPTION Consider again the view created in Example 6.3: CREATE VIEW Manager3Staff AS SELECT * FROM Staff WHERE branchNo = ‘B003’ WITH CHECK OPTION; with the virtual table shown in Table 6.3. If we now attempt to update the branch number of one of the rows from B003 to B005, for example: UPDATE Manager3Staff SET branchNo = ‘B005’ WHERE staffNo = ‘SG37’; then the specification of the WITH CHECK OPTION clause in the definition of the view prevents this from happening, as this would cause the row to migrate from this horizontal view. Similarly, if we attempt to insert the following row through the view: INSERT INTO Manager3Staff VALUES(‘SL15’, ‘Mary’, ‘Black’, ‘Assistant’, ‘F’, DATE‘1967-06-21’, 8000, ‘B002’); 6.4.6 | 183 184 | Chapter 6 z SQL: Data Definition then the specification of WITH CHECK OPTION would prevent the row from being inserted into the underlying Staff table and immediately disappearing from this view (as branch B002 is not part of the view). Now consider the situation where Manager3Staff is defined not on Staff directly but on another view of Staff: CREATE VIEW LowSalary CREATE VIEW HighSalary CREATE VIEW Manager3Staff AS SELECT * AS SELECT * AS SELECT * FROM Staff FROM LowSalary FROM HighSalary WHERE salary > 9000; WHERE salary > 10000 WHERE branchNo = ‘B003’; WITH LOCAL CHECK OPTION; If we now attempt the following update on Manager3Staff: UPDATE Manager3Staff SET salary = 9500 WHERE staffNo = ‘SG37’; then this update would fail: although the update would cause the row to disappear from the view HighSalary, the row would not disappear from the table LowSalary that HighSalary is derived from. However, if instead the update tried to set the salary to 8000, then the update would succeed as the row would no longer be part of LowSalary. Alternatively, if the view HighSalary had specified WITH CASCADED CHECK OPTION, then setting the salary to either 9500 or 8000 would be rejected because the row would disappear from HighSalary. Therefore, to ensure that anomalies like this do not arise, each view should normally be created using the WITH CASCADED CHECK OPTION. 6.4.7 Advantages and Disadvantages of Views Restricting some users’ access to views has potential advantages over allowing users direct access to the base tables. Unfortunately, views in SQL also have disadvantages. In this section we briefly review the advantages and disadvantages of views in SQL as summarized in Table 6.7. Table 6.7 Summary of advantages/disadvantages of views in SQL. Advantages Disadvantages Data independence Currency Improved security Reduced complexity Convenience Customization Data integrity Update restriction Structure restriction Performance 6.4 Views Advantages In the case of a DBMS running on a standalone PC, views are usually a convenience, defined to simplify database requests. However, in a multi-user DBMS, views play a central role in defining the structure of the database and enforcing security. The major advantages of views are described below. Data independence A view can present a consistent, unchanging picture of the structure of the database, even if the underlying source tables are changed (for example, columns added or removed, relationships changed, tables split, restructured, or renamed). If columns are added or removed from a table, and these columns are not required by the view, then the definition of the view need not change. If an existing table is rearranged or split up, a view may be defined so that users can continue to see the old table. In the case of splitting a table, the old table can be recreated by defining a view from the join of the new tables, provided that the split is done in such a way that the original table can be reconstructed. We can ensure that this is possible by placing the primary key in both of the new tables. Thus, if we originally had a Client table of the form: Client (clientNo, fName, lName, telNo, prefType, maxRent) we could reorganize it into two new tables: ClientDetails (clientNo, fName, lName, telNo) ClientReqts (clientNo, prefType, maxRent) Users and applications could still access the data using the old table structure, which would be recreated by defining a view called Client as the natural join of ClientDetails and ClientReqts, with clientNo as the join column: CREATE VIEW Client AS SELECT cd.clientNo, fName, lName, telNo, prefType, maxRent FROM ClientDetails cd, ClientReqts cr WHERE cd.clientNo = cr.clientNo; Currency Changes to any of the base tables in the defining query are immediately reflected in the view. Improved security Each user can be given the privilege to access the database only through a small set of views that contain the data appropriate for that user, thus restricting and controlling each user’s access to the database. Reduced complexity A view can simplify queries, by drawing data from several tables into a single table, thereby transforming multi-table queries into single-table queries. | 185 186 | Chapter 6 z SQL: Data Definition Convenience Views can provide greater convenience to users as users are presented with only that part of the database that they need to see. This also reduces the complexity from the user’s point of view. Customization Views provide a method to customize the appearance of the database, so that the same underlying base tables can be seen by different users in different ways. Data integrity If the WITH CHECK OPTION clause of the CREATE VIEW statement is used, then SQL ensures that no row that fails to satisfy the WHERE clause of the defining query is ever added to any of the underlying base table(s) through the view, thereby ensuring the integrity of the view. Disadvantages Although views provide many significant benefits, there are also some disadvantages with SQL views. Update restriction In Section 6.4.5 we showed that, in some cases, a view cannot be updated. Structure restriction The structure of a view is determined at the time of its creation. If the defining query was of the form SELECT * FROM . . . , then the * refers to the columns of the base table present when the view is created. If columns are subsequently added to the base table, then these columns will not appear in the view, unless the view is dropped and recreated. Performance There is a performance penalty to be paid when using a view. In some cases, this will be negligible; in other cases, it may be more problematic. For example, a view defined by a complex, multi-table query may take a long time to process as the view resolution must join the tables together every time the view is accessed. View resolution requires additional computer resources. In the next section, we briefly discuss an alternative approach to maintaining views that attempts to overcome this disadvantage. 6.4.8 View Materialization In Section 6.4.3 we discussed one approach to handling queries based on a view, where the query is modified into a query on the underlying base tables. One disadvantage with this approach is the time taken to perform the view resolution, particularly if the view is accessed frequently. An alternative approach, called view materialization, is to store 6.5 Transactions | 187 the view as a temporary table in the database when the view is first queried. Thereafter, queries based on the materialized view can be much faster than recomputing the view each time. The speed difference may be critical in applications where the query rate is high and the views are complex so that it is not practical to recompute the view for every query. Materialized views are useful in new applications such as data warehousing, replication servers, data visualization, and mobile systems. Integrity constraint checking and query optimization can also benefit from materialized views. The difficulty with this approach is maintaining the currency of the view while the base table(s) are being updated. The process of updating a materialized view in response to changes to the underlying data is called view maintenance. The basic aim of view maintenance is to apply only those changes necessary to the view to keep it current. As an indication of the issues involved, consider the following view: CREATE VIEW StaffPropRent (staffNo) AS SELECT DISTINCT staffNo FROM PropertyForRent WHERE branchNo = ‘B003’ AND rent > 400; with the data shown in Table 6.8. If we were to insert a row into the PropertyForRent table with a rent ≤ 400, then the view would be unchanged. If we were to insert the row (‘PG24’, . . . , 550, ‘CO40’, ‘SG19’, ‘B003’) into the PropertyForRent table then the row should also appear within the materialized view. However, if we were to insert the row (‘PG54’, . . . , 450, ‘CO89’, ‘SG37’, ‘B003’) into the PropertyForRent table, then no new row need be added to the materialized view because there is a row for SG37 already. Note that in these three cases the decision whether to insert the row into the materialized view can be made without access to the underlying PropertyForRent table. If we now wished to delete the new row (‘PG24’, . . . , 550, ‘CO40’, ‘SG19’, ‘B003’) from the PropertyForRent table then the row should also be deleted from the materialized view. However, if we wished to delete the new row (‘PG54’, . . . , 450, ‘CO89’, ‘SG37’, ‘B003’) from the PropertyForRent table then the row corresponding to SG37 should not be deleted from the materialized view, owing to the existence of the underlying base row corresponding to property PG21. In these two cases, the decision on whether to delete or retain the row in the materialized view requires access to the underlying base table PropertyForRent. For a more complete discussion of materialized views, the interested reader is referred to Gupta and Mumick (1999). Transactions The ISO standard defines a transaction model based on two SQL statements: COMMIT and ROLLBACK. Most, but not all, commercial implementations of SQL conform to this model, which is based on IBM’s DB2 DBMS. A transaction is a logical unit of work consisting of one or more SQL statements that is guaranteed to be atomic with respect to recovery. The standard specifies that an SQL transaction automatically begins with a transaction-initiating SQL statement executed by a user or program (for example, Table 6.8 Data for view StaffPropRent. staffNo SG37 SG14 6.5 188 | Chapter 6 z SQL: Data Definition SELECT, INSERT, UPDATE). Changes made by a transaction are not visible to other concurrently executing transactions until the transaction completes. A transaction can complete in one of four ways: n n n n A COMMIT statement ends the transaction successfully, making the database changes permanent. A new transaction starts after COMMIT with the next transaction-initiating statement. A ROLLBACK statement aborts the transaction, backing out any changes made by the transaction. A new transaction starts after ROLLBACK with the next transactioninitiating statement. For programmatic SQL (see Appendix E), successful program termination ends the final transaction successfully, even if a COMMIT statement has not been executed. For programmatic SQL, abnormal program termination aborts the transaction. SQL transactions cannot be nested (see Section 20.4). The SET TRANSACTION statement allows the user to configure certain aspects of the transaction. The basic format of the statement is: SET TRANSACTION [READ ONLY | READ WRITE] | [ISOLATION LEVEL READ UNCOMMITTED | READ COMMITTED | REPEATABLE READ | SERIALIZABLE] The READ ONLY and READ WRITE qualifiers indicate whether the transaction is read only or involves both read and write operations. The default is READ WRITE if neither qualifier is specified (unless the isolation level is READ UNCOMMITTED). Perhaps confusingly, READ ONLY allows a transaction to issue INSERT, UPDATE, and DELETE statements against temporary tables (but only temporary tables). The isolation level indicates the degree of interaction that is allowed from other transactions during the execution of the transaction. Table 6.9 shows the violations of serializability allowed by each isolation level against the following three preventable phenomena: n n n Dirty read A transaction reads data that has been written by another as yet uncommitted transaction. Nonrepeatable read A transaction rereads data it has previously read but another committed transaction has modified or deleted the data in the intervening period. Phantom read A transaction executes a query that retrieves a set of rows satisfying a certain search condition. When the transaction re-executes the query at a later time additional rows are returned that have been inserted by another committed transaction in the intervening period. Only the SERIALIZABLE isolation level is safe, that is generates serializable schedules. The remaining isolation levels require a mechanism to be provided by the DBMS that 6.6 Discretionary Access Control Table 6.9 Violations of serializability permitted by isolation levels. Isolation level Dirty read Nonrepeatable read Phantom read READ UNCOMMITTED READ COMMITTED REPEATABLE READ SERIALIZABLE Y N N N Y Y N N Y Y Y N can be used by the programmer to ensure serializability. Chapter 20 provides additional information on transactions and serializability. Immediate and Deferred Integrity Constraints 6.5.1 In some situations, we do not want integrity constraints to be checked immediately, that is after every SQL statement has been executed, but instead at transaction commit. A constraint may be defined as INITIALLY IMMEDIATE or INITIALLY DEFERRED, indicating which mode the constraint assumes at the start of each transaction. In the former case, it is also possible to specify whether the mode can be changed subsequently using the qualifier [NOT] DEFERRABLE. The default mode is INITIALLY IMMEDIATE. The SET CONSTRAINTS statement is used to set the mode for specified constraints for the current transaction. The format of this statement is: SET CONSTRAINTS {ALL | constraintName [, . . . ]} {DEFERRED | IMMEDIATE} Discretionary Access Control In Section 2.4 we stated that a DBMS should provide a mechanism to ensure that only authorized users can access the database. Modern DBMSs typically provide one or both of the following authorization mechanisms: n n Discretionary access control Each user is given appropriate access rights (or privileges) on specific database objects. Typically users obtain certain privileges when they create an object and can pass some or all of these privileges to other users at their discretion. Although flexible, this type of authorization mechanism can be circumvented by a devious unauthorized user tricking an authorized user into revealing sensitive data. Mandatory access control Each database object is assigned a certain classification level (for example, Top Secret, Secret, Confidential, Unclassified) and each subject (for 6.6 | 189 190 | Chapter 6 z SQL: Data Definition example, users, programs) is given a designated clearance level. The classification levels form a strict ordering (Top Secret > Secret > Confidential > Unclassified) and a subject requires the necessary clearance to read or write a database object. This type of multilevel security mechanism is important for certain government, military, and corporate applications. The most commonly used mandatory access control model is known as Bell–LaPadula (Bell and LaPadula, 1974), which we discuss further in Chapter 19. SQL supports only discretionary access control through the GRANT and REVOKE statements. The mechanism is based on the concepts of authorization identifiers, ownership, and privileges, as we now discuss. Authorization identifiers and ownership An authorization identifier is a normal SQL identifier that is used to establish the identity of a user. Each database user is assigned an authorization identifier by the Database Administrator (DBA). Usually, the identifier has an associated password, for obvious security reasons. Every SQL statement that is executed by the DBMS is performed on behalf of a specific user. The authorization identifier is used to determine which database objects the user may reference and what operations may be performed on those objects. Each object that is created in SQL has an owner. The owner is identified by the authorization identifier defined in the AUTHORIZATION clause of the schema to which the object belongs (see Section 6.3.1). The owner is initially the only person who may know of the existence of the object and, consequently, perform any operations on the object. Privileges Privileges are the actions that a user is permitted to carry out on a given base table or view. The privileges defined by the ISO standard are: n n n n n n SELECT – the privilege to retrieve data from a table; INSERT – the privilege to insert new rows into a table; UPDATE – the privilege to modify rows of data in a table; DELETE – the privilege to delete rows of data from a table; REFERENCES – the privilege to reference columns of a named table in integrity constraints; USAGE – the privilege to use domains, collations, character sets, and translations. We do not discuss collations, character sets, and translations in this book; the interested reader is referred to Cannan and Otten (1993). The INSERT and UPDATE privileges can be restricted to specific columns of the table, allowing changes to these columns but disallowing changes to any other column. Similarly, the REFERENCES privilege can be restricted to specific columns of the table, allowing these columns to be referenced in constraints, such as check constraints and foreign key constraints, when creating another table, but disallowing others from being referenced. 6.6 Discretionary Access Control When a user creates a table using the CREATE TABLE statement, he or she automatically becomes the owner of the table and receives full privileges for the table. Other users initially have no privileges on the newly created table. To give them access to the table, the owner must explicitly grant them the necessary privileges using the GRANT statement. When a user creates a view with the CREATE VIEW statement, he or she automatically becomes the owner of the view, but does not necessarily receive full privileges on the view. To create the view, a user must have SELECT privilege on all the tables that make up the view and REFERENCES privilege on the named columns of the view. However, the view owner gets INSERT, UPDATE, and DELETE privileges only if he or she holds these privileges for every table in the view. Granting Privileges to Other Users (GRANT) The GRANT statement is used to grant privileges on database objects to specific users. Normally the GRANT statement is used by the owner of a table to give other users access to the data. The format of the GRANT statement is: GRANT {PrivilegeList | ALL PRIVILEGES} ON ObjectName TO {AuthorizationIdList | PUBLIC} [WITH GRANT OPTION] PrivilegeList consists of one or more of the following privileges separated by commas: SELECT DELETE INSERT UPDATE REFERENCES USAGE [(columnName [, . . . ])] [(columnName [, . . . ])] [(columnName [, . . . ])] For convenience, the GRANT statement allows the keyword ALL PRIVILEGES to be used to grant all privileges to a user instead of having to specify the six privileges individually. It also provides the keyword PUBLIC to allow access to be granted to all present and future authorized users, not just to the users currently known to the DBMS. ObjectName can be the name of a base table, view, domain, character set, collation, or translation. The WITH GRANT OPTION clause allows the user(s) in AuthorizationIdList to pass the privileges they have been given for the named object on to other users. If these users pass a privilege on specifying WITH GRANT OPTION, the users receiving the privilege may in turn grant it to still other users. If this keyword is not specified, the receiving user(s) will not be able to pass the privileges on to other users. In this way, the owner of the object maintains very tight control over who has permission to use the object and what forms of access are allowed. 6.6.1 | 191 192 | Chapter 6 z SQL: Data Definition Example 6.7 GRANT all privileges Give the user with authorization identifier Manager full privileges to the Staff table. GRANT ALL PRIVILEGES ON Staff TO Manager WITH GRANT OPTION; The user identified as Manager can now retrieve rows from the Staff table, and also insert, update, and delete data from this table. Manager can also reference the Staff table, and all the Staff columns in any table that he or she creates subsequently. We also specified the keyword WITH GRANT OPTION, so that Manager can pass these privileges on to other users. Example 6.8 GRANT specific privileges Give users Personnel and Director the privileges SELECT and UPDATE on column salary of the Staff table. GRANT SELECT, UPDATE (salary) ON Staff TO Personnel, Director; We have omitted the keyword WITH GRANT OPTION, so that users Director cannot pass either of these privileges on to other users. Personnel and Example 6.9 GRANT specific privileges to PUBLIC Give all users the privilege SELECT on the Branch table. GRANT SELECT ON Branch TO PUBLIC; The use of the keyword PUBLIC means that all users (now and in the future) are able to retrieve all the data in the Branch table. Note that it does not make sense to use WITH GRANT OPTION in this case: as every user has access to the table, there is no need to pass the privilege on to other users. 6.6.2 Revoking Privileges from Users (REVOKE) The REVOKE statement is used to take away privileges that were granted with the GRANT statement. A REVOKE statement can take away all or some of the privileges that were previously granted to a user. The format of the statement is: 6.6 Discretionary Access Control | 193 REVOKE [GRANT OPTION FOR] {PrivilegeList | ALL PRIVILEGES} ON ObjectName FROM {AuthorizationIdList | PUBLIC} [RESTRICT | CASCADE] The keyword ALL PRIVILEGES refers to all the privileges granted to a user by the user revoking the privileges. The optional GRANT OPTION FOR clause allows privileges passed on via the WITH GRANT OPTION of the GRANT statement to be revoked separately from the privileges themselves. The RESTRICT and CASCADE qualifiers operate exactly as in the DROP TABLE statement (see Section 6.3.3). Since privileges are required to create certain objects, revoking a privilege can remove the authority that allowed the object to be created (such an object is said to be abandoned). The REVOKE statement fails if it results in an abandoned object, such as a view, unless the CASCADE keyword has been specified. If CASCADE is specified, an appropriate DROP statement is issued for any abandoned views, domains, constraints, or assertions. The privileges that were granted to this user by other users are not affected by this REVOKE statement. Therefore, if another user has granted the user the privilege being revoked, the other user’s grant still allows the user to access the table. For example, in Figure 6.1 User A grants User B INSERT privilege on the Staff table WITH GRANT OPTION (step 1). User B passes this privilege on to User C (step 2). Subsequently, User C gets the same privilege from User E (step 3). User C then passes the privilege on to User D (step 4). When User A revokes the INSERT privilege from User B (step 5), the privilege cannot be revoked from User C, because User C has also received the privilege from User E. If User E had not given User C this privilege, the revoke would have cascaded to User C and User D. Figure 6.1 Effects of REVOKE. 194 | Chapter 6 z SQL: Data Definition Example 6.10 REVOKE specific privileges from PUBLIC Revoke the privilege SELECT on the Branch table from all users. REVOKE SELECT ON Branch FROM PUBLIC; Example 6.11 REVOKE specific privileges from named user Revoke all privileges you have given to Director on the Staff table. REVOKE ALL PRIVILEGES ON Staff FROM Director; This is equivalent to REVOKE SELECT . . . , as this was the only privilege that has been given to Director. Chapter Summary n n n n n The ISO standard provides eight base data types: boolean, character, bit, exact numeric, approximate numeric, datetime, interval, and character/binary large objects. The SQL DDL statements allow database objects to be defined. The CREATE and DROP SCHEMA statements allow schemas to be created and destroyed; the CREATE, ALTER, and DROP TABLE statements allow tables to be created, modified, and destroyed; the CREATE and DROP INDEX statements allow indexes to be created and destroyed. The ISO SQL standard provides clauses in the CREATE and ALTER TABLE statements to define integrity constraints that handle: required data, domain constraints, entity integrity, referential integrity, and general constraints. Required data can be specified using NOT NULL. Domain constraints can be specified using the CHECK clause or by defining domains using the CREATE DOMAIN statement. Primary keys should be defined using the PRIMARY KEY clause and alternate keys using the combination of NOT NULL and UNIQUE. Foreign keys should be defined using the FOREIGN KEY clause and update and delete rules using the subclauses ON UPDATE and ON DELETE. General constraints can be defined using the CHECK and UNIQUE clauses. General constraints can also be created using the CREATE ASSERTION statement. A view is a virtual table representing a subset of columns and/or rows and/or column expressions from one or more base tables or views. A view is created using the CREATE VIEW statement by specifying a defining query. It may not necessarily be a physically stored table, but may be recreated each time it is referenced. Views can be used to simplify the structure of the database and make queries easier to write. They can also be used to protect certain columns and/or rows from unauthorized access. Not all views are updatable. Exercises n n n | 195 View resolution merges the query on a view with the definition of the view producing a query on the underlying base table(s). This process is performed each time the DBMS has to process a query on a view. An alternative approach, called view materialization, stores the view as a temporary table in the database when the view is first queried. Thereafter, queries based on the materialized view can be much faster than recomputing the view each time. One disadvantage with materialized views is maintaining the currency of the temporary table. The COMMIT statement signals successful completion of a transaction and all changes to the database are made permanent. The ROLLBACK statement signals that the transaction should be aborted and all changes to the database are undone. SQL access control is built around the concepts of authorization identifiers, ownership, and privileges. Authorization identifiers are assigned to database users by the DBA and identify a user. Each object that is created in SQL has an owner. The owner can pass privileges on to other users using the GRANT statement and can revoke the privileges passed on using the REVOKE statement. The privileges that can be passed on are USAGE, SELECT, DELETE, INSERT, UPDATE, and REFERENCES; the latter three can be restricted to specific columns. A user can allow a receiving user to pass privileges on using the WITH GRANT OPTION clause and can revoke this privilege using the GRANT OPTION FOR clause. Review Questions 6.1 Describe the eight base data types in SQL. 6.2 Discuss the functionality and importance of the Integrity Enhancement Feature (IEF). 6.3 Discuss each of the clauses of the CREATE TABLE statement. 6.4 Discuss the advantages and disadvantages of views. 6.5 Describe how the process of view resolution works. 6.6 What restrictions are necessary to ensure that a view is updatable? 6.7 What is a materialized view and what are the advantages of a maintaining a materialized view rather than using the view resolution process? 6.8 Describe the difference between discretionary and mandatory access control. What type of control mechanism does SQL support? 6.9 Describe how the access control mechanisms of SQL work. Exercises Answer the following questions using the relational schema from the Exercises at the end of Chapter 3: 6.10 Create the Hotel table using the integrity enhancement features of SQL. 6.11 Now create the Room, Booking, and Guest tables using the integrity enhancement features of SQL with the following constraints: (a) (b) (c) (d) type must be one of Single, Double, or Family. price must be between £10 and £100. roomNo must be between 1 and 100. dateFrom and dateTo must be greater than today’s date. 196 | Chapter 6 z SQL: Data Definition (e) The same room cannot be double-booked. (f) The same guest cannot have overlapping bookings. 6.12 Create a separate table with the same structure as the Booking table to hold archive records. Using the INSERT statement, copy the records from the Booking table to the archive table relating to bookings before 1 January 2003. Delete all bookings before 1 January 2003 from the Booking table. 6.13 Create a view containing the hotel name and the names of the guests staying at the hotel. 6.14 Create a view containing the account for each guest at the Grosvenor Hotel. 6.15 Give the users Manager and Director full access to these views, with the privilege to pass the access on to other users. 6.16 Give the user Accounts SELECT access to these views. Now revoke the access from this user. 6.17 Consider the following view defined on the Hotel schema: CREATE VIEW HotelBookingCount (hotelNo, bookingCount) AS SELECT h.hotelNo, COUNT(*) FROM Hotel h, Room r, Booking b WHERE h.hotelNo = r.hotelNo AND r.roomNo = b.roomNo GROUP BY h.hotelNo; For each of the following queries, state whether the query is valid and for the valid ones show how each of the queries would be mapped on to a query on the underlying base tables. (a) SELECT * FROM HotelBookingCount; (b) SELECT hotelNo FROM HotelBookingCount WHERE hotelNo = ‘H001’; (c) SELECT MIN(bookingCount) FROM HotelBookingCount; (d) SELECT COUNT(*) FROM HotelBookingCount; (e) SELECT hotelNo FROM HotelBookingCount WHERE bookingCount > 1000; (f) SELECT hotelNo FROM HotelBookingCount ORDER BY bookingCount; General 6.18 Consider the following table: Part (partNo, contract, partCost) which represents the cost negotiated under each contract for a part (a part may have a different price under each contract). Now consider the following view ExpensiveParts, which contains the distinct part numbers for parts that cost more than £1000: Exercises | 197 CREATE VIEW ExpensiveParts (partNo) AS SELECT DISTINCT partNo FROM Part WHERE partCost > 1000; Discuss how you would maintain this as a materialized view and under what circumstances you would be able to maintain the view without having to access the underlying base table Part. 6.19 Assume that we also have a table for suppliers: Supplier (supplierNo, partNo, price) and a view SupplierParts, which contains the distinct part numbers that are supplied by at least one supplier: CREATE VIEW SupplierParts (partNo) AS SELECT DISTINCT partNo FROM Supplier s, Part p WHERE s.partNo = p.partNo; Discuss how you would maintain this as a materialized view and under what circumstances you would be able to maintain the view without having to access the underlying base tables Part and Supplier. 6.20 Investigate the SQL dialect on any DBMS that you are currently using. Determine the system’s compliance with the DDL statements in the ISO standard. Investigate the functionality of any extensions the DBMS supports. Are there any functions not supported? 6.21 Create the DreamHome rental database schema defined in Section 3.2.6 and insert the tuples shown in Figure 3.3. 6.22 Using the schema you have created above, run the SQL queries given in the examples in Chapter 5. 6.23 Create the schema for the Hotel schema given at the start of the exercises for Chapter 3 and insert some sample tuples. Now run the SQL queries that you produced for Exercises 5.7–5.28. Chapter 7 Query-By-Example Chapter Objectives In this chapter you will learn: n The main features of Query-By-Example (QBE). n The types of query provided by the Microsoft Office Access DBMS QBE facility. n How to use QBE to build queries to select fields and records. n How to use QBE to target single or multiple tables. n How to perform calculations using QBE. n How to use advanced QBE facilities including parameter, find matched, find unmatched, crosstab, and autolookup queries. n How to use QBE action queries to change the content of tables. In this chapter, we demonstrate the major features of the Query-By-Example (QBE) facility using the Microsoft Office Access 2003 DBMS. QBE represents a visual approach for accessing data in a database through the use of query templates (Zloof, 1977). We use QBE by entering example values directly into a query template to represent what the access to the database is to achieve, such as the answer to a query. QBE was developed originally by IBM in the 1970s to help users in their retrieval of data from a database. Such was the success of QBE that this facility is now provided, in one form or another, by the most popular DBMSs including Microsoft Office Access. The Office Access QBE facility is easy to use and has very powerful capabilities. We can use QBE to ask questions about the data held in one or more tables and to specify the fields we want to appear in the answer. We can select records according to specific or nonspecific criteria and perform calculations on the data held in tables. We can also use QBE to perform useful operations on tables such as inserting and deleting records, modifying the values of fields, or creating new fields and tables. In this chapter we use simple examples to demonstrate these facilities. We use the sample tables shown in Figure 3.3 of the DreamHome case study, which is described in detail in Section 10.4 and Appendix A. When we create a query using QBE, in the background Microsoft Office Access constructs the equivalent SQL statement. SQL is a language used in the querying, updating, and management of relational databases. In Chapters 5 and 6 we presented a comprehensive overview of the SQL standard. We display the equivalent Microsoft Office Access 7.1 Introduction to Microsoft Office Access Queries SQL statement alongside every QBE example discussed in this chapter. However, we do not discuss the SQL statements in any detail but refer the interested reader to Chapters 5 and 6. Although this chapter uses Microsoft Office Access to demonstrate QBE, in Section 8.1 we present a general overview of the other facilities of Microsoft Office Access 2003 DBMS. Also, in Chapters 17 and 18 we illustrate by example the physical database design methodology presented in this book, using Microsoft Office Access as one of the target DBMSs. Structure of this Chapter In Section 7.1 we present an overview of the types of QBE queries provided by Microsoft Office Access 2003, and in Section 7.2, we demonstrate how to build simple select queries using the QBE grid. In Section 7.3 we illustrate the use of advanced QBE queries (such as crosstab and autolookup), and finally in Section 7.4 we examine action queries (such as update and make-table). Introduction to Microsoft Office Access Queries When we create or open a database using Microsoft Office Access, the Database window is displayed showing the objects (such as tables, forms, queries, and reports) in the database. For example, when we open the DreamHome database, we can view the tables in this database, as shown in Figure 7.1. To ask a question about data in a database, we design a query that tells Microsoft Office Access what data to retrieve. The most commonly used queries are called select queries. With select queries, we can view, analyze, or make changes to the data. We can view data from a single table or from multiple tables. When a select query is run, Microsoft Office Access collects the retrieved data in a dynaset. A dynaset is a dynamic view of the data from one or more tables, selected and sorted as specified by the query. In other words, a dynaset is an updatable set of records defined by a table or a query that we can treat as an object. As well as select queries, we can also create many other types of useful queries using Microsoft Office Access. Table 7.1 presents a summary of the types of query provided by Microsoft Office Access 2003. These queries are discussed in more detailed in the following sections, with the exception of SQL-specific queries. When we create a new query, Microsoft Office Access displays the New Query dialog box shown in Figure 7.2. From the options shown in the dialog box, we can start from scratch with a blank object and build the new query ourselves by choosing Design View or use one of the listed Office Access Wizards to help build the query. A Wizard is like a database expert who asks questions about the query we want and then builds the query based on our responses. As shown in Figure 7.2, we can use Wizards 7.1 | 199 Database window Database objects Figure 7.1 Microsoft Office Access Database window of the tables in the DreamHome database. Table 7.1 Summary of Microsoft Office Access 2003 query types. Query type Description Select query Asks a question or defines a set of criteria about the data in one or more tables. Performs calculations on groups of records. Displays one or more predefined dialog boxes that prompts the user for the parameter value(s). Finds duplicate records in a single table. Finds distinct records in related tables. Allows large amounts of data to be summarized and presented in a compact spreadsheet. Automatically fills in certain field values for a new record. Makes changes to many records in just one operation. Such changes include the ability to delete, append, or make changes to records in a table and also to create a new table. Used to modify the queries described above and to set the properties of forms and reports. Must be used to create SQL-specific queries such as union, data definition, subqueries (see Chapters 5 and 6), and pass-through queries. Pass-through queries send commands to a SQL database such as Microsoft or Sybase SQL Server. Totals (Aggregate) query Parameter query Find Matched query Find Unmatched query Crosstab query Autolookup query Action query (including delete, append, update, and make-table queries) SQL query (including union, pass-through, data definition, and subqueries) 7.2 Building Select Queries Using QBE | 201 Figure 7.2 Microsoft Office Access New Query dialog box. to help build simple select queries, crosstab queries, or queries that find duplicates or unmatched records within tables. Unfortunately, Query Wizards are of limited use when we want to build more complex select queries or other useful types of query such as parameter queries, autolookup queries, or action queries. Building Select Queries Using QBE A select query is the most common type of query. It retrieves data from one or more tables and displays the results in a datasheet where we can update the records (with some restrictions). A datasheet displays data from the table(s) in columns and rows, similar to a spreadsheet. A select query can also group records and calculate sums, counts, averages, and other types of total. As stated in the previous section, simple select statements can be created using the Simple Query Wizard. However, in this section we demonstrate the building of simple select queries from scratch using Design View, without the use of the Wizards. After reading this section, the interested reader may want to experiment with the available Wizards to determine their usefulness. When we begin to build the query from scratch, the Select Query window opens and displays a dialog box, which in our example lists the tables and queries in the DreamHome database. We then select the tables and/or queries that contain the data that we want to add to the query. The Select Query window is a graphical Query-By-Example (QBE) tool. Because of its graphical features, we can use a mouse to select, drag, or manipulate objects in the window to define an example of the records we want to see. We specify the fields and records we want to include in the query in the QBE grid. When we create a query using the QBE design grid, behind the scenes Microsoft Office Access constructs the equivalent SQL statement. We can view or edit the SQL statement in SQL view. Throughout this chapter, we display the equivalent SQL statement for 7.2 202 | Chapter 7 z Query-By-Example every query built using the QBE grid or with the help of a Wizard (as demonstrated in later sections of this chapter). Note that many of the Microsoft Office Access SQL statements displayed throughout this chapter do not comply with the SQL standard presented in Chapters 5 and 6. 7.2.1 Specifying Criteria Criteria are restrictions we place on a query to identify the specific fields or records we want to work with. For example, to view only the property number, city, type, and rent of all properties in the PropertyForRent table, we construct the QBE grid shown in Figure 7.3(a). When this select query is run, the retrieved data is displayed as a datasheet of the selected fields of the PropertyForRent table, as shown in Figure 7.3(b). The equivalent SQL statement for the QBE grid shown in Figure 7.3(a) is given in Figure 7.3(c). Note that in Figure 7.3(a) we show the complete Select Query window with the target table, namely PropertyForRent, displayed above the QBE grid. In some of the examples that follow, we show only the QBE grid where the target table(s) can be easily inferred from the fields displayed in the grid. We can add additional criteria to the query shown in Figure 7.3(a) to view only properties in Glasgow. To do this, we specify criteria that limits the results to records whose city field contains the value ‘Glasgow’ by entering this value in the Criteria cell for the city field of the QBE grid. We can enter additional criteria for the same field or different fields. When we enter expressions in more than one Criteria cell, Microsoft Office Access combines them using either: n n the And operator, if the expressions are in different cells in the same row, which means only the records that meet the criteria in all the cells will be returned; the Or operator, if the expressions are in different rows of the design grid, which means records that meet criteria in any of the cells will be returned. For example, to view properties in Glasgow with a rent between £350 and £450, we enter ‘Glasgow’ into the Criteria cell of the city field and enter the expression ‘Between 350 And 450’ in the Criteria cell of the rent field. The construction of this QBE grid is shown in Figure 7.4(a) and the resulting datasheet containing the records that satisfy the criteria is shown in Figure 7.4(b). The equivalent SQL statement for the QBE grid is shown in Figure 7.4(c). Suppose that we now want to alter this query to also view all properties in Aberdeen. We enter ‘Aberdeen’ into the or row below ‘Glasgow’ in the city field. The construction of this QBE grid is shown in Figure 7.5(a) and the resulting datasheet containing the records that satisfy the criteria is shown in Figure 7.5(b). The equivalent SQL statement for the QBE grid is given in Figure 7.5(c). Note that in this case, the records retrieved by this query satisfy the criteria ‘Glasgow’ in the city field And ‘Between 350 And 450’ in the rent field Or alternatively only ‘Aberdeen’ in the city field. We can use wildcard characters or the LIKE operator to specify a value we want to find and we either know only part of the value or want to find values that start with a specific 7.2 Building Select Queries Using QBE Figure 7.3 (a) QBE grid to retrieve the propertyNo, city, type, and rent fields of the PropertyForRent table; (b) resulting datasheet; (c) equivalent SQL statement. | 203 204 | Chapter 7 z Query-By-Example Figure 7.4 (a) QBE grid of select query to retrieve the properties in Glasgow with a rent between £350 and £450; (b) resulting datasheet; (c) equivalent SQL statement. letter or match a certain pattern. For example, if we want to search for properties in Glasgow but we are unsure of the exact spelling for ‘Glasgow’, we can enter ‘LIKE Glasgo’ into the Criteria cell of the city field. Alternatively, we can use wildcard characters to perform the same search. For example, if we were unsure about the number of characters in the correct spelling of ‘Glasgow’, we could enter ‘Glasg*’ as the criteria. The wildcard (*) specifies an unknown number of characters. On the other hand, if we did know the number of characters in the correct spelling of ‘Glasgow’, we could enter ‘Glasg??’. The wildcard (?) specifies a single unknown character. 7.2.2 Creating Multi-Table Queries In a database that is correctly normalized, related data may be stored in several tables. It is therefore essential that in answering a query, the DBMS is capable of joining related data stored in different tables. 7.2 Building Select Queries Using QBE To bring together the data that we need from multiple tables, we create a multi-table select query with the tables and/or queries that contain the data we require in the QBE grid. For example, to view the first and last names of owners and the property number and city of their properties, we construct the QBE grid shown in Figure 7.6(a). The target tables for this query, namely PrivateOwner and PropertyForRent, are displayed above the grid. The PrivateOwner table provides the fName and lName fields and the PropertyForRent table provides the propertyNo and city fields. When this query is run the resulting datasheet is displayed, as in Figure 7.6(b). The equivalent SQL statement for the QBE grid is given in Figure 7.6(c). The multi-table query shown in Figure 7.6 is an example of an Inner (natural) join, which we discussed in detail in Sections 4.1.3 and 5.3.7. When we add more than one table or query to a select query, we need to make sure that the field lists are joined to each other with a join line so that Microsoft Office Access knows how to join the tables. In Figure 7.6(a), note that Microsoft Office Access displays a ‘1’ above the join line to show which table is on the ‘one’ side of a one-to-many relationship and an infinity symbol ‘∞’ to show which table is on the ‘many’ side. In our example, ‘one’ owner has ‘many’ properties for rent. | 205 Figure 7.5 (a) QBE grid of select query to retrieve the properties in Glasgow with a rent between £350 and £450 and all properties in Aberdeen; (b) resulting datasheet; (c) equivalent SQL statement. 206 | Chapter 7 z Query-By-Example Figure 7.6 (a) QBE grid of multi-table query to retrieve the first and last names of owners and the property number and city of their properties; (b) resulting datasheet; (c) equivalent SQL statement. 7.2 Building Select Queries Using QBE Microsoft Office Access automatically displays a join line between tables in the QBE grid if they contain a common field. However, the join line is only shown with symbols if a relationship has been previously established between the tables. We describe how to set up relationships between tables in Chapter 8. In the example shown in Figure 7.6, the ownerNo field is the common field in the PrivateOwner and PropertyForRent tables. For the join to work, the two fields must contain matching data in related records. Microsoft Office Access will not automatically join tables if the related data is in fields with different names. However, we can identify the common fields in the two tables by joining the tables in the QBE grid when we create the query. Calculating Totals It is often useful to ask questions about groups of data such as: n n n What is the total number of properties for rent in each city? What is the average salary for staff? How many viewings has each property for rent had since the start of this year? We can perform calculations on groups of records using totals queries (also called aggregate queries). Microsoft Office Access provides various types of aggregate function including Sum, Avg, Min, Max, and Count. To access these functions, we change the query type to Totals, which results in the display of an additional row called Total in the QBE grid. When a totals query is run, the resulting datasheet is a snapshot, a set of records that is not updatable. As with other queries, we may also want to specify criteria in a query that includes totals. For example, suppose that we want to view the total number of properties for rent in each city. This requires that the query first groups the properties according to the city field using Group By and then performs the totals calculation using Count for each group. The construction of the QBE grid to perform this calculation is shown in Figure 7.7(a) and the resulting datasheet in Figure 7.7(b). The equivalent SQL statement is given in Figure 7.7(c). For some calculations it is necessary to create our own expressions. For example, suppose that we want to calculate the yearly rent for each property in the PropertyForRent table retrieving only the propertyNo, city, and type fields. The yearly rent is calculated as twelve times the monthly rent for each property. We enter ‘Yearly Rent: [rent]*12’ into a new field of the QBE grid, as shown in Figure 7.8(a). The ‘Yearly Rent:’ part of the expression provides the name for the new field and ‘[rent]*12’ calculates a yearly rent value for each property using the monthly values in the rent field. The resulting datasheet for this select query is shown in Figure 7.8(b) and the equivalent SQL statement in Figure 7.8(c). 7.2.3 | 207 208 | Chapter 7 z Query-By-Example Figure 7.7 (a) QBE grid of totals query to calculate the number of properties for rent in each city; (b) resulting datasheet; (c) equivalent SQL statement. 7.3 Using Advanced Queries Microsoft Office Access provides a range of advanced queries. In this section, we describe some of the most useful examples of those queries including: n n n n parameter queries; crosstab queries; Find Duplicates queries; Find Unmatched queries. 7.3.1 Parameter Query A parameter query displays one or more predefined dialog boxes that prompt the user for the parameter value(s) (criteria). Parameter queries are created by entering a prompt enclosed in square brackets in the Criteria cell for each field we want to use as a parameter. For example, suppose that we want to amend the select query shown in Figure 7.6(a) to first prompt for the owner’s first and last name before retrieving the property number and city 7.3 Using Advanced Queries Figure 7.8 statement. | 209 (a) QBE grid of select query to calculate the yearly rent for each property; (b) resulting datasheet; (c) equivalent SQL of his or her properties. The QBE grid for this parameter query is shown in Figure 7.9(a). To retrieve the property details for an owner called ‘Carol Farrel’, we enter the appropriate values into the first and second dialog boxes as shown in Figure 7.9(b), which results in the display of the resulting datasheet shown in Figure 7.9(c). The equivalent SQL statement is given in Figure 7.9(d). Crosstab Query A crosstab query can be used to summarize data in a compact spreadsheet format. This format enables users of large amounts of summary data to more easily identify trends and 7.3.2 210 | Chapter 7 z Query-By-Example Figure 7.9 (a) QBE grid of example parameter query; (b) dialog boxes for first and last name of owner; (c) resulting datasheet; (d) equivalent SQL statement. to make comparisons. When a crosstab query is run, it returns a snapshot. We can create a crosstab query using the CrossTab Query Wizard or build the query from scratch using the QBE grid. Creating a crosstab query is similar to creating a query with totals, but we must specify the fields to be used as row headings, column headings, and the fields that are to supply the values. For example, suppose that we want to know for each member of staff the total number of properties that he or she manages for each type of property. For the purposes of this 7.3 Using Advanced Queries example, we have appended additional property records into the PropertyForRent table to more clearly demonstrate the value of crosstab queries. To answer this question, we first design a totals query, as shown in Figure 7.10(a), which creates the datasheet shown in Figure 7.10(b). The equivalent SQL statement for the totals query is given in Figure 7.10(c). Note that the layout of the resulting datasheet makes it difficult to make comparisons between staff. | 211 Figure 7.10 (a) QBE grid of example totals query; (b) resulting datasheet; (c) equivalent SQL statement. 212 | Chapter 7 z Query-By-Example Figure 7.11 (a) QBE grid of example crosstab query; (b) resulting datasheet; (c) equivalent SQL statement. To convert the select query into a crosstab query, we change the type of query to Crosstab, which results in the addition of the Crosstab row in the QBE grid. We then identify the fields to be used for row headings, column headings, and to supply the values, as shown in Figure 7.11(a). When we run this query, the datasheet is displayed in a more compact layout, as illustrated in Figure 7.11(b). In this format, we can easily compare figures between staff. The equivalent SQL statement for the crosstab query is given in Figure 7.11(c). The TRANSFORM statement is not supported by standard SQL but is an extension of Microsoft Office Access SQL. 7.3.3 Find Duplicates Query The Find Duplicates Query Wizard shown in Figure 7.2 can be used to determine if there are duplicate records in a table or determine which records in a table share the same value. For example, it is possible to search for duplicate values in the fName and lName fields to 7.3 Using Advanced Queries determine if we have duplicate records for the same property owners, or to search for duplicate values in a city field to see which owners are in the same city. Suppose that we have inadvertently created a duplicate record for the property owner called ‘Carol Farrel’ and given this record a unique owner number. The database therefore contains two records with different owner numbers, representing the same owner. We can use the Find Duplicates Query Wizard to identify the duplicated property owner records using (for simplicity) only the values in the fName and lName fields. As discussed earlier, the Wizard simply constructs the query based on our answers. Before viewing the results of the query we can view the QBE grid for the Find Duplicates query shown in Figure 7.12(a). The resulting datasheet for the Find Duplicates query is shown in 7.12(b) displaying the two records representing the same property owner called ‘Carol Farrel’. The equivalent SQL statement is given in Figure 7.12(c). Note that this SQL statement displays in full the inner SELECT SQL statement that is partially visible in the Criteria row of the fName field shown in Figure 7.12(a). | 213 Figure 7.12 (a) QBE for example Find Duplicates query; (b) resulting datasheet; (c) equivalent SQL statement. 214 | Chapter 7 z Query-By-Example 7.3.4 Find Unmatched Query The Find Unmatched Query Wizard shown in Figure 7.2 can be used to find records in one table that do not have related records in another table. For example, we can find clients who have not viewed properties for rent by comparing the records in the Client and Viewing tables. The Wizard constructs the query based on our answers. Before viewing the results of the query, we can view the QBE grid for the Find Unmatched query, as shown in Figure 7.13(a). The resulting datasheet for the Find Unmatched query is shown in 7.13(b) indicating that there are no records in the Viewing table that relate to ‘Mike Ritchie’ in the Client table. Note that the Show box of the clientNo field in the QBE grid is not ticked Figure 7.13 (a) QBE grid of example Find Unmatched query; (b) resulting datasheet; (c) equivalent SQL statement. 7.4 Changing the Content of Tables Using Action Queries as this field is not required in the datasheet. The equivalent SQL statement for the QBE grid is given in Figure 7.13(c). The Find Unmatched query is an example of a Left Outer join, which we discussed in detail in Sections 4.1.3 and 5.3.7. Autolookup Query 7.3.5 An autolookup query can be used to automatically fill in certain field values for a new record. When we enter a value in the join field in the query or in a form based on the query, Microsoft Office Access looks up and fills in existing data related to that value. For example, if we know the value in the join field (staffNo) between the PropertyForRent table and the Staff table, we can enter the staff number and have Microsoft Office Access enter the rest of the data for that member of staff. If no matching data is found, Microsoft Office Access displays an error message. To create an autolookup query, we add two tables that have a one-to-many relationship and select fields for the query into the QBE grid. The join field must be selected from the ‘many’ side of the relationship. For example, in a query that includes fields from the PropertyForRent and Staff tables, we drag the staffNo field (foreign key) from the PropertyForRent table to the design grid. The QBE grid for this autolookup query is shown in Figure 7.14(a). Figure 7.14(b) displays a datasheet based on this query that allows us to enter the property number, street, and city for a new property record. When we enter the staff number of the member of staff responsible for the management of the property, for example ‘SA9’, Microsoft Office Access looks up the Staff table and automatically fills in the first and last name of the member of staff, which in this case is ‘Mary Howe’. Figure 7.14(c) displays the equivalent SQL statement for the QBE grid of the autolookup query. Changing the Content of Tables Using Action Queries 7.4 When we create a query, Microsoft Office Access creates a select query unless we choose a different type from the Query menu. When we run a select query, Microsoft Office Access displays the resulting datasheet. As the datasheet is updatable, we can make changes to the data; however, we must make the changes record by record. If we require a large number of similar changes, we can save time by using an action query. An action query allows us to make changes to many records at the same time. There are four types of action query: make-table, delete, update, and append. Make-Table Action Query The make-table action query creates a new table from all or part of the data in one or more tables. The newly created table can be saved to the currently opened database or exported to another database. Note that the data in the new table does not inherit the field properties including the primary key from the original table, which needs to be set 7.4.1 | 215 216 | Chapter 7 z Query-By-Example Figure 7.14 (a) QBE grid of example autolookup query; (b) datasheet based on autolookup query; (c) equivalent SQL statement. 7.4 Changing the Content of Tables Using Action Queries manually. Make-table queries are useful for several reasons including the ability to archive historic data, create snapshot reports, and to improve the performance of forms and reports based on multi-table queries. Suppose we want to create a new table called StaffCut, containing only the staffNo, fName, lName, position, and salary fields of the original Staff table. We first design a query to target the required fields of the Staff table. We then change the query type in Design View to Make-Table and a dialog box is displayed. The dialog box prompts for the name and location of the new table, as shown in Figure 7.15(a). Figure 7.15(b) displays the QBE grid for this make-table action query. When we run the query, a warning message asks whether we want to continue with the make-table operation, as shown in Figure 7.15(c). If we continue, the new table StaffCut is created, as shown in Figure 7.15(d). Figure 7.15(e) displays the equivalent SQL statement for this make-table action query. Delete Action Query 7.4.2 The delete action query deletes a group of records from one or more tables. We can use a single delete query to delete records from a single table, from multiple tables in a oneto-one relationship, or from multiple tables in a one-to-many relationship with referential integrity set to allow cascading deletes. For example, suppose that we want to delete all properties for rent in Glasgow and the associated viewings records. To perform this deletion, we first create a query that targets the appropriate records in the PropertyForRent table. We then change the query type in Design View to Delete. The QBE grid for this delete action query is shown in Figure 7.16(a). As the PropertyForRent and Viewing tables have a one-to-many relationship with referential integrity set to the Cascade Delete Related Records option, all the associated viewings records for the properties in Glasgow will also be deleted. When we run the delete action query, a warning message asks whether or not we want to continue with the deletion, as shown in Figure 7.16(b). If we continue, the selected records are deleted from the PropertyForRent table and the related records from the Viewing table, as shown in Figure 7.16(c). Figure 7.16(d) displays the equivalent SQL statement for this delete action query. Update Action Query An update action query makes global changes to a group of records in one or more tables. For example, suppose we want to increase the rent of all properties by 10%. To perform this update, we first create a query that targets the PropertyForRent table. We then change the query type in Design View to Update. We enter the expression ‘[Rent]*1.1’ in the Update To cell for the rent field, as shown in Figure 7.17(a). When we run the query, a warning message asks whether or not we want to continue with the update, as shown in Figure 7.17(b). If we continue, the rent field of PropertyForRent table is updated, as shown in Figure 7.17(c). Figure 7.17(d) displays the equivalent SQL statement for this update action query. 7.4.3 | 217 218 | Chapter 7 z Query-By-Example Figure 7.15 (a) Make-Table dialog box; (b) QBE grid of example make-table query; (c) warning message; (d) resulting datasheet; (e) equivalent SQL statement. 7.4 Changing the Content of Tables Using Action Queries | 219 Figure 7.16 (a) QBE grid of example delete action query; (b) warning message; (c) resulting PropertyForRent and Viewing datasheets with records deleted; (d) equivalent SQL statement. 220 | Chapter 7 z Query-By-Example Figure 7.17 (a) QBE grid of example update action query; (b) warning message; (c) resulting datasheet; (d) equivalent SQL statement. 7.4 Changing the Content of Tables Using Action Queries Append Action Query We use an append action query to insert records from one or more source tables into a single target table. We can append records to a table in the same database or in another database. Append queries are also useful when we want to append fields based on criteria or even when some of the fields do not exist in the other table. For example, suppose that we want to insert the details of new owners of property for rent into the PrivateOwner table. Assume that the details of these new owners are contained in a table called NewOwner with only the ownerNo, fName, lName, and the address fields. Furthermore, we want to append only new owners located in Glasgow into the PrivateOwner table. In this example, the PrivateOwner table is the target table and the NewOwner table is the source table. To create an append action query, we first design a query that targets the appropriate records of the NewOwner table. We change the type of query to Append and a dialog box is displayed, which prompts for the name and location of the target table, as shown in Figure 7.18(a). The QBE grid for this append action query is shown in Figure 7.18(b). When we run the query, a warning message asks whether we want to continue with the append operation, as shown in Figure 7.18(c). If we continue, the two records of owners located in Glasgow in the NewOwner table are appended to the PrivateOwner table, as given in Figure 7.18(d). The equivalent SQL statement for the append action query is shown in Figure 7.18(e). 7.4.4 | 221 222 | Chapter 7 z Query-By-Example Figure 7.18 (a) Append dialog box; (b) QBE grid of example append action query; (c) warning message; (d) the NewOwner table and the PrivateOwner table with the newly appended records; (e) equivalent SQL statement. 7.4 Changing the Content of Tables Using Action Queries | 223 224 | Chapter 7 z Query-By-Example Exercises 7.1 Create the sample tables of the DreamHome case study shown in Figure 3.3 and carry out the exercises demonstrated in this chapter, using (where possible) the QBE facility of your DBMS. 7.2 Create the following additional select QBE queries for the sample tables of the DreamHome case study, using (where possible) the QBE facility of your DBMS. (a) (b) (c) (d) (e) (f) (g) 7.3 Retrieve the branch number and address for all branch offices. Retrieve the staff number, position, and salary for all members of staff working at branch office B003. Retrieve the details of all flats in Glasgow. Retrieve the details of all female members of staff who are older than 25 years old. Retrieve the full name and telephone of all clients who have viewed flats in Glasgow. Retrieve the total number of properties, according to property type. Retrieve the total number of staff working at each branch office, ordered by branch number. Create the following additional advanced QBE queries for the sample tables of the DreamHome case study, using (where possible) the QBE facility of your DBMS. (a) Create a parameter query that prompts for a property number and then displays the details of that property. (b) Create a parameter query that prompts for the first and last names of a member of staff and then displays the details of the property that the member of staff is responsible for. (c) Add several more records into the PropertyForRent tables to reflect the fact that property owners ‘Carol Farrel’ and ‘Tony Shaw’ now own many properties in several cities. Create a select query to display for each owner, the number of properties that he or she owns in each city. Now, convert the select query into a crosstab query and assess whether the display is more or less useful when comparing the number of properties owned by each owner in each city. (d) Introduce an error into your Staff table by entering an additional record for the member of staff called ‘David Ford’ with a new staff number. Use the Find Duplicates query to identify this error. (e) Use the Find Unmatched query to identify those members of staff who are not assigned to manage property. (f) Create an autolookup query that fills in the details of an owner, when a new property record is entered into the PropertyForRent table and the owner of the property already exists in the database. 7.4 Use action queries to carry out the following tasks on the sample tables of the DreamHome cases study, using (where possible) the QBE facility of your DBMS. (a) Create a cut-down version of the PropertyForRent table called PropertyGlasgow, which has the propertyNo, street, postcode, and type fields of the original table and contains only the details of properties in Glasgow. (b) Remove all records of property viewings that do not have an entry in the comment field. (c) Update the salary of all members of staff, except Managers, by 12.5%. (d) Create a table called NewClient that contains the details of new clients. Append this data into the original Client table. 7.5 Using the sample tables of the DreamHome case study, create equivalent QBE queries for the SQL examples given in Chapter 5. Chapter 8 Commercial RDBMSs: Office Access and Oracle Chapter Objectives In this chapter you will learn: n About Microsoft Office Access 2003: – the DBMS architecture; – how to create base tables and relationships; – how to create general constraints; – how to use forms and reports; – how to use macros. n About Oracle9i: – the DBMS architecture; – how to create base tables and relationships; – how to create general constraints; – how to use PL /SQL; – how to create and use stored procedures and functions; – how to create and use triggers; – how to create forms and reports; – support for grid computing. As we mentioned in Chapter 3, the Relational Database Management System (RDBMS) has become the dominant data-processing software in use today, with estimated new licence sales of between US$6 billion and US$10 billion per year (US$25 billion with tools sales included). There are many hundreds of RDBMSs on the market. For many users, the process of selecting the best DBMS package can be a difficult task, and in the next chapter we present a summary of the main features that should be considered when selecting a DBMS package. In this chapter, we consider two of the most widely used RDBMSs: Microsoft Office Access and Oracle. In each case, we use the terminology of the particular DBMS (which does not conform to the formal relational terminology we introduced in Chapter 3). 226 | Chapter 8 z Commercial RDBMSs: Office Access and Oracle 8.1 Microsoft Office Access 2003 Microsoft Office Access is the mostly widely used relational DBMS for the Microsoft Windows environment. It is a typical PC-based DBMS capable of storing, sorting, and retrieving data for a variety of applications. Access provides a Graphical User Interface (GUI) to create tables, queries, forms, and reports, and tools to develop customized database applications using the Microsoft Office Access macro language or the Microsoft Visual Basic for Applications (VBA) language. In addition, Office Access provides programs, called Wizards, to simplify many of the processes of building a database application by taking the user through a series of question-and-answer dialog boxes. It also provides Builders to help the user build syntactically correct expressions, such as those required in SQL statements and macros. Office Access supports much of the SQL standard presented in Chapters 5 and 6, and the Microsoft Open Database Connectivity (ODBC) standard, which provides a common interface for accessing heterogeneous SQL databases, such as Oracle and Informix. We discuss ODBC in more detail in Appendix E. To start the presentation of Microsoft Office Access, we first introduce the objects that can be created to help develop a database application. 8.1.1 Objects The user interacts with Microsoft Office Access and develops a database application using a number of objects: n n n n n n n Tables The base tables that make up the database. Using the Microsoft terminology, a table is organized into columns (called fields) and rows (called records). Queries Allow the user to view, change, and analyze data in different ways. Queries can also be stored and used as the source of records for forms, reports, and data access pages. We examined queries in some detail in the previous chapter. Forms Can be used for a variety of purposes such as to create a data entry form to enter data into a table. Reports Allow data in the database to be presented in an effective way in a customized printed format. Pages A (data access) page is a special type of Web page designed for viewing and working with data (stored in a Microsoft Office Access database or a Microsoft SQL Server database) from the Internet or an intranet. The data access page may also include data from other sources, such as Microsoft Excel. Macros A set of one or more actions each of which performs a particular operation, such as opening a form or printing a report. Macros can help automate common tasks such as printing a report when a user clicks a button. Modules A collection of VBA declarations and procedures that are stored together as a unit. Before we discuss these objects in more detail, we first examine the architecture of Microsoft Office Access. 8.1 Microsoft Office Access 2003 Microsoft Office Access Architecture Microsoft Office Access can be used as a standalone system on a single PC or as a multiuser system on a PC network. Since the release of Access 2000, there is a choice of two data engines† in the product: the original Jet engine and the new Microsoft SQL Server Desktop Engine (MSDE, previously the Microsoft Data Engine), which is compatible with Microsoft’s backoffice SQL Server. The Jet engine stores all the application data, such as tables, indexes, queries, forms, and reports, in a single Microsoft database (.mdb) file, based on the ISAM (Indexed Sequential Access Method) organization (see Appendix C). MSDE is based on the same data engine as SQL Server, enabling users to write one application that scales from a PC running Windows 95 to multiprocessor clusters running Windows Server 2003. MSDE also provides a migration path to allow users to subsequently upgrade to SQL Server. However, unlike SQL Server, MSDE has a 2 gigabyte database size limit. Microsoft Office Access, like SQL Server, divides the data stored in its table structures into 2 kilobyte data pages, corresponding to the size of a conventional DOS fixed-disk file cluster. Each page contains one or more records. A record cannot span more than a single page, although Memo and OLE Object fields can be stored in pages separate from the rest of the record. Office Access uses variable-length records as the standard method of storage and allows records to be ordered by the use of an index, such as a primary key. Using variable length, each record occupies only the space required to store its actual data. A header is added to each page to create a linked list of data pages. The header contains a pointer to the page that precedes it and another pointer to the page that follows. If no indexes are in use, new data is added to the last page of the table until the page is full, and then another page is added at the end. One advantage of data pages with their own header is that a table’s data pages can be kept in ISAM order by altering the pointers in the page header, and not the structure of the file itself. Multi-user support Microsoft Office Access provides four main ways of working with a database that is shared among users on a network: n n † File-server solutions An Office Access database is placed on a network so that multiple users can share it. In this case, each workstation runs a copy of the Office Access application. Client–server solutions In earlier versions of Office Access, the only way to achieve this was to create linked tables that used an ODBC driver to link to a database such as SQL Server. Since Access 2000, an Access Project (.adp) File can also be created, which can store forms, reports, macros, and VBA modules locally and can connect to a remote SQL Server database using OLE DB (Object Linking and Embedding for Databases) to display and work with tables, views, relationships, and stored procedures. As mentioned above, MSDE can also be used to achieve this type of solution. A ‘data engine’ or ‘database engine’ is the core process that a DBMS uses to store and maintain data. 8.1.2 | 227 228 | Chapter 8 z Commercial RDBMSs: Office Access and Oracle n Database replication solutions These allow data or database design changes to be shared between copies of an Office Access database in different locations without having to redistribute copies of the entire database. Replication involves producing one or more copies, called replicas, of a single original database, called the Design Master. Together, the Design Master and its replicas are called a replica set. By performing a process called synchronization, changes to objects and data are distributed to all members of the replica set. Changes to the design of objects can only be made in the Design Master, but changes to data can be made from any member of the replica set. We discuss replication in Chapter 24. n Web-based database solutions A browser displays one or more data access pages that dynamically link to a shared Office Access or SQL Server database. These pages have to be displayed by Internet Explorer 5 or later. We discuss this solution in Section 29.10.5. When a database resides on a file server, the operating system’s locking primitives are used to lock pages when a table record is being updated. In a multi-user environment, Jet uses a locking database (.ldb) file to store information on which records are locked and which user has them locked. The locking database file is created when a database is opened for shared access. We discuss locking in detail in Section 20.2. 8.1.3 Table Definition Microsoft Office Access provides five ways to create a blank (empty) table: n n n n n Use the Database Wizard to create in one operation all the tables, forms, and reports that are required for the entire database. The Database Wizard creates a new database, although this particular wizard cannot be used to add new tables, forms, or reports to an existing database. Use the Table Wizard to choose the fields for the table from a variety of predefined tables such as business contacts, household inventory, or medical records. Enter data directly into a blank table (called a datasheet). When the new datasheet is saved, Office Access will analyze the data and automatically assign the appropriate data type and format for each field. Use Design View to specify all table details from scratch. Use the CREATE TABLE statement in SQL View. Creating a blank table in Microsoft Office Access using SQL In Section 6.3.2 we examined the SQL CREATE TABLE statement that allows users to create a table. Microsoft Office Access 2003 does not fully comply with the SQL standard and the Office Access CREATE TABLE statement has no support for the DEFAULT and CHECK clauses. However, default values and certain enterprise constraints can still be specified outside SQL, as we see shortly. In addition, the data types are slightly different from the SQL standard, as shown in Table 8.1. In Example 6.1 in Chapter 6 we showed how to create the PropertyForRent table in SQL. Figure 8.1 shows the SQL View with the equivalent statement in Office Access. 8.1 Microsoft Office Access 2003 Table 8.1 Microsoft Office Access data types. Data type Use Text Memo Number Date/Time Currency Autonumber Yes/No OLE Object Hyperlink Lookup Wizard Figure 8.1 Text or text/numbers. Also numbers that do not require calculations, such as telephone numbers. Corresponds to the SQL character data type (see Section 6.1.2). Lengthy text and numbers, such as notes or descriptions. Numeric data to be used for mathematical calculations, except calculations involving money (use Currency type). Corresponds to the SQL exact numeric and approximate numeric data type (see Section 6.1.2). Dates and times. Corresponds to the SQL datetime data type (see Section 6.1.2). Currency values. Use the Currency data type to prevent rounding off during calculations. Unique sequential (incrementing by 1) or random numbers automatically inserted when a record is added. Fields that will contain only one of two values, such as Yes/No, True/False, On/Off. Corresponds to the SQL bit data type (see Section 6.1.2). Objects (such as Microsoft Word documents, Microsoft Excel spreadsheets, pictures, sounds, or other binary data), created in other programs using the OLE protocol, which can be linked to, or embedded in, a Microsoft Office Access table. Field that will store hyperlinks. Creates a field that allows the user to choose a value from another table or from a list of values using a combo box. Choosing this option in the data type list starts a wizard to define this. SQL View showing creation of the PropertyForRent table. Size Up to 255 characters Up to 65,536 characters 1, 2, 4, or 8 bytes (16 bytes for Replication ID) 8 bytes 8 bytes 4 bytes (16 bytes for Replication ID) 1 bit Up to 1 gigabyte Up to 64,000 characters Same size as the primary key that forms the lookup field (typically 4 bytes) | 229 230 | Chapter 8 z Commercial RDBMSs: Office Access and Oracle Figure 8.2 Design View showing creation of the PropertyForRent table. Creating a blank table in Microsoft Office Access using Design View Figure 8.2 shows the creation of the PropertyForRent table in Design View. Regardless of which method is used to create a table, table Design View can be used at any time to customize the table further, such as adding new fields, setting default values, or creating input masks. Microsoft Office Access provides facilities for adding constraints to a table through the Field Properties section of the table Design View. Each field has a set of properties that are used to customize how data in a field is stored, managed, or displayed. For example, we can control the maximum number of characters that can be entered into a Text field by setting its Field Size property. The data type of a field determines the properties that are available for that field. Setting field properties in Design View ensures that the fields have consistent settings when used at a later stage to build forms and reports. We now briefly discuss each of the field properties. Field Size property The Field Size property is used to set the maximum size for data that can be stored in a field of type Text, Number, and AutoNumber. For example, the Field Size property of the propertyNo field (Text) is set to 5 characters, and the Field Size property for the rooms field (Number) is set to Byte to store whole numbers from 0 to 255, as shown in Figure 8.2. In addition to Byte, the valid values for the Number data type are: n n Integer – 16-bit integer (values between −32,768 and 32,767); Long integer – 32 bit integer; 8.1 Microsoft Office Access 2003 n n n n Single – floating point 32-bit representation; Double – floating point 64-bit representation; Replication ID – 128-bit identifier, unique for each record, even in a distributed system; Decimal – floating point number with a precision and scale. Format property The Format property is used to customize the way that numbers, dates, times, and text are displayed and printed. Microsoft Office Access provides a range of formats for the display of different data types. For example, a field with a Date/Time data type can display dates in various formats including Short Date, Medium Date, and Long Date. The date 1st November 1933 can be displayed as 01/11/33 (Short Date), 01-Nov-33 (Medium Date), or 1 November 1933 (Long Date). Decimal Places property The Decimal Places property is used to specify the number of decimal places to be used when displaying numbers (this does not actually affect the number of decimal places used to store the number). Input Mask property Input masks assist the process of data entry by controlling the format of the data as it is entered into the table. A mask determines the type of character allowed for each position of a field. Input masks can simplify data entry by automatically entering special formatted characters when required and generating error messages when incorrect entries are attempted. Microsoft Office Access provides a range of input mask characters to control data entry. For example, the values to be entered into the propertyNo field have a specific format: the first character is ‘P’ for property, the second character is an upper-case letter and the third, fourth, and fifth characters are numeric. The fourth and fifth characters are optional and are used only when required (for example, property numbers include PA9, PG21, PL306). The input mask used in this case is ‘\P >L099’: n n n ‘\’ causes the character that follows to be displayed as the literal character (for example, \P is displayed as just P); ‘>L’ causes the letter that follows P to be converted to upper case; ‘0’ specifies that a digit must follow and ‘9’ specifies optional entry for a digit or space. Caption property The Caption property is used to provide a fuller description of a field name or useful information to the user through captions on objects in various views. For example, if we enter ‘Property Number’ into the Caption property of the propertyNo field, the column heading ‘Property Number’ will be displayed for the table in Datasheet View and not the field name, ‘propertyNo’. Default Value property To speed up and reduce possible errors in data entry, we can assign default values to specify a value that is automatically entered in a field when a new record is created. For | 231 232 | Chapter 8 z Commercial RDBMSs: Office Access and Oracle example, the average number of rooms in a single property is four, therefore we set ‘4’ as the default value for the rooms field, as shown in Figure 8.2. Validation Rule/ Validation Text properties The Validation Rule property is used to specify constraints for data entered into a field. When data is entered that violates the Validation Rule setting, the Validation Text property is used to specify the warning message that is displayed. Validation rules can also be used to set a range of allowable values for numeric or date fields. This reduces the amount of errors that may occur when records are being entered into the table. For example, the number of rooms in a property ranges from a minimum of 1 to a maximum of 15. The validation rule and text for the rooms field are shown in Figure 8.2. Required property Required fields must hold a value in every record. If this property is set to ‘Yes’, we must enter a value in the required field and the value cannot be null. Therefore, setting the Required property is equivalent to the NOT NULL constraint in SQL (see Section 6.2.1). Primary key fields should always be implemented as required fields. Allow Zero Length property The Allow Zero Length property is used to specify whether a zero-length string (“”) is a valid entry in a field (for Text, Memo, and Hyperlink fields). If we want Microsoft Office Access to store a zero-length string instead of null when we leave a field blank, we set both the Allow Zero Length and Required properties to ‘Yes’. The Allow Zero Length property works independently of the Required property. The Required property determines only whether null is valid for the field. If the Allow Zero Length property is set to ‘Yes’, a zero-length string will be a valid value for the field regardless of the setting of the Required property. Indexed property The Indexed property is used to set a single-field index. An index is a structure used to help retrieve data more quickly and efficiently (just as the index in this book allows a particular section to be found more quickly). An index speeds up queries on the indexed fields as well as sorting and grouping operations. The Indexed property has the following values: No Yes (Duplicates OK) Yes (No Duplicates) no index (the default) the index allows duplicates the index does not allow duplicates For the DreamHome database, we discuss which fields to index in Step 5.3 in Chapter 17. Unicode Compression property Unicode is a character encoding standard that represents each character as two bytes, enabling almost all of the written languages in the world to be represented using a single character set. For a Latin character (a character of a western European language such as English, Spanish, or German) the first byte is 0. Thus, for Text, Memo, and Hypertext fields more storage space is required than in earlier versions of Office Access, which did not use Unicode. To overcome this, the default value of the Unicode Compression property for 8.1 Microsoft Office Access 2003 these fields is ‘Yes’ (for compression), so that any character whose first byte is 0 is compressed when it is stored and uncompressed when it is retrieved. The Unicode Compression property can also be set to ‘No’ (for no compression). Note that data in a Memo field is not compressed unless it requires 4096 bytes or less of storage space after compression. IME Mode/IME Sentence Mode properties An Input Method Editor (IME) is a program that allows entry of East Asian text (traditional Chinese, simplified Chinese, Japanese, or Korean), converting keystrokes into complex East Asian characters. In essence, the IME is treated as an alternative type of keyboard layout. The IME interprets keystrokes as characters and then gives the user an opportunity to insert the correct interpretation. The IME Mode property applies to all East Asian languages, and IME Sentence Mode property applies to Japanese only. Smart tags property Smart tags allow actions to be performed within Office Access that would normally require the user to open another program. Smart tags can be associated with the fields of a table or query, or with the controls of a form, report, or data access page. The Smart Tags Action button appears when the field or control is activated and the button can be clicked to see what actions are available. For example, for a person’s name the smart tag could allow an e-mail to be generated; for a date, the smart tag could allow a meeting to be scheduled. Microsoft provides some standard tags but custom smart tags can be built using any programming language that can create a Component Object Model (COM) add-in. Relationships and Referential Integrity Definition As we saw in Figure 8.1, relationships can be created in Microsoft Office Access using the SQL CREATE TABLE statement. Relationships can also be created in the Relationships window. To create a relationship, we display the tables that we want to create the relationship between, and then drag the primary key field of the parent table to the foreign key field of the child table. At this point, Office Access will display a window allowing specification of the referential integrity constraints. Figure 8.3(a) shows the referential integrity dialog box that is displayed while creating the one-to-many (1:*) relationship Staff Manages PropertyForRent, and Figure 8.3(b) shows the Relationships window after the relationship has been created. Two things to note about setting referential integrity constraints in Microsoft Office Access are: (1) A one-to-many (1:*) relationship is created if only one of the related fields is a primary key or has a unique index; a 1:1 relationship is created if both the related fields are primary keys or have unique indexes. (2) There are only two referential integrity actions for update and delete that correspond to NO ACTION and CASCADE (see Section 6.2.4). Therefore, if other actions are required, consideration must be given to modifying these constraints to fit in with the constraints available in Office Access, or to implementing these constraints in application code. 8.1.4 | 233 234 | Chapter 8 z Commercial RDBMSs: Office Access and Oracle (a) (b) Figure 8.3 (a) Setting the referential integrity constraints for the one-to-many Staff Manages PropertyForRent relationship; (b) relationship window with the one-to-many Staff Manages PropertyForRent relationship displayed. 8.1.5 General Constraint Definition There are several ways to create general constraints in Microsoft Office Access using, for example: n n n validation rules for fields; validation rules for records; validation for forms using Visual Basic for Applications (VBA). 8.1 Microsoft Office Access 2003 | 235 Figure 8.4 Example of record validation in Microsoft Office Access. We have already seen an example of field validation in Section 8.1.3. In this section, we illustrate the other two methods with some simple examples. Validation rules for records A record validation rule controls when an entire record can be saved. Unlike field validation rules, record validation rules can refer to more than one field. This can be useful when values from different fields in a table have to be compared. For example, DreamHome has a constraint that the lease period for properties must be between 90 days and 1 year. We can implement this constraint at the record level in the Lease table using the validation rule: [dateFinish] – [dateStart] Between 90 and 365 Figure 8.4 shows the Table Properties box for the Lease table with this rule set. Validation for forms using VBA DreamHome also has a constraint that prevents a member of staff from managing more than 100 properties at any one time. This is a more complex constraint that requires a check on how many properties the member of staff currently manages. One way to implement this constraint in Office Access is to use an event procedure. An event is a specific action that occurs on or with a certain object. Microsoft Office Access can respond to a variety of events such as mouse clicks, changes in data, and forms opening or closing. Events are usually the result of user action. By using either an event procedure or a macro (see Section 8.1.8), we can customize a user response to an event that occurs on a form, report, or control. Figure 8.5 shows an example of a BeforeUpdate event procedure, which is triggered before a record is updated to implement this constraint. In some systems, there will be no support for some or all of the general constraints and it will be necessary to design the constraints into the application, as we have shown in Figure 8.5 that has built the constraint into the application’s VBA code. Implementing a 236 | Chapter 8 z Commercial RDBMSs: Office Access and Oracle Figure 8.5 VBA code to check that a member of staff does not have more than 100 properties to manage at any one time. general constraint in application code is potentially dangerous and can lead to duplication of effort and, worse still, to inconsistencies if the constraint is not implemented everywhere that it should be. 8.1.6 Forms Microsoft Office Access Forms allow a user to view and edit the data stored in the underlying base tables, presenting the data in an organized and customized manner. Forms are constructed as a collection of individual design elements called controls or control objects. There are many types of control, such as text boxes to enter and edit data, labels to hold field names, and command buttons to initiate some user action. Controls can be easily added and removed from a form. In addition, Office Access provides a Control Wizard to help the user add controls to a form. A form is divided into a number of sections, of which the three main ones are: n n n Form Header This determines what will be displayed at the top of each form, such as a title. Detail This section usually displays a number of fields in a record. Form Footer This determines what will be displayed at the bottom of each form, such as a total. 8.1 Microsoft Office Access 2003 It is also possible for forms to contain other forms, called subforms. For example, we may want to display details relating to a branch (the master form) and the details of all staff at that branch (the subform). Normally, subforms are used when there is a relationship between two tables (in this example, we have a one-to-many relationship Branch Has Staff). Forms have three views: Design View, Form View, and Datasheet View. Figure 8.6 shows the construction of a form in Design View to display branch details; the adjacent toolbox gives access to the controls that can be added to the form. In Datasheet View, multiple records can be viewed in the conventional row and column layout and, in Form View, records are typically viewed one at a time. Figure 8.7 shows an example of the branch form in both Datasheet View and Form View. Office Access allows forms to be created from scratch by the experienced user. However, Office Access also provides a Form Wizard that takes the user through a series of interactive pages to determine: | 237 Figure 8.6 Example of a form in Design View with the adjacent toolbox. Figure 8.7 Example of the branch form: (a) Datasheet View; (b) Form View. 238 | Chapter 8 z Commercial RDBMSs: Office Access and Oracle n n n n n the table or query that the form is to be based on; the fields to be displayed on the form; the layout for the form (Columnar, Tabular, Datasheet, or Justified); the style for the form based on a predefined set of options; the title for the form. 8.1.7 Reports Microsoft Office Access Reports are a special type of continuous form designed specifically for printing, rather than for displaying in a Window. As such, a Report has only read-access to the underlying base table(s). Among other things, an Office Access Report allows the user to: n n n n sort records; group records; calculate summary information; control the overall layout and appearance of the report. As with Forms, a Report’s Design View is divided into a number of sections with the main ones being: n n n n n Report Header Similar to the Form Header section, this determines what will be displayed at the top of the report, such as a title. Page Header Determines what will be displayed at the top of each page of the report, such as column headings. Detail Constitutes the main body of the report, such as details of each record. Page Footer Determines what will be displayed at the bottom of each page, such as a page number. Report Footer Determines what will be displayed at the bottom of the report, such as sums or averages that summarize the information in the body of the report. It is also possible to split the body of the report into groupings based on records that share a common value, and to calculate subtotals for the group. In this case, there are two additional sections in the report: n n Group Header Determines what will be displayed at the top of each group, such as the name of the field used for grouping the data. Group Footer Determines what will be displayed at the bottom of each group, such as a subtotal for the group. A Report does not have a Datasheet View, only a Design View, a Print Preview, and a Layout Preview. Figure 8.8 shows the construction of a report in Design View to display property for rent details. Figure 8.9 shows an example of the report in Print Preview. Layout Preview is similar to Print Preview but is used to obtain a quick view of the layout of the report and not all records may be displayed. 8.1 Microsoft Office Access 2003 Office Access allows reports to be created from scratch by the experienced user. However, Office Access also provides a Report Wizard that takes the user through a series of interactive pages to determine: n n n n n n n As discussed earlier, Microsoft Office Access uses an event-driven programming paradigm. Office Access can recognize certain events, such as: n mouse events, which occur when a mouse action, such as pressing down or clicking a mouse button, occurs; 239 Figure 8.8 Example of a report in Design View. the table or query the report is to be based on; the fields to be displayed in the report; any fields to be used for grouping data in the report along with any subtotals required for the group(s); any fields to be used for sorting the data in the report; the layout for the report; the style for the report based on a predefined set of options; the title for the report. Macros | 8.1.8 240 | Chapter 8 z Commercial RDBMSs: Office Access and Oracle Figure 8.9 Example of a report for the PropertyForRent table with a grouping based on the branchNo field in Print Preview. n n n keyboard events, which occur, for example, when the user types on the keyboard; focus events, which occur when a form or form control gains or loses focus or when a form or report becomes active or inactive; data events, which occur when data is entered, deleted, or changed in a form or control, or when the focus moves from one record to another. Office Access allows the user to write macros and event procedures that are triggered by an event. We saw an example of an event procedure in Section 8.1.5. In this section, we briefly describe macros. Macros are very useful for automating repetitive tasks and ensuring that these tasks are performed consistently and completely each time. A macro consists of a list of actions that Office Access is to perform. Some actions duplicate menu commands such as Print, Close, and ApplyFilter. Some actions substitute for mouse actions such as the SelectObject action, which selects a database object in the same way that a database object is selected by clicking the object’s name. Most actions require additional information as action arguments to determine how the action is to function. For example, to use the SetValue action, 8.1 Microsoft Office Access 2003 Figure 8.10 Macro to check that a member of staff currently has fewer than 100 properties to manage. which sets the value of a field, control, or property on a form or report, we need to specify the item to be set and an expression representing the value for the specified item. Similarly, to use the MsgBox action, which displays a pop-up message box, we need to specify the text to go into the message box. Figure 8.10 shows an example of a macro that is called when a user tries to add a new property for rent record into the database. The macro enforces the enterprise constraint that a member of staff cannot manage more than 100 properties at any one time, which we showed previously how to implement using an event procedure written in VBA (see Figure 8.5). In this example, the macro checks whether the member of staff specified on the PropertyForRent form (Forms!PropertyForRent!staffNo) is currently managing less than 100 properties. If so, the macro uses the RunCommand action with the argument Save (to save the new record) and then uses the StopMacro action to stop. Otherwise, the macro uses the MsgBox action to display an error message and uses the CancelEvent macro to cancel the addition of the new record. This example also demonstrates: n n use of the DCOUNT function to check the constraint instead of a SELECT COUNT(*) statement; use of an ellipsis ( . . . ) in the Condition column to run a series of actions associated with a condition. In this case, the SetWarnings, RunCommand, and StopMacro actions are called if the condition DCOUNT(“*”, “PropertyForRent”, “[staffNo] = Forms!PropertyForRent!staffNo”) < 100 evaluates to true, otherwise the MsgBox and CancelEvent actions are called. | 241 242 | Chapter 8 z Commercial RDBMSs: Office Access and Oracle Figure 8.11 Object Dependencies task pane showing the dependencies for the Branch table. 8.1.9 Object Dependencies Microsoft Office Access now allows dependencies between database objects (tables, queries, forms, and reports) to be viewed. This can be particularly useful for identifying objects that are no longer required or for maintaining consistency after an object has been modified. For example, if we add a new field to the Branch table, we can use the Object Dependencies task pane shown in Figure 8.11 to identify which queries, forms, and reports may need to be modified to include the additional field. It is also possible to list the objects that are being used by a selected object. 8.2 Oracle9i The Oracle Corporation is the world’s leading supplier of software for information management, and the world’s second largest independent software company. With annual revenues of about US$10 billion, the company offers its database, tools, and application products, along with related services, in more than 145 countries around the world. Oracle is the top-selling multi-user RDBMS with 98% of Fortune 100 companies using Oracle Solutions (Oracle Corporation, 2003). Oracle’s integrated suite of business applications, Oracle E-Business Suite, covers business intelligence, financials (such as accounts receivable, accounts payable, and general ledger), human resources, procurement, manufacturing, marketing, projects, sales, services, asset enterprise management, order fulfilment, product development, and treasury. Oracle has undergone many revisions since its first release in the late 1970s, but in 1997 Oracle8 was released with extended object-relational capabilities, and improved performance and scalability features. In 1999, Oracle8i was released with added functionality 8.2 Oracle9i Table 8.2 Oracle9i family of products. Product Description Oracle9i Standard Edition Oracle for low to medium volume OLTP (Online Transaction Processing) environments. Oracle for a large number of users or large database size, with advanced management, extensibility, and performance features for mission-critical OLTP environments, query intensive data warehousing applications, and demanding Internet applications. Single-user version of Oracle, typically for development of applications deployed on Oracle9i Standard/Enterprise Edition. Oracle9i Enterprise Edition Oracle9i Personal Edition supporting Internet deployment and in 2001 Oracle9i was released with additional functionality aimed at the e-Business environments. There are three main products in the Oracle9i family, as shown in Table 8.2. Within this family, Oracle offers a number of advanced products and options such as: n n n n n n Oracle Real Application Clusters As performance demands increase and data volumes continue to grow, the use of database servers with multiple CPUs, called symmetric multiprocessing (SMP) machines, are becoming more common. The use of multiple processors and disks reduces the time to complete a given task and at the same time provides greater availability and scalability. The Oracle Real Application Clusters supports parallelism within a single SMP server as well as parallelism across multiple nodes. Oracle9i Application Server (Oracle9iAS) Provides a means of implementing the middle tier of a three-tier architecture for Web-based applications. The first tier is a Web browser and the third tier is the database server. We discuss the Oracle9i Application Server in more detail in Chapter 29. Oracle9iAS Portal An HTML-based tool for developing Web-enabled applications and content-enabled Web sites. iFS Bundled now with Oracle9iAS, Oracle Internet File System (iFS) makes it possible to treat an Oracle9i database like a shared network drive, allowing users to store and retrieve files managed by the database as if they were files managed by a file server. Java support Oracle has integrated a secure Java Virtual Machine with the Oracle9i database server. Oracle JVM supports Java stored procedures and triggers, Java methods, CORBA objects, Enterprise JavaBeans (EJB), Java Servlets, and JavaServer Pages (JSPs). It also supports the Internet Inter-Object Protocol (IIOP) and the HyperText Transfer Protocol (HTTP). Oracle provides JDeveloper to help develop basic Java applications. We discuss Java support in more detail in Chapter 29. XML support Oracle includes a number of features to support XML. The XML Development Kit (XDK) allows developers to send, receive, and interpret XML data from applications written in Java, C, C++, and PL/SQL. The XML Class Generator creates Java/C++ classes from XML Schema definitions. The XML SQL utility | 243 244 | Chapter 8 z Commercial RDBMSs: Office Access and Oracle n n n n n n n n supports reading and writing XML data to and from the database using SQL (through the DBMS–XMLGEN package). Oracle9i also includes the new XMLType data type, which allows an XML document to be stored in a character LOB column (see Table 8.3 on page 253), with built-in functions to extract individual nodes from the document and to build indexes on any node in the document. We discuss XML in Chapter 30. interMEDIA Enables Oracle9i to manage text, documents, image, audio, video, and locator data. It supports a variety of Web client interfaces, Web development tools, Web servers, and streaming media servers. Visual Information Retrieval Supports content-based queries based on visual attributes of an image, such as color, structure, and texture. Time Series Allows timestamped data to be stored in the database. Includes calendar functions and time-based analysis functions such as calculating moving averages. Spatial Optimizes the retrieval and display of data linked to spatial information. Distributed database features Allow data to be distributed across a number of database servers. Users can query and update this data as if it existed in a single database. We discuss distributed DBMSs and examine the Oracle distribution facilities in Chapters 22 and 23. Advanced Security Used in a distributed environment to provide secure access and transmission of data. Includes network data encryption using RSA Data Security’s RC4 or DES algorithm, network data integrity checking, enhanced authentication, and digital certificates (see Chapter 19). Data Warehousing Provides tools that support the extraction, transformation, and loading of organizational data sources into a single database, and tools that can then be used to analyze this data for strategic decision-making. We discuss data warehouses and examine the Oracle data warehouse facilities in Chapters 31 and 32. Oracle Internet Developer Suite A set of tools to help developers build sophisticated database applications. We discuss this suite in Section 8.2.8. 8.2.1 Objects The user interacts with Oracle and develops a database using a number of objects, the main objects being: n n Tables The base tables that make up the database. Using the Oracle terminology, a table is organized into columns and rows. One or more tables are stored within a tablespace (see Section 8.2.2). Oracle also supports temporary tables that exist only for the duration of a transaction or session. Objects Object types provide a way to extend Oracle’s relational data type system. As we saw in Section 6.1, SQL supports three regular data types: characters, numbers, and dates. Object types allow the user to define new data types and use them as regular relational data types would be used. We defer discussion of Oracle’s object-relational features until Chapter 28. 8.2 Oracle9i n Clusters A cluster is a set of tables physically stored together as one table that shares common columns. If data in two or more tables are frequently retrieved together based on data in the common column, using a cluster can be quite efficient. Tables can be accessed separately even though they are part of a cluster. Because of the structure of the cluster, related data requires much less input/output (I/O) overhead if accessed simultaneously. Clusters are discussed in Appendix C and we give guidelines for their use. n Indexes An index is a structure that provides accelerated access to the rows of a table based on the values in one or more columns. Oracle supports index-only tables, where the data and index are stored together. Indexes are discussed in Appendix C and guidelines for when to create indexes are provided in Step 5.3 in Chapter 17. n Views A view is a virtual table that does not necessarily exist in the database but can be produced upon request by a particular user, at the time of request (see Section 6.4). n Synonyms These are alternative names for objects in the database. n Sequences The Oracle sequence generator is used to automatically generate a unique sequence of numbers in cache. The sequence generator avoids the user having to create the sequence, for example by locking the row that has the last value of the sequence, generating a new value, and then unlocking the row. n Stored functions These are a set of SQL or PL/SQL statements used together to execute a particular function and stored in the database. PL/SQL is Oracle’s procedural extension to SQL. n Stored procedures Procedures and functions are identical except that functions always return a value (procedures do not). By processing the SQL code on the database server, the number of instructions sent across the network and returned from the SQL statements are reduced. n Packages These are a collection of procedures, functions, variables, and SQL statements that are grouped together and stored as a single program unit in the database. n Triggers Triggers are code stored in the database and invoked (triggered) by events that occur in the database. Before we discuss some of these objects in more detail, we first examine the architecture of Oracle. Oracle Architecture Oracle is based on the client–server architecture examined in Section 2.6.3. The Oracle server consists of the database (the raw data, including log and control files) and the instance (the processes and system memory on the server that provide access to the database). An instance can connect to only one database. The database consists of a logical structure, such as the database schema, and a physical structure, containing the files that make up an Oracle database. We now discuss the logical and physical structure of the database and the system processes in more detail. 8.2.2 | 245 246 | Chapter 8 z Commercial RDBMSs: Office Access and Oracle Oracle’s logical database structure At the logical level, Oracle maintains tablespaces, schemas, and data blocks and extents/segments. Tablespaces An Oracle database is divided into logical storage units called tablespaces. A tablespace is used to group related logical structures together. For example, tablespaces commonly group all the application’s objects to simplify some administrative operations. Every Oracle database contains a tablespace named SYSTEM, which is created automatically when the database is created. The SYSTEM tablespace always contains the system catalog tables (called the data dictionary in Oracle) for the entire database. A small database might need only the SYSTEM tablespace; however, it is recommended that at least one additional tablespace is created to store user data separate from the data dictionary, thereby reducing contention among dictionary objects and schema objects for the same datafiles (see Figure 16.2 in Chapter 16). Figure 8.12 illustrates an Oracle database consisting of the SYSTEM tablespace and a USER_DATA tablespace. A new tablespace can be created using the CREATE TABLESPACE command, for example: CREATE TABLESPACE user_data DATAFILE ‘DATA3.ORA’ SIZE 100K EXTENT MANAGEMENT LOCAL SEGMENT SPACE MANAGEMENT AUTO; Figure 8.12 Relationship between an Oracle database, tablespaces, and datafiles. 8.2 Oracle9i A table can then be associated with a specific tablespace using the CREATE TABLE or ALTER TABLE statement, for example: CREATE TABLE PropertyForRent (propertyNo VARCHAR2(5) NOT NULL, . . . ) TABLESPACE user_data; If no tablespace is specified when creating a new table, the default tablespace associated with the user when the user account was set up is used. We see how this default tablespace can be specified in Section 18.4. Users, schemas, and schema objects A user (sometimes called a username) is a name defined in the database that can connect to, and access, objects. A schema is a named collection of schema objects, such as tables, views, indexes, clusters, and procedures, associated with a particular user. Schemas and users help DBAs manage database security. To access a database, a user must run a database application (such as Oracle Forms or SQL*Plus) and connect using a username defined in the database. When a database user is created, a corresponding schema of the same name is created for the user. By default, once a user connects to a database, the user has access to all objects contained in the corresponding schema. As a user is associated only with the schema of the same name, the terms ‘user’ and ‘schema’ are often used interchangeably. (Note there is no relationship between a tablespace and a schema: objects in the same schema can be in different tablespaces, and a tablespace can hold objects from different schemas.) Data blocks, extents, and segments The data block is the smallest unit of storage that Oracle can use or allocate. One data block corresponds to a specific number of bytes of physical disk space. The data block size can be set for each Oracle database when it is created. This data block size should be a multiple of the operating system’s block size (within the system’s maximum operating limit) to avoid unnecessary I/O. A data block has the following structure: n n n n n Header Contains general information such as block address and type of segment. Table directory Contains information about the tables that have data in the data block. Row directory Contains information about the rows in the data block. Row data Contains the actual rows of table data. A row can span blocks. Free space Allocated for the insertion of new rows and updates to rows that require additional space. Since Oracle8i, Oracle can manage free space automatically, although there is an option to manage it manually. We show how to estimate the size of an Oracle table using these components in Appendix G. The next level of logical database space is called an extent. An extent is a specific number of contiguous data blocks allocated for storing a specific type of information. The level above an extent is called a segment. A segment is a set of extents allocated for a certain logical structure. For example, each table’s data is stored in its own data segment, while each index’s data is stored in its own index segment. Figure 8.13 shows the relationship between data blocks, extents, and segments. Oracle dynamically allocates space when the existing extents of a segment become full. Because extents are allocated as needed, the extents of a segment may or may not be contiguous on disk. | 247 248 | Chapter 8 z Commercial RDBMSs: Office Access and Oracle Figure 8.13 Relationship between Oracle data blocks, extents, and segments. Oracle’s physical database structure The main physical database structures in Oracle are datafiles, redo log files, and control files. Datafiles Every Oracle database has one or more physical datafiles. The data of logical database structures (such as tables and indexes) is physically stored in these datafiles. As shown in Figure 8.12, one or more datafiles form a tablespace. The simplest Oracle database would have one tablespace and one datafile. A more complex database might have four tablespaces, each consisting of two datafiles, giving a total of eight datafiles. Redo log files Every Oracle database has a set of two or more redo log files that record all changes made to data for recovery purposes. Should a failure prevent modified data from being permanently written to the datafiles, the changes can be obtained from the redo log, thus preventing work from being lost. We discuss recovery in detail in Section 20.3. Control files Every Oracle database has a control file that contains a list of all the other files that make up the database, such as the datafiles and redo log files. For added protection, it is recommended that the control file should be multiplexed (multiple copies may be written to multiple devices). Similarly, it may be advisable to multiplex the redo log files as well. The Oracle instance The Oracle instance consists of the Oracle processes and shared memory required to access information in the database. The instance is made up of the Oracle background processes, the user processes, and the shared memory used by these processes, as illustrated in Figure 8.14. Among other things, Oracle uses shared memory for caching data and indexes as well as storing shared program code. Shared memory is broken into various 8.2 Oracle9i Figure 8.14 The Oracle architecture (from the Oracle documentation set). | 249 250 | Chapter 8 z Commercial RDBMSs: Office Access and Oracle memory structures, of which the basic ones are the System Global Area (SGA) and the Program Global Area (PGA). n System global area The SGA is an area of shared memory that is used to store data and control information for one Oracle instance. The SGA is allocated when the Oracle instance starts and deallocated when the Oracle instance shuts down. The information in the SGA consists of the following memory structures, each of which has a fixed size and is created at instance startup: – Database buffer cache This contains the most recently used data blocks from the database. These blocks can contain modified data that has not yet been written to disk (dirty blocks), blocks that have not been modified, or blocks that have been written to disk since modification (clean blocks). By storing the most recently used blocks, the most active buffers stay in memory to reduce I/O and improve performance. We discuss buffer management policies in Section 20.3.2. – Redo log buffer This contains the redo log file entries, which are used for recovery purposes (see Section 20.3). The background process LGWR writes the redo log buffer to the active online redo log file on disk. – Shared pool This contains the shared memory structures such as shared SQL areas in the library cache and internal information in the data dictionary. The shared SQL areas contain parse trees and execution plans for the SQL queries. If multiple applications issue the same SQL statement, each can access the shared SQL area to reduce the amount of memory needed and to reduce the processing time used for parsing and execution. We discuss query processing in Chapter 21. n Program global area The PGA is an area of shared memory that is used to store data and control information for the Oracle server processes. The size and content of the PGA depends on the Oracle server options installed. User processes Each user process represents the user’s connection to the Oracle server (for example, through SQL*Plus or an Oracle Forms application). The user process manipulates the user’s input, communicates with the Oracle server process, displays the information requested by the user and, if required, processes this information into a more useful form. Oracle processes Oracle (server) processes perform functions for users. Oracle processes can be split into two groups: server processes (which handle requests from connected user processes) and background processes (which perform asynchronous I/O and provide increased parallelism for improved performance and reliability). From Figure 8.14, we have the following background processes: – Database Writer (DBWR) The DBWR process is responsible for writing the modified (dirty) blocks from the buffer cache in the SGA to datafiles on disk. An Oracle instance can have up to ten DBWR processes, named DBW0 to DBW9, to handle I/O to multiple datafiles. Oracle employs a technique known as write-ahead logging (see Section 20.3.4), which means that the DBWR process performs batched writes whenever the buffers need to be freed, not necessarily at the point the transaction commits. – Log Writer (LGWR) The LGWR process is responsible for writing data from the log buffer to the redo log. n n 8.2 Oracle9i – Checkpoint (CKPT) A checkpoint is an event in which all modified database buffers are written to the datafiles by the DBWR (see Section 20.3.3). The CKPT process is responsible for telling the DBWR process to perform a checkpoint and to update all the datafiles and control files for the database to indicate the most recent checkpoint. The CKPT process is optional and, if omitted, these responsibilities are assumed by the LGWR process. – System Monitor (SMON) The SMON process is responsible for crash recovery when the instance is started following a failure. This includes recovering transactions that have died because of a system crash. SMON also defragments the database by merging free extents within the datafiles. – Process Monitor (PMON) The PMON process is responsible for tracking user processes that access the database and recovering them following a crash. This includes cleaning up any resources left behind (such as memory) and releasing any locks held by the failed process. – Archiver (ARCH) The ARCH process is responsible for copying the online redo log files to archival storage when they become full. The system can be configured to run up to ten ARCH processes, named ARC0 to ARC9. The additional archive processes are started by the LWGR when the load dictates. – Recoverer (RECO) The RECO process is responsible for cleaning up failed or suspended distributed transactions (see Section 23.4). – Dispatchers (Dnnn) The Dnnn processes are responsible for routing requests from the user processes to available shared server processes and back again. Dispatchers are present only when the Shared Server (previously known as the MultiThreaded Server, MTS) option is used, in which case there is at least one Dnnn process for every communications protocol in use. – Lock Manager Server (LMS) The LMS process is responsible for inter-instance locking when the Oracle Real Application Clusters option is used. In the foregoing descriptions we have used the term ‘process’ generically. Nowadays, some systems will implement processes as threads. Example of how these processes interact The following example illustrates an Oracle configuration with the server process running on one machine and a user process connecting to the server from a separate machine. Oracle uses a communication mechanism called Oracle Net Services to allow processes on different physical machines to communicate with each other. Oracle Net Services supports a variety of network protocols such as TCP/ IP. The services can also perform network protocol interchanges, allowing clients that use one protocol to interact with a database server using another protocol. (1) The client workstation runs an application in a user process. The client application attempts to establish a connection to the server using the Oracle Net Services driver. (2) The server detects the connection request from the application and creates a (dedicated) server process on behalf of the user process. (3) The user executes an SQL statement to change a row of a table and commits the transaction. | 251 252 | Chapter 8 z Commercial RDBMSs: Office Access and Oracle (4) The server process receives the statement and checks the shared pool for any shared SQL area that contains an identical SQL statement. If a shared SQL area is found, the server process checks the user’s access privileges to the requested data and the previously existing shared SQL area is used to process the statement; if not, a new shared SQL area is allocated for the statement so that it can be parsed and processed. (5) The server process retrieves any necessary data values from the actual datafile (table) or those stored in the SGA. (6) The server process modifies data in the SGA. The DBWR process writes modified blocks permanently to disk when doing so is efficient. Since the transaction committed, the LGWR process immediately records the transaction in the online redo log file. (7) The server process sends a success/failure message across the network to the application. (8) During this time, the other background processes run, watching for conditions that require intervention. In addition, the Oracle server manages other users’ transactions and prevents contention between transactions that request the same data. 8.2.3 Table Definition In Section 6.3.2, we examined the SQL CREATE TABLE statement. Oracle9i supports many of the SQL CREATE TABLE clauses, so we can define: n n n n n n primary keys, using the PRIMARY KEY clause; alternate keys, using the UNIQUE keyword; default values, using the DEFAULT clause; not null attributes, using the NOT NULL keyword; foreign keys, using the FOREIGN KEY clause; other attribute or table constraints using the CHECK and CONSTRAINT clauses. However, there is no facility to create domains, although Oracle9i does allow user-defined types to be created, as we discuss in Section 28.6. In addition, the data types are slightly different from the SQL standard, as shown in Table 8.3. Sequences In the previous section we mentioned that Microsoft Office Access has an Autonumber data type that creates a new sequential number for a column value whenever a row is inserted. Oracle does not have such a data type but it does have a similar facility through the SQL CREATE SEQUENCE statement. For example, the statement: CREATE SEQUENCE appNoSeq START WITH 1 INCREMENT BY 1 CACHE 30; creates a sequence, called appNoSeq, that starts with the initial value 1 and increases by 1 each time. The CACHE 30 clause specifies that Oracle should pre-allocate 30 sequence 8.2 Oracle9i Table 8.3 Partial list of Oracle data types Data type Use Stores fixed-length character data (default size is 1). Unicode data types that store Unicode character data. Same as char data type, except the maximum length is determined by the character set of the database (for example, American English, eastern European, or Korean). Stores variable length character data. varchar2(size) Same as varchar2 with the same caveat as for nvarchar2(size) nchar data type. Currently the same as char. However, use of varchar varchar2 is recommended as varchar might become a separate data type with different comparison semantics in a later release. Stores fixed-point or floating-point numbers, number(l, d) where l stands for length and d stands for the number of decimal digits. For example, number(5, 2) could contain nothing larger than 999.99 without an error. decimal(l, d), dec(l, d), Same as number. Provided for compatibility with SQL standard. or numeric(l, d) integer, int, or smallint Provided for compatibility with SQL standard. Converted to number(38). Stores dates from 1 Jan 4712 BC to 31 Dec 4712 AD date A binary large object. blob A character large object. clob Raw binary data, such as a sequence of graphics raw(size) characters or a digitized picture. char(size) nchar(size) Size Up to 2000 bytes Up to 4000 bytes Up to 2000 bytes ±1.0E−130 . . . ±9.99E125 (up to 38 significant digits) Up to 4 gigabytes Up to 4 gigabytes Up to 2000 bytes numbers and keep them in memory for faster access. Once a sequence has been created, its values can be accessed in SQL statements using the following pseudocolumns: n n CURRVAL NEXTVAL Returns the current value of the sequence. Increments the sequence and returns the new value. For example, the SQL statement: INSERT INTO Appointment(appNo, aDate, aTime, clientNo) VALUES (appNoSeq.nextval, SYSDATE, ‘12.00’, ‘CR76’); inserts a new row into the Appointment table with the value for column appNo (the appointment number) set to the next available number in the sequence. We now illustrate how to create the PropertyForRent table in Oracle with the constraints specified in Example 6.1. | 253 254 | Chapter 8 z Commercial RDBMSs: Office Access and Oracle Figure 8.15 Creating the PropertyForRent table using the Oracle SQL CREATE TABLE statement in SQL*Plus. Creating a blank table in Oracle using SQL*Plus To illustrate the process of creating a blank table in Oracle, we first use SQL*Plus, which is an interactive, command-line driven, SQL interface to the Oracle database. Figure 8.15 shows the creation of the PropertyForRent table using the Oracle SQL CREATE TABLE statement. By default, Oracle enforces the referential actions ON DELETE NO ACTION and ON UPDATE NO ACTION on the named foreign keys. It also allows the additional clause ON DELETE CASCADE to be specified to allow deletions from the parent table to cascade to the child table. However, it does not support the ON UPDATE CASCADE action or the SET DEFAULT and SET NULL actions. If any of these actions are required, they have to be implemented as triggers or stored procedures, or within the application code. We see an example of a trigger to enforce this type of constraint in Section 8.2.7. 8.2 Oracle9i Creating a table using the Create Table Wizard An alternative approach in Oracle9i is to use the Create Table Wizard that is part of the Schema Manager. Using a series of interactive forms, the Create Table Wizard takes the user through the process of defining each of the columns with its associated data type, defining any constraints on the columns and/or constraints on the table that may be required, and defining the key fields. Figure 8.16 shows the final form of the Create Table Wizard used to create the PropertyForRent table. General Constraint Definition 8.2.4 There are several ways to create general constraints in Oracle using, for example: n n n n SQL, and the CHECK and CONSTRAINT clauses of the CREATE and ALTER TABLE statements; stored procedures and functions; triggers; methods. The first approach was dealt with in Section 6.1. We defer treatment of methods until Chapter 28 on Object-Relational DBMSs. Before we illustrate the remaining two approaches, we first discuss Oracle’s procedural programming language, PL/SQL. PL/SQL PL/SQL is Oracle’s procedural extension to SQL. There are two versions of PL/SQL: one is part of the Oracle server, the other is a separate engine embedded in a number of Oracle tools. They are very similar to each other and have the same programming constructs, syntax, and logic mechanisms, although PL/SQL for Oracle tools has some extensions to suit the requirements of the particular tool (for example, PL/SQL has extensions for Oracle Forms). PL/SQL has concepts similar to modern programming languages, such as variable and constant declarations, control structures, exception handling, and modularization. PL/SQL is a block-structured language: blocks can be entirely separate or nested within one another. The basic units that comprise a PL/SQL program are procedures, functions, and anonymous (unnamed) blocks. As illustrated in Figure 8.17, a PL/SQL block has up to three parts: n n n an optional declaration part in which variables, constants, cursors, and exceptions are defined and possibly initialized; a mandatory executable part, in which the variables are manipulated; an optional exception part, to handle any exceptions raised during execution. 8.2.5 | 255 256 | Chapter 8 z Commercial RDBMSs: Office Access and Oracle Figure 8.16 Creating the PropertyForRent table using the Oracle Create Table Wizard. 8.2 Oracle9i | 257 Figure 8.17 General structure of a PL/SQL block. Declarations Variables and constant variables must be declared before they can be referenced in other statements, including other declarative statements. The types of variables are as shown in Table 8.3. Examples of declarations are: vStaffNo VARCHAR2(5); vRent NUMBER(6, 2) NOT NULL := 600; MAX_PROPERTIES CONSTANT NUMBER := 100; Note that it is possible to declare a variable as NOT NULL, although in this case an initial value must be assigned to the variable. It is also possible to declare a variable to be of the same type as a column in a specified table or another variable using the %TYPE attribute. For example, to declare that the vStaffNo variable is the same type as the staffNo column of the Staff table we could write: vStaffNo Staff.staffNo%TYPE; vStaffNo1 vStaffNo%TYPE; Similarly, we can declare a variable to be of the same type as an entire row of a table or view using the %ROWTYPE attribute. In this case, the fields in the record take their names and data types from the columns in the table or view. For example, to declare a vStaffRec variable to be a row from the Staff table we could write: vStaffRec Staff%ROWTYPE; Assignments In the executable part of a PL/SQL block, variables can be assigned in two ways: using the normal assignment statement (:=) or as the result of an SQL SELECT or FETCH statement. For example: vStaffNo := ‘SG14’; vRent := 500; SELECT COUNT (*) INTO x FROM PropertyForRent WHERE staffNo = vStaffNo; In the latter case, the variable x is set to the result of the SELECT statement (in this case, equal to the number of properties managed by staff member SG14). 258 | Chapter 8 z Commercial RDBMSs: Office Access and Oracle Control statements PL/SQL supports the usual conditional, iterative, and sequential flow-of-control mechanisms: n n n IF–THEN–ELSE–END IF; LOOP–EXIT WHEN–END LOOP; FOR–END LOOP; and WHILE–END LOOP; GOTO. We present examples using some of these structures shortly. Exceptions An exception is an identifier in PL/SQL raised during the execution of a block, which terminates its main body of actions. A block always terminates when an exception is raised although the exception handler can perform some final actions. An exception can be raised automatically by Oracle – for example, the exception NO_DATA_FOUND is raised whenever no rows are retrieved from the database in a SELECT statement. It is also possible for an exception to be raised explicitly using the RAISE statement. To handle raised exceptions, separate routines called exception handlers are specified. As mentioned earlier, a user-defined exception is defined in the declarative part of a PL/SQL block. In the executable part a check is made for the exception condition and, if found, the exception is raised. The exception handler itself is defined at the end of the PL/SQL block. An example of exception handling is given in Figure 8.18. This example also illustrates the use of the Oracle-supplied package DBMS_OUTPUT, which allows Figure 8.18 Example of exception handling in PL/SQL. 8.2 Oracle9i output from PL/SQL blocks and subprograms. The procedure put_line outputs information to a buffer in the SGA, which can be displayed by calling the procedure get_line or by setting SERVEROUTPUT ON in SQL*Plus. Cursors A SELECT statement can be used if the query returns one and only one row. To handle a query that can return an arbitrary number of rows (that is, zero, one, or more rows) PL/SQL uses cursors to allow the rows of a query result to be accessed one at a time. In effect, the cursor acts as a pointer to a particular row of the query result. The cursor can be advanced by 1 to access the next row. A cursor must be declared and opened before it can be used, and it must be closed to deactivate it after it is no longer required. Once the cursor has been opened, the rows of the query result can be retrieved one at a time using a FETCH statement, as opposed to a SELECT statement. (In Appendix E we see that SQL can also be embedded in high-level programming languages and cursors are also used for handling queries that can return an arbitrary number of rows.) Figure 8.19 illustrates the use of a cursor to determine the properties managed by staff member SG14. In this case, the query can return an arbitrary number of rows and so a cursor must be used. The important points to note in this example are: n n n n n n In the DECLARE section, the cursor propertyCursor is defined. In the statements section, the cursor is first opened. Among others, this has the effect of parsing the SELECT statement specified in the CURSOR declaration, identifying the rows that satisfy the search criteria (called the active set), and positioning the pointer just before the first row in the active set. Note, if the query returns no rows, PL/SQL does not raise an exception when the cursor is open. The code then loops over each row in the active set and retrieves the current row values into output variables using the FETCH INTO statement. Each FETCH statement also advances the pointer to the next row of the active set. The code checks if the cursor did not contain a row (propertyCursor%NOTFOUND) and exits the loop if no row was found (EXIT WHEN). Otherwise, it displays the property details using the DBMS_OUTPUT package and goes round the loop again. The cursor is closed on completion of the fetches. Finally, the exception block displays any error conditions encountered. As well as %NOTFOUND, which evaluates to true if the most recent fetch does not return a row, there are some other cursor attributes that are useful: n n n %FOUND Evaluates to true if the most recent fetch returns a row (complement of %NOTFOUND). %ISOPEN Evaluates to true if the cursor is open. %ROWCOUNT Evaluates to the total number of rows returned so far. Passing parameters to cursors PL/SQL allows cursors to be parameterized, so that the same cursor definition can be reused with different criteria. For example, we could change the cursor defined in the above example to: | 259 260 | Chapter 8 z Commercial RDBMSs: Office Access and Oracle Figure 8.19 Using cursors in PL/SQL to process a multi-row query. CURSOR propertyCursor (vStaffNo VARCHAR2) IS SELECT propertyNo, street, city, postcode FROM PropertyForRent WHERE staffNo = vStaffNo ORDER BY propertyNo; and we could open the cursor using the following example statements: 8.2 Oracle9i vStaffNo1 PropertyForRent.staffNo%TYPE := ‘SG14’; OPEN propertyCursor(‘SG14’); OPEN propertyCursor(‘SA9’); OPEN propertyCursor(vStaffNo1); Updating rows through a cursor It is possible to update and delete a row after it has been fetched through a cursor. In this case, to ensure that rows are not changed between declaring the cursor, opening it, and fetching the rows in the active set, the FOR UPDATE clause is added to the cursor declaration. This has the effect of locking the rows of the active set to prevent any update conflict when the cursor is opened (locking and update conflicts are discussed in Chapter 20). For example, we may want to reassign the properties that SG14 manages to SG37. The cursor would now be declared as: CURSOR propertyCursor IS SELECT propertyNo, street, city, postcode FROM PropertyForRent WHERE staffNo = ‘SG14’ ORDER BY propertyNo FOR UPDATE NOWAIT; By default, if the Oracle server cannot acquire the locks on the rows in the active set in a SELECT FOR UPDATE cursor, it waits indefinitely. To prevent this, the optional NOWAIT keyword can be specified and a test can be made to see if the locking has been successful. When looping over the rows in the active set, the WHERE CURRENT OF clause is added to the SQL UPDATE or DELETE statement to indicate that the update is to be applied to the current row of the active set. For example: UPDATE PropertyForRent SET staffNo = ‘SG37’ WHERE CURRENT OF propertyCursor; ... COMMIT; Subprograms, Stored Procedures, Functions, and Packages Subprograms are named PL/SQL blocks that can take parameters and be invoked. PL/SQL has two types of subprogram called (stored) procedures and functions. Procedures and functions can take a set of parameters given to them by the calling program and perform a set of actions. Both can modify and return data passed to them as a parameter. The difference between a procedure and a function is that a function will always return a single value to the caller, whereas a procedure does not. Usually, procedures are used unless only one return value is needed. Procedures and functions are very similar to those found in most high-level programming languages, and have the same advantages: they provide modularity and extensibility, 8.2.6 | 261 262 | Chapter 8 z Commercial RDBMSs: Office Access and Oracle they promote reusability and maintainability, and they aid abstraction. A parameter has a specified name and data type but can also be designated as: n n n IN parameter is used as an input value only. OUT parameter is used as an output value only. IN OUT parameter is used as both an input and an output value. For example, we could change the anonymous PL/SQL block given in Figure 8.19 into a procedure by adding the following lines at the start: CREATE OR REPLACE PROCEDURE PropertiesForStaff (IN vStaffNo VARCHAR2) AS . . . The procedure could then be executed in SQL*Plus as: SQL> SET SERVEROUTPUT ON; SQL> EXECUTE PropertiesForStaff(‘SG14’); Packages A package is a collection of procedures, functions, variables, and SQL statements that are grouped together and stored as a single program unit. A package has two parts: a specification and a body. A package’s specification declares all public constructs of the package, and the body defines all constructs (public and private) of the package, and so implements the specification. In this way, packages provide a form of encapsulation. Oracle performs the following steps when a procedure or package is created: n n n It compiles the procedure or package. It stores the compiled code in memory. It stores the procedure or package in the database. For the previous example, we could create a package specification as follows: CREATE OR REPLACE PACKAGE StaffPropertiesPackage AS procedure PropertiesForStaff(vStaffNo VARCHAR2); END StaffPropertiesPackage; and we could create the package body (that is, the implementation of the package) as: CREATE OR REPLACE PACKAGE BODY StaffPropertiesPackage AS ... END StaffPropertiesPackage; To reference the items declared within a package specification, we use the dot notation. For example, we could call the PropertiesForStaff procedure as follows: StaffPropertiesPackage.PropertiesForStaff(‘SG14’); 8.2 Oracle9i Triggers A trigger defines an action that the database should take when some event occurs in the application. A trigger may be used to enforce some referential integrity constraints, to enforce complex enterprise constraints, or to audit changes to data. The code within a trigger, called the trigger body, is made up of a PL/SQL block, Java program, or ‘C’ callout. Triggers are based on the Event–Condition–Action (ECA) model: n The event (or events) that trigger the rule. In Oracle, this is: – an INSERT, UPDATE, or DELETE statement on a specified table (or possibly view); – a CREATE, ALTER, or DROP statement on any schema object; – a database startup or instance shutdown, or a user logon or logoff; – a specific error message or any error message. It is also possible to specify whether the trigger should fire before the event or after the event. n The condition that determines whether the action should be executed. The condition is optional but, if specified, the action will be executed only if the condition is true. n The action to be taken. This block contains the SQL statements and code to be executed when a triggering statement is issued and the trigger condition evaluates to true. There are two types of trigger: row-level triggers that execute for each row of the table that is affected by the triggering event, and statement-level triggers that execute only once even if multiple rows are affected by the triggering event. Oracle also supports INSTEAD-OF triggers, which provide a transparent way of modifying views that cannot be modified directly through SQL DML statements (INSERT, UPDATE, and DELETE). These triggers are called INSTEAD-OF triggers because, unlike other types of trigger, Oracle fires the trigger instead of executing the original SQL statement. Triggers can also activate themselves one after the other. This can happen when the trigger action makes a change to the database that has the effect of causing another event that has a trigger associated with it. For example, DreamHome has a rule that prevents a member of staff from managing more than 100 properties at the same time. We could create the trigger shown in Figure 8.20 to enforce this enterprise constraint. This trigger is invoked before a row is inserted into the PropertyForRent table or an existing row is updated. If the member of staff currently manages 100 properties, the system displays a message and aborts the transaction. The following points should be noted: n The BEFORE keyword indicates that the trigger should be executed before an insert or update is applied to the PropertyForRent table. n The FOR EACH ROW keyword indicates that this is a row-level trigger, which executes for each row of the PropertyForRent table that is updated in the statement. n The new keyword is used to refer to the new value of the column. (Although not used in this example, the old keyword can be used to refer to the old value of a column.) 8.2.7 | 263 264 | Chapter 8 z Commercial RDBMSs: Office Access and Oracle Figure 8.20 Trigger to enforce the constraint that a member of staff cannot manage more than 100 properties at any one time. Using triggers to enforce referential integrity We mentioned in Section 8.2.3 that, by default, Oracle enforces the referential actions ON DELETE NO ACTION and ON UPDATE NO ACTION on the named foreign keys. It also allows the additional clause ON DELETE CASCADE to be specified to allow deletions from the parent table to cascade to the child table. However, it does not support the ON UPDATE CASCADE action, or the SET DEFAULT and SET NULL actions. If any of these actions are required, they will have to be implemented as triggers or stored procedures, or within the application code. For example, from Example 6.1 in Chapter 6 the foreign key staffNo in the PropertyForRent table should have the action ON UPDATE CASCADE. This action can be implemented using the triggers shown in Figure 8.21. Trigger 1 (PropertyForRent_Check_Before) The trigger in Figure 8.21(a) is fired whenever the staffNo column in the PropertyForRent table is updated. The trigger checks before the update takes place that the new value specified exists in the Staff table. If an Invalid_Staff exception is raised, the trigger issues an error message and prevents the change from occurring. Changes to support triggers on the Staff table The three triggers shown in Figure 8.21(b) are fired whenever the staffNo column in the Staff table is updated. Before the definition of the triggers, a sequence number updateSequence is created along with a public variable updateSeq (which is accessible to the three triggers through the seqPackage package). In addition, the PropertyForRent table is modified to add a column called updateId, which is used to flag whether a row has been updated, to prevent it being updated more than once during the cascade operation. Trigger 2 (Cascade_StaffNo_Update1) This (statement-level) trigger fires before the update to the staffNo column in the Staff table to set a new sequence number for the update. 8.2 Oracle9i | 265 Figure 8.21 Oracle triggers to enforce ON UPDATE CASCADE on the foreign key staffNo in the PropertyForRent table when the primary key staffNo is updated in the Staff table: (a) trigger for the PropertyForRent table. 266 | Chapter 8 z Commercial RDBMSs: Office Access and Oracle Figure 8.21 (b) Triggers for the Staff table. Trigger 3 (Cascade_StaffNo_Update2) This (row-level) trigger fires to update all rows in the PropertyForRent table that have the old staffNo value (:old.staffNo) to the new value (:new.staffNo), and to flag the row as having been updated. 8.2 Oracle9i Trigger 4 (Cascade_StaffNo_Update3) The final (statement-level) trigger fires after the update to reset the flagged rows back to unflagged. Oracle Internet Developer Suite The Oracle Internet Developer Suite is a set of tools to help developers build sophisticated database applications. The suite includes: n n n n n Oracle Forms Developer, a set of tools to develop form-based applications for deployment as traditional two-tier client–server applications or as three-tier browser-based applications. Oracle Reports Developer, a set of tools for the rapid development and deployment of sophisticated paper and Web reports. Oracle Designer, a graphical tool for Rapid Application Development (RAD) covering the database system development lifecycle from conceptual design, to logical design (schema generation), application code generation, and deployment. Oracle Designer can also reverse engineer existing logical designs into conceptual schemas. Oracle JDeveloper, to help develop Java applications. JDeveloper includes a Data Form wizard, a Beans-Express wizard for creating JavaBeans and BeanInfo classes, and a Deployment wizard. Oracle9iAS Portal, an HTML-based tool for developing Web-enabled applications and content-driven websites. In this section we consider the first two components of the Oracle Developer Suite. We consider Web-based development in Chapter 29. Oracle9i Forms Developer Oracle9i Forms Developer is a set of tools that help developers create customized database applications. In conjunction with Oracle9iAS Forms Services (a component of the Oracle9i Application Server), developers can create and deploy Oracle Forms on the Web using Oracle Containers for J2EE (OC4J). The Oracle9iAS Forms Services component renders the application presentation as a Java applet, which can be extended using Java components, such as JavaBeans and Pluggable Java Components (PJCs), so that developers can quickly and easily deliver sophisticated interfaces. Forms are constructed as a collection of individual design elements called items. There are many types of items, such as text boxes to enter and edit data, check boxes, and buttons to initiate some user action. A form is divided into a number of sections, of which the main ones are: n Canvas This is the area on which items are placed (akin to the canvas that an artist would use). Properties such as layout and color can be changed using the Layout Editor. There are four types of canvas: a content canvas is the visual part of the application and 8.2.8 | 267 268 | Chapter 8 z Commercial RDBMSs: Office Access and Oracle n n n must exist; a stacked canvas, which can be overlayed with other canvases to hide or show parts of some information when other data is being accessed; a tab canvas, which has a series of pages, each with a named tab at the top to indicate the nature of the page; a toolbar, which appears in all forms and can be customized. Frames A group of items which can be manipulated and changed as a single item. Data blocks The control source for the form, such as a table, view, or stored procedure. Windows A container for all visual objects that make up a Form. Each window must have a least one canvas and each canvas must be assigned to a window. Like Microsoft Office Access, Oracle Forms applications are event driven. An event may be an interface event, such as a user pressing a button, moving between fields, or opening/closing a form, or an internal processing event (a system action), such as checking the validity of an item against validation rules. The code that responds to an event is a trigger; for example, when the user presses the close button on a form the WHEN-WINDOWCLOSED trigger is fired. The code written to handle this event may, for example, close down the application or remind the user to save his/her work. Forms can be created from scratch by the experienced user. However, Oracle also provides a Data Block Wizard and a Layout Wizard that takes the user through a series of interactive pages to determine: n the table/view or stored procedure that the form is to be based on; n the columns to be displayed on the form; n whether to create/delete a master–detail relationship to other data blocks on the form; n the name for the new data block; n the canvas the data block is to be placed on; n the label, width, and height of each item; n the layout style (Form or Tabular); n the title for the frame, along with the number of records to be displayed and the distance between records. Figure 8.22 shows some screens from these wizards and the final form displayed through Forms Services. Oracle9i Reports Developer Oracle9i Reports Developer is a set of tools that enables the rapid development and deployment of sophisticated paper and Web reports against a variety of data sources, including the Oracle9i database itself, JDBC, XML, text files, and Oracle9i OLAP. Using J2EE technologies such as JSP and XML, reports can be published in a variety of formats, such as HTML, XML, PDF, delimited text, Postscript, PCL, and RTF, to a variety of destinations, such as e-mail, Web browser, Oracle9iAS Portal, and the file system. In conjunction with Oracle9iAS Reports Services (a component of the Oracle9i Application Server), developers can create and deploy Oracle Reports on the Web. 8.2 Oracle9i | 269 Figure 8.22 Example of a form being created in Oracle Forms Builder: (a) a page from the Data Block Wizard; (b) a page from the Layout Wizard; (c) the final form displayed through Forms Services. 270 | Chapter 8 z Commercial RDBMSs: Office Access and Oracle The Oracle9i Reports Developer includes: n wizards that guide the user through the report design process; n pluggable data sources (PDSs), such as JDBC and XML, that provide access to data from any source for reports; n a query builder with a graphical representation of the SQL statement to obtain report data; n default report templates and layout styles that can be customized; n an editor that allows paper report layouts to be modified in WYSIWYG mode (Paper Design view); n an integrated graph builder to graphically represent report data; n the ability to execute dynamic SQL statements within PL/SQL procedures; n event-based reporting (report execution based on database events); Reports are constructed as a collection of objects, such as: n data model objects (queries, groups, database columns, links, user parameters); n layout objects (frames, repeating frames, fields, boilerplate, anchors); n parameter form objects (parameters, fields, boilerplate); n PL/SQL objects (program units, triggers). Queries provide the data for the report. Queries can select data from any data source, such as an Oracle9i database, JDBC, XML, or PDSs. Groups are created to organize the columns in the report. Groups can separate a query’s data into sets and can also filter a query’s data. A database column represents a column that is selected by the query containing the data values for a report. For each column selected in the query, the Reports Builder automatically creates a column in the report’s data model. Summaries and computations on database column values can be created manually in the Data Model view or by using the Report Wizard (for summary columns). A data link (or parent–child relationship) relates the results of multiple queries. A data link causes the child query to be executed once for each instance of its parent group. The child query is executed with the value of the parent’s primary key. Frames surround objects and protect them from being overwritten by other objects. For example, a frame might be used to surround all objects owned by a group, to surround column headings, or to surround summaries. Repeating frames surround all the fields that are created for a group’s columns. The repeating frame prints once for each record in the group. Repeating frames can enclose any layout object, including other repeating frames. Nested repeating frames are typically used to produce master/detail and break reports. Fields are placeholders for parameters, columns, and other data such as the page number or current date. A boilerplate object is any text, lines, or graphics that appear in a report every time it is run. A parameter is a variable whose value can be set at runtime. Like Oracle Forms, Oracle Reports Developer allows reports to be created from scratch by the experienced user and it also provides a Data Block Wizard and a Layout Wizard that take the user through a series of interactive pages to determine: 8.2 Oracle9i n n n n n n n n | the report style (for example, tabular, group left, group above, matrix, matrix with group); the data source (Express Server Query for OLAP queries, JDBC Query, SQL Query, Text Query, XML Query); the data source definition (for example, an SQL query); the fields to group on (for a grouped report); the fields to be displayed in the report; the fields for any aggregated calculations; the label, width, and height of each item; the template to be used for the report, if any. Figure 8.23 shows some screens from this wizard and the final form displayed through Reports Services Note that it is also possible to build a report using SQL*Plus. Figure 8.24 illustrates some of the commands that can be used to build a report using SQL*Plus: n n n The COLUMN command provides a title and format for a column in the report. BREAKs can be set to group the data, skip lines between attributes, or separate the report into pages. Breaks can be defined on an attribute, expression, alias, or the report itself. COMPUTE performs a computation on columns or expressions selected from a table. The BREAK command must accompany the compute command. Other Oracle Functionality 8.2.9 We will examine Oracle in more depth in later parts of this book, including: n n n n n n n n n n Oracle file organizations and indexing in Chapter 17 and Appendix C; basic Oracle security features in Chapter 19; how Oracle handles concurrency and recovery in Chapter 20; how Oracle handles query optimization in Chapter 21; Oracle’s data distribution mechanism in Chapter 23; Oracle’s data replication mechanism in Chapter 24; Oracle’s object-relational features in Chapter 28; the Oracle9i Application Server in Chapter 29; Oracle’s support for XML in Chapter 30; Oracle’s data warehousing functionality in Chapter 32. Oracle10g At the time of writing, Oracle had just announced the next version of its product, Oracle10g. While the ‘i’ in Oracle9i stands for ‘Internet’, the ‘g’ in the next release stands for ‘grid’. The product line targets grid computing, which aims to pool together low-cost 8.2.10 271 272 | Chapter 8 z Commercial RDBMSs: Office Access and Oracle Figure 8.23 Example of a report being created in Oracle Reports Builder: (a)–(d) pages from the Data Block Wizard and Layout Wizard; (e) the data model for the report; (f) the final form displayed through Reports Services. 8.2 Oracle9i | 273 (d) (e) (f) Figure 8.23 (cont’d ) 274 | Chapter 8 z Commercial RDBMSs: Office Access and Oracle Figure 8.24 Example of a report being created through SQL*Plus. modular storage and servers to create a virtual computing resource that the organization has at its disposal. The system transparently distributes workload to use capacity efficiently, at low cost, and with high availability, thus providing computing capacity ‘on demand’. In this way, computing is considered to be analogous to a utility, like an electric power grid or telephone network: a client does not care where data is stored within the grid or where the computation is performed; the client is only concerned about getting the necessary data as and when required. Oracle has announced three grid-enhanced products: n n n Oracle Database 10g; Oracle Application Server 10g; Oracle Enterprise Manager 10g Grid Control. 8.2 Oracle9i Oracle Database 10g The database component of the grid architecture is based on the Real Application Clusters feature, which was introduced in Oracle9i. Oracle Real Application Clusters enables a single database to run across multiple clustered nodes. New integrated clusterware has been added to simplify the clustering process, allowing the dynamic addition and removal of an Oracle cluster. Automatic storage management (ASM) allows a DBA to define a disk group (a set of disk devices) that Oracle manages as a single, logical unit. For example, if a disk group has been defined as the default disk group for a database, Oracle will automatically allocate the necessary storage and create/delete the associated files. Using RAID, ASM can balance I/O from multiple databases across all the devices in the disk group and improve performance and reliability with striping and mirroring (see Section 19.2.6). In addition, ASM can reassign disks from node to node and cluster to cluster. As well as dynamically allocating work across multiple nodes and data across multiple disks, Oracle can also dynamically move data or share data across multiple databases, potentially on different operating systems, using Oracle Streams. Self-managing features of the database include automatically diagnosing problems such as poor lock contention and slow SQL queries, resolving some problems and alerting the DBA to others with suggested solutions. Oracle Application Server 10g and Oracle Enterprise Manager 10g Grid Control Oracle9iAS, an integrated suite of application infrastructure software, and the Enterprise Manager have been enhanced to run enterprise applications on computing grids. Enhancements include: n streamlined installation and configuration of software across multiple nodes in the grid; n cloning facilities, to clone servers, their configurations, and the applications deployed on them; n facilities to automate frequent tasks across multiple servers; advanced security including Java2 security support, SSL support for all protocols, and a PKI-based security infrastructure (see Chapter 19); a Security Management Console, to create users, roles and to define user identity and access control privileges across the grid (this information is stored in the Oracle Internet Directory, an LDAP-compliant Directory Service that can be integrated with other security environments); n n n Oracle Enterprise Single Sign-On Service, to allow users to authenticate to a number of applications and services on the grid; n a set of tools to monitor and tune the performance of the system; for example, the Dynamic Monitoring Service (DMS) collects resource consumption statistics such as CPU, memory, and I/O usage; Application Performance Monitoring (APM) allows DBAs to track the resource usage of a transaction through the various infrastructure components, such as network, Web servers, application servers, and database servers. | 275 276 | Chapter 8 z Commercial RDBMSs: Office Access and Oracle Chapter Summary n n n n n n n The Relational Database Management System (RDBMS) has become the dominant data-processing software in use today, with estimated new licence sales of between US$6 billion and US$10 billion per year (US$25 billion with tools sales included). Microsoft Office Access is the mostly widely used relational DBMS for the Microsoft Windows environment. It is a typical PC-based DBMS capable of storing, sorting, and retrieving data for a variety of applications. Office Access provides a GUI to create tables, queries, forms, and reports, and tools to develop customized database applications using the Microsoft Office Access macro language or the Microsoft Visual Basic for Applications (VBA) language. The user interacts with Microsoft Office Access and develops a database and application using tables, queries, forms, reports, data access pages, macros, and modules. A table is organized into columns (called fields) and rows (called records). Queries allow the user to view, change, and analyze data in different ways. Queries can also be stored and used as the source of records for forms, reports, and data access pages. Forms can be used for a variety of purposes such as to create a data entry form to enter data into a table. Reports allow data in the database to be presented in an effective way in a customized printed format. A data access page is a special type of Web page designed for viewing and working with data (stored in a Microsoft Office Access database or a Microsoft SQL Server database) from the Internet or an intranet. Macros are a set of one or more actions that each performs a particular operation, such as opening a form or printing a report. Modules are a collection of VBA declarations and procedures that are stored together as a unit. Microsoft Office Access can be used as a standalone system on a single PC or as a multi-user system on a PC network. Since the release of Office Access 2000, there is a choice of two data engines in the product: the original Jet engine and the new Microsoft SQL Server Desktop Engine (MSDE), which is compatible with Microsoft’s backoffice SQL Server. The Oracle Corporation is the world’s leading supplier of software for information management, and the world’s second largest independent software company. With annual revenues of about US$10 billion, the company offers its database, tools, and application products, along with related services in more than 145 countries around the world. Oracle is the top-selling multi-user RDBMS with 98% of Fortune 100 companies using Oracle solutions. The user interacts with Oracle and develops a database using a number of objects. The main objects in Oracle are tables (a table is organized into columns and rows); objects (a way to extend Oracle’s relational data type system); clusters (a set of tables physically stored together as one table that shares a common column); indexes (a structure used to help retrieve data more quickly and efficiently); views (virtual tables); synonyms (an alternative name for an object in the database); sequences (generates a unique sequence of numbers in cache); stored functions/procedures (a set of SQL or PL/SQL statements used together to execute a particular function); packages (a collection of procedures, functions, variables, and SQL statements that are grouped together and stored as a single program unit); triggers (code stored in the database and invoked – triggered – by events that occur in the application). Oracle is based on the client–server architecture. The Oracle server consists of the database (the raw data, including log and control files) and the instance (the processes and system memory on the server that provide access to the database). An instance can connect to only one database. The database consists of a logical structure, such as the database schema, and a physical structure, containing the files that make up an Oracle database. Review Questions | Review Questions 8.1 8.2 8.3 8.4 8.5 8.6 Describe the objects that can be created within Microsoft Office Access. Discuss how Office Access can be used in a multi-user environment. Describe the main data types in Office Access and when each type would be used. Describe two ways to create tables and relationships in Office Access. Describe three ways to create enterprise constraints in Office Access. Describe the objects that can be created within Oracle. 8.7 8.8 Describe Oracle’s logical database structure. Describe Oracle’s physical database structure. 8.9 Describe the main data types in Oracle and when each type would be used. 8.10 Describe two ways to create tables and relationships in Oracle. 8.11 Describe three ways to create enterprise constraints in Oracle. 8.12 Describe the structure of a PL/SQL block. 277 Part 3 Chapter Database Analysis and Design Techniques 9 Database Planning, Design, and Administration 281 Chapter 10 Fact-Finding Techniques 314 Chapter 11 Entity–Relationship Modeling 342 Chapter 12 Enhanced Entity–Relationship Modeling 371 Chapter 13 Normalization 387 Chapter 14 Advanced Normalization 415 Chapter 9 Database Planning, Design, and Administration Chapter Objectives In this chapter you will learn: n The main components of an information system. n The main stages of the database system development lifecycle (DSDLC). n The main phases of database design: conceptual, logical, and physical design. n The benefits of Computer-Aided Software Engineering (CASE) tools. n The types of criteria used to evaluate a DBMS. n How to evaluate and select a DBMS. n The distinction between data administration and database administration. n The purpose and tasks associated with data administration and database administration. Software has now surpassed hardware as the key to the success of many computerbased systems. Unfortunately, the track record at developing software is not particularly impressive. The last few decades have seen the proliferation of software applications ranging from small, relatively simple applications consisting of a few lines of code, to large, complex applications consisting of millions of lines of code. Many of these applications have required constant maintenance. This involved correcting faults that had been detected, implementing new user requirements, and modifying the software to run on new or upgraded platforms. The effort spent on maintenance began to absorb resources at an alarming rate. As a result, many major software projects were late, over budget, unreliable, difficult to maintain, and performed poorly. This led to what has become known as the software crisis. Although this term was first used in the late 1960s, more than 40 years later the crisis is still with us. As a result, some authors now refer to the software crisis as the software depression. As an indication of the crisis, a study carried out in the UK by OASIG, a Special Interest Group concerned with the Organizational Aspects of IT, reached the following conclusions about software projects (OASIG, 1996): 282 | Chapter 9 z Database Planning, Design, and Administration n 80–90% do not meet their performance goals; n about 80% are delivered late and over budget; n around 40% fail or are abandoned; n under 40% fully address training and skills requirements; n less than 25% properly integrate enterprise and technology objectives; n just 10–20% meet all their success criteria. There are several major reasons for the failure of software projects including: n lack of a complete requirements specification; n lack of an appropriate development methodology; n poor decomposition of design into manageable components. As a solution to these problems, a structured approach to the development of software was proposed called the Information Systems Lifecycle (ISLC) or the Software Development Lifecycle (SDLC). However, when the software being developed is a database system the lifecycle is more specifically referred to as the Database System Development Lifecycle (DSDLC). Structure of this Chapter In Section 9.1 we briefly describe the information systems lifecycle and discuss how this lifecycle relates to the database system development lifecycle. In Section 9.2 we present an overview of the stages of the database system development lifecycle. In Sections 9.3 to 9.13 we describe each stage of the lifecycle in more detail. In Section 9.14 we discuss how Computer-Aided Software Engineering (CASE) tools can provide support for the database system development lifecycle. We conclude in Section 9.15 with a discussion on the purpose and tasks associated with data administration and database administration within an organization. 9.1 The Information Systems Lifecycle Information system The resources that enable the collection, management, control, and dissemination of information throughout an organization. Since the 1970s, database systems have been gradually replacing file-based systems as part of an organization’s Information Systems (IS) infrastructure. At the same time there has 9.2 The Database System Development Lifecycle been a growing recognition that data is an important corporate resource that should be treated with respect, like all other organizational resources. This resulted in many organizations establishing whole departments or functional areas called Data Administration (DA) and Database Administration (DBA), which are responsible for the management and control of the corporate data and the corporate database, respectively. A computer-based information system includes a database, database software, application software, computer hardware, and personnel using and developing the system. The database is a fundamental component of an information system, and its development and usage should be viewed from the perspective of the wider requirements of the organization. Therefore, the lifecycle of an organization’s information system is inherently linked to the lifecycle of the database system that supports it. Typically, the stages in the lifecycle of an information system include: planning, requirements collection and analysis, design, prototyping, implementation, testing, conversion, and operational maintenance. In this chapter we review these stages from the perspective of developing a database system. However, it is important to note that the development of a database system should also be viewed from the broader perspective of developing a component part of the larger organization-wide information system. Throughout this chapter we use the terms ‘functional area’ and ‘application area’ to refer to particular enterprise activities within an organization such as marketing, personnel, and stock control. The Database System Development Lifecycle As a database system is a fundamental component of the larger organization-wide information system, the database system development lifecycle is inherently associated with the lifecycle of the information system. The stages of the database system development lifecycle are shown in Figure 9.1. Below the name of each stage is the section in this chapter that describes that stage. It is important to recognize that the stages of the database system development lifecycle are not strictly sequential, but involve some amount of repetition of previous stages through feedback loops. For example, problems encountered during database design may necessitate additional requirements collection and analysis. As there are feedback loops between most stages, we show only some of the more obvious ones in Figure 9.1. A summary of the main activities associated with each stage of the database system development lifecycle is described in Table 9.1. For small database systems, with a small number of users, the lifecycle need not be very complex. However, when designing a medium to large database systems with tens to thousands of users, using hundreds of queries and application programs, the lifecycle can become extremely complex. Throughout this chapter we concentrate on activities associated with the development of medium to large database systems. In the following sections we describe the main activities associated with each stage of the database system development lifecycle in more detail. 9.2 | 283 Figure 9.1 The stages of the database system development lifecycle. 9.3 Database Planning Table 9.1 Summary of the main activities associated with each stage of the database system development lifecycle. Stage Main activities Database planning Planning how the stages of the lifecycle can be realized most efficiently and effectively. Specifying the scope and boundaries of the database system, including the major user views, its users, and application areas. Collection and analysis of the requirements for the new database system. Conceptual, logical, and physical design of the database. Selecting a suitable DBMS for the database system. Designing the user interface and the application programs that use and process the database. Building a working model of the database system, which allows the designers or users to visualize and evaluate how the final system will look and function. Creating the physical database definitions and the application programs. Loading data from the old system to the new system and, where possible, converting any existing applications to run on the new database. Database system is tested for errors and validated against the requirements specified by the users. Database system is fully implemented. The system is continuously monitored and maintained. When necessary, new requirements are incorporated into the database system through the preceding stages of the lifecycle. System definition Requirements collection and analysis Database design DBMS selection (optional) Application design Prototyping (optional) Implementation Data conversion and loading Testing Operational maintenance Database Planning Database planning The management activities that allow the stages of the database system development lifecycle to be realized as efficiently and effectively as possible. Database planning must be integrated with the overall IS strategy of the organization. There are three main issues involved in formulating an IS strategy, which are: n n n identification of enterprise plans and goals with subsequent determination of information systems needs; evaluation of current information systems to determine existing strengths and weaknesses; appraisal of IT opportunities that might yield competitive advantage. 9.3 | 285 286 | Chapter 9 z Database Planning, Design, and Administration The methodologies used to resolve these issues are outside the scope of this book; however, the interested reader is referred to Robson (1997) for a fuller discussion. An important first step in database planning is to clearly define the mission statement for the database system. The mission statement defines the major aims of the database system. Those driving the database project within the organization (such as the Director and/or owner) normally define the mission statement. A mission statement helps to clarify the purpose of the database system and provide a clearer path towards the efficient and effective creation of the required database system. Once the mission statement is defined, the next activity involves identifying the mission objectives. Each mission objective should identify a particular task that the database system must support. The assumption is that if the database system supports the mission objectives then the mission statement should be met. The mission statement and objectives may be accompanied with some additional information that specifies, in general terms, the work to be done, the resources with which to do it, and the money to pay for it all. We demonstrate the creation of a mission statement and mission objectives for the database system of DreamHome in Section 10.4.2. Database planning should also include the development of standards that govern how data will be collected, how the format should be specified, what necessary documentation will be needed, and how design and implementation should proceed. Standards can be very time-consuming to develop and maintain, requiring resources to set them up initially, and to continue maintaining them. However, a well-designed set of standards provides a basis for training staff and measuring quality control, and can ensure that work conforms to a pattern, irrespective of staff skills and experience. For example, specific rules may govern how data items can be named in the data dictionary, which in turn may prevent both redundancy and inconsistency. Any legal or enterprise requirements concerning the data should be documented, such as the stipulation that some types of data must be treated confidentially. 9.4 System Definition System definition Describes the scope and boundaries of the database application and the major user views. Before attempting to design a database system, it is essential that we first identify the boundaries of the system that we are investigating and how it interfaces with other parts of the organization’s information system. It is important that we include within our system boundaries not only the current users and application areas, but also future users and applications. We present a diagram that represents the scope and boundaries of the DreamHome database system in Figure 10.10. Included within the scope and boundary of the database system are the major user views that are to be supported by the database. 9.4 System Definition User Views User view | 287 9.4.1 Defines what is required of a database system from the perspective of a particular job role (such as Manager or Supervisor) or enterprise application area (such as marketing, personnel, or stock control). A database system may have one or more user views. Identifying user views is an important aspect of developing a database system because it helps to ensure that no major users of the database are forgotten when developing the requirements for the new database system. User views are also particularly helpful in the development of a relatively complex database system by allowing the requirements to be broken down into manageable pieces. A user view defines what is required of a database system in terms of the data to be held and the transactions to be performed on the data (in other words, what the users will do with the data). The requirements of a user view may be distinct to that view or overlap with other views. Figure 9.2 is a diagrammatic representation of a database system with multiple user views (denoted user view 1 to 6). Note that whereas user views (1, 2, and 3) and (5 and 6) have overlapping requirements (shown as hatched areas), user view 4 has distinct requirements. Figure 9.2 Representation of a database system with multiple user views: user views (1, 2, and 3) and (5 and 6) have overlapping requirements (shown as hatched areas), whereas user view 4 has distinct requirements. 288 | Chapter 9 z Database Planning, Design, and Administration 9.5 Requirements Collection and Analysis Requirements collection and analysis The process of collecting and analyzing information about the part of the organization that is to be supported by the database system, and using this information to identify the requirements for the new system. This stage involves the collection and analysis of information about the part of the enterprise to be served by the database. There are many techniques for gathering this information, called fact-finding techniques, which we discuss in detail in Chapter 10. Information is gathered for each major user view (that is, job role or enterprise application area), including: n n n a description of the data used or generated; the details of how data is to be used or generated; any additional requirements for the new database system. This information is then analyzed to identify the requirements (or features) to be included in the new database system. These requirements are described in documents collectively referred to as requirements specifications for the new database system. Requirements collection and analysis is a preliminary stage to database design. The amount of data gathered depends on the nature of the problem and the policies of the enterprise. Too much study too soon leads to paralysis by analysis. Too little thought can result in an unnecessary waste of both time and money due to working on the wrong solution to the wrong problem. The information collected at this stage may be poorly structured and include some informal requests, which must be converted into a more structured statement of requirements. This is achieved using requirements specification techniques, which include for example: Structured Analysis and Design (SAD) techniques, Data Flow Diagrams (DFD), and Hierarchical Input Process Output (HIPO) charts supported by documentation. As we will see shortly, Computer-Aided Software Engineering (CASE) tools may provide automated assistance to ensure that the requirements are complete and consistent. In Section 25.7 we will discuss how the Unified Modeling Language (UML) supports requirements collection and analysis. Identifying the required functionality for a database system is a critical activity, as systems with inadequate or incomplete functionality will annoy the users, which may lead to rejection or underutilization of the system. However, excessive functionality can also be problematic as it can overcomplicate a system making it difficult to implement, maintain, use, or learn. Another important activity associated with this stage is deciding how to deal with the situation where there is more than one user view for the database system. There are three main approaches to managing the requirements of a database system with multiple user views, namely: n n n the centralized approach; the view integration approach; a combination of both approaches. 9.5 Requirements Collection and Analysis Figure 9.3 The centralized approach to managing multiple user views 1 to 3. Centralized Approach Centralized approach 9.5.1 Requirements for each user view are merged into a single set of requirements for the new database system. A data model representing all user views is created during the database design stage. The centralized (or one-shot) approach involves collating the requirements for different user views into a single list of requirements. The collection of user views is given a name that provides some indication of the functional area covered by all the merged user views. In the database design stage (see Section 9.6), a global data model is created, which represents all user views. The global data model is composed of diagrams and documentation that formally describe the data requirements of the users. A diagram representing the management of user views 1 to 3 using the centralized approach is shown in Figure 9.3. Generally, this approach is preferred when there is a significant overlap in requirements for each user view and the database system is not overly complex. View Integration Approach View integration approach Requirements for each user view remain as separate lists. Data models representing each user view are created and then merged later during the database design stage. 9.5.2 | 289 290 | Chapter 9 z Database Planning, Design, and Administration Figure 9.4 The view integration approach to managing multiple user views 1 to 3. The view integration approach involves leaving the requirements for each user view as separate lists of requirements. In the database design stage (see Section 9.6), we first create a data model for each user view. A data model that represents a single user view (or a subset of all user views) is called a local data model. Each model is composed of diagrams and documentation that formally describes the requirements of one or more but not all user views of the database. The local data models are then merged at a later stage of database design to produce a global data model, which represents all user requirements for the database. A diagram representing the management of user views 1 to 3 using the view integration approach is shown in Figure 9.4. Generally, this approach is preferred 9.6 Database Design when there are significant differences between user views and the database system is sufficiently complex to justify dividing the work into more manageable parts. We demonstrate how to use the view integration approach in Chapter 16, Step 2.6. For some complex database systems it may be appropriate to use a combination of both the centralized and view integration approaches to manage multiple user views. For example, the requirements for two or more user views may be first merged using the centralized approach, which is used to build a local logical data model. This model can then be merged with other local logical data models using the view integration approach to produce a global logical data model. In this case, each local logical data model represents the requirements of two or more user views and the final global logical data model represents the requirements of all user views of the database system. We discuss how to manage multiple user views in more detail in Section 10.4.4 and using the methodology described in this book we demonstrate how to build a database for the DreamHome property rental case study using a combination of both the centralized and view integration approaches. Database Design Database design 9.6 The process of creating a design that will support the enterprise’s mission statement and mission objectives for the required database system. In this section we present an overview of the main approaches to database design. We also discuss the purpose and use of data modeling in database design. We then describe the three phases of database design, namely conceptual, logical, and physical design. Approaches to Database Design The two main approaches to the design of a database are referred to as ‘bottom-up’ and ‘top-down’. The bottom-up approach begins at the fundamental level of attributes (that is, properties of entities and relationships), which through analysis of the associations between attributes, are grouped into relations that represent types of entities and relationships between entities. In Chapters 13 and 14 we discuss the process of normalization, which represents a bottom-up approach to database design. Normalization involves the identification of the required attributes and their subsequent aggregation into normalized relations based on functional dependencies between the attributes. The bottom-up approach is appropriate for the design of simple databases with a relatively small number of attributes. However, this approach becomes difficult when applied to the design of more complex databases with a larger number of attributes, where it is difficult to establish all the functional dependencies between the attributes. As the conceptual and logical data models for complex databases may contain hundreds to thousands 9.6.1 | 291 292 | Chapter 9 z Database Planning, Design, and Administration of attributes, it is essential to establish an approach that will simplify the design process. Also, in the initial stages of establishing the data requirements for a complex database, it may be difficult to establish all the attributes to be included in the data models. A more appropriate strategy for the design of complex databases is to use the top-down approach. This approach starts with the development of data models that contain a few high-level entities and relationships and then applies successive top-down refinements to identify lower-level entities, relationships, and the associated attributes. The top-down approach is illustrated using the concepts of the Entity–Relationship (ER) model, beginning with the identification of entities and relationships between the entities, which are of interest to the organization. For example, we may begin by identifying the entities PrivateOwner and PropertyForRent, and then the relationship between these entities, PrivateOwner Owns PropertyForRent, and finally the associated attributes such as PrivateOwner (ownerNo, name, and address) and PropertyForRent (propertyNo and address). Building a highlevel data model using the concepts of the ER model is discussed in Chapters 11 and 12. There are other approaches to database design such as the inside-out approach and the mixed strategy approach. The inside-out approach is related to the bottom-up approach but differs by first identifying a set of major entities and then spreading out to consider other entities, relationships, and attributes associated with those first identified. The mixed strategy approach uses both the bottom-up and top-down approach for various parts of the model before finally combining all parts together. 9.6.2 Data Modeling The two main purposes of data modeling are to assist in the understanding of the meaning (semantics) of the data and to facilitate communication about the information requirements. Building a data model requires answering questions about entities, relationships, and attributes. In doing so, the designers discover the semantics of the enterprise’s data, which exist whether or not they happen to be recorded in a formal data model. Entities, relationships, and attributes are fundamental to all enterprises. However, their meaning may remain poorly understood until they have been correctly documented. A data model makes it easier to understand the meaning of the data, and thus we model data to ensure that we understand: n n n each user’s perspective of the data; the nature of the data itself, independent of its physical representations; the use of data across user views. Data models can be used to convey the designer’s understanding of the information requirements of the enterprise. Provided both parties are familiar with the notation used in the model, it will support communication between the users and designers. Increasingly, enterprises are standardizing the way that they model data by selecting a particular approach to data modeling and using it throughout their database development projects. The most popular high-level data model used in database design, and the one we use in this book, is based on the concepts of the Entity–Relationship (ER) model. We describe Entity–Relationship modeling in detail in Chapters 11 and 12. 9.6 Database Design Table 9.2 The criteria to produce an optimal data model. Structural validity Simplicity Expressibility Nonredundancy Shareability Extensibility Integrity Diagrammatic representation Consistency with the way the enterprise defines and organizes information. Ease of understanding by IS professionals and non-technical users. Ability to distinguish between different data, relationships between data, and constraints. Exclusion of extraneous information; in particular, the representation of any one piece of information exactly once. Not specific to any particular application or technology and thereby usable by many. Ability to evolve to support new requirements with minimal effect on existing users. Consistency with the way the enterprise uses and manages information. Ability to represent a model using an easily understood diagrammatic notation. Criteria for data models An optimal data model should satisfy the criteria listed in Table 9.2 (Fleming and Von Halle, 1989). However, sometimes these criteria are not compatible with each other and tradeoffs are sometimes necessary. For example, in attempting to achieve greater expressibility in a data model, we may lose simplicity. Phases of Database Design Database design is made up of three main phases, namely conceptual, logical, and physical design. Conceptual database design Conceptual database design The process of constructing a model of the data used in an enterprise, independent of all physical considerations. The first phase of database design is called conceptual database design, and involves the creation of a conceptual data model of the part of the enterprise that we are interested in modeling. The data model is built using the information documented in the users’ requirements specification. Conceptual database design is entirely independent of implementation details such as the target DBMS software, application programs, programming languages, hardware platform, or any other physical considerations. In Chapter 15, we present a practical step-by-step guide on how to perform conceptual database design. Throughout the process of developing a conceptual data model, the model is tested and validated against the users’ requirements. The conceptual data model of the enterprise is a source of information for the next phase, namely logical database design. 9.6.3 | 293 294 | Chapter 9 z Database Planning, Design, and Administration Logical database design Logical database design The process of constructing a model of the data used in an enterprise based on a specific data model, but independent of a particular DBMS and other physical considerations. The second phase of database design is called logical database design, which results in the creation of a logical data model of the part of the enterprise that we interested in modeling. The conceptual data model created in the previous phase is refined and mapped on to a logical data model. The logical data model is based on the target data model for the database (for example, the relational data model). Whereas a conceptual data model is independent of all physical considerations, a logical model is derived knowing the underlying data model of the target DBMS. In other words, we know that the DBMS is, for example, relational, network, hierarchical, or objectoriented. However, we ignore any other aspects of the chosen DBMS and, in particular, any physical details, such as storage structures or indexes. Throughout the process of developing a logical data model, the model is tested and validated against the users’ requirements. The technique of normalization is used to test the correctness of a logical data model. Normalization ensures that the relations derived from the data model do not display data redundancy, which can cause update anomalies when implemented. In Chapter 13 we illustrate the problems associated with data redundancy and describe the process of normalization in detail. The logical data model should also be examined to ensure that it supports the transactions specified by the users. The logical data model is a source of information for the next phase, namely physical database design, providing the physical database designer with a vehicle for making tradeoffs that are very important to efficient database design. The logical model also serves an important role during the operational maintenance stage of the database system development lifecycle. Properly maintained and kept up to date, the data model allows future changes to application programs or data to be accurately and efficiently represented by the database. In Chapter 16 we present a practical step-by-step guide for logical database design. Physical database design Physical database design The process of producing a description of the implementation of the database on secondary storage; it describes the base relations, file organizations, and indexes used to achieve efficient access to the data, and any associated integrity constraints and security measures. Physical database design is the third and final phase of the database design process, during which the designer decides how the database is to be implemented. The previous phase of database design involved the development of a logical structure for the database, which describes relations and enterprise constraints. Although this structure is 9.7 DBMS Selection DBMS-independent, it is developed in accordance with a particular data model such as the relational, network, or hierarchic. However, in developing the physical database design, we must first identify the target DBMS. Therefore, physical design is tailored to a specific DBMS system. There is feedback between physical and logical design, because decisions are taken during physical design for improving performance that may affect the structure of the logical data model. In general, the main aim of physical database design is to describe how we intend to physically implement the logical database design. For the relational model, this involves: n n n creating a set of relational tables and the constraints on these tables from the information presented in the logical data model; identifying the specific storage structures and access methods for the data to achieve an optimum performance for the database system; designing security protection for the system. Ideally, conceptual and logical database design for larger systems should be separated from physical design for three main reasons: n n n it deals with a different subject matter – the what, not the how; it is performed at a different time – the what must be understood before the how can be determined; it requires different skills, which are often found in different people. Database design is an iterative process, which has a starting point and an almost endless procession of refinements. They should be viewed as learning processes. As the designers come to understand the workings of the enterprise and the meanings of its data, and express that understanding in the selected data models, the information gained may well necessitate changes to other parts of the design. In particular, conceptual and logical database designs are critical to the overall success of the system. If the designs are not a true representation of the enterprise, it will be difficult, if not impossible, to define all the required user views or to maintain database integrity. It may even prove difficult to define the physical implementation or to maintain acceptable system performance. On the other hand, the ability to adjust to change is one hallmark of good database design. Therefore, it is worthwhile spending the time and energy necessary to produce the best possible design. In Chapter 2, we discussed the three-level ANSI-SPARC architecture for a database system, consisting of external, conceptual, and internal schemas. Figure 9.5 illustrates the correspondence between this architecture and conceptual, logical, and physical database design. In Chapters 17 and 18 we present a step-by-step methodology for the physical database design phase. DBMS Selection DBMS selection The selection of an appropriate DBMS to support the database system. 9.7 | 295 296 | Chapter 9 z Database Planning, Design, and Administration Figure 9.5 Data modeling and the ANSI-SPARC architecture. If no DBMS exists, an appropriate part of the lifecycle in which to make a selection is between the conceptual and logical database design phases (see Figure 9.1). However, selection can be done at any time prior to logical design provided sufficient information is available regarding system requirements such as performance, ease of restructuring, security, and integrity constraints. Although DBMS selection may be infrequent, as enterprise needs expand or existing systems are replaced, it may become necessary at times to evaluate new DBMS products. In such cases the aim is to select a system that meets the current and future requirements of the enterprise, balanced against costs that include the purchase of the DBMS product, any additional software/hardware required to support the database system, and the costs associated with changeover and staff training. A simple approach to selection is to check off DBMS features against requirements. In selecting a new DBMS product, there is an opportunity to ensure that the selection process is well planned, and the system delivers real benefits to the enterprise. In the following section we describe a typical approach to selecting the ‘best’ DBMS. 9.7.1 Selecting the DBMS The main steps to selecting a DBMS are listed in Table 9.3. Table 9.3 Main steps to selecting a DBMS. Define Terms of Reference of study Shortlist two or three products Evaluate products Recommend selection and produce report 9.7 DBMS Selection Define Terms of Reference of study The Terms of Reference for the DBMS selection is established, stating the objectives and scope of the study, and the tasks that need to be undertaken. This document may also include a description of the criteria (based on the users’ requirements specification) to be used to evaluate the DBMS products, a preliminary list of possible products, and all necessary constraints and timescales for the study. Shortlist two or three products Criteria considered to be ‘critical’ to a successful implementation can be used to produce a preliminary list of DBMS products for evaluation. For example, the decision to include a DBMS product may depend on the budget available, level of vendor support, compatibility with other software, and whether the product runs on particular hardware. Additional useful information on a product can be gathered by contacting existing users who may provide specific details on how good the vendor support actually is, on how the product supports particular applications, and whether or not certain hardware platforms are more problematic than others. There may also be benchmarks available that compare the performance of DBMS products. Following an initial study of the functionality and features of DBMS products, a shortlist of two or three products is identified. The World Wide Web is an excellent source of information and can be used to identify potential candidate DBMSs. For example, the DBMS magazine’s website (available at www.intelligententerprise.com) provides a comprehensive index of DBMS products. Vendors’ websites can also provide valuable information on DBMS products. Evaluate products There are various features that can be used to evaluate a DBMS product. For the purposes of the evaluation, these features can be assessed as groups (for example, data definition) or individually (for example, data types available). Table 9.4 lists possible features for DBMS product evaluation grouped by data definition, physical definition, accessibility, transaction handling, utilities, development, and other features. If features are checked off simply with an indication of how good or bad each is, it may be difficult to make comparisons between DBMS products. A more useful approach is to weight features and/or groups of features with respect to their importance to the organization, and to obtain an overall weighted value that can be used to compare products. Table 9.5 illustrates this type of analysis for the ‘Physical definition’ group for a sample DBMS product. Each selected feature is given a rating out of 10, a weighting out of 1 to indicate its importance relative to other features in the group, and a calculated score based on the rating times the weighting. For example, in Table 9.5 the feature ‘Ease of reorganization’ is given a rating of 4, and a weighting of 0.25, producing a score of 1.0. This feature is given the highest weighting in this table, indicating its importance in this part of the evaluation. Further, the ‘Ease of reorganization’ feature is weighted, for example, five times higher than the feature ‘Data compression’ with the lowest weighting of 0.05. Whereas, the two features ‘Memory requirements’ and ‘Storage requirements’ are given a weighting of 0.00 and are therefore not included in this evaluation. | 297 298 Table 9.4 Features for DBMS evaluation. Data definition Physical definition Primary key enforcement Foreign key specification Data types available Data type extensibility Domain specification Ease of restructuring Integrity controls View mechanism Data dictionary Data independence Underlying data model Schema evolution File structures available File structure maintenance Ease of reorganization Indexing Variable length fields/records Data compression Encryption routines Memory requirements Storage requirements Accessibility Transaction handling Query language: SQL2/SQL:2003/ODMG compliant Interfacing to 3GLs Multi-user Security – Access controls – Authorization mechanism Backup and recovery routines Checkpointing facility Logging facility Granularity of concurrency Deadlock resolution strategy Advanced transaction models Parallel query processing Utilities Development Performance measuring Tuning Load/unload facilities User usage monitoring Database administration support 4GL/5GL tools CASE tools Windows capabilities Stored procedures, triggers, and rules Web development tools Other features Upgradability Vendor stability User base Training and user support Documentation Operating system required Cost Online help Standards used Version management Extensibile query optimization Scalability Support for analytical tools Interoperability with other DBMSs and other systems Web integration Replication utilities Distributed capabilities Portability Hardware required Network support Object-oriented capabilities Architecture (2- or 3-tier client/server) Performance Transaction throughput Maximum number of concurrent users XML support 9.8 Application Design Table 9.5 Analysis of features for DBMS product evaluation. DBMS: Sample product Vendor: Sample vendor Physical Definition Group Features Comments File structures available File structure maintenance Ease of reorganization Indexing Variable length fields/records Data compression Encryption routines Memory requirements Storage requirements Choice of 4 NOT self-regulating Specify with file structure Choice of 2 Totals Physical definition group Rating Weighting Score 8 6 4 6 6 7 4 0 0 0.15 0.2 0.25 0.15 0.15 0.05 0.05 0.00 0.00 1.2 1.2 1.0 0.9 0.9 0.35 0.2 0 0 41 1.0 5.75 0.25 1.44 5.75 We next sum together all the scores for each evaluated feature to produce a total score for the group. The score for the group is then itself subject to a weighting, to indicate its importance relative to other groups of features included in the evaluation. For example, in Table 9.5, the total score for the ‘Physical definition’ group is 5.75; however, this score has a weighting of 0.25. Finally, all the weighted scores for each assessed group of features are summed to produce a single score for the DBMS product, which is compared with the scores for the other products. The product with the highest score is the ‘winner’. In addition to this type of analysis, we can also evaluate products by allowing vendors to demonstrate their product or by testing the products in-house. In-house evaluation involves creating a pilot testbed using the candidate products. Each product is tested against its ability to meet the users’ requirements for the database system. Benchmarking reports published by the Transaction Processing Council can be found at www.tpc.org Recommend selection and produce report The final step of the DBMS selection is to document the process and to provide a statement of the findings and recommendations for a particular DBMS product. Application Design Application design The design of the user interface and the application programs that use and process the database. 9.8 | 299 300 | Chapter 9 z Database Planning, Design, and Administration In Figure 9.1, observe that database and application design are parallel activities of the database system development lifecycle. In most cases, it is not possible to complete the application design until the design of the database itself has taken place. On the other hand, the database exists to support the applications, and so there must be a flow of information between application design and database design. We must ensure that all the functionality stated in the users’ requirements specification is present in the application design for the database system. This involves designing the application programs that access the database and designing the transactions, (that is, the database access methods). In addition to designing how the required functionality is to be achieved, we have to design an appropriate user interface to the database system. This interface should present the required information in a ‘user-friendly’ way. The importance of user interface design is sometimes ignored or left until late in the design stages. However, it should be recognized that the interface may be one of the most important components of the system. If it is easy to learn, simple to use, straightforward and forgiving, the users will be inclined to make good use of what information is presented. On the other hand, if the interface has none of these characteristics, the system will undoubtedly cause problems. In the following sections, we briefly examine two aspects of application design, namely transaction design and user interface design. 9.8.1 Transaction Design Before discussing transaction design we first describe what a transaction represents. Transaction An action, or series of actions, carried out by a single user or application program, which accesses or changes the content of the database. Transactions represent ‘real world’ events such as the registering of a property for rent, the addition of a new member of staff, the registration of a new client, and the renting out of a property. These transactions have to be applied to the database to ensure that data held by the database remains current with the ‘real world’ situation and to support the information needs of the users. A transaction may be composed of several operations, such as the transfer of money from one account to another. However, from the user’s perspective these operations still accomplish a single task. From the DBMS’s perspective, a transaction transfers the database from one consistent state to another. The DBMS ensures the consistency of the database even in the presence of a failure. The DBMS also ensures that once a transaction has completed, the changes made are permanently stored in the database and cannot be lost or undone (without running another transaction to compensate for the effect of the first transaction). If the transaction cannot complete for any reason, the DBMS should ensure that the changes made by that transaction are undone. In the example of the bank transfer, if money is debited from one account and the transaction fails before crediting the other account, the DBMS should undo the debit. If we were to define the debit and credit 9.8 Application Design operations as separate transactions, then once we had debited the first account and completed the transaction, we are not allowed to undo that change (without running another transaction to credit the debited account with the required amount). The purpose of transaction design is to define and document the high-level characteristics of the transactions required on the database, including: n n n n n data to be used by the transaction; functional characteristics of the transaction; output of the transaction; importance to the users; expected rate of usage. This activity should be carried out early in the design process to ensure that the implemented database is capable of supporting all the required transactions. There are three main types of transactions: retrieval transactions, update transactions, and mixed transactions. n n n Retrieval transactions are used to retrieve data for display on the screen or in the production of a report. For example, the operation to search for and display the details of a property (given the property number) is an example of a retrieval transaction. Update transactions are used to insert new records, delete old records, or modify existing records in the database. For example, the operation to insert the details of a new property into the database is an example of an update transaction. Mixed transactions involve both the retrieval and updating of data. For example, the operation to search for and display the details of a property (given the property number) and then update the value of the monthly rent is an example of a mixed transaction. User Interface Design Guidelines Before implementing a form or report, it is essential that we first design the layout. Useful guidelines to follow when designing forms or reports are listed in Table 9.6 (Shneiderman, 1992). Meaningful title The information conveyed by the title should clearly and unambiguously identify the purpose of the form/report. Comprehensible instructions Familiar terminology should be used to convey instructions to the user. The instructions should be brief, and, when more information is required, help screens should be made available. Instructions should be written in a consistent grammatical style using a standard format. 9.8.2 | 301 302 | Chapter 9 z Database Planning, Design, and Administration Table 9.6 Guidelines for form/report design. Meaningful title Comprehensible instructions Logical grouping and sequencing of fields Visually appealing layout of the form/report Familiar field labels Consistent terminology and abbreviations Consistent use of color Visible space and boundaries for data-entry fields Convenient cursor movement Error correction for individual characters and entire fields Error messages for unacceptable values Optional fields marked clearly Explanatory messages for fields Completion signal Logical grouping and sequencing of fields Related fields should be positioned together on the form/report. The sequencing of fields should be logical and consistent. Visually appealing layout of the form/report The form/report should present an attractive interface to the user. The form/report should appear balanced with fields or groups of fields evenly positioned throughout the form/ report. There should not be areas of the form/report that have too few or too many fields. Fields or groups of fields should be separated by a regular amount of space. Where appropriate, fields should be vertically or horizontally aligned. In cases where a form on screen has a hardcopy equivalent, the appearance of both should be consistent. Familiar field labels Field labels should be familiar. For example, if Sex was replaced by Gender, it is possible that some users would be confused. Consistent terminology and abbreviations An agreed list of familiar terms and abbreviations should be used consistently. Consistent use of color Color should be used to improve the appearance of a form/report and to highlight important fields or important messages. To achieve this, color should be used in a consistent and 9.9 Prototyping meaningful way. For example, fields on a form with a white background may indicate data-entry fields and those with a blue background may indicate display-only fields. Visible space and boundaries for data-entry fields A user should be visually aware of the total amount of space available for each field. This allows a user to consider the appropriate format for the data before entering the values into a field. Convenient cursor movement A user should easily identify the operation required to move a cursor throughout the form/report. Simple mechanisms such as using the Tab key, arrows, or the mouse pointer should be used. Error correction for individual characters and entire fields A user should easily identify the operation required to make alterations to field values. Simple mechanisms should be available such as using the Backspace key or by overtyping. Error messages for unacceptable values If a user attempts to enter incorrect data into a field, an error message should be displayed. The message should inform the user of the error and indicate permissible values. Optional fields marked clearly Optional fields should be clearly identified for the user. This can be achieved using an appropriate field label or by displaying the field using a color that indicates the type of the field. Optional fields should be placed after required fields. Explanatory messages for fields When a user places a cursor on a field, information about the field should appear in a regular position on the screen such as a window status bar. Completion signal It should be clear to a user when the process of filling in fields on a form is complete. However, the option to complete the process should not be automatic as the user may wish to review the data entered. Prototyping At various points throughout the design process, we have the option to either fully implement the database system or build a prototype. 9.9 | 303 304 | Chapter 9 z Database Planning, Design, and Administration Prototyping Building a working model of a database system. A prototype is a working model that does not normally have all the required features or provide all the functionality of the final system. The main purpose of developing a prototype database system is to allow users to use the prototype to identify the features of the system that work well, or are inadequate, and if possible to suggest improvements or even new features to the database system. In this way, we can greatly clarify the users’ requirements for both the users and developers of the system and evaluate the feasibility of a particular system design. Prototypes should have the major advantage of being relatively inexpensive and quick to build. There are two prototyping strategies in common use today: requirements prototyping and evolutionary prototyping. Requirements prototyping uses a prototype to determine the requirements of a proposed database system and once the requirements are complete the prototype is discarded. While evolutionary prototyping is used for the same purposes, the important difference is that the prototype is not discarded but with further development becomes the working database system. 9.10 Implementation Implementation The physical realization of the database and application designs. On completion of the design stages (which may or may not have involved prototyping), we are now in a position to implement the database and the application programs. The database implementation is achieved using the Data Definition Language (DDL) of the selected DBMS or a Graphical User Interface (GUI), which provides the same functionality while hiding the low-level DDL statements. The DDL statements are used to create the database structures and empty database files. Any specified user views are also implemented at this stage. The application programs are implemented using the preferred third or fourth generation language (3GL or 4GL). Parts of these application programs are the database transactions, which are implemented using the Data Manipulation Language (DML) of the target DBMS, possibly embedded within a host programming language, such as Visual Basic (VB), VB.net, Python, Delphi, C, C++, C#, Java, COBOL, Fortran, Ada, or Pascal. We also implement the other components of the application design such as menu screens, data entry forms, and reports. Again, the target DBMS may have its own fourth generation tools that allow rapid development of applications through the provision of non-procedural query languages, reports generators, forms generators, and application generators. Security and integrity controls for the system are also implemented. Some of these controls are implemented using the DDL, but others may need to be defined outside the DDL using, for example, the supplied DBMS utilities or operating system controls. Note that SQL (Structured Query Language) is both a DDL and a DML as described in Chapters 5 and 6. 9.12 Testing Data Conversion and Loading Data conversion and loading | 9.11 Transferring any existing data into the new database and converting any existing applications to run on the new database. This stage is required only when a new database system is replacing an old system. Nowadays, it is common for a DBMS to have a utility that loads existing files into the new database. The utility usually requires the specification of the source file and the target database, and then automatically converts the data to the required format of the new database files. Where applicable, it may be possible for the developer to convert and use application programs from the old system for use by the new system. Whenever conversion and loading are required, the process should be properly planned to ensure a smooth transition to full operation. Testing Testing The process of running the database system with the intent of finding errors. Before going live, the newly developed database system should be thoroughly tested. This is achieved using carefully planned test strategies and realistic data so that the entire testing process is methodically and rigorously carried out. Note that in our definition of testing we have not used the commonly held view that testing is the process of demonstrating that faults are not present. In fact, testing cannot show the absence of faults; it can show only that software faults are present. If testing is conducted successfully, it will uncover errors with the application programs and possibly the database structure. As a secondary benefit, testing demonstrates that the database and the application programs appear to be working according to their specification and that performance requirements appear to be satisfied. In addition, metrics collected from the testing stage provide a measure of software reliability and software quality. As with database design, the users of the new system should be involved in the testing process. The ideal situation for system testing is to have a test database on a separate hardware system, but often this is not available. If real data is to be used, it is essential to have backups taken in case of error. Testing should also cover usability of the database system. Ideally, an evaluation should be conducted against a usability specification. Examples of criteria that can be used to conduct the evaluation include (Sommerville, 2002): n n n Learnability – How long does it take a new user to become productive with the system? Performance – How well does the system response match the user’s work practice? Robustness – How tolerant is the system of user error? 9.12 305 306 | Chapter 9 z Database Planning, Design, and Administration n n Recoverability – How good is the system at recovering from user errors? Adapatability – How closely is the system tied to a single model of work? Some of these criteria may be evaluated in other stages of the lifecycle. After testing is complete, the database system is ready to be ‘signed off’ and handed over to the users. 9.13 Operational Maintenance Operational maintenance The process of monitoring and maintaining the database system following installation. In the previous stages, the database system has been fully implemented and tested. The system now moves into a maintenance stage, which involves the following activities: n n Monitoring the performance of the system. If the performance falls below an acceptable level, tuning or reorganization of the database may be required. Maintaining and upgrading the database system (when required). New requirements are incorporated into the database system through the preceding stages of the lifecycle. Once the database system is fully operational, close monitoring takes place to ensure that performance remains within acceptable levels. A DBMS normally provides various utilities to aid database administration including utilities to load data into a database and to monitor the system. The utilities that allow system monitoring give information on, for example, database usage, locking efficiency (including number of deadlocks that have occurred, and so on), and query execution strategy. The Database Administrator (DBA) can use this information to tune the system to give better performance, for example, by creating additional indexes to speed up queries, by altering storage structures, or by combining or splitting tables. The monitoring process continues throughout the life of a database system and in time may lead to reorganization of the database to satisfy the changing requirements. These changes in turn provide information on the likely evolution of the system and the future resources that may be needed. This, together with knowledge of proposed new applications, enables the DBA to engage in capacity planning and to notify or alert senior staff to adjust plans accordingly. If the DBMS lacks certain utilities, the DBA can either develop the required utilities in-house or purchase additional vendor tools, if available. We discuss database administration in more detail in Section 9.15. When a new database application is brought online, the users should operate it in parallel with the old system for a period of time. This safeguards current operations in case of unanticipated problems with the new system. Periodic checks on data consistency between the two systems need to be made, and only when both systems appear to be producing the same results consistently, should the old system be dropped. If the changeover is too hasty, the end-result could be disastrous. Despite the foregoing assumption that the old system may be dropped, there may be situations where both systems are maintained. 9.14 CASE Tools CASE Tools The first stage of the database system development lifecycle, namely database planning, may also involve the selection of suitable Computer-Aided Software Engineering (CASE) tools. In its widest sense, CASE can be applied to any tool that supports software engineering. Appropriate productivity tools are needed by data administration and database administration staff to permit the database development activities to be carried out as efficiently and effectively as possible. CASE support may include: n n n n a data dictionary to store information about the database system’s data; design tools to support data analysis; tools to permit development of the corporate data model, and the conceptual and logical data models; tools to enable the prototyping of applications. CASE tools may be divided into three categories: upper-CASE, lower-CASE, and integratedCASE, as illustrated in Figure 9.6. Upper-CASE tools support the initial stages of the database system development lifecycle, from planning through to database design. LowerCASE tools support the later stages of the lifecycle, from implementation through testing, to operational maintenance. Integrated-CASE tools support all stages of the lifecycle and thus provide the functionality of both upper- and lower-CASE in one tool. Benefits of CASE The use of appropriate CASE tools should improve the productivity of developing a database system. We use the term ‘productivity’ to relate both to the efficiency of the development process and to the effectiveness of the developed system. Efficiency refers to the cost, in terms of time and money, of realizing the database system. CASE tools aim to support and automate the development tasks and thus improve efficiency. Effectiveness refers to the extent to which the system satisfies the information needs of its users. In the pursuit of greater productivity, raising the effectiveness of the development process may be even more important than increasing its efficiency. For example, it would not be sensible to develop a database system extremely efficiently when the end-product is not what the users want. In this way, effectiveness is related to the quality of the final product. Since computers are better than humans at certain tasks, for example consistency checking, CASE tools can be used to increase the effectiveness of some tasks in the development process. CASE tools provide the following benefits that improve productivity: n n Standards CASE tools help to enforce standards on a software project or across the organization. They encourage the production of standard test components that can be reused, thus simplifying maintenance and increasing productivity. Integration CASE tools store all the information generated in a repository, or data dictionary, as discussed in Section 2.7. Thus, it should be possible to store the data gathered during all stages of the database system development lifecycle. The data then can be linked together to ensure that all parts of the system are integrated. In this way, | 9.14 307 308 | Chapter 9 z Database Planning, Design, and Administration Figure 9.6 Application of CASE tools. n n n an organization’s information system no longer has to consist of independent, unconnected components. Support for standard methods Structured techniques make significant use of diagrams, which are difficult to draw and maintain manually. CASE tools simplify this process, resulting in documentation that is correct and more current. Consistency Since all the information in the data dictionary is interrelated, CASE tools can check its consistency. Automation Some CASE tools can automatically transform parts of a design specification into executable code. This reduces the work required to produce the implemented system, and may eliminate errors that arise during the coding process. For further information on CASE tools, the interested reader is referred to Gane (1990), Batini et al. (1992), and Kendall and Kendall (1995). 9.15 Data Administration and Database Administration Data Administration and Database Administration | 9.15 The Data Administrator (DA) and Database Administrator (DBA) are responsible for managing and controlling the activities associated with the corporate data and the corporate database, respectively. The DA is more concerned with the early stages of the lifecycle, from planning through to logical database design. In contrast, the DBA is more concerned with the later stages, from application/physical database design to operational maintenance. In this final section of the chapter, we discuss the purpose and tasks associated with data and database administration. Data Administration Data administration 9.15.1 The management of the data resource, which includes database planning, development, and maintenance of standards, policies and procedures, and conceptual and logical database design. The Data Administrator (DA) is responsible for the corporate data resource, which includes non-computerized data, and in practice is often concerned with managing the shared data of users or application areas of an organization. The DA has the primary responsibility of consulting with and advising senior managers and ensuring that the application of database technologies continues to support corporate objectives. In some enterprises, data administration is a distinct functional area, in others it may be combined with database administration. The tasks associated with data administration are described in Table 9.7. Database Administration Database administration The management of the physical realization of a database system, which includes physical database design and implementation, setting security and integrity controls, monitoring system performance, and reorganizing the database, as necessary. The database administration staff are more technically oriented than the data administration staff, requiring knowledge of specific DBMSs and the operating system environment. Although the primary responsibilities are centered on developing and maintaining systems using the DBMS software to its fullest extent, DBA staff also assist DA staff in other areas, as indicated in Table 9.8. The number of staff assigned to the database administration functional area varies, and is often determined by the size of the organization. The tasks of database administration are described in Table 9.8. 9.15.2 309 310 | Chapter 9 z Database Planning, Design, and Administration Table 9.7 Data administration tasks. Selecting appropriate productivity tools. Assisting in the development of the corporate IT/IS and enterprise strategies. Undertaking feasibility studies and planning for database development. Developing a corporate data model. Determining the organization’s data requirements. Setting data collection standards and establishing data formats. Estimating volumes of data and likely growth. Determining patterns and frequencies of data usage. Determining data access requirements and safeguards for both legal and enterprise requirements. Undertaking conceptual and logical database design. Liaising with database administration staff and application developers to ensure applications meet all stated requirements. Educating users on data standards and legal responsibilities. Keeping up to date with IT/IS and enterprise developments. Ensuring documentation is up to date and complete, including standards, policies, procedures, use of the data dictionary, and controls on end-users. Managing the data dictionary. Liaising with users to determine new requirements and to resolve difficulties over data access or performance. Developing a security policy. Table 9.8 Database administration tasks. Evaluating and selecting DBMS products. Undertaking physical database design. Implementing a physical database design using a target DBMS. Defining security and integrity constraints. Liaising with database application developers. Developing test strategies. Training users. Responsible for ‘signing off’ the implemented database system. Monitoring system performance and tuning the database, as appropriate. Performing backups routinely. Ensuring recovery mechanisms and procedures are in place. Ensuring documentation is complete including in-house produced material. Keeping up to date with software and hardware developments and costs, and installing updates as necessary. Chapter Summary Table 9.9 | 311 Data administration and database administration – main task differences. Data administration Database administration Involved in strategic IS planning Determines long-term goals Enforces standards, policies, and procedures Determines data requirements Develops conceptual and logical database design Develops and maintains corporate data model Coordinates system development Managerial orientation DBMS independent Evaluates new DBMSs Executes plans to achieve goals Enforces standards, policies, and procedures Implements data requirements Develops logical and physical database design Implements physical database design Monitors and controls database Technical orientation DBMS dependent Comparison of Data and Database Administration 9.15.3 The preceding sections examined the purpose and tasks associated with data administration and database administration. In this final section we briefly contrast these functional areas. Table 9.9 summarizes the main task differences of data administration and database administration. Perhaps the most obvious difference lies in the nature of the work carried out. Data administration staff tend to be much more managerial, whereas the database administration staff tend to be more technical. Chapter Summary n n n n n An information system is the resources that enable the collection, management, control, and dissemination of information throughout an organization. A computer-based information system includes the following components: database, database software, application software, computer hardware including storage media, and personnel using and developing the system. The database is a fundamental component of an information system, and its development and usage should be viewed from the perspective of the wider requirements of the organization. Therefore, the lifecycle of an organizational information system is inherently linked to the lifecycle of the database that supports it. The main stages of the database system development lifecycle include: database planning, system definition, requirements collection and analysis, database design, DBMS selection (optional), application design, prototyping (optional), implementation, data conversion and loading, testing, and operational maintenance. Database planning is the management activities that allow the stages of the database system development lifecycle to be realized as efficiently and effectively as possible. 312 | Chapter 9 z Database Planning, Design, and Administration n System definition involves identifying the scope and boundaries of the database system and user views. A user view defines what is required of a database system from the perspective of a particular job role (such as Manager or Supervisor) or enterprise application (such as marketing, personnel, or stock control). n Requirements collection and analysis is the process of collecting and analyzing information about the part of the organization that is to be supported by the database system, and using this information to identify the requirements for the new system. There are three main approaches to managing the requirements for a database system that has multiple user views, namely the centralized approach, the view integration approach, and a combination of both approaches. n The centralized approach involves merging the requirements for each user view into a single set of requirements for the new database system. A data model representing all user views is created during the database design stage. In the view integration approach, requirements for each user view remain as separate lists. Data models representing each user view are created then merged later during the database design stage. n Database design is the process of creating a design that will support the enterprise’s mission statement and mission objectives for the required database system. There are three phases of database design, namely conceptual, logical, and physical database design. n Conceptual database design is the process of constructing a model of the data used in an enterprise, independent of all physical considerations. n Logical database design is the process of constructing a model of the data used in an enterprise based on a specific data model, but independent of a particular DBMS and other physical considerations. n Physical database design is the process of producing a description of the implementation of the database on secondary storage; it describes the base relations, file organizations, and indexes used to achieve efficient access to the data, and any associated integrity constraints and security measures. n DBMS selection involves selecting a suitable DBMS for the database system. n Application design involves user interface design and transaction design, which describes the application programs that use and process the database. A database transaction is an action, or series of actions, carried out by a single user or application program, which accesses or changes the content of the database. n Prototyping involves building a working model of the database system, which allows the designers or users to visualize and evaluate the system. n Implementation is the physical realization of the database and application designs. n Data conversion and loading involves transferring any existing data into the new database and converting any existing applications to run on the new database. n Testing is the process of running the database system with the intent of finding errors. n Operational maintenance is the process of monitoring and maintaining the system following installation. n Computer-Aided Software Engineering (CASE) applies to any tool that supports software engineering and permits the database system development activities to be carried out as efficiently and effectively as possible. CASE tools may be divided into three categories: upper-CASE, lower-CASE, and integrated-CASE. n Data administration is the management of the data resource, including database planning, development and maintenance of standards, policies and procedures, and conceptual and logical database design. n Database administration is the management of the physical realization of a database system, including physical database design and implementation, setting security and integrity controls, monitoring system performance, and reorganizing the database as necessary. Exercises | 313 Review Questions 9.1 9.2 9.3 9.4 9.5 9.6 Describe the major components of an information system. Discuss the relationship between the information systems lifecycle and the database system development lifecycle. Describe the main purpose(s) and activities associated with each stage of the database system development lifecycle. Discuss what a user view represents in the context of a database system. Discuss the main approaches for managing the design of a database system that has multiple user views. Compare and contrast the three phases of database design. 9.7 What are the main purposes of data modeling and identify the criteria for an optimal data model? 9.8 Identify the stage(s) where it is appropriate to select a DBMS and describe an approach to selecting the ‘best’ DBMS. 9.9 Application design involves transaction design and user interface design. Describe the purpose and main activities associated with each. 9.10 Discuss why testing cannot show the absence of faults, only that software faults are present. 9.11 Describe the main advantages of using the prototyping approach when building a database system. 9.12 Define the purpose and tasks associated with data administration and database administration. Exercises 9.13 Assume that you are responsible for selecting a new DBMS product for a group of users in your organization. To undertake this exercise, you must first establish a set of requirements for the group and then identify a set of features that a DBMS product must provide to fulfill the requirements. Describe the process of evaluating and selecting the best DBMS product. 9.14 Describe the process of evaluating and selecting a DBMS product for each of the case studies described in Appendix B. 9.15 Investigate whether data administration and database administration exist as distinct functional areas within your organization. If identified, describe the organization, responsibilities, and tasks associated with each functional area. Chapter 10 Fact-Finding Techniques Chapter Objectives In this chapter you will learn: n When fact-finding techniques are used in the database system development lifecycle. n The types of facts collected in each stage of the database system development lifecycle. n The types of documentation produced in each stage of the database system development lifecycle. n The most commonly used fact-finding techniques. n How to use each fact-finding technique and the advantages and disadvantages of each. n About a property rental company called DreamHome. n How to apply fact-finding techniques to the early stages of the database system development lifecycle. In Chapter 9 we introduced the stages of the database system development lifecycle. There are many occasions during these stages when it is critical that the database developer captures the necessary facts to build the required database system. The necessary facts include, for example, the terminology used within the enterprise, problems encountered using the current system, opportunities sought from the new system, necessary constraints on the data and users of the new system, and a prioritized set of requirements for the new system. These facts are captured using fact-finding techniques. Fact-finding The formal process of using techniques such as interviews and questionnaires to collect facts about systems, requirements, and preferences. In this chapter we discuss when a database developer might use fact-finding techniques and what types of facts should be captured. We present an overview of how these facts are used to generate the main types of documentation used throughout the database system development 10.1 When Are Fact-Finding Techniques Used? | lifecycle. We describe the most commonly used fact-finding techniques and identify the advantages and disadvantages of each. We finally demonstrate how some of these techniques may be used during the earlier stages of the database system development lifecycle using a property management company called DreamHome. The DreamHome case study is used throughout this book. Structure of this Chapter In Section 10.1 we discuss when a database developer might use fact-finding techniques. (Throughout this book we use the term ‘database developer’ to refer to a person or group of people responsible for the analysis, design, and implementation of a database system.) In Section 10.2 we illustrate the types of facts that should be collected and the documentation that should be produced at each stage of the database system development lifecycle. In Section 10.3 we describe the five most commonly used fact-finding techniques and identify the advantages and disadvantages of each. In Section 10.4 we demonstrate how fact-finding techniques can be used to develop a database system for a case study called DreamHome, a property management company. We begin this section by providing an overview of the DreamHome case study. We then examine the first three stages of the database system development lifecycle, namely database planning, system definition, and requirements collection and analysis. For each stage we demonstrate the process of collecting data using fact-finding techniques and describe the documentation produced. When Are Fact-Finding Techniques Used? There are many occasions for fact-finding during the database system development lifecycle. However, fact-finding is particularly crucial to the early stages of the lifecycle including the database planning, system definition, and requirements collection and analysis stages. It is during these early stages that the database developer captures the essential facts necessary to build the required database. Fact-finding is also used during database design and the later stages of the lifecycle, but to a lesser extent. For example, during physical database design, fact-finding becomes technical as the database developer attempts to learn more about the DBMS selected for the database system. Also, during the final stage, operational maintenance, fact-finding is used to determine whether a system requires tuning to improve performance or further development to include new requirements. Note that it is important to have a rough estimate of how much time and effort is to be spent on fact-finding for a database project. As we mentioned in Chapter 9, too much study too soon leads to paralysis by analysis. However, too little thought can result in an unnecessary waste of both time and money due to working on the wrong solution to the wrong problem. 10.1 315 316 | Chapter 10 z Fact-Finding Techniques 10.2 What Facts Are Collected? Throughout the database system development lifecycle, the database developer needs to capture facts about the current and/or future system. Table 10.1 provides examples of the sorts of data captured and the documentation produced for each stage of the lifecycle. As we mentioned in Chapter 9, the stages of the database system development lifecycle are Table 10.1 Examples of the data captured and the documentation produced for each stage of the database system development lifecycle. Stage of database system development lifecycle Examples of data captured Examples of documentation produced Database planning Aims and objectives of database project System definition Description of major user views (includes job roles or business application areas) Requirements collection and analysis Requirements for user views; systems specifications, including performance and security requirements Users’ responses to checking the logical database design; functionality provided by target DBMS Mission statement and objectives of database system Definition of scope and boundary of database application; definition of user views to be supported Users’ and system requirements specifications Database design Application design Users’ responses to checking interface design DBMS selection Functionality provided by target DBMS Users’ responses to prototype Prototyping Implementation Data conversion and loading Testing Operational maintenance Functionality provided by target DBMS Format of current data; data import capabilities of target DBMS Test results Performance testing results; new or changing user and system requirements Conceptual/logical database design (includes ER model(s), data dictionary, and relational schema); physical database design Application design (includes description of programs and user interface) DBMS evaluation and recommendations Modified users’ requirements and systems specifications Testing strategies used; analysis of test results User manual; analysis of performance results; modified users’ requirements and systems specifications 10.3 Fact-Finding Techniques | not strictly sequential, but involve some amount of repetition of previous stages through feedback loops. This is also true for the data captured and the documentation produced at each stage. For example, problems encountered during database design may necessitate additional data capture on the requirements for the new system. Fact-Finding Techniques 10.3 A database developer normally uses several fact-finding techniques during a single database project. There are five commonly used fact-finding techniques: n n n n n examining documentation; interviewing; observing the enterprise in operation; research; questionnaires. In the following sections we describe these fact-finding techniques and identify the advantages and disadvantages of each. Examining Documentation 10.3.1 Examining documentation can be useful when we are trying to gain some insight as to how the need for a database arose. We may also find that documentation can help to provide information on the part of the enterprise associated with the problem. If the problem relates to the current system, there should be documentation associated with that system. By examining documents, forms, reports, and files associated with the current system, we can quickly gain some understanding of the system. Examples of the types of documentation that should be examined are listed in Table 10.2. Interviewing Interviewing is the most commonly used, and normally most useful, fact-finding technique. We can interview to collect information from individuals face-to-face. There can be several objectives to using interviewing, such as finding out facts, verifying facts, clarifying facts, generating enthusiasm, getting the end-user involved, identifying requirements, and gathering ideas and opinions. However, using the interviewing technique requires good communication skills for dealing effectively with people who have different values, priorities, opinions, motivations, and personalities. As with other fact-finding techniques, interviewing is not always the best method for all situations. The advantages and disadvantages of using interviewing as a fact-finding technique are listed in Table 10.3. There are two types of interview: unstructured and structured. Unstructured interviews are conducted with only a general objective in mind and with few, if any, specific 10.3.2 317 318 | Chapter 10 z Fact-Finding Techniques Table 10.2 Examples of types of documentation that should be examined. Purpose of documentation Examples of useful sources Describes problem and need for database Internal memos, e-mails, and minutes of meetings Employee/customer complaints, and documents that describe the problem Performance reviews/reports Organizational chart, mission statement, and strategic plan of the enterprise Objectives for the part of the enterprise being studied Task/job descriptions Samples of completed manual forms and reports Samples of completed computerized forms and reports Various types of flowcharts and diagrams Data dictionary Database system design Program documentation User/training manuals Describes the part of the enterprise affected by problem Describes current system Table 10.3 Advantages and disadvantages of using interviewing as a fact-finding technique. Advantages Disadvantages Allows interviewee to respond freely and openly to questions Allows interviewee to feel part of project Very time-consuming and costly, and therefore may be impractical Success is dependent on communication skills of interviewer Success can be dependent on willingness of interviewees to participate in interviews Allows interviewer to follow up on interesting comments made by interviewee Allows interviewer to adapt or re-word questions during interview Allows interviewer to observe interviewee’s body language questions. The interviewer counts on the interviewee to provide a framework and direction to the interview. This type of interview frequently loses focus and, for this reason, it often does not work well for database analysis and design. In structured interviews, the interviewer has a specific set of questions to ask the interviewee. Depending on the interviewee’s responses, the interviewer will direct additional questions to obtain clarification or expansion. Open-ended questions allow the interviewee to respond in any way that seems appropriate. An example of an open-ended question is: ‘Why are you dissatisfied with the report on client registration?’ Closedended questions restrict answers to either specific choices or short, direct responses. An example of such a question might be: ‘Are you receiving the report on client registration 10.3 Fact-Finding Techniques | on time?’ or ‘Does the report on client registration contain accurate information?’ Both questions require only a ‘Yes’ or ‘No’ response. To ensure a successful interview includes selecting appropriate individuals to interview, preparing extensively for the interview, and conducting the interview in an efficient and effective manner. Observing the Enterprise in Operation 10.3.3 Observation is one of the most effective fact-finding techniques for understanding a system. With this technique, it is possible to either participate in, or watch, a person perform activities to learn about the system. This technique is particularly useful when the validity of data collected through other methods is in question or when the complexity of certain aspects of the system prevents a clear explanation by the end-users. As with the other fact-finding techniques, successful observation requires preparation. To ensure that the observation is successful, it is important to know as much about the individuals and the activity to be observed as possible. For example, ‘When are the low, normal, and peak periods for the activity being observed?’ and ‘Will the individuals be upset by having someone watch and record their actions?’ The advantages and disadvantages of using observation as a fact-finding technique are listed in Table 10.4. Research 10.3.4 A useful fact-finding technique is to research the application and problem. Computer trade journals, reference books, and the Internet (including user groups and bulletin boards) are good sources of information. They can provide information on how others have solved similar problems, plus whether or not software packages exist to solve or even partially solve the problem. The advantages and disadvantages of using research as a fact-finding technique are listed in Table 10.5. Table 10.4 Advantages and disadvantages of using observation as a fact-finding technique. Advantages Disadvantages Allows the validity of facts and data to be checked Observer can see exactly what is being done People may knowingly or unknowingly perform differently when being observed May miss observing tasks involving different levels of difficulty or volume normally experienced during that time period Some tasks may not always be performed in the manner in which they are observed May be impractical Observer can also obtain data describing the physical environment of the task Relatively inexpensive Observer can do work measurements 319 320 | Chapter 10 z Fact-Finding Techniques Table 10.5 Advantages and disadvantages of using research as a fact-finding technique. Advantages Disadvantages Can save time if solution already exists Requires access to appropriate sources of information May ultimately not help in solving problem because problem is not documented elsewhere Researcher can see how others have solved similar problems or met similar requirements Keeps researcher up to date with current developments 10.3.5 Questionnaires Another fact-finding technique is to conduct surveys through questionnaires. Questionnaires are special-purpose documents that allow facts to be gathered from a large number of people while maintaining some control over their responses. When dealing with a large audience, no other fact-finding technique can tabulate the same facts as efficiently. The advantages and disadvantages of using questionnaires as a fact-finding technique are listed in Table 10.6. There are two types of questions that can be asked in a questionnaire, namely freeformat and fixed-format. Free-format questions offer the respondent greater freedom in providing answers. A question is asked and the respondent records the answer in the space provided after the question. Examples of free-format questions are: ‘What reports do you currently receive and how are they used?’ and ‘Are there any problems with these reports? If so, please explain.’ The problems with free-format questions are that the respondent’s answers may prove difficult to tabulate and, in some cases, may not match the questions asked. Fixed-format questions require specific responses from individuals. Given any question, the respondent must choose from the available answers. This makes the results much Table 10.6 Advantages and disadvantages of using questionnaires as a fact-finding technique. Advantages Disadvantages People can complete and return questionnaires at their convenience Relatively inexpensive way to gather data from a large number of people People more likely to provide the real facts as responses can be kept confidential Number of respondents can be low, possibly only 5% to 10% Questionnaires may be returned incomplete Responses can be tabulated and analyzed quickly May not provide an opportunity to adapt or re-word questions that have been misinterpreted Cannot observe and analyze the respondent’s body language 10.4 Using Fact-Finding Techniques – A Worked Example | easier to tabulate. On the other hand, the respondent cannot provide additional information that might prove valuable. An example of a fixed-format question is: ‘The current format of the report on property rentals is ideal and should not be changed.’ The respondent may be given the option to answer ‘Yes’ or ‘No’ to this question, or be given the option to answer from a range of responses including ‘Strongly agree’, ‘Agree’, ‘No opinion’, ‘Disagree’, and ‘Strongly disagree’. Using Fact-Finding Techniques – A Worked Example 10.4 In this section we first present an overview of the DreamHome case study and then use this case study to illustrate how to establish a database project. In particular, we illustrate how fact-finding techniques can be used and the documentation produced in the early stages of the database system development lifecycle namely the database planning, system definition, and requirements collection and analysis stages. The DreamHome Case Study – An Overview The first branch office of DreamHome was opened in 1992 in Glasgow in the UK. Since then, the Company has grown steadily and now has several offices in most of the main cities of the UK. However, the Company is now so large that more and more administrative staff are being employed to cope with the ever-increasing amount of paperwork. Furthermore, the communication and sharing of information between offices, even in the same city, is poor. The Director of the Company, Sally Mellweadows feels that too many mistakes are being made and that the success of the Company will be short-lived if she does not do something to remedy the situation. She knows that a database could help in part to solve the problem and requests that a database system be developed to support the running of DreamHome. The Director has provided the following brief description of how DreamHome currently operates. DreamHome specializes in property management, by taking an intermediate role between owners who wish to rent out their furnished property and clients of DreamHome who require to rent furnished property for a fixed period. DreamHome currently has about 2000 staff working in 100 branches. When a member of staff joins the Company, the DreamHome staff registration form is used. The staff registration form for Susan Brand is shown in Figure 10.1. Each branch has an appropriate number and type of staff including a Manager, Supervisors, and Assistants. The Manager is responsible for the day-to-day running of a branch and each Supervisor is responsible for supervising a group of staff called Assistants. An example of the first page of a report listing the details of staff working at a branch office in Glasgow is shown in Figure 10.2. Each branch office offers a range of properties for rent. To offer property through DreamHome, a property owner normally contacts the DreamHome branch office nearest to the property for rent. The owner provides the details of the property and agrees an 10.4.1 321 322 | Chapter 10 z Fact-Finding Techniques Figure 10.1 The DreamHome staff registration form for Susan Brand. Figure 10.2 Example of the first page of a report listing the details of staff working at a DreamHome branch office in Glasgow. appropriate rent for the property with the branch Manager. The registration form for a property in Glasgow is shown in Figure 10.3. Once a property is registered, DreamHome provides services to ensure that the property is rented out for maximum return for both the property owner and, of course, DreamHome. 10.4 Using Fact-Finding Techniques – A Worked Example | 323 Figure 10.3 The DreamHome property registration form for a property in Glasgow. These services include interviewing prospective renters (called clients), organizing viewings of the property by clients, advertising the property in local or national newspapers (when necessary), and negotiating the lease. Once rented, DreamHome assumes responsibility for the property including the collection of rent. Members of the public interested in renting out property must first contact their nearest DreamHome branch office to register as clients of DreamHome. However, before registration is accepted, a prospective client is normally interviewed to record personal details and preferences of the client in terms of property requirements. The registration form for a client called Mike Ritchie is shown in Figure 10.4. Once registration is complete, clients are provided with weekly reports that list properties currently available for rent. An example of the first page of a report listing the properties available for rent at a branch office in Glasgow is shown in Figure 10.5. Clients may request to view one or more properties from the list and after viewing will normally provide a comment on the suitability of the property. The first page of a report describing the comments made by clients on a property in Glasgow is shown in Figure 10.6. Properties that prove difficult to rent out are normally advertised in local and national newspapers. Once a client has identified a suitable property, a member of staff draws up a lease. The lease between a client called Mike Ritchie and a property in Glasgow is shown in Figure 10.7. 324 | Chapter 10 z Fact-Finding Techniques Figure 10.4 The DreamHome client registration form for Mike Ritchie. Figure 10.5 The first page of the DreamHome property for rent report listing property available at a branch in Glasgow. 10.4 Using Fact-Finding Techniques – A Worked Example | 325 Figure 10.6 The first page of the DreamHome property viewing report for a property in Glasgow. Figure 10.7 The DreamHome lease form for a client called Mike Ritchie renting a property in Glasgow. 326 | Chapter 10 z Fact-Finding Techniques At the end of a rental period a client may request that the rental be continued; however, this requires that a new lease be drawn up. Alternatively, a client may request to view alternative properties for the purposes of renting. 10.4.2 The DreamHome Case Study – Database Planning The first step in developing a database system is to clearly define the mission statement for the database project, which defines the major aims of the database system. Once the mission statement is defined, the next activity involves identifying the mission objectives, which should identify the particular tasks that the database must support (see Section 9.3). Creating the mission statement for the DreamHome database system We begin the process of creating a mission statement for the DreamHome database system by conducting interviews with the Director and any other appropriate staff, as indicated by the Director. Open-ended questions are normally the most useful at this stage of the process. Examples of typical questions we might ask include: ‘What is the purpose of your company?’ ‘Why do you feel that you need a database?’ ‘How do you know that a database will solve your problem?’ For example, the database developer may start the interview by asking the Director of DreamHome the following questions: Database Developer Director Database Developer Director What is the purpose of your company? We offer a wide range of high quality properties for rent to clients registered at our branches throughout the UK. Our ability to offer quality properties, of course, depends upon the services we provide to property owners. We provide a highly professional service to property owners to ensure that properties are rented out for maximum return. Why do you feel that you need a database? To be honest we can’t cope with our own success. Over the past few years we’ve opened several branches in most of the main cities of the UK, and at each branch we now offer a larger selection of properties to a growing number of clients. However, this success has been accompanied with increasing data management problems, which means that the level of service we provide is falling. Also, there’s a lack of cooperation and sharing of information between branches, which is a very worrying development. 10.4 Using Fact-Finding Techniques – A Worked Example | 327 Figure 10.8 Mission statement for the DreamHome database system. Database Developer Director How do you know that a database will solve your problem? All I know is that we are drowning in paperwork. We need something that will speed up the way we work by automating a lot of the day-to-day tasks that seem to take for ever these days. Also, I want the branches to start working together. Databases will help to achieve this, won’t they? Responses to these types of questions should help to formulate the mission statement. An example mission statement for the DreamHome database system is shown in Figure 10.8. When we have a clear and unambiguous mission statement that the staff of DreamHome agree with, we move on to define the mission objectives. Creating the mission objectives for the DreamHome database system The process of creating mission objectives involves conducting interviews with appropriate members of staff. Again, open-ended questions are normally the most useful at this stage of the process. To obtain the complete range of mission objectives, we interview various members of staff with different roles in DreamHome. Examples of typical questions we might ask include: ‘What is your job description?’ ‘What kinds of tasks do you perform in a typical day?’ ‘What kinds of data do you work with?’ ‘What types of reports do you use?’ ‘What types of things do you need to keep track of?’ ‘What service does your company provide to your customers?’ These questions (or similar) are put to the Director of DreamHome and members of staff in the role of Manager, Supervisor, and Assistant. It may be necessary to adapt the questions as required depending on whom is being interviewed. Director Database Developer Director What role do you play for the company? I oversee the running of the company to ensure that we continue to provide the best possible property rental service to our clients and property owners. 328 | Chapter 10 z Fact-Finding Techniques Database Developer Director Database Developer Director Database Developer Director Database Developer Director Database Developer Director What kinds of tasks do you perform in a typical day? I monitor the running of each branch by our Managers. I try to ensure that the branches work well together and share important information about properties and clients. I normally try to keep a high profile with my branch Managers by calling into each branch at least once or twice a month. What kinds of data do you work with? I need to see everything, well at least a summary of the data used or generated by DreamHome. That includes data about staff at all branches, all properties and their owners, all clients, and all leases. I also like to keep an eye on the extent to which branches advertise properties in newspapers. What types of reports do you use? I need to know what’s going on at all the branches and there’s lots of them. I spend a lot of my working day going over long reports on all aspects of DreamHome. I need reports that are easy to access and that let me get a good overview of what’s happening at a given branch and across all branches. What types of things do you need to keep track of? As I said before, I need to have an overview of everything, I need to see the whole picture. What service does your company provide to your customers? We try to provide the best property rental service in the UK. Manager Database Developer Manager Database Developer Manager Database Developer Manager What is your job description? My job title is Manager. I oversee the day-to-day running of my branch to provide the best property rental service to our clients and property owners. What kinds of tasks do you perform in a typical day? I ensure that the branch has the appropriate number and type of staff on duty at all times. I monitor the registering of new properties and new clients, and the renting activity of our currently active clients. It’s my responsibility to ensure that we have the right number and type of properties available to offer our clients. I sometimes get involved in negotiating leases for our top-of-the-range properties, although due to my workload I often have to delegate this task to Supervisors. What kinds of data do you work with? I mostly work with data on the properties offered at my branch and the owners, clients, and leases. I also need to know when properties are proving difficult to rent out so that I can arrange for them to be advertised in newspapers. I need to keep an eye on this aspect of the business because advertising can get costly. I also need access to data about staff working at my 10.4 Using Fact-Finding Techniques – A Worked Example Database Developer Manager Database Developer Manager Database Developer Manager branch and staff at other local branches. This is because I sometimes need to contact other branches to arrange management meetings or to borrow staff from other branches on a temporary basis to cover staff shortages due to sickness or during holiday periods. This borrowing of staff between local branches is informal and thankfully doesn’t happen very often. Besides data on staff, it would be helpful to see other types of data at the other branches such as data on property, property owners, clients, and leases, you know, to compare notes. Actually, I think the Director hopes that this database project is going to help promote cooperation and sharing of information between branches. However, some of the Managers I know are not going to be too keen on this because they think we’re in competition with each other. Part of the problem is that a percentage of a Manager’s salary is made up of a bonus, which is related to the number of properties we rent out. What types of reports do you use? I need various reports on staff, property, owners, clients, and leases. I need to know at a glance which properties we need to lease out and what clients are looking for. What types of things do you need to keep track of? I need to keep track of staff salaries. I need to know how well the properties on our books are being rented out and when leases are coming up for renewal. I also need to keep eye on our expenditure on advertising in newspapers. What service does your company provide to your customers? Remember that we have two types of customers, that is clients wanting to rent property and property owners. We need to make sure that our clients find the property they’re looking for quickly without too much legwork and at a reasonable rent and, of course, that our property owners see good returns from renting out their properties with minimal hassle. Supervisor Database Developer Supervisor Database Developer Supervisor What is your job description? My job title is Supervisor. I spend most of my time in the office dealing directly with our customers, that is clients wanting to rent property and property owners. I’m also responsible for a small group of staff called Assistants and making sure that they are kept busy, but that’s not a problem as there’s always plenty to do, it’s never ending actually. What kinds of tasks do you perform in a typical day? I normally start the day by allocating staff to particular duties, such as dealing with clients or property owners, organizing for clients to view properties, and the filing of paperwork. When | 329 330 | Chapter 10 z Fact-Finding Techniques a client finds a suitable property, I process the drawing up of a lease, although the Manager must see the documentation before any signatures are requested. I keep client details up to date and register new clients when they want to join the Company. When a new property is registered, the Manager allocates responsibility for managing that property to me or one of the other Supervisors or Assistants. Database Developer Supervisor What kinds of data do you work with? I work with data about staff at my branch, property, property owners, clients, property viewings, and leases. Database Developer Supervisor What types of reports do you use? Reports on staff and properties for rent. Database Developer Supervisor What types of things do you need to keep track of? I need to know what properties are available for rent and when currently active leases are due to expire. I also need to know what clients are looking for. I need to keep our Manager up to date with any properties that are proving difficult to rent out. Assistant Database Developer Assistant What is your job description? My job title is Assistant. I deal directly with our clients. Database Developer Assistant What kinds of tasks do you perform in a typical day? I answer general queries from clients about properties for rent. You know what I mean: ‘Do you have such and such type of property in a particular area of Glasgow?’ I also register new clients and arrange for clients to view properties. When we’re not too busy I file paperwork but I hate this part of the job, it’s so boring. Database Developer Assistant What kinds of data do you work with? I work with data on property and property viewings by clients and sometimes leases. Database Developer Assistant What types of reports do you use? Lists of properties available for rent. These lists are updated every week. Database Developer Assistant What types of things do you need to keep track of? Whether certain properties are available for renting out and which clients are still actively looking for property. Database Developer Assistant What service does your company provide to your customers? We try to answer questions about properties available for rent such as: ‘Do you have a 2-bedroom flat in Hyndland, Glasgow?’ and ‘What should I expect to pay for a 1-bedroom flat in the city center?’ 10.4 Using Fact-Finding Techniques – A Worked Example | 331 Figure 10.9 Mission objectives for the DreamHome database system. Responses to these types of questions should help to formulate the mission objectives. An example of the mission objectives for the DreamHome database system is shown in Figure 10.9. The DreamHome Case Study – System Definition The purpose of the system definition stage is to define the scope and boundary of the database system and its major user views. In Section 9.4.1 we described how a user view represents the requirements that should be supported by a database system as defined by a particular job role (such as Director or Supervisor) or business application area (such as property rentals or property sales). Defining the systems boundary for the DreamHome database system During this stage of the database system development lifecycle, further interviews with users can be used to clarify or expand on data captured in the previous stage. However, additional fact-finding techniques can also be used including examining the sample 10.4.3 332 | Chapter 10 z Fact-Finding Techniques Figure 10.10 Systems boundary for the DreamHome database system. documentation shown in Section 10.4.1. The data collected so far is analyzed to define the boundary of the database system. The systems boundary for the DreamHome database system is shown in Figure 10.10. Identifying the major user views for the DreamHome database system We now analyze the data collected so far to define the main user views of the database system. The majority of data about the user views was collected during interviews with the Director and members of staff in the role of Manager, Supervisor, and Assistant. The main user views for the DreamHome database system are shown in Figure 10.11. 10.4.4 The DreamHome Case Study – Requirements Collection and Analysis During this stage, we continue to gather more details on the user views identified in the previous stage, to create a users’ requirements specification that describes in detail the data to be held in the database and how the data is to be used. While gathering more information on the user views, we also collect any general requirements for the system. The purpose of gathering this information is to create a systems specification, which describes any features to be included in the new database system such as networking and shared access requirements, performance requirements, and the levels of security required. As we collect and analyze the requirements for the new system we also learn about the most useful and most troublesome features of the current system. When building a new database system it is sensible to try to retain the good things about the old system while introducing the benefits that will be part of using the new system. An important activity associated with this stage is deciding how to deal with the situation where there is more than one user view. As we discussed in Section 9.6, there are three Figure 10.11 Major user views for the DreamHome database system. 334 | Chapter 10 z Fact-Finding Techniques major approaches to dealing with multiple user views, namely the centralized approach, the view integration approach, and a combination of both approaches. We discuss how these approaches can be used shortly. Gathering more information on the user views of the DreamHome database system To find out more about the requirements for each user view, we may again use a selection of fact-finding techniques including interviews and observing the business in operation. Examples of the types of questions that we may ask about the data (represented as X) required by a user view include: ‘What type of data do you need to hold on X?’ ‘What sorts of things do you do with the data on X?’ For example, we may ask a Manager the following questions: Database Developer Manager Database Developer Manager What type of data do you need to hold on staff? The types of data held on a member of staff is his or her full name, position, sex, date of birth, and salary. What sorts of things do you do with the data on staff? I need to be able to enter the details of new members of staff and delete their details when they leave. I need to keep the details of staff up to date and print reports that list the full name, position, and salary of each member of staff at my branch. I need to be able to allocate staff to Supervisors. Sometimes when I need to communicate with other branches, I need to find out the names and telephone numbers of Managers at other branches. We need to ask similar questions about all the important data to be stored in the database. Responses to these questions will help identify the necessary details for the users’ requirements specification. Gathering information on the system requirements of the DreamHome database system While conducting interviews about user views, we should also collect more general information on the system requirements. Examples of the types of questions that we may ask about the system include: ‘What transactions run frequently on the database?’ ‘What transactions are critical to the operation of the organization?’ ‘When do the critical transactions run?’ ‘When are the low, normal, and high workload periods for the critical transactions?’ ‘What type of security do you want for the database system?’ 10.4 Using Fact-Finding Techniques – A Worked Example ‘Is there any highly sensitive data that should be accessed only by certain members of staff?’ ‘What historical data do you want to hold?’ ‘What are the networking and shared access requirements for the database system?’ ‘What type of protection from failures or data loss do you want for the database system?’ For example, we may ask a Manager the following questions: Database Developer Manager Database Developer Manager Database Developer Manager Database Developer Manager What transactions run frequently on the database? We frequently get requests either by phone or by clients who call into our branch to search for a particular type of property in a particular area of the city and for a rent no higher than a particular amount. We also need up-to-date information on properties and clients so that reports can be run off that show properties currently available for rent and clients currently seeking property. What transactions are critical to the operation of the business? Again, critical transactions include being able to search for particular properties and to print out reports with up-to-date lists of properties available for rent. Our clients would go elsewhere if we couldn’t provide this basic service. When do the critical transactions run? Every day. When are the low, normal, and high workload periods for the critical transactions? We’re open six days a week. In general, we tend to be quiet in the mornings and get busier as the day progresses. However, the busiest time-slots each day for dealing with customers are between 12 and 2pm and 5 and 7pm. We may ask the Director the following questions: Database Developer Director Database Developer Director What type of security do you want for the database system? I don’t suppose a database holding information for a property rental company holds very sensitive data, but I wouldn’t want any of our competitors to see the data on properties, owners, clients, and leases. Staff should only see the data necessary to do their job in a form that suits what they’re doing. For example, although it’s necessary for Supervisors and Assistants to see client details, client records should only be displayed one at a time and not as a report. Is there any highly sensitive data that should be accessed only by certain members of staff? As I said before, staff should only see the data necessary to do their jobs. For example, although Supervisors need to see data on staff, salary details should not be included. | 335 336 | Chapter 10 z Fact-Finding Techniques Database Developer Director Database Developer Director Database Developer Director What historical data do you want to hold? I want to keep the details of clients and owners for a couple of years after their last dealings with us, so that we can mailshot them to tell them about our latest offers, and generally try to attract them back. I also want to be able to keep lease information for a couple of years so that we can analyze it to find out which types of properties and areas of each city are the most popular for the property rental market, and so on. What are the networking and shared access requirements for the database system? I want all the branches networked to our main branch office, here in Glasgow, so that staff can access the system from wherever and whenever they need to. At most branches, I would expect about two or three staff to be accessing the system at any one time, but remember we have about 100 branches. Most of the time the staff should be just accessing local branch data. However, I don’t really want there to be any restrictions about how often or when the system can be accessed, unless it’s got real financial implications. What type of protection from failures or data loss do you want for the database system? The best of course. All our business is going to be conducted using the database, so if it goes down, we’re sunk. To be serious for a minute, I think we probably have to back up our data every evening when the branch closes. What do you think? We need to ask similar questions about all the important aspects of the system. Responses to these questions should help identify the necessary details for the system requirements specification. Managing the user views of the DreamHome database system How do we decide whether to use the centralized or view integration approach, or a combination of both to manage multiple user views? One way to help make a decision is to examine the overlap in the data used between the user views identified during the system definition stage. Table 10.7 cross-references the Director, Manager, Supervisor, and Assistant user views with the main types of data used by each user view. We see from Table 10.7 that there is overlap in the data used by all user views. However, the Director and Manager user views and the Supervisor and Assistant user views show more similarities in terms of data requirements. For example, only the Director and Manager user views require data on branches and newspapers whereas only the Supervisor and Assistant user views require data on property viewings. Based on this analysis, we use the centralized approach to first merge the requirements for the Director and Manager user views (given the collective name of Branch user views) and the requirements for the Supervisor and Assistant user views (given the collective name of Staff user views). We 10.4 Using Fact-Finding Techniques – A Worked Example Table 10.7 Cross-reference of user views with the main types of data used by each. branch staff property for rent owner client property viewing lease newspaper Director Manager X X X X X X X X X X X X X X Supervisor Assistant X X X X X X X X X X X then develop data models representing the Branch and Staff user views and then use the view integration approach to merge the two data models. Of course, for a simple case study like DreamHome, we could easily use the centralized approach for all user views but we will stay with our decision to create two collective user views so that we can describe and demonstrate how the view integration approach works in practice in Chapter 16. It is difficult to give precise rules as to when it is appropriate to use the centralized or view integration approaches. The decision should be based on an assessment of the complexity of the database system and the degree of overlap between the various user views. However, whether we use the centralized or view integration approach or a mixture of both to build the underlying database, ultimately we need to re-establish the original user views (namely Director, Manager, Supervisor, and Assistant) for the working database system. We describe and demonstrate the establishment of the user views for the database system in Chapter 17. All of the information gathered so far on each user view of the database system is described in a document called a users’ requirements specification. The users’ requirements specification describes the data requirements for each user view and examples of how the data is used by the user view. For ease of reference the users’ requirements specifications for the Branch and Staff user views of the DreamHome database system are given in Appendix A. In the remainder of this chapter, we present the general systems requirements for the DreamHome database system. The systems specification for the DreamHome database system The systems specification should list all the important features for the DreamHome database system. The types of features that should be described in the systems specification include: n n n initial database size; database rate of growth; the types and average number of record searches; | 337 338 | Chapter 10 z Fact-Finding Techniques n n n n n networking and shared access requirements; performance; security; backup and recovery; legal issues. Systems Requirements for DreamHome Database System Initial database size (1) There are approximately 2000 members of staff working at over 100 branches. There is an average of 20 and a maximum of 40 members of staff at each branch. (2) There are approximately 100,000 properties available at all branches. There is an average of 1000 and a maximum of 3000 properties at each branch. (3) There are approximately 60,000 property owners. There is an average of 600 and a maximum of 1000 property owners at each branch. (4) There are approximately 100,000 clients registered across all branches. There is an average of 1000 and a maximum of 1500 clients registered at each branch. (5) There are approximately 4,000,000 viewings across all branches. There is an average of 40,000 and a maximum of 100,000 viewings at each branch. (6) There are approximately 400,000 leases across all branches. There are an average of 4000 and a maximum of 10,000 leases at each branch. (7) There are approximately 50,000 newspaper adverts in 100 newspapers across all branches. Database rate of growth (1) Approximately 500 new properties and 200 new property owners are added to the database each month. (2) Once a property is no longer available for renting out, the corresponding record is deleted from the database. Approximately 100 records of properties are deleted each month. (3) If a property owner does not provide properties for rent at any time within a period of two years, his or her record is deleted. Approximately 100 property owner records are deleted each month. (4) Approximately 20 members of staff join and leave the company each month. The records of staff who have left the company are deleted after one year. Approximately 20 staff records are deleted each month. (5) Approximately 1000 new clients register at branches each month. If a client does not view or rent out a property at any time within a period of two years, his or her record is deleted. Approximately 100 client records are deleted each month. 10.4 Using Fact-Finding Techniques – A Worked Example (6) Approximately 5000 new viewings are recorded across all branches each day. The details of property viewings are deleted one year after the creation of the record. (7) Approximately 1000 new leases are recorded across all branches each month. The details of property leases are deleted two years after the creation of the record. (8) Approximately 1000 newspaper adverts are placed each week. The details of newspaper adverts are deleted one year after the creation of the record. The types and average number of record searches (1) Searching for the details of a branch – approximately 10 per day. (2) Searching for the details of a member of staff at a branch – approximately 20 per day. (3) Searching for the details of a given property – approximately 5000 per day (Monday to Thursday), approximately 10,000 per day (Friday and Saturday). Peak workloads are 12.00–14.00 and 17.00–19.00 daily. (4) Searching for the details of a property owner – approximately 100 per day. (5) Searching for the details of a client – approximately 1000 per day (Monday to Thursday), approximately 2000 per day (Friday and Saturday). Peak workloads are 12.00–14.00 and 17.00–19.00 daily. (6) Searching for the details of a property viewing – approximately 2000 per day (Monday to Thursday), approximately 5000 per day (Friday and Saturday). Peak workloads are 12.00–14.00 and 17.00–19.00 daily. (7) Searching for the details of a lease – approximately 1000 per day (Monday to Thursday), approximately 2000 per day (Friday and Saturday). Peak workloads are 12.00–14.00 and 17.00–19.00 daily. Networking and shared access requirements All branches should be securely networked to a centralized database located at DreamHome’s main office in Glasgow. The system should allow for at least two to three people concurrently accessing the system from each branch. Consideration needs to be given to the licensing requirements for this number of concurrent accesses. Performance (1) During opening hours but not during peak periods expect less than 1 second response for all single record searches. During peak periods expect less than 5 second response for each search. (2) During opening hours but not during peak periods expect less than 5 second response for each multiple record search. During peak periods expect less than 10 second response for each multiple record search. (3) During opening hours but not during peak periods expect less than 1 second response for each update/save. During peak periods expect less than 5 second response for each update/save. | 339 340 | Chapter 10 z Fact-Finding Techniques Security (1) The database should be password-protected. (2) Each member of staff should be assigned database access privileges appropriate to a particular user view, namely Director, Manager, Supervisor, or Assistant. (3) A member of staff should only see the data necessary to do his or her job in a form that suits what he or she is doing. Backup and Recovery The database should be backed up daily at 12 midnight. Legal Issues Each country has laws that govern the way that the computerized storage of personal data is handled. As the DreamHome database holds data on staff, clients, and property owners any legal issues that must be complied with should be investigated and implemented. 10.4.5 The DreamHome Case Study – Database Design In this chapter we demonstrated the creation of the users’ requirements specification for the Branch and Staff user views and the systems specification for the DreamHome database system. These documents are the sources of information for the next stage of the lifecycle called database design. In Chapters 15 to 18 we provide a step-by-step methodology for database design and use the DreamHome case study and the documents created for the DreamHome database system in this chapter to demonstrate the methodology in practice. Chapter Summary n Fact-finding is the formal process of using techniques such as interviews and questionnaires to collect facts about systems, requirements, and preferences. n Fact-finding is particularly crucial to the early stages of the database system development lifecycle including the database planning, system definition, and requirements collection and analysis stages. n The five most common fact-finding techniques are examining documentation, interviewing, observing the enterprise in operation, conducting research, and using questionnaires. n There are two main documents created during the requirements collection and analysis stage, namely the users’ requirements specification and the systems specification. n The users’ requirements specification describes in detail the data to be held in the database and how the data is to be used. n The systems specification describes any features to be included in the database system such as the performance and security requirements. Exercises | 341 Review Questions 10.1 Briefly describe what the process of factfinding attempts to achieve for a database developer. 10.2 Describe how fact-finding is used throughout the stages of the database system development lifecycle. 10.3 For each stage of the database system development lifecycle identify examples of the facts captured and the documentation produced. 10.4 A database developer normally uses several fact-finding techniques during a single database project. The five most commonly used techniques are examining documentation, interviewing, observing the business in operation, conducting research, and using questionnaires. Describe each fact-finding 10.5 10.6 10.7 10.8 technique and identify the advantages and disadvantages of each. Describe the purpose of defining a mission statement and mission objectives for a database system. What is the purpose of identifying the systems boundary for a database system? How do the contents of a users’ requirements specification differ from a systems specification? Describe one method of deciding whether to use either the centralized or view integration approach, or a combination of both when developing a database system with multiple user views. Exercises 10.9 Assume that you are developing a database system for your enterprise, whether it is a university (or college) or business (or department). Consider what fact-finding techniques you would use to identify the important facts needed to develop a database system. Identify the techniques that you would use for each stage of the database system development lifecycle. 10.10 Assume that you are developing a database system for the case studies described in Appendix B. Consider what fact-finding techniques you would use to identify the important facts needed to develop a database system. 10.11 Produce mission statements and mission objectives for the database systems described in the case studies given in Appendix B. 10.12 Produce a diagram to represent the scope and boundaries for the database systems described in the case studies given in Appendix B. 10.13 Identify the major user views for the database systems described in the case studies given in Appendix B. Chapter 11 Entity–Relationship Modeling Chapter Objectives In this chapter you will learn: n How to use Entity–Relationship (ER) modeling in database design. n The basic concepts associated with the Entity–Relationship (ER) model, namely entities, relationships, and attributes. n A diagrammatic technique for displaying an ER model using the Unified Modeling Language (UML). n How to identify and resolve problems with ER models called connection traps. In Chapter 10 we described the main techniques for gathering and capturing information about what the users require of a database system. Once the requirements collection and analysis stage of the database system development lifecycle is complete and we have documented the requirements for the database system, we are ready to begin the database design stage. One of the most difficult aspects of database design is the fact that designers, programmers, and end-users tend to view data and its use in different ways. Unfortunately, unless we gain a common understanding that reflects how the enterprise operates, the design we produce will fail to meet the users’ requirements. To ensure that we get a precise understanding of the nature of the data and how it is used by the enterprise, we need to have a model for communication that is non-technical and free of ambiguities. The Entity–Relationship (ER) model is one such example. ER modeling is a top-down approach to database design that begins by identifying the important data called entities and relationships between the data that must be represented in the model. We then add more details such as the information we want to hold about the entities and relationships called attributes and any constraints on the entities, relationships, and attributes. ER modeling is an important technique for any database designer to master and forms the basis of the methodology presented in this book. In this chapter we introduce the basic concepts of the ER model. Although there is general agreement about what each concept means, there are a number of different notations that can be used to represent each concept diagrammatically. We have chosen a diagrammatic notation that uses an increasingly popular object-oriented modeling language 11.1 Entity Types | called the Unified Modeling Language (UML) (Booch et al., 1999). UML is the successor to a number of object-oriented analysis and design methods introduced in the 1980s and 1990s. The Object Management Group (OMG) is currently looking at the standardization of UML and it is anticipated that UML will be the de facto standard modeling language in the near future. Although we use the UML notation for drawing ER models, we continue to describe the concepts of ER models using traditional database terminology. In Section 25.7 we will provide a fuller discussion on UML. We also include a summary of two alternative diagrammatic notations for ER models in Appendix F. In the next chapter we discuss the inherent problems associated with representing complex database applications using the basic concepts of the ER model. To overcome these problems, additional ‘semantic’ concepts were added to the original ER model resulting in the development of the Enhanced Entity–Relationship (EER) model. In Chapter 12 we describe the main concepts associated with the EER model called specialization/generalization, aggregation, and composition. We also demonstrate how to convert the ER model shown in Figure 11.1 into the EER model shown in Figure 12.8. Structure of this Chapter In Sections 11.1, 11.2, and 11.3 we introduce the basic concepts of the Entity–Relationship model, namely entities, relationships, and attributes. In each section we illustrate how the basic ER concepts are represented pictorially in an ER diagram using UML. In Section 11.4 we differentiate between weak and strong entities and in Section 11.5 we discuss how attributes normally associated with entities can be assigned to relationships. In Section 11.6 we describe the structural constraints associated with relationships. Finally, in Section 11.7 we identify potential problems associated with the design of an ER model called connection traps and demonstrate how these problems can be resolved. The ER diagram shown in Figure 11.1 is an example of one of the possible endproducts of ER modeling. This model represents the relationships between data described in the requirements specification for the Branch view of the DreamHome case study given in Appendix A. This figure is presented at the start of this chapter to show the reader an example of the type of model that we can build using ER modeling. At this stage, the reader should not be concerned about fully understanding this diagram, as the concepts and notation used in this figure are discussed in detail throughout this chapter. Entity Types Entity type A group of objects with the same properties, which are identified by the enterprise as having an independent existence. The basic concept of the ER model is the entity type, which represents a group of ‘objects’ in the ‘real world’ with the same properties. An entity type has an independent existence 11.1 343 344 | Chapter 11 z Entity–Relationship Modeling Figure 11.1 An Entity–Relationship (ER) diagram of the Branch view of DreamHome. 11.1 Entity Types | 345 Figure 11.2 Example of entities with a physical or conceptual existence. and can be objects with a physical (or ‘real’) existence or objects with a conceptual (or ‘abstract’) existence, as listed in Figure 11.2. Note that we are only able to give a working definition of an entity type as no strict formal definition exists. This means that different designers may identify different entities. Entity occurrence A uniquely identifiable object of an entity type. Each uniquely identifiable object of an entity type is referred to simply as an entity occurrence. Throughout this book, we use the terms ‘entity type’ or ‘entity occurrence’; however, we use the more general term ‘entity’ where the meaning is obvious. We identify each entity type by a name and a list of properties. A database normally contains many different entity types. Examples of entity types shown in Figure 11.1 include: Staff, Branch, PropertyForRent, and PrivateOwner. Diagrammatic representation of entity types Each entity type is shown as a rectangle labeled with the name of the entity, which is normally a singular noun. In UML, the first letter of each word in the entity name is upper case (for example, Staff and PropertyForRent). Figure 11.3 illustrates the diagrammatic representation of the Staff and Branch entity types. Figure 11.3 Diagrammatic representation of the Staff and Branch entity types. 346 | Chapter 11 z Entity–Relationship Modeling 11.2 Relationship Types Relationship type A set of meaningful associations among entity types. A relationship type is a set of associations between one or more participating entity types. Each relationship type is given a name that describes its function. An example of a relationship type shown in Figure 11.1 is the relationship called POwns, which associates the PrivateOwner and PropertyForRent entities. As with entity types and entities, it is necessary to distinguish between the terms ‘relationship type’ and ‘relationship occurrence’. Relationship occurrence A uniquely identifiable association, which includes one occurrence from each participating entity type. A relationship occurrence indicates the particular entity occurrences that are related. Throughout this book, we use the terms ‘relationship type’ or ‘relationship occurrence’. However, as with the term ‘entity’, we use the more general term ‘relationship’ when the meaning is obvious. Consider a relationship type called Has, which represents an association between Branch and Staff entities, that is Branch Has Staff. Each occurrence of the Has relationship associates one Branch entity occurrence with one Staff entity occurrence. We can examine examples of individual occurrences of the Has relationship using a semantic net. A semantic net is an object-level model, which uses the symbol • to represent entities and the symbol to represent relationships. The semantic net in Figure 11.4 shows three examples of the Has relationships (denoted r1, r2, and r3). Each relationship describes an association of a single Branch entity occurrence with a single Staff entity occurrence. Relationships are represented by lines that join each participating Branch entity with the associated Staff entity. For example, relationship r1 represents the association between Branch entity B003 and Staff entity SG37. Figure 11.4 A semantic net showing individual occurrences of the Has relationship type. 11.2 Relationship Types | 347 Figure 11.5 A diagrammatic representation of Branch Has Staff relationship type. Note that we represent each Branch and Staff entity occurrences using values for the their primary key attributes, namely branchNo and staffNo. Primary key attributes uniquely identify each entity occurrence and are discussed in detail in the following section. If we represented an enterprise using semantic nets, it would be difficult to understand due to the level of detail. We can more easily represent the relationships between entities in an enterprise using the concepts of the Entity–Relationship (ER) model. The ER model uses a higher level of abstraction than the semantic net by combining sets of entity occurrences into entity types and sets of relationship occurrences into relationship types. Diagrammatic representation of relationships types Each relationship type is shown as a line connecting the associated entity types, labeled with the name of the relationship. Normally, a relationship is named using a verb (for example, Supervises or Manages) or a short phrase including a verb (for example, LeasedBy). Again, the first letter of each word in the relationship name is shown in upper case. Whenever possible, a relationship name should be unique for a given ER model. A relationship is only labeled in one direction, which normally means that the name of the relationship only makes sense in one direction (for example, Branch Has Staff makes more sense than Staff Has Branch). So once the relationship name is chosen, an arrow symbol is placed beside the name indicating the correct direction for a reader to interpret the relationship name (for example, Branch Has Staff ) as shown in Figure 11.5. Degree of Relationship Type Degree of a relationship type The number of participating entity types in a relationship. The entities involved in a particular relationship type are referred to as participants in that relationship. The number of participants in a relationship type is called the degree of that 11.2.1 348 | Chapter 11 z Entity–Relationship Modeling Figure 11.6 An example of a binary relationship called POwns. relationship. Therefore, the degree of a relationship indicates the number of entity types involved in a relationship. A relationship of degree two is called binary. An example of a binary relationship is the Has relationship shown in Figure 11.5 with two participating entity types namely, Staff and Branch. A second example of a binary relationship is the POwns relationship shown in Figure 11.6 with two participating entity types, namely PrivateOwner and PropertyForRent. The Has and POwns relationships are also shown in Figure 11.1 as well as other examples of binary relationships. In fact the most common degree for a relationship is binary as demonstrated in this figure. A relationship of degree three is called ternary. An example of a ternary relationship is Registers with three participating entity types, namely Staff, Branch, and Client. This relationship represents the registration of a client by a member of staff at a branch. The term ‘complex relationship’ is used to describe relationships with degrees higher than binary. Diagrammatic representation of complex relationships The UML notation uses a diamond to represent relationships with degrees higher than binary. The name of the relationship is displayed inside the diamond and in this case the directional arrow normally associated with the name is omitted. For example, the ternary relationship called Registers is shown in Figure 11.7. This relationship is also shown in Figure 11.1. A relationship of degree four is called quaternary. As we do not have an example of such a relationship in Figure 11.1, we describe a quaternary relationship called Arranges with four participating entity types, namely Buyer, Solicitor, FinancialInstitution, and Bid in Figure 11.8. This relationship represents the situation where a buyer, advised by a solicitor and supported by a financial institution, places a bid. Figure 11.7 An example of a ternary relationship called Registers. 11.2 Relationship Types | 349 Figure 11.8 An example of a quaternary relationship called Arranges. Recursive Relationship Recursive relationship 11.2.2 A relationship type where the same entity type participates more than once in different roles. Consider a recursive relationship called Supervises, which represents an association of staff with a Supervisor where the Supervisor is also a member of staff. In other words, the Staff entity type participates twice in the Supervises relationship; the first participation as a Supervisor, and the second participation as a member of staff who is supervised (Supervisee). Recursive relationships are sometimes called unary relationships. Relationships may be given role names to indicate the purpose that each participating entity type plays in a relationship. Role names can be important for recursive relationships to determine the function of each participant. The use of role names to describe the Supervises recursive relationship is shown in Figure 11.9. The first participation of the Staff entity type in the Supervises relationship is given the role name ‘Supervisor’ and the second participation is given the role name ‘Supervisee’. Figure 11.9 An example of a recursive relationship called Supervises with role names Supervisor and Supervisee. 350 | Chapter 11 z Entity–Relationship Modeling Figure 11.10 An example of entities associated through two distinct relationships called Manages and Has with role names. Role names may also be used when two entities are associated through more than one relationship. For example, the Staff and Branch entity types are associated through two distinct relationships called Manages and Has. As shown in Figure 11.10, the use of role names clarifies the purpose of each relationship. For example, in the case of Staff Manages Branch, a member of staff (Staff entity) given the role name ‘Manager’ manages a branch (Branch entity) given the role name ‘Branch Office’. Similarly, for Branch Has Staff, a branch, given the role name ‘Branch Office’ has staff given the role name ‘Member of Staff’. Role names are usually not required if the function of the participating entities in a relationship is unambiguous. 11.3 Attributes Attribute A property of an entity or a relationship type. The particular properties of entity types are called attributes. For example, a Staff entity type may be described by the staffNo, name, position, and salary attributes. The attributes hold values that describe each entity occurrence and represent the main part of the data stored in the database. A relationship type that associates entities can also have attributes similar to those of an entity type but we defer discussion of this until Section 11.5. In this section, we concentrate on the general characteristics of attributes. Attribute domain The set of allowable values for one or more attributes. 11.3 Attributes | Each attribute is associated with a set of values called a domain. The domain defines the potential values that an attribute may hold and is similar to the domain concept in the relational model (see Section 3.2). For example, the number of rooms associated with a property is between 1 and 15 for each entity occurrence. We therefore define the set of values for the number of rooms (rooms) attribute of the PropertyForRent entity type as the set of integers between 1 and 15. Attributes may share a domain. For example, the address attributes of the Branch, PrivateOwner, and BusinessOwner entity types share the same domain of all possible addresses. Domains can also be composed of domains. For example, the domain for the address attribute of the Branch entity is made up of subdomains: street, city, and postcode. The domain of the name attribute is more difficult to define, as it consists of all possible names. It is certainly a character string, but it might consist not only of letters but also of hyphens or other special characters. A fully developed data model includes the domains of each attribute in the ER model. As we now explain, attributes can be classified as being: simple or composite; singlevalued or multi-valued; or derived. Simple and Composite Attributes Simple attribute 11.3.1 An attribute composed of a single component with an independent existence. Simple attributes cannot be further subdivided into smaller components. Examples of simple attributes include position and salary of the Staff entity. Simple attributes are sometimes called atomic attributes. Composite attribute An attribute composed of multiple components, each with an independent existence. Some attributes can be further divided to yield smaller components with an independent existence of their own. For example, the address attribute of the Branch entity with the value (163 Main St, Glasgow, G11 9QX) can be subdivided into street (163 Main St), city (Glasgow), and postcode (G11 9QX) attributes. The decision to model the address attribute as a simple attribute or to subdivide the attribute into street, city, and postcode is dependent on whether the user view of the data refers to the address attribute as a single unit or as individual components. Single-Valued and Multi-Valued Attributes Single-valued attribute An attribute that holds a single value for each occurrence of an entity type. 11.3.2 351 352 | Chapter 11 z Entity–Relationship Modeling The majority of attributes are single-valued. For example, each occurrence of the Branch entity type has a single value for the branch number (branchNo) attribute (for example B003), and therefore the branchNo attribute is referred to as being single-valued. Multi-valued attribute An attribute that holds multiple values for each occurrence of an entity type. Some attributes have multiple values for each entity occurrence. For example, each occurrence of the Branch entity type can have multiple values for the telNo attribute (for example, branch number B003 has telephone numbers 0141-339-2178 and 0141-339-4439) and therefore the telNo attribute in this case is multi-valued. A multi-valued attribute may have a set of numbers with upper and lower limits. For example, the telNo attribute of the Branch entity type has between one and three values. In other words, a branch may have a minimum of a single telephone number to a maximum of three telephone numbers. 11.3.3 Derived Attributes Derived attribute An attribute that represents a value that is derivable from the value of a related attribute or set of attributes, not necessarily in the same entity type. The values held by some attributes may be derived. For example, the value for the duration attribute of the Lease entity is calculated from the rentStart and rentFinish attributes also of the Lease entity type. We refer to the duration attribute as a derived attribute, the value of which is derived from the rentStart and rentFinish attributes. In some cases, the value of an attribute is derived from the entity occurrences in the same entity type. For example, the total number of staff (totalStaff) attribute of the Staff entity type can be calculated by counting the total number of Staff entity occurrences. Derived attributes may also involve the association of attributes of different entity types. For example, consider an attribute called deposit of the Lease entity type. The value of the deposit attribute is calculated as twice the monthly rent for a property. Therefore, the value of the deposit attribute of the Lease entity type is derived from the rent attribute of the PropertyForRent entity type. 11.3.4 Keys Candidate key The minimal set of attributes that uniquely identifies each occurrence of an entity type. A candidate key is the minimal number of attributes, whose value(s) uniquely identify each entity occurrence. For example, the branch number (branchNo) attribute is the candidate 11.3 Attributes key for the Branch entity type, and has a distinct value for each branch entity occurrence. The candidate key must hold values that are unique for every occurrence of an entity type. This implies that a candidate key cannot contain a null (see Section 3.2). For example, each branch has a unique branch number (for example, B003), and there will never be more than one branch with the same branch number. Primary key The candidate key that is selected to uniquely identify each occurrence of an entity type. An entity type may have more than one candidate key. For the purposes of discussion consider that a member of staff has a unique company-defined staff number (staffNo) and also a unique National Insurance Number (NIN) that is used by the Government. We therefore have two candidate keys for the Staff entity, one of which must be selected as the primary key. The choice of primary key for an entity is based on considerations of attribute length, the minimal number of attributes required, and the future certainty of uniqueness. For example, the company-defined staff number contains a maximum of five characters (for example, SG14) while the NIN contains a maximum of nine characters (for example, WL220658D). Therefore, we select staffNo as the primary key of the Staff entity type and NIN is then referred to as the alternate key. Composite key A candidate key that consists of two or more attributes. In some cases, the key of an entity type is composed of several attributes, whose values together are unique for each entity occurrence but not separately. For example, consider an entity called Advert with propertyNo (property number), newspaperName, dateAdvert, and cost attributes. Many properties are advertised in many newspapers on a given date. To uniquely identify each occurrence of the Advert entity type requires values for the propertyNo, newspaperName, and dateAdvert attributes. Thus, the Advert entity type has a composite primary key made up of the propertyNo, newspaperName, and dateAdvert attributes. Diagrammatic representation of attributes If an entity type is to be displayed with its attributes, we divide the rectangle representing the entity in two. The upper part of the rectangle displays the name of the entity and the lower part lists the names of the attributes. For example, Figure 11.11 shows the ER diagram for the Staff and Branch entity types and their associated attributes. The first attribute(s) to be listed is the primary key for the entity type, if known. The name(s) of the primary key attribute(s) can be labeled with the tag {PK}. In UML, the name of an attribute is displayed with the first letter in lower case and, if the name has more than one word, with the first letter of each subsequent word in upper case (for example, address and telNo). Additional tags that can be used include partial primary key {PPK} when an attribute forms part of a composite primary key, and alternate key {AK}. As | 353 354 | Chapter 11 z Entity–Relationship Modeling Figure 11.11 Diagrammatic representation of Staff and Branch entities and their attributes. shown in Figure 11.11, the primary key of the Staff entity type is the staffNo attribute and the primary key of the Branch entity type is the branchNo attribute. For some simpler database systems, it is possible to show all the attributes for each entity type in the ER diagram. However, for more complex database systems, we just display the attribute, or attributes, that form the primary key of each entity type. When only the primary key attributes are shown in the ER diagram, we can omit the {PK} tag. For simple, single-valued attributes, there is no need to use tags and so we simply display the attribute names in a list below the entity name. For composite attributes, we list the name of the composite attribute followed below and indented to the right by the names of its simple component attributes. For example, in Figure 11.11 the composite attribute address of the Branch entity is shown, followed below by the names of its component attributes, street, city, and postcode. For multi-valued attributes, we label the attribute name with an indication of the range of values available for the attribute. For example, if we label the telNo attribute with the range [1..*], this means that the values for the telNo attribute is one or more. If we know the precise maximum number of values, we can label the attribute with an exact range. For example, if the telNo attribute holds one to a maximum of three values, we can label the attribute with [1..3]. For derived attributes, we prefix the attribute name with a ‘/’. For example, the derived attribute of the Staff entity type is shown in Figure 11.11 as /totalStaff. 11.4 Strong and Weak Entity Types We can classify entity types as being strong or weak. Strong entity type An entity type that is not existence-dependent on some other entity type. 11.5 Attributes on Relationships | 355 Figure 11.12 A strong entity type called Client and a weak entity type called Preference. An entity type is referred to as being strong if its existence does not depend upon the existence of another entity type. Examples of strong entities are shown in Figure 11.1 and include the Staff, Branch, PropertyForRent, and Client entities. A characteristic of a strong entity type is that each entity occurrence is uniquely identifiable using the primary key attribute(s) of that entity type. For example, we can uniquely identify each member of staff using the staffNo attribute, which is the primary key for the Staff entity type. Weak entity type An entity type that is existence-dependent on some other entity type. A weak entity type is dependent on the existence of another entity type. An example of a weak entity type called Preference is shown in Figure 11.12. A characteristic of a weak entity is that each entity occurrence cannot be uniquely identified using only the attributes associated with that entity type. For example, note that there is no primary key for the Preference entity. This means that we cannot identify each occurrence of the Preference entity type using only the attributes of this entity. We can only uniquely identify each preference through the relationship that a preference has with a client who is uniquely identifiable using the primary key for the Client entity type, namely clientNo. In this example, the Preference entity is described as having existence dependency for the Client entity, which is referred to as being the owner entity. Weak entity types are sometimes referred to as child, dependent, or subordinate entities and strong entity types as parent, owner, or dominant entities. Attributes on Relationships As we mentioned in Section 11.3, attributes can also be assigned to relationships. For example, consider the relationship Advertises, which associates the Newspaper and PropertyForRent entity types as shown in Figure 11.1. To record the date the property was advertised and the cost, we associate this information with the Advertises relationship as attributes called dateAdvert and cost, rather than with the Newspaper or the PropertyForRent entities. 11.5 356 | Chapter 11 z Entity–Relationship Modeling Figure 11.13 An example of a relationship called Advertises with attributes dateAdvert and cost. Diagrammatic representation of attributes on relationships We represent attributes associated with a relationship type using the same symbol as an entity type. However, to distinguish between a relationship with an attribute and an entity, the rectangle representing the attribute(s) is associated with the relationship using a dashed line. For example, Figure 11.13 shows the Advertises relationship with the attributes dateAdvert and cost. A second example shown in Figure 11.1 is the Manages relationship with the mgrStartDate and bonus attributes. The presence of one or more attributes assigned to a relationship may indicate that the relationship conceals an unidentified entity type. For example, the presence of the dateAdvert and cost attributes on the Advertises relationship indicates the presence of an entity called Advert. 11.6 Structural Constraints We now examine the constraints that may be placed on entity types that participate in a relationship. The constraints should reflect the restrictions on the relationships as perceived in the ‘real world’. Examples of such constraints include the requirements that a property for rent must have an owner and each branch must have staff. The main type of constraint on relationships is called multiplicity. Multiplicity The number (or range) of possible occurrences of an entity type that may relate to a single occurrence of an associated entity type through a particular relationship. Multiplicity constrains the way that entities are related. It is a representation of the policies (or business rules) established by the user or enterprise. Ensuring that all appropriate constraints are identified and represented is an important part of modeling an enterprise. As we mentioned earlier, the most common degree for relationships is binary. Binary relationships are generally referred to as being one-to-one (1:1), one-to-many (1:*), or 11.6 Structural Constraints | 357 many-to-many (*:*). We examine these three types of relationships using the following integrity constraints: n n n a member of staff manages a branch (1:1); a member of staff oversees properties for rent (1:*); newspapers advertise properties for rent (*:*). In Sections 11.6.1, 11.6.2, and 11.6.3 we demonstrate how to determine the multiplicity for each of these constraints and show how to represent each in an ER diagram. In Section 11.6.4 we examine multiplicity for relationships of degrees higher than binary. It is important to note that not all integrity constraints can be easily represented in an ER model. For example, the requirement that a member of staff receives an additional day’s holiday for every year of employment with the enterprise may be difficult to represent in an ER model. One-to-One (1:1) Relationships 11.6.1 Consider the relationship Manages, which relates the Staff and Branch entity types. Figure 11.14(a) displays two occurrences of the Manages relationship type (denoted r1 and r2) using a semantic net. Each relationship (rn) represents the association between a single Staff entity occurrence and a single Branch entity occurrence. We represent each entity occurrence using the values for the primary key attributes of the Staff and Branch entities, namely staffNo and branchNo. Determining the multiplicity Determining the multiplicity normally requires examining the precise relationships between the data given in a enterprise constraint using sample data. The sample data may be obtained by examining filled-in forms or reports and, if possible, from discussion with users. However, it is important to stress that to reach the right conclusions about a constraint requires that the sample data examined or discussed is a true representation of all the data being modeled. Figure 11.14(a) Semantic net showing two occurrences of the Staff Manages Branch relationship type. 358 | Chapter 11 z Entity–Relationship Modeling Figure 11.14(b) The multiplicity of the Staff Manages Branch one-to-one (1:1) relationship. In Figure 11.14(a) we see that staffNo SG5 manages branchNo B003 and staffNo SL21 manages branchNo B005, but staffNo SG37 does not manage any branch. In other words, a member of staff can manage zero or one branch and each branch is managed by one member of staff. As there is a maximum of one branch for each member of staff involved in this relationship and a maximum of one member of staff for each branch, we refer to this type of relationship as one-to-one, which we usually abbreviate as (1:1). Diagrammatic representation of 1:1 relationships An ER diagram of the Staff Manages Branch relationship is shown in Figure 11.14(b). To represent that a member of staff can manage zero or one branch, we place a ‘0..1’ beside the Branch entity. To represent that a branch always has one manager, we place a ‘1..1’ beside the Staff entity. (Note that for a 1:1 relationship, we may choose a relationship name that makes sense in either direction.) 11.6.2 One-to-Many (1:*) Relationships Consider the relationship Oversees, which relates the Staff and PropertyForRent entity types. Figure 11.15(a) displays three occurrences of the Staff Oversees PropertyForRent relationship Figure 11.15(a) Semantic net showing three occurrences of the Staff Oversees PropertyForRent relationship type. 11.6 Structural Constraints | 359 Figure 11.15(b) The multiplicity of the Staff Oversees PropertyForRent one-to-many (1:*) relationship type. type (denoted r1, r2, and r3) using a semantic net. Each relationship (rn) represents the association between a single Staff entity occurrence and a single PropertyForRent entity occurrence. We represent each entity occurrence using the values for the primary key attributes of the Staff and PropertyForRent entities, namely staffNo and propertyNo. Determining the multiplicity In Figure 11.15(a) we see that staffNo SG37 oversees propertyNos PG21 and PG36, and staffNo SA9 oversees propertyNo PA14 but staffNo SG5 does not oversee any properties for rent and propertyNo PG4 is not overseen by any member of staff. In summary, a member of staff can oversee zero or more properties for rent and a property for rent is overseen by zero or one member of staff. Therefore, for members of staff participating in this relationship there are many properties for rent, and for properties participating in this relationship there is a maximum of one member of staff. We refer to this type of relationship as oneto-many, which we usually abbreviate as (1:*). Diagrammatic representation of 1:* relationships An ER diagram of the Staff Oversees PropertyForRent relationship is shown in Figure 11.15(b). To represent that a member of staff can oversee zero or more properties for rent, we place a ‘0..*’ beside the PropertyForRent entity. To represent that each property for rent is overseen by zero or one member of staff, we place a ‘0..1’ beside the Staff entity. (Note that with 1:* relationships, we choose a relationship name that makes sense in the 1:* direction.) If we know the actual minimum and maximum values for the multiplicity, we can display these instead. For example, if a member of staff oversees a minimum of zero and a maximum of 100 properties for rent, we can replace the ‘0..*’ with ‘0..100’. Many-to-Many (*:*) Relationships Consider the relationship Advertises, which relates the Newspaper and PropertyForRent entity types. Figure 11.16(a) displays four occurrences of the Advertises relationship (denoted r1, r2, r3, and r4) using a semantic net. Each relationship (rn) represents the association between a single Newspaper entity occurrence and a single PropertyForRent entity occurrence. 11.6.3 360 | Chapter 11 z Entity–Relationship Modeling Figure 11.16(a) Semantic net showing four occurrences of the Newspaper Advertises PropertyForRent relationship type. We represent each entity occurrence using the values for the primary key attributes of the Newspaper and PropertyForRent entity types, namely newspaperName and propertyNo. Determining the multiplicity In Figure 11.16(a) we see that the Glasgow Daily advertises propertyNos PG21 and PG36, The West News also advertises propertyNo PG36 and the Aberdeen Express advertises propertyNo PA14. However, propertyNo PG4 is not advertised in any newspaper. In other words, one newspaper advertises one or more properties for rent and one property for rent is advertised in zero or more newspapers. Therefore, for newspapers there are many properties for rent, and for each property for rent participating in this relationship there are many newspapers. We refer to this type of relationship as many-to-many, which we usually abbreviate as (*:*). Diagrammatic representation of *:* relationships An ER diagram of the Newspaper Advertises PropertyForRent relationship is shown in Figure 11.16(b). To represent that each newspaper can advertise one or more properties for Figure 11.16(b) The multiplicity of the Newspaper Advertises PropertyForRent many-to-many (*:*) relationship. 11.6 Structural Constraints | 361 rent, we place a ‘1..*’ beside the PropertyForRent entity type. To represent that each property for rent can be advertised by zero or more newspapers, we place a ‘0..*’ beside the Newspaper entity. (Note that for a *:* relationship, we may choose a relationship name that makes sense in either direction.) Multiplicity for Complex Relationships 11.6.4 Multiplicity for complex relationships, that is those higher than binary, is slightly more complex. Multiplicity (complex relationship) The number (or range) of possible occurrences of an entity type in an n-ary relationship when the other (n−1) values are fixed. In general, the multiplicity for n-ary relationships represents the potential number of entity occurrences in the relationship when (n−1) values are fixed for the other participating entity types. For example, the multiplicity for a ternary relationship represents the potential range of entity occurrences of a particular entity in the relationship when the other two values representing the other two entities are fixed. Consider the ternary Registers relationship between Staff, Branch, and Client shown in Figure 11.7. Figure 11.17(a) displays five occurrences of the Registers relationship (denoted r1 to r5) using a semantic net. Each relationship (rn) represents the association of a single Staff entity occurrence, a single Branch entity occurrence, and a single Client entity occurrence. We represent each entity occurrence using the values for the primary key attributes of the Staff, Branch, and Client entities, namely, staffNo, branchNo, and clientNo. In Figure 11.17(a) we examine the Registers relationship when the values for the Staff and Branch entities are fixed. Figure 11.17(a) Semantic net showing five occurrences of the ternary Registers relationship with values for Staff and Branch entity types fixed. 362 | Chapter 11 z Entity–Relationship Modeling Figure 11.17(b) The multiplicity of the ternary Registers relationship. Table 11.1 A summary of ways to represent multiplicity constraints. Alternative ways to represent multiplicity constraints Meaning 0..1 1..1 (or just 1) 0..* (or just *) 1..* 5..10 0, 3, 6–8 Zero or one entity occurrence Exactly one entity occurrence Zero or many entity occurrences One or many entity occurrences Minimum of 5 up to a maximum of 10 entity occurrences Zero or three or six, seven, or eight entity occurrences Determining the multiplicity In Figure 11.17(a) with the staffNo/branchNo values fixed there are zero or more clientNo values. For example, staffNo SG37 at branchNo B003 registers clientNo CR56 and CR74, and staffNo SG14 at branchNo B003 registers clientNo CR62, CR84, and CR91. However, SG5 at branchNo B003 registers no clients. In other words, when the staffNo and branchNo values are fixed the corresponding clientNo values are zero or more. Therefore, the multiplicity of the Registers relationship from the perspective of the Staff and Branch entities is 0..*, which is represented in the ER diagram by placing the 0..* beside the Client entity. If we repeat this test we find that the multiplicity when Staff/Client values are fixed is 1..1, which is placed beside the Branch entity and the Client/Branch values are fixed is 1..1, which is placed beside the Staff entity. An ER diagram of the ternary Registers relationship showing multiplicity is in Figure 11.17(b). A summary of the possible ways that multiplicity constraints can be represented along with a description of the meaning is shown in Table 11.1. 11.6.5 Cardinality and Participation Constraints Multiplicity actually consists of two separate constraints known as cardinality and participation. 11.6 Structural Constraints | 363 Figure 11.18 Multiplicity described as cardinality and participation constraints for the Staff Manages Branch (1:1) relationship. Cardinality Describes the maximum number of possible relationship occurrences for an entity participating in a given relationship type. The cardinality of a binary relationship is what we previously referred to as a one-toone (1:1), one-to-many (1:*), and many-to-many (*:*). The cardinality of a relationship appears as the maximum values for the multiplicity ranges on either side of the relationship. For example, the Manages relationship shown in Figure 11.18 has a one-to-one (1:1) cardinality and this is represented by multiplicity ranges with a maximum value of 1 on both sides of the relationship. Participation Determines whether all or only some entity occurrences participate in a relationship. The participation constraint represents whether all entity occurrences are involved in a particular relationship (referred to as mandatory participation) or only some (referred to as optional participation). The participation of entities in a relationship appears as the minimum values for the multiplicity ranges on either side of the relationship. Optional participation is represented as a minimum value of 0 while mandatory participation is shown as a minimum value of 1. It is important to note that the participation for a given entity in a relationship is represented by the minimum value on the opposite side of the relationship; that is the minimum value for the multiplicity beside the related entity. For example, in Figure 11.18, the optional participation for the Staff entity in the Manages relationship is shown as a minimum value of 0 for the multiplicity beside the Branch entity and the mandatory participation for the Branch entity in the Manages relationship is shown as a minimum value of 1 for the multiplicity beside the Staff entity. 364 | Chapter 11 z Entity–Relationship Modeling A summary of the conventions introduced in this section to represent the basic concepts of the ER model is shown on the inside front cover of this book. 11.7 Problems with ER Models In this section we examine problems that may arise when creating an ER model. These problems are referred to as connection traps, and normally occur due to a misinterpretation of the meaning of certain relationships (Howe, 1989). We examine two main types of connection traps, called fan traps and chasm traps, and illustrate how to identify and resolve such problems in ER models. In general, to identify connection traps we must ensure that the meaning of a relationship is fully understood and clearly defined. If we do not understand the relationships we may create a model that is not a true representation of the ‘real world’. 11.7.1 Fan Traps Fan trap Where a model represents a relationship between entity types, but the pathway between certain entity occurrences is ambiguous. A fan trap may exist where two or more 1:* relationships fan out from the same entity. A potential fan trap is illustrated in Figure 11.19(a), which shows two 1:* relationships (Has and Operates) emanating from the same entity called Division. This model represents the facts that a single division operates one or more branches and has one or more staff. However, a problem arises when we want to know which members Figure 11.19(a) An example of a fan trap. Figure 11.19(b) The semantic net of the ER model shown in Figure 11.19(a). 11.7 Problems with ER Models | 365 Figure 11.20(a) The ER model shown in Figure 11.19(a) restructured to remove the fan trap. Figure 11.20(b) The semantic net of the ER model shown in Figure 11.20(a). of staff work at a particular branch. To appreciate the problem, we examine some occurrences of the Has and Operates relationships using values for the primary key attributes of the Staff, Division, and Branch entity types as shown in Figure 11.19(b). If we attempt to answer the question: ‘At which branch does staff number SG37 work?’ we are unable to give a specific answer based on the current structure. We can only determine that staff number SG37 works at Branch B003 or B007. The inability to answer this question specifically is the result of a fan trap associated with the misrepresentation of the correct relationships between the Staff, Division, and Branch entities. We resolve this fan trap by restructuring the original ER model to represent the correct association between these entities, as shown in Figure 11.20(a). If we now examine occurrences of the Operates and Has relationships as shown in Figure 11.20(b), we are now in a position to answer the type of question posed earlier. From this semantic net model, we can determine that staff number SG37 works at branch number B003, which is part of division D1. Chasm Traps Chasm trap Where a model suggests the existence of a relationship between entity types, but the pathway does not exist between certain entity occurrences. 11.7.2 366 | Chapter 11 z Entity–Relationship Modeling Figure 11.21(a) An example of a chasm trap. Figure 11.21(b) The semantic net of the ER model shown in Figure 11.21(a). A chasm trap may occur where there are one or more relationships with a minimum multiplicity of zero (that is optional participation) forming part of the pathway between related entities. A potential chasm trap is illustrated in Figure 11.21(a), which shows relationships between the Branch, Staff, and PropertyForRent entities. This model represents the facts that a single branch has one or more staff who oversee zero or more properties for rent. We also note that not all staff oversee property, and not all properties are overseen by a member of staff. A problem arises when we want to know which properties are available at each branch. To appreciate the problem, we examine some occurrences of the Has and Oversees relationships using values for the primary key attributes of the Branch, Staff, and PropertyForRent entity types as shown in Figure 11.21(b). If we attempt to answer the question: ‘At which branch is property number PA14 available?’ we are unable to answer this question, as this property is not yet allocated to a member of staff working at a branch. The inability to answer this question is considered to be a loss of information (as we know a property must be available at a branch), and is the result of a chasm trap. The multiplicity of both the Staff and PropertyForRent entities in the Oversees relationship has a minimum value of zero, which means that some properties cannot be associated with a branch through a member of staff. Therefore to solve this problem, we need to identify the missing relationship, which in this case is the Offers relationship between the Branch and PropertyForRent entities. The ER model shown in Figure 11.22(a) represents the true association between these entities. This model ensures that, at all times, the properties associated with each branch are known, including properties that are not yet allocated to a member of staff. If we now examine occurrences of the Has, Oversees, and Offers relationship types, as shown in Figure 11.22(b), we are now able to determine that property number PA14 is available at branch number B007. 11.7 Problems with ER Models | 367 Figure 11.22(a) The ER model shown in Figure 11.21(a) restructured to remove the chasm trap. Figure 11.22(b) The semantic net of the ER model shown in Figure 11.22(a). 368 | Chapter 11 z Entity–Relationship Modeling Chapter Summary n n n n n n n n n n n n n n n n n n n n n An entity type is a group of objects with the same properties, which are identified by the enterprise as having an independent existence. An entity occurrence is a uniquely identifiable object of an entity type. A relationship type is a set of meaningful associations among entity types. A relationship occurrence is a uniquely identifiable association, which includes one occurrence from each participating entity type. The degree of a relationship type is the number of participating entity types in a relationship. A recursive relationship is a relationship type where the same entity type participates more than once in different roles. An attribute is a property of an entity or a relationship type. An attribute domain is the set of allowable values for one or more attributes. A simple attribute is composed of a single component with an independent existence. A composite attribute is composed of multiple components each with an independent existence. A single-valued attribute holds a single value for each occurrence of an entity type. A multi-valued attribute holds multiple values for each occurrence of an entity type. A derived attribute represents a value that is derivable from the value of a related attribute or set of attributes, not necessarily in the same entity. A candidate key is the minimal set of attributes that uniquely identifies each occurrence of an entity type. A primary key is the candidate key that is selected to uniquely identify each occurrence of an entity type. A composite key is a candidate key that consists of two or more attributes. A strong entity type is not existence-dependent on some other entity type. A weak entity type is existencedependent on some other entity type. Multiplicity is the number (or range) of possible occurrences of an entity type that may relate to a single occurrence of an associated entity type through a particular relationship. Multiplicity for a complex relationship is the number (or range) of possible occurrences of an entity type in an n-ary relationship when the other (n−1) values are fixed. Cardinality describes the maximum number of possible relationship occurrences for an entity participating in a given relationship type. Participation determines whether all or only some entity occurrences participate in a given relationship. A fan trap exists where a model represents a relationship between entity types, but the pathway between certain entity occurrences is ambiguous. A chasm trap exists where a model suggests the existence of a relationship between entity types, but the pathway does not exist between certain entity occurrences. Exercises | 369 Review Questions 11.1 Describe what entity types represent in an ER model and provide examples of entities with a physical or conceptual existence. 11.2 Describe what relationship types represent in an ER model and provide examples of unary, binary, ternary, and quaternary relationships. 11.3 Describe what attributes represent in an ER model and provide examples of simple, composite, single-valued, multi-valued, and derived attributes. 11.4 Describe what the multiplicity constraint represents for a relationship type. 11.5 What are integrity constraints and how does multiplicity model these constraints? 11.6 How does multiplicity represent both the cardinality and the participation constraints on a relationship type? 11.7 Provide an example of a relationship type with attributes. 11.8 Describe how strong and weak entity types differ and provide an example of each. 11.9 Describe how fan and chasm traps can occur in an ER model and how they can be resolved. Exercises 11.10 Create an ER diagram for each of the following descriptions: (a) Each company operates four departments, and each department belongs to one company. (b) Each department in part (a) employs one or more employees, and each employee works for one department. (c) Each of the employees in part (b) may or may not have one or more dependants, and each dependant belongs to one employee. (d) Each employee in part (c) may or may not have an employment history. (e) Represent all the ER diagrams described in (a), (b), (c), and (d) as a single ER diagram. 11.11 You are required to create a conceptual data model of the data requirements for a company that specializes in IT training. The company has 30 instructors and can handle up to 100 trainees per training session. The company offers five advanced technology courses, each of which is taught by a teaching team of two or more instructors. Each instructor is assigned to a maximum of two teaching teams or may be assigned to do research. Each trainee undertakes one advanced technology course per training session. (a) Identify the main entity types for the company. (b) Identify the main relationship types and specify the multiplicity for each relationship. State any assumptions you make about the data. (c) Using your answers for (a) and (b), draw a single ER diagram to represent the data requirements for the company. 11.12 Read the following case study, which describes the data requirements for a video rental company. The video rental company has several branches throughout the USA. The data held on each branch is the branch address made up of street, city, state, and zip code, and the telephone number. Each branch is given a branch number, which is unique throughout the company. Each branch is allocated staff, which includes a Manager. The Manager is responsible for the day-to-day running of a given branch. The data held on a member of staff is his or her name, position, and salary. Each member of staff is given a staff number, which is unique throughout the company. Each branch has a stock of videos. The data held on a video is the catalog number, video number, title, category, daily rental, cost, status, and the names of the main actors and the director. The 370 | Chapter 11 z Entity–Relationship Modeling catalog number uniquely identifies each video. However, in most cases, there are several copies of each video at a branch, and the individual copies are identified using the video number. A video is given a category such as Action, Adult, Children, Drama, Horror, or Sci-Fi. The status indicates whether a specific copy of a video is available for rent. Before hiring a video from the company, a customer must first register as a member of a local branch. The data held on a member is the first and last name, address, and the date that the member registered at a branch. Each member is given a member number, which is unique throughout all branches of the company. Once registered, a member is free to rent videos, up to a maximum of ten at any one time. The data held on each video rented is the rental number, the full name and number of the member, the video number, title, and daily rental, and the dates the video is rented out and returned. The rental number is unique throughout the company. (a) Identify the main entity types of the video rental company. (b) Identify the main relationship types between the entity types described in (a) and represent each relationship as an ER diagram. (c) Determine the multiplicity constraints for each relationships described in (b). Represent the multiplicity for each relationship in the ER diagrams created in (b). (d) Identify attributes and associate them with entity or relationship types. Represent each attribute in the ER diagrams created in (c). (e) Determine candidate and primary key attributes for each (strong) entity type. (f) Using your answers (a) to (e) attempt to represent the data requirements of the video rental company as a single ER diagram. State any assumptions necessary to support your design. Chapter 12 Enhanced Entity– Relationship Modeling Chapter Objectives In this chapter you will learn: n The limitations of the basic concepts of the Entity–Relationship (ER) model and the requirements to represent more complex applications using additional data modeling concepts. n The most useful additional data modeling concepts of the Enhanced Entity–Relationship (EER) model called specialization/generalization, aggregation, and composition. n A diagrammatic technique for displaying specialization/generalization, aggregation, and composition in an EER diagram using the Unified Modeling Language (UML). In Chapter 11 we discussed the basic concepts of the Entity–Relationship (ER) model. These basic concepts are normally adequate for building data models of traditional, administrativebased database systems such as stock control, product ordering, and customer invoicing. However, since the 1980s there has been a rapid increase in the development of many new database systems that have more demanding database requirements than those of the traditional applications. Examples of such database applications include Computer-Aided Design (CAD), Computer-Aided Manufacturing (CAM ), Computer-Aided Software Engineering (CASE) tools, Office Information Systems (OIS) and Multimedia Systems, Digital Publishing, and Geographical Information Systems (GIS). The main features of these applications are described in Chapter 25. As the basic concepts of ER modeling are often not sufficient to represent the requirements of the newer, more complex applications, this stimulated the need to develop additional ‘semantic’ modeling concepts. Many different semantic data models have been proposed and some of the most important semantic concepts have been successfully incorporated into the original ER model. The ER model supported with additional semantic concepts is called the Enhanced Entity–Relationship (EER) model. In this chapter we describe three of the most important and useful additional concepts of the EER model, namely specialization/generalization, aggregation, and composition. We also illustrate how specialization/generalization, aggregation, and composition are represented in an EER diagram using the Unified Modeling Language (UML) (Booch et al., 1998). In Chapter 11 we introduced UML and demonstrated how UML could be used to diagrammatically represent the basic concepts of the ER model. 372 | Chapter 12 z Enhanced Entity–Relationship Modeling Structure of this Chapter In Section 12.1 we discuss the main concepts associated with specialization/generalization and illustrate how these concepts are represented in an EER diagram using the Unified Modeling Language (UML). We conclude this section with a worked example that demonstrates how to introduce specialization/generalization into an ER model using UML. In Section 12.2 we describe the concept of aggregation and in Section 12.3 the related concept of composition. We provide examples of aggregation and composition and show how these concepts can be represented in an EER diagram using UML. 12.1 Specialization/Generalization The concept of specialization/generalization is associated with special types of entities known as superclasses and subclasses, and the process of attribute inheritance. We begin this section by defining what superclasses and subclasses are and by examining superclass/subclass relationships. We describe the process of attribute inheritance and contrast the process of specialization with the process of generalization. We then describe the two main types of constraints on superclass/subclass relationships called participation and disjoint constraints. We show how to represent specialization/generalization in an Enhanced Entity–Relationship (EER) diagram using UML. We conclude this section with a worked example of how specialization/generalization may be introduced into the Entity–Relationship (ER) model of the Branch user views of the DreamHome case study described in Appendix A and shown in Figure 11.1. 12.1.1 Superclasses and Subclasses As we discussed in Chapter 11, an entity type represents a set of entities of the same type such as Staff, Branch, and PropertyForRent. We can also form entity types into a hierarchy containing superclasses and subclasses. Superclass An entity type that includes one or more distinct subgroupings of its occurrences, which require to be represented in a data model. Subclass A distinct subgrouping of occurrences of an entity type, which require to be represented in a data model. Entity types that have distinct subclasses are called superclasses. For example, the entities that are members of the Staff entity type may be classified as Manager, SalesPersonnel, and Secretary. In other words, the Staff entity is referred to as the superclass of the Manager, SalesPersonnel, and Secretary subclasses. The relationship between a superclass and any 12.1 Specialization/Generalization | 373 one of its subclasses is called a superclass/subclass relationship. For example, Staff/Manager has a superclass/subclass relationship. Superclass/Subclass Relationships 12.1.2 Each member of a subclass is also a member of the superclass. In other words, the entity in the subclass is the same entity in the superclass, but has a distinct role. The relationship between a superclass and a subclass is one-to-one (1:1) and is called a superclass/subclass relationship (see Section 11.6.1). Some superclasses may contain overlapping subclasses, as illustrated by a member of staff who is both a Manager and a member of Sales Personnel. In this example, Manager and SalesPersonnel are overlapping subclasses of the Staff superclass. On the other hand, not every member of a superclass need be a member of a subclass; for example, members of staff without a distinct job role such as a Manager or a member of Sales Personnel. We can use superclasses and subclasses to avoid describing different types of staff with possibly different attributes within a single entity. For example, Sales Personnel may have special attributes such as salesArea and carAllowance. If all staff attributes and those specific to particular jobs are described by a single Staff entity, this may result in a lot of nulls for the job-specific attributes. Clearly, Sales Personnel have common attributes with other staff, such as staffNo, name, position, and salary. However, it is the unshared attributes that cause problems when we try to represent all members of staff within a single entity. We can also show relationships that are only associated with particular types of staff (subclasses) and not with staff, in general. For example, Sales Personnel may have distinct relationships that are not appropriate for all staff, such as SalesPersonnel Uses Car. To illustrate these points, consider the relation called AllStaff shown in Figure 12.1. This relation holds the details of all members of staff no matter what position they hold. A consequence of holding all staff details in one relation is that while the attributes appropriate to all staff are filled (namely, staffNo, name, position, and salary), those that are only applicable Figure 12.1 The AllStaff relation holding details of all staff. 374 | Chapter 12 z Enhanced Entity–Relationship Modeling to particular job roles are only partially filled. For example, the attributes associated with the Manager (mgrStartDate and bonus), SalesPersonnel (salesArea and carAllowance), and Secretary (typingSpeed) subclasses have values for those members in these subclasses. In other words, the attributes associated with the Manager, SalesPersonnel, and Secretary subclasses are empty for those members of staff not in these subclasses. There are two important reasons for introducing the concepts of superclasses and subclasses into an ER model. Firstly, it avoids describing similar concepts more than once, thereby saving time for the designer and making the ER diagram more readable. Secondly, it adds more semantic information to the design in a form that is familiar to many people. For example, the assertions that ‘Manager IS-A member of staff’ and ‘flat IS-A type of property’, communicates significant semantic content in a concise form. 12.1.3 Attribute Inheritance As mentioned above, an entity in a subclass represents the same ‘real world’ object as in the superclass, and may possess subclass-specific attributes, as well as those associated with the superclass. For example, a member of the SalesPersonnel subclass inherits all the attributes of the Staff superclass such as staffNo, name, position, and salary together with those specifically associated with the SalesPersonnel subclass such as salesArea and carAllowance. A subclass is an entity in its own right and so it may also have one or more subclasses. An entity and its subclasses and their subclasses, and so on, is called a type hierarchy. Type hierarchies are known by a variety of names including: specialization hierarchy (for example, Manager is a specialization of Staff), generalization hierarchy (for example, Staff is a generalization of Manager), and IS-A hierarchy (for example, Manager IS-A (member of) Staff). We describe the process of specialization and generalization in the following sections. A subclass with more than one superclass is called a shared subclass. In other words, a member of a shared subclass must be a member of the associated superclasses. As a consequence, the attributes of the superclasses are inherited by the shared subclass, which may also have its own additional attributes. This process is referred to as multiple inheritance. 12.1.4 Specialization Process Specialization The process of maximizing the differences between members of an entity by identifying their distinguishing characteristics. Specialization is a top-down approach to defining a set of superclasses and their related subclasses. The set of subclasses is defined on the basis of some distinguishing characteristics of the entities in the superclass. When we identify a set of subclasses of an entity type, we then associate attributes specific to each subclass (where necessary), and also identify any relationships between each subclass and other entity types or subclasses (where necessary). For example, consider a model where all members of staff are 12.1 Specialization/Generalization | represented as an entity called Staff. If we apply the process of specialization on the Staff entity, we attempt to identify differences between members of this entity such as members with distinctive attributes and/or relationships. As described earlier, staff with the job roles of Manager, Sales Personnel, and Secretary have distinctive attributes and therefore we identify Manager, SalesPersonnel, and Secretary as subclasses of a specialized Staff superclass. Generalization Process Generalization The process of minimizing the differences between entities by identifying their common characteristics. The process of generalization is a bottom-up approach, which results in the identification of a generalized superclass from the original entity types. For example, consider a model where Manager, SalesPersonnel, and Secretary are represented as distinct entity types. If we apply the process of generalization on these entities, we attempt to identify similarities between them such as common attributes and relationships. As stated earlier, these entities share attributes common to all staff, and therefore we identify Manager, SalesPersonnel, and Secretary as subclasses of a generalized Staff superclass. As the process of generalization can be viewed as the reverse of the specialization process, we refer to this modeling concept as ‘specialization/generalization’. Diagrammatic representation of specialization/generalization UML has a special notation for representing specialization/generalization. For example, consider the specialization/generalization of the Staff entity into subclasses that represent job roles. The Staff superclass and the Manager, SalesPersonnel, and Secretary subclasses can be represented in an Enhanced Entity–Relationship (EER) diagram as illustrated in Figure 12.2. Note that the Staff superclass and the subclasses, being entities, are represented as rectangles. The subclasses are attached by lines to a triangle that points toward the superclass. The label below the specialization/generalization triangle, shown as {Optional, And}, describes the constraints on the relationship between the superclass and its subclasses. These constraints are discussed in more detail in Section 12.1.6. Attributes that are specific to a given subclass are listed in the lower section of the rectangle representing that subclass. For example, salesArea and carAllowance attributes are only associated with the SalesPersonnel subclass, and are not applicable to the Manager or Secretary subclasses. Similarly, we show attributes that are specific to the Manager (mgrStartDate and bonus) and Secretary (typingSpeed) subclasses. Attributes that are common to all subclasses are listed in the lower section of the rectangle representing the superclass. For example, staffNo, name, position, and salary attributes are common to all members of staff and are associated with the Staff superclass. Note that we can also show relationships that are only applicable to specific subclasses. For example, in Figure 12.2, the Manager subclass is related to the Branch entity through the 12.1.5 375 376 | Chapter 12 z Enhanced Entity–Relationship Modeling Figure 12.2 Specialization/ generalization of the Staff entity into subclasses representing job roles. Manages relationship, whereas the Staff superclass is related to the Branch entity through the Has relationship. We may have several specializations of the same entity based on different distinguishing characteristics. For example, another specialization of the Staff entity may produce the subclasses FullTimePermanent and PartTimeTemporary, which distinguishes between the types of employment contract for members of staff. The specialization of the Staff entity type into job role and contract of employment subclasses is shown in Figure 12.3. In this figure, we show attributes that are specific to the FullTimePermanent (salaryScale and holidayAllowance) and PartTimeTemporary (hourlyRate) subclasses. As described earlier, a superclass and its subclasses and their subclasses, and so on, is called a type hierarchy. An example of a type hierarchy is shown in Figure 12.4, where the job roles specialization/generalization shown in Figure 12.2 are expanded to show a shared subclass called SalesManager and the subclass called Secretary with its own subclass called AssistantSecretary. In other words, a member of the SalesManager shared subclass must be a member of the SalesPersonnel and Manager subclasses as well as the Staff superclass. As a consequence, the attributes of the Staff superclass (staffNo, name, position, and salary), and the attributes of the subclasses SalesPersonnel (salesArea and carAllowance) and Manager (mgrStartDate and bonus) are inherited by the SalesManager subclass, which also has its own additional attribute called salesTarget. AssistantSecretary is a subclass of Secretary, which is a subclass of Staff. This means that a member of the AssistantSecretary subclass must be a member of the Secretary subclass and the Staff superclass. As a consequence, the attributes of the Staff superclass (staffNo, name, position, and salary) and the attribute of the Secretary subclass (typingSpeed) are inherited by the AssistantSecretary subclass, which also has its own additional attribute called startDate. 12.1 Specialization/Generalization Figure 12.3 | 377 Specialization/generalization of the Staff entity into subclasses representing job roles and contracts of employment. Figure 12.4 Specialization/ generalization of the Staff entity into job roles including a shared subclass called SalesManager and a subclass called Secretary with its own subclass called AssistantSecretary. 378 | Chapter 12 z Enhanced Entity–Relationship Modeling 12.1.6 Constraints on Specialization/Generalization There are two constraints that may apply to a specialization/generalization called participation constraints and disjoint constraints. Participation constraints Participation constraint Determines whether every member in the superclass must participate as a member of a subclass. A participation constraint may be mandatory or optional. A superclass/subclass relationship with mandatory participation specifies that every member in the superclass must also be a member of a subclass. To represent mandatory participation, ‘Mandatory’ is placed in curly brackets below the triangle that points towards the superclass. For example, in Figure 12.3 the contract of employment specialization/generalization is mandatory participation, which means that every member of staff must have a contract of employment. A superclass/subclass relationship with optional participation specifies that a member of a superclass need not belong to any of its subclasses. To represent optional participation, ‘Optional’ is placed in curly brackets below the triangle that points towards the superclass. For example, in Figure 12.3 the job role specialization/generalization has optional participation, which means that a member of staff need not have an additional job role such as a Manager, Sales Personnel, or Secretary. Disjoint constraints Disjoint constraint Describes the relationship between members of the subclasses and indicates whether it is possible for a member of a superclass to be a member of one, or more than one, subclass. The disjoint constraint only applies when a superclass has more than one subclass. If the subclasses are disjoint, then an entity occurrence can be a member of only one of the subclasses. To represent a disjoint superclass/subclass relationship, ‘Or’ is placed next to the participation constraint within the curly brackets. For example, in Figure 12.3 the subclasses of the contract of employment specialization/generalization is disjoint, which means that a member of staff must have a full-time permanent or a part-time temporary contract, but not both. If subclasses of a specialization/generalization are not disjoint (called nondisjoint), then an entity occurrence may be a member of more than one subclass. To represent a nondisjoint superclass/subclass relationship, ‘And’ is placed next to the participation constraint within the curly brackets. For example, in Figure 12.3 the job role specialization/ generalization is nondisjoint, which means that an entity occurrence can be a member of both the Manager, SalesPersonnel, and Secretary subclasses. This is confirmed by the presence of the shared subclass called SalesManager shown in Figure 12.4. Note that it is not necessary to include the disjoint constraint for hierarchies that have a single subclass 12.1 Specialization/Generalization | at a given level and for this reason only the participation constraint is shown for the SalesManager and AssistantSecretary subclasses of Figure 12.4. The disjoint and participation constraints of specialization and generalization are distinct, giving rise to four categories: ‘mandatory and disjoint’, ‘optional and disjoint’, ‘mandatory and nondisjoint’, and ‘optional and nondisjoint’. Worked Example of using Specialization/ Generalization to Model the Branch View of DreamHome Case Study The database design methodology described in this book includes the use of specialization/ generalization as an optional step (Step 1.6) in building an EER model. The choice to use this step is dependent on the complexity of the enterprise (or part of the enterprise) being modeled and whether using the additional concepts of the EER model will help the process of database design. In Chapter 11 we described the basic concepts necessary to build an ER model to represent the Branch user views of the DreamHome case study. This model was shown as an ER diagram in Figure 11.1. In this section, we show how specialization/generalization may be used to convert the ER model of the Branch user views into an EER model. As a starting point, we first consider the entities shown in Figure 11.1. We examine the attributes and relationships associated with each entity to identify any similarities or differences between the entities. In the Branch user views’ requirements specification there are several instances where there is the potential to use specialization/generalization as discussed below. (a) For example, consider the Staff entity in Figure 11.1, which represents all members of staff. However, in the data requirements specification for the Branch user views of the DreamHome case study given in Appendix A, there are two key job roles mentioned namely Manager and Supervisor. We have three options as to how we may best model members of staff. The first option is to represent all members of staff as a generalized Staff entity (as in Figure 11.1), the second option is to create three distinct entities Staff, Manager, and Supervisor, and the third option is to represent the Manager and Supervisor entities as subclasses of a Staff superclass. The option we select is based on the commonality of attributes and relationships associated with each entity. For example, all attributes of the Staff entity are represented in the Manager and Supervisor entities, including the same primary key, namely staffNo. Furthermore, the Supervisor entity does not have any additional attributes representing this job role. On the other hand, the Manager entity has two additional attributes, namely mgrStartDate and bonus. In addition, both the Manager and Supervisor entities are associated with distinct relationships, namely Manager Manages Branch and Supervisor Supervises Staff. Based on this information, we select the third option and create Manager and Supervisor subclasses of the Staff superclass, as shown in Figure 12.5. Note that in this EER diagram, the subclasses are shown above the superclass. The relative positioning of the subclasses and superclass is not significant, however; what is important is that the specialization/ generalization triangle points toward the superclass. 12.1.7 379 380 | Chapter 12 z Enhanced Entity–Relationship Modeling Figure 12.5 Staff superclass with Supervisor and Manager subclasses. The specialization/generalization of the Staff entity is optional and disjoint (shown as {Optional, Or}), as not all members of staff are Managers or Supervisors, and in addition a single member of staff cannot be both a Manager and a Supervisor. This representation is particularly useful for displaying the shared attributes associated with these subclasses and the Staff superclass and also the distinct relationships associated with each subclass, namely Manager Manages Branch and Supervisor Supervises Staff. (b) Next, consider for specialization/generalization the relationship between owners of property. The data requirements specification for the Branch user views describes two types of owner, namely PrivateOwner and BusinessOwner as shown in Figure 11.1. Again, we have three options as to how we may best model owners of property. The first option is to leave PrivateOwner and BusinessOwner as two distinct entities (as shown in Figure 11.1), the second option is to represent both types of owner as a generalized Owner entity, and the third option is to represent the PrivateOwner and BusinessOwner entities as subclasses of an Owner superclass. Before we are able to reach a decision we first examine the attributes and relationships associated with these entities. PrivateOwner and BusinessOwner entities share common attributes, namely address and telNo and have a similar relationship with property for rent (namely PrivateOwner POwns PropertyForRent and BusinessOwner BOwns PropertyForRent). However, both types of owner also have different attributes; for example, PrivateOwner has distinct attributes ownerNo and name, and BusinessOwner has distinct attributes bName, bType, and contactName. In this case, we create a superclass called Owner, with PrivateOwner and BusinessOwner as subclasses as shown in Figure 12.6. The specialization/generalization of the Owner entity is mandatory and disjoint (shown as {Mandatory, Or}), as an owner must be either a private owner or a business owner, but cannot be both. Note that we choose to relate the Owner superclass to the PropertyForRent entity using the relationship called Owns. The examples of specialization/generalization described above are relatively straightforward. However, the specialization/generalization process can be taken further as illustrated in the following example. 12.1 Specialization/Generalization | 381 Figure 12.6 Owner superclass with PrivateOwner and BusinessOwner subclasses. (c) There are several persons with common characteristics described in the data requirements specification for the Branch user views of the DreamHome case study. For example, members of staff, private property owners, and clients all have number and name attributes. We could create a Person superclass with Staff (including Manager and Supervisor subclasses), PrivateOwner, and Client as subclasses, as shown in Figure 12.7. We now consider to what extent we wish to use specialization/generalization to represent the Branch user views of the DreamHome case study. We decide to use the specialization/generalization examples described in (a) and (b) above but not (c), as shown in Figure 12.8. To simplify the EER diagram only attributes associated with primary keys or Figure 12.7 Person superclass with Staff (including Supervisor and Manager subclasses), PrivateOwner, and Client subclasses. 382 | Chapter 12 z Enhanced Entity–Relationship Modeling Figure 12.8 An Enhanced Entity–Relationship (EER) model of the Branch user views of DreamHome with specialization/generalization. 12.2 Aggregation | relationships are shown. We leave out the representation shown in Figure 12.7 from the final EER model because the use of specialization/generalization in this case places too much emphasis on the relationship between entities that are persons rather than emphasizing the relationship between these entities and some of the core entities such as Branch and PropertyForRent. The option to use specialization/generalization, and to what extent, is a subjective decision. In fact, the use of specialization/generalization is presented as an optional step in our methodology for conceptual database design discussed in Chapter 15, Step 1.6. As described in Section 2.3, the purpose of a data model is to provide the concepts and notations that allow database designers and end-users to unambiguously and accurately communicate their understanding of the enterprise data. Therefore, if we keep these goals in mind, we should only use the additional concepts of specialization/generalization when the enterprise data is too complex to easily represent using only the basic concepts of the ER model. At this stage we may consider whether the introduction of specialization/generalization to represent the Branch user views of DreamHome is a good idea. In other words, is the requirement specification for the Branch user views better represented as the ER model shown in Figure 11.1 or as the EER model shown in Figure 12.8? We leave this for the reader to consider. Aggregation Aggregation Represents a ‘has-a’ or ‘is-part-of’ relationship between entity types, where one represents the ‘whole’ and the other the ‘part’. A relationship represents an association between two entity types that are conceptually at the same level. Sometimes we want to model a ‘has-a’ or ‘is-part-of’ relationship, in which one entity represents a larger entity (the ‘whole’), consisting of smaller entities (the ‘parts’). This special kind of relationship is called an aggregation (Booch et al., 1998). Aggregation does not change the meaning of navigation across the relationship between the whole and its parts, nor does it link the lifetimes of the whole and its parts. An example of an aggregation is the Has relationship, which relates the Branch entity (the ‘whole’) to the Staff entity (the ‘part’). Diagrammatic representation of aggregation UML represents aggregation by placing an open diamond shape at one end of the relationship line, next to the entity that represents the ‘whole’. In Figure 12.9, we redraw part of the EER diagram shown in Figure 12.8 to demonstrate aggregation. This EER diagram displays two examples of aggregation, namely Branch Has Staff and Branch Offers PropertyForRent. In both relationships, the Branch entity represents the ‘whole’ and therefore the open diamond shape is placed beside this entity. 12.2 383 384 | Chapter 12 z Enhanced Entity–Relationship Modeling Figure 12.9 Examples of aggregation: Branch Has Staff and Branch Offers PropertyForRent. 12.3 Composition Composition A specific form of aggregation that represents an association between entities, where there is a strong ownership and coincidental lifetime between the ‘whole’ and the ‘part’. Aggregation is entirely conceptual and does nothing more than distinguish a ‘whole’ from a ‘part’. However, there is a variation of aggregation called composition that represents a strong ownership and coincidental lifetime between the ‘whole’ and the ‘part’ (Booch et al., 1998). In a composite, the ‘whole’ is responsible for the disposition of the ‘parts’, which means that the composition must manage the creation and destruction of its ‘parts’. In other words, an object may only be part of one composite at a time. There are no examples of composition in Figure 12.8. For the purposes of discussion, consider an example of a composition, namely the Displays relationship, which relates the Newspaper entity to the Advert entity. As a composition, this emphasizes the fact that an Advert entity (the ‘part’) belongs to exactly one Newspaper entity (the ‘whole’). This is in contrast to aggregation, in which a part may be shared by many wholes. For example, a Staff entity may be ‘a part of’ one or more Branches entities. Chapter Summary | 385 Figure 12.10 An example of composition: Newspaper Displays Advert. Diagrammatic representation of composition UML represents composition by placing a filled-in diamond shape at one end of the relationship line next to the entity that represents the ‘whole’ in the relationship. For example, to represent the Newspaper Displays Advert composition, the filled-in diamond shape is placed next to the Newspaper entity, which is the ‘whole’ in this relationship, as shown in Figure 12.10. As discussed with specialization/generalization, the options to use aggregation and composition, and to what extent, are again subjective decisions. Aggregation and composition should only be used when there is a requirement to emphasize special relationships between entity types such as ‘has-a’ or ‘is-part-of’, which has implications on the creation, update, and deletion of these closely related entities. We discuss how to represent such constraints between entity types in our methodology for logical database design in Chapter 16, Step 2.4. If we remember that the major aim of a data model is to unambiguously and accurately communicate an understanding of the enterprise data. We should only use the additional concepts of aggregation and composition when the enterprise data is too complex to easily represent using only the basic concepts of the ER model. Chapter Summary n A superclass is an entity type that includes one or more distinct subgroupings of its occurrences, which require to be represented in a data model. A subclass is a distinct subgrouping of occurrences of an entity type, which require to be represented in a data model. n Specialization is the process of maximizing the differences between members of an entity by identifying their distinguishing features. n Generalization is the process of minimizing the differences between entities by identifying their common features. n There are two constraints that may apply to a specialization/generalization called participation constraints and disjoint constraints. 386 | Chapter 12 z Enhanced Entity–Relationship Modeling n A participation constraint determines whether every member in the superclass must participate as a member of a subclass. n A disjoint constraint describes the relationship between members of the subclasses and indicates whether it is possible for a member of a superclass to be a member of one, or more than one, subclass. n Aggregation represents a ‘has-a’ or ‘is-part-of’ relationship between entity types, where one represents the ‘whole’ and the other the ‘part’. n Composition is a specific form of aggregation that represents an association between entities, where there is a strong ownership and coincidental lifetime between the ‘whole’ and the ‘part’. Review Questions 12.1 Describe what a superclass and a subclass represent. 12.2 Describe the relationship between a superclass and its subclass. 12.3 Describe and illustrate using an example the process of attribute inheritance. 12.4 What are the main reasons for introducing the concepts of superclasses and subclasses into an ER model? 12.5 Describe what a shared subclass represents and how this concept relates to multiple inheritance. 12.6 Describe and contrast the process of specialization with the process of generalization. 12.7 Describe the two main constraints that apply to a specialization/generalization relationship. 12.8 Describe and contrast the concepts of aggregation and composition and provide an example of each. Exercises 12.9 Consider whether it is appropriate to introduce the enhanced concepts of specialization/generalization, aggregation, and/or composition for the case studies described in Appendix B. 12.10 Consider whether it is appropriate to introduce the enhanced concepts of specialization/generalization, aggregation, and/or composition into the ER model for the case study described in Exercise 11.12. If appropriate, redraw the ER diagram as an EER diagram with the additional enhanced concepts. Chapter 13 Normalization Chapter Objectives In this chapter you will learn: n The purpose of normalization. n How normalization can be used when designing a relational database. n The potential problems associated with redundant data in base relations. n The concept of functional dependency, which describes the relationship between attributes. n The characteristics of functional dependencies used in normalization. n How to identify functional dependencies for a given relation. n How functional dependencies identify the primary key for a relation. n How to undertake the process of normalization. n How normalization uses functional dependencies to group attributes into relations that are in a known normal form. n How to identify the most commonly used normal forms, namely First Normal Form (1NF), Second Normal Form (2NF), and Third Normal Form (3NF). n The problems associated with relations that break the rules of 1NF, 2NF, or 3NF. n How to represent attributes shown on a form as 3NF relations using normalization. When we design a database for an enterprise, the main objective is to create an accurate representation of the data, relationships between the data, and constraints on the data that is pertinent to the enterprise. To help achieve this objective, we can use one or more database design techniques. In Chapters 11 and 12 we described a technique called Entity–Relationship (ER) modeling. In this chapter and the next we describe another database design technique called normalization. Normalization is a database design technique, which begins by examining the relationships (called functional dependencies) between attributes. Attributes describe some property of the data or of the relationships between the data that is important to the enterprise. Normalization uses a series of tests (described as normal forms) to help identify the optimal grouping for these attributes to ultimately identify a set of suitable relations that supports the data requirements of the enterprise. 388 | Chapter 13 z Normalization While the main purpose of this chapter is to introduce the concept of functional dependencies and describe normalization up to Third Normal Form (3NF), in Chapter 14 we take a more formal look at functional dependencies and also consider later normal forms that go beyond 3NF. Structure of this Chapter In Section 13.1 we describe the purpose of normalization. In Section 13.2 we discuss how normalization can be used to support relational database design. In Section 13.3 we identify and illustrate the potential problems associated with data redundancy in a base relation that is not normalized. In Section 13.4 we describe the main concept associated with normalization called functional dependency, which describes the relationship between attributes. We also describe the characteristics of the functional dependencies that are used in normalization. In Section 13.5 we present an overview of normalization and then proceed in the following sections to describe the process involving the three most commonly used normal forms, namely First Normal Form (1NF) in Section 13.6, Second Normal Form (2NF) in Section 13.7, and Third Normal Form (3NF) in Section 13.8. The 2NF and 3NF described in these sections are based on the primary key of a relation. In Section 13.9 we present general definitions for 2NF and 3NF based on all candidate keys of a relation. Throughout this chapter we use examples taken from the DreamHome case study described in Section 10.4 and documented in Appendix A. 13.1 The Purpose of Normalization Normalization A technique for producing a set of relations with desirable properties, given the data requirements of an enterprise. The purpose of normalization is to identify a suitable set of relations that support the data requirements of an enterprise. The characteristics of a suitable set of relations include the following: n n n the minimal number of attributes necessary to support the data requirements of the enterprise; attributes with a close logical relationship (described as functional dependency) are found in the same relation; minimal redundancy with each attribute represented only once with the important exception of attributes that form all or part of foreign keys (see Section 3.2.5), which are essential for the joining of related relations. The benefits of using a database that has a suitable set of relations is that the database will be easier for the user to access and maintain the data, and take up minimal storage 13.2 How Normalization Supports Database Design | space on the computer. The problems associated with using a relation that is not appropriately normalized is described later in Section 13.3. How Normalization Supports Database Design Normalization is a formal technique that can be used at any stage of database design. However, in this section we highlight two main approaches for using normalization, as illustrated in Figure 13.1. Approach 1 shows how normalization can be used as a bottomup standalone database design technique while Approach 2 shows how normalization can be used as a validation technique to check the structure of relations, which may have been created using a top-down approach such as ER modeling. No matter which approach is used the goal is the same that of creating a set of well-designed relations that meet the data requirements of the enterprise. Figure 13.1 shows examples of data sources that can be used for database design. Although, the users’ requirements specification (see Section 9.5) is the preferred data source, it is possible to design a database based on the information taken directly from other data sources such as forms and reports, as illustrated in this chapter and the next. Figure 13.1 How normalization can be used to support database design. 13.2 389 390 | Chapter 13 z Normalization Figure 13.1 also shows that the same data source can be used for both approaches; however, although this is true in principle, in practice the approach taken is likely to be determined by the size, extent, and complexity of the database being described by the data sources and by the preference and expertise of the database designer. The opportunity to use normalization as a bottom-up standalone technique (Approach 1) is often limited by the level of detail that the database designer is reasonably expected to manage. However, this limitation is not applicable when normalization is used as a validation technique (Approach 2) as the database designer focuses on only part of the database, such as a single relation, at any one time. Therefore, no matter what the size or complexity of the database, normalization can be usefully applied. 13.3 Data Redundancy and Update Anomalies As stated in Section 13.1 a major aim of relational database design is to group attributes into relations to minimize data redundancy. If this aim is achieved, the potential benefits for the implemented database include the following: n n updates to the data stored in the database are achieved with a minimal number of operations thus reducing the opportunities for data inconsistencies occurring in the database; reduction in the file storage space required by the base relations thus minimizing costs. Of course, relational databases also rely on the existence of a certain amount of data redundancy. This redundancy is in the form of copies of primary keys (or candidate keys) acting as foreign keys in related relations to enable the modeling of relationships between data. In this section we illustrate the problems associated with unwanted data redundancy by comparing the Staff and Branch relations shown in Figure 13.2 with the StaffBranch relation Figure 13.2 Staff and Branch relations. 13.3 Data Redundancy and Update Anomalies | 391 Figure 13.3 StaffBranch relation. shown in Figure 13.3. The StaffBranch relation is an alternative format of the relations. The relations have the form: Staff and Branch Staff Branch StaffBranch (staffNo, sName, position, salary, branchNo) (branchNo, bAddress) (staffNo, sName, position, salary, branchNo, bAddress) Note that the primary key for each relation is underlined. In the StaffBranch relation there is redundant data; the details of a branch are repeated for every member of staff located at that branch. In contrast, the branch details appear only once for each branch in the Branch relation, and only the branch number (branchNo) is repeated in the Staff relation to represent where each member of staff is located. Relations that have redundant data may have problems called update anomalies, which are classified as insertion, deletion, or modification anomalies. Insertion Anomalies There are two main types of insertion anomaly, which we illustrate using the relation shown in Figure 13.3. n n 13.3.1 StaffBranch To insert the details of new members of staff into the StaffBranch relation, we must include the details of the branch at which the staff are to be located. For example, to insert the details of new staff located at branch number B007, we must enter the correct details of branch number B007 so that the branch details are consistent with values for branch B007 in other tuples of the StaffBranch relation. The relations shown in Figure 13.2 do not suffer from this potential inconsistency because we enter only the appropriate branch number for each staff member in the Staff relation. Instead, the details of branch number B007 are recorded in the database as a single tuple in the Branch relation. To insert details of a new branch that currently has no members of staff into the StaffBranch relation, it is necessary to enter nulls into the attributes for staff, such as staffNo. However, as staffNo is the primary key for the StaffBranch relation, attempting to enter nulls for staffNo violates entity integrity (see Section 3.3), and is not allowed. We therefore cannot enter a tuple for a new branch into the StaffBranch relation with a null for the staffNo. The design of the relations shown in Figure 13.2 avoids this problem 392 | Chapter 13 z Normalization because branch details are entered in the Branch relation separately from the staff details. The details of staff ultimately located at that branch are entered at a later date into the Staff relation. 13.3.2 Deletion Anomalies If we delete a tuple from the StaffBranch relation that represents the last member of staff located at a branch, the details about that branch are also lost from the database. For example, if we delete the tuple for staff number SA9 (Mary Howe) from the StaffBranch relation, the details relating to branch number B007 are lost from the database. The design of the relations in Figure 13.2 avoids this problem, because branch tuples are stored separately from staff tuples and only the attribute branchNo relates the two relations. If we delete the tuple for staff number SA9 from the Staff relation, the details on branch number B007 remain unaffected in the Branch relation. 13.3.3 Modification Anomalies If we want to change the value of one of the attributes of a particular branch in the StaffBranch relation, for example the address for branch number B003, we must update the tuples of all staff located at that branch. If this modification is not carried out on all the appropriate tuples of the StaffBranch relation, the database will become inconsistent. In this example, branch number B003 may appear to have different addresses in different staff tuples. The above examples illustrate that the Staff and Branch relations of Figure 13.2 have more desirable properties than the StaffBranch relation of Figure 13.3. This demonstrates that while the StaffBranch relation is subject to update anomalies, we can avoid these anomalies by decomposing the original relation into the Staff and Branch relations. There are two important properties associated with decomposition of a larger relation into smaller relations: n n The lossless-join property ensures that any instance of the original relation can be identified from corresponding instances in the smaller relations. The dependency preservation property ensures that a constraint on the original relation can be maintained by simply enforcing some constraint on each of the smaller relations. In other words, we do not need to perform joins on the smaller relations to check whether a constraint on the original relation is violated. Later in this chapter, we discuss how the process of normalization can be used to derive well-formed relations. However, we first introduce functional dependencies, which are fundamental to the process of normalization. 13.4 Functional Dependencies An important concept associated with normalization is functional dependency, which describes the relationship between attributes (Maier, 1983). In this section we describe 13.4 Functional Dependencies | functional dependencies and then focus on the particular characteristics of functional dependencies that are useful for normalization. We then discuss how functional dependencies can be identified and use to identify the primary key for a relation. Characteristics of Functional Dependencies 13.4.1 For the discussion on functional dependencies, assume that a relational schema has attributes (A, B, C, . . . , Z) and that the database is described by a single universal relation called R = (A, B, C, . . . , Z). This assumption means that every attribute in the database has a unique name. Functional dependency Describes the relationship between attributes in a relation. For example, if A and B are attributes of relation R, B is functionally dependent on A (denoted A → B), if each value of A is associated with exactly one value of B. (A and B may each consist of one or more attributes.) Functional dependency is a property of the meaning or semantics of the attributes in a relation. The semantics indicate how attributes relate to one another, and specify the functional dependencies between attributes. When a functional dependency is present, the dependency is specified as a constraint between the attributes. Consider a relation with attributes A and B, where attribute B is functionally dependent on attribute A. If we know the value of A and we examine the relation that holds this dependency, we find only one value of B in all the tuples that have a given value of A, at any moment in time. Thus, when two tuples have the same value of A, they also have the same value of B. However, for a given value of B there may be several different values of A. The dependency between attributes A and B can be represented diagrammatically, as shown Figure 13.4. An alternative way to describe the relationship between attributes A and B is to say that ‘A functionally determines B’. Some readers may prefer this description, as it more naturally follows the direction of the functional dependency arrow between the attributes. Determinant Refers to the attribute, or group of attributes, on the left-hand side of the arrow of a functional dependency. When a functional dependency exists, the attribute or group of attributes on the lefthand side of the arrow is called the determinant. For example, in Figure 13.4, A is the determinant of B. We demonstrate the identification of a functional dependency in the following example. Figure 13.4 A functional dependency diagram. 393 394 | Chapter 13 z Normalization Example 13.1 An example of a functional dependency Consider the attributes staffNo and position of the Staff relation in Figure 13.2. For a specific staffNo, for example SL21, we can determine the position of that member of staff as Manager. In other words, staffNo functionally determines position, as shown in Figure 13.5(a). However, Figure 13.5(b) illustrates that the opposite is not true, as position does not functionally determine staffNo. A member of staff holds one position; however, there may be several members of staff with the same position. The relationship between staffNo and position is one-to-one (1:1): for each staff number there is only one position. On the other hand, the relationship between position and staffNo is one-to-many (1:*): there are several staff numbers associated with a given position. In this example, staffNo is the determinant of this functional dependency. For the purposes of normalization we are interested in identifying functional dependencies between attributes of a relation that have a one-to-one relationship between the attribute(s) that makes up the determinant on the left-hand side and the attribute(s) on the right-hand side of a dependency. When identifying functional dependencies between attributes in a relation it is important to distinguish clearly between the values held by an attribute at a given point in time and the set of all possible values that an attribute may hold at different times. In other words, a functional dependency is a property of a relational schema (intension) and not a property of a particular instance of the schema (extension) (see Section 3.2.1). This point is illustrated in the following example. Figure 13.5 (a) staffNo functionally determines position (staffNo → position); (b) position does not functionally determine staffNo x staffNo). (position → 13.4 Functional Dependencies Example 13.2 Example of a functional dependency that holds for all time Consider the values shown in staffNo and sName attributes of the Staff relation in Figure 13.2. We see that for a specific staffNo, for example SL21, we can determine the name of that member of staff as John White. Furthermore, it appears that for a specific sName, for example, John White, we can determine the staff number for that member of staff as SL21. Can we therefore conclude that the staffNo attribute functionally determines the sName attribute and/or that the sName attribute functionally determines the staffNo attribute? If the values shown in the Staff relation of Figure 13.2 represent the set of all possible values for staffNo and sName attributes then the following functional dependencies hold: staffNo sName → sName → staffNo However, if the values shown in the Staff relation of Figure 13.2 simply represent a set of values for staffNo and sName attributes at a given moment in time, then we are not so interested in such relationships between attributes. The reason is that we want to identify functional dependencies that hold for all possible values for attributes of a relation as these represent the types of integrity constraints that we need to identify. Such constraints indicate the limitations on the values that a relation can legitimately assume. One approach to identifying the set of all possible values for attributes in a relation is to more clearly understand the purpose of each attribute in that relation. For example, the purpose of the values held in the staffNo attribute is to uniquely identify each member of staff, whereas the purpose of the values held in the sName attribute is to hold the names of members of staff. Clearly, the statement that if we know the staff number (staffNo) of a member of staff we can determine the name of the member of staff (sName) remains true. However, as it is possible for the sName attribute to hold duplicate values for members of staff with the same name, then for some members of staff in this category we would not be able to determine their staff number (staffNo). The relationship between staffNo and sName is one-to-one (1:1): for each staff number there is only one name. On the other hand, the relationship between sName and staffNo is one-to-many (1:*): there can be several staff numbers associated with a given name. The functional dependency that remains true after consideration of all possible values for the staffNo and sName attributes of the Staff relation is: staffNo → sName An additional characteristic of functional dependencies that is useful for normalization is that their determinants should have the minimal number of attributes necessary to maintain the functional dependency with the attribute(s) on the right hand-side. This requirement is called full functional dependency. Full functional dependency Indicates that if A and B are attributes of a relation, B is fully functionally dependent on A if B is functionally dependent on A, but not on any proper subset of A. | 395 396 | Chapter 13 z Normalization A functional dependency A → B is a full functional dependency if removal of any attribute from A results in the dependency no longer existing. A functional dependency A → B is a partially dependency if there is some attribute that can be removed from A and yet the dependency still holds. An example of how a full functional dependency is derived from a partial functional dependency is presented in Example 13.3. Example 13.3 Example of a full functional dependency Consider the following functional dependency that exists in the Figure 13.2: staffNo, sName Staff relation of → branchNo It is correct to say that each value of (staffNo, sName) is associated with a single value of branchNo. However, it is not a full functional dependency because branchNo is also functionally dependent on a subset of (staffNo, sName), namely staffNo. In other words, the functional dependency shown above is an example of a partial dependency. The type of functional dependency that we are interested in identifying is a full functional dependency as shown below. staffNo → branchNo Additional examples of partial and full functional dependencies are discussed in Section 13.7. In summary, the functional dependencies that we use in normalization have the following characteristics: n n n There is a one-to-one relationship between the attribute(s) on the left-hand side (determinant) and those on the right-hand side of a functional dependency. (Note that the relationship in the opposite direction, that is from the right- to the left-hand side attributes, can be a one-to-one relationship or one-to-many relationship.) They hold for all time. The determinant has the minimal number of attributes necessary to maintain the dependency with the attribute(s) on the right-hand side. In other words, there must be a full functional dependency between the attribute(s) on the left- and right-hand sides of the dependency. So far we have discussed functional dependencies that we are interested in for the purposes of normalization. However, there is an additional type of functional dependency called a transitive dependency that we need to recognize because its existence in a relation can potentially cause the types of update anomaly discussed in Section 13.3. In this section we simply describe these dependencies so that we can identify them when necessary. 13.4 Functional Dependencies Transitive dependency | A condition where A, B, and C are attributes of a relation such that if A → B and B → C, then C is transitively dependent on A via B (provided that A is not functionally dependent on B or C). An example of a transitive dependency is provided in Example 13.4. Example 13.4 Example of a transitive functional dependency Consider the following functional dependencies within the Figure 13.3: StaffBranch relation shown in staffNo → sName, position, salary, branchNo, bAddress branchNo → bAddress The transitive dependency branchNo → bAddress exists on staffNo via branchNo. In other words, the staffNo attribute functionally determines the bAddress via the branchNo attribute and neither branchNo nor bAddress functionally determines staffNo. An additional example of a transitive dependency is discussed in Section 13.8. In the following sections we demonstrate approaches to identifying a set of functional dependencies and then discuss how these dependencies can be used to identify a primary key for the example relations. Identifying Functional Dependencies Identifying all functional dependencies between a set of attributes should be quite simple if the meaning of each attribute and the relationships between the attributes are well understood. This type of information may be provided by the enterprise in the form of discussions with users and/or appropriate documentation such as the users’ requirements specification. However, if the users are unavailable for consultation and/or the documentation is incomplete, then, depending on the database application, it may be necessary for the database designer to use their common sense and/or experience to provide the missing information. Example 13.5 illustrates how easy it is to identify functional dependencies between attributes of a relation when the purpose of each attribute and the attributes’ relationships are well understood. Example 13.5 Identifying a set of functional dependencies for the StaffBranch relation We begin by examining the semantics of the attributes in the StaffBranch relation shown in Figure 13.3. For the purposes of discussion we assume that the position held and the branch determine a member of staff’s salary. We identify the functional dependencies based on our understanding of the attributes in the relation as: 13.4.2 397 398 | Chapter 13 z Normalization staffNo → sName, position, salary, branchNo, bAddress branchNo → bAddress bAddress → branchNo branchNo, position → salary bAddress, position → salary We identify five functional dependencies in the StaffBranch relation with staffNo, branchNo, bAddress, (branchNo, position), and (bAddress, position) as determinants. For each functional dependency, we ensure that all the attributes on the right-hand side are functionally dependent on the determinant on the left-hand side. As a contrast to this example we now consider the situation where functional dependencies are to be identified in the absence of appropriate information about the meaning of attributes and their relationships. In this case, it may be possible to identify functional dependencies if sample data is available that is a true representation of all possible data values that the database may hold. We demonstrate this approach in Example 13.6. Example 13.6 Using sample data to identify functional dependencies Consider the data for attributes denoted A, B, C, D, and E in the Sample relation of Figure 13.6. It is important first to establish that the data values shown in this relation are representative of all possible values that can be held by attributes A, B, C, D, and E. For the purposes of this example, let us assume that this is true despite the relatively small amount of data shown in this relation. The process of identifying the functional dependencies (denoted fd1 to fd4) that exist between the attributes of the Sample relation shown in Figure 13.6 is described below. Figure 13.6 The Sample relation displaying data for attributes A, B, C, D, and E and the functional dependencies (fd1 to fd4) that exist between these attributes. 13.4 Functional Dependencies | To identify the functional dependencies that exist between attributes A, B, C, D, and E, we examine the Sample relation shown in Figure 13.6 and identify when values in one column are consistent with the presence of particular values in other columns. We begin with the first column on the left-hand side and work our way over to the right-hand side of the relation and then we look at combinations of columns, in other words where values in two or more columns are consistent with the appearance of values in other columns. For example, when the value ‘a’ appears in column A the value ‘z’ appears in column C, and when ‘e’ appears in column A the value ‘r’ appears in column C. We can therefore conclude that there is a one-to-one (1:1) relationship between attributes A and C. In other words, attribute A functionally determines attribute C and this is shown as functional dependency 1 (fd1) in Figure 13.6. Furthermore, as the values in column C are consistent with the appearance of particular values in column A, we can also conclude that there is a (1:1) relationship between attributes C and A. In other words, C functionally determines A and this is shown as fd2 in Figure 13.6. If we now consider attribute B, we can see that when ‘b’ or ‘d’ appears in column B then ‘w’ appears in column D and when ‘f’ appears in column B then ‘s’ appears in column D. We can therefore conclude that there is a (1:1) relationship between attributes B and D. In other words, B functionally determines D and this is shown as fd3 in Figure 13.6. However, attribute D does not functionally determine attribute B as a single unique value in column D such as ‘w’ is not associated with a single consistent value in column B. In other words, when ‘w’ appears in column D the values ‘b’ or ‘d’ appears in column B. Hence, there is a one-to-many relationship between attributes D and B. The final single attribute to consider is E and we find that the values in this column are not associated with the consistent appearance of particular values in the other columns. In other words, attribute E does not functionally determine attributes A, B, C, or D. We now consider combinations of attributes and the appearance of consistent values in other columns. We conclude that unique combination of values in columns A and B such as (a, b) is associated with a single value in column E, which in this example is ‘q’. In other words attributes (A, B) functionally determines attribute E and this is shown as fd4 in Figure 13.6. However, the reverse is not true, as we have already stated that attribute E does not functionally determine any other attribute in the relation. We complete the examination of the relation shown in Figure 13.6 by considering all the remaining combinations of columns. In summary, we describe the function dependencies between attributes A to E in the Sample relation shown in Figure 13.6 as follows: A→C C→A B→D A, B → E (fd1) (fd2) (fd3) (fd4) Identifying the Primary Key for a Relation using Functional Dependencies The main purpose of identifying a set of functional dependencies for a relation is to specify the set of integrity constraints that must hold on a relation. An important integrity 13.4.3 399 400 | Chapter 13 z Normalization constraint to consider first is the identification of candidate keys, one of which is selected to be the primary key for the relation. We demonstrate the identification of a primary key for a given relation in the following two examples. Example 13.7 Identifying the primary key for the StaffBranch relation In Example 13.5 we describe the identification of five functional dependencies for the StaffBranch relation shown in Figure 13.3. The determinants for these functional dependencies are staffNo, branchNo, bAddress, (branchNo, position), and (bAddress, position). To identify the candidate key(s) for the StaffBranch relation, we must identify the attribute (or group of attributes) that uniquely identifies each tuple in this relation. If a relation has more than one candidate key, we identify the candidate key that is to act as the primary key for the relation (see Section 3.2.5). All attributes that are not part of the primary key (non-primary-key attributes) should be functionally dependent on the key. The only candidate key of the StaffBranch relation, and therefore the primary key, is staffNo, as all other attributes of the relation are functionally dependent on staffNo. Although branchNo, bAddress, (branchNo, position), and (bAddress, position) are determinants in this relation, they are not candidate keys for the relation. Example 13.8 Identifying the primary key for the Sample relation In Example 13.6 we identified four functional dependencies for the Sample relation. We examine the determinant for each functional dependency to identify the candidate key(s) for the relation. A suitable determinant must functionally determine the other attributes in the relation. The determinants in the Sample relation are A, B, C, and (A, B). However, the only determinant that functionally determines all the other attributes of the relation is (A, B). In particular, A functionally determines C, B functionally determines D, and (A, B) functionally determines E. In other words, the attributes that make up the determinant (A, B) can determine all the other attributes in the relation either separately as A or B or together as (A, B). Hence, we see that an essential characteristic for a candidate key of a relation is that the attributes of a determinant either individually or working together must be able to functionally determine all the other attributes in the relation. This is not a characteristic of the other determinants in the Sample relation (namely A, B, or C) as in each case they can determine only one other attribute in the relation. As there are no other candidate keys for the Sample relation (A, B) is identified as the primary key for this relation. So far in this section we have discussed the types of functional dependency that are most useful in identifying important constraints on a relation and how these dependencies can be used to identify a primary key (or candidate keys) for a given relation. The concepts of functional dependencies and keys are central to the process of normalization. We continue the discussion on functional dependencies in the next chapter for readers interested in a more formal coverage of this topic. However, in this chapter, we continue by describing the process of normalization. 13.5 The Process of Normalization The Process of Normalization | 401 13.5 Normalization is a formal technique for analyzing relations based on their primary key (or candidate keys) and functional dependencies (Codd, 1972b). The technique involves a series of rules that can be used to test individual relations so that a database can be normalized to any degree. When a requirement is not met, the relation violating the requirement must be decomposed into relations that individually meet the requirements of normalization. Three normal forms were initially proposed called First Normal Form (1NF), Second Normal Form (2NF), and Third Normal Form (3NF). Subsequently, R. Boyce and E.F. Codd introduced a stronger definition of third normal form called Boyce–Codd Normal Form (BCNF) (Codd, 1974). With the exception of 1NF, all these normal forms are based on functional dependencies among the attributes of a relation (Maier, 1983). Higher normal forms that go beyond BCNF were introduced later such as Fourth Normal Form (4NF) and Fifth Normal Form (5NF) (Fagin, 1977, 1979). However, these later normal forms deal with situations that are very rare. In this chapter we describe only the first three normal forms and leave discussions on BCNF, 4NF, and 5NF to the next chapter. Normalization is often executed as a series of steps. Each step corresponds to a specific normal form that has known properties. As normalization proceeds, the relations become progressively more restricted (stronger) in format and also less vulnerable to update anomalies. For the relational data model, it is important to recognize that it is only First Normal Form (1NF) that is critical in creating relations; all subsequent normal forms are optional. However, to avoid the update anomalies discussed in Section 13.3, it is generally recommended that we proceed to at least Third Normal Form (3NF). Figure 13.7 illustrates the relationship between the various normal forms. It shows that some 1NF relations are also in 2NF and that some 2NF relations are also in 3NF, and so on. In the following sections we describe the process of normalization in detail. Figure 13.8 provides an overview of the process and highlights the main actions taken in each step of the process. The number of the section that covers each step of the process is also shown in this figure. In this chapter, we describe normalization as a bottom-up technique extracting information about attributes from sample forms that are first transformed into table format, Figure 13.7 Diagrammatic illustration of the relationship between the normal forms. 402 | Chapter 13 z Normalization Figure 13.8 Diagrammatic illustration of the process of normalization. which is described as being in Unnormalized Form (UNF). This table is then subjected progressively to the different requirements associated with each normal form until ultimately the attributes shown in the original sample forms are represented as a set of 3NF relations. Although the example used in this chapter proceeds from a given normal form to the one above, this is not necessarily the case with other examples. As shown in Figure 13.8, the resolution of a particular problem with, say, a 1NF relation may result in the relation being transformed to 2NF relations or in some cases directly into 3NF relations in one step. To simplify the description of normalization we assume that a set of functional dependencies is given for each relation in the worked examples and that each relation has a designated primary key. In other words, it is essential that the meaning of the attributes and their relationships is well understood before beginning the process of normalization. This information is fundamental to normalization and is used to test whether a relation is in a particular normal form. In Section 13.6 we begin by describing First Normal Form (1NF). In Sections 13.7 and 13.8 we describe Second Normal Form (2NF) and Third Normal 13.6 First Normal Form (1NF) | Forms (3NF) based on the primary key of a relation and then present a more general definition of each in Section 13.9. The more general definitions of 2NF and 3NF take into account all candidate keys of a relation rather than just the primary key. First Normal Form (1NF) Before discussing First Normal Form, we provide a definition of the state prior to First Normal Form. Unnormalized Form (UNF) First Normal Form (1NF) A table that contains one or more repeating groups. A relation in which the intersection of each row and column contains one and only one value. In this chapter, we begin the process of normalization by first transferring the data from the source (for example, a standard data entry form) into table format with rows and columns. In this format, the table is in Unnormalized Form and is referred to as an unnormalized table. To transform the unnormalized table to First Normal Form we identify and remove repeating groups within the table. A repeating group is an attribute, or group of attributes, within a table that occurs with multiple values for a single occurrence of the nominated key attribute(s) for that table. Note that in this context, the term ‘key’ refers to the attribute(s) that uniquely identify each row within the unnormalized table. There are two common approaches to removing repeating groups from unnormalized tables: (1) By entering appropriate data in the empty columns of rows containing the repeating data. In other words, we fill in the blanks by duplicating the nonrepeating data, where required. This approach is commonly referred to as ‘flattening’ the table. (2) By placing the repeating data, along with a copy of the original key attribute(s), in a separate relation. Sometimes the unnormalized table may contain more than one repeating group, or repeating groups within repeating groups. In such cases, this approach is applied repeatedly until no repeating groups remain. A set of relations is in 1NF if it contains no repeating groups. For both approaches, the resulting tables are now referred to as 1NF relations containing atomic (or single) values at the intersection of each row and column. Although both approaches are correct, approach 1 introduces more redundancy into the original UNF table as part of the ‘flattening’ process, whereas approach 2 creates two or more relations with less redundancy than in the original UNF table. In other words, approach 2 moves the original UNF table further along the normalization process than approach 1. However, no matter which initial approach is taken, the original UNF table will be normalized into the same set of 3NF relations. We demonstrate both approaches in the following worked example using the DreamHome case study. 13.6 403 404 | Chapter 13 z Normalization Example 13.9 First Normal Form (1NF) A collection of (simplified) DreamHome leases is shown in Figure 13.9. The lease on top is for a client called John Kay who is leasing a property in Glasgow, which is owned by Tina Murphy. For this worked example, we assume that a client rents a given property only once and cannot rent more than one property at any one time. Sample data is taken from two leases for two different clients called John Kay and Aline Stewart and is transformed into table format with rows and columns, as shown in Figure 13.10. This is an example of an unnormalized table. Figure 13.9 Collection of (simplified) DreamHome leases. Figure 13.10 ClientRental unnormalized table. 13.6 First Normal Form (1NF) | 405 We identify the key attribute for the ClientRental unnormalized table as clientNo. Next, we identify the repeating group in the unnormalized table as the property rented details, which repeats for each client. The structure of the repeating group is: Repeating Group = (propertyNo, pAddress, rentStart, rentFinish, rent, ownerNo, oName) As a consequence, there are multiple values at the intersection of certain rows and columns. For example, there are two values for propertyNo (PG4 and PG16) for the client named John Kay. To transform an unnormalized table into 1NF, we ensure that there is a single value at the intersection of each row and column. This is achieved by removing the repeating group. With the first approach, we remove the repeating group (property rented details) by entering the appropriate client data into each row. The resulting first normal form ClientRental relation is shown in Figure 13.11. In Figure 13.12, we present the functional dependencies (fd1 to fd6) for the ClientRental relation. We use the functional dependencies (as discussed in Section 13.4.3) to identify candidate keys for the ClientRental relation as being composite keys comprising (clientNo, Figure 13.11 First Normal Form ClientRental relation. Figure 13.12 Functional dependencies of the ClientRental relation. 406 | Chapter 13 z Normalization Figure 13.13 Alternative 1NF Client and PropertyRentalOwner relations. propertyNo), (clientNo, rentStart), and (propertyNo, rentStart). We select (clientNo, propertyNo) as the primary key for the relation, and for clarity we place the attributes that make up the primary key together at the left-hand side of the relation. In this example, we assume that the rentFinish attribute is not appropriate as a component of a candidate key as it may contain nulls (see Section 3.3.1). The ClientRental relation is defined as follows: ClientRental (clientNo, propertyNo, cName, pAddress, rentStart, rentFinish, rent, ownerNo, oName) The ClientRental relation is in 1NF as there is a single value at the intersection of each row and column. The relation contains data describing clients, property rented, and property owners, which is repeated several times. As a result, the ClientRental relation contains significant data redundancy. If implemented, the 1NF relation would be subject to the update anomalies described in Section 13.3. To remove some of these, we must transform the relation into Second Normal Form, which we discuss shortly. With the second approach, we remove the repeating group (property rented details) by placing the repeating data along with a copy of the original key attribute (clientNo) in a separate relation, as shown in Figure 13.13. With the help of the functional dependencies identified in Figure 13.12 we identify a primary key for the relations. The format of the resulting 1NF relations are as follows: Client PropertyRentalOwner (clientNo, cName) (clientNo, propertyNo, pAddress, rentStart, rentFinish, rent, ownerNo, oName) The Client and PropertyRentalOwner relations are both in 1NF as there is a single value at the intersection of each row and column. The Client relation contains data describing clients and the PropertyRentalOwner relation contains data describing property rented by clients and property owners. However, as we see from Figure 13.13, this relation also contains some redundancy and as a result may suffer from similar update anomalies to those described in Section 13.3. 13.7 Second Normal Form (2NF) | To demonstrate the process of normalizing relations from 1NF to 2NF, we use only the relation shown in Figure 13.11. However, recall that both approaches are correct, and will ultimately result in the production of the same relations as we continue the process of normalization. We leave the process of completing the normalization of the Client and PropertyRentalOwner relations as an exercise for the reader, which is given at the end of this chapter. ClientRental Second Normal Form (2NF) 13.7 Second Normal Form (2NF) is based on the concept of full functional dependency, which we described in Section 13.4. Second Normal Form applies to relations with composite keys, that is, relations with a primary key composed of two or more attributes. A relation with a single-attribute primary key is automatically in at least 2NF. A relation that is not in 2NF may suffer from the update anomalies discussed in Section 13.3. For example, suppose we wish to change the rent of property number PG4. We have to update two tuples in the ClientRental relation in Figure 13.11. If only one tuple is updated with the new rent, this results in an inconsistency in the database. Second Normal Form (2NF) A relation that is in First Normal Form and every non-primary-key attribute is fully functionally dependent on the primary key. The normalization of 1NF relations to 2NF involves the removal of partial dependencies. If a partial dependency exists, we remove the partially dependent attribute(s) from the relation by placing them in a new relation along with a copy of their determinant. We demonstrate the process of converting 1NF relations to 2NF relations in the following example. Example 13.10 Second Normal Form (2NF) As shown in Figure 13.12, the dependencies: fd1 fd2 fd3 fd4 fd5 fd6 ClientRental relation has the following functional clientNo, propertyNo → rentStart, rentFinish clientNo → cName propertyNo → pAddress, rent, ownerNo, oName ownerNo → oName clientNo, rentStart → propertyNo, pAddress, rentFinish, rent, ownerNo, oName propertyNo, rentStart → clientNo, cName, rentFinish (Primary key) (Partial dependency) (Partial dependency) (Transitive dependency) (Candidate key) (Candidate key) Using these functional dependencies, we continue the process of normalizing the ClientRental relation. We begin by testing whether the ClientRental relation is in 2NF by identifying the presence of any partial dependencies on the primary key. We note that the 407 408 | Chapter 13 z Normalization Figure 13.14 Second Normal Form relations derived from the ClientRental relation. client attribute (cName) is partially dependent on the primary key, in other words, on only the clientNo attribute (represented as fd2). The property attributes (pAddress, rent, ownerNo, oName) are partially dependent on the primary key, that is, on only the propertyNo attribute (represented as fd3). The property rented attributes (rentStart and rentFinish) are fully dependent on the whole primary key; that is the clientNo and propertyNo attributes (represented as fd1). The identification of partial dependencies within the ClientRental relation indicates that the relation is not in 2NF. To transform the ClientRental relation into 2NF requires the creation of new relations so that the non-primary-key attributes are removed along with a copy of the part of the primary key on which they are fully functionally dependent. This results in the creation of three new relations called Client, Rental, and PropertyOwner, as shown in Figure 13.14. These three relations are in Second Normal Form as every nonprimary-key attribute is fully functionally dependent on the primary key of the relation. The relations have the following form: Client Rental PropertyOwner 13.8 (clientNo, cName) (clientNo, propertyNo, rentStart, rentFinish) (propertyNo, pAddress, rent, ownerNo, oName) Third Normal Form (3NF) Although 2NF relations have less redundancy than those in 1NF, they may still suffer from update anomalies. For example, if we want to update the name of an owner, such as Tony Shaw (ownerNo CO93), we have to update two tuples in the PropertyOwner relation of Figure 13.14. If we update only one tuple and not the other, the database would be in an inconsistent state. This update anomaly is caused by a transitive dependency, which we described in Section 13.4. We need to remove such dependencies by progressing to Third Normal Form. 13.8 Third Normal Form (3NF) Third Normal Form (3NF) A relation that is in First and Second Normal Form and in which no non-primary-key attribute is transitively dependent on the primary key. The normalization of 2NF relations to 3NF involves the removal of transitive dependencies. If a transitive dependency exists, we remove the transitively dependent attribute(s) from the relation by placing the attribute(s) in a new relation along with a copy of the determinant. We demonstrate the process of converting 2NF relations to 3NF relations in the following example. Example 13.11 Third Normal Form (3NF) The functional dependencies for the Example 13.10, are as follows: Client, Rental, and PropertyOwner relations, derived in Client fd2 clientNo → cName (Primary key) Rental fd1 fd5′ fd6′ clientNo, propertyNo → rentStart, rentFinish clientNo, rentStart → propertyNo, rentFinish propertyNo, rentStart → clientNo, rentFinish PropertyOwner fd3 propertyNo → pAddress, rent, ownerNo, oName fd4 ownerNo → oName (Primary key) (Candidate key) (Candidate key) (Primary key) (Transitive dependency) All the non-primary-key attributes within the Client and Rental relations are functionally dependent on only their primary keys. The Client and Rental relations have no transitive dependencies and are therefore already in 3NF. Note that where a functional dependency (fd) is labeled with a prime (such as fd5′), this indicates that the dependency has altered compared with the original functional dependency shown in Figure 13.12. All the non-primary-key attributes within the PropertyOwner relation are functionally dependent on the primary key, with the exception of oName, which is transitively dependent on ownerNo (represented as fd4). This transitive dependency was previously identified in Figure 13.12. To transform the PropertyOwner relation into 3NF we must first remove this transitive dependency by creating two new relations called PropertyForRent and Owner, as shown in Figure 13.15. The new relations have the form: PropertyForRent Owner (propertyNo, pAddress, rent, ownerNo) (ownerNo, oName) The PropertyForRent and Owner relations are in 3NF as there are no further transitive dependencies on the primary key. | 409 410 | Chapter 13 z Normalization Figure 13.15 Third Normal Form relations derived from the PropertyOwner relation. Figure 13.16 The decomposition of the ClientRental 1NF relation into 3NF relations. The ClientRental relation shown in Figure 13.11 has been transformed by the process of normalization into four relations in 3NF. Figure 13.16 illustrates the process by which the original 1NF relation is decomposed into the 3NF relations. The resulting 3NF relations have the form: Client Rental PropertyForRent Owner (clientNo, cName) (clientNo, propertyNo, rentStart, rentFinish) (propertyNo, pAddress, rent, ownerNo) (ownerNo, oName) The original ClientRental relation shown in Figure 13.11 can be recreated by joining the Client, Rental, PropertyForRent, and Owner relations through the primary key/foreign key mechanism. For example, the ownerNo attribute is a primary key within the Owner relation and is also present within the PropertyForRent relation as a foreign key. The ownerNo attribute acting as a primary key/foreign key allows the association of the PropertyForRent and Owner relations to identify the name of property owners. The clientNo attribute is a primary key of the Client relation and is also present within the Rental relation as a foreign key. Note in this case that the clientNo attribute in the Rental relation acts both as a foreign key and as part of the primary key of this relation. Similarly, the propertyNo attribute is the primary key of the PropertyForRent relation and is also present within the Rental relation acting both as a foreign key and as part of the primary key for this relation. In other words, the normalization process has decomposed the original ClientRental relation using a series of relational algebra projections (see Section 4.1). This results in a lossless-join (also called nonloss- or nonadditive-join) decomposition, which is reversible using the natural join operation. The Client, Rental, PropertyForRent, and Owner relations are shown in Figure 13.17. 13.9 General Definitions of 2NF and 3NF | 411 Figure 13.17 A summary of the 3NF relations derived from the ClientRental relation. General Definitions of 2NF and 3NF The definitions for 2NF and 3NF given in Sections 13.7 and 13.8 disallow partial or transitive dependencies on the primary key of relations to avoid the update anomalies described in Section 13.3. However, these definitions do not take into account other candidate keys of a relation, if any exist. In this section, we present more general definitions for 2NF and 3NF that take into account candidate keys of a relation. Note that this requirement does not alter the definition for 1NF as this normal form is independent of keys and functional dependencies. For the general definitions, we define that a candidate-key attribute is part of any candidate key and that partial, full, and transitive dependencies are with respect to all candidate keys of a relation. Second Normal Form (2NF) Third Normal Form (3NF) A relation that is in First Normal Form and every non-candidatekey attribute is fully functionally dependent on any candidate key. A relation that is in First and Second Normal Form and in which no non-candidate-key attribute is transitively dependent on any candidate key. When using the general definitions of 2NF and 3NF we must be aware of partial and transitive dependencies on all candidate keys and not just the primary key. This can make the process of normalization more complex; however, the general definitions place additional constraints on the relations and may identify hidden redundancy in relations that could be missed. The tradeoff is whether it is better to keep the process of normalization simpler by examining dependencies on primary keys only, which allows the identification of the most problematic and obvious redundancy in relations, or to use the general definitions and increase the opportunity to identify missed redundancy. In fact, it is often the case that 13.9 412 | Chapter 13 z Normalization whether we use the definitions based on primary keys or the general definitions of 2NF and 3NF, the decomposition of relations is the same. For example, if we apply the general definitions of 2NF and 3NF to Examples 13.10 and 13.11 described in Sections 13.7 and 13.8, the same decomposition of the larger relations into smaller relations results. The reader may wish to verify this fact. In the following chapter we re-examine the process of identifying functional dependencies that are useful for normalization and take the process of normalization further by discussing normal forms that go beyond 3NF such as Boyce–Codd Normal Form (BCNF). Also in this chapter we present a second worked example taken from the DreamHome case study that reviews the process of normalization from UNF through to BCNF. Chapter Summary n n n n n n n n n n n Normalization is a technique for producing a set of relations with desirable properties, given the data requirements of an enterprise. Normalization is a formal method that can be used to identify relations based on their keys and the functional dependencies among their attributes. Relations with data redundancy suffer from update anomalies, which can be classified as insertion, deletion, and modification anomalies. One of the main concepts associated with normalization is functional dependency, which describes the relationship between attributes in a relation. For example, if A and B are attributes of relation R, B is functionally dependent on A (denoted A → B), if each value of A is associated with exactly one value of B. (A and B may each consist of one or more attributes.) The determinant of a functional dependency refers to the attribute, or group of attributes, on the left-hand side of the arrow. The main characteristics of functional dependencies that we use for normalization have a one-to-one relationship between attribute(s) on the left- and right-hand sides of the dependency, hold for all time, and are fully functionally dependent. Unnormalized Form (UNF) is a table that contains one or more repeating groups. First Normal Form (1NF) is a relation in which the intersection of each row and column contains one and only one value. Second Normal Form (2NF) is a relation that is in First Normal Form and every non-primary-key attribute is fully functionally dependent on the primary key. Full functional dependency indicates that if A and B are attributes of a relation, B is fully functionally dependent on A if B is functionally dependent on A but not on any proper subset of A. Third Normal Form (3NF) is a relation that is in First and Second Normal Form in which no non-primarykey attribute is transitively dependent on the primary key. Transitive dependency is a condition where A, B, and C are attributes of a relation such that if A → B and B → C, then C is transitively dependent on A via B (provided that A is not functionally dependent on B or C). General definition for Second Normal Form (2NF) is a relation that is in First Normal Form and every non-candidate-key attribute is fully functionally dependent on any candidate key. In this definition, a candidate-key attribute is part of any candidate key. General definition for Third Normal Form (3NF) is a relation that is in First and Second Normal Form in which no non-candidate-key attribute is transitively dependent on any candidate key. In this definition, a candidate-key attribute is part of any candidate key. Exercises | 413 Review Questions 13.1 13.2 13.3 13.4 13.5 13.6 13.7 Describe the purpose of normalizing data. Discuss the alternative ways that normalization can be used to support database design. Describe the types of update anomaly that may occur on a relation that has redundant data. Describe the concept of functional dependency. What are the main characteristics of functional dependencies that are used for normalization? Describe how a database designer typically identifies the set of functional dependencies associated with a relation. Describe the characteristics of a table in Unnormalized Form (UNF) and describe how such a table is converted to a First Normal Form (1NF) relation. 13.8 What is the minimal normal form that a relation must satisfy? Provide a definition for this normal form. 13.9 Describe the two approaches to converting an Unnormalized Form (UNF) table to First Normal Form (1NF) relation(s). 13.10 Describe the concept of full functional dependency and describe how this concept relates to 2NF. Provide an example to illustrate your answer. 13.11 Describe the concept of transitive dependency and describe how this concept relates to 3NF. Provide an example to illustrate your answer. 13.12 Discuss how the definitions of 2NF and 3NF based on primary keys differ from the general definitions of 2NF and 3NF. Provide an example to illustrate your answer. Exercises 13.13 Continue the process of normalizing the Client and PropertyRentalOwner 1NF relations shown in Figure 13.13 to 3NF relations. At the end of this process check that the resultant 3NF relations are the same as those produced from the alternative ClientRental 1NF relation shown in Figure 13.16. 13.14 Examine the Patient Medication Form for the Wellmeadows Hospital case study shown in Figure 13.18. (a) Identify the functional dependencies represented by the attributes shown in the form in Figure 13.18. State any assumptions you make about the data and the attributes shown in this form. (b) Describe and illustrate the process of normalizing the attributes shown in Figure 13.18 to produce a set of well-designed 3NF relations. (c) Identify the primary, alternate, and foreign keys in your 3NF relations. 13.15 The table shown in Figure 13.19 lists sample dentist/patient appointment data. A patient is given an appointment at a specific time and date with a dentist located at a particular surgery. On each day of patient appointments, a dentist is allocated to a specific surgery for that day. (a) The table shown in Figure 13.19 is susceptible to update anomalies. Provide examples of insertion, deletion, and update anomalies. (b) Identify the functional dependencies represented by the attributes shown in the table of Figure 13.19. State any assumptions you make about the data and the attributes shown in this table. (c) Describe and illustrate the process of normalizing the table shown in Figure 13.19 to 3NF relations. Identify the primary, alternate, and foreign keys in your 3NF relations. 13.16 An agency called Instant Cover supplies part-time/temporary staff to hotels within Scotland. The table shown in Figure 13.20 displays sample data, which lists the time spent by agency staff working at various hotels. The National Insurance Number (NIN) is unique for every member of staff. 414 | Chapter 13 z Normalization Figure 13.18 The Wellmeadows Hospital Patient Medication Form. Figure 13.19 Table displaying sample dentist/patient appointment data. Figure 13.20 Table displaying sample data for the Instant Cover agency. (a) The table shown in Figure 13.20 is susceptible to update anomalies. Provide examples of insertion, deletion, and update anomalies. (b) Identify the functional dependencies represented by the attributes shown in the table of Figure 13.20. State any assumptions you make about the data and the attributes shown in this table. (c) Describe and illustrate the process of normalizing the table shown in Figure 13.20 to 3NF. Identify primary, alternate and foreign keys in your relations. Chapter 14 Advanced Normalization Chapter Objectives In this chapter you will learn: n How inference rules can identify a set of all functional dependencies for a relation. n How inference rules called Armstrong’s axioms can identify a minimal set of useful functional dependencies from the set of all functional dependencies for a relation. n Normal forms that go beyond Third Normal Form (3NF), which includes Boyce–Codd Normal Form (BCNF), Fourth Normal Form (4NF), and Fifth Normal Form (5NF). n How to identify Boyce–Codd Normal Form (BCNF). n How to represent attributes shown on a report as BCNF relations using normalization. n The concept of multi-valued dependencies and 4NF. n The problems associated with relations that break the rules of 4NF. n How to create 4NF relations from a relation which breaks the rules of 4NF. n The concept of join dependency and 5NF. n The problems associated with relations that break the rules of 5NF. n How to create 5NF relations from a relation which breaks the rules of 5NF. In the previous chapter we introduced the technique of normalization and the concept of functional dependencies between attributes. We described the benefits of using normalization to support database design and demonstrated how attributes shown on sample forms are transformed into First Normal Form (1NF), Second Normal Form (2NF), and then finally Third Normal Form (3NF) relations. In this chapter, we return to consider functional dependencies and describe normal forms that go beyond 3NF such as Boyce–Codd Normal Form (BCNF), Fourth Normal Form (4NF), and Fifth Normal Form (5NF). Relations in 3NF are normally sufficiently well structured to prevent the problems associated with data redundancy, which was described in Section 13.3. However, later normal forms were created to identify relatively rare problems with relations that, if not corrected, may result in undesirable data redundancy. 416 | Chapter 14 z Advanced Normalization Structure of this Chapter With the exception of 1NF, all normal forms discussed in the previous chapter and in this chapter are based on functional dependencies among the attributes of a relation. In Section 14.1 we continue the discussion on the concept of functional dependency which was introduced in the previous chapter. We present a more formal and theoretical aspect of functional dependencies by discussing inference rules for functional dependencies. In the previous chapter we described the three most commonly used normal forms: 1NF, 2NF, and 3NF. However, R. Boyce and E.F. Codd identified a weakness with 3NF and introduced a stronger definition of 3NF called Boyce–Codd Normal Form (BCNF) (Codd, 1974), which we describe in Section 14.2. In Section 14.3 we present a worked example to demonstrate the process of normalizing attributes originally shown on a report into a set of BCNF relations. Higher normal forms that go beyond BCNF were introduced later, such as Fourth (4NF) and Fifth (5NF) Normal Forms (Fagin, 1977, 1979). However, these later normal forms deal with situations that are very rare. We describe 4NF and 5NF in Sections 14.4 and 14.5. To illustrate the process of normalization, examples are drawn from the DreamHome case study described in Section 10.4 and documented in Appendix A. 14.1 More on Functional Dependencies One of the main concepts associated with normalization is functional dependency, which describes the relationship between attributes (Maier, 1983). In the previous chapter we introduced this concept. In this section we describe this concept in a more formal and theoretical way by discussing inference rules for functional dependencies. 14.1.1 Inference Rules for Functional Dependencies In Section 13.4 we identified the characteristics of the functional dependencies that are most useful in normalization. However, even if we restrict our attention to functional dependencies with a one-to-one (1:1) relationship between attributes on the left- and righthand sides of the dependency that hold for all time and are fully functionally dependent, then the complete set of functional dependencies for a given relation can still be very large. It is important to find an approach that can reduce that set to a manageable size. Ideally, we want to identify a set of functional dependencies (represented as X) for a relation that is smaller than the complete set of functional dependencies (represented as Y) for that relation and has the property that every functional dependency in Y is implied by the functional dependencies in X. Hence, if we enforce the integrity constraints defined by the functional dependencies in X, we automatically enforce the integrity constraints defined in the larger set of functional dependencies in Y. This requirement suggests that there must 14.1 More on Functional Dependencies be functional dependencies that can be inferred from other functional dependencies. For example, functional dependencies A → B and B → C in a relation implies that the functional dependency A → C also holds in that relation. A → C is an example of a transitive functional dependency and was discussed previously in Sections 13.4 and 13.7. How do we begin to identify useful functional dependencies on a relation? Normally, the database designer starts by specifying functional dependencies that are semantically obvious; however, there are usually numerous other functional dependencies. In fact, the task of specifying all possible functional dependencies for ‘real’ database projects is more often than not, impractical. However, in this section we do consider an approach that helps identify the complete set of functional dependencies for a relation and then discuss how to achieve a minimal set of functional dependencies that can represent the complete set. The set of all functional dependencies that are implied by a given set of functional dependencies X is called the closure of X, written X + . We clearly need a set of rules to help compute X + from X. A set of inference rules, called Armstrong’s axioms, specifies how new functional dependencies can be inferred from given ones (Armstrong, 1974). For our discussion, let A, B, and C be subsets of the attributes of the relation R. Armstrong’s axioms are as follows: (1) Reflexivity: (2) Augmentation: (3) Transitivity: If B is a subset of A, then A → B If A → B, then A,C → B,C If A → B and B → C, then A → C Note that each of these three rules can be directly proved from the definition of functional dependency. The rules are complete in that given a set X of functional dependencies, all functional dependencies implied by X can be derived from X using these rules. The rules are also sound in that no additional functional dependencies can be derived that are not implied by X. In other words, the rules can be used to derive the closure of X +. Several further rules can be derived from the three given above that simplify the practical task of computing X +. In the following rules, let D be another subset of the attributes of relation R, then: (4) (5) (6) (7) Self-determination: Decomposition: Union: Composition: →A If A → B,C, then A → B and A → C If A → B and A → C, then A → B,C If A → B and C → D then A,C → B,D A Rule 1 Reflexivity and Rule 4 Self-determination state that a set of attributes always determines any of its subsets or itself. Because these rules generate functional dependencies that are always true, such dependencies are trivial and, as stated earlier, are generally not interesting or useful. Rule 2 Augmentation states that adding the same set of attributes to both the left- and right-hand sides of a dependency results in another valid dependency. Rule 3 Transitivity states that functional dependencies are transitive. Rule 5 Decomposition states that we can remove attributes from the right-hand side of a dependency. Applying this rule repeatedly, we can decompose A → B, C, D functional dependency into the set of dependencies A → B, A → C, and A → D. Rule 6 Union states that we can do the opposite: we can combine a set of dependencies A → B, A → C, and A → D into a single functional | 417 418 | Chapter 14 z Advanced Normalization dependency A → B, C, D. Rule 7 Composition is more general than Rule 6 and states that we can combine a set of non-overlapping dependencies to form another valid dependency. To begin to identify the set of functional dependencies F for a relation, typically we first identify the dependencies that are determined from the semantics of the attributes of the relation. Then we apply Armstrong’s axioms (Rules 1 to 3) to infer additional functional dependencies that are also true for that relation. A systematic way to determine these additional functional dependencies is to first determine each set of attributes A that appears on the left-hand side of some functional dependencies and then to determine the set of all attributes that are dependent on A. Thus, for each set of attributes A we can determine the set A+ of attributes that are functionally determined by A based on F; (A+ is called the closure of A under F). 14.1.2 Minimal Sets of Functional Dependencies In this section, we introduce what is referred to as equivalence of sets of functional dependencies. A set of functional dependencies Y is covered by a set of functional dependencies X, if every functional dependency in Y is also in X +; that is, every dependency in Y can be inferred from X. A set of functional dependencies X is minimal if it satisfies the following conditions: n n n Every dependency in X has a single attribute on its right-hand side. We cannot replace any dependency A → B in X with dependency C → B, where C is a proper subset of A, and still have a set of dependencies that is equivalent to X. We cannot remove any dependency from X and still have a set of dependencies that is equivalent to X. A minimal set of dependencies should be in a standard form with no redundancies. A minimal cover of a set of functional dependencies X is a minimal set of dependencies X min that is equivalent to X. Unfortunately there can be several minimal covers for a set of functional dependencies. We demonstrate the identification of the minimal cover for the StaffBranch relation in the following example. Example 14.1 Identifying the minimal set of functional dependencies of the StaffBranch relation We apply the three conditions described above on the set of functional dependencies for the StaffBranch relation listed in Example 13.5 to produce the following functional dependencies: staffNo staffNo staffNo staffNo staffNo → sName → position → salary → branchNo → bAddress 14.2 Boyce–Codd Normal Form (BCNF) | branchNo → bAddress bAddress → branchNo branchNo, position → salary bAddress, position → salary These functional dependencies satisfy the three conditions for producing a minimal set of functional dependencies for the StaffBranch relation. Condition 1 ensures that every dependency is in a standard form with a single attribute on the right-hand side. Conditions 2 and 3 ensure that there are no redundancies in the dependencies either by having redundant attributes on the left-hand side of a dependency (Condition 2) or by having a dependency that can be inferred from the remaining functional dependencies in X (Condition 3). In the following section we return to consider normalization. We begin by discussing Boyce–Codd Normal Form (BCNF), a stronger normal form than 3NF. Boyce–Codd Normal Form (BCNF) 14.2 In the previous chapter we demonstrated how 2NF and 3NF disallow partial and transitive dependencies on the primary key of a relation, respectively. Relations that have these types of dependencies may suffer from the update anomalies discussed in Section 13.3. However, the definition of 2NF and 3NF discussed in Sections 13.7 and 13.8, respectively, do not consider whether such dependencies remain on other candidate keys of a relation, if any exist. In Section 13.9 we presented general definitions for 2NF and 3NF that disallow partial and transitive dependencies on any candidate key of a relation, respectively. Application of the general definitions of 2NF and 3NF may identify additional redundancy caused by dependencies that violate one or more candidate keys. However, despite these additional constraints, dependencies can still exist that will cause redundancy to be present in 3NF relations. This weakness in 3NF, resulted in the presentation of a stronger normal form called Boyce–Codd Normal Form (Codd, 1974). Definition of Boyce–Codd Normal Form Boyce–Codd Normal Form (BCNF) is based on functional dependencies that take into account all candidate keys in a relation; however, BCNF also has additional constraints compared with the general definition of 3NF given in Section 13.9. Boyce–Codd Normal Form (BCNF) A relation is in BCNF, if and only if, every determinant is a candidate key. To test whether a relation is in BCNF, we identify all the determinants and make sure that they are candidate keys. Recall that a determinant is an attribute, or a group of attributes, on which some other attribute is fully functionally dependent. 14.2.1 419 420 | Chapter 14 z Advanced Normalization The difference between 3NF and BCNF is that for a functional dependency A → B, 3NF allows this dependency in a relation if B is a primary-key attribute and A is not a candidate key, whereas BCNF insists that for this dependency to remain in a relation, A must be a candidate key. Therefore, Boyce–Codd Normal Form is a stronger form of 3NF, such that every relation in BCNF is also in 3NF. However, a relation in 3NF is not necessarily in BCNF. Before considering the next example, we re-examine the Client, Rental, PropertyForRent, and Owner relations shown in Figure 13.17. The Client, PropertyForRent, and Owner relations are all in BCNF, as each relation only has a single determinant, which is the candidate key. However, recall that the Rental relation contains the three determinants (clientNo, propertyNo), (clientNo, rentStart), and (propertyNo, rentStart), originally identified in Example 13.11, as shown below: fd1 fd5′ fd6′ clientNo, propertyNo → rentStart, rentFinish clientNo, rentStart → propertyNo, rentFinish propertyNo, rentStart → clientNo, rentFinish As the three determinants of the Rental relation are also candidate keys, the Rental relation is also already in BCNF. Violation of BCNF is quite rare, since it may only happen under specific conditions. The potential to violate BCNF may occur when: n n the relation contains two (or more) composite candidate keys; or the candidate keys overlap, that is have at least one attribute in common. In the following example, we present a situation where a relation violates BCNF and demonstrate the transformation of this relation to BCNF. This example demonstrates the process of converting a 1NF relation to BCNF relations. Example 14.2 Boyce–Codd Normal Form (BCNF) In this example, we extend the DreamHome case study to include a description of client interviews by members of staff. The information relating to these interviews is in the ClientInterview relation shown in Figure 14.1. The members of staff involved in interviewing clients are allocated to a specific room on the day of interview. However, a room may be allocated to several members of staff as required throughout a working day. A client is only interviewed once on a given date, but may be requested to attend further interviews at later dates. The ClientInterview relation has three candidate keys: (clientNo, interviewDate), (staffNo, interviewDate, interviewTime), and (roomNo, interviewDate, interviewTime). Therefore the ClientInterview relation has three composite candidate keys, which overlap by sharing the Figure 14.1 ClientInterview relation. 14.2 Boyce–Codd Normal Form (BCNF) | 421 common attribute interviewDate. We select (clientNo, interviewDate) to act as the primary key for this relation. The ClientInterview relation has the following form: ClientInterview (clientNo, interviewDate, interviewTime, staffNo, roomNo) The ClientInterview relation has the following functional dependencies: fd1 fd2 fd3 fd4 clientNo, interviewDate → interviewTime, staffNo, roomNo staffNo, interviewDate, interviewTime → clientNo roomNo, interviewDate, interviewTime → staffNo, clientNo staffNo, interviewDate → roomNo (Primary key) (Candidate key) (Candidate key) We examine the functional dependencies to determine the normal form of the ClientInterview relation. As functional dependencies fd1, fd2, and fd3 are all candidate keys for this relation, none of these dependencies will cause problems for the relation. The only functional dependency that requires discussion is (staffNo, interviewDate) → roomNo (represented as fd4). Even though (staffNo, interviewDate) is not a candidate key for the ClientInterview relation this functional dependency is allowed in 3NF because roomNo is a primary-key attribute being part of the candidate key (roomNo, interviewDate, interviewTime). As there are no partial or transitive dependencies on the primary key (clientNo, interviewDate), and functional dependency fd4 is allowed, the ClientInterview relation is in 3NF. However, this relation is not in BCNF (a stronger normal form of 3NF) due to the presence of the (staffNo, interviewDate) determinant, which is not a candidate key for the relation. BCNF requires that all determinants in a relation must be a candidate key for the relation. As a consequence the ClientInterview relation may suffer from update anomalies. For example, to change the room number for staff number SG5 on the 13-May-05 we must update two tuples. If only one tuple is updated with the new room number, this results in an inconsistent state for the database. To transform the ClientInterview relation to BCNF, we must remove the violating functional dependency by creating two new relations called Interview and StaffRoom, as shown in Figure 14.2. The Interview and StaffRoom relations have the following form: Interview (clientNo, interviewDate, interviewTime, staffNo) StaffRoom (staffNo, interviewDate, roomNo) Figure 14.2 The Interview and StaffRoom BCNF relations. 422 | Chapter 14 z Advanced Normalization We can decompose any relation that is not in BCNF into BCNF as illustrated. However, it may not always be desirable to transform a relation into BCNF; for example, if there is a functional dependency that is not preserved when we perform the decomposition (that is, the determinant and the attributes it determines are placed in different relations). In this situation, it is difficult to enforce the functional dependency in the relation, and an important constraint is lost. When this occurs, it may be better to stop at 3NF, which always preserves dependencies. Note in Example 14.2, in creating the two BCNF relations from the original ClientInterview relation, we have ‘lost’ the functional dependency, roomNo, interviewDate, interviewTime → staffNo, clientNo (represented as fd3), as the determinant for this dependency is no longer in the same relation. However, we must recognize that if the functional dependency, staffNo, interviewDate → roomNo (represented as fd4) is not removed, the ClientInterview relation will have data redundancy. The decision as to whether it is better to stop the normalization at 3NF or progress to BCNF is dependent on the amount of redundancy resulting from the presence of fd4 and the significance of the ‘loss’ of fd3. For example, if it is the case that members of staff conduct only one interview per day, then the presence of fd4 in the ClientInterview relation will not cause redundancy and therefore the decomposition of this relation into two BCNF relations is not helpful or necessary. On the other hand, if members of staff conduct numerous interviews per day, then the presence of fd4 in the ClientInterview relation will cause redundancy and normalization of this relation to BCNF is recommended. However, we should also consider the significance of losing fd3; in other words, does fd3 convey important information about client interviews that must be represented in one of the resulting relations? The answer to this question will help to determine whether it is better to retain all functional dependencies or remove data redundancy. 14.3 Review of Normalization up to BCNF The purpose of this section is to review the process of normalization described in the previous chapter and in Section 14.2. We demonstrate the process of transforming attributes displayed on a sample report from the DreamHome case study into a set of Boyce–Codd Normal Form relations. In this worked example we use the definitions of 2NF and 3NF that are based on the primary key of a relation. We leave the normalization of this worked example using the general definitions of 2NF and 3NF as an exercise for the reader. Example 14.3 First normal form (1NF) to Boyce–Codd Normal Form (BCNF) In this example we extend the DreamHome case study to include property inspection by members of staff. When staff are required to undertake these inspections, they are allocated a company car for use on the day of the inspections. However, a car may be allocated to several members of staff as required throughout the working day. A member of staff may inspect several properties on a given date, but a property is only inspected once on a given date. Examples of the DreamHome Property Inspection Report are 14.3 Review of Normalization up to BCNF | 423 Figure 14.3 DreamHome Property Inspection reports. Figure 14.4 StaffPropertyInspection unnormalized table. presented in Figure 14.3. The report on top describes staff inspections of property PG4 in Glasgow. First Normal Form (1NF) We first transfer sample data held on two property inspection reports into table format with rows and columns. This is referred to as the StaffPropertyInspection unnormalized table and is shown in Figure 14.4. We identify the key attribute for this unnormalized table as propertyNo. We identify the repeating group in the unnormalized table as the property inspection and staff details, which repeats for each property. The structure of the repeating group is: Repeating Group = (iDate, iTime, comments, staffNo, sName, carReg) 424 | Chapter 14 z Advanced Normalization Figure 14.5 The First Normal Form (1NF) StaffPropertyInspection relation. As a consequence, there are multiple values at the intersection of certain rows and columns. For example, for propertyNo PG4 there are three values for iDate (18-Oct-03, 22-Apr-04, 1-Oct-04). We transform the unnormalized form to first normal form using the first approach described in Section 13.6. With this approach, we remove the repeating group (property inspection and staff details) by entering the appropriate property details (nonrepeating data) into each row. The resulting first normal form StaffPropertyInspection relation is shown in Figure 14.5. In Figure 14.6, we present the functional dependencies (fd1 to fd6) for the StaffPropertyInspection relation. We use the functional dependencies (as discussed in Section 13.4.3) to identify candidate keys for the StaffPropertyInspection relation as being composite keys comprising (propertyNo, iDate), (staffNo, iDate, iTime), and (carReg, iDate, iTime). We select (propertyNo, iDate) as the primary key for this relation. For clarity, we place the attributes that make up the primary key together, at the left-hand side of the relation. The StaffPropertyInspection relation is defined as follows: StaffPropertyInspection Figure 14.6 Functional dependencies of the StaffPropertyInspection relation. (propertyNo, iDate, iTime, pAddress, comments, staffNo, sName, carReg) 14.3 Review of Normalization up to BCNF The StaffPropertyInspection relation is in first normal form (1NF) as there is a single value at the intersection of each row and column. The relation contains data describing the inspection of property by members of staff, with the property and staff details repeated several times. As a result, the StaffPropertyInspection relation contains significant redundancy. If implemented, this 1NF relation would be subject to update anomalies. To remove some of these, we must transform the relation into second normal form. Second Normal Form (2NF) The normalization of 1NF relations to 2NF involves the removal of partial dependencies on the primary key. If a partial dependency exists, we remove the functionally dependent attributes from the relation by placing them in a new relation with a copy of their determinant. As shown in Figure 14.6, the functional dependencies (fd1 to fd6) of the StaffPropertyInspection relation are as follows: fd1 fd2 fd3 fd4 fd5 fd6 propertyNo, iDate → iTime, comments, staffNo, sName, carReg propertyNo → pAddress staffNo → sName staffNo, iDate → carReg carReg, iDate, iTime → propertyNo, pAddress, comments, staffNo, sName staffNo, iDate, iTime → propertyNo, pAddress, comments (Primary key) (Partial dependency) (Transitive dependency) (Candidate key) (Candidate key) Using the functional dependencies, we continue the process of normalizing the relation. We begin by testing whether the relation is in 2NF by identifying the presence of any partial dependencies on the primary key. We note that the property attribute (pAddress) is partially dependent on part of the primary key, namely the propertyNo (represented as fd2), whereas the remaining attributes (iTime, comments, staffNo, sName, and carReg) are fully dependent on the whole primary key (propertyNo and iDate), (represented as fd1). Note that although the determinant of the functional dependency staffNo, iDate → carReg (represented as fd4) only requires the iDate attribute of the primary key, we do not remove this dependency at this stage as the determinant also includes another non-primary-key attribute, namely staffNo. In other words, this dependency is not wholly dependent on part of the primary key and therefore does not violate 2NF. The identification of the partial dependency (propertyNo → pAddress) indicates that the StaffPropertyInspection relation is not in 2NF. To transform the relation into 2NF requires the creation of new relations so that the attributes that are not fully dependent on the primary key are associated with only the appropriate part of the key. The StaffPropertyInspection relation is transformed into second normal form by removing the partial dependency from the relation and creating two new relations called Property and PropertyInspection with the following form: StaffPropertyInspection Property PropertyInspection (propertyNo, pAddress) (propertyNo, iDate, iTime, comments, staffNo, sName, carReg) These relations are in 2NF, as every non-primary-key attribute is functionally dependent on the primary key of the relation. | 425 426 | Chapter 14 z Advanced Normalization Third Normal Form (3NF) The normalization of 2NF relations to 3NF involves the removal of transitive dependencies. If a transitive dependency exists, we remove the transitively dependent attributes from the relation by placing them in a new relation along with a copy of their determinant. The functional dependencies within the Property and PropertyInspection relations are as follows: Property Relation propertyNo → pAddress PropertyInspection Relation fd1 propertyNo, iDate → iTime, comments, staffNo, sName, carReg fd3 staffNo → sName fd4 staffNo, iDate → carReg fd5′ carReg, iDate, iTime → propertyNo, comments, staffNo, sName fd6′ staffNo, iDate, iTime → propertyNo, comments fd2 As the Property relation does not have transitive dependencies on the primary key, it is therefore already in 3NF. However, although all the non-primary-key attributes within the PropertyInspection relation are functionally dependent on the primary key, sName is also transitively dependent on staffNo (represented as fd3). We also note the functional dependency staffNo, iDate → carReg (represented as fd4) has a non-primary-key attribute carReg partially dependent on a non-primary-key attribute, staffNo. We do not remove this dependency at this stage as part of the determinant for this dependency includes a primarykey attribute, namely iDate. In other words, this dependency is not wholly transitively dependent on non-primary-key attributes and therefore does not violate 3NF. (In other words, as described in Section 13.9, when considering all candidate keys of a relation, the staffNo, iDate → carReg dependency is allowed in 3NF because carReg is a primarykey attribute as it is part of the candidate key (carReg, iDate, iTime) of the original PropertyInspection relation.) To transform the PropertyInspection relation into 3NF, we remove the transitive dependency (staffNo → sName) by creating two new relations called Staff and PropertyInspect with the form: Staff PropertyInspect (staffNo, sName) (propertyNo, iDate, iTime, comments, staffNo, carReg) The Staff and PropertyInspect relations are in 3NF as no non-primary-key attribute is wholly functionally dependent on another non-primary-key attribute. Thus, the StaffPropertyInspection relation shown in Figure 14.5 has been transformed by the process of normalization into three relations in 3NF with the following form: Property Staff PropertyInspect (propertyNo, pAddress) (staffNo, sName) (propertyNo, iDate, iTime, comments, staffNo, carReg) Boyce–Codd Normal Form (BCNF) We now examine the Property, Staff, and PropertyInspect relations to determine whether they are in BCNF. Recall that a relation is in BCNF if every determinant of a relation is a 14.3 Review of Normalization up to BCNF | 427 candidate key. Therefore, to test for BCNF, we simply identify all the determinants and make sure they are candidate keys. The functional dependencies for the Property, Staff, and PropertyInspect relations are as follows: Property Relation propertyNo fd2 Staff fd3 → pAddress Relation staffNo → sName PropertyInspect Relation fd1′ propertyNo, iDate → iTime, comments, staffNo, carReg fd4 staffNo, iDate → carReg fd5′ carReg, iDate, iTime → propertyNo, comments, staffNo fd6′ staffNo, iDate, iTime → propertyNo, comments We can see that the Property and Staff relations are already in BCNF as the determinant in each of these relations is also the candidate key. The only 3NF relation that is not in BCNF is PropertyInspect because of the presence of the determinant (staffNo, iDate), which is not a candidate key (represented as fd4). As a consequence the PropertyInspect relation may suffer from update anomalies. For example, to change the car allocated to staff number SG14 on the 22-Apr-03, we must update two tuples. If only one tuple is updated with the new car registration number, this results in an inconsistent state for the database. To transform the PropertyInspect relation into BCNF, we must remove the dependency that violates BCNF by creating two new relations called StaffCar and Inspection with the form: StaffCar Inspection (staffNo, iDate, carReg) (propertyNo, iDate, iTime, comments, staffNo) The StaffCar and Inspection relations are in BCNF as the determinant in each of these relations is also a candidate key. In summary, the decomposition of the StaffPropertyInspection relation shown in Figure 14.5 into BCNF relations is shown in Figure 14.7. In this example, the decomposition of the Figure 14.7 Decomposition of the StaffPropertyInspection relation into BCNF relations. 428 | Chapter 14 z Advanced Normalization original StaffPropertyInspection relation to BCNF relations has resulted in the ‘loss’ of the functional dependency: carReg, iDate, iTime → propertyNo, pAddress, comments, staffNo, sName, as parts of the determinant are in different relations (represented as fd5). However, we recognize that if the functional dependency, staffNo, iDate → carReg (represented as fd4) is not removed, the PropertyInspect relation will have data redundancy. The resulting BCNF relations have the following form: Property (propertyNo, pAddress) Staff (staffNo, sName) Inspection (propertyNo, iDate, iTime, comments, staffNo) StaffCar (staffNo, iDate, carReg) The original StaffPropertyInspection relation shown in Figure 14.5 can be recreated from the Property, Staff, Inspection, and StaffCar relations using the primary key/foreign key mechanism. For example, the attribute staffNo is a primary key within the Staff relation and is also present within the Inspection relation as a foreign key. The foreign key allows the association of the Staff and Inspection relations to identify the name of the member of staff undertaking the property inspection. 14.4 Fourth Normal Form (4NF) Although BCNF removes any anomalies due to functional dependencies, further research led to the identification of another type of dependency called a Multi-Valued Dependency (MVD), which can also cause data redundancy (Fagin, 1977). In this section, we briefly describe a multi-valued dependency and the association of this type of dependency with Fourth Normal Form (4NF). 14.4.1 Multi-Valued Dependency The possible existence of multi-valued dependencies in a relation is due to First Normal Form, which disallows an attribute in a tuple from having a set of values. For example, if we have two multi-valued attributes in a relation, we have to repeat each value of one of the attributes with every value of the other attribute, to ensure that tuples of the relation are consistent. This type of constraint is referred to as a multi-valued dependency and results in data redundancy. Consider the BranchStaffOwner relation shown in Figure 14.8(a), which Figure 14.8(a) The BranchStaffOwner relation. 14.4 Fourth Normal Form (4NF) displays the names of members of staff (sName) and property owners (oName) at each branch office (branchNo). In this example, assume that staff name (sName) uniquely identifies each member of staff and that the owner name (oName) uniquely identifies each owner. In this example, members of staff called Ann Beech and David Ford work at branch B003, and property owners called Carol Farrel and Tina Murphy are registered at branch B003. However, as there is no direct relationship between members of staff and property owners at a given branch office, we must create a tuple for every combination of member of staff and owner to ensure that the relation is consistent. This constraint represents a multi-valued dependency in the BranchStaffOwner relation. In other words, a MVD exists because two independent 1:* relationships are represented in the BranchStaffOwner relation. Multi-Valued Dependency (MVD) Represents a dependency between attributes (for example, A, B, and C) in a relation, such that for each value of A there is a set of values for B and a set of values for C. However, the set of values for B and C are independent of each other. We represent a MVD between attributes A , B, and notation: A A C in a relation using the following ⎯>> B ⎯>> C For example, we specify the MVD in the BranchStaffOwner relation shown in Figure 14.8(a) as follows: branchNo branchNo ⎯>> sName ⎯>> oName A multi-valued dependency can be further defined as being trivial or nontrivial. A MVD A ⎯>> B in relation R is defined as being trivial if (a) B is a subset of A or (b) A ∪ B = R. A MVD is defined as being nontrivial if neither (a) nor (b) is satisfied. A trivial MVD does not specify a constraint on a relation, while a nontrivial MVD does specify a constraint. The MVD in the BranchStaffOwner relation shown in Figure 14.8(a) is nontrivial as neither condition (a) nor (b) is true for this relation. The BranchStaffOwner relation is therefore constrained by the nontrivial MVD to repeat tuples to ensure the relation remains consistent in terms of the relationship between the sName and oName attributes. For example, if we wanted to add a new property owner for branch B003 we would have to create two new tuples, one for each member of staff, to ensure that the relation remains consistent. This is an example of an update anomaly caused by the presence of the nontrivial MVD. Even though the BranchStaffOwner relation is in BCNF, the relation remains poorly structured, due to the data redundancy caused by the presence of the nontrivial MVD. We clearly require a stronger form of BCNF that prevents relational structures such as the BranchStaffOwner relation. | 429 430 | Chapter 14 z Advanced Normalization Figure 14.8(b) The BranchStaff and BranchOwner 4NF relations. 14.4.2 Definition of Fourth Normal Form Fourth Normal Form (4NF) A relation that is in Boyce–Codd normal form and does not contain nontrivial multi-valued dependencies. Fourth Normal Form (4NF) is a stronger normal form than BCNF as it prevents relations from containing nontrivial MVDs, and hence data redundancy (Fagin, 1977). The normalization of BCNF relations to 4NF involves the removal of the MVD from the relation by placing the attribute(s) in a new relation along with a copy of the determinant(s). For example, the BranchStaffOwner relation in Figure 14.8(a) is not in 4NF because of the presence of the nontrivial MVD. We decompose the BranchStaffOwner relation into the BranchStaff and BranchOwner relations, as shown in Figure 14.8(b). Both new relations are in 4NF because the BranchStaff relation contains the trivial MVD branchNo ⎯>> sName, and the BranchOwner relation contains the trivial MVD branchNo ⎯>> oName. Note that the 4NF relations do not display data redundancy and the potential for update anomalies is removed. For example, to add a new property owner for branch B003, we simply create a single tuple in the BranchOwner relation. For a detailed discussion on 4NF the interested reader is referred to Date (2003), Elmasri and Navathe (2003), and Hawryszkiewycz (1994). 14.5 Fifth Normal Form (5NF) Whenever we decompose a relation into two relations the resulting relations have the lossless-join property. This property refers to the fact that we can rejoin the resulting relations to produce the original relation. However, there are cases were there is the requirement to decompose a relation into more than two relations. Although rare, these cases are managed by join dependency and Fifth Normal Form (5NF). In this section we briefly describe the lossless-join dependency and the association with 5NF. 14.5.1 Lossless-Join Dependency Lossless-join dependency A property of decomposition, which ensures that no spurious tuples are generated when relations are reunited through a natural join operation. 14.5 Fifth Normal Form (5NF) | 431 In splitting relations by projection, we are very explicit about the method of decomposition. In particular, we are careful to use projections that can be reversed by joining the resulting relations, so that the original relation is reconstructed. Such a decomposition is called a lossless-join (also called a nonloss- or nonadditive-join) decomposition, because it preserves all the data in the original relation and does not result in the creation of additional spurious tuples. For example, Figures 14.8(a) and (b) show that the decomposition of the BranchStaffOwner relation into the BranchStaff and BranchOwner relations has the lossless-join property. In other words, the original BranchStaffOwner relation can be reconstructed by performing a natural join operation on the BranchStaff and BranchOwner relations. In this example, the original relation is decomposed into two relations. However, there are cases were we require to perform a lossless-join decompose of a relation into more than two relations (Aho et al., 1979). These cases are the focus of the lossless-join dependency and Fifth Normal Form (5NF). Definition of Fifth Normal Form 14.5.2 Fifth Normal Form (5NF) A relation that has no join dependency. Fifth Normal Form (5NF) (also called Project-Join Normal Form (PJNF)) specifies that a 5NF relation has no join dependency (Fagin, 1979). To examine what a join dependency means, consider as an example the PropertyItemSupplier relation shown in Figure 14.9(a). This relation describes properties (propertyNo) that require certain items (itemDescription), which are supplied by suppliers (supplierNo) to the properties (propertyNo). Furthermore, whenever a property (p) requires a certain item (i) and a supplier (s) supplies that item (i) and the supplier (s) already supplies at least one item to that property (p), then the supplier (s) will also supply the required item (i) to property (p). In this example, assume that a description of an item (itemDescription) uniquely identifies each type of item. Figure 14.9 (a) Illegal state for PropertyItemSupplier relation and (b) legal state for PropertyItemSupplier relation. 432 | Chapter 14 z Advanced Normalization To identify the type of constraint on the consider the following statement: If Then PropertyItemSupplier Property PG4 requires Bed Supplier S2 supplies property PG4 Supplier S2 provides Bed Supplier S2 provides Bed for property PG4 relation in Figure 14.9(a), (from data in tuple 1) (from data in tuple 2) (from data in tuple 3) This example illustrates the cyclical nature of the constraint on the PropertyItemSupplier relation. If this constraint holds then the tuple (PG4, Bed, S2) must exist in any legal state of the PropertyItemSupplier relation as shown in Figure 14.9(b). This is an example of a type of update anomaly and we say that this relation contains a join dependency (JD). Join dependency Describes a type of dependency. For example, for a relation R with subsets of the attributes of R denoted as A, B, . . . , Z, a relation R satisfies a join dependency if and only if every legal value of R is equal to the join of its projections on A, B, . . . , Z. As the PropertyItemSupplier relation contains a join dependency, it is therefore not in 5NF. To remove the join dependency, we decompose the PropertyItemSupplier relation into three 5NF relations, namely PropertyItem (R1), ItemSupplier (R2), and PropertySupplier (R3) relations, as shown in Figure 14.10. We say that the PropertyItemSupplier relation with the form (A, B, C) satisfies the join dependency JD (R1(A, B), R2(B, C), R3(A, C)). It is important to note that performing a natural join on any two relations will produce spurious tuples; however, performing the join on all three will recreate the original PropertyItemSupplier relation. For a detailed discussion on 5NF the interested reader is referred to Date (2003), Elmasri and Navathe (2003), and Hawryszkiewycz (1994). Figure 14.10 PropertyItem, ItemSupplier, and PropertySupplier 5NF relations. Exercises | 433 Chapter Summary n n n n n n Inference rules can be used to identify the set of all functional dependencies associated with a relation. This set of dependencies can be very large for a given relation. Inference rules called Armstrong’s axioms can be used to identify a minimal set of functional dependencies from the set of all functional dependencies for a relation. Boyce–Codd Normal Form (BCNF) is a relation in which every determinant is a candidate key. Fourth Normal Form (4NF) is a relation that is in BCNF and does not contain nontrivial multi-valued dependencies. A multi-valued dependency (MVD) represents a dependency between attributes (A, B, and C) in a relation, such that for each value of A there is a set of values of B and a set of values for C. However, the set of values for B and C are independent of each other. A lossless-join dependency is a property of decomposition, which means that no spurious tuples are generated when relations are combined through a natural join operation. Fifth Normal Form (5NF) is a relation that contains no join dependency. For a relation R with subsets of attributes of R denoted as A, B, . . . , Z, a relation R satisfies a join dependency if and only if every legal value of R is equal to the join of its projections on A, B, . . . , Z. Review Questions 14.1 Describe the purpose of using inference rules to identify functional dependencies for a given relation. 14.2 Discuss the purpose of Armstrong’s axioms. 14.3 Discuss the purpose of Boyce–Codd Normal Form (BCNF) and discuss how BCNF differs from 3NF. Provide an example to illustrate your answer. 14.4 Describe the concept of multi-valued dependency and discuss how this concept relates to 4NF. Provide an example to illustrate your answer. 14.5 Describe the concept of join dependency and discuss how this concept relates to 5NF. Provide an example to illustrate your answer. Exercises 14.6 On completion of Exercise 13.14 examine the 3NF relations created to represent the attributes shown in the Wellmeadows Hospital form shown in Figure 13.18. Determine whether these relations are also in BCNF. If not, transform the relations that do not conform into BCNF. 14.7 On completion of Exercise 13.15 examine the 3NF relations created to represent the attributes shown in the relation that displays dentist/patient appointment data in Figure 13.19. Determine whether these relations are also in BCNF. If not, transform the relations that do not conform into BCNF. 14.8 On completion of Exercise 13.16 examine the 3NF relations created to represent the attributes shown in the relation displaying employee contract data for an agency called Instant Cover in Figure 13.20. Determine whether these relations are also in BCNF. If not, transform the relations that do not conform into BCNF. 14.9 The relation shown in Figure 14.11 lists members of staff (staffName) working in a given ward (wardName) and patients (patientName) allocated to a given ward. There is no relationship between members of staff and 434 | Chapter 14 z Advanced Normalization Figure 14.11 The WardStaffPatient relation. patients in each ward. In this example assume that staff name (staffName) uniquely identifies each member of staff and that the patient name (patientName) uniquely identifies each patient. (a) Describe why the relation shown in Figure 14.11 is not in 4NF. (b) The relation shown in Figure 14.11 is susceptible to update anomalies. Provide examples of insertion, deletion, and update anomalies. (c) Describe and illustrate the process of normalizing the relation shown in Figure 14.11 to 4NF. 14.10 The relation shown in Figure 14.12 describes hospitals (hospitalName) that require certain items (itemDescription), which are supplied by suppliers (supplierNo) to the hospitals (hospitalName). Furthermore, whenever a hospital (h) requires a certain item (i) and a supplier (s) supplies that item (i) and the supplier (s) already supplies at least one item to that hospital (h), then the supplier (s) will also supply the required item (i) to the hospital (h). In this example, assume that a description of an item (itemDescription) uniquely identifies each type of item. (a) Describe why the relation shown in Figure 14.12 is not in 5NF. (b) Describe and illustrate the process of normalizing the relation shown in Figure 14.12 to 5NF. Figure 14.12 The HospitalItemSupplier relation. Part 4 Methodology Chapter 15 Methodology – Conceptual Database Design 437 Chapter 16 Methodology – Logical Database Design for the Relational Model 461 Methodology – Physical Database Design for Relational Databases 494 Methodology – Monitoring and Tuning the Operational System 519 Chapter 17 Chapter 18 Chapter 15 Methodology – Conceptual Database Design Chapter Objectives In this chapter you will learn: n The purpose of a design methodology. n Database design has three main phases: conceptual, logical, and physical design. n How to decompose the scope of the design into specific views of the enterprise. n How to use Entity–Relationship (ER) modeling to build a local conceptual data model based on the information given in a view of the enterprise. n How to validate the resultant conceptual model to ensure it is a true and accurate representation of a view of the enterprise. n How to document the process of conceptual database design. n End-users play an integral role throughout the process of conceptual database design. In Chapter 9 we described the main stages of the database system development lifecycle, one of which is database design. This stage starts only after a complete analysis of the enterprise’s requirements has been undertaken. In this chapter, and Chapters 16–18, we describe a methodology for the database design stage of the database system development lifecycle for relational databases. The methodology is presented as a step-by-step guide to the three main phases of database design, namely: conceptual, logical, and physical design (see Figure 9.1). The main aim of each phase is as follows: n n n Conceptual database design – to build the conceptual representation of the database, which includes identification of the important entities, relationships, and attributes. Logical database design – to translate the conceptual representation to the logical structure of the database, which includes designing the relations. Physical database design – to decide how the logical structure is to be physically implemented (as base relations) in the target Database Management System (DBMS). 438 | Chapter 15 z Methodology – Conceptual Database Design Structure of this Chapter In Section 15.1 we define what a database design methodology is and review the three phases of database design. In Section 15.2 we provide an overview of the methodology and briefly describe the main activities associated with each design phase. In Section 15.3 we focus on the methodology for conceptual database design and present a detailed description of the steps required to build a conceptual data model. We use the Entity– Relationship (ER) modeling technique described in Chapters 11 and 12 to create the conceptual data model. In Chapter 16 we focus on the methodology for logical database design for the relational model and present a detailed description of the steps required to convert a conceptual data model into a logical data model. This chapter also includes an optional step that describes how to merge two or more logical data models into a single logical data model for those using the view integration approach (see Section 9.5) to manage the design of a database with multiple user views. In Chapters 17 and 18 we complete the database design methodology by presenting a detailed description of the steps associated with the production of the physical database design for relational DBMSs. This part of the methodology illustrates that the development of the logical data model alone is insufficient to guarantee the optimum implementation of a database system. For example, we may have to consider modifying the logical model to achieve acceptable levels of performance. Appendix G presents a summary of the database design methodology for those readers who are already familiar with database design and simply require an overview of the main steps. Throughout the methodology the terms ‘entity’ and ‘relationship’ are used in place of ‘entity type’ and ‘relationship type’ where the meaning is obvious; ‘type’ is generally only added to avoid ambiguity. In this chapter we mostly use examples from the Staff user views of the DreamHome case study documented in Section 10.4 and Appendix A. 15.1 Introduction to the Database Design Methodology Before presenting the methodology, we discuss what a design methodology represents and describe the three phases of database design. Finally, we present guidelines for achieving success in database design. 15.1.1 What is a Design Methodology? Design methodology A structured approach that uses procedures, techniques, tools, and documentation aids to support and facilitate the process of design. 15.1 Introduction to the Database Design Methodology | A design methodology consists of phases each containing a number of steps, which guide the designer in the techniques appropriate at each stage of the project. A design methodology also helps the designer to plan, manage, control, and evaluate database development projects. Furthermore, it is a structured approach for analyzing and modeling a set of requirements for a database in a standardized and organized manner. Conceptual, Logical, and Physical Database Design In presenting this database design methodology, the design process is divided into three main phases: conceptual, logical, and physical database design. Conceptual database design The process of constructing a model of the data used in an enterprise, independent of all physical considerations. The conceptual database design phase begins with the creation of a conceptual data model of the enterprise, which is entirely independent of implementation details such as the target DBMS, application programs, programming languages, hardware platform, performance issues, or any other physical considerations. Logical database design The process of constructing a model of the data used in an enterprise based on a specific data model, but independent of a particular DBMS and other physical considerations. The logical database design phase maps the conceptual model on to a logical model, which is influenced by the data model for the target database (for example, the relational model). The logical data model is a source of information for the physical design phase, providing the physical database designer with a vehicle for making tradeoffs that are very important to the design of an efficient database. Physical database design The process of producing a description of the implementation of the database on secondary storage; it describes the base relations, file organizations, and indexes used to achieve efficient access to the data, and any associated integrity constraints and security measures. The physical database design phase allows the designer to make decisions on how the database is to be implemented. Therefore, physical design is tailored to a specific DBMS. There is feedback between physical and logical design, because decisions taken during physical design for improving performance may affect the logical data model. 15.1.2 439 440 | Chapter 15 z Methodology – Conceptual Database Design 15.1.3 Critical Success Factors in Database Design The following guidelines are often critical to the success of database design: n n n n n n n n n Work interactively with the users as much as possible. Follow a structured methodology throughout the data modeling process. Employ a data-driven approach. Incorporate structural and integrity considerations into the data models. Combine conceptualization, normalization, and transaction validation techniques into the data modeling methodology. Use diagrams to represent as much of the data models as possible. Use a Database Design Language (DBDL) to represent additional data semantics that cannot easily be represented in a diagram. Build a data dictionary to supplement the data model diagrams and the DBDL. Be willing to repeat steps. These factors are built into the methodology we present for database design. 15.2 Overview of the Database Design Methodology In this section, we present an overview of the database design methodology. The steps in the methodology are as follows. Conceptual database design Step 1 Build conceptual data model Step 1.1 Identify entity types Step 1.2 Identify relationship types Step 1.3 Identify and associate attributes with entity or relationship types Step 1.4 Determine attribute domains Step 1.5 Determine candidate, primary, and alternate key attributes Step 1.6 Consider use of enhanced modeling concepts (optional step) Step 1.7 Check model for redundancy Step 1.8 Validate conceptual model against user transactions Step 1.9 Review conceptual data model with user Logical database design for the relational model Step 2 Build and validate logical data model Step 2.1 Derive relations for logical data model Step 2.2 Validate relations using normalization Step 2.3 Validate relations against user transactions Step 2.4 Check integrity constraints Step 2.5 Review logical data model with user 15.2 Overview of the Database Design Methodology Step 2.6 Merge logical data models into global model (optional step) Step 2.7 Check for future growth Physical database design for relational databases Step 3 Translate logical data model for target DBMS Step 3.1 Design base relations Step 3.2 Design representation of derived data Step 3.3 Design general constraints Step 4 Design file organizations and indexes Step 4.1 Analyze transactions Step 4.2 Choose file organizations Step 4.3 Choose indexes Step 4.4 Estimate disk space requirements Step 5 Design user views Step 6 Design security mechanisms Step 7 Consider the introduction of controlled redundancy Step 8 Monitor and tune the operational system This methodology can be used to design relatively simple to highly complex database systems. Just as the database design stage of the database systems development lifecycle (see Section 9.6) has three phases, namely conceptual, logical, and physical design, so too has the methodology. Step 1 creates a conceptual database design, Step 2 creates a logical database design, and Steps 3 to 8 creates a physical database design. Depending on the complexity of the database system being built, some of the steps may be omitted. For example, Step 2.6 of the methodology is not required for database systems with a single user view or database systems with multiple user views being managed using the centralization approach (see Section 9.5). For this reason, we only refer to the creation of a single conceptual data model in Step 1 or single logical data model in Step 2. However, if the database designer is using the view integration approach (see Section 9.5) to manage user views for a database system then Steps 1 and 2 may be repeated as necessary to create the required number of models, which are then merged in Step 2.6. In Chapter 9, we introduced the term ‘local conceptual data model’ or ‘local logical data model’ to refer to the modeling of one or more, but not all, user views of a database system and the term ‘global logical data model’ to refer to the modeling of all user views of a database system. However, the methodology is presented using the more general terms ‘conceptual data model’ and ‘logical data model’ with the exception of the optional Step 2.6, which necessitates the use of the terms local logical data model and global logical data model as it is this step that describes the tasks necessary to merge separate local logical data models to produce a global logical data model. An important aspect of any design methodology is to ensure that the models produced are repeatedly validated so that they continue to be an accurate representation of the part of the enterprise being modeled. In this methodology the data models are validated in various ways such as by using normalization (Step 2.2), by ensuring the critical transactions are supported (Steps 1.8 and 2.3) and by involving the users as much as possible (Steps 1.9 and 2.5). | 441 442 | Chapter 15 z Methodology – Conceptual Database Design The logical model created at the end of Step 2 is then used as the source of information for physical database design described in Steps 3 to 8. Again depending on the complexity of the database systems being design and/or the functionality of the target DBMS, some steps of physical database design may be omitted. For example, Step 4.2 may not be applicable for certain PC-based DBMSs. The steps of physical database design are described in detail in Chapters 17 and 18. Database design is an iterative process, which has a starting point and an almost endless procession of refinements. Although the steps of the methodology are presented here as a procedural process, it must be emphasized that this does not imply that it should be performed in this manner. It is likely that knowledge gained in one step may alter decisions made in a previous step. Similarly, it may be useful to look briefly at a later step to help with an earlier step. Therefore, the methodology should act as a framework to help guide the designer through database design effectively. To illustrate the database design methodology we use the DreamHome case study. The DreamHome database has several user views (Director, Manager, Supervisor, and Assistant) that are managed using a combination of the centralization and view integration approaches (see Section 10.4). Applying the centralization approach resulted in the identification of two collections of user views called Staff user views and Branch user views. The user views represented by each collection are as follows: n n Staff user views – representing Supervisor and Assistant user views; Branch user views – representing Director and Manager user views. In this chapter, which describes Step 1 of the methodology we use the Staff user views to illustrate the building of a conceptual data model, and then in the following chapter, which describes Step 2 we describe how this model is translated into a logical data model. As the Staff user views represent only a subset of all the user views of the DreamHome database it is more correct to refer to the data models as local data models. However, as stated earlier when we described the methodology and the worked examples, for simplicity we use the terms conceptual data model and logical data model until the optional Step 2.6, which describes the integration of the local logical data models for the Staff user views and the Branch user views. 15.3 Conceptual Database Design Methodology This section provides a step-by-step guide for conceptual database design. Step 1 Build Conceptual Data Model Objective To build a conceptual data model of the data requirements of the enterprise. The first step in conceptual database design is to build one (or more) conceptual data models of the data requirements of the enterprise. A conceptual data model comprises: 15.3 Conceptual Database Design Methodology n n n n n entity types; relationship types; attributes and attribute domains; primary keys and alternate keys; integrity constraints. The conceptual data model is supported by documentation, including ER diagrams and a data dictionary, which is produced throughout the development of the model. We detail the types of supporting documentation that may be produced as we go through the various steps. The tasks involved in Step 1 are: Step 1.1 Identify entity types Step 1.2 Identify relationship types Step 1.3 Step 1.4 Step 1.5 Step 1.6 Step 1.7 Step 1.8 Step 1.9 Identify and associate attributes with entity or relationship types Determine attribute domains Determine candidate, primary, and alternate key attributes Consider use of enhanced modeling concepts (optional step) Check model for redundancy Validate conceptual model against user transactions Review conceptual data model with user. Step 1.1 Identify entity types Objective To identify the required entity types. The first step in building a conceptual data model is to define the main objects that the users are interested in. These objects are the entity types for the model (see Section 11.1). One method of identifying entities is to examine the users’ requirements specification. From this specification, we identify nouns or noun phrases that are mentioned (for example, staff number, staff name, property number, property address, rent, number of rooms). We also look for major objects such as people, places, or concepts of interest, excluding those nouns that are merely qualities of other objects. For example, we could group staff number and staff name with an object or entity called Staff and group property number, property address, rent, and number of rooms with an entity called PropertyForRent. An alternative way of identifying entities is to look for objects that have an existence in their own right. For example, Staff is an entity because staff exist whether or not we know their names, positions, and dates of birth. If possible, the users should assist with this activity. It is sometimes difficult to identify entities because of the way they are presented in the users’ requirements specification. Users often talk in terms of examples or analogies. Instead of talking about staff in general, users may mention people’s names. In some cases, users talk in terms of job roles, particularly where people or organizations are involved. | 443 444 | Chapter 15 z Methodology – Conceptual Database Design These roles may be job titles or responsibilities, such as Director, Manager, Supervisor, or Assistant. To confuse matters further, users frequently use synonyms and homonyms. Two words are synonyms when they have the same meaning, for example, ‘branch’ and ‘office’. Homonyms occur when the same word can have different meanings depending on the context. For example, the word ‘program’ has several alternative meanings such as a course of study, a series of events, a plan of work, and an item on the television. It is not always obvious whether a particular object is an entity, a relationship, or an attribute. For example, how would we classify marriage? In fact, depending on the actual requirements we could classify marriage as any or all of these. Design is subjective and different designers may produce different, but equally valid, interpretations. The activity therefore relies, to a certain extent, on judgement and experience. Database designers must take a very selective view of the world and categorize the things that they observe within the context of the enterprise. Thus, there may be no unique set of entity types deducible from a given requirements specification. However, successive iterations of the design process should lead to the choice of entities that are at least adequate for the system required. For the Staff user views of DreamHome we identify the following entities: Staff PrivateOwner Client Lease PropertyForRent BusinessOwner Preference Document entity types As entity types are identified, assign them names that are meaningful and obvious to the user. Record the names and descriptions of entities in a data dictionary. If possible, document the expected number of occurrences of each entity. If an entity is known by different names, the names are referred to as synonyms or aliases, which are also recorded in the data dictionary. Figure 15.1 shows an extract from the data dictionary that documents the entities for the Staff user views of DreamHome. Figure 15.1 Extract from the data dictionary for the Staff user views of DreamHome showing a description of entities. 15.3 Conceptual Database Design Methodology Step 1.2 Identify relationship types Objective To identify the important relationships that exist between the entity types. Having identified the entities, the next step is to identify all the relationships that exist between these entities (see Section 11.2). When we identify entities, one method is to look for nouns in the users’ requirements specification. Again, we can use the grammar of the requirements specification to identify relationships. Typically, relationships are indicated by verbs or verbal expressions. For example: n Staff Manages PropertyForRent n PrivateOwner Owns PropertyForRent n PropertyForRent AssociatedWith Lease The fact that the requirements specification records these relationships suggests that they are important to the enterprise, and should be included in the model. We are interested only in required relationships between entities. In the above examples, we identified the Staff Manages PropertyForRent and the PrivateOwner Owns PropertyForRent relationships. We may also be inclined to include a relationship between Staff and PrivateOwner (for example, Staff Assists PrivateOwner). However, although this is a possible relationship, from the requirements specification it is not a relationship that we are interested in modeling. In most instances, the relationships are binary; in other words, the relationships exist between exactly two entity types. However, we should be careful to look out for complex relationships that may involve more than two entity types (see Section 11.2.1) and recursive relationships that involve only one entity type (see Section 11.2.2). Great care must be taken to ensure that all the relationships that are either explicit or implicit in the users’ requirements specification are detected. In principle, it should be possible to check each pair of entity types for a potential relationship between them, but this would be a daunting task for a large system comprising hundreds of entity types. On the other hand, it is unwise not to perform some such check, and the responsibility is often left to the analyst/designer. However, missing relationships should become apparent when we validate the model against the transactions that are to be supported (Step 1.8). Use Entity–Relationship (ER) diagrams It is often easier to visualize a complex system rather than decipher long textual descriptions of a users’ requirements specification. We use Entity–Relationship (ER) diagrams to represent entities and how they relate to one another more easily. Throughout the database design phase, we recommend that ER diagrams should be used whenever necessary to help build up a picture of the part of the enterprise that we are modeling. In this book, we have used the latest object-oriented notation called UML (Unified Modeling Language) but other notations perform a similar function (see Appendix F). | 445 446 | Chapter 15 z Methodology – Conceptual Database Design Determine the multiplicity constraints of relationship types Having identified the relationships to model, we next determine the multiplicity of each relationship (see Section 11.6). If specific values for the multiplicity are known, or even upper or lower limits, document these values as well. Multiplicity constraints are used to check and maintain data quality. These constraints are assertions about entity occurrences that can be applied when the database is updated to determine whether or not the updates violate the stated rules of the enterprise. A model that includes multiplicity constraints more explicitly represents the semantics of the relationships and results in a better representation of the data requirements of the enterprise. Check for fan and chasm traps Having identified the necessary relationships, check that each relationship in the ER model is a true representation of the ‘real world’, and that fan or chasm traps have not been created inadvertently (see Section 11.7). Figure 15.2 shows the first-cut ER diagram for the Staff user views of the DreamHome case study. Document relationship types As relationship types are identified, assign them names that are meaningful and obvious to the user. Also record relationship descriptions and the multiplicity constraints in the Figure 15.2 First-cut ER diagram showing entity and relationship types for the Staff user views of DreamHome. 15.3 Conceptual Database Design Methodology | 447 Figure 15.3 Extract from the data dictionary for the Staff user views of DreamHome showing a description of relationships. data dictionary. Figure 15.3 shows an extract from the data dictionary that documents the relationships for the Staff user views of DreamHome. Step 1.3 Identify and associate attributes with entity or relationship types Objective To associate attributes with appropriate entity or relationship types. The next step in the methodology is to identify the types of facts about the entities and relationships that we have chosen to be represented in the database. In a similar way to identifying entities, we look for nouns or noun phrases in the users’ requirements specification. The attributes can be identified where the noun or noun phrase is a property, quality, identifier, or characteristic of one of these entities or relationships (see Section 11.3). By far the easiest thing to do when we have identified an entity (x) or a relationship (y) in the requirements specification is to ask ‘What information are we required to hold on x or y?’ The answer to this question should be described in the specification. However, in some cases it may be necessary to ask the users to clarify the requirements. Unfortunately, they may give answers to this question that also contain other concepts, so that the users’ responses must be carefully considered. Simple/composite attributes It is important to note whether an attribute is simple or composite (see Section 11.3.1). Composite attributes are made up of simple attributes. For example, the address attribute can be simple and hold all the details of an address as a single value, such as, ‘115 Dumbarton Road, Glasgow, G11 6YG’. However, the address attribute may also represent a composite attribute, made up of simple attributes that hold the address details as separate values in the attributes street (‘115 Dumbarton Road’), city (‘Glasgow’), and postcode (‘G11 6YG’). The option to represent address details as a simple or composite attribute is determined by the users’ requirements. If the user does not need to access the separate components of an address, we represent the address attribute as a simple attribute. On the other hand, if the user does need to access the individual components of an address, we represent the address attribute as being composite, made up of the required simple attributes. 448 | Chapter 15 z Methodology – Conceptual Database Design In this step, it is important that we identify all simple attributes to be represented in the conceptual data model including those attributes that make up a composite attribute. Single/multi-valued attributes In addition to being simple or composite, an attribute can also be single-valued or multivalued (see Section 11.3.2). Most attributes encountered will be single-valued, but occasionally a multi-valued attribute may be encountered; that is, an attribute that holds multiple values for a single entity occurrence. For example, we may identify the attribute telNo (the telephone number) of the Client entity as a multi-valued attribute. On the other hand, client telephone numbers may have been identified as a separate entity from Client. This is an alternative, and equally valid, way to model this. As we will see in Step 2.1, multi-valued attributes are mapped to relations anyway, so both approaches produce the same end-result. Derived attributes Attributes whose values are based on the values of other attributes are known as derived attributes (see Section 11.3.3). Examples of derived attributes include: n n n the age of a member of staff; the number of properties that a member of staff manages; the rental deposit (calculated as twice the monthly rent). Often, these attributes are not represented in the conceptual data model. However, sometimes the value of the attribute or attributes on which the derived attribute is based may be deleted or modified. In this case, the derived attribute must be shown in the data model to avoid this potential loss of information. However, if a derived attribute is shown in the model, we must indicate that it is derived. The representation of derived attributes will be considered during physical database design. Depending on how an attribute is used, new values for a derived attribute may be calculated each time it is accessed or when the value(s) it is derived from changes. However, this issue is not the concern of conceptual database design, and is discussed in more detail in Step 3.2 in Chapter 17. Potential problems When identifying the entities, relationships, and attributes for the view, it is not uncommon for it to become apparent that one or more entities, relationships, or attributes have been omitted from the original selection. In this case, return to the previous steps, document the new entities, relationships, or attributes and re-examine any associated relationships. As there are generally many more attributes than entities and relationships, it may be useful to first produce a list of all attributes given in the users’ requirements specification. As an attribute is associated with a particular entity or relationship, remove the attribute from the list. In this way, we ensure that an attribute is associated with only one entity or relationship type and, when the list is empty, that all attributes are associated with some entity or relationship type. We must also be aware of cases where attributes appear to be associated with more than one entity or relationship type as this can indicate the following: 15.3 Conceptual Database Design Methodology (1) We have identified several entities that can be represented as a single entity. For example, we may have identified entities Assistant and Supervisor both with the attributes staffNo (the staff number), name, sex, and DOB (date of birth), which can be represented as a single entity called Staff with the attributes staffNo (the staff number), name, sex, DOB, and position (with values Assistant or Supervisor). On the other hand, it may be that these entities share many attributes but there are also attributes or relationships that are unique to each entity. In this case, we must decide whether we want to generalize the entities into a single entity such as Staff, or leave them as specialized entities representing distinct staff roles. The consideration of whether to specialize or generalize entities was discussed in Chapter 12 and is addressed in more detail in Step 1.6. (2) We have identified a relationship between entity types. In this case, we must associate the attribute with only one entity, namely the parent entity, and ensure that the relationship was previously identified in Step 1.2. If this is not the case, the documentation should be updated with details of the newly identified relationship. For example, we may have identified the entities Staff and PropertyForRent with the following attributes: Staff PropertyForRent staffNo, name, position, sex, DOB propertyNo, street, city, postcode, type, rooms, rent, managerName The presence of the managerName attribute in PropertyForRent is intended to represent the relationship Staff Manages PropertyForRent. In this case, the managerName attribute should be omitted from PropertyForRent and the relationship Manages should be added to the model. DreamHome attributes for entities For the Staff user views of DreamHome, we identify and associate attributes with entities as follows: Staff PropertyForRent PrivateOwner BusinessOwner Client Preference Lease staffNo, name (composite: fName, lName), position, sex, DOB propertyNo, address (composite: street, city, postcode), type, rooms, rent ownerNo, name (composite: fName, lName), address, telNo ownerNo, bName, bType, address, telNo, contactName clientNo, name (composite: fName, lName), telNo prefType, maxRent leaseNo, paymentMethod, deposit (derived as PropertyForRent.rent*2), depositPaid, rentStart, rentFinish, duration (derived as rentFinish – rentStart) DreamHome attributes for relationships Some attributes should not be associated with entities but instead should be associated with relationships. For the Staff user views of DreamHome, we identify and associate attributes with relationships as follows: Views viewDate, comment Document attributes As attributes are identified, assign them names that are meaningful to the user. Record the following information for each attribute: | 449 450 | Chapter 15 z Methodology – Conceptual Database Design Figure 15.4 Extract from the data dictionary for the Staff user views of DreamHome showing a description of attributes. n n n n n n n attribute name and description; data type and length; any aliases that the attribute is known by; whether the attribute is composite and, if so, the simple attributes that make up the composite attribute; whether the attribute is multi-valued; whether the attribute is derived and, if so, how it is to be computed; any default value for the attribute. Figure 15.4 shows an extract from the data dictionary that documents the attributes for the Staff user views of DreamHome. Step 1.4 Determine attribute domains Objective To determine domains for the attributes in the conceptual data model. The objective of this step is to determine domains for all the attributes in the model (see Section 11.3). A domain is a pool of values from which one or more attributes draw their values. For example, we may define: n the attribute domain of valid staff numbers (staffNo) as being a five-character variablelength string, with the first two characters as letters and the next one to three characters as digits in the range 1–999; n the possible values for the sex attribute of the Staff entity as being either ‘M’ or ‘F’. The domain of this attribute is a single character string consisting of the values ‘M’ or ‘F’. 15.3 Conceptual Database Design Methodology A fully developed data model specifies the domains for each attribute and includes: n n allowable set of values for the attribute; sizes and formats of the attribute. Further information can be specified for a domain such as the allowable operations on an attribute, and which attributes can be compared with other attributes or used in combination with other attributes. However, implementing these characteristics of attribute domains in a DBMS is still the subject of research. Document attribute domains As attribute domains are identified, record their names and characteristics in the data dictionary. Update the data dictionary entries for attributes to record their domain in place of the data type and length information. Step 1.5 Determine candidate, primary, and alternate key attributes Objective To identify the candidate key(s) for each entity type and, if there is more than one candidate key, to choose one to be the primary key and the others as alternate keys. This step is concerned with identifying the candidate key(s) for an entity and then selecting one to be the primary key (see Section 11.3.4). A candidate key is a minimal set of attributes of an entity that uniquely identifies each occurrence of that entity. We may identify more than one candidate key, in which case we must choose one to be the primary key; the remaining candidate keys are called alternate keys. People’s names generally do not make good candidate keys. For example, we may think that a suitable candidate key for the Staff entity would be the composite attribute name, the member of staff’s name. However, it is possible for two people with the same name to join DreamHome, which would clearly invalidate the choice of name as a candidate key. We could make a similar argument for the names of DreamHome’s owners. In such cases, rather than coming up with combinations of attributes that may provide uniqueness, it may be better to use an existing attribute that would always ensure uniqueness, such as the staffNo attribute for the Staff entity and the ownerNo attribute for the PrivateOwner entity, or define a new attribute that would provide uniqueness. When choosing a primary key from among the candidate keys, use the following guidelines to help make the selection: n n n n n the candidate key with the minimal set of attributes; the candidate key that is least likely to have its values changed; the candidate key with fewest characters (for those with textual attribute(s)); the candidate key with smallest maximum value (for those with numerical attribute(s)); the candidate key that is easiest to use from the users’ point of view. | 451 452 | Chapter 15 z Methodology – Conceptual Database Design Figure 15.5 ER diagram for the Staff user views of DreamHome with primary keys added. In the process of identifying primary keys, note whether an entity is strong or weak. If we are able to assign a primary key to an entity, the entity is referred to as being strong. On the other hand, if we are unable to identify a primary key for an entity, the entity is referred to as being weak (see Section 11.4). The primary key of a weak entity can only be identified when we map the weak entity and its relationship with its owner entity to a relation through the placement of a foreign key in that relation. The process of mapping entities and their relationships to relations is described in Step 2.1, and therefore the identification of primary keys for weak entities cannot take place until that step. DreamHome primary keys The primary keys for the Staff user views of DreamHome are shown in Figure 15.5. Note that the Preference entity is a weak entity and, as identified previously, the Views relationship has two attributes, viewDate and comment. Document primary and alternate keys Record the identification of primary and any alternate keys in the data dictionary. 15.3 Conceptual Database Design Methodology Step 1.6 Consider use of enhanced modeling concepts (optional step) Objective To consider the use of enhanced modeling concepts, such as specialization/ generalization, aggregation, and composition. In this step, we have the option to continue the development of the ER model using the advanced modeling concepts discussed in Chapter 12, namely specialization/generalization, aggregation, and composition. If we select the specialization approach, we attempt to highlight differences between entities by defining one or more subclasses of a superclass entity. If we select the generalization approach, we attempt to identify common features between entities to define a generalizing superclass entity. We may use aggregation to represent a ‘has-a’ or ‘is-part-of’ relationship between entity types, where one represents the ‘whole’ and the other ‘the part’. We may use composition (a special type of aggregation) to represent an association between entity types where there is a strong ownership and coincidental lifetime between the ‘whole’ and the ‘part’. For the Staff user views of DreamHome, we choose to generalize the two entities PrivateOwner and BusinessOwner to create a superclass Owner that contains the common attributes ownerNo, address, and telNo. The relationship that the Owner superclass has with its subclasses is mandatory and disjoint, denoted as {Mandatory, Or}; each member of the Owner superclass must be a member of one of the subclasses, but cannot belong to both. In addition, we identify one specialization subclass of Staff, namely Supervisor, specifically to model the Supervises relationship. The relationship that the Staff superclass has with the Supervisor subclass is optional: a member of the Staff superclass does not necessarily have to be a member of the Supervisor subclass. To keep the design simple, we decide not to use aggregation or composition. The revised ER diagram for the Staff user views of DreamHome is shown in Figure 15.6. There are no strict guidelines on when to develop the ER model using advanced modeling concepts, as the choice is often subjective and dependent on the particular characteristics of the situation that is being modeled. As a useful ‘rule of thumb’ when considering the use of these concepts, always attempt to represent the important entities and their relationships as clearly as possible in the ER diagram. Therefore, the use of advanced modeling concepts should be guided by the readability of the ER diagram and the clarity by which it models the important entities and relationships. These concepts are associated with enhanced ER modeling. However, as this step is optional, we simply use the term ‘ER diagram’ when referring to the diagrammatic representation of data models throughout the methodology. Step 1.7 Check model for redundancy Objective To check for the presence of any redundancy in the model. In this step, we examine the conceptual data model with the specific objective of identifying whether there is any redundancy present and removing any that does exist. The three activities in this step are: | 453 454 | Chapter 15 z Methodology – Conceptual Database Design Figure 15.6 Revised ER diagram for the Staff user views of DreamHome with specialization/generalization added. (1) re-examine one-to-one (1:1) relationships; (2) remove redundant relationships; (3) consider time dimension. (1) Re-examine one-to-one (1:1) relationships In the identification of entities, we may have identified two entities that represent the same object in the enterprise. For example, we may have identified the two entities Client and Renter that are actually the same; in other words, Client is a synonym for Renter. In this case, the two entities should be merged together. If the primary keys are different, choose one of them to be the primary key and leave the other as an alternate key. (2) Remove redundant relationships A relationship is redundant if the same information can be obtained via other relationships. We are trying to develop a minimal data model and, as redundant relationships are 15.3 Conceptual Database Design Methodology | 455 Figure 15.7 Remove the redundant relationship called Rents. unnecessary, they should be removed. It is relatively easy to identify whether there is more than one path between two entities. However, this does not necessarily imply that one of the relationships is redundant, as they may represent different associations between the entities. For example, consider the relationships between the PropertyForRent, Lease, and Client entities shown in Figure 15.7. There are two ways to find out which clients rent which properties. There is the direct route using the Rents relationship between the Client and PropertyForRent entities and there is the indirect route using the Holds and AssociatedWith relationships via the Lease entity. Before we can assess whether both routes are required, we need to establish the purpose of each relationship. The Rents relationship indicates which client rents which property. On the other hand, the Holds relationship indicates which client holds which lease, and the AssociatedWith relationship indicates which properties are associated with which leases. Although it is true that there is a relationship between clients and the properties they rent, this is not a direct relationship and the association is more accurately represented through a lease. The Rents relationship is therefore redundant and does not convey any additional information about the relationship between PropertyForRent and Client that cannot more correctly be found through the Lease entity. To ensure that we create a minimal model, the redundant Rents relationship must be removed. (3) Consider time dimension The time dimension of relationships is important when assessing redundancy. For example, consider the situation where we wish to model the relationships between the entities Man, Woman, and Child, as illustrated in Figure 15.8. Clearly, there are two paths between Man and Child: one via the direct relationship FatherOf and the other via the relationships MarriedTo and MotherOf. Consequently, we may think that the relationship FatherOf is unnecessary. However, this would be incorrect for two reasons: (1) The father may have children from a previous marriage, and we are modeling only the father’s current marriage through a 1:1 relationship. 456 | Chapter 15 z Methodology – Conceptual Database Design Figure 15.8 Example of a non-redundant relationship FatherOf. (2) The father and mother may not be married, or the father may be married to someone other than the mother (or the mother may be married to someone who is not the father). In either case, the required relationship could not be modeled without the FatherOf relationship. The message is that it is important to examine the meaning of each relationship between entities when assessing redundancy. At the end of this step, we have simplified the local conceptual data model by removing any inherent redundancy. Step 1.8 Validate conceptual model against user transactions Objective To ensure that the conceptual model supports the required transactions. We now have a conceptual data model that represents the data requirements of the enterprise. The objective of this step is to check the model to ensure that the model supports the required transactions. Using the model, we attempt to perform the operations manually. If we can resolve all transactions in this way, we have checked that the conceptual data model supports the required transactions. However, if we are unable to perform a transaction manually there must be a problem with the data model, which must be resolved. In this case, it is likely that we have omitted an entity, a relationship, or an attribute from the data model. We examine two possible approaches to ensuring that the conceptual data model supports the required transactions: (1) describing the transactions; (2) using transaction pathways. Describing the transaction Using the first approach, we check that all the information (entities, relationships, and their attributes) required by each transaction is provided by the model, by documenting a description of each transaction’s requirements. We illustrate this approach for an example DreamHome transaction listed in Appendix A from the Staff user views: 15.3 Conceptual Database Design Methodology Transaction (d) List the details of properties managed by a named member of staff at the branch The details of properties are held in the PropertyForRent entity and the details of staff who manage properties are held in the Staff entity. In this case, we can use the Staff Manages PropertyForRent relationship to produce the required list. Using transaction pathways The second approach to validating the data model against the required transactions involves diagrammatically representing the pathway taken by each transaction directly on the ER diagram. An example of this approach for the query transactions for the Staff user views listed in Appendix A is shown in Figure 15.9. Clearly, the more transactions that exist, the more complex this diagram would become, so for readability we may need several such diagrams to cover all the transactions. This approach allows the designer to visualize areas of the model that are not required by transactions and those areas that are critical to transactions. We are therefore in a Figure 15.9 Using pathways to check that the conceptual model supports the user transactions. | 457 458 | Chapter 15 z Methodology – Conceptual Database Design position to directly review the support provided by the data model for the transactions required. If there are areas of the model that do not appear to be used by any transactions, we may question the purpose of representing this information in the data model. On the other hand, if there are areas of the model that are inadequate in providing the correct pathway for a transaction, we may need to investigate the possibility that critical entities, relationships, or attributes have been missed. It may look like a lot of hard work to check every transaction that the model has to support in this way, and it certainly can be. As a result, it may be tempting to omit this step. However, it is very important that these checks are performed now rather than later when it is much more difficult and expensive to resolve any errors in the data model. Step 1.9 Review conceptual data model with user Objective To review the conceptual data model with the users to ensure that they consider the model to be a ‘true’ representation of the data requirements of the enterprise. Before completing Step 1, we review the conceptual data model with the user. The conceptual data model includes the ER diagram and the supporting documentation that describes the data model. If any anomalies are present in the data model, we must make the appropriate changes, which may require repeating the previous step(s). We repeat this process until the user is prepared to ‘sign off’ the model as being a ‘true’ representation of the part of the enterprise that we are modeling. The steps in this methodology are summarized in Appendix G. The next chapter describes the steps of the logical database design methodology. Chapter Summary n A design methodology is a structured approach that uses procedures, techniques, tools, and documentation aids to support and facilitate the process of design. n Database design includes three main phases: conceptual, logical, and physical database design. n Conceptual database design is the process of constructing a model of the data used in an enterprise, independent of all physical considerations. n Conceptual database design begins with the creation of a conceptual data model of the enterprise, which is entirely independent of implementation details such as the target DBMS, application programs, programming languages, hardware platform, performance issues, or any other physical considerations. n Logical database design is the process of constructing a model of the data used in an enterprise based on a specific data model (such as the relational model), but independent of a particular DBMS and other physical considerations. Logical database design translates the conceptual data model into a logical data model of the enterprise. Review Questions | 459 n Physical database design is the process of producing a description of the implementation of the database on secondary storage; it describes the base relations, file organizations, and indexes used to achieve efficient access to the data, and any associated integrity constraints and security measures. n The physical database design phase allows the designer to make decisions on how the database is to be implemented. Therefore, physical design is tailored to a specific DBMS. There is feedback between physical and conceptual/logical design, because decisions taken during physical design to improve performance may affect the structure of the conceptual/logical data model. n There are several critical factors for the success of the database design stage including, for example, working interactively with users and being willing to repeat steps. n The main objective of Step 1 of the methodology is to build a conceptual data model of the data requirements of the enterprise. A conceptual data model comprises: entity types, relationship types, attributes, attribute domains, primary keys, and alternate keys. n A conceptual data model is supported by documentation, such as ER diagrams and a data dictionary, which is produced throughout the development of the model. n The conceptual data model is validated to ensure it supports the required transactions. Two possible approaches to ensure that the conceptual data model supports the required transactions are: (1) checking that all the information (entities, relationships, and their attributes) required by each transaction is provided by the model by documenting a description of each transaction’s requirements; (2) diagrammatically representing the pathway taken by each transaction directly on the ER diagram. Review Questions 15.1 Describe the purpose of a design methodology. 15.2 Describe the main phases involved in database design. 15.3 Identify important factors in the success of database design. 15.4 Discuss the important role played by users in the process of database design. 15.5 Describe the main objective of conceptual database design. 15.6 Identify the main steps associated with conceptual database design. 15.7 How would you identify entity and relationship types from a user’s requirements specification? 15.8 How would you identify attributes from a user’s requirements specification and then 15.9 15.10 15.11 15.12 associate the attributes with entity or relationship types? Describe the purpose of specialization/generalization of entity types, and discuss why this is an optional step in conceptual database design. How would you check a data model for redundancy? Give an example to illustrate your answer. Discuss why you would want to validate a conceptual data model and describe two approaches to validating a conceptual model. Identify and describe the purpose of the documentation generated during conceptual database design. 460 | Chapter 15 z Methodology – Conceptual Database Design Exercises The DreamHome case study 15.13 Create a conceptual data model for the Branch user views of DreamHome documented in Appendix A. Compare your ER diagram with Figure 12.8 and justify any differences found. 15.14 Show that all the query transactions for the Branch user views of DreamHome listed in Appendix A are supported by your conceptual data model. The University Accommodation Office case study 15.15 Provide a user’s requirements specification for the University Accommodation Office case study documented in Appendix B.1. 15.16 Create a conceptual data model for the case study. State any assumptions necessary to support your design. Check that the conceptual data model supports the required transactions. The EasyDrive School of Motoring case study 15.17 Provide a user’s requirements specification for the EasyDrive School of Motoring case study documented in Appendix B.2. 15.18 Create a conceptual data model for the case study. State any assumptions necessary to support your design. Check that the conceptual data model supports the required transactions. The Wellmeadows Hospital case study 15.19 Identify user views for the Medical Director and Charge Nurse in the Wellmeadows Hospital case study described in Appendix B.3. 15.20 Provide a user’s requirements specification for each of these user views. 15.21 Create conceptual data models for each of the user views. State any assumptions necessary to support your design. Chapter 16 Methodology – Logical Database Design for the Relational Model Chapter Objectives In this chapter you will learn: n n n How to derive a set of relations from a conceptual data model. How to validate these relations using the technique of normalization. How to validate a logical data model to ensure it supports the required transactions. n How to merge local logical data models based on one or more user views into a global logical data model that represents all user views. n How to ensure that the final logical data model is a true and accurate representation of the data requirements of the enterprise. In Chapter 9, we described the main stages of the database system development lifecycle, one of which is database design. This stage is made up of three phases, namely conceptual, logical, and physical database design. In the previous chapter we introduced a methodology that describes the steps that make up the three phases of database design and then presented Step 1 of this methodology for conceptual database design. In this chapter we describe Step 2 of the methodology, which translates the conceptual model produced in Step 1 into a logical data model. The methodology for logical database design described in this book also includes an optional Step 2.6, which is required when the database has multiple user views that are managed using the view integration approach (see Section 9.5). In this case, we repeat Step 1 through Step 2.5 as necessary to create the required number of local logical data models, which are then finally merged in Step 2.6 to form a global logical data model. A local logical data model represents the data requirements of one or more but not all user views of a database and a global logical data model represents the data requirements for all user views (see Section 9.5). However, on concluding Step 2.6 we cease to use the term ‘global logical data model’ and simply refer to the final model as being a ‘logical data model’. The final step of the logical database design phase is to consider how well the model is able to support possible future developments for the database system. It is the logical data model created in Step 2 that forms the starting point for physical database design, which is described as Steps 3 to 8 in Chapters 17 and 18. Throughout the methodology the terms ‘entity’ and ‘relationship’ are used in place of ‘entity type’ and 462 | Chapter 16 z Methodology – Logical Database Design for the Relational Model ‘relationship type’ where the meaning is obvious; ‘type’ is generally only added to avoid ambiguity. 16.1 Logical Database Design Methodology for the Relational Model This section describes the steps of the logical database design methodology for the relational model. Step 2 Build and Validate Logical Data Model Objective To translate the conceptual data model into a logical data model and then to validate this model to check that it is structurally correct and able to support the required transactions. In this step, the main objective is to translate the conceptual data model created in Step 1 into a logical data model of the data requirements of the enterprise. This objective is achieved by following the activities listed below: Step 2.1 Step 2.2 Step 2.3 Step 2.4 Step 2.5 Step 2.6 Step 2.7 Derive relations for logical data model Validate relations using normalization Validate relations against user transactions Check integrity constraints Review logical data model with user Merge logical data models into global model (optional step) Check for future growth We begin by deriving a set of relations (relational schema) from the conceptual data model created in Step 1. The structure of the relational schema is validated using normalization and then checked to ensure that the relations are capable of supporting the transactions given in the users’ requirements specification. We next check that all important integrity constraints are represented by the logical data model. At this stage the logical data model is validated by the users to ensure that they consider the model to be a true representation of the data requirements of the enterprise. The methodology for Step 2 is presented so that it is applicable for the design of simple to complex database systems. For example, to create a database with a single user view or with multiple user views that are managed using the centralized approach (see Section 9.5) then Step 2.6 is omitted. If, however, the database has multiple user views that are being managed using the view integration approach (see Section 9.5) then Steps 2.1 to 2.5 are repeated for the required number of data models, each of which represents different user views of the database system. In Step 2.6 these data models are merged. Step 2 concludes with an assessment of the logical data model, which may or may not have involved Step 2.6, to ensure that the final model is able to support possible future developments. On completion of Step 2 we should have a single logical data model that is a correct, comprehensive, and unambiguous representation of the data requirements of the enterprise. 16.1 Logical Database Design Methodology for the Relational Model We demonstrate Step 2 using the conceptual data model created in the previous chapter for the Staff user views of the DreamHome case study and represented in Figure 16.1 as an ER diagram. We also use the Branch user views of DreamHome, which is represented in Figure 12.8 as an ER diagram to illustrate some concepts that are not present in the Staff user views and to demonstrate the merging of data models in Step 2.6. Step 2.1 Derive relations for logical data model Objective To create relations for the logical data model to represent the entities, relationships, and attributes that have been identified. In this step, we derive relations for the logical data model to represent the entities, relationships, and attributes. We describe the composition of each relation using a Database Definition Language (DBDL) for relational databases. Using the DBDL, we first specify the name of the relation followed by a list of the relation’s simple attributes enclosed in brackets. We then identify the primary key and any alternate and/or foreign key(s) of the relation. Following the identification of a foreign key, the relation containing the referenced primary key is given. Any derived attributes are also listed together with how each one is calculated. The relationship that an entity has with another entity is represented by the primary key/ foreign key mechanism. In deciding where to post (or place) the foreign key attribute(s), we must first identify the ‘parent’ and ‘child’ entities involved in the relationship. The parent entity refers to the entity that posts a copy of its primary key into the relation that represents the child entity, to act as the foreign key. We describe how relations are derived for the following structures that may occur in a conceptual data model: (1) (2) (3) (4) (5) (6) (7) (8) (9) strong entity types; weak entity types; one-to-many (1:*) binary relationship types; one-to-one (1:1) binary relationship types; one-to-one (1:1) recursive relationship types; superclass/subclass relationship types; many-to-many (*:*) binary relationship types; complex relationship types; multi-valued attributes. For most of the examples discussed below we use the conceptual data model for the Staff user views of DreamHome, which is represented as an ER diagram in Figure 16.1. (1) Strong entity types For each strong entity in the data model, create a relation that includes all the simple attributes of that entity. For composite attributes, such as name, include only the constituent | 463 464 | Chapter 16 z Methodology – Logical Database Design for the Relational Model rentFinish /deposit /duration Figure 16.1 Conceptual data model for the Staff user views showing all attributes. simple attributes, namely, fName and lName in the relation. For example, the composition of the Staff relation shown in Figure 16.1 is: Staff (staffNo, fName, lName, position, sex, DOB) Primary Key staffNo 16.1 Logical Database Design Methodology for the Relational Model (2) Weak entity types For each weak entity in the data model, create a relation that includes all the simple attributes of that entity. The primary key of a weak entity is partially or fully derived from each owner entity and so the identification of the primary key of a weak entity cannot be made until after all the relationships with the owner entities have been mapped. For example, the weak entity Preference in Figure 16.1 is initially mapped to the following relation: Preference (prefType, maxRent) Primary Key None (at present) In this situation, the primary key for the Preference relation cannot be identified until after the States relationship has been appropriately mapped. (3) One-to-many (1:*) binary relationship types For each 1:* binary relationship, the entity on the ‘one side’ of the relationship is designated as the parent entity and the entity on the ‘many side’ is designated as the child entity. To represent this relationship, we post a copy of the primary key attribute(s) of the parent entity into the relation representing the child entity, to act as a foreign key. For example, the Staff Registers Client relationship shown in Figure 16.1 is a 1:* relationship, as a single member of staff can register many clients. In this example Staff is on the ‘one side’ and represents the parent entity, and Client is on the ‘many side’ and represents the child entity. The relationship between these entities is established by placing a copy of the primary key of the Staff (parent) entity, staffNo, into the Client (child) relation. The composition of the Staff and Client relations is: In the case where a 1:* relationship has one or more attributes, these attributes should follow the posting of the primary key to the child relation. For example, if the Staff Registers Client relationship had an attribute called dateRegister representing when a member of staff registered the client, this attribute should also be posted to the Client relation along with the copy of the primary key of the Staff relation, namely staffNo. (4) One-to-one (1:1) binary relationship types Creating relations to represent a 1:1 relationship is slightly more complex as the cardinality cannot be used to help identify the parent and child entities in a relationship. Instead, the participation constraints (see Section 11.6.5) are used to help decide whether it is best to represent the relationship by combining the entities involved into one relation or by creating two relations and posting a copy of the primary key from one relation to the other. We consider how to create relations to represent the following participation constraints: | 465 466 | Chapter 16 z Methodology – Logical Database Design for the Relational Model (a) mandatory participation on both sides of 1:1 relationship; (b) mandatory participation on one side of 1:1 relationship; (c) optional participation on both sides of 1:1 relationship. (a) Mandatory participation on both sides of 1:1 relationship In this case we should combine the entities involved into one relation and choose one of the primary keys of the original entities to be the primary key of the new relation, while the other (if one exists) is used as an alternate key. The Client States Preference relationship is an example of a 1:1 relationship with mandatory participation on both sides. In this case, we choose to merge the two relations together to give the following Client relation: Client (clientNo, fName, lName, telNo, prefType, maxRent, staffNo) Primary Key clientNo Foreign Key staffNo references Staff(staffNo) In the case where a 1:1 relationship with mandatory participation on both sides has one or more attributes, these attributes should also be included in the merged relation. For example, if the States relationship had an attribute called dateStated recording the date the preferences were stated, this attribute would also appear as an attribute in the merged Client relation. Note that it is only possible to merge two entities into one relation when there are no other direct relationships between these two entities that would prevent this, such as a 1:* relationship. If this were the case, we would need to represent the States relationship using the primary key/foreign key mechanism. We discuss how to designate the parent and child entities in this type of situation in part (c) shortly. (b) Mandatory participation on one side of a 1:1 relationship In this case we are able to identify the parent and child entities for the 1:1 relationship using the participation constraints. The entity that has optional participation in the relationship is designated as the parent entity, and the entity that has mandatory participation in the relationship is designated as the child entity. As described above, a copy of the primary key of the parent entity is placed in the relation representing the child entity. If the relationship has one or more attributes, these attributes should follow the posting of the primary key to the child relation. For example, if the 1:1 Client States Preference relationship had partial participation on the Client side (in other words, not every client specifies preferences), then the Client entity would be designated as the parent entity and the Preference entity would be designated as the child entity. Therefore, a copy of the primary key of the Client (parent) entity, clientNo, would be placed in the Preference (child) relation, giving: 16.1 Logical Database Design Methodology for the Relational Model Note that the foreign key attribute of the Preference relation also forms the relation’s primary key. In this situation, the primary key for the Preference relation could not have been identified until after the foreign key had been posted from the Client relation to the Preference relation. Therefore, at the end of this step we should identify any new primary key or candidate keys that have been formed in the process, and update the data dictionary accordingly. (c) Optional participation on both sides of a 1:1 relationship In this case the designation of the parent and child entities is arbitrary unless we can find out more about the relationship that can help a decision to be made one way or the other. For example, consider how to represent a 1:1 Staff Uses Car relationship with optional participation on both sides of the relationship. (Note that the discussion that follows is also relevant for 1:1 relationships with mandatory participation for both entities where we cannot select the option to combine the entities into a single relation.) If there is no additional information to help select the parent and child entities, the choice is arbitrary. In other words, we have the choice to post a copy of the primary key of the Staff entity to the Car entity, or vice versa. However, assume that the majority of cars, but not all, are used by staff and only a minority of staff use cars. The Car entity, although optional, is closer to being mandatory than the Staff entity. We therefore designate Staff as the parent entity and Car as the child entity, and post a copy of the primary key of the Staff entity (staffNo) into the Car relation. (5) One-to-one (1:1) recursive relationships For a 1:1 recursive relationship, follow the rules for participation as described above for a 1:1 relationship. However, in this special case of a 1:1 relationship, the entity on both sides of the relationship is the same. For a 1:1 recursive relationship with mandatory participation on both sides, represent the recursive relationship as a single relation with two copies of the primary key. As before, one copy of the primary key represents a foreign key and should be renamed to indicate the relationship it represents. For a 1:1 recursive relationship with mandatory participation on only one side, we have the option to create a single relation with two copies of the primary key as described above, or to create a new relation to represent the relationship. The new relation would only have two attributes, both copies of the primary key. As before, the copies of the primary keys act as foreign keys and have to be renamed to indicate the purpose of each in the relation. For a 1:1 recursive relationship with optional participation on both sides, again create a new relation as described above. (6) Superclass/subclass relationship types For each superclass/subclass relationship in the conceptual data model, we identify the superclass entity as the parent entity and the subclass entity as the child entity. There are various options on how to represent such a relationship as one or more relations. The selection of the most appropriate option is dependent on a number of factors such as the disjointness and participation constraints on the superclass/subclass relationship (see Section 12.1.6), whether the subclasses are involved in distinct relationships, and the | 467 468 | Chapter 16 z Methodology – Logical Database Design for the Relational Model Table 16.1 Guidelines for the representation of a superclass/subclass relationship based on the participation and disjoint constraints. Participation constraint Disjoint constraint Relations required Mandatory Nondisjoint {And} Optional Nondisjoint {And} Mandatory Disjoint {Or} Optional Disjoint {Or} Single relation (with one or more discriminators to distinguish the type of each tuple) Two relations: one relation for superclass and one relation for all subclasses (with one or more discriminators to distinguish the type of each tuple) Many relations: one relation for each combined superclass/subclass Many relations: one relation for superclass and one for each subclass number of participants in the superclass/subclass relationship. Guidelines for the representation of a superclass/subclass relationship based only on the participation and disjoint constraints are shown in Table 16.1. For example, consider the Owner superclass/subclass relationship shown in Figure 16.1. From Table 16.1 there are various ways to represent this relationship as one or more relations, as shown in Figure 16.2. The options range from placing all the attributes into one relation with two discriminators pOwnerFlag and bOwnerFlag indicating whether a tuple belongs to a particular subclass (Option 1), to dividing the attributes into three relations (Option 4). In this case the most appropriate representation of the superclass/subclass relationship is determined by the constraints on this relationship. From Figure 16.1 the relationship that the Owner superclass has with its subclasses is mandatory and disjoint, as each member of the Owner superclass must be a member of one of the subclasses (PrivateOwner or BusinessOwner) but cannot belong to both. We therefore select Option 3 as the best representation of this relationship and create a separate relation to represent each subclass, and include a copy of the primary key attribute(s) of the superclass in each. It must be stressed that Table 16.1 is for guidance only and there may be other factors that influence the final choice. For example, with Option 1 (mandatory, nondisjoint) we have chosen to use two discriminators to distinguish whether the tuple is a member of a particular subclass. An equally valid way to represent this would be to have one discriminator that distinguishes whether the tuple is a member of PrivateOwner, BusinessOwner, or both. Alternatively, we could dispense with discriminators all together and simply test whether one of the attributes unique to a particular subclass has a value present to determine whether the tuple is a member of that subclass. In this case, we would have to ensure that the attribute examined was a required attribute (and so must not allow nulls). In Figure 16.1 there is another superclass/subclass relationship between Staff and Supervisor with optional participation. However, as the Staff superclass only has one subclass (Supervisor) there is no disjoint constraint. In this case, as there are many more ‘supervised staff’ than supervisors, we choose to represent this relationship as a single relation: 16.1 Logical Database Design Methodology for the Relational Model | 469 Figure 16.2 Various representations of the Owner superclass/subclass relationship based on the participation and disjointness constraints shown in Table 16.1. Staff (staffNo, fName, lName, position, sex, DOB, supervisorStaffNo) Primary Key staffNo Foreign Key supervisorStaffNo references Staff(staffNo) If we had left the superclass/subclass relationship as a 1:* recursive relationship as we had it originally in Figure 15.5 with optional participation on both sides this would have resulted in the same representation as above. (7) Many-to-many (*:*) binary relationship types For each *:* binary relationship create a relation to represent the relationship and include any attributes that are part of the relationship. We post a copy of the primary key attribute(s) of the entities that participate in the relationship into the new relation, to act as foreign keys. One or both of these foreign keys will also form the primary key of the new relation, possibly in combination with one or more of the attributes of the relationship. (If one or more of the attributes that form the relationship provide uniqueness, then an entity has been omitted from the conceptual data model, although this mapping process resolves this.) 470 | Chapter 16 z Methodology – Logical Database Design for the Relational Model For example, consider the *:* relationship Client Views PropertyForRent shown in Figure 16.1. In this example, the Views relationship has two attributes called dateView and comments. To represent this, we create relations for the strong entities Client and PropertyForRent and we create a relation Viewing to represent the relationship Views, to give: (8) Complex relationship types For each complex relationship, create a relation to represent the relationship and include any attributes that are part of the relationship. We post a copy of the primary key attribute(s) of the entities that participate in the complex relationship into the new relation, to act as foreign keys. Any foreign keys that represent a ‘many’ relationship (for example, 1..*, 0..*) generally will also form the primary key of this new relation, possibly in combination with some of the attributes of the relationship. For example, the ternary Registers relationship in the Branch user views represents the association between the member of staff who registers a new client at a branch, as shown in Figure 12.8. To represent this, we create relations for the strong entities Branch, Staff, and Client, and we create a relation Registration to represent the relationship Registers, to give: Note that the Registers relationship is shown as a binary relationship in Figure 16.1 and this is consistent with its composition in Figure 16.3. The discrepancy between how Registers is modeled in the Staff and Branch user views of DreamHome is discussed and resolved in Step 2.6. 16.1 Logical Database Design Methodology for the Relational Model Figure 16.3 Relations for the Staff user views of DreamHome. (9) Multi-valued attributes For each multi-valued attribute in an entity, create a new relation to represent the multivalued attribute and include the primary key of the entity in the new relation, to act as a foreign key. Unless the multi-valued attribute is itself an alternate key of the entity, the primary key of the new relation is the combination of the multi-valued attribute and the primary key of the entity. For example, in the Branch user views to represent the situation where a single branch has up to three telephone numbers, the telNo attribute of the Branch entity has been defined as being a multi-valued attribute, as shown in Figure 12.8. To represent this, we create a relation for the Branch entity and we create a new relation called Telephone to represent the multi-valued attribute telNo, to give: Table 16.2 summarizes how to map entities and relationships to relations. | 471 472 | Chapter 16 z Methodology – Logical Database Design for the Relational Model Table 16.2 Summary of how to map entities and relationships to relations. Entity/Relationship Mapping Strong entity Create relation that includes all simple attributes. Create relation that includes all simple attributes (primary key still has to be identified after the relationship with each owner entity has been mapped). Post primary key of entity on ‘one’ side to act as foreign key in relation representing entity on ‘many’ side. Any attributes of relationship are also posted to ‘many’ side. Weak entity 1:* binary relationship 1:1 binary relationship: (a) Mandatory participation on both sides (b) Mandatory participation on one side (c) Optional participation on both sides Superclass/subclass relationship *:* binary relationship, complex relationship Multi-valued attribute Combine entities into one relation. Post primary key of entity on ‘optional’ side to act as foreign key in relation representing entity on ‘mandatory’ side. Arbitrary without further information. See Table 16.1. Create a relation to represent the relationship and include any attributes of the relationship. Post a copy of the primary keys from each of the owner entities into the new relation to act as foreign keys. Create a relation to represent the multi-valued attribute and post a copy of the primary key of the owner entity into the new relation to act as a foreign key. Document relations and foreign key attributes At the end of Step 2.1, document the composition of the relations derived for the logical data model using the DBDL. The relations for the Staff user views of DreamHome are shown in Figure 16.3. Now that each relation has its full set of attributes, we are in a position to identify any new primary and/or alternate keys. This is particularly important for weak entities that rely on the posting of the primary key from the parent entity (or entities) to form a primary key of their own. For example, the weak entity Viewing now has a composite primary key made up of a copy of the primary key of the PropertyForRent entity (propertyNo) and a copy of the primary key of the Client entity (clientNo). The DBDL syntax can be extended to show integrity constraints on the foreign keys (Step 2.5). The data dictionary should also be updated to reflect any new primary and alternate keys identified in this step. For example, following the posting of primary keys, the Lease relation has gained new alternate keys formed from the attributes (propertyNo, rentStart) and (clientNo, rentStart). 16.1 Logical Database Design Methodology for the Relational Model Step 2.2 Validate relations using normalization Objective To validate the relations in the logical data model using normalization. In the previous step we derived a set of relations to represent the conceptual data model created in Step 1. In this step we validate the groupings of attributes in each relation using the rules of normalization. The purpose of normalization is to ensure that the set of relations has a minimal and yet sufficient number of attributes necessary to support the data requirements of the enterprise. Also, the relations should have minimal data redundancy to avoid the problems of update anomalies discussed in Section 13.3. However, some redundancy is essential to allow the joining of related relations. The use of normalization requires that we first identify the functional dependencies that hold between the attributes in each relation. The characteristics of functional dependencies that are used for normalization were discussed in Section 13.4 and can only be identified if the meaning of each attribute is well understood. The functional dependencies indicate important relationships between the attributes of a relation. It is those functional dependencies and the primary key for each relation that are used in the process of normalization. The process of normalization takes a relation through a series of steps to check whether or not the composition of attributes in a relation conforms or otherwise with the rules for a given normal form such as First Normal Form (1NF), Second Normal Form (2NF), and Third Normal Form (3NF). The rules for each normal form were discussed in detail in Sections 13.6 to 13.8. To avoid the problems associated with data redundancy, it is recommended that each relation be in at least 3NF. The process of deriving relations from a conceptual data model should produce relations that are already in 3NF. If, however, we identify relations that are not in 3NF, this may indicate that part of the logical data model and/or conceptual data model is incorrect, or that we have introduced an error when deriving the relations from the conceptual data model. If necessary, we must restructure the problem relation(s) and/or data model(s) to ensure a true representation of the data requirements of the enterprise. It is sometimes argued that a normalized database design does not provide maximum processing efficiency. However, the following points can be argued: n n n n A normalized design organizes the data according to its functional dependencies. Consequently, the process lies somewhere between conceptual and physical design. The logical design may not be the final design. It should represent the database designer’s best understanding of the nature and meaning of the data required by the enterprise. If there are specific performance criteria, the physical design may be different. One possibility is that some normalized relations are denormalized, and this approach is discussed in detail in Step 7 of the physical database design methodology (see Chapter 18). A normalized design is robust and free of the update anomalies discussed in Section 13.3. Modern computers are much more powerful than those that were available a few years ago. It is sometimes reasonable to implement a design that gains ease of use at the expense of additional processing. | 473 474 | Chapter 16 z Methodology – Logical Database Design for the Relational Model n n To use normalization a database designer must understand completely each attribute that is to be represented in the database. This benefit may be the most important. Normalization produces a flexible database design that can be extended easily. Step 2.3 Validate relations against user transactions Objective To ensure that the relations in the logical data model support the required transactions. The objective of this step is to validate the logical data model to ensure that the model supports the required transactions, as detailed in the users’ requirements specification. This type of check was carried out in Step 1.8 to ensure that the conceptual data model supported the required transactions. In this step, we check that the relations created in the previous step also support these transactions, and thereby ensure that no error has been introduced while creating relations. Using the relations, the primary key/foreign key links shown in the relations, the ER diagram, and the data dictionary, we attempt to perform the operations manually. If we can resolve all transactions in this way, we have validated the logical data model against the transactions. However, if we are unable to perform a transaction manually, there must be a problem with the data model, which has to be resolved. In this case, it is likely that an error has been introduced while creating the relations, and we should go back and check the areas of the data model that the transaction is accessing to identify and resolve the problem. Step 2.4 Check integrity constraints Objective To check integrity constraints are represented in the logical data model. Integrity constraints are the constraints that we wish to impose in order to protect the database from becoming incomplete, inaccurate, or inconsistent. Although DBMS controls for integrity constraints may or may not exist, this is not the question here. At this stage we are concerned only with high-level design, that is, specifying what integrity constraints are required, irrespective of how this might be achieved. A logical data model that includes all important integrity constraints is a ‘true’ representation of the data requirements for the enterprise. We consider the following types of integrity constraint: n n n n n n required data; attribute domain constraints; multiplicity; entity integrity; referential integrity; general constraints. 16.1 Logical Database Design Methodology for the Relational Model Required data Some attributes must always contain a valid value; in other words, they are not allowed to hold nulls. For example, every member of staff must have an associated job position (such as Supervisor or Assistant). These constraints should have been identified when we documented the attributes in the data dictionary (Step 1.3). Attribute domain constraints Every attribute has a domain, that is, a set of values that are legal. For example, the sex of a member of staff is either ‘M’ or ‘F’, so the domain of the sex attribute is a single character string consisting of ‘M’ or ‘F’. These constraints should have been identified when we chose the attribute domains for the data model (Step 1.4). Multiplicity Multiplicity represents the constraints that are placed on relationships between data in the database. Examples of such constraints include the requirements that a branch has many staff and a member of staff works at a single branch. Ensuring that all appropriate integrity constraints are identified and represented is an important part of modeling the data requirements of an enterprise. In Step 1.2 we defined the relationships between entities, and all integrity constraints that can be represented in this way were defined and documented in this step. Entity integrity The primary key of an entity cannot hold nulls. For example, each tuple of the Staff relation must have a value for the primary key attribute, staffNo. These constraints should have been considered when we identified the primary keys for each entity type (Step 1.5). Referential integrity A foreign key links each tuple in the child relation to the tuple in the parent relation containing the matching candidate key value. Referential integrity means that if the foreign key contains a value, that value must refer to an existing tuple in the parent relation. For example, consider the Staff Manages PropertyForRent relationship. The staffNo attribute in the PropertyForRent relation links the property for rent to the tuple in the Staff relation containing the member of staff who manages that property. If staffNo is not null, it must contain a valid value that exists in the staffNo attribute of the Staff relation, or the property will be assigned to a non-existent member of staff. There are two issues regarding foreign keys that must be addressed. The first considers whether nulls are allowed for the foreign key. For example, can we store the details of a property for rent without having a member of staff specified to manage it (that is, can we specify a null staffNo)? The issue is not whether the staff number exists, but whether a staff number must be specified. In general, if the participation of the child relation in the relationship is: | 475 476 | Chapter 16 z Methodology – Logical Database Design for the Relational Model n n mandatory, then nulls are not allowed; optional, then nulls are allowed. The second issue we must address is how to ensure referential integrity. To do this, we specify existence constraints that define conditions under which a candidate key or foreign key may be inserted, updated, or deleted. For the 1:* Staff Manages PropertyForRent relationship consider the following cases. Case 1: Insert tuple into child relation (PropertyForRent) To ensure referential integrity, check that the foreign key attribute, staffNo, of the new PropertyForRent tuple is set to null or to a value of an existing Staff tuple. Case 2: Delete tuple from child relation (PropertyForRent) If a tuple of a child relation is deleted referential integrity is unaffected. Case 3: Update foreign key of child tuple (PropertyForRent) This is similar to Case 1. To ensure referential integrity, check that the staffNo of the updated PropertyForRent tuple is set to null or to a value of an existing Staff tuple. Case 4: Insert tuple into parent relation (Staff) Inserting a tuple into the parent relation (Staff) does not affect referential integrity; it simply becomes a parent without any children: in other words, a member of staff without properties to manage. Case 5: Delete tuple from parent relation (Staff) If a tuple of a parent relation is deleted, referential integrity is lost if there exists a child tuple referencing the deleted parent tuple, in other words if the deleted member of staff currently manages one or more properties. There are several strategies we can consider: n n n NO ACTION Prevent a deletion from the parent relation if there are any referenced child tuples. In our example, ‘You cannot delete a member of staff if he or she currently manages any properties’. CASCADE When the parent tuple is deleted automatically delete any referenced child tuples. If any deleted child tuple acts as the parent in another relationship then the delete operation should be applied to the tuples in this child relation and so on in a cascading manner. In other words, deletions from the parent relation cascade to the child relation. In our example, ‘Deleting a member of staff automatically deletes all properties he or she manages’. Clearly, in this situation, this strategy would not be wise. If we have used the advanced modeling technique of composition to relate the parent and child entities, CASCADE should be specified (see Section 12.3). SET NULL When a parent tuple is deleted, the foreign key values in all corresponding child tuples are automatically set to null. In our example, ‘If a member of staff is deleted, indicate that the current assignment of those properties previously managed by that employee is unknown’. We can only consider this strategy if the attributes comprising the foreign key are able to accept nulls. 16.1 Logical Database Design Methodology for the Relational Model n n | 477 SET DEFAULT When a parent tuple is deleted, the foreign key values in all corresponding child tuples should automatically be set to their default values. In our example, ‘If a member of staff is deleted, indicate that the current assignment of some properties is being handled by another (default) member of staff such as the Manager’. We can only consider this strategy if the attributes comprising the foreign key have default values defined. NO CHECK When a parent tuple is deleted, do nothing to ensure that referential integrity is maintained. Case 6: Update primary key of parent tuple (Staff) If the primary key value of a parent relation tuple is updated, referential integrity is lost if there exists a child tuple referencing the old primary key value; that is, if the updated member of staff currently manages one or more properties. To ensure referential integrity, the strategies described above can be used. In the case of CASCADE, the updates to the primary key of the parent tuple are reflected in any referencing child tuples, and if a referencing child tuple is itself a primary key of a parent tuple, this update will also cascade to its referencing child tuples, and so on in a cascading manner. It is normal for updates to be specified as CASCADE. The referential integrity constraints for the relations that have been created for the Staff user views of DreamHome are shown in Figure 16.4. Figure 16.4 Referential integrity constraints for the relations in the Staff user views of DreamHome. 478 | Chapter 16 z Methodology – Logical Database Design for the Relational Model General constraints Finally, we consider constraints known as general constraints. Updates to entities may be controlled by constraints governing the ‘real world’ transactions that are represented by the updates. For example, DreamHome has a rule that prevents a member of staff from managing more than 100 properties at the same time. Document all integrity constraints Document all integrity constraints in the data dictionary for consideration during physical design. Step 2.5 Review logical data model with user Objective To review the logical data model with the users to ensure that they consider the model to be a true representation of the data requirements of the enterprise. The logical data model should now be complete and fully documented. However, to confirm this is the case, users are requested to review the logical data model to ensure that they consider the model to be a true representation of the data requirements of the enterprise. If the users are dissatisfied with the model then some repetition of earlier steps in the methodology may be required. If the users are satisfied with the model then the next step taken depends on the number of user views associated with the database and, more importantly, how they are being managed. If the database system has a single user view or multiple user views that are being managed using the centralization approach (see Section 9.5) then we proceed directly to the final step of Step 2, namely Step 2.7. If the database has multiple user views that are being managed using the view integration approach (see Section 9.5) then we proceed to Step 2.6. The view integration approach results in the creation of several logical data models each of which represents one or more, but not all, user views of a database. The purpose of Step 2.6 is to merge these data models to create a single logical data model that represents all user views of a database. However, before we consider this step we discuss briefly the relationship between logical data models and data flow diagrams. Relationship between logical data model and data flow diagrams A logical data model reflects the structure of stored data for an enterprise. A Data Flow Diagram (DFD) shows data moving about the enterprise and being stored in datastores. All attributes should appear within an entity type if they are held within the enterprise, and will probably be seen flowing around the enterprise as a data flow. When these two techniques are being used to model the users’ requirements specification, we can use each one to check the consistency and completeness of the other. The rules that control the relationship between the two techniques are: 16.1 Logical Database Design Methodology for the Relational Model n each datastore should represent a whole number of entity types; n attributes on data flows should belong to entity types. Step 2.6 Merge logical data models into global model (optional step) Objective To merge local logical data models into a single global logical data model that represents all user views of a database. This step is only necessary for the design of a database with multiple user views that are being managed using the view integration approach. To facilitate the description of the merging process we use the terms ‘local logical data model’ and ‘global logical data model’. A local logical data model represents one or more but not all user views of a database whereas global logical data model represents all user views of a database. In this step we merge two or more local logical data models into a single global logical data model. The source of information for this step is the local data models created through Step 1 and Steps 2.1 to 2.5 of the methodology. Although each local logical data model should be correct, comprehensive, and unambiguous, each model is only a representation of one or more but not all user views of a database. In other words, each model represents only part of the complete database. This may mean that there are inconsistencies as well as overlaps when we look at the complete set of user views. Thus, when we merge the local logical data models into a single global model, we must endeavor to resolve conflicts between the views and any overlaps that exist. Therefore, on completion of the merging process, the resulting global logical data model is subjected to validations similar to those performed on the local data models. The validations are particularly necessary and should be focused on areas of the model which are subjected to most change during the merging process. The activities in this step include: n n n Step 2.6.1 Merge local logical data models into global model Step 2.6.2 Validate global logical data model Step 2.6.3 Review global logical data model with users We demonstrate this step using the local logical data model developed above for the Staff user views of the DreamHome case study and using the model developed in Chapters 11 and 12 for the Branch user views of DreamHome. Figure 16.5 shows the relations created from the ER model for the Branch user views given in Figure 12.8. We leave it as an exercise for the reader to show that this mapping is correct (see Exercise 16.6). Step 2.6.1 Merge logical data models into global model Objective To merge local logical data models into a single global logical data model. Up to this point, for each local logical data model we have produced an ER diagram, a relational schema, a data dictionary, and supporting documentation that describes the | 479 480 | Chapter 16 z Methodology – Logical Database Design for the Relational Model Figure 16.5 Relations for the Branch user views of DreamHome. constraints on the data. In this step, we use these components to identify the similarities and differences between the models and thereby help merge the models together. For a simple database system with a small number of user views each with a small number of entity and relationship types, it is a relatively easy task to compare the local models, merge them together, and resolve any differences that exist. However, in a large system, a more systematic approach must be taken. We present one approach that may be used to merge the local models together and resolve any inconsistencies found. For a discussion on other approaches, the interested reader is referred to the papers by Batini and Lanzerini (1986), Biskup and Convent (1986), Spaccapietra et al. (1992) and Bouguettaya et al. (1998). 16.1 Logical Database Design Methodology for the Relational Model Some typical tasks in this approach are as follows: (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) Review the names and contents of entities/relations and their candidate keys. Review the names and contents of relationships/foreign keys. Merge entities/relations from the local data models. Include (without merging) entities/relations unique to each local data model. Merge relationships/foreign keys from the local data models. Include (without merging) relationships/foreign keys unique to each local data model. Check for missing entities/relations and relationships/foreign keys. Check foreign keys. Check integrity constraints. Draw the global ER/relation diagram. Update the documentation. In some of the above tasks, we have used the term ‘entities/relations’ and ‘relationships/ foreign keys’. This allows the designer to choose whether to examine the ER models or the relations that have been derived from the ER models in conjunction with their supporting documentation, or even to use a combination of both approaches. It may be easier to base the examination on the composition of relations as this removes many syntactic and semantic differences that may exist between different ER models possibly produced by different designers. Perhaps the easiest way to merge several local data models together is first to merge two of the data models to produce a new model, and then successively to merge the remaining local data models until all the local models are represented in the final global data model. This may prove a simpler approach than trying to merge all the local data models at the same time. (1) Review the names and contents of entities/relations and their candidate keys It may be worthwhile reviewing the names and descriptions of entities/relations that appear in the local data models by inspecting the data dictionary. Problems can arise when two or more entities/relations: n n have the same name but are, in fact, different (homonyms); are the same but have different names (synonyms). It may be necessary to compare the data content of each entity/relation to resolve these problems. In particular, use the candidate keys to help identify equivalent entities/relations that may be named differently across views. A comparison of the relations in the Branch and Staff user views of DreamHome is shown in Table 16.3. The relations that are common to each user views are highlighted. (2) Review the names and contents of relationships/foreign keys This activity is the same as described for entities/relations. A comparison of the foreign keys in the Branch and Staff user views of DreamHome is shown in Table 16.4. The | 481 482 | Chapter 16 z Methodology – Logical Database Design for the Relational Model Table 16.3 A comparison of the names of entities/relations and their candidate keys in the Branch and Staff user views. Branch user views Entity/Relation Candidate keys Branch branchNo postcode telNo staffNo staffNo ownerNo bName telNo Telephone Staff Manager PrivateOwner BusinessOwner Client PropertyForRent clientNo propertyNo Lease leaseNo propertyNo, rentStart clientNo, rentStart clientNo newpaperName telNo (propertyNo, newspaperName, dateAdvert) Registration Newspaper Advert Staff user views Entity/Relation Candidate keys Staff staffNo PrivateOwner BusinessOwner ownerNo bName telNo ownerNo clientNo propertyNo clientNo, propertyNo leaseNo propertyNo, rentStart clientNo, rentStart Client PropertyForRent Viewing Lease foreign keys that are common to each view are highlighted. Note, in particular, that of the relations that are common to both views, the Staff and PropertyForRent relations have an extra foreign key, branchNo. This initial comparison of the relationship names/foreign keys in each view again gives some indication of the extent to which the views overlap. However, it is important to recognize that we should not rely too heavily on the fact that entities or relationships with the same name play the same role in both views. However, comparing the names of entities/relations and relationships/foreign keys is a good starting point when searching for overlap between the views, as long as we are aware of the pitfalls. We must be careful of entities or relationships that have the same name but in fact represent different concepts (also called homonyms). An example of this occurrence is the Staff Manages PropertyForRent (Staff view) and Manager Manages Branch (Branch view). Obviously, the Manages relationship in this case means something different in each view. c b a PropertyForRent(propertyNo) Newspaper(newspaperName) newspaperName → Staff(staffNo) propertyNo → Branch(branchNo) staffNo → propertyNo → branchNo → PropertyForRent(propertyNo) clientNo → Client(clientNo) Client(clientNo) branchNo → clientNo → Branch(branchNo) staffNo → The Telephone relation is created from the multi-valued attribute telNo The Registration relation is created from the ternary relationship Registers The Advert relation is created from the many-to-many (*;*) relationship Advertises Advertc Newspaper Registrationb Lease BusinessOwner(bName) Staff(staffNo) bName → PrivateOwner(ownerNo) ownerNo → Lease Viewing PropertyForRent BusinessOwner Client Client PropertyForRent PrivateOwner BusinessOwner Staff(staffNo) staffNo → Staff Child relation PrivateOwner Manager Staff(staffNo) Staff Branch(branchNo) branchNo → Telephonea supervisorStaffNo → Manager(staffNo) Branch(branchNo) mgrStaffNo → Branch branchNo → Parent relation Foreign keys Branch user views A comparison of the foreign keys in the Branch and Staff user views. Child relation Table 16.4 propertyNo → PropertyForRent(propertyNo) Client(clientNo) PropertyForRent(propertyNo) clientNo → Client(clientNo) propertyNo → Staff(staffNo) BusinessOwner(ownerNo) PrivateOwner(ownerNo) Staff(staffNo) Staff(staffNo) Parent relation clientNo → staffNo → ownerNo → ownerNo → staffNo → supervisorStaffNo → Foreign keys Staff user views 16.1 Logical Database Design Methodology for the Relational Model | 483 484 | Chapter 16 z Methodology – Logical Database Design for the Relational Model We must therefore ensure that entities or relationships that have the same name represent the same concept in the ‘real world’, and that the names that differ in each view represent different concepts. To achieve this, we compare the attributes (and, in particular, the keys) associated with each entity and also their associated relationships with other entities. We should also be aware that entities or relationships in one view may be represented simply as attributes in another view. For example, consider the scenario where the Branch entity has an attribute called managerName in one view, which is represented as an entity called Manager in another view. (3) Merge entities/relations from the local data models Examine the name and content of each entity/relation in the models to be merged to determine whether entities/relations represent the same thing and can therefore be merged. Typical activities involved in this task include: n n n merging entities/relations with the same name and the same primary key; merging entities/relations with the same name but different primary keys; merging entities/relations with different names using the same or different primary keys. Merging entities/relations with the same name and the same primary key Generally, entities/relations with the same primary key represent the same ‘real world’ object and should be merged. The merged entity/relation includes the attributes from the original entities/relations with duplicates removed. For example, Figure 16.6 lists the attributes associated with the relation PrivateOwner defined in the Branch and Staff user views. The primary key of both relations is ownerNo. We merge these two relations together by combining their attributes, so that the merged PrivateOwner relation now has all the original attributes associated with both PrivateOwner relations. Note that there is conflict between the views on how we should represent the name of an owner. In this situation, we should (if possible) consult the users of each view to determine the final representation. Note, in this example, we use the decomposed version of the owner’s name, represented by the fName and lName attributes, in the merged global view. Figure 16.6 Merging the PrivateOwner relations from the Branch and Staff user views. 16.1 Logical Database Design Methodology for the Relational Model | 485 In a similar way, from Table 16.2 the Staff, Client, PropertyForRent, and Lease relations have the same primary keys in both views and the relations can be merged as discussed above. Merging entities/relations with the same name but different primary keys In some situations, we may find two entities/relations with the same name and similar candidate keys, but with different primary keys. In this case, the entities/relations should be merged together as described above. However, it is necessary to choose one key to be the primary key, the others becoming alternate keys. For example, Figure 16.7 lists the attributes associated with the two relations BusinessOwner defined in the two views. The primary key of the BusinessOwner relation in the Branch user views is bName and the primary key of the BusinessOwner relation in the Staff user views is ownerNo. However, the alternate key for BusinessOwner in the Staff user views is bName. Although the primary keys are different, the primary key of BusinessOwner in the Branch user views is the alternate key of BusinessOwner in the Staff user views. We merge these two relations together as shown in Figure 16.7 and include bName as an alternate key. Merging entities/relations with different names using the same or different primary keys In some cases, we may identify entities/relations that have different names but appear to have the same purpose. These equivalent entities/relations may be recognized simply by: n n n their name, which indicates their similar purpose; their content and, in particular, their primary key; their association with particular relationships. An obvious example of this occurrence would be entities called Staff and Employee, which if found to be equivalent should be merged. Figure 16.7 Merging the BusinessOwner relations with different primary keys. 486 | Chapter 16 z Methodology – Logical Database Design for the Relational Model (4) Include (without merging) entities/relations unique to each local data model The previous tasks should identify all entities/relations that are the same. All remaining entities/relations are included in the global model without change. From Table 16.2, the Branch, Telephone, Manager, Registration, Newspaper, and Advert relations are unique to the Branch user views, and the Viewing relation is unique to the Staff user views. (5) Merge relationships/foreign keys from the local data models In this step we examine the name and purpose of each relationship/foreign key in the data models. Before merging relationships/foreign keys, it is important to resolve any conflicts between the relationships such as differences in multiplicity constraints. The activities in this step include: n n merging relationships/foreign keys with the same name and the same purpose; merging relationships/foreign keys with different names but the same purpose. Using Table 16.3 and the data dictionary, we can identify foreign keys with the same name and the same purpose which can be merged into the global model. Note that the Registers relationship in the two views essentially represents the same ‘event’: in the Staff user views, the Registers relationship models a member of staff registering a client; in the Branch user views, the situation is slightly more complex due to the additional modeling of branches, but the introduction of the Registration relation models a member of staff registering a client at a branch. In this case, we ignore the Registers relationship in the Staff user views and include the equivalent relationships/foreign keys from the Branch user views in the next step. (6) Include (without merging) relationships/foreign keys unique to each local data model Again, the previous task should identify relationships/foreign keys that are the same (by definition, they must be between the same entities/relations, which would have been merged together earlier). All remaining relationships/foreign keys are included in the global model without change. (7) Check for missing entities/relations and relationships/foreign keys Perhaps one of the most difficult tasks in producing the global model is identifying missing entities/relations and relationships/foreign keys between different local data models. If a corporate data model exists for the enterprise, this may reveal entities and relationships that do not appear in any local data model. Alternatively, as a preventative measure, when interviewing the users of a specific user views, ask them to pay particular attention to the entities and relationships that exist in other user views. Otherwise, examine the attributes of each entity/relation and look for references to entities/relations in other local data models. We may find that we have an attribute associated with an entity/relation in one local data model that corresponds to a primary key, alternate key, or even a non-key attribute of an entity/relation in another local data model. 16.1 Logical Database Design Methodology for the Relational Model (8) Check foreign keys During this step, entities/relations and relationships/foreign keys may have been merged, primary keys changed, and new relationships identified. Check that the foreign keys in child relations are still correct, and make any necessary modifications. The relations that represent the global logical data model for DreamHome are shown in Figure 16.8. Figure 16.8 Relations that represent the global logical data model for DreamHome. | 487 488 | Chapter 16 z Methodology – Logical Database Design for the Relational Model (9) Check integrity constraints Check that the integrity constraints for the global logical data model do not conflict with those originally specified for each view. For example, if any new relationships have been identified and new foreign keys have been created, ensure that appropriate referential integrity constraints are specified. Any conflicts must be resolved in consultation with the users. (10) Draw the global ER/relation diagram We now draw a final diagram that represents all the merged local logical data models. If relations have been used as the basis for merging, we call the resulting diagram a global relation diagram, which shows primary keys and foreign keys. If local ER diagrams have been used, the resulting diagram is simply a global ER diagram. The global relation diagram for DreamHome is shown in Figure 16.9. (11) Update the documentation Update the documentation to reflect any changes made during the development of the global data model. It is very important that the documentation is up to date and reflects the current data model. If changes are made to the model subsequently, either during database implementation or during maintenance, then the documentation should be updated at the same time. Out-of-date information will cause considerable confusion at a later time. Step 2.6.2 Validate global logical data model Objective To validate the relations created from the global logical data model using the technique of normalization and to ensure they support the required transactions, if necessary. This step is equivalent to Steps 2.2 and 2.3, where we validated each local logical data model. However, it is only necessary to check those areas of the model that resulted in any change during the merging process. In a large system, this will significantly reduce the amount of rechecking that needs to be performed. Step 2.6.3 Review global logical data model with users Objective To review the global logical data model with the users to ensure that they consider the model to be a true representation of the data requirements of an enterprise. The global logical data model for the enterprise should now be complete and accurate. The model and the documentation that describes the model should be reviewed with the users to ensure that it is a true representation of the enterprise. 16.1 Logical Database Design Methodology for the Relational Model Figure 16.9 Global relation diagram for DreamHome. | 489 490 | Chapter 16 z Methodology – Logical Database Design for the Relational Model To facilitate the description of the tasks associated with Step 2.6 it is necessary to use the terms ‘local logical data model’ and ‘global logical data model’. However, at the end of this step when the local data models have been merged into a single global data model, the distinction between the data models that refer to some or all user views of a database is no longer necessary. Therefore on completion of this step we refer to the single global data model using the simpler term of ‘logical data model’ for the remaining steps of the methodology. Step 2.7 Check for future growth Objective To determine whether there are any significant changes likely in the foreseeable future and to assess whether the logical data model can accommodate these changes. Logical database design concludes by considering whether the logical data model (which may or may not have been developed using Step 2.6) is capable of being extended to support possible future developments. If the model can sustain current requirements only, then the life of the model may be relatively short and significant reworking may be necessary to accommodate new requirements. It is important to develop a model that is extensible and has the ability to evolve to support new requirements with minimal effect on existing users. Of course, this may be very difficult to achieve, as the enterprise may not know what it wants to do in the future. Even if it does, it may be prohibitively expensive both in time and money to accommodate possible future enhancements now. Therefore, it may be necessary to be selective in what is accommodated. Consequently, it is worth examining the model to check its ability to be extended with minimal impact. However, it is not necessary to incorporate any changes into the data model unless requested by the user. At the end of Step 2 the logical data model is used as the source of information for physical database design, which is described in the following two chapters as Steps 3 to 8 of the methodology. For readers familiar with database design, a summary of the steps of the methodology is presented in Appendix G. Chapter Summary n The database design methodology includes three main phases: conceptual, logical, and physical database design. n Logical database design is the process of constructing a model of the data used in an enterprise based on a specific data model but independent of a particular DBMS and other physical considerations. A logical data model includes ER diagram(s), relational schema, and supporting documentation such as the data dictionary, which is produced throughout the development of the model. n Review Questions n n n n n n | 491 The purpose of Step 2.1 of the methodology for logical database design is to derive a relational schema from the conceptual data model created in Step 1. In Step 2.2 the relational schema is validated using the rules of normalization to ensure that each relation is structurally correct. Normalization is used to improve the model so that it satisfies various constraints that avoids unnecessary duplication of data. In Step 2.3 the relational schema is also validated to ensure it supports the transactions given in the users’ requirements specification. In Step 2.4 the integrity constraints of the logical data model are checked. Integrity constraints are the constraints that are to be imposed on the database to protect the database from becoming incomplete, inaccurate, or inconsistent. The main types of integrity constraints include: required data, attribute domain constraints, multiplicity, entity integrity, referential integrity, and general constraints. In Step 2.5 the logical data model is validated by the users. Step 2.6 of logical database design is an optional step and is only required if the database has multiple user views that are being managed using the view integration approach (see Section 9.5), which results in the creation of two or more local logical data models. A local logical data model represents the data requirements of one or more, but not all, user views of a database. In Step 2.6 these data models are merged into a global logical data model which represents the requirements of all user views. This logical data model is again validated using normalization, against the required transaction, and by users. Logical database design concludes with Step 2.7, which considers whether the model is capable of being extended to support possible future developments. At the end of Step 2, the logical data model, which may or may not have been developed using Step 2.6, is the source of information for physical database design described as Steps 3 to 8 in Chapters 17 and 18. Review Questions 16.1 Discuss the purpose of logical database design. 16.2 Describe the rules for deriving relations that represent: 16.3 Discuss how the technique of normalization can be used to validate the relations derived from the conceptual data model. 16.4 Discuss two approaches that can be used to validate that the relational schema is capable of (a) strong entity types; supporting the required transactions. (b) weak entity types; (c) one-to-many (1:*) binary relationship types; 16.5 Describe the purpose of integrity constraints and identify the main types of integrity constraints on a (d) one-to-one (1:1) binary relationship types; logical data model. (e) one-to-one (1:1) recursive relationship types; 16.6 Describe the alternative strategies that can be (f) superclass/subclass relationship types; applied if there exists a child tuple referencing a (g) many-to-many (*:*) binary relationship parent tuple that we wish to delete. types; 16.7 Identify the tasks typically associated with (h) complex relationship types; merging local logical data models into a global (i) multi-valued attributes. logical model. Give examples to illustrate your answers. 492 | Chapter 16 z Methodology – Logical Database Design for the Relational Model Exercises 16.8 Derive relations from the following conceptual data model: The DreamHome case study 16.9 Create a relational schema for the Branch user view of DreamHome based on the conceptual data model produced in Exercise 15.13 and compare your schema with the relations listed in Figure 16.5. Justify any differences found. The University Accommodation Office case study 16.10 Create and validate a logical data model from the conceptual data model for the University Accommodation Office case study created in Exercise 15.16. The EasyDrive School of Motoring case study 16.11 Create and validate a logical data model from the conceptual data model for the EasyDrive School of Motoring case study created in Exercise 15.18. Exercises | 493 The Wellmeadows Hospital case study 16.12 Create and validate the local logical data models for each of the local conceptual data models of the Wellmeadows Hospital case study identified in Exercise 15.21. 16.13 Merge the local data models to create a global logical data model of the Wellmeadows Hospital case study. State any assumptions necessary to support your design. Chapter 17 Methodology – Physical Database Design for Relational Databases Chapter Objectives In this chapter you will learn: n The purpose of physical database design. n How to map the logical database design to a physical database design. n How to design base relations for the target DBMS. n How to design general constraints for the target DBMS. n How to select appropriate file organizations based on analysis of transactions. n When to use secondary indexes to improve performance. n How to estimate the size of the database. n How to design user views. n How to design security mechanisms to satisfy user requirements. In this chapter and the next we describe and illustrate by example a physical database design methodology for relational databases. The starting point for this chapter is the logical data model and the documentation that describes the model created in the conceptual/logical database design methodology described in Chapters 15 and 16. The methodology started by producing a conceptual data model in Step 1 and then derived a set of relations to produce a logical data model in Step 2. The derived relations were validated to ensure they were correctly structured using the technique of normalization described in Chapters 13 and 14, and to ensure they supported the transactions the users require. In the third and final phase of the database design methodology, the designer must decide how to translate the logical database design (that is, the entities, attributes, relationships, and constraints) into a physical database design that can be implemented using the target DBMS. As many parts of physical database design are highly dependent on the target DBMS, there may be more than one way of implementing any given part of the database. Consequently to do this work properly, the designer must be fully aware of the functionality of the target DBMS, and must understand the advantages and disadvantages of each alternative approach for a particular implementation. For some systems the designer may also need to select a suitable storage strategy that takes account of intended database usage. 17.1 Comparison of Logical and Physical Database Design | Structure of this Chapter In Section 17.1 we provide a comparison of logical and physical database design. In Section 17.2 we provide an overview of the physical database design methodology and briefly describe the main activities associated with each design phase. In Section 17.3 we focus on the methodology for physical database design and present a detailed description of the first four steps required to build a physical data model. In these steps, we show how to convert the relations derived for the logical data model into a specific database implementation. We provide guidelines for choosing storage structures for the base relations and deciding when to create indexes. In places, we show physical implementation details to clarify the discussion. In Chapter 18 we complete our presentation of the physical database design methodology and discuss how to monitor and tune the operational system and, in particular, we consider when it is appropriate to denormalize the logical data model and introduce redundancy. Appendix G presents a summary of the database design methodology for those readers who are already familiar with database design and simply require an overview of the main steps. Comparison of Logical and Physical Database Design In presenting a database design methodology we divide the design process into three main phases: conceptual, logical, and physical database design. The phase prior to physical design, namely logical database design, is largely independent of implementation details, such as the specific functionality of the target DBMS and application programs, but is dependent on the target data model. The output of this process is a logical data model consisting of an ER/relation diagram, relational schema, and supporting documentation that describes this model, such as a data dictionary. Together, these represent the sources of information for the physical design process, and they provide the physical database designer with a vehicle for making tradeoffs that are so important to an efficient database design. Whereas logical database design is concerned with the what, physical database design is concerned with the how. It requires different skills that are often found in different people. In particular, the physical database designer must know how the computer system hosting the DBMS operates, and must be fully aware of the functionality of the target DBMS. As the functionality provided by current systems varies widely, physical design must be tailored to a specific DBMS. However, physical database design is not an isolated activity – there is often feedback between physical, logical, and application design. For example, decisions taken during physical design for improving performance, such as merging relations together, might affect the structure of the logical data model, which will have an associated effect on the application design. 17.1 495 496 | Chapter 17 z Methodology – Physical Database Design for Relational Databases 17.2 Overview of Physical Database Design Methodology Physical database design The process of producing a description of the implementation of the database on secondary storage; it describes the base relations, file organizations, and indexes used to achieve efficient access to the data, and any associated integrity constraints and security measures. The steps of the physical database design methodology are as follows: Step 3 Translate logical data model for target DBMS Step 3.1 Design base relations Step 3.2 Design representation of derived data Step 3.3 Design general constraints Step 4 Design file organizations and indexes Step 4.1 Analyze transactions Step 4.2 Choose file organizations Step 4.3 Choose indexes Step 4.4 Estimate disk space requirements Step 5 Design user views Step 6 Design security mechanisms Step 7 Consider the introduction of controlled redundancy Step 8 Monitor and tune the operational system The physical database design methodology presented in this book is divided into six main steps, numbered consecutively from 3 to follow the three steps of the conceptual and logical database design methodology. Step 3 of physical database design involves the design of the base relations and general constraints using the available functionality of the target DBMS. This step also considers how we should represent any derived data present in the data model. Step 4 involves choosing the file organizations and indexes for the base relations. Typically, PC DBMSs have a fixed storage structure but other DBMSs tend to provide a number of alternative file organizations for data. From the user’s viewpoint, the internal storage representation for relations should be transparent – the user should be able to access relations and tuples without having to specify where or how the tuples are stored. This requires that the DBMS provides physical data independence, so that users are unaffected by changes to the physical structure of the database, as discussed in Section 2.1.5. The mapping between the logical data model and physical data model is defined in the internal schema, as shown previously in Figure 2.1. The designer may have to provide the physical design details to both the DBMS and the operating system. For the DBMS, the designer may have to specify the file organizations that are to be used to represent each relation; for the operating system, the designer must specify details such as the location and protection for each file. We recommend that the reader reviews Appendix C on file organization and storage structures before reading Step 4 of the methodology. 17.3 The Physical Database Design Methodology for Relational Databases | Step 5 involves deciding how each user view should be implemented. Step 6 involves designing the security measures necessary to protect the data from unauthorized access, including the access controls that are required on the base relations. Step 7 (described in Chapter 18) considers relaxing the normalization constraints imposed on the logical data model to improve the overall performance of the system. This step should be undertaken only if necessary, because of the inherent problems involved in introducing redundancy while still maintaining consistency. Step 8 (Chapter 18) is an ongoing process of monitoring the operational system to identify and resolve any performance problems resulting from the design, and to implement new or changing requirements. Appendix G presents a summary of the methodology for those readers who are already familiar with database design and simply require an overview of the main steps. The Physical Database Design Methodology for Relational Databases This section provides a step-by-step guide to the first four steps of the physical database design methodology for relational databases. In places, we demonstrate the close association between physical database design and implementation by describing how alternative designs can be implemented using various target DBMSs. The remaining two steps are covered in the next chapter. Step 3 Translate Logical Data Model for Target DBMS Objective To produce a relational database schema from the logical data model that can be implemented in the target DBMS. The first activity of physical database design involves the translation of the relations in the logical data model into a form that can be implemented in the target relational DBMS. The first part of this process entails collating the information gathered during logical database design and documented in the data dictionary along with the information gathered during the requirements collection and analysis stage and documented in the systems specification. The second part of the process uses this information to produce the design of the base relations. This process requires intimate knowledge of the functionality offered by the target DBMS. For example, the designer will need to know: n n n n n n how to create base relations; whether the system supports the definition of primary keys, foreign keys, and alternate keys; whether the system supports the definition of required data (that is, whether the system allows attributes to be defined as NOT NULL); whether the system supports the definition of domains; whether the system supports relational integrity constraints; whether the system supports the definition of integrity constraints. 17.3 497 498 | Chapter 17 z Methodology – Physical Database Design for Relational Databases The three activities of Step 3 are: Step 3.1 Design base relations Step 3.2 Design representation of derived data Step 3.3 Design general constraints Step 3.1 Design base relations Objective To decide how to represent the base relations identified in the logical data model in the target DBMS. To start the physical design process, we first collate and assimilate the information about the relations produced during logical database design. The necessary information can be obtained from the data dictionary and the definition of the relations described using the Database Design Language (DBDL). For each relation identified in the logical data model, we have a definition consisting of: n n n n the name of the relation; a list of simple attributes in brackets; the primary key and, where appropriate, alternate keys (AK) and foreign keys (FK); referential integrity constraints for any foreign keys identified. From the data dictionary, we also have for each attribute: n n n n its domain, consisting of a data type, length, and any constraints on the domain; an optional default value for the attribute; whether the attribute can hold nulls; whether the attribute is derived and, if so, how it should be computed. To represent the design of the base relations, we use an extended form of the DBDL to define domains, default values, and null indicators. For example, for the PropertyForRent relation of the DreamHome case study, we may produce the design shown in Figure 17.1. Implementing base relations The next step is to decide how to implement the base relations. This decision is dependent on the target DBMS; some systems provide more facilities than others for defining base relations. We have previously demonstrated three particular ways to implement base relations using the ISO SQL standard (Section 6.1), Microsoft Office Access (Section 8.1.3), and Oracle (Section 8.2.3). Document design of base relations The design of the base relations should be fully documented along with the reasons for selecting the proposed design. In particular, document the reasons for selecting one approach where many alternatives exist. 17.3 The Physical Database Design Methodology for Relational Databases | 499 Figure 17.1 DBDL for the PropertyForRent relation. Step 3.2 Design representation of derived data Objective To decide how to represent any derived data present in the logical data model in the target DBMS. Attributes whose value can be found by examining the values of other attributes are known as derived or calculated attributes. For example, the following are all derived attributes: n n n the number of staff who work in a particular branch; the total monthly salaries of all staff; the number of properties that a member of staff handles. Often, derived attributes do not appear in the logical data model but are documented in the data dictionary. If a derived attribute is displayed in the model, a ‘/’ is used to indicate that it is derived (see Section 11.1.2). The first step is to examine the logical data model and the data dictionary, and produce a list of all derived attributes. From a physical database 500 | Chapter 17 z Methodology – Physical Database Design for Relational Databases Figure 17.2 The PropertyforRent relation and a simplified Staff relation with the derived attribute noOfProperties. design perspective, whether a derived attribute is stored in the database or calculated every time it is needed is a tradeoff. The designer should calculate: n n the additional cost to store the derived data and keep it consistent with operational data from which it is derived; the cost to calculate it each time it is required. The less expensive option is chosen subject to performance constraints. For the last example cited above, we could store an additional attribute in the Staff relation representing the number of properties that each member of staff currently manages. A simplified Staff relation based on the sample instance of the DreamHome database shown in Figure 3.3 with the new derived attribute noOfProperties is shown in Figure 17.2. The additional storage overhead for this new derived attribute would not be particularly significant. The attribute would need to be updated every time a member of staff was assigned to or deassigned from managing a property, or the property was removed from the list of available properties. In each case, the noOfProperties attribute for the appropriate member of staff would be incremented or decremented by 1. It would be necessary to ensure that this change is made consistently to maintain the correct count, and thereby ensure the integrity of the database. When a query accesses this attribute, the value would be immediately available and would not have to be calculated. On the other hand, if the attribute is not stored directly in the Staff relation it must be calculated each time it is required. This involves a join of the Staff and PropertyForRent relations. Thus, if this type of query is frequent or is considered to be critical for performance purposes, it may be more appropriate to store the derived attribute rather than calculate it each time. It may also be more appropriate to store derived attributes whenever the DBMS’s query language cannot easily cope with the algorithm to calculate the derived attribute. For example, SQL has a limited set of aggregate functions and cannot easily handle recursive queries, as we discussed in Chapter 5. 17.3 The Physical Database Design Methodology for Relational Databases Document design of derived data The design of derived data should be fully documented along with the reasons for selecting the proposed design. In particular, document the reasons for selecting one approach where many alternatives exist. Step 3.3 Design general constraints Objective To design the general constraints for the target DBMS. Updates to relations may be constrained by integrity constraints governing the ‘real world’ transactions that are represented by the updates. In Step 3.1 we designed a number of integrity constraints: required data, domain constraints, and entity and referential integrity. In this step we have to consider the remaining general constraints. The design of such constraints is again dependent on the choice of DBMS; some systems provide more facilities than others for defining general constraints. As in the previous step, if the system is compliant with the SQL standard, some constraints may be easy to implement. For example, DreamHome has a rule that prevents a member of staff from managing more than 100 properties at the same time. We could design this constraint into the SQL CREATE TABLE statement for PropertyForRent using the following clause: CONSTRAINT StaffNotHandlingTooMuch CHECK (NOT EXISTS (SELECT staffNo FROM PropertyForRent GROUP BY staffNo HAVING COUNT(*) > 100)) In Section 8.1.4 we demonstrated how to implement this constraint in Microsoft Office Access using an event procedure in VBA (Visual Basic for Applications). Alternatively, a trigger could be used to enforce some constraints as we illustrated in Section 8.2.7. In some systems there will be no support for some or all of the general constraints and it will be necessary to design the constraints into the application. For example, there are very few relational DBMSs (if any) that would be able to handle a time constraint such as ‘at 17.30 on the last working day of each year, archive the records for all properties sold that year and delete the associated records’. Document design of general constraints The design of general constraints should be fully documented. In particular, document the reasons for selecting one approach where many alternatives exist. Step 4 Design File Organizations and Indexes Objective To determine the optimal file organizations to store the base relations and the indexes that are required to achieve acceptable performance, that is, the way in which relations and tuples will be held on secondary storage. | 501 502 | Chapter 17 z Methodology – Physical Database Design for Relational Databases One of the main objectives of physical database design is to store and access data in an efficient way (see Appendix C). While some storage structures are efficient for bulk loading data into the database, they may be inefficient after that. Thus, we may have to choose to use an efficient storage structure to set up the database and then choose another for operational use. Again, the types of file organization available are dependent on the target DBMS; some systems provide more choice of storage structures than others. It is extremely important that the physical database designer fully understands the storage structures that are available, and how the target system uses these structures. This may require the designer to know how the system’s query optimizer functions. For example, there may be circumstances where the query optimizer would not use a secondary index, even if one were available. Thus, adding a secondary index would not improve the performance of the query, and the resultant overhead would be unjustified. We discuss query processing and optimization in Chapter 21. As with logical database design, physical database design must be guided by the nature of the data and its intended use. In particular, the database designer must understand the typical workload that the database must support. During the requirements collection and analysis stage there may have been requirements specified about how fast certain transactions must run or how many transactions must be processed per second. This information forms the basis for a number of decisions that will be made during this step. With these objectives in mind, we now discuss the activities in Step 4: Step 4.1 Step 4.2 Step 4.3 Step 4.4 Analyze transactions Choose file organizations Choose indexes Estimate disk space requirements Step 4.1 Analyze transactions Objective To understand the functionality of the transactions that will run on the database and to analyze the important transactions. To carry out physical database design effectively, it is necessary to have knowledge of the transactions or queries that will run on the database. This includes both qualitative and quantitative information. In analyzing the transactions, we attempt to identify performance criteria, such as: n n n the transactions that run frequently and will have a significant impact on performance; the transactions that are critical to the operation of the business; the times during the day/week when there will be a high demand made on the database (called the peak load). We use this information to identify the parts of the database that may cause performance problems. At the same time, we need to identify the high-level functionality of the transactions, such as the attributes that are updated in an update transaction or the criteria 17.3 The Physical Database Design Methodology for Relational Databases used to restrict the tuples that are retrieved in a query. We use this information to select appropriate file organizations and indexes. In many situations, it is not possible to analyze all the expected transactions, so we should at least investigate the most ‘important’ ones. It has been suggested that the most active 20% of user queries account for 80% of the total data access (Wiederhold, 1983). This 80/20 rule may be used as a guideline in carrying out the analysis. To help identify which transactions to investigate, we can use a transaction/relation cross-reference matrix, which shows the relations that each transaction accesses, and/or a transaction usage map, which diagrammatically indicates which relations are potentially heavily used. To focus on areas that may be problematic, one way to proceed is to: (1) map all transaction paths to relations; (2) determine which relations are most frequently accessed by transactions; (3) analyze the data usage of selected transactions that involve these relations. Map all transaction paths to relations In Steps 1.8, 2.3, and 2.6.2 of the conceptual/logical database design methodology we validated the data models to ensure they supported the transactions that the users require by mapping the transaction paths to entities/relations. If a transaction pathway diagram was used similar to the one shown in Figure 15.9, we may be able to use this diagram to determine the relations that are most frequently accessed. On the other hand, if the transactions were validated in some other way, it may be useful to create a transaction/ relation cross-reference matrix. The matrix shows, in a visual way, the transactions that are required and the relations they access. For example, Table 17.1 shows a transaction/ relation cross-reference matrix for the following selection of typical entry, update/delete, and query transactions for DreamHome (see Appendix A): (A) (B) (C) (D) (E) (F) Enter the details for a new property and the owner (such as details of property number PG4 in Glasgow owned by Tina Murphy). Update/delete the details of a property. Identify the total number of staff in each position at branches in Glasgow. 5 4 6 Staff view 4 7 List the property number, address, type, and rent of all properties in 5 Glasgow, ordered by rent. 4 4 List the details of properties for rent managed by a named member 6 Branch view of staff. 4 Identify the total number of properties assigned to each member of 4 7 staff at a given branch. The matrix indicates, for example, that transaction (A) reads the Staff table and also inserts tuples into the PropertyForRent and PrivateOwner/BusinessOwner relations. To be more useful, the matrix should indicate in each cell the number of accesses over some time interval (for example, hourly, daily, or weekly). However, to keep the matrix simple, we do not show this information. This matrix shows that both the Staff and PropertyForRent relations | 503 504 | Chapter 17 z Methodology – Physical Database Design for Relational Databases Table 17.1 Cross-referencing transactions and relations. Transaction/ Relation (A) I R U (B) D I R U (C) D I R U (D) D I X Branch R U (E) D I R X U (F) D I R U D X Telephone X Staff X X X X X X Manager PrivateOwner BusinessOwner PropertyForRent X X X X X X X Viewing Client Registration Lease Newspaper Advert I = Insert; R = Read; U = Update; D = Delete are accessed by five of the six transactions, and so efficient access to these relations may be important to avoid performance problems. We therefore conclude that a closer inspection of these transactions and relations are necessary. Determine frequency information In the requirements specification for DreamHome given in Section 10.4.4, it was estimated that there are about 100,000 properties for rent and 2000 staff distributed over 100 branch offices, with an average of 1000 and a maximum of 3000 properties at each branch. Figure 17.3 shows the transaction usage map for transactions (C), (D), (E), and (F), which all access at least one of the Staff and PropertyForRent relations, with these numbers added. Due to the size of the PropertyForRent relation, it will be important that access to this relation is as efficient as possible. We may now decide that a closer analysis of transactions involving this particular relation would be useful. In considering each transaction, it is important to know not only the average and maximum number of times it runs per hour, but also the day and time that the transaction is run, including when the peak load is likely. For example, some transactions may run at the average rate for most of the time, but have a peak loading between 14.00 and 16.00 on a Thursday prior to a meeting on Friday morning. Other transactions may run only at specific times, for example 17.00–19.00 on Fridays/Saturdays, which is also their peak loading. Where transactions require frequent access to particular relations, then their pattern of operation is very important. If these transactions operate in a mutually exclusive manner, 17.3 The Physical Database Design Methodology for Relational Databases | 505 Figure 17.3 Transaction usage map for some sample transactions showing expected occurrences. the risk of likely performance problems is reduced. However, if their operating patterns conflict, potential problems may be alleviated by examining the transactions more closely to determine whether changes can be made to the structure of the relations to improve performance, as we discuss in Step 7 in the next chapter. Alternatively, it may be possible to reschedule some transactions so that their operating patterns do not conflict (for example, it may be possible to leave some summary transactions until a quieter time in the evening or overnight). Analyze data usage Having identified the important transactions, we now analyze each one in more detail. For each transaction, we should determine: n The relations and attributes accessed by the transaction and the type of access; that is, whether it is an insert, update, delete, or retrieval (also known as a query) transaction. For an update transaction, note the attributes that are updated, as these attributes may be candidates for avoiding an access structure (such as a secondary index). n The attributes used in any predicates (in SQL, the predicates are the conditions specified in the WHERE clause). Check whether the predicates involve: – pattern matching; for example: (name LIKE ‘%Smith%’); – range searches; for example: (salary BETWEEN 10000 AND 20000); – exact-match key retrieval; for example: (salary = 30000). This applies not only to queries but also to update and delete transactions, which can restrict the tuples to be updated/deleted in a relation. These attributes may be candidates for access structures. n For a query, the attributes that are involved in the join of two or more relations. Again, these attributes may be candidates for access structures. 506 | Chapter 17 z Methodology – Physical Database Design for Relational Databases n n The expected frequency at which the transaction will run; for example, the transaction will run approximately 50 times per day. The performance goals for the transaction; for example, the transaction must complete within 1 second. The attributes used in any predicates for very frequent or critical transactions should have a higher priority for access structures. Figure 17.4 shows an example of a transaction analysis form for transaction (D). This form shows that the average frequency of this transaction is 50 times per hour, with a peak loading of 100 times per hour daily between 17.00 and 19.00. In other words, typically half the branches will run this transaction per hour and at peak time all branches will run this transaction once per hour. The form also shows the required SQL statement and the transaction usage map. At this stage, the full SQL statement may be too detailed but the types of details that are shown adjacent to the SQL statement should be identified, namely: n any predicates that will be used; n any attributes that will be required to join relations together (for a query transaction); n attributes used to order results (for a query transaction); n attributes used to group data together (for a query transaction); n any built-in functions that may be used (such as AVG, SUM); n any attributes that will be updated by the transaction. This information will be used to determine the indexes that are required, as we discuss next. Below the transaction usage map, there is a detailed breakdown documenting: n n n how each relation is accessed (reads in this case); how many tuples will be accessed each time the transaction is run; how many tuples will be accessed per hour on average and at peak loading times. The frequency information will identify the relations that will need careful consideration to ensure that appropriate access structures are used. As mentioned above, the search conditions used by transactions that have time constraints become higher priority for access structures. Step 4.2 Choose file organizations Objective To determine an efficient file organization for each base relation. One of the main objectives of physical database design is to store and access data in an efficient way. For example, if we want to retrieve staff tuples in alphabetical order of name, sorting the file by staff name is a good file organization. However, if we want to retrieve all staff whose salary is in a certain range, searching a file ordered by staff name would not be particularly efficient. To complicate matters, some file organizations are 17.3 The Physical Database Design Methodology for Relational Databases Figure 17.4 Example transaction analysis form. efficient for bulk loading data into the database but inefficient after that. In other words, we may want to use an efficient storage structure to set up the database and then change it for normal operational use. The objective of this step therefore is to choose an optimal file organization for each relation, if the target DBMS allows this. In many cases, a relational DBMS may give little or no choice for choosing file organizations, although some may be established as | 507 508 | Chapter 17 z Methodology – Physical Database Design for Relational Databases indexes are specified. However, as an aid to understanding file organizations and indexes more fully, we provide guidelines in Appendix C.7 for selecting a file organization based on the following types of file: n n n n n Heap Hash Indexed Sequential Access Method (ISAM) B+-tree Clusters. If the target DBMS does not allow the choice of file organizations, this step can be omitted. Document choice of file organizations The choice of file organizations should be fully documented, along with the reasons for the choice. In particular, document the reasons for selecting one approach where many alternatives exist. Step 4.3 Choose indexes Objective To determine whether adding indexes will improve the performance of the system. One approach to selecting an appropriate file organization for a relation is to keep the tuples unordered and create as many secondary indexes as necessary. Another approach is to order the tuples in the relation by specifying a primary or clustering index (see Appendix C.5). In this case, choose the attribute for ordering or clustering the tuples as: n n the attribute that is used most often for join operations, as this makes the join operation more efficient, or the attribute that is used most often to access the tuples in a relation in order of that attribute. If the ordering attribute chosen is a key of the relation, the index will be a primary index; if the ordering attribute is not a key, the index will be a clustering index. Remember that each relation can only have either a primary index or a clustering index. Specifying indexes We saw in Section 6.3.4 that an index can usually be created in SQL using the CREATE INDEX statement. For example, to create a primary index on the PropertyForRent relation based on the propertyNo attribute, we might use the following SQL statement: CREATE UNIQUE INDEX PropertyNoInd ON PropertyForRent(propertyNo); To create a clustering index on the PropertyForRent relation based on the we might use the following SQL statement: staffNo attribute, 17.3 The Physical Database Design Methodology for Relational Databases CREATE INDEX StaffNoInd ON PropertyForRent(staffNo) CLUSTER; As we have already mentioned, in some systems the file organization is fixed. For example, until recently Oracle has supported only B+-trees but has now added support for clusters. On the other hand, INGRES offers a wide set of different index structures that can be chosen using the following optional clause in the CREATE INDEX statement: [STRUCTURE = BTREE | ISAM | HASH | HEAP] Choosing secondary indexes Secondary indexes provide a mechanism for specifying an additional key for a base relation that can be used to retrieve data more efficiently. For example, the PropertyForRent relation may be hashed on the property number, propertyNo, the primary index. However, there may be frequent access to this relation based on the rent attribute. In this case, we may decide to add rent as a secondary index. There is an overhead involved in the maintenance and use of secondary indexes that has to be balanced against the performance improvement gained when retrieving data. This overhead includes: n adding an index record to every secondary index whenever a tuple is inserted into the relation; n updating a secondary index when the corresponding tuple in the relation is updated; n the increase in disk space needed to store the secondary index; n possible performance degradation during query optimization, as the query optimizer may consider all secondary indexes before selecting an optimal execution strategy. Guidelines for choosing a ‘wish-list’ of indexes One approach to determining which secondary indexes are needed is to produce a ‘wishlist’ of attributes that we consider are candidates for indexing, and then to examine the impact of maintaining each of these indexes. We provide the following guidelines to help produce such a ‘wish-list’: (1) Do not index small relations. It may be more efficient to search the relation in memory than to store an additional index structure. (2) In general, index the primary key of a relation if it is not a key of the file organization. Although the SQL standard provides a clause for the specification of primary keys as discussed in Section 6.2.3, it should be noted that this does not guarantee that the primary key will be indexed. (3) Add a secondary index to a foreign key if it is frequently accessed. For example, we may frequently join the PropertyForRent relation and the PrivateOwner/BusinessOwner relations on the attribute ownerNo, the owner number. Therefore, it may be more efficient to add a secondary index to the PropertyForRent relation based on the attribute ownerNo. Note, some DBMSs may automatically index foreign keys. | 509 510 | Chapter 17 z Methodology – Physical Database Design for Relational Databases (4) Add a secondary index to any attribute that is heavily used as a secondary key (for example, add a secondary index to the PropertyForRent relation based on the attribute rent, as discussed above). (5) Add a secondary index on attributes that are frequently involved in: (a) selection or join criteria; (b) ORDER BY; (c) GROUP BY; (d) other operations involving sorting (such as UNION or DISTINCT). (6) Add a secondary index on attributes involved in built-in aggregate functions, along with any attributes used for the built-in functions. For example, to find the average staff salary at each branch, we could use the following SQL query: SELECT branchNo, AVG(salary) FROM Staff GROUP BY branchNo; (7) (8) (9) (10) From the previous guideline, we could consider adding an index to the branchNo attribute by virtue of the GROUP BY clause. However, it may be more efficient to consider an index on both the branchNo attribute and the salary attribute. This may allow the DBMS to perform the entire query from data in the index alone, without having to access the data file. This is sometimes called an index-only plan, as the required response can be produced using only data in the index. As a more general case of the previous guideline, add a secondary index on attributes that could result in an index-only plan. Avoid indexing an attribute or relation that is frequently updated. Avoid indexing an attribute if the query will retrieve a significant proportion (for example 25%) of the tuples in the relation. In this case, it may be more efficient to search the entire relation than to search using an index. Avoid indexing attributes that consist of long character strings. If the search criteria involve more than one predicate, and one of the terms contains an OR clause, and the term has no index/sort order, then adding indexes for the other attributes is not going to help improve the speed of the query, because a linear search of the relation will still be required. For example, assume that only the type and rent attributes of the PropertyForRent relation are indexed, and we need to use the following query: SELECT * FROM PropertyForRent WHERE (type = ‘Flat’ OR rent > 500 OR rooms > 5); Although the two indexes could be used to find the tuples where (type = ‘Flat or rent > 500), the fact that the rooms attribute is not indexed will mean that these indexes cannot be used for the full WHERE clause. Thus, unless there are other queries that would benefit from having the type and rent attributes indexed, there would be no benefit gained in indexing them for this query. On the other hand, if the predicates in the WHERE clause were AND’ed together, the two indexes on the type and rent attributes could be used to optimize the query. 17.3 The Physical Database Design Methodology for Relational Databases Removing indexes from the ‘wish-list’ Having drawn up the ‘wish-list’ of potential indexes, we should now consider the impact of each of these on update transactions. If the maintenance of the index is likely to slow down important update transactions, then consider dropping the index from the list. Note, however, that a particular index may also make update operations more efficient. For example, if we want to update a member of staff’s salary given the member’s staff number, staffNo, and we have an index on staffNo, then the tuple to be updated can be found more quickly. It is a good idea to experiment when possible to determine whether an index is improving performance, providing very little improvement, or adversely impacting performance. In the last case, clearly we should remove this index from the ‘wish-list’. If there is little observed improvement with the addition of the index, further examination may be necessary to determine under what circumstances the index will be useful, and whether these circumstances are sufficiently important to warrant the implementation of the index. Some systems allow users to inspect the optimizer’s strategy for executing a particular query or update, sometimes called the Query Execution Plan (QEP). For example, Microsoft Office Access has a Performance Analyzer, Oracle has an EXPLAIN PLAN diagnostic utility (see Section 21.6.3), DB2 has an EXPLAIN utility, and INGRES has an online QEP-viewing facility. When a query runs slower than expected, it is worth using such a facility to determine the reason for the slowness, and to find an alternative strategy that may improve the performance of the query. If a large number of tuples are being inserted into a relation with one or more indexes, it may be more efficient to drop the indexes first, perform the inserts, and then recreate the indexes afterwards. As a rule of thumb, if the insert will increase the size of the relation by at least 10%, drop the indexes temporarily. Updating the database statistics The query optimizer relies on database statistics held in the system catalog to select the optimal strategy. Whenever we create an index, the DBMS automatically adds the presence of the index to the system catalog. However, we may find that the DBMS requires a utility to be run to update the statistics in the system catalog relating to the relation and the index. Document choice of indexes The choice of indexes should be fully documented along with the reasons for the choice. In particular, if there are performance reasons why some attributes should not be indexed, these should also be documented. File organizations and indexes for DreamHome with Microsoft Office Access Like most, if not all, PC DBMSs, Microsoft Office Access uses a fixed file organization, so if the target DBMS is Microsoft Office Access, Step 4.2 can be omitted. Microsoft | 511 512 | Chapter 17 z Methodology – Physical Database Design for Relational Databases Office Access does, however, support indexes as we now briefly discuss. In this section we use the terminology of Office Access, which refers to a relation as a table with fields and records. Guidelines for indexes In Office Access, the primary key of a table is automatically indexed, but a field whose data type is Memo, Hyperlink, or OLE Object cannot be indexed. For other fields, Microsoft advise indexing a field if all the following apply: n n n n the field’s data type is Text, Number, Currency, or Date/Time; the user anticipates searching for values stored in the field; the user anticipates sorting values in the field; the user anticipates storing many different values in the field. If many of the values in the field are the same, the index may not significantly speed up queries. In addition, Microsoft advise: n n indexing fields on both sides of a join or creating a relationship between these fields, in which case Office Access will automatically create an index on the foreign key field, if one does not exist already; when grouping records by the values in a joined field, specifying GROUP BY for the field that is in the same table as the field the aggregate is being calculated on. Microsoft Office Access can optimize simple and complex predicates (called expressions in Office Access). For certain types of complex expressions, Microsoft Office Access uses a data access technology called Rushmore, to achieve a greater level of optimization. A complex expression is formed by combining two simple expressions with the AND or OR operator, such as: branchNo = ‘B001’ AND rooms type = ‘Flat’ OR rent > 300 >5 In Office Access, a complex expression is fully or partially optimizable depending on whether one or both simple expressions are optimizable, and which operator was used to combine them. A complex expression is Rushmore-optimizable if all three of the following conditions are true: n n n the expression uses AND or OR to join two conditions; both conditions are made up of simple optimizable expressions; both expressions contain indexed fields. The fields can be indexed individually or they can be part of a multiple-field index. Indexes for DreamHome Before creating the wish-list, we ignore small tables from further consideration, as small tables can usually be processed in memory without requiring additional indexes. For DreamHome we ignore the Branch, Telephone, Manager, and Newspaper tables from further consideration. Based on the guidelines provided above: 17.3 The Physical Database Design Methodology for Relational Databases Table 17.2 | 513 Interactions between base tables and query transactions for the Staff view of DreamHome. Table Transaction Field Frequency (per day) Staff (a), (d) (a) (b) (b) (e) (j) (c) (k), (l) (c) (d) (f) (f) (g) (i) (c) (l) (j) predicate: fName, lName join: Staff on supervisorStaffNo ordering: fName, lName predicate: position join: Staff on staffNo predicate: fName, lName predicate: rentFinish predicate: rentFinish join: PrivateOwner/BusinessOwner on ownerNo join: Staff on staffNo predicate: city predicate: rent join: Client on clientNo join: Client on clientNo join: PropertyForRent on propertyNo join: PropertyForRent on propertyNo join: Client on clientNo 20 20 20 20 1000–2000 1000 5000–10,000 100 5000–10,000 20 50 50 100 100 5000–10,000 100 1000 Client PropertyForRent Viewing Lease (1) Create the primary key for each table, which will cause Office Access to automatically index this field. (2) Ensure all relationships are created in the Relationships window, which will cause Office Access to automatically index the foreign key fields. As an illustration of which other indexes to create, we consider the query transactions listed in Appendix A for the Staff user views of Dreamhome. We can produce a summary of interactions between the base tables and these transactions shown in Table 17.2. This figure shows for each table: the transaction(s) that operate on the table, the type of access (a search based on a predicate, a join together with the join field, any ordering field, and any grouping field ), and the frequency with which the transaction runs. Based on this information, we choose to create the additional indexes shown in Table 17.3. We leave it as an exercise for the reader to choose additional indexes to create in Microsoft Office Access for the transactions listed in Appendix A for the Branch view of Dreamhome (see Exercise 17.5). File organizations and indexes for DreamHome with Oracle In this section we repeat the above exercise of determining appropriate file organizations and indexes for the Staff user views of DreamHome. Once again, we use the terminology of the DBMS – Oracle refers to a relation as a table with columns and rows. 514 | Chapter 17 z Methodology – Physical Database Design for Relational Databases Table 17.3 Additional indexes to be created in Microsoft Office Access based on the query transactions for the Staff view for DreamHome. Table Index Staff fName, lName Client fName, lName PropertyForRent rentFinish position city rent Oracle automatically adds an index for each primary key. In addition, Oracle recommends that UNIQUE indexes are not explicitly defined on tables but instead UNIQUE integrity constraints are defined on the desired columns. Oracle enforces UNIQUE integrity constraints by automatically defining a unique index on the unique key. Exceptions to this recommendation are usually performance related. For example, using a CREATE TABLE . . . AS SELECT with a UNIQUE constraint is slower than creating the table without the constraint and then manually creating a UNIQUE index. Assume that the tables are created with the identified primary, alternate, and foreign keys specified. We now identify whether any clusters are required and whether any additional indexes are required. To keep the design simple, we will assume that clusters are not appropriate. Again, considering just the query transactions listed in Appendix A for the Staff view of DreamHome, there may be performance benefits in adding the indexes shown in Table 17.4. Again, we leave it as an exercise for the reader to choose additional indexes to create in Oracle for the transactions listed in Appendix A for the Branch view of Dreamhome (see Exercise 17.6). Step 4.4 Estimate disk space requirements Objective To estimate the amount of disk space that will be required by the database. It may be a requirement that the physical database implementation can be handled by the current hardware configuration. Even if this is not the case, the designer still has to estimate the amount of disk space that is required to store the database, in the event that new hardware has to be procured. The objective of this step is to estimate the amount of disk space that is required to support the database implementation on secondary storage. As with the previous steps, estimating the disk usage is highly dependent on the target DBMS and the hardware used to support the database. In general, the estimate is based on the size of each tuple and the number of tuples in the relation. The latter estimate should 17.3 The Physical Database Design Methodology for Relational Databases Table 17.4 Additional indexes to be created in Oracle based on the query transactions for the Staff view of DreamHome. Table Staff Index fName, lName supervisorStaffNo position Client staffNo PropertyForRent ownerNo fName, lName staffNo clientNo rentFinish city rent Viewing Lease clientNo propertyNo clientNo be a maximum number, but it may also be worth considering how the relation will grow, and modifying the resulting disk size by this growth factor to determine the potential size of the database in the future. In Appendix H (see companion Web site) we illustrate the process for estimating the size of relations created in Oracle. Step 5 Design User Views Objective To design the user views that were identified during the requirements collection and analysis stage of the database system development lifecycle. The first phase of the database design methodology presented in Chapter 15 involved the production of a conceptual data model for either the single user view or a number of combined user views identified during the requirements collection and analysis stage. In Section 10.4.4 we identified four user views for DreamHome named Director, Manager, Supervisor, and Assistant. Following an analysis of the data requirements for these user views, we used the centralized approach to merge the requirements for the user views as follows: n n Branch, consisting of the Director and Manager user views; Staff, consisting of the Supervisor and Assistant user views. | 515 516 | Chapter 17 z Methodology – Physical Database Design for Relational Databases In Step 2 the conceptual data model was mapped to a logical data model based on the relational model. The objective of this step is to design the user views identified previously. In a standalone DBMS on a PC, user views are usually a convenience, defined to simplify database requests. However, in a multi-user DBMS, user views play a central role in defining the structure of the database and enforcing security. In Section 6.4.7, we discussed the major advantages of user views, such as data independence, reduced complexity, and customization. We previously discussed how to create views using the ISO SQL standard (Section 6.4.10), and how to create views (stored queries) in Microsoft Office Access (Chapter 7), and in Oracle (Section 8.2.5). Document design of user views The design of the individual user views should be fully documented. Step 6 Design Security Mechanisms Objective To design the security mechanisms for the database as specified by the users during the requirements and collection stage of the database system development lifecycle. A database represents an essential corporate resource and so security of this resource is extremely important. During the requirements collection and analysis stage of the database system development lifecycle, specific security requirements should have been documented in the system requirements specification (see Section 10.4.4). The objective of this step is to decide how these security requirements will be realized. Some systems offer different security facilities than others. Again, the database designer must be aware of the facilities offered by the target DBMS. As we discuss in Chapter 19, relational DBMSs generally provide two types of database security: n n system security; data security. System security covers access and use of the database at the system level, such as a user name and password. Data security covers access and use of database objects (such as relations and views) and the actions that users can have on the objects. Again, the design of access rules is dependent on the target DBMS; some systems provide more facilities than others for designing access rules. We have previously discussed three particular ways to create access rules using the discretionary GRANT and REVOKE statements of the ISO SQL standard (Section 6.6), Microsoft Office Access (Section 8.1.9), and Oracle (Section 8.2.5). We discuss security more fully in Chapter 19. Document design of security measures The design of the security measures should be fully documented. If the physical design affects the logical data model, this model should also be updated. Review Questions | 517 Chapter Summary n n n n n n n Physical database design is the process of producing a description of the implementation of the database on secondary storage. It describes the base relations and the storage structures and access methods used to access the data effectively, along with any associated integrity constraints and security measures. The design of the base relations can be undertaken only once the designer is fully aware of the facilities offered by the target DBMS. The initial step (Step 3) of physical database design is the translation of the logical data model into a form that can be implemented in the target relational DBMS. The next step (Step 4) designs the file organizations and access methods that will be used to store the base relations. This involves analyzing the transactions that will run on the database, choosing suitable file organizations based on this analysis, choosing indexes and, finally, estimating the disk space that will be required by the implementation. Secondary indexes provide a mechanism for specifying an additional key for a base relation that can be used to retrieve data more efficiently. However, there is an overhead involved in the maintenance and use of secondary indexes that has to be balanced against the performance improvement gained when retrieving data. One approach to selecting an appropriate file organization for a relation is to keep the tuples unordered and create as many secondary indexes as necessary. Another approach is to order the tuples in the relation by specifying a primary or clustering index. One approach to determining which secondary indexes are needed is to produce a ‘wish-list’ of attributes that we consider are candidates for indexing, and then to examine the impact of maintaining each of these indexes. The objective of Step 5 is to design how to implement the user views identified during the requirements collection and analysis stage, such as using the mechanisms provided by SQL. A database represents an essential corporate resource and so security of this resource is extremely important. The objective of Step 6 is to design how the security mechanisms identified during the requirements collection and analysis stage will be realized. Review Questions 17.1 Explain the difference between conceptual, logical, and physical database design. Why might these tasks be carried out by different people? 17.2 Describe the inputs and outputs of physical database design. 17.3 Describe the purpose of the main steps in the physical design methodology presented in this chapter. 17.4 Discuss when indexes may improve the efficiency of the system. 518 | Chapter 17 z Methodology – Physical Database Design for Relational Databases Exercises The DreamHome case study 17.5 In Step 4.3 we chose the indexes to create in Microsoft Office Access for the query transactions listed in Appendix A for the Staff user views of DreamHome. Choose indexes to create in Microsoft Office Access for the query transactions listed in Appendix A for the Branch view of DreamHome. 17.6 Repeat Exercise 17.5 using Oracle as the target DBMS. 17.7 Create a physical database design for the logical design of the DreamHome case study (described in Chapter 16) based on the DBMS that you have access to. 17.8 Implement this physical design for DreamHome created in Exercise 17.7. The University Accommodation Office case study 17.9 Based on the logical data model developed in Exercise 16.10, create a physical database design for the University Accommodation Office case study (described in Appendix B.1) based on the DBMS that you have access to. 17.10 Implement the University Accommodation Office database using the physical design created in Exercise 17.9. The EasyDrive School of Motoring case study 17.11 Based on the logical data model developed in Exercise 16.11, create a physical database design for the EasyDrive School of Motoring case study (described in Appendix B.2) based on the DBMS that you have access to. 17.12 Implement the EasyDrive School of Motoring database using the physical design created in Exercise 17.11. The Wellmeadows Hospital case study 17.13 Based on the logical data model developed in Exercise 16.13, create a physical database design for the Wellmeadows Hospital case study (described in Appendix B.3) based on the DBMS that you have access to. 17.14 Implement the Wellmeadows Hospital database using the physical design created in Exercise 17.13. Chapter 18 Methodology – Monitoring and Tuning the Operational System Chapter Objectives In this chapter you will learn: n The meaning of denormalization. n When to denormalize to improve performance. n The importance of monitoring and tuning the operational system. n How to measure efficiency. n How system resources affect performance. In this chapter we describe and illustrate by example the final two steps of the physical database design methodology for relational databases. We provide guidelines for determining when to denormalize the logical data model and introduce redundancy, and then discuss the importance of monitoring the operational system and continuing to tune it. In places, we show physical implementation details to clarify the discussion. 18.1 Denormalizing and Introducing Controlled Redundancy Step 7 Consider the Introduction of Controlled Redundancy Objective To determine whether introducing redundancy in a controlled manner by relaxing the normalization rules will improve the performance of the system. Normalization is a technique for deciding which attributes belong together in a relation. One of the basic aims of relational database design is to group attributes together in a relation because there is a functional dependency between them. The result of normalization is a logical database design that is structurally consistent and has minimal redundancy. However, it is sometimes argued that a normalized database design does not provide maximum processing efficiency. Consequently, there may be circumstances where it may 520 | Chapter 18 z Methodology – Monitoring and Tuning the Operational System be necessary to accept the loss of some of the benefits of a fully normalized design in favor of performance. This should be considered only when it is estimated that the system will not be able to meet its performance requirements. We are not advocating that normalization should be omitted from logical database design: normalization forces us to understand completely each attribute that has to be represented in the database. This may be the most important factor that contributes to the overall success of the system. In addition, the following factors have to be considered: n n n denormalization makes implementation more complex; denormalization often sacrifices flexibility; denormalization may speed up retrievals but it slows down updates. Formally, the term denormalization refers to a refinement to the relational schema such that the degree of normalization for a modified relation is less than the degree of at least one of the original relations. We also use the term more loosely to refer to situations where we combine two relations into one new relation, and the new relation is still normalized but contains more nulls than the original relations. Some authors refer to denormalization as usage refinement. As a general rule of thumb, if performance is unsatisfactory and a relation has a low update rate and a very high query rate, denormalization may be a viable option. The transaction /relation cross-reference matrix that may have been produced in Step 4.1 provides useful information for this step. The matrix summarizes, in a visual way, the access patterns of the transactions that will run on the database. It can be used to highlight possible candidates for denormalization, and to assess the effects this would have on the rest of the model. More specifically, in this step we consider duplicating certain attributes or joining relations together to reduce the number of joins required to perform a query. Indirectly, we have encountered an implicit example of denormalization when dealing with address attributes. For example, consider the definition of the Branch relation: Branch (branchNo, street, city, postcode, mgrStaffNo) Strictly speaking, this relation is not in third normal form: postcode (the post or zip code) functionally determines city. In other words, we can determine the value of the city attribute given a value for the postcode attribute. Hence, the Branch relation is in Second Normal Form (2NF). To normalize the relation to Third Normal Form (3NF), it would be necessary to split the relation into two, as follows: Branch (branchNo, street, postcode, mgrStaffNo) Postcode (postcode, city) However, we rarely wish to access the branch address without the city attribute. This would mean that we would have to perform a join whenever we want a complete address for a branch. As a result, we settle for the second normal form and implement the original Branch relation. Unfortunately, there are no fixed rules for determining when to denormalize relations. In this step we discuss some of the more common situations for considering denormalization. For additional information, the interested reader is referred to Rogers (1989) and Fleming and Von Halle (1989). In particular, we consider denormalization in the following situations, specifically to speed up frequent or critical transactions: 18.1 Denormalizing and Introducing Controlled Redundancy n Step 7.1 Combining one-to-one (1:1) relationships n Step 7.2 Duplicating non-key attributes in one-to-many (1:*) relationships to reduce joins n Step 7.3 Duplicating foreign key attributes in one-to-many (1:*) relationships to reduce joins n Step 7.4 Duplicating attributes in many-to-many (*:*) relationships to reduce joins n Step 7.5 Introducing repeating groups n Step 7.6 Creating extract tables n Step 7.7 Partitioning relations | 521 To illustrate these steps, we use the relation diagram shown in Figure 18.1(a) and the sample data shown in Figure 18.1(b). Figure 18.1 (a) Sample relation diagram. Figure 18.1 (b) Sample relations. 18.1 Denormalizing and Introducing Controlled Redundancy Figure 18.2 Combined Client and Interview: (a) revised extract from the relation diagram; (b) combined relation. Step 7.1 Combining one-to-one (1:1) relationships Re-examine one-to-one (1:1) relationships to determine the effects of combining the relations into a single relation. Combination should only be considered for relations that are frequently referenced together and infrequently referenced separately. Consider, for example, the 1:1 relationship between Client and Interview, as shown in Figure 18.1. The Client relation contains information on potential renters of property; the Interview relation contains the date of the interview and comments made by a member of staff about a Client. We could combine these two relations together to form a new relation ClientInterview, as shown in Figure 18.2. Since the relationship between Client and Interview is 1:1 and the participation is optional, there may be a significant number of nulls in the combined relation ClientInterview depending on the proportion of tuples involved in the participation, as shown in Figure 18.2(b). If the original Client relation is large and the proportion of tuples involved in the participation is small, there will be a significant amount of wasted space. Step 7.2 Duplicating non-key attributes in one-to-many (1:*) relationships to reduce joins With the specific aim of reducing or removing joins from frequent or critical queries, consider the benefits that may result in duplicating one or more non-key attributes of the parent relation in the child relation in a 1:* relationship. For example, whenever the PropertyForRent relation is accessed, it is very common for the owner’s name to be accessed at the same time. A typical SQL query would be: | 523 524 | Chapter 18 z Methodology – Monitoring and Tuning the Operational System Figure 18.3 Revised PropertyForRent relation with duplicated lName attribute from the PrivateOwner relation. SELECT p.*, o.lName FROM PropertyForRent p, PrivateOwner o WHERE p.ownerNo = o.ownerNo AND branchNo = ‘B003’; based on the original relation diagram and sample relations shown in Figure 18.1. If we duplicate the lName attribute in the PropertyForRent relation, we can remove the PrivateOwner relation from the query, which in SQL becomes: SELECT p.* FROM PropertyForRent p WHERE branchNo = ‘B003’; based on the revised relation shown in Figure 18.3. The benefits that result from this change have to be balanced against the problems that may arise. For example, if the duplicated data is changed in the parent relation, it must be updated in the child relation. Further, for a 1:* relationship there may be multiple occurrences of each data item in the child relation (for example, the names Farrel and Shaw both appear twice in the revised PropertyForRent relation), in which case it becomes necessary to maintain consistency of multiple copies. If the update of the IName attribute in the PrivateOwner and PropertyForRent relation cannot be automated, the potential for loss of integrity is considerable. An associated problem with duplication is the additional time that is required to maintain consistency automatically every time a tuple is inserted, updated, or deleted. In our case, it is unlikely that the name of the owner of a property will change, so the duplication may be warranted. Another problem to consider is the increase in storage space resulting from the duplication. Again, with the relatively low cost of secondary storage nowadays, this may not be so much of a problem. However, this is not a justification for arbitrary duplication. A special case of a one-to-many (1:*) relationship is a lookup table, sometimes called a reference table or pick list. Typically, a lookup table contains a code and a description. For example, we may define a lookup (parent) table for property type and modify the PropertyForRent (child) table, as shown in Figure 18.4. The advantages of using a lookup table are: n n n reduction in the size of the child relation; the type code occupies 1 byte as opposed to 5 bytes for the type description; if the description can change (which is not the case in this particular example), it is easier changing it once in the lookup table as opposed to changing it many times in the child relation; the lookup table can be used to validate user input. 18.1 Denormalizing and Introducing Controlled Redundancy | 525 Figure 18.4 Lookup table for property type: (a) relation diagram; (b) sample relations. If the lookup table is used in frequent or critical queries, and the description is unlikely to change, consideration should be given to duplicating the description attribute in the child relation, as shown in Figure 18.5. The original lookup table is not redundant – it can still be used to validate user input. However, by duplicating the description in the child relation, we have eliminated the need to join the child relation to the lookup table. Figure 18.5 Modified PropertyForRent relation with duplicated description attribute. 526 | Chapter 18 z Methodology – Monitoring and Tuning the Operational System Step 7.3 Duplicating foreign key attributes in one-to-many (1:*) relationship to reduce joins Again, with the specific aim of reducing or removing joins from frequent or critical queries, consider the benefits that may result in duplicating one or more of the foreign key attributes in a relationship. For example, a frequent query for DreamHome is to list all the private property owners at a branch, using an SQL query of the form: SELECT o.lName FROM PropertyForRent p, PrivateOwner o WHERE p.ownerNo = o.ownerNo AND branchNo = ‘B003’; based on the original data shown in Figure 18.1. In other words, because there is no direct relationship between PrivateOwner and Branch, then to get the list of owners we have to use the PropertyForRent relation to gain access to the branch number, branchNo. We can remove the need for this join by duplicating the foreign key branchNo in the PrivateOwner relation; that is, we introduce a direct relationship between the Branch and PrivateOwner relations. In this case, we can simplify the SQL query to: SELECT o.lName FROM PrivateOwner o WHERE branchNo = ‘B003’; based on the revised relation diagram and PrivateOwner relation shown in Figure 18.6. If this change is made, it will be necessary to introduce additional foreign key constraints, as discussed in Step 2.2. Figure 18.6 Duplicating the foreign key branchNo in the PrivateOwner relation: (a) revised (simplified) relation diagram with branchNo included as a foreign key; (b) revised PrivateOwner relation. 18.1 Denormalizing and Introducing Controlled Redundancy If an owner could rent properties through many branches, the above change would not work. In this case, it would be necessary to model a many-to-many (*:*) relationship between Branch and PrivateOwner. Note also that the PropertyForRent relation has the branchNo attribute because it is possible for a property not to have a member of staff allocated to it, particularly at the start when the property is first taken on by the agency. If the PropertyForRent relation did not have the branch number, it would be necessary to join the PropertyForRent relation to the Staff relation based on the staffNo attribute to get the required branch number. The original SQL query would then become: SELECT o.lName FROM Staff s, PropertyForRent p, PrivateOwner o WHERE s.staffNo = p.staffNo AND p.ownerNo = o.ownerNo AND s.branchNo = ‘B003’; Removing two joins from the query may provide greater justification for creating a direct relationship between PrivateOwner and Branch and thereby duplicating the foreign key branchNo in the PrivateOwner relation. Step 7.4 Duplicating attributes in many-to-many (*:*) relationships to reduce joins During logical database design, we mapped each *:* relationship into three relations: the two relations derived from the original entities and a new relation representing the relationship between the two entities. Now, if we wish to produce information from the *:* relationship, we have to join these three relations. In some circumstances, it may be possible to reduce the number of relations to be joined by duplicating attributes from one of the original entities in the intermediate relation. For example, the *:* relationship between Client and PropertyForRent has been decomposed by introducing the intermediate Viewing relation. Consider the requirement that the DreamHome sales staff should contact clients who have still to make a comment on the properties they have viewed. However, the sales staff need only the street attribute of the property when talking to the clients. The required SQL query is: SELECT p.street, c.*, v.viewDate FROM Client c, Viewing v, PropertyForRent p WHERE v.propertyNo = p.propertyNo AND c.clientNo = v.clientNo AND comment IS NULL; based on the relation model and sample data shown in Figure 18.1. If we duplicate the street attribute in the intermediate Viewing relation, we can remove the PropertyForRent relation from the query, giving the SQL query: SELECT c.*, v.street, v.viewDate FROM Client c, Viewing v WHERE c.clientNo = v.clientNo AND comment IS NULL; based on the revised Viewing relation shown in Figure 18.7. Step 7.5 Introducing repeating groups Repeating groups were eliminated from the logical data model as a result of the requirement that all entities be in first normal form. Repeating groups were separated out into a | 527 528 | Chapter 18 z Methodology – Monitoring and Tuning the Operational System Figure 18.7 Duplicating the street attribute from the PropertyForRent relation in the Viewing relation. Figure 18.8 Branch incorporating repeating group: (a) revised relation diagram; (b) revised relation. new relation, forming a 1:* relationship with the original (parent) relation. Occasionally, reintroducing repeating groups is an effective way to improve system performance. For example, each DreamHome branch office has a maximum of three telephone numbers, although not all offices necessarily have the same number of lines. In the logical data model, we created a Telephone entity with a three-to-one (3:1) relationship with Branch, resulting in two relations, as shown in Figure 18.1. If access to this information is important or frequent, it may be more efficient to combine the relations and store the telephone details in the original Branch relation, with one attribute for each telephone, as shown in Figure 18.8. In general, this type of denormalization should be considered only in the following circumstances: n n n the absolute number of items in the repeating group is known (in this example there is a maximum of three telephone numbers); the number is static and will not change over time (the maximum number of telephone lines is fixed and is not expected to change); the number is not very large, typically not greater than 10, although this is not as important as the first two conditions. 18.1 Denormalizing and Introducing Controlled Redundancy | 529 Sometimes it may be only the most recent or current value in a repeating group, or just the fact that there is a repeating group, that is needed most frequently. In the above example we may choose to store one telephone number in the Branch relation and leave the remaining numbers for the Telephone relation. This would remove the presence of nulls from the Branch relation, as each branch must have at least one telephone number. Step 7.6 Creating extract tables There may be situations where reports have to be run at peak times during the day. These reports access derived data and perform multi-relation joins on the same set of base relations. However, the data the report is based on may be relatively static or, in some cases, may not have to be current (that is, if the data is a few hours old, the report would be perfectly acceptable). In this case, it may be possible to create a single, highly denormalized extract table based on the relations required by the reports, and allow the users to access the extract table directly instead of the base relations. The most common technique for producing extract tables is to create and populate the tables in an overnight batch run when the system is lightly loaded. Step 7.7 Partitioning relations Rather than combining relations together an alternative approach that addresses the key problem with supporting very large relations (and indexes) is to decompose them into a number of smaller and more manageable pieces called partitions. As illustrated in Figure 18.9, there are two main types of partitioning: horizontal partitioning and vertical partitioning. Horizontal partitioning Distributing the tuples of a relation across a number of (smaller) relations. Vertical partitioning Distributing the attributes of a relation across a number of (smaller) relations (the primary key is duplicated to allow the original relation to be reconstructed). Figure 18.9 Horizontal and vertical partitioning. 530 | Chapter 18 z Methodology – Monitoring and Tuning the Operational System Figure 18.10 Oracle SQL statement to create a hash partition. CREATE TABLE ArchivedPropertyForRentPartition( propertyNo VARHAR2(5) NOT NULL, street VARCHAR2(25) NOT NULL, city VARCHAR2(15) NOT NULL, postcode VARCHAR2(8), type CHAR NOT NULL, rooms SMALLINT NOT NULL, rent NUMBER(6, 2) NOT NULL, ownerNo VARCHAR2(5) NOT NULL, staffNo VARCHAR2(5), branchNo CHAR(4) NOT NULL, PRIMARY KEY (propertyNo), FOREIGN KEY (ownerNo) REFERENCES PrivateOwner(ownerNo), FOREIGN KEY (staffNo) REFERENCES Staff(staffNo), FOREIGN KEY (branchNo) REFERENCES Branch(branchNo)) PARTITION BY HASH (branchNo) (PARTITION b1 TABLESPACE TB01, PARTITION b2 TABLESPACE TB02, PARTITION b3 TABLESPACE TB03, PARTITION b4 TABLESPACE TB04); Partitions are particularly useful in applications that store and analyze large amounts of data. For example, DreamHome maintains an ArchivedPropertyForRent relation with several hundreds of thousands of tuples that are held indefinitely for analysis purposes. Searching for a particular tuple at a branch could be quite time consuming; however, we could reduce this time by horizontally partitioning the relation, with one partition for each branch. We can create a (hash) partition for this scenario in Oracle using the SQL statement shown in Figure 18.10. As well as hash partitioning, other common types of partitioning are range (each partition is defined by a range of values for one or more attributes) and list (each partition is defined by a list of values for an attribute). There are also composite partitions such as range–hash and list–hash (each partition is defined by a range or a list of values and then each partition is further subdivided based on a hash function). There may also be circumstances where we frequently examine particular attributes of a very large relation and it may be appropriate to vertically partition the relation into those attributes that are frequently accessed together and another vertical partition for the remaining attributes (with the primary key replicated in each partition to allow the original relation to be reconstructed using a join). Partitioning has a number of advantages: n n Improved load balancing Partitions can be allocated to different areas of secondary storage thereby permitting parallel access while at the same time minimizing the contention for access to the same storage area if the relation was not partitioned. Improved performance By limiting the amount of data to be examined or processed, and by enabling parallel execution, performance can be enhanced. 18.1 Denormalizing and Introducing Controlled Redundancy n n n | 531 Increased availability If partitions are allocated to different storage areas and one storage area becomes unavailable, the other partitions would still be available. Improved recovery Smaller partitions can be recovered more efficiently (equally well, the DBA may find backing up smaller partitions easier than backing up very large relations). Security Data in a partition can be restricted to those users who require access to it, with different partitions having different access restrictions. Partitioning can also have a number of disadvantages: n n n Complexity Partitioning is not usually transparent to end-users and queries that utilize more than one partition become more complex to write. Reduced performance Queries that combine data from more than one partition may be slower than a non-partitioned approach. Duplication Vertical partitioning involves duplication of the primary key. This leads not only to increased storage requirements but also to potential inconsistencies arising. Consider implications of denormalization Consider the implications of denormalization on the previous steps in the methodology. For example, it may be necessary to reconsider the choice of indexes on the relations that have been denormalized to establish whether existing indexes should be removed or additional indexes added. In addition it will be necessary to consider how data integrity will be maintained. Common solutions are: n n n Triggers Triggers can be used to automate the updating of derived or duplicated data. Transactions Build transactions into each application that make the updates to denormalized data as a single (atomic) action. Batch reconciliation Run batch programs at appropriate times to make the denormalized data consistent. In terms of maintaining integrity, triggers provide the best solution, although they can cause performance problems. The advantages and disadvantages of denormalization are summarized in Table 18.1. Table 18.1 Advantages and disadvantages of denormalization Advantages Disadvantages Can improve performance by: May speed up retrievals but can slow down updates. n n n n n precomputing derived data; minimizing the need for joins; reducing the number of foreign keys in relations; reducing the number indexes (thereby saving storage space); reducing the number of relations. Always application-specific and needs to be re-evaluated if the application changes. Can increase the size of relations. May simplify implementation in some cases but may make it more complex in others. Sacrifices flexibility. 532 | Chapter 18 z Methodology – Monitoring and Tuning the Operational System Document introduction of redundancy The introduction of redundancy should be fully documented, along with the reasons for introducing it. In particular, document the reasons for selecting one approach where many alternatives exist. Update the logical data model to reflect any changes made as a result of denormalization. 18.2 Monitoring the System to Improve Performance Step 8 Monitor and Tune the Operational System Objective To monitor the operational system and improve the performance of the system to correct inappropriate design decisions or reflect changing requirements. For this activity we should remember that one of the main objectives of physical database design is to store and access data in an efficient way (see Appendix C). There are a number of factors that we may use to measure efficiency: n n n Transaction throughput This is the number of transactions that can be processed in a given time interval. In some systems, such as airline reservations, high transaction throughput is critical to the overall success of the system. Response time This is the elapsed time for the completion of a single transaction. From a user’s point of view, we want to minimize response time as much as possible. However, there are some factors that influence response time that the designer may have no control over, such as system loading or communication times. Response time can be shortened by: – reducing contention and wait times, particularly disk I/O wait times; – reducing the amount of time for which resources are required; – using faster components. Disk storage This is the amount of disk space required to store the database files. The designer may wish to minimize the amount of disk storage used. However, there is no one factor that is always correct. Typically, the designer has to trade one factor off against another to achieve a reasonable balance. For example, increasing the amount of data stored may decrease the response time or transaction throughput. The initial physical database design should not be regarded as static, but should be considered as an estimate of how the operational system might perform. Once the initial design has been implemented, it will be necessary to monitor the system and tune it as a result of observed performance and changing requirements (see Step 8). Many DBMSs provide the Database Administrator (DBA) with utilities to monitor the operation of the system and tune it. 18.2 Monitoring the System to Improve Performance There are many benefits to be gained from tuning the database: n n n n n Tuning can avoid the procurement of additional hardware. It may be possible to downsize the hardware configuration. This results in less, and cheaper, hardware and consequently less expensive maintenance. A well-tuned system produces faster response times and better throughput, which in turn makes the users, and hence the organization, more productive. Improved response times can improve staff morale. Improved response times can increase customer satisfaction. These last two benefits are more intangible than the others. However, we can certainly state that slow response times demoralize staff and potentially lose customers. To tune an operational system, the physical database designer must be aware of how the various hardware components interact and affect database performance, as we now discuss. Understanding system resources Main memory Main memory accesses are significantly faster than secondary storage accesses, sometimes tens or even hundreds of thousands of times faster. In general, the more main memory available to the DBMS and the database applications, the faster the applications will run. However, it is sensible always to have a minimum of 5% of main memory available. Equally well, it is advisable not to have any more than 10% available otherwise main memory is not being used optimally. When there is insufficient memory to accommodate all processes, the operating system transfers pages of processes to disk to free up memory. When one of these pages is next required, the operating system has to transfer it back from disk. Sometimes it is necessary to swap entire processes from memory to disk, and back again, to free up memory. Problems occur with main memory when paging or swapping becomes excessive. To ensure efficient usage of main memory, it is necessary to understand how the target DBMS uses main memory, what buffers it keeps in main memory, what parameters exist to allow the size of the buffers to be adjusted, and so on. For example, Oracle keeps a data dictionary cache in main memory that ideally should be large enough to handle 90% of data dictionary accesses without having to retrieve the information from disk. It is also necessary to understand the access patterns of users: an increase in the number of concurrent users accessing the database will result in an increase in the amount of memory being utilized. CPU The CPU controls the tasks of the other system resources and executes user processes, and is the most costly resource in the system so needs to be correctly utilized. The main objective for this component is to prevent CPU contention in which processes are waiting for the CPU. CPU bottlenecks occur when either the operating system or user processes make too many demands on the CPU. This is often a result of excessive paging. It is necessary to understand the typical workload through a 24-hour period and ensure that sufficient resources are available for not only the normal workload but also the peak | 533 534 | Chapter 18 z Methodology – Monitoring and Tuning the Operational System Figure 18.11 Typical disk configuration. workload (for example, if the system has 90% CPU utilization and 10% idle during the normal workload then there may not be sufficient scope to handle the peak workload). One option is to ensure that during peak load no unnecessary jobs are being run and that such jobs are instead run in off-hours. Another option may be to consider multiple CPUs, which allows the processing to be distributed and operations to be performed in parallel. CPU MIPS (millions of instructions per second) can be used as a guide in comparing platforms and determining their ability to meet the enterprise’s throughput requirements. Disk I/O With any large DBMS, there is a significant amount of disk I/O involved in storing and retrieving data. Disks usually have a recommended I/O rate and, when this rate is exceeded, I/O bottlenecks occur. While CPU clock speeds have increased dramatically in recent years, I/O speeds have not increased proportionately. The way in which data is organized on disk can have a major impact on the overall disk performance. One problem that can arise is disk contention. This occurs when multiple processes try to access the same disk simultaneously. Most disks have limits on both the number of accesses and the amount of data they can transfer per second and, when these limits are reached, processes may have to wait to access the disk. To avoid this, it is recommended that storage should be evenly distributed across available drives to reduce the likelihood of performance problems occurring. Figure 18.11 illustrates the basic principles of distributing the data across disks: – the operating system files should be separated from the database files; – the main database files should be separated from the index files; – the recovery log file (see Section 20.3.3) should be separated from the rest of the database. If a disk still appears to be overloaded, one or more of its heavily accessed files can be moved to a less active disk (this is known as distributing I/O). Load balancing can be achieved by applying this principle to each of the disks until they all have approximately the same amount of I/O. Once again, the physical database designer needs to understand how the DBMS operates, the characteristics of the hardware, and the access patterns of the users. Disk I/O has been revolutionized with the introduction of RAID (Redundant Array of Independent Disks) technology. RAID works on having a large disk array comprising an arrangement of several independent disks that are organized to increase performance and at the same time improve reliability. We discuss RAID in Section 19.2.6. Network When the amount of traffic on the network is too great, or when the number of network collisions is large, network bottlenecks occur. 18.2 Monitoring the System to Improve Performance Each of above resources may affect other system resources. Equally well, an improvement in one resource may effect an improvement in other system resources. For example: n n procuring more main memory should result in less paging, which should help avoid CPU bottlenecks; more effective use of main memory may result in less disk I/O. Summary Tuning is an activity that is never complete. Throughout the life of the system, it will be necessary to monitor performance, particularly to account for changes in the environment and user requirements. However, making a change to one area of an operational system to improve performance may have an adverse effect on another area. For example, adding an index to a relation may improve the performance of one transaction, but it may adversely affect another, perhaps more important, transaction. Therefore, care must be taken when making changes to an operational system. If possible, test the changes either on a test database, or alternatively, when the system is not being fully used (such as, out of working hours). Document tuning activity The mechanisms used to tune the system should be fully documented, along with the reasons for tuning it in the closen way. In particular, document the reasons for selecting one opproach where many alternatives exist. New Requirement for DreamHome As well as tuning the system to maintain optimal performance, it may also be necessary to handle changing requirements. For example, suppose that after some months as a fully operational database, several users of the DreamHome system raise two new requirements: (1) Ability to hold pictures of the properties for rent, together with comments that describe the main features of the property. In Microsoft Office Access we are able to accommodate this request using OLE (Object Linking and Embedding) fields, which are used to store data such as Microsoft Word or Microsoft Excel documents, pictures, sound, and other types of binary data created in other programs. OLE objects can be linked to, or embedded in, a field in a Microsoft Office Access table and then displayed in a form or report. To implement this new requirement, we restructure the PropertyForRent table to include: (a) a field called picture specified as an OLE data type; this field holds graphical images of properties, created by scanning photographs of the properties for rent and saving the images as BMP (Bit Mapped) graphic files; (b) a field called comments specified as a Memo data type, capable of storing lengthy text. | 535 536 | Chapter 18 z Methodology – Monitoring and Tuning the Operational System Figure 18.12 Form based on PropertyForRent table with new picture and comments fields. A form based on some fields of the PropertyForRent table, including the new fields, is shown in Figure 18.12. The main problem associated with the storage of graphic images is the large amount of disk space required to store the image files. We would therefore need to continue to monitor the performance of the DreamHome database to ensure that satisfying this new requirement does not compromise the system’s performance. (2) Ability to publish a report describing properties available for rent on the Web. This requirement can be accommodated in both Microsoft Office Access and Oracle as both DBMSs provide many features for developing a Web application and publishing on the Internet. However, to use these features, we require a Web browser, such as Microsoft Internet Explorer or Netscape Navigator, and a modem or other network connection to access the Internet. In Chapter 29, we describe in detail the technologies used in the integration of databases and the Web. Exercises | 537 Chapter Summary n n n n n Formally, the term denormalization refers to a refinement to the relational schema such that the degree of normalization for a modified relation is less than the degree of at least one of the original relations. The term is also used more loosely to refer to situations where two relations are combined into one new relation, and the new relation is still normalized but contains more nulls than the original relations. Step 7 of physical database design considers denormalizing the relational schema to improve performance. There may be circumstances where it may be necessary to accept the loss of some of the benefits of a fully normalized design in favor of performance. This should be considered only when it is estimated that the system will not be able to meet its performance requirements. As a rule of thumb, if performance is unsatisfactory and a relation has a low update rate and a very high query rate, denormalization may be a viable option. The final step (Step 8) of physical database design is the ongoing process of monitoring and tuning the operational system to achieve maximum performance. One of the main objectives of physical database design is to store and access data in an efficient way. There are a number of factors that can be used to measure efficiency, including throughput, response time, and disk storage. To improve performance, it is necessary to be aware of how the following four basic hardware components interact and affect system performance: main memory, CPU, disk I/O, and network. Review Questions 18.1 Describe the purpose of the main steps in the physical design methodology presented in this chapter. 18.2 Under what circumstances would you want to denormalize a logical data model? Use examples to illustrate your answer. 18.3 What factors can be used to measure efficiency? 18.4 Discuss how the four basic hardware components interact and affect system performance. 18.5 How should you distribute data across disks? Exercise 18.6 Investigate whether your DBMS can accommodate the two new requirements for the DreamHome case study given in Step 8 of this chapter. If feasible, produce a design for the two requirements and implement them in your target DBMS. Part 5 Selected Database Issues Chapter 19 Security 541 Chapter 20 Transaction Management 572 Chapter 21 Query Processing 630 Chapter 19 Security Chapter Objectives In this chapter you will learn: n The scope of database security. n Why database security is a serious concern for an organization. n The types of threat that can affect a database system. n How to protect a computer system using computer-based controls. n The security measures provided by Microsoft Office Access and Oracle DBMSs. n Approaches for securing a DBMS on the Web. Data is a valuable resource that must be strictly controlled and managed, as with any corporate resource. Part or all of the corporate data may have strategic importance to an organization and should therefore be kept secure and confidential. In Chapter 2 we discussed the database environment and, in particular, the typical functions and services of a Database Management System (DBMS). These functions and services include authorization services, such that a DBMS must furnish a mechanism to ensure that only authorized users can access the database. In other words, the DBMS must ensure that the database is secure. The term security refers to the protection of the database against unauthorized access, either intentional or accidental. Besides the services provided by the DBMS, discussions on database security could also include broader issues associated with securing the database and its environment. However, these issues are outwith the scope of this book and the interested reader is referred to Pfleeger (1997). 542 | Chapter 19 z Security Structure of this Chapter In Section 19.1 we discuss the scope of database security and examine the types of threat that may affect computer systems in general. In Section 19.2 we consider the range of computer-based controls that are available as countermeasures to these threats. In Sections 19.3 and 19.4 we describe the security measures provided by Microsoft Office Access 2003 DBMS and Oracle9i DBMS. In Section 19.5 we identify the security measures associated with DBMSs and the Web. The examples used throughout this chapter are taken from the DreamHome case study described in Section 10.4 and Appendix A. 19.1 Database Security In this section we describe the scope of database security and discuss why organizations must take potential threats to their computer systems seriously. We also identify the range of threats and their consequences on computer systems. Database security The mechanisms that protect the database against intentional or accidental threats. Security considerations apply not only to the data held in a database: breaches of security may affect other parts of the system, which may in turn affect the database. Consequently, database security encompasses hardware, software, people, and data. To effectively implement security requires appropriate controls, which are defined in specific mission objectives for the system. This need for security, while often having been neglected or overlooked in the past, is now increasingly recognized by organizations. The reason for this turnaround is the increasing amounts of crucial corporate data being stored on computer and the acceptance that any loss or unavailability of this data could prove to be disastrous. A database represents an essential corporate resource that should be properly secured using appropriate controls. We consider database security in relation to the following situations: n n n n n theft and fraud; loss of confidentiality (secrecy); loss of privacy; loss of integrity; loss of availability. These situations broadly represent areas in which the organization should seek to reduce risk, that is the possibility of incurring loss or damage. In some situations, these areas are closely related such that an activity that leads to loss in one area may also lead to loss in another. In addition, events such as fraud or loss of privacy may arise because of either 19.1 Database Security | intentional or unintentional acts, and do not necessarily result in any detectable changes to the database or the computer system. Theft and fraud affect not only the database environment but also the entire organization. As it is people who perpetrate such activities, attention should focus on reducing the opportunities for this occurring. Theft and fraud do not necessarily alter data, as is the case for activities that result in either loss of confidentiality or loss of privacy. Confidentiality refers to the need to maintain secrecy over data, usually only that which is critical to the organization, whereas privacy refers to the need to protect data about individuals. Breaches of security resulting in loss of confidentiality could, for instance, lead to loss of competitiveness, and loss of privacy could lead to legal action being taken against the organization. Loss of data integrity results in invalid or corrupted data, which may seriously affect the operation of an organization. Many organizations are now seeking virtually continuous operation, the so-called 24/7 availability (that is, 24 hours a day, 7 days a week). Loss of availability means that the data, or the system, or both cannot be accessed, which can seriously affect an organization’s financial performance. In some cases, events that cause a system to be unavailable may also cause data corruption. Database security aims to minimize losses caused by anticipated events in a costeffective manner without unduly constraining the users. In recent times, computer-based criminal activities have significantly increased and are forecast to continue to rise over the next few years. Threats Threat Any situation or event, whether intentional or accidental, that may adversely affect a system and consequently the organization. A threat may be caused by a situation or event involving a person, action, or circumstance that is likely to bring harm to an organization. The harm may be tangible, such as loss of hardware, software, or data, or intangible, such as loss of credibility or client confidence. The problem facing any organization is to identify all possible threats. Therefore, as a minimum an organization should invest time and effort in identifying the most serious threats. In the previous section we identified areas of loss that may result from intentional or unintentional activities. While some types of threat can be either intentional or unintentional, the impact remains the same. Intentional threats involve people and may be perpetrated by both authorized users and unauthorized users, some of whom may be external to the organization. Any threat must be viewed as a potential breach of security which, if successful, will have a certain impact. Table 19.1 presents examples of various types of threat, listed under the area on which they may have an impact. For example, ‘viewing and disclosing unauthorized data’ as a threat may result in theft and fraud, loss of confidentiality, and loss of privacy for the organization. 19.1.1 543 544 | Table 19.1 Chapter 19 z Security Examples of threats. Threat Theft and fraud Loss of confidentiality Loss of privacy Using another person’s means of access Unauthorized amendment or copying of data Program alteration Inadequate policies and procedures that allow a mix of confidential and normal output Wire tapping Illegal entry by hacker Blackmail Creating ‘trapdoor’ into system Theft of data, programs, and equipment Failure of security mechanisms, giving greater access than normal Staff shortages or strikes Inadequate staff training Viewing and disclosing unauthorized data Electronic interference and radiation Data corruption owing to power loss or surge Fire (electrical fault, lightning strike, arson), flood, bomb Physical damage to equipment Breaking cables or disconnection of cables Introduction of viruses 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 Loss of integrity Loss of availability 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 The extent that an organization suffers as a result of a threat’s succeeding depends upon a number of factors, such as the existence of countermeasures and contingency plans. For example, if a hardware failure occurs corrupting secondary storage, all processing activity must cease until the problem is resolved. The recovery will depend upon a number of factors, which include when the last backups were taken and the time needed to restore the system. An organization needs to identify the types of threat it may be subjected to and initiate appropriate plans and countermeasures, bearing in mind the costs of implementing them. Obviously, it may not be cost-effective to spend considerable time, effort, and money on potential threats that may result only in minor inconvenience. The organization’s business may also influence the types of threat that should be considered, some of which may be rare. However, rare events should be taken into account, particularly if their impact would be significant. A summary of the potential threats to computer systems is represented in Figure 19.1. 19.2 Countermeasures – Computer-Based Controls | Figure 19.1 Summary of potential threats to computer systems. Countermeasures – Computer-Based Controls The types of countermeasure to threats on computer systems range from physical controls to administrative procedures. Despite the range of computer-based controls that are available, it is worth noting that, generally, the security of a DBMS is only as good as that of the operating system, owing to their close association. Representation of a typical multiuser computer environment is shown in Figure 19.2. In this section we focus on the following computer-based security controls for a multi-user environment (some of which may not be available in the PC environment): 19.2 545 546 | Chapter 19 z Security n n n n n n n authorization access controls views backup and recovery integrity encryption RAID technology. Figure 19.2 Representation of a typical multi-user computer environment. 19.2.1 Authorization Authorization The granting of a right or privilege that enables a subject to have legitimate access to a system or a system’s object. 19.2 Countermeasures – Computer-Based Controls | Authorization controls can be built into the software, and govern not only what system or object a specified user can access, but also what the user may do with it. The process of authorization involves authentication of subjects requesting access to objects, where ‘subject’ represents a user or program and ‘object’ represents a database table, view, procedure, trigger, or any other object that can be created within the system. Authentication A mechanism that determines whether a user is who he or she claims to be. A system administrator is usually responsible for allowing users to have access to a computer system by creating individual user accounts. Each user is given a unique identifier, which is used by the operating system to determine who they are. Associated with each identifier is a password, chosen by the user and known to the operating system, which must be supplied to enable the operating system to verify (or authenticate) who the user claims to be. This procedure allows authorized use of a computer system but does not necessarily authorize access to the DBMS or any associated application programs. A separate, similar procedure may have to be undertaken to give a user the right to use the DBMS. The responsibility to authorize use of the DBMS usually rests with the Database Administrator (DBA), who must also set up individual user accounts and passwords using the DBMS itself. Some DBMSs maintain a list of valid user identifiers and associated passwords, which can be distinct from the operating system’s list. However, other DBMSs maintain a list whose entries are validated against the operating system’s list based on the current user’s login identifier. This prevents a user from logging on to the DBMS with one name, having already logged on to the operating system using a different name. Access Controls The typical way to provide access controls for a database system is based on the granting and revoking of privileges. A privilege allows a user to create or access (that is read, write, or modify) some database object (such as a relation, view, or index) or to run certain DBMS utilities. Privileges are granted to users to accomplish the tasks required for their jobs. As excessive granting of unnecessary privileges can compromise security: a privilege should only be granted to a user if that user cannot accomplish his or her work without that privilege. A user who creates a database object such as a relation or a view automatically gets all privileges on that object. The DBMS subsequently keeps track of how these privileges are granted to other users, and possibly revoked, and ensures that at all times only users with necessary privileges can access an object. Discretionary Access Control (DAC) Most commercial DBMSs provide an approach to managing privileges that uses SQL called Discretionary Access Control (DAC). The SQL standard supports DAC through the GRANT and REVOKE commands. The GRANT command gives privileges to users, and the REVOKE command takes away privileges. We discussed how the SQL standard supports discretionary access control in Section 6.6. 19.2.2 547 548 | Chapter 19 z Security Discretionary access control, while effective, has certain weaknesses. In particular, an unauthorized user can trick an authorized user into disclosing sensitive data. For example, an unauthorized user such as an Assistant in the DreamHome case study can create a relation to capture new client details and give access privileges to an authorized user such as a Manager without their knowledge. The Assistant can then alter some application programs that the Manager uses to include some hidden instruction to copy sensitive data from the Client relation that only the Manager has access to, into the new relation created by the Assistant. The unauthorized user, namely the Assistant, now has a copy of the sensitive data, namely new clients of DreamHome, and to cover up his or her actions now modifies the altered application programs back to the original form. Clearly, an additional security approach is required to remove such loopholes, and this requirement is met in an approach called Mandatory Access Control (MAC), which we discuss in detail below. Although discretionary access control is typically provided by most commercial DBMSs, only some also provide support for mandatory access control. Mandatory Access Control (MAC) Mandatory Access Control (MAC) is based on system-wide policies that cannot be changed by individual users. In this approach each database object is assigned a security class and each user is assigned a clearance for a security class, and rules are imposed on reading and writing of database objects by users. The DBMS determines whether a given user can read or write a given object based on certain rules that involve the security level of the object and the clearance of the user. These rules seek to ensure that sensitive data can never be passed on to another user without the necessary clearance. The SQL standard does not include support for MAC. A popular model for MAC is called Bell–LaPadula model (Bell and LaPadula, 1974), which is described in terms of objects (such as relations, views, tuples, and attributes), subjects (such as users and programs), security classes, and clearances. Each database object is assigned a security class, and each subject is assigned a clearance for a security class. The security classes in a system are ordered, with a most secure class and a least secure class. For our discussion of the model, we assume that there are four classes: top secret (TS), secret (S), confidential (C), and unclassified (U), and we denote the class of an object or subject A as class (A). Therefore for this system, TS > S > C > U, where A > B means that class A data has a higher security level than class B data. The Bell–LaPadula model imposes two restrictions on all reads and writes of database objects: 1. Simple Security Property: Subject S is allowed to read object O only if class (S) >= class (O). For example, a user with TS clearance can read a relation with C clearance, but a user with C clearance cannot read a relation with TS classification. 2. *_Property: Subject S is allowed to write object O only if class (S) <=class (O). For example, a user with S clearance can only write objects with S or TS classification. If discretionary access controls are also specified, these rules represent additional restrictions. Thus to read or write a database object, a user must have the necessary privileges provided through the SQL GRANT command (see Section 6.6) and the security classes of the user and the object must satisfy the restrictions given above. 19.2 Countermeasures – Computer-Based Controls | 549 Multilevel Relations and Polyinstantiation In order to apply mandatory access control policies in a relational DBMS, a security class must be assigned to each database object. The objects can be at the granularity of relations, tuples, or even individual attribute values. Assume that each tuple is assigned a security class. This situation leads to the concept of a multilevel relation, which is a relation that reveals different tuples to users with different security clearances. For example, the Client relation with an additional attribute displaying the security class for each tuple is shown in Figure 19.3(a). Users with S and TS clearance will see all tuples in the Client relation. However, a user with C clearance will only see the first two tuples and a user with U clearance will see no tuples at all. Assume that a user with clearance C wishes to enter a tuple (CR74, David, Sinclaire) into the Client relation, where the primary key of the relation is clientNo. This insertion is disallowed because it violates the primary key constraint (see Section 3.2.5) for this relation. However, the inability to insert this new tuple informs the user with clearance C that a tuple exists with a primary key value of CR74 at a higher security class than C. This compromises the security requirement that users should not be able to infer any information about objects that have a higher security classification. This problem of inference can be solved by including the security classification attribute as part of the primary key for a relation. In the above example, the insertion of the new tuple into the Client relation is allowed, and the relation instance is modified as shown in Figure 19.3(b). Users with clearance C see the first two tuples and the newly added tuple, but users with clearance S or TS see all five tuples. The result is a relation with two tuples with a clientNo of CR74, which can be confusing. This situation may be dealt with by assuming that the tuple with the higher classification takes priority over the other, or by only revealing a single tuple according to the user’s clearance. The presence of data objects that appear to have different values to users with different clearances is called polyinstantiation. Figure 19.3(a) The Client relation with an additional attribute displaying the security class for each tuple. Figure 19.3(b) The Client relation with two tuples displaying clientNo as CR74. The primary key for this relation is (clientNo, securityClass). 550 | Chapter 19 z Security Although mandatory access control does address a major weakness of discretionary access control, a major disadvantage of MAC is the rigidity of the MAC environment. For example, MAC policies are often established by database or systems administrators, and the classification mechanisms are sometimes considered to be inflexible. 19.2.3 Views View A view is the dynamic result of one or more relational operations operating on the base relations to produce another relation. A view is a virtual relation that does not actually exist in the database, but is produced upon request by a particular user, at the time of request. The view mechanism provides a powerful and flexible security mechanism by hiding parts of the database from certain users. The user is not aware of the existence of any attributes or rows that are missing from the view. A view can be defined over several relations with a user being granted the appropriate privilege to use it, but not to use the base relations. In this way, using a view is more restrictive than simply having certain privileges granted to a user on the base relation(s). We discussed views in detail in Sections 3.4 and 6.4. 19.2.4 Backup and Recovery Backup The process of periodically taking a copy of the database and log file (and possibly programs) on to offline storage media. A DBMS should provide backup facilities to assist with the recovery of a database following failure. It is always advisable to make backup copies of the database and log file at regular intervals and to ensure that the copies are in a secure location. In the event of a failure that renders the database unusable, the backup copy and the details captured in the log file are used to restore the database to the latest possible consistent state. A description of how a log file is used to restore a database is described in more detail in Section 20.3.3. Journaling The process of keeping and maintaining a log file (or journal) of all changes made to the database to enable recovery to be undertaken effectively in the event of a failure. A DBMS should provide logging facilities, sometimes referred to as journaling, which keep track of the current state of transactions and database changes, to provide support for recovery procedures. The advantage of journaling is that, in the event of a failure, the database can be recovered to its last known consistent state using a backup copy of the database and the information contained in the log file. If no journaling is enabled on a 19.2 Countermeasures – Computer-Based Controls | failed system, the only means of recovery is to restore the database using the latest backup version of the database. However, without a log file, any changes made after the last backup to the database will be lost. The process of journaling is discussed in more detail in Section 20.3.3. Integrity 19.2.5 Integrity constraints also contribute to maintaining a secure database system by preventing data from becoming invalid, and hence giving misleading or incorrect results. Integrity constraints were discussed in detail in Section 3.3. Encryption Encryption The encoding of the data by a special algorithm that renders the data unreadable by any program without the decryption key. If a database system holds particularly sensitive data, it may be deemed necessary to encode it as a precaution against possible external threats or attempts to access it. Some DBMSs provide an encryption facility for this purpose. The DBMS can access the data (after decoding it), although there is a degradation in performance because of the time taken to decode it. Encryption also protects data transmitted over communication lines. There are a number of techniques for encoding data to conceal the information; some are termed ‘irreversible’ and others ‘reversible’. Irreversible techniques, as the name implies, do not permit the original data to be known. However, the data can be used to obtain valid statistical information. Reversible techniques are more commonly used. To transmit data securely over insecure networks requires the use of a cryptosystem, which includes: n n n n an encryption key to encrypt the data (plaintext); an encryption algorithm that, with the encryption key, transforms the plaintext into ciphertext; a decryption key to decrypt the ciphertext; a decryption algorithm that, with the decryption key, transforms the ciphertext back into plaintext. One technique, called symmetric encryption, uses the same key for both encryption and decryption and relies on safe communication lines for exchanging the key. However, most users do not have access to a secure communication line and, to be really secure, the keys need to be as long as the message (Leiss, 1982). However, most working systems are based on user keys shorter than the message. One scheme used for encryption is the Data Encryption Standard (DES), which is a standard encryption algorithm developed by IBM. This scheme uses one key for both encryption and decryption, which must be kept secret, although the algorithm need not be. The algorithm transforms each 64-bit block of 19.2.6 551 552 | Chapter 19 z Security plaintext using a 56-bit key. The DES is not universally regarded as being very secure, and some authors maintain that a larger key is required. For example, a scheme called PGP (Pretty Good Privacy) uses a 128-bit symmetric algorithm for bulk encryption of the data it sends. Keys with 64 bits are now probably breakable by major governments with special hardware, albeit at substantial cost. However, this technology will be within the reach of organized criminals, major organizations, and smaller governments in a few years. While it is envisaged that keys with 80 bits will also become breakable in the future, it is probable that keys with 128 bits will remain unbeakable for the foreseeable future. The terms ‘strong authentication’ and ‘weak authentication’ are sometimes used to distinguish between algorithms that, to all intents and purposes, cannot be broken with existing technologies and knowledge (strong) from those that can be (weak). Another type of cryptosystem uses different keys for encryption and decryption, and is referred to as asymmetric encryption. One example is public key cryptosystems, which use two keys, one of which is public and the other private. The encryption algorithm may also be public, so that anyone wishing to send a user a message can use the user’s publicly known key in conjunction with the algorithm to encrypt it. Only the owner of the private key can then decipher the message. Public key cryptosystems can also be used to send a ‘digital signature’ with a message and prove that the message came from the person who claimed to have sent it. The most well known asymmetric encryption is RSA (the name is derived from the initials of the three designers of the algorithm). Generally, symmetric algorithms are much faster to execute on a computer than those that are asymmetric. However, in practice, they are often used together, so that a public key algorithm is used to encrypt a randomly generated encryption key, and the random key is used to encrypt the actual message using a symmetric algorithm. We discuss encryption in the context of the Web in Section 19.5. 19.2.7 RAID (Redundant Array of Independent Disks) The hardware that the DBMS is running on must be fault-tolerant, meaning that the DBMS should continue to operate even if one of the hardware components fails. This suggests having redundant components that can be seamlessly integrated into the working system whenever there is one or more component failures. The main hardware components that should be fault-tolerant include disk drives, disk controllers, CPU, power supplies, and cooling fans. Disk drives are the most vulnerable components with the shortest times between failure of any of the hardware components. One solution is the use of Redundant Array of Independent Disks (RAID) technology. RAID originally stood for Redundant Array of Inexpensive Disks, but more recently the ‘I’ in RAID has come to stand for Independent. RAID works on having a large disk array comprising an arrangement of several independent disks that are organized to improve reliability and at the same time increase performance. Performance is increased through data striping: the data is segmented into equal-size partitions (the striping unit) which are transparently distributed across multiple disks. This gives the appearance of a single large, fast disk where in actual fact the data is distributed across several smaller disks. Striping improves overall I/O performance by allowing 19.2 Countermeasures – Computer-Based Controls multiple I/Os to be serviced in parallel. At the same time, data striping also balances the load among disks. Reliability is improved through storing redundant information across the disks using a parity scheme or an error-correcting scheme, such as Reed-Solomon codes (see, for example, Pless, 1989). In a parity scheme, each byte may have a parity bit associated with it that records whether the number of bits in the byte that are set to 1 is even or odd. If the number of bits in the byte becomes corrupted, the new parity of the byte will not match the stored parity. Similarly, if the stored parity bit becomes corrupted, it will not match the data in the byte. Error-correcting schemes store two or more additional bits, and can reconstruct the original data if a single bit becomes corrupt. These schemes can be used through striping bytes across disks. There are a number of different disk configurations with RAID, termed RAID levels. A brief description of each RAID level is given below together with a diagrammatic representation for each of the main levels in Figure 19.4. In this figure the numbers represent sequential data blocks and the letters indicate segments of a data block. n n n n n n n n RAID 0 – Nonredundant This level maintains no redundant data and so has the best write performance since updates do not have to be replicated. Data striping is performed at the level of blocks. A diagrammatic representation of RAID 0 is shown in Figure 19.4(a). RAID 1 – Mirrored This level maintains (mirrors) two identical copies of the data across different disks. To maintain consistency in the presence of disk failure, writes may not be performed simultaneously. This is the most expensive storage solution. A diagrammatic representation of RAID 1 is shown in Figure 19.4(b). RAID 0 +1 – Nonredundant and Mirrored This level combines striping and mirroring. RAID 2 – Memory-Style Error-Correcting Codes With this level, the striping unit is a single bit and Hamming codes are used as the redundancy scheme. A diagrammatic representation of RAID 2 is shown in Figure 19.4(c). RAID 3 – Bit-Interleaved Parity This level provides redundancy by storing parity information on a single disk in the array. This parity information can be used to recover the data on other disks should they fail. This level uses less storage space than RAID 1 but the parity disk can become a bottleneck. A diagrammatic representation of RAID 3 is shown in Figure 19.4(d). RAID 4 – Block-Interleaved Parity With this level, the striping unit is a disk block – a parity block is maintained on a separate disk for corresponding blocks from a number of other disks. If one of the disks fails, the parity block can be used with the corresponding blocks from the other disks to restore the blocks of the failed disk. A diagrammatic representation of RAID 4 is shown in Figure 19.4(e). RAID 5 – Block-Interleaved Distributed Parity This level uses parity data for redundancy in a similar way to RAID 3 but stripes the parity data across all the disks, similar to the way in which the source data is striped. This alleviates the bottleneck on the parity disk. A diagrammatic representation of RAID 5 is shown in Figure 19.4(f). RAID 6 – P+Q Redundancy This level is similar to RAID 5 but additional redundant data is maintained to protect against multiple disk failures. Error-correcting codes are used instead of using parity. | 553 Figure 19.4 RAID levels. The numbers represent sequential data blocks and the letters indicate segments of a data block. 19.3 Security in Microsoft Office Access DBMS | Oracle, for example, recommends use of RAID 1 for the redo log files. For the database files, Oracle recommends either RAID 5, provided the write overhead is acceptable, otherwise Oracle recommends either RAID 1 or RAID 0 +1. A fuller discussion of RAID is outwith the scope of this book and the interested reader is referred to the papers by Chen and Patterson (1990) and Chen et al. (1994). Security in Microsoft Office Access DBMS In Section 8.1 we provided an overview of Microsoft Office Access 2003 DBMS. In this section we focus on the security measures provided by Office Access. In Section 6.6 we described the SQL GRANT and REVOKE statements; Microsoft Office Access 2003 does not support these statements but instead provides the following two methods for securing a database: n n setting a password for opening a database (referred to as system security by Microsoft Office Access); user-level security, which can be used to limit the parts of the database that a user can read or update (referred to as data security by Microsoft Office Access). In this section we briefly discuss how Microsoft Office Access provides these two types of security mechanism. Setting a Password The simpler security method is to set a password for opening the database. Once a password has been set (from the Tools, Security menu), a dialog box requesting the password will be displayed whenever the database is opened. Only users who type the correct password will be allowed to open the database. This method is secure as Microsoft Office Access encrypts the password so that it cannot be accessed by reading the database file directly. However, once a database is open, all the objects contained within the database are available to the user. Figure 19.5(a) shows the dialog box to set the password and Figure 19.5(b) shows the dialog box requesting the password whenever the database is opened. User-Level Security User-level security in Microsoft Office Access is similar to methods used in most network systems. Users are required to identify themselves and type a password when they start Microsoft Office Access. Within the Microsoft Office Access workgroup information file, users are identified as members of a group. Access provides two default groups: administrators (Admins group) and users (Users group), but additional groups can be defined. Figure 19.6 displays the dialog box used to define the security level for user and group accounts. It shows a non-default group called Assistants, and a user called Assistant who is a member of the Users and Assistants groups. Permissions are granted to groups and users to regulate how they are allowed to work with each object in the database using the User and Group Permissions dialog box. Table 19.2 shows the permissions that can be set in Microsoft Office Access. For example, Figure 19.7 shows the dialog box for a user called Assistant who has only read access to 19.3 555 556 | Chapter 19 z Security Figure 19.5 Securing the DreamHome database using a password: (a) the Set Database Password dialog box; (b) the Password Required dialog box shown at startup. Figure 19.6 The User and Group Accounts dialog box for the DreamHome database. 19.3 Security in Microsoft Office Access DBMS Table 19.2 | 557 Microsoft Office Access permissions. Permission Description Open/Run Open Exclusive Read Design Modify Design Administer Open a database, form, report, or run a macro Open a database with exclusive access View objects in Design view View and change database objects, and delete them For databases, set database password, replicate database, and change startup properties Full access to database objects including ability to assign permissions View data View and modify data (but not insert or delete data) View and insert data (but not update or delete data) View and delete data (but not insert or update data) Read Data Update Data Insert Data Delete Data Figure 19.7 User and Group Permissions dialog box showing the Assistant user has only read access to the Staff1_View query. 558 | Chapter 19 z Security a stored query called Staff1_View. In a similar way, all access to the base table Staff would be removed so that the Assistant user could only view the data in the Staff table using this view. 19.4 Figure 19.8 Creation of a new user called Beech with password authentication set. Security in Oracle DBMS In Section 8.2 we provided an overview of Oracle9i DBMS. In this section, we focus on the security measures provided by Oracle. In the previous section we examined two types of security in Microsoft Office Access: system security and data security. In this section we examine how Oracle provides these two types of security. As with Office Access, one form of system security used by Oracle is the standard user name and password mechanism, whereby a user has to provide a valid user name and password before access can be gained to the database, although the responsibility to authenticate users can be devolved to the operating system. Figure 19.8 illustrates the creation of a new user called Beech with password authentication set. Whenever user Beech tries to connect to the database, this user will be presented with a Connect or Log On dialog box similar to the 19.4 Security in Oracle DBMS | 559 Figure 19.9 Log On dialog box requesting user name, password, and the name of the database the user wishes to connect to. one illustrated in Figure 19.9, prompting for a user name and password to access the specified database. Privileges As we discussed in Section 19.2.2, a privilege is a right to execute a particular type of SQL statement or to access another user’s objects. Some examples of Oracle privileges include the right to: n n n connect to the database (create a session); create a table; select rows from another user’s table. In Oracle, there are two distinct categories of privileges: n n system privileges; object privileges. System privileges A system privilege is the right to perform a particular action or to perform an action on any schema objects of a particular type. For example, the privileges to create tablespaces and to create users in a database are system privileges. There are over eighty distinct system privileges in Oracle. System privileges are granted to, or revoked from, users and roles (discussed below) using either of the following: n n Grant System Privileges/Roles dialog box and Revoke System Privileges/Roles dialog box of the Oracle Security Manager; SQL GRANT and REVOKE statements (see Section 6.6). 560 | Chapter 19 z Security However, only users who are granted a specific system privilege with the ADMIN OPTION or users with the GRANT ANY PRIVILEGE system privilege can grant or revoke system privileges. Object privileges An object privilege is a privilege or right to perform a particular action on a specific table, view, sequence, procedure, function, or package. Different object privileges are available for different types of object. For example, the privilege to delete rows from the Staff table is an object privilege. Some schema objects (such as clusters, indexes, and triggers) do not have associated object privileges; their use is controlled with system privileges. For example, to alter a cluster, a user must own the cluster or have the ALTER ANY CLUSTER system privilege. A user automatically has all object privileges for schema objects contained in his or her schema. A user can grant any object privilege on any schema object he or she owns to any other user or role. If the grant includes the WITH GRANT OPTION (of the GRANT statement), the grantee can further grant the object privilege to other users; otherwise, the grantee can use the privilege but cannot grant it to other users. The object privileges for tables and views are shown in Table 19.3. Table 19.3 What each object privilege allows a grantee to do with tables and views. Object privilege Table View ALTER Change the table definition with the ALTER TABLE statement. Remove rows from the table with the DELETE statement. Note: SELECT privilege on the table must be granted along with the DELETE privilege. Create an index on the table with the CREATE INDEX statement. Add new rows to the table with the INSERT statement. Create a constraint that refers to the table. Cannot grant this privilege to a role. Query the table with the SELECT statement. Change data in the table with the UPDATE statement. Note: SELECT privilege on the table must be granted along with the UPDATE privilege. N/A DELETE INDEX INSERT REFERENCES SELECT UPDATE Remove rows from the view with the DELETE statement. N/A Add new rows to the view with the INSERT statement. N/A Query the view with the SELECT statement. Change data in the view with the UPDATE statement. 19.4 Security in Oracle DBMS Roles A user can receive a privilege in two different ways: (1) Privileges can be granted to users explicitly. For example, a user can explicitly grant the privilege to insert rows into the PropertyForRent table to the user Beech: GRANT INSERT ON PropertyForRent TO Beech; (2) Privileges can also be granted to a role (a named group of privileges), and then the role granted to one or more users. For example, a user can grant the privileges to select, insert, and update rows from the PropertyForRent table to the role named Assistant, which in turn can be granted to the user Beech. A user can have access to several roles, and several users can be assigned the same roles. Figure 19.10 illustrates the granting of these privileges to the role Assistant using the Oracle Security Manager. Because roles allow for easier and better management of privileges, privileges should normally be granted to roles and not to specific users. | 561 Figure 19.10 Setting the Insert, Select, and Update privileges on the PropertyForRent table to the role Assistant. 562 | Chapter 19 z Security 19.5 DBMSs and Web Security In Chapter 29 we provide a general overview of DBMSs on the Web. In this section we focus on how to make a DBMS secure on the Web. Those readers unfamilair with the terms and technologies associated with DBMSs on the Web are advised to read Chapter 29 before reading this section. Internet communication relies on TCP/IP as the underlying protocol. However, TCP/IP and HTTP were not designed with security in mind. Without special software, all Internet traffic travels ‘in the clear’ and anyone who monitors traffic can read it. This form of attack is relatively easy to perpetrate using freely available ‘packet sniffing’ software, since the Internet has traditionally been an open network. Consider, for example, the implications of credit card numbers being intercepted by unethical parties during transmission when customers use their cards to purchase products over the Internet. The challenge is to transmit and receive information over the Internet while ensuring that: n n n n n it is inaccessible to anyone but the sender and receiver (privacy); it has not been changed during transmission (integrity); the receiver can be sure it came from the sender (authenticity); the sender can be sure the receiver is genuine (non-fabrication); the sender cannot deny he or she sent it (non-repudiation). However, protecting the transaction only solves part of the problem. Once the information has reached the Web server, it must also be protected there. With the three-tier architecture that is popular in a Web environment, we also have the complexity of ensuring secure access to, and of, the database. Today, most parts of such architecture can be secured, but it generally requires different products and mechanisms. One other aspect of security that has to be addressed in the Web environment is that information transmitted to the client’s machine may have executable content. For example, HTML pages may contain ActiveX controls, JavaScript/VBScript, and/or one or more Java applets. Executable content can perform the following malicious actions, and measures need to be taken to prevent them: n n n n n n n corrupt data or the execution state of programs; reformat complete disks; perform a total system shutdown; collect and download confidential data, such as files or passwords, to another site; usurp identity and impersonate the user or user’s computer to attack other targets on the network; lock up resources making them unavailable for legitimate users and programs; cause non-fatal but unwelcome effects, especially on output devices. In earlier sections we identified general security mechanisms for database systems. However, the increasing accessibility of databases on the public Internet and private intranets requires a re-analysis and extension of these approaches. In this section we address some of the issues associated with database security in these environments. 19.5 DBMSs and Web Security Proxy Servers | 19.5.1 In a Web environment, a proxy server is a computer that sits between a Web browser and a Web server. It intercepts all requests to the Web server to determine if it can fulfill the requests itself. If not, it forwards the requests to the Web server. Proxy servers have two main purposes: to improve performance and filter requests. Improve performance Since a proxy server saves the results of all requests for a certain amount of time, it can significantly improve performance for groups of users. For example, assume that user A and user B access the Web through a proxy server. First, user A requests a certain Web page and, slightly later, user B requests the same page. Instead of forwarding the request to the Web server where that page resides, the proxy server simply returns the cached page that it had already fetched for user A. Since the proxy server is often on the same network as the user, this is a much faster operation. Real proxy servers, such as those employed by Compuserve and America Online, can support thousands of users. Filter requests Proxy servers can also be used to filter requests. For example, an organization might use a proxy server to prevent its employees from accessing a specific set of Web sites. Firewalls The standard security advice is to ensure that Web servers are unconnected to any in-house networks and regularly backed up to recover from inevitable attacks. When the Web server has to be connected to an internal network, for example to access the company database, firewall technology can help to prevent unauthorized access, provided it has been installed and maintained correctly. A firewall is a system designed to prevent unauthorized access to or from a private network. Firewalls can be implemented in both hardware and software, or a combination of both. They are frequently used to prevent unauthorized Internet users from accessing private networks connected to the Internet, especially intranets. All messages entering or leaving the intranet pass through the firewall, which examines each message and blocks those that do not meet the specified security criteria. There are several types of firewall technique: n Packet filter, which looks at each packet entering or leaving the network and accepts or rejects it based on user-defined rules. Packet filtering is a fairly effective mechanism and transparent to users, but can be difficult to configure. In addition, it is susceptible to IP spoofing. (IP spoofing is a technique used to gain unauthorized access to computers, whereby the intruder sends messages to a computer with an IP address indicating that the message is coming from a trusted port.) 19.5.2 563 564 | Chapter 19 z Security n n n Application gateway, which applies security mechanisms to specific applications, such as FTP and Telnet servers. This is a very effective mechanism, but can degrade performance. Circuit-level gateway, which applies security mechanisms when a TCP or UDP (User Datagram Protocol) connection is established. Once the connection has been made, packets can flow between the hosts without further checking. Proxy server, which intercepts all messages entering and leaving the network. The proxy server in effect hides the true network addresses. In practice, many firewalls provide more than one of these techniques. A firewall is considered a first line of defense in protecting private information. For greater security, data can be encrypted, as discussed below and earlier in Section 19.2.6. 19.5.3 Message Digest Algorithms and Digital Signatures A message digest algorithm, or one-way hash function, takes an arbitrarily sized string (the message) and generates a fixed-length string (the digest or hash). A digest has the following characteristics: n n it should be computationally infeasible to find another message that will generate the same digest; the digest does not reveal anything about the message. A digital signature consists of two pieces of information: a string of bits that is computed from the data that is being ‘signed’, along with the private key of the individual or organization wishing the signatur

Log In

E- Book Database

Related papers

Related papers

Related topics