Business Analysis Guidebook/Data Modeling and Data Documentation

From Wikibooks, open books for an open world
Jump to navigation Jump to search

Business data is a key asset for any organization. The information contained within data plays a large role in the operations, reporting and measurement of how the organization is performing. An information system provides the means to capture and share data, and to monitor business performance. A sound data structure is universally required to ensure data integrity and to minimize future application maintenance costs. Creating the ‘right’ data store for an information application involves data governance, understanding data flows, and developing a reliable foundation to ensure ongoing reliability, stability, and maximal effectiveness for the investment.

Business Data[edit | edit source]

Data Governance and Security[edit | edit source]

The investments in information systems necessary to build new applications, modify existing applications, or to buy and migrate to COTS (Commercial Off-the-Shelf) solutions can be very high for any organization. In order to ensure that those investments result in maximum returns, controls may be implemented in the form of data governance and security. These controls may exist within an organizational subdivision, across subdivisions, or across an entire organization. Some controls are even global in nature, determined by international agreements and guided by generally accepted standards.

Data governance is concerned with who is responsible for the organizational data and how governance decisions and processes are executed. One or more different approaches may be used in the governance of business data. These include:

  • Top-down – decisions are authoritative and affect the entire organization
  • Bottom-up – decisions relating to data are established at low level (e.g., naming conventions)
  • Center-out – typically involves a consultant hired to consider governance needs from all stakeholders and make organization-wide recommendation (results in Top-down for approved decisions)
  • Silo-in - multiple groups set governance standards based on collective agreement [1].

Goals of data governance include ensuring the quality and security of business data, conformance of the data to organizational standards, overarching data rules and definitions, and the assurance that the data will be trustable.

Business Ownership[edit | edit source]

When creating a new information system, the application is generally sponsored by one or more business program areas or departments. Organizational data governance requires the identification of those who will ‘own’ the data. These owners will be responsible for oversight, coordination with the organization’s Governance Board, conformance with adopted standards and law, and securing the data access rights. An ‘Owner’ may be responsible for data within the domain of the application, or the responsibility may be spread out between multiple owners across a group of integrated applications[2]. Regardless of how the owner (or owners) of the data is determined within an organization, the approvals for project data requirements will come from within this ownership role.

Eliminate Redundancy[edit | edit source]

Data redundancy is caused when the same information is captured multiple times within the same or multiple data store(s). Most information systems today are supported by data stores built on the relational schema originally designed by Edgar Codd[3]. This type of schema minimizes redundancy and ensures data integrity. Issues relating to data redundancy revolve around conflicts between these multiple instances of the data and lead to higher operational costs associated with the data. These costs include those associated with data entry, retrieval and maintenance[4]. They also present a security threat as they introduce anomalies in in the data - preventing assurance that end users are viewing a single version of the truth. Proper data definition ensures that any redundancy is identified and can be corrected or mitigated before the system is actually designed, lowering the associated development and ongoing maintenance costs .

Data Documentation[edit | edit source]

Documentation of application data is critical to ensure a common understanding of the information contained within the data and to assure the reliability of that information. Business Analysts are often responsible for creating data documentation for information system projects and the alignment of this documentation with governance processes and rules. Documentation artifacts will span the entire SDLC from Initiation through Implementation.

Data Flow Diagrams[edit | edit source]

NYS Worker's Comp Claiming

Tracing the flow of data through an application includes identifying the sources, transformations and destinations of the data that will be supported. The destinations include both the data stores and the downstream processes that use the data. Data sources may be shared within and across the organization or may involve external entities. They include data entry, existing databases, computer files, and data streams from external applications. Once the sources are determined, the processes that must occur to transform or manipulate the data are defined. These processes include the conversion of data from the source format to the destination format. The destinations include databases, files, reports and external applications that receive the output from the transformation processes. A Data Flow Diagram captures this information for verification and design purposes.

There are many tutorials available on the Internet to refer to for the creation of a data flow diagram. Generally, consistency in rendering the data input(s), processing and data output(s) will accomplish the goals of defining data flows in a manner that is easily understood by those who will review and verify the diagram contents. An example of a Data Flow diagram using the Yourdon/DeMarco notation[5] is included in the image at right.

Data Definitions[edit | edit source]

Vehicle Owner Data Tables

The most common format used to capture the definition of data elements is the Data Dictionary. This artifact is critical for purposes of data governance and for ensuring that all stakeholders share a common understanding of what the data is. Typical information that is contained in the data dictionary includes identification of where the data resides (information system, relational table), the exact database field or data element, what the data element represents and the format of the element. The image at right provides an example of what a data dictionary may look like.

Common Data Matrix[edit | edit source]

Common Data Matrix - D = Define; P = Provide; U=Use

A Common Data Matrix is used during the Requirements Gathering phase of a project when there is at least some level of data integration required involving multiple applications, and /or some degree of MDM, and /or some form of organizational data governance. It is a form of a data dictionary that encompasses the scope of the governance domain. To construct this type of matrix, data elements are listed for the scope domain, with information about that data in relation to the organizational units, or applications, that interact with that data[6]. This type of diagram should be constructed to include information applicable to the problem under consideration (see image at right).

Consider an information system that uses external data elements from other organizational (internal) applications. The definitions, meanings and formats of these elements become critical for integration. Within a high governance maturity level organization, a customer first name (for example) would have the same data element definition and format across the organization. In reality, most organizations have not fully conformed all data definitions across the entire organization; data elements for integration projects should be summarized using a Common Data Matrix when determining project requirements for the integration effort. In such an instance, the information included within the matrix (the matrix cells) might contain data element formatting information to identify any transformation(s) that may be needed for integrating the data with the project information.

Modeling Data Requirements[edit | edit source]

For purposes of establishing a new application and for application enhancements, modeling the data at a high level provides guidance in the design efforts. Typically, Business Analysts are not involved in the actual development of an application, but the requirements associated with a project will determine what data is needed to support the project deliverables and vice versa. In the same way that high-level Business Requirements are refined/defined in Functional Requirements and then detailed into design specifications, the Conceptual Data model can be used as the starting point when defining Logical and then Physical Data Models.

Conceptual Model[edit | edit source]

Basic Conceptual Data Model example

According to the Merriam-Webster dictionary, a Concept describes an idea of what something is, or how it works [7]. A conceptual data model illustrates the information that an organization wants or needs to capture [8]. It is a high-level, business-oriented model, capturing operational data needs: what does the organization need to capture data about? The answer to this question identifies data ‘entities’ that need to be explored to define requirements. The conceptual data model represents the information that the business needs to support the business processes existing within the domain of the organization or project[9].

This type of model is very high-level, guiding the design of the data store for an application. It presents the different ‘things’ or entities (nouns) that information must be captured for and indicates high-level relationships between each of these identified ‘things’. The model is independent of whatever technology is used to implement the solution and does not include details such as primary keys or entity attributes [10]. A conceptual data model is most useful during the requirements gathering phase of a project, and may be discarded once a Logical Model is defined for the problem domain [11].

Logical Entity Relationship Model[edit | edit source]

Owner-Vehicle Logical Data Model

A Logical data model includes the entities that are needed to support an application, information about each of the entities (including attributes and unique identifiers), and the relationships that are needed between the entities. This type of model may be informed by the Conceptual Data Model, but the identified Entity-Relationship structure is modified to support (at a high-level) the technical solution domain.

Applications that rely on an RDBMS platform will typically use an Entity Relationship (ER) diagram format for illustrating the Logical Data model. Each distinct entity is defined with associated attributes in such a way that each attribute is specific to the entity and only to that entity. Normalization of the data in the model ensures that when it is time to create the physical data model, the data architecture design will provide assurance that data anomalies cannot occur – i.e., data stored in the database will have integrity and can be relied on by the users.

Physical Data Model[edit | edit source]

Physical Data Model - Owner-Vehicle tables

A Physical Data Model provides detailed specifications for the data structure that will support the business operations. The Logical Data Model, generally completed with base-lined project functional requirements, is expanded and modified to conform with the data architecture that will used to store the data. This could reflect any kind of architecture – relational databases or XML data stores both require this specification. The US Department of Health and Human Services defines this type of data model as “a database-specific implementation of a Logical Data Model (LDM)” [12].

The physical data model includes definitions for tables and columns (for relational databases) or schema and element (for XML databases), domain (attribute or element) values, physical data types, constraints, unique keys and indexes[13] . Relationships with their cardinality, XML extensions, specific RDBMS architectural information, nullability, and value lengths/precision are other definitions that are generally included in the model. Physical Data models may be created using tools that generate the code required to create a database. These tools generally also provide the capability to reverse-engineer a database, generating the physical data model for an existing application [14].

References[edit | edit source]

  1. Thomas, Gwen. "Choosing Governance Models". Data Governance Institute. Retrieved 02/05/2014. 
  2. Thomas, Gwen. "Assigning Data Ownership". Data Governance Institute. Retrieved 02/05/2014. 
  3. "Edgar F. Codd". IBM Research News. 2003. Retrieved 02/12/2014. 
  4. Streiff, John (2001). "Estimating the Costs of Data Redundancy?". Retrieved 02/12/2014. 
  5. Yourdon, E. (2006). "Just Enough Structured Analysis, Chapter 9". Retrieved 02/05/2014. 
  6. Seiner, Robert S. (2006). "The Data Stewardship Approach to Data Management". The Data Administration Newsletter - Retrieved 02/05/2014. 
  7. "Concept". Merriam-Webster. Retrieved 10/02/2014. 
  8. Stiglich, Pete (2007). "What are the benefits of a conceptual data model". TechTarget. Retrieved 10/02/2014. 
  9. Bowman, David (2009). "Conceptual Data Model". David Bowman’s Information Management Checklist. Retrieved 10/02/2014. 
  10. Gottesdiener, Ellen (2005). The Software Requirements Memory Jogger, 1st edition. GOAL/APC. p. 183. 
  11. Wambler, Scott. "Data Modeling 101". Retrieved 10/07/2014. 
  12. "Physical Data Model template". US Department of Health and Human Services. 2009. Retrieved 10/07/2014. 
  13. "Physical Data Model template". US Department of Health and Human Services. 2009. Retrieved 10/07/2014. 
  14. Ambler, Scott (2004). "Physical Data Model (PDM)s: An Agile Introduction (The Object Primer 3rd Edition, Chapter 12)". Retrieved 10/07/2014.