Data Science: An Introduction/Thinking Like a Data Engineer

From Wikibooks, open books for an open world
Jump to navigation Jump to search


Chapter 06: Thinking Like a Data Engineer


Note to Contributors (remove this section when the chapter is complete)[edit | edit source]

First, please register yourself with Wikibooks (and list yourself below), so that we know who our co-contributors are. Also, please abide by the Wikibooks Editing Guidelines, Manual of Style, and Policies and Guidelines. Thank you.

Secondly, we only need basic, clear, straightforward information in each chapter. We are not trying to be exhaustive or complete—the value of this book is in the simple synthesis across subjects. There are other venues in which to wax eloquent on the deepness and complexities of a particular subject. Please place yourself in a "beginner's mind" as you make contributions. Please also scope each chapter so that it can be taught in a one-hour class period. If the chapter requires more than an hour to teach, it is probably too detailed.

  • To the extent possible, please use terms and concepts in the way in which they are defined in the Wikipedia and Wiktionary. This way students can refer to the corresponding Wikipedia / Wiktionary page to get a deeper understanding of the concept.

Thirdly, this is a cross-disciplinary book. We want to help people apply data science to all fields. Therefore, we need a wide variety of simple examples and simple exercises.

Fourthly, please adhere to the simple structure of each chapter: Summary of Main Points, Discussion, More Reading, Exercises, and References. We want the More Reading section to link to on-line resources. The References section may contain off-line resources. To start a new page, you should use the wiki markup from this prototype page.

Fifthly, as with any Wikibook please feel free to make corrections, expand explanations, and make additions where necessary, even if it is not "your" chapter. Use the discussion page to explain changes that might be controversial.

Sixthly, some syntax rules:

  • Please bold key terms and phrases the student should learn.
  • Put the name of functions and code snippets using the 'code' tags: <code>lm()</code>
  • Use in-line links [[ ]] to the Wikipedia, Wiktionary, WikiCommons, Wikibooks, and other Wikimedia Foundation properties.
  • Use references (<ref> </ref>) to "external" sources—both on-line and off-line.
  • If you want to add an image or graph, you should load it into the Commons rather than uploading into Wikibooks.
    • If appropriate, add the tag {{Created with R}}) when you upload the graph.
  • If using a different package than R standard packages, put the name of the package in bold in parenthesis after each function : <code>MCMCprobit()</code> ('''MCMCpack''')
  • You can use the third chapter Definitions of Data as an example of how to craft a chapter.

Finally, thank you so much for volunteering to be part of our our team!

Chapter Summary[edit | edit source]

When a data scientist thinks like a data engineer, they think in terms of tables. The tasks are to define the rows, columns, and cells of the tables; to associated tables with one another; and create systems to ingest, store, and retrieve tables.


(in the following discussion we need more on thinking in tables (row and columns) and how several tables are related to each other (schemas). Maybe throw in Normal Forms and indexing, just for fun. talk about different data management schemes, flat CSV files, RDBMS, no-SQL, etc.)

Discussion[edit | edit source]

Data engineering is the data part of data science. According to Wikipedia,data engineering involves acquiring, ingesting, transforming, storing, and retrieving data. Data engineering is closely related to Data Collection, Information Engineering, Knowledge Engineering, Information Management and Knowledge Management.

Data engineering starts with an understanding of the general nature of the problems to be solved. A data acquisition and management plan must be formulated which specifies where the data are coming from (RSS feeds, sensor network, pre-existing data repository), the format of the incoming data (text, numbers, images, video), and how the data will be stored and retrieved (file system, database management system). Raw data is "dirty." There will be records in the raw data that do not conform to data definitions that have been agreed upon. For example, in one hospital data set, several young boys aged 7 to 11 gave birth to babies.[1] Clearly, there are mistakes in this data. Part of the data acquisition and management plan is deciding what to do with dirty data (leave it, erase it, infer corrections).

Most of the time, raw data is not in the format that the analytical tools are expecting to see. Indeed, each tool will want to see data in its own particular way. Therefore, one task of data engineering is to transform the data so that it can be consumed by the analytical tools the data science team will use. For example, a team might receive egg laying data with each observation in its own row like the following:

Chicken Day Eggs
A 1 3
A 2 4
A 3 2
B 1 1
B 2 0
B 3 2

But what the analysis the team wants to do requires all of the observations about each chicken to be in one row only as follows:

Chicken Day1 Day2 Day3
A 3 4 2
B 1 0 2

Good data engineering requires both the ability to manipulate data and an understanding of the analytic purposes to which the data are going to be used.

In the egg laying example above the first table is in a Normalised form that lends enables further analysis, the second table is formatted to present data to the user. Often the formatting makes implicit assumptions about the questions being asked of the data – such as “what are the trends in egg laying by chicken over time?”. Other questions such as “on how many occasions did a chicken lay no eggs?” are easier to answer with the data in the normalised form.

Often the sources for an analysis are outputs from another system – so for example an egg-laying database may well internally store data in the 3 column format but export a report in the “many columns” format. One of the tasks of a data engineer is to transform captured data, which may well involve re-normalising data from output reports.

Wikipedia defines database normalization as the process of organizing the fields and tables of a relational database to minimize redundancy and dependency – usually by dividing larger tables into smaller (and less redundant) tables and defining relationships between them. The main objectives of normalisation are to:

  • avoid update and deletion anomalies
  • minimize redesign when extending the database structure
  • support general-purpose querying, including future queries that are not anticipated at design time

Suppose the egg-laying data is extended to store the age and colour of each chicken. This could be represented in a table like this:

Chicken Age Colour Day Eggs
A 2 Brown 1 3
A 2 Brown 2 4
A 2 Brown 3 2
B 1 White 1 1
B 1 White 2 0
B 1 White 3 2

This table now contains redundant information, since we are storing the age and colour of each chicken 3 times. This becomes inefficient if we are storing data for hundreds of days for each chicken. Furthermore, if Chicken B turns 2 years old we would have to synchronise the changes to records 4, 5 and 6 in order to update the age data. The normalised solution would be to have a separate "chicken" table for chicken-related facts that is linked to the "egg laying" table by a unique identifier or key.

Wikipedia defines a primary key as a unique identifier for a record in a table in a relational database relational_database. Some data sets have naturally unique keys (such as employee_id for an employee table) in other cases a unique key will need to be system generated either as an internal 'one-up' counter or by combining several attributes to create one (such as Chicken_Day in the example above). Other tables can cross-reference to a table by using its primary key. For example, a 'project' table could have a column with employee_id for each team member associated with the project. This 'cross referencing' column is known as a foreign key.

Entity relationship diagrams (also known as logical data models) are used to design relational databases and can be a good way of understanding the structures in a data set. The 3 building blocks of an Entity Relationship model are entities, attributes and relationships. An entity is a discrete and recognisable 'thing', either a physical object such as a car (or a chicken), or concept such as a bank transaction or a phone call. Each entity can be physically represented as a table, where each column of the table is an attribute of the entity (such as employee_id, forename, surname, date of joining). A relationship is a verb that links two or more entities. For example, a chicken 'lays' eggs or an employee 'belongs to' a department. Importantly, relationships also have a cardinality that can be 'one to one', 'many to one', 'one to many' or 'many to many'. For example, a chicken can lay many eggs but each egg is laid by only one chicken so the 'lays' relationship is one to many. Many to many relationships are often a sign that a design needs to be further elaborated. For example, the 'teaches' relationship between teachers and students at a university would be many to many and would require the introduction of entities like class and date to fully understand the relationship. An example entity relationship diagram is shown below:

More advanced data engineering also requires knowledge of computer programming and the Structured Query Language, as well as relational and no-SQL database management systems. For the purposes of this book, we will use the R programming language for simple data engineering tasks.

Assignment/Exercise[edit | edit source]

This assignment is about reading data sets into R data frames. Assemble into groups of 3 or 4 students. Every single student must do every part of this exercise. The purpose of grouping is to help each other understand what is going on. Some of these assignments require some trial and error. Different students will do different trials and errors, thus all will learn from each other's trials and errors.

Part 1 of 3: Within R create 4 variables, each with 12 of observations.

#Create data frame
#
#This work is licensed under a
#Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
#D. Calvin Andrus, Ph.D.
#30 August 2012

#Remove Objects in workspace
rm(list=ls())

#Create four variables with 12 oberservations each
#Weather data for Sterling, VA from http://www.weather.com/weather/wxclimatology/monthly/USVA0735
#Retrieved 30 August 2012
#Average Temperature (Farenheit) 
#Average Precipitation (inches)
Num <- 1:12
Month <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
AveTemp <-c(32, 35, 43, 53, 62, 72, 76, 75, 67, 55, 46, 36)
AvePrcp <-c(2.85, 2.86, 3.60, 3.62, 4.72, 3.92, 3.70, 3.49, 4.00, 3.59, 3.58, 3.09)

#List out the objects that exist in the R environment
ls()

#Verify each variable
Num
Month
AveTemp
AvePrcp

#Link these four variables together into a dataset where each of the 12 observations correspond to each other
#Give the dataset a name (Wthr) using the dataframe command

Wthr <- data.frame(Num, Month, AveTemp, AvePrcp)

#List out the objects that exist in the R environment
ls()

#Notice that the 4 variables are still part of the R environment in addition to the dataframe
#The variables are now also part of the data frame 
#Verify the contents of the dataset
Wthr

#Verify the formats within the data frame using the "structure" (str) command
str(Wthr)

#Notice that as part of the data frame the variables have a dollar sign ($) as a prefix
#Compare the Month variable inside and outside the data frame
str(Month)
str(Wthr$Month)

#Whoops! What happened? When we inserted the character variable Month into the data frame, it was converted to a factor variable.
#We call the values of a Factor variable "levels"
#Factor variables are nominal variables, which means the default is that order does not matter, which is called an "unordered" factor. 
#Therefore R does two things as a default:
#  1) R prints out the levels in alphabetical order
#  2) R associates an random integer to each level, in this case 5, 4, 8, 1, 9, etc.
#For this particular problem the order of the months does matter.
#We can force an order on a factor by using the factor() function
#This is called an "ordered" factor
levels(Wthr$Month)
Wthr$Month <- factor(Wthr$Month, levels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))

#Note we could have also specified, levels=Month, can you explain why?
#Verify that the factor levels are now ordered properly, with the assigned integers in order
levels(Wthr$Month)
str(Wthr$Month)
Wthr

#We can now remove the redundant variables from the R workspace
rm("AvePrcp", "AveTemp", "Month", "Num")
ls()

#The dataframe is the only object left
#Now let's do some plots
plot(x=Wthr$Month, y=Wthr$AveTemp)
lines(Wthr$Month,fitted(loess(Wthr$AveTemp~Wthr$Num)))
plot(x=Wthr$Month, y=Wthr$AvePrcp)
plot(x=Wthr$AveTemp, y=Wthr$AvePrcp, pch=16, cex=1.5)
abline(lm(Wthr$AvePrcp~Wthr$AveTemp))


Part 2 of 3. Load an example data set into a data frame.


#Put Example Data into Data Frame 
#
#This work is licensed under a
#Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
#D. Calvin Andrus, Ph.D.
#31 August 2012

#Remove Objects in workspace
rm(list=ls())

#Find out the available datasets
data()

#Pick a dataset and get the help file
?state

#Load the dataset into the R workspace
data(state)

#Find out what got loaded
ls()

#Examine the objects that were loaded
str(state.abb)
str(state.area)
str(state.x77)

#Notice that the last object was not a simple variable with a single set of observations, but
#it is a matrix that is 50 rows long and 8 columns wide
#Inspect a summary of these data
summary(state.abb)
summary(state.x77)

#Print out the contents of these objects
state.abb
state.x77

#Now let's put these objects into a data frame called "state" and inspect it
state <- data.frame(state.abb, state.area, state.center, state.division, state.name, state.region, state.x77)
ls()
str(state)

#Remove the old objects, now that we have put the data set into a data frame
rm(state.abb, state.area, state.center, state.division, state.name, state.region, state.x77)
ls()

#Print out the data frame
state

#Examine the relationships among the variables using table() and plot(), then
#Try about 10 different variations on both the table() and the plot() functions
table(state$state.region,state$state.division)
plot(state$Illiteracy,state$Murder)


Part 3 of 3 - Import an external data set.

  1. Find Fisher's Iris Data Set in the Wikipedia.
  2. Copy the data table and paste it into Microsoft Excel, Apple Numbers, or Google Docs Spreadsheet
  3. Save the dataset in Comma Separated Value (CSV) format on your desktop, with a filename of "iris.csv"
  4. Read the dataset into R
  5. Inspect the data, make sure it is all there, then look at the data using the summary(), table(), and plot() functions
#Read External Data into Data Frame 
#
#This work is licensed under a
#Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
#D. Calvin Andrus, Ph.D.
#30 August 2012

#Remove Objects in workspace
rm(list=ls())

#Findout what our default working directory is
getwd()

#Set your working directory to the "desktop" and verify
#You will need to use your own directory structure
setwd("/Users/Calvin/Desktop/")
getwd()

#Read the iris.csv into a dataframe -- and verify
#  The first line of the file should be the variable names, hence header=TRUE
#  Tell R that the separator is a comma
#  If there are other line on top of the variable names, then you will need to skip them
iris <- read.table("iris.csv", header=TRUE, sep=",", skip=0)
str(iris)
iris

#You should have gotten 150 observations on 5 variables
#Explore the data using summary(), table(), and plot()
summary(iris)
table(iris$Species)
plot(iris$Sepal.length,iris$Sepal.width)

#Create a character variable to match a color to the factor variable Species
#Note how the R code implements the follow English statement
#  If the variable "iris$species" has the value "I.setosa" then set the "iris$plotcolor" variable to the value "blue"
iris$plotcolor <- as.character("black")
iris$plotcolor [iris$Species == "I. setosa"] <- "blue"
iris$plotcolor [iris$Species == "I. versicolor"] <- "green"
iris$plotcolor [iris$Species == "I. virginica"] <- "red"

plot(
   main="Plot of Sepal Size for Three Iris Species",
   x=iris$Sepal.width, xlim=c(1,5), xlab="Sepal Width",
   y=iris$Sepal.length, ylim=c(3,8), ylab="Sepal Length",
   pch=16,
   col=iris$plotcolor
  )
legend(1.5, 3.5,"Setosa=Blue, Versicolor=Green, Virginica=Red")

#Now, plot the Petal Length and Width
#Compare Sepal Width with Petal Width
#Compare Sepal Length with Petal Length

More Reading[edit | edit source]

References[edit | edit source]

  1. "Heritage Provider Network Heath Prize". Heritage Provider Network, Inc. Retrieved 13 July 2012.

Copyright Notice[edit | edit source]

You are free:

  • to Share — to copy, distribute, display, and perform the work (pages from this wiki)
  • to Remix — to adapt or make derivative works

Under the following conditions:

  • Attribution — You must attribute this work to Wikibooks. You may not suggest that Wikibooks, in any way, endorses you or your use of this work.
  • Share Alike — If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one.
  • Waiver — Any of the above conditions can be waived if you get permission from the copyright holder.
  • Public Domain — Where the work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
  • Other Rights — In no way are any of the following rights affected by the license:
  • Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations;
  • The author's moral rights;
  • Rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights.
  • Notice — For any reuse or distribution, you must make clear to others the license terms of this work.The best way to do this is with a link to the following web page.
http://creativecommons.org/licenses/by-nc-sa/3.0/