Data Science: An Introduction/Single Variable Analysis
Contents
Note to Contributors (remove this section when the chapter is complete)[edit]
First, please register yourself with Wikibooks (and list yourself below), so that we know who our cocontributors are. Also, please abide by the Wikibooks Editing Guidelines, Manual of Style, and Policies and Guidelines. Thank you.
Secondly, we only need basic, clear, straightforward information in each chapter. We are not trying to be exhaustive or complete—the value of this book is in the simple synthesis across subjects. There are other venues in which to wax eloquent on the deepness and complexities of a particular subject. Please place yourself in a "beginner's mind" as you make contributions. Please also scope each chapter so that it can be taught in a onehour class period. If the chapter requires more than an hour to teach, it is probably too detailed.
 To the extent possible, please use terms and concepts in the way in which they are defined in the Wikipedia and Wiktionary. This way students can refer to the corresponding Wikipedia / Wiktionary page to get a deeper understanding of the concept.
Thirdly, this is a crossdisciplinary book. We want to help people apply data science to all fields. Therefore, we need a wide variety of simple examples and simple exercises.
Fourthly, please adhere to the simple structure of each chapter: Summary of Main Points, Discussion, More Reading, Exercises, and References. We want the More Reading section to link to online resources. The References section may contain offline resources. To start a new page, you should use the wiki markup from this prototype page.
Fifthly, as with any Wikibook please feel free to make corrections, expand explanations, and make additions where necessary, even if it is not "your" chapter. Use the discussion page to explain changes that might be controversial.
Sixthly, some syntax rules:
 Please bold key terms and phrases the student should learn.
 Put the name of functions and code snippets using the 'code' tags:
<code>lm()</code>
 Use inline links
[[ ]]
to the Wikipedia, Wiktionary, WikiCommons, Wikibooks, and other Wikimedia Foundation properties.  Use references (<ref> </ref>) to "external" sources—both online and offline.
 Use the citations templates to make citations : Template:Cite book, Template:Cite web, Template:Cite journal
 If you want to add an image or graph, you should load it into the Commons rather than uploading into Wikibooks.
 If appropriate, add the tag
{{Created with R}}
) when you upload the graph.
 If appropriate, add the tag
 If using a different package than R standard packages, put the name of the package in bold in parenthesis after each function : <code>MCMCprobit()</code> ('''MCMCpack''')
 You can use the third chapter Definitions of Data as an example of how to craft a chapter.
Finally, thank you so much for volunteering to be part of our our team!
Chapter Summary[edit]
As discussed in chapter three, a variable is a set of values we have measured from a group of objects. For example, we can measure the first name of each person in a class. Their actual collected name is the value for that person for the variable (which, in this case, we would call "FirstName") When we put all the values of "FirstName" together in a group, we call that group of values a Distribution. In data science speak we would say that "a variable has a distribution of values." In practice, however, many data scientists interchange the words distribution and variable as if they were synonyms.
Descriptive Statistics are calculations we perform on distributions to simply describe the variables. The two most common descriptive statistics we normally calculate are called Measures of Central Tendency, and Measures of Dispersion. Every variable, and hence every distribution, has a data type—nominal, ordinal, interval, or ratio. We have distinct descriptive statistics for each data type. The table below lists the names of the simple descriptive statistics for each data type.
Measure  Data Types  

Nominal  Ordinal  Interval  Ratio  
Central Tendency  Mode  Median  Arithmetic Mean  Geometric Mean 
Dispersion  Variation Ratio  Interquartile Range  Standard Deviation  Coefficient of Variation 
Generally speaking, except for physics and chemistry, most data science projects either do not use ratio data, or the ratio data is converted to interval data (into what is sometimes called "lognormal" data). Thus, the Geometric Mean and the Coefficient of Variation are rarely used by data scientists. We also must be careful not to misapply the descriptive statistics of one data type to that of another. This will often result in a misinterpretation of the data. The exception is that we can cautiously apply descriptive statistics of a "lower" data type to a "higher" data type. That is, we can appropriately calculate the median for interval data, but not the arithmetic mean for ordinal data.
Discussion[edit]
Distributions[edit]
The Normal Distribution
Other Common Distributions
Nominal Variables[edit]
Central Tendency
Dispersion
Ordinal Variables[edit]
Central Tendency
Dispersion
From Ordinal to "ordered nominal"
Interval Variables[edit]
Central Tendency
Dispersion
From Interval to Ordinal
Ratio Variables[edit]
Central Tendency
Dispersion
From Ratio to Interval
Assignment/Exercise[edit]
More Reading[edit]
References[edit]
Copyright Notice[edit]
You are free:
 to Share — to copy, distribute, display, and perform the work (pages from this wiki)
 to Remix — to adapt or make derivative works
Under the following conditions:
 Attribution — You must attribute this work to Wikibooks. You may not suggest that Wikibooks, in any way, endorses you or your use of this work.
 Share Alike — If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one.
 Waiver — Any of the above conditions can be waived if you get permission from the copyright holder.
 Public Domain — Where the work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
 Other Rights — In no way are any of the following rights affected by the license:

 Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations;
 The author's moral rights;
 Rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights.
 Notice — For any reuse or distribution, you must make clear to others the license terms of this work.The best way to do this is with a link to the following web page.