Statistical Analysis: an Introduction using R/R/Logical operations

From Wikibooks, open books for an open world
Jump to navigation Jump to search
When accessing elements of vectors, we saw how to use a simple logical expression involving the less than sign (<) to produce a logical vector, which could then be used to select elements less than a certain value. This type of logical operation is very useful thing to be able to do. As well as <, there are a handful of other comparison operators. Here is the full set (See ?Comparison for more details)
  • < (less than) and <= (less than or equal to)
  • > (greater than) and >= (greater than or equal to)
  • == (equal to[1]) and != (not equal to)

Even more flexibility can be gained by combining logical vectors using and, or, and not. For example, we might want to identify which US states have an area less than 10 000 or greater than 100 000 square miles, or to identify which have an area greater than 100 000 square miles and which have a short name. The code below shows how can be used to do this, using the following R symbols:

  • & ("and")
  • | ("or")
  • ! ("not")

When using logical vectors, the following functions are particularly useful, as illustrated below

  • which() identifies which elements of a logical vector are TRUE
  • sum() can be used to give the number of elements of a logical vector which are TRUE. This is because sum() forces its input to be converted to numbers, and if TRUE and FALSE are converted to numbers, they take the values 1 and 0 respectively.
  • ifelse() returns different values depending on whether each element of a logical vector is TRUE or FALSE. Specifically, a command such as ifelse(aLogicalVector, vectorT, vectorF) takes aLogicalVector and returns, for each element that is TRUE, the corresponding element from vectorT, and for each element that is FALSE, the corresponding element from vectorF. An extra elaboration is that if vectorT or vectorF are shorter than aLogicalVector they are extended by duplication to the correct length.
Input:
### In these examples, we'll reuse the American states data, especially the state names
### To remind yourself of them, you might want to look at the vector "state.names"

nchar(state.name)       # nchar() returns the number of characters in strings of text ...
nchar(state.name) <= 6  #so this indicates which states have names of 6 letters or fewer
ShortName <- nchar(state.name) <= 6         #store this logical vector for future use
sum(ShortName)          #With a logical vector, sum() tells us how many are TRUE (11 here)
which(ShortName)        #These are the positions of the 11 elements which have short names
state.name[ShortName]   #Use the index operator [] on the original vector to get the names
state.abb[ShortName]    #Or even on other vectors (e.g. the 2 letter state abbreviations)

isSmall <- state.area < 10000  #Store a logical vector indicating states <10000 sq. miles
isHuge  <- state.area > 100000 #And another for states >100000 square miles in area
sum(isSmall)                   #there are 8 "small" states
sum(isHuge)                    #coincidentally, there are also 8 "huge" states

state.name[isSmall | isHuge]   # | means OR. So these are states which are small OR huge
state.name[isHuge & ShortName] # & means AND. So these are huge AND with a short name
state.name[isHuge & !ShortName]# ! means NOT. So these are huge and with a longer name

### Examples of ifelse() ###

ifelse(ShortName, state.name, state.abb) #mix short names with abbreviations for long ones
# (think of this as "*if* ShortName is TRUE then use state.name *else* use state.abb)

### Many functions in R increase input vectors to the correct size by duplication ###
ifelse(ShortName, state.name, "tooBIG")   #A silly example: the 3rd argument is duplicated
size <- ifelse(isSmall, "small", "large") #A more useful example, for both 2nd & 3rd args
size                                      #might be useful as an indicator variable?             
ifelse(size=="large", ifelse(isHuge, "huge", "medium"), "small") #A more complex example
Result:
> ### In these examples, we'll reuse the American states data, especially the state names
> ### To remind yourself of them, you might want to look at the vector "state.names"
>  
> nchar(state.name)       # nchar() returns the number of characters in strings of text ...
 [1]  7  6  7  8 10  8 11  8  7  7  6  5  8  7  4  6  8  9  5  8 13  8  9 11  8  7  8  6 13
[30] 10 10  8 14 12  4  8  6 12 12 14 12  9  5  4  7  8 10 13  9  7
> nchar(state.name) <= 6  #so this indicates which states have names of 6 letters or fewer
 [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
[15]  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
[29] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
[43]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
> ShortName <- nchar(state.name) <= 6         #store this logical vector for future use
> sum(ShortName)          #With a logical vector, sum() tells us how many are TRUE (11 here)
[1] 11
> which(ShortName)        #These are the positions of the 11 elements which have short names
 [1]  2 11 12 15 16 19 28 35 37 43 44
> state.name[ShortName]   #Use the index operator [] on the original vector to get the names
 [1] "Alaska" "Hawaii" "Idaho"  "Iowa"   "Kansas" "Maine"  "Nevada" "Ohio"   "Oregon"
[10] "Texas"  "Utah"  
> state.abb[ShortName]    #Or even on other vectors (e.g. the 2 letter state abbreviations)
 [1] "AK" "HI" "ID" "IA" "KS" "ME" "NV" "OH" "OR" "TX" "UT"
>  
> isSmall <- state.area < 10000  #Store a logical vector indicating states <10000 sq. miles
> isHuge  <- state.area > 100000 #And another for states >100000 square miles in area
> sum(isSmall)                   #there are 8 "small" states
[1] 8
> sum(isHuge)                    #coincidentally, there are also 8 "huge" states
[1] 8
>  
> state.name[isSmall | isHuge]   # | means OR. So these are states which are small OR huge
 [1] "Alaska"        "Arizona"       "California"    "Colorado"      "Connecticut"  
 [6] "Delaware"      "Hawaii"        "Massachusetts" "Montana"       "Nevada"       
[11] "New Hampshire" "New Jersey"    "New Mexico"    "Rhode Island"  "Texas"        
[16] "Vermont"      
> state.name[isHuge & ShortName] # & means AND. So these are huge AND with a short name
[1] "Alaska" "Nevada" "Texas" 
> state.name[isHuge & !ShortName]# ! means NOT. So these are huge and with a longer name
[1] "Arizona"    "California" "Colorado"   "Montana"    "New Mexico"
>  
> ### Examples of ifelse() ###
>  
> ifelse(ShortName, state.name, state.abb) #mix short names with abbreviations for long ones
 [1] "AL"     "Alaska" "AZ"     "AR"     "CA"     "CO"     "CT"     "DE"     "FL"    
[10] "GA"     "Hawaii" "Idaho"  "IL"     "IN"     "Iowa"   "Kansas" "KY"     "LA"    
[19] "Maine"  "MD"     "MA"     "MI"     "MN"     "MS"     "MO"     "MT"     "NE"    
[28] "Nevada" "NH"     "NJ"     "NM"     "NY"     "NC"     "ND"     "Ohio"   "OK"    
[37] "Oregon" "PA"     "RI"     "SC"     "SD"     "TN"     "Texas"  "Utah"   "VT"    
[46] "VA"     "WA"     "WV"     "WI"     "WY"    
> # (think of this as "*if* ShortName is TRUE then use state.name *else* use state.abb)
>  
> ### Many functions in R increase input vectors to the correct size by duplication ###
> ifelse(ShortName, state.name, "tooBIG")   #A silly example: the 3rd argument is duplicated
 [1] "tooBIG" "Alaska" "tooBIG" "tooBIG" "tooBIG" "tooBIG" "tooBIG" "tooBIG" "tooBIG"
[10] "tooBIG" "Hawaii" "Idaho"  "tooBIG" "tooBIG" "Iowa"   "Kansas" "tooBIG" "tooBIG"
[19] "Maine"  "tooBIG" "tooBIG" "tooBIG" "tooBIG" "tooBIG" "tooBIG" "tooBIG" "tooBIG"
[28] "Nevada" "tooBIG" "tooBIG" "tooBIG" "tooBIG" "tooBIG" "tooBIG" "Ohio"   "tooBIG"
[37] "Oregon" "tooBIG" "tooBIG" "tooBIG" "tooBIG" "tooBIG" "Texas"  "Utah"   "tooBIG"
[46] "tooBIG" "tooBIG" "tooBIG" "tooBIG" "tooBIG"
> size <- ifelse(isSmall, "small", "large") #A more useful example, for both 2nd & 3rd args
> size                                      #might be useful as an indicator variable?             
 [1] "large" "large" "large" "large" "large" "large" "small" "small" "large" "large"
[11] "small" "large" "large" "large" "large" "large" "large" "large" "large" "large"
[21] "small" "large" "large" "large" "large" "large" "large" "large" "small" "small"
[31] "large" "large" "large" "large" "large" "large" "large" "large" "small" "large"
[41] "large" "large" "large" "large" "small" "large" "large" "large" "large" "large"
> ifelse(size=="large", ifelse(isHuge, "huge", "medium"), "small") #A more complex example
 [1] "medium" "huge"   "huge"   "medium" "huge"   "huge"   "small"  "small"  "medium"
[10] "medium" "small"  "medium" "medium" "medium" "medium" "medium" "medium" "medium"
[19] "medium" "medium" "small"  "medium" "medium" "medium" "medium" "huge"   "medium"
[28] "huge"   "small"  "small"  "huge"   "medium" "medium" "medium" "medium" "medium"
[37] "medium" "medium" "small"  "medium" "medium" "medium" "huge"   "medium" "small" 
[46] "medium" "medium" "medium" "medium" "medium"
If you have done any computer programming, you may be more used to dealing with logic in the context of "if" statements. While R also has an if() statement, it is less useful when dealing with vectors. For example, the following R expression
if(aVariable == 0) then print("zero") else print("not zero")
expects aVariable to be a single number: it outputs "zero" if this number is 0, or "not zero" if it is a number other than zero[2]. If aVariable is a vector of 2 values or more, only the first element counts: everything else is ignored[3]. There are also logical operators which ignore everything but the first element of a vector: these are && for AND and || for OR[4].


Notes[edit | edit source]

  1. Note that, when using continuous (fractional) numbers, rounding error may mean that results of calculations are not exactly equal to each other, even if they seem as if they should be. For this reason, you should be careful when using == with continuous numbers. R provides the function all.equal to help in this case
  2. But unlike ifelse, it can't cope with NA values
  3. For this reason, using == in if statements may not be a good idea, see the Note in ?"==" for details.
  4. These are particularly used in more advanced computer programming in R, see ?"&&" for details