A-level Mathematics/OCR/S1/Collection of data

From Wikibooks, open books for an open world
< A-level Mathematics‎ | OCR‎ | S1
Jump to: navigation, search

The collection of the numerical data is a VERY important aspect of statistics - any mistakes made or bias introduced during collection will be reflected in the analysis and the conclusions subsequently made based upon the data. If the data is not collected properly then at best the analysis will be a waste of time and at worst mislead with serious consequences.


If you collect data about TV-viewing by making a rapid door-to-door survay in your area then you are NOT obtaining information about the viewing habits in your neighbourhood, you are obtaining information about the viewing habits from the people in your neighbourhood who were in, answered the door and answered your questions.

Interpretation of statistics

One of the first things that students of statistics must recognise is that the results extracted from numbers can be interpreted wrongly, or by (abused) selection deliberately distorted. To avoid errors, it is essential that the statistician: a) Be extremely careful in selecting informaiton to avoid biased results. b) Make only those deductions as are strictly logical. c) Beware of "third party" sources of correlation.

Biased sources

One of the problems facing a statistician is that the sources of information may be biased. A statistician must always ask questions such as: a) Who says this? b) Why does he say it? c) What does he/she stand to gain from saying it? d) How" does he/she know? e) Could he/she be lying? Or guessing? f) Is there another explanation?

Populations[edit]

We call the group of people (or items) about which we want to obtain information the population. Defining the population is sometimes tricky, and sometimes the full extent of the population may not be known - for example the number of people with un-diagnosed AIDS is inherently unknown.

Samples[edit]

If the population is relatively small and easily surveyed we may examine every item in the population. However usually the population is either too big or too expensive or too inaccessible to survey every item, and so you may have to be satisfied with examining only a part, or sample of the total population.

Sample frame[edit]

A list of the entire population from which items can be selected to form a sample is called a sample frame.

Costs, accuracy and samples[edit]

Although it may appear better to survey the whole population than rely on a small sample, this is often not so. Firstly it may be expensive, with the costs exceeding the value of the results. Sometimes sampling only a fraction of the population can result in improved accuracy because very careful attention can be given to a small sample that time limits or the availability of skilled surveyors will not allow for a whole population survey.

Methods of collection[edit]

Collection of statistical data is generally achieved by one of (or a combination of) the following methods: a) Direct measurement b) Interviewing c) Abstration of data from published statistics d) Indirect questionnaire f) Solicitation

Direct observation[edit]

This is usually the best method, as it reduces the chance of incorrect data and you have control over the quality of the data being recorded, it is also one of the most expensive. In some cases it is not possible - you cannot observe where people would go on holiday if they had unlimited money.

Interviewing[edit]

Interviews can be an effective technique, but only if considerable care is taken in how the questions are framed and the answers collated. The results from interviews can also be misleading because the responder may (a) mis-understand the question, (b) have forgotten some of the information, (c) lie in an attempt to provide the right answer, (d) lie to hide the truth. Different standards in how interviewers record the results can also produce distortion. If the questioner asks "did you watch XYZ on TV last night - yes or no" and gets the response "part of it", is this a "yes" or a "no"?

Abstraction of data from published statistics[edit]

Data that has been collected by or directly for an investigation is primary data, and the investigator should be fully aware of the conditions and limitations of that data. Data that is extracted from information that was collected by another investigation is secondary data, and usually the investigator is not fully aware of the conditions and limitations of that data, however often this may be the only practical source of data (e.g. variations in coal production over the past 100 years). The inherent limits on the investigators knowledge of the data mean that whenever use is to be made of data from published sources, that consideration is given as the purpose of the original data collection - and in particular would that purpose mean that there may be bias if you use the data for your investigation.

Indirect questionnaire[edit]

The indirect questionnaire is typified by the postal questionnaire - it arrives unsolicited, is expected to be completed and then returned by post, although the modern version may also arrive electronically. This is usually the least satisfactory method of data collection for the simple reason that only a few such questionnaires are ever returned (15% would be good), and those that are returned may show strong bias, since the return is done only by those who have a sufficiently strong interest in the subject or by those with an intent to mislead. There is an exception in most countries - and that is the statutory Census where force of law applies to its completion, although the census authorities still need to use checking interviews to verify the data.

Solicitation[edit]

Solicitation is typified by the "suggestions box", by the "complaints box". Data collected is almost worthless. Unfortunately it is a method that is being widely used by some parts of the media in its modern form "SMS this number ...", with the "results" then given high prominence. This method only produces data from those with sufficiently strong views to be bothered with the time/expense of completing the responses.

Design of questionnaires[edit]

Whenever information is sought from people, it is essential that the questions are carefully constructed. The questions should: a) Be simple to understand. b) Be unambiguous. c) Limit the possible responses (tick the pre-printed answer). d) Be concise and short. e) Be relevant to the respondent. f) Align with the goal(s), objective(s)and purpose of the study. g) Be meaningful to the respondent. h) Have a clear focus i) Not suggest the desired answer. j) Be arranged in a logical sequence.

Methods of sampling[edit]

Bias revisited[edit]

When we take a sample from a population, it is generally not advisable to take just the easiest items - we are seeking information about the whole population, and must therefore take our data from across the whole population without allowing any particular set from the population to have undue influence than it really warrants. Unfortunately it is not just suspected or known sources of undue influence that must be avoided, we must also beware of unsuspected sources of bias.

Random samples[edit]

The possibility of taking a sample with unsuspected bias may be reduced by taking a random sample. A random sample is a sample that has been selected in such a way that every item in the population has an equal chance of being selected.

Random samples are NOT perfect samples - a random sample (especially if small) is not necessarily a good cross section of the population. A random sample of people living in the UK could result in a sample of people that all live in London. A random sample does NOT guarantee a sample free from bias, it simply guarantees that the method of selection is free from bias.

Quota sampling[edit]

If the population can be divided into different groups then it may be practical to set a quota for each of those different groups, and then randomly select within the groups.

Multi-stage sampling[edit]

Divide the population into groups and then randomly select a number of these groups for the next stage. Each of the selected groups is divided into sub-groups and a number of these sub-groups is randomly selected for the next stage. This process is repeated until the size of the sub-groups is sufficiently small. This process is commonly used for limiting the travel involved with surveys, with each of the groupings representing different geographical areas.

Systematic sampling[edit]

This is a simple method - if you want a 10% sample of the size of bolts from a production line, then take every 10th bolt produced. This method is commonly used on a production line.

If there is something systematic in the way the production line works that means that there is a periodicity in the bolt manufacture every 10th bolt (the machinest knows that it will be tested), then the result will be biased.

Stratified sampling[edit]

This is a refinement on quota sampling. If the relative size of each of the different population groups is known then it is possible to set the size of the quota for each of the population groups. If this process is performed well, then the results from stratified sampling usually show less bias than pure random sampling.

Convenience Sampling[edit]

Ths is NOT a scientific/statistical method of sampling but is unfortunately used by companies. Convenience sampling is selecting items based on how easy it is to get a response. For example standing outside a specific grocery store collecting data and then making general judgements on shopping habits. THIS is not true judgements and are biased. You did not collect data about the general shopping habits of the population, you collected data on shoppers who visited that specific store during those specific hours and who took the time to full in your questionnaire. IT is important to know these limitations.

Reporting methods used[edit]

Whatever method has been used to collect your data, it is essential that you document the method, and describe the limitations on that data. If you are using secondary data (i.e. data in a published source) then it is essential that the source of the data is fully documented. Only by taking the care to describe your data sources can the results from, and limitations of, your analysis be understood.