Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Understanding Data Types and Summarization Techniques: A Guide to Analyzing Raw Data, Study notes of Mathematical Statistics

An introduction to the concepts of raw data, different types of data (categorical and quantitative, discrete and continuous), and the importance of asking the right research questions. It also covers techniques for summarizing data, including calculating summary measures, identifying outliers, and creating visualizations using minitab. Examples and exercises.

Typology: Study notes

Pre 2010

Uploaded on 08/16/2009

koofers-user-ay5
koofers-user-ay5 🇺🇸

10 documents

1 / 21

Toggle sidebar

Related documents


Partial preview of the text

Download Understanding Data Types and Summarization Techniques: A Guide to Analyzing Raw Data and more Study notes Mathematical Statistics in PDF only on Docsity! 1 1 Turning Data Into Information Chapter 2 2 2.1 Raw Data • Raw data are numbers and category labels that have been collected but have not yet been processed in any way. • When measurements are taken from a subset of a population, they represent sample data. • When all individuals in a population are measured, the measurements represent population data. • Descriptive statistics: summary numbers for either population or a sample. 3Jamshidian , 2005 2.2 Types of Data Categorical Quantitative Nominal Ordinal Continuous Discrete Data 4 2.2 Types of Data • Raw data from categorical variables consist of group or category names that don’t necessarily have a logical ordering. Examples: eye color, country of residence. • Categorical variables for which the categories have a logical ordering are called ordinal variables. Examples: highest educational degree earned, tee shirt size (S, M, L, XL). • Raw data from quantitative variables consist of numerical values taken on each individual. Examples: height, number of siblings. 2 5Jamshidian , 2005 2.2 Types of Data Discrete variables are those whose possible values are countable. Example, number of siblings. Continuous variables are those that take on values in intervals. Example, height. Sometimes the type of the variable depends on the way it is being observed. For example, the following two questions about income lead to two different types of variable: State your annual income in dollars. Is your income (1) between $20,000 - $40,000, (2) between $40,000 – $60,000, (3) above $60,000. 6 Asking the Right Questions One Categorical Variable Question 1a: How many and what percentage of individuals fall into each category? Example: What percentage of college students favor the legalization of marijuana, and what percentage of college students oppose legalization of marijuana? Question 1b: Are individuals equally divided across categories, or do the percentages across categories follow some other interesting pattern? Example: When individuals are asked to choose a number from 1 to 10, are all numbers equally likely to be chosen? 7 Asking the Right Questions Two Categorical Variables Question 2a: Is there a relationship between the two variables, so that the category into which individuals fall for one variable seems to depend on which category they are in for the other variable? Example: In Case Study 1.6, we asked if the risk of having a heart attack was different for the physicians who took aspirin than for those who took a placebo. Question 2b: Do some combinations of categories stand out because they provide information that is not found by examining the categories separately? Example: The relationship between smoking and lung cancer was detected, in part, because someone noticed that the combination of being a nonsmoker and having lung cancer is unusual. 8 Asking the Right Questions One Quantitative Variable Question 3a: What are the interesting summary measures, like the average or the range of values, that help us understand the collection of individuals who were measured? Example: What is the average handspan measurement, and how much variability is there in handspan measurements? Question 3b: Are there individual data values that provide interesting information because they are unique or stand out in some way (Outliers)? Example: What is the oldest recorded age of death for a human? Are there many people who have lived nearly that long, or is the oldest recorded age a unique case? 5 17Jamshidian , 2005 Using Minitab: Summarizing one categorical Variables Results for: pennstate1.mtw Tally for Discrete Variables: SQpick SQpick Count Percent Q 84 44.21 S 106 55.79 N= 190 Minitab Output: 18Jamshidian , 2005 Using Minitab: Summarizing two categorical Variables Stat > Tables > Cross tabulation and … 19Jamshidian , 2005 Using Minitab: Summarizing two categorical Variables M initab O utput Tabulated statistics: SQpick, Form Rows: SQpick Columns: Form Q or S S or Q All Q 53 31 84 63.10 36.90 100.00 S 45 61 106 42.45 57.55 100.00 All 98 92 190 51.58 48.42 100.00 20 In class exercises from Sections 2.1 & 2.2 Pages 58 # 1, 5, 11, 16 6 21 • Pie Charts: useful for summarizing a single categorical variable if not too many categories. • Bar Graphs: useful for summarizing one or two categorical variables and particularly useful for making comparisons when there are two categorical variables. Visual Summaries for Categorical Variables 22 Example 2.3 Humans Are Not Good Randomizers Survey of n = 190 college students. “Randomly pick a number between 1 and 10.” Results: Most chose 7, very few chose 1 or 10. 23 Example 2.4 Revisiting Nightlights and Nearsightedness Survey of n = 479 children. Response: Degree of Myopia Explanatory: Amount of Sleeptime Lighting 24Jamshidian , 2005 Minitab: Pie Chart Graph > Pie Chart > Tally individual variables 7 25Jamshidian , 2005 109 8 7 6 5 4 3 21 F 10 9 8 7 6 5 4 3 21 M Category 5 6 7 8 9 10 1 2 3 4 Pie Chart of RandNumb Panel variable: Sex Minitab: Pie Chart output 26Jamshidian , 2005 Minitab: Bar Graphs Graph > Bar Chart Co un t RandNumb Sex 10987654321 MFMFMFMFMFMFMFMFMFMF 30 25 20 15 10 5 0 Sex F M Chart of RandNumb, Sex 27 2.4 Finding Information in Quantitative Data Long list of numbers – needs to be organized to obtain answers to questions of interest. 28 • Find extremes (high, low), the median, and the quartiles (medians of lower and upper halves of the values). • Quick overview of the data values. • Information about the center, spread, and shape of data. Five-Number Summaries 10 37 Example 2.7 Tiny Boatsmen Weights (in pounds) of 18 men on crew team: Note: last weight in each list is unusually small. They are the coxswains for their teams, while others are rowers. Cambridge:188.5, 183.0, 194.5, 185.0, 214.0, 203.5, 186.0, 178.5, 109.0 Oxford: 186.0, 184.5, 204.0, 184.5, 195.5, 202.5, 174.0, 183.0, 109.5 38 2.4 Pictures for Quantitative Data • Histograms: similar to bar graphs, used for any number of data values. • Stem-and-leaf plots and dotplots: present all individual values, useful for small to moderate sized data sets. • Boxplot or box-and-whisker plot: useful summary for comparing two or more groups. 39 • Values are centered around 20 cm. • Two possible low outliers. • Apart from outliers, spans range from about 16 to 23 cm. Interpreting Histograms, Stemplots, and Dotplots 40 Creating a Histogram Step 1: Decide how many equally spaced (same width) intervals to use for the horizontal axis. Between 6 and 15 intervals is a good number. Step 2: Decide to use frequencies (count) or relative frequencies (proportion) on the vertical axis. Step 3: Draw equally spaced intervals on horizontal axis covering entire range of the data. Determine frequency or relative frequency of data values in each interval and draw a bar with corresponding height. Decide rule to use for values that fall on the border between two intervals. 11 41Jamshidian , 2005 Example: A marketing consultant observed 50 shoppers at a grocery store. One variable of interest was how much each shopper spent in the store. Here are the data (in dollars), arranged in increasing order. 2.32 6.61 6.90 8.04 9.45 10.26 11.34 11.63 12.66 12.95 13.67 13.72 14.35 14.52 14.55 15.01 15.33 16.55 17.15 18.22 18.30 18.71 19.54 19.55 20.58 20.89 20.91 21.13 23.85 26.04 27.07 28.76 29.15 30.54 31.99 32.82 33.26 33.80 34.76 36.22 37.52 39.28 40.80 43.97 45.58 52.36 61.57 63.85 64.30 69.49 Histogram 42Jamshidian , 2005 Frequency table summarizes this data as follows: Dollar Spent Frequency Relative Frequency 2.32-12.32 ` 8 0.16 12.32-22.32 20 0.40 22.32-32.32 7 0.14 32.32-42.32 8 0.16 42.32-52.32 2 0.04 52.32-62.32 2 0.04 62.32-72.32 3 0.06 Total 50 1.00 Histogram 43Jamshidian , 2005 A few notes: Each of the intervals in the first column is called a measurement class. The observed values that fall on the boundaries of the measurement classes should consistently go into lower or upper subinterval. The number of measurement classes for a data set should be chosen so that the least amount of information is lost, while the data are effectively summarized. Too few classes summarizes data too much. Too many classes does not summarize data effectively. Histogram 44Jamshidian , 2005 Minitab: Histogram Graph > Histogram In order to determine the number of bins follow the following steps: 1. Select the histogram bars, by clicking on one of the bars. 2. Right click on the graph and select “Edit bars”. 3. Choose the “binning” tab 4. Type-in the number of bins desired. 12 45Jamshidian , 2005 Histogram: Three histograms of the shopping data 10.00 20.00 30.00 40.00 50.00 60.00 v1 0 5 10 15 20 C ou nt Histogram of the shopping data V1 70.0 65.0 60.0 55.0 50.0 45.0 40.0 35.0 30.0 25.0 20.0 15.0 10.0 5.0 0.0 Histogram of the shopping data 12 10 8 6 4 2 0 Std. Dev = 16.15 Mean = 25.8 N = 50.00 10.00 20.00 30.00 40.00 50.00 60.00 v1 0 5 10 15 C ou nt Histogram of the shopping data 46 Creating a Dotplot “A dotplot displays a dot for each observation along a number line. If there are multiple occurrences of an observation, or if observations are too close together, then dots will be stacked vertically. If there are too many points to fit vertically in the graph, then each dot may represent more than one point.” (Minitab, Release 12.1, 1998) 47 Creating a Stem-and-Leaf Plot Step 1: Determine stem values. The “stem” contains all but the last of the displayed digits of a number. Stems should define equally spaced intervals. Step 2: For each individual, attach a “leaf” to the appropriate stem. A “leaf” is the last of the displayed digits of a number. Often leaves are ordered on each stem. Note: More than one way to define stems. Can use split-stems or truncate/round values first. 48Jamshidian , 2005 Stem and leaf plots For a given number, the stem consists of all but the final (rightmost) digit. The leaf consists of the final digit. A leaf digit unit (LDU) determines the location of the decimal place. Example: Number Stem leaf LDU 34.75 347 5 .01 3475 347 5 1 15 57 2.5 Numerical Summaries of Quantitative Data Notation for Raw Data: n = number of individuals in a data set x1, x2 , x3,…, xn represent individual raw data values Example: A data set consists of handspan values in centimeters for six females; the values are 21, 19, 20, 20, 22, and 19. Then, n = 6 x1= 21, x2 = 19, x3 = 20, x4 = 20, x5 = 22, and x6 = 19 58 Describing the Location of a Data Set • Mean: the numerical average • Median: the middle value (if n odd) or the average of the middle two values (n even) Symmetric: mean = median Skewed Left: mean < median Skewed Right: mean > median 59Jamshidian , 2005 Comparison of mean and median for various shapes: Mean = median Mean < median Mean > median Mean < median 60 Determining the Mean and Median The Mean where means “add together all the values”∑ ix n x x i∑= The Median (See Slide 28) 16 61 Example 2.9 Will “Normal” Rainfall Get Rid of Those Odors? Mean = 18.69 inches Median = 16.72 inches Data: Average rainfall (inches) for Davis, California for 47 years In 1997-98, a company with odor problem blamed it on excessive rain. That year rainfall was 29.69 inches. More rain occurred in 4 other years. 62 The Influence of Outliers on the Mean and Median Larger influence on mean than median. High outliers will increase the mean. Low outliers will decrease the mean. If ages at death are: 70, 72, 74, 76, and 78 then mean = median = 74 years. If ages at death are: 35, 72, 74, 76, and 78 then median = 74 but mean = 67 years. 63 Describing Spread: Range and Interquartile Range • Range = high value – low value • Interquartile Range (IQR) = upper quartile – lower quartile • Standard Deviation (covered later in Section 2.7) 64 Example 2.10 Fastest Speeds Ever Driven Five-Number Summary for 87 males • Median = 110 mph measures the center of the data • Two extremes describe spread over 100% of data Range = 150 – 55 = 95 mph • Two quartiles describe spread over middle 50% of data Interquartile Range = 120 – 95 = 25 mph 17 65 Notation and Finding the Quartiles (also see slide 30) Split the ordered values into the half that is below the median and the half that is above the median. Q1 = lower quartile = median of data values that are below the median Q3 = upper quartile = median of data values that are above the median 66 Example 2.10 Fastest Speeds (cont) Ordered Data (in rows of 10 values) for the 87 males: • Median = (87+1)/2 = 44th value in the list = 110 mph • Q1 = median of the 43 values below the median = (43+1)/2 = 22nd value from the start of the list = 95 mph • Q3 = median of the 43 values above the median = (43+1)/2 = 22nd value from the end of the list = 120 mph 55 60 80 80 80 80 85 85 85 85 90 90 90 90 90 92 94 95 95 95 95 95 95 100 100 100 100 100 100 100 100 100 101 102 105 105 105 105 105 105 105 105 109 110 110 110 110 110 110 110 110 110 110 110 110 112 115 115 115 115 115 115 120 120 120 120 120 120 120 120 120 120 124 125 125 125 125 125 125 130 130 140 140 140 140 145 150 67 Percentiles The kth percentile is a number that has k% of the data values at or below it and (100 – k)% of the data values at or above it. • Lower quartile = 25th percentile • Median = 50th percentile • Upper quartile = 75th percentile 68 In class exercises: Section 2.5 & 2.6 # 60, 64, 65, 71 20 77 Interpreting the Standard Deviation for Bell-Shaped Curves: The Empirical Rule For any bell-shaped curve, approximately • 68% of the values fall within 1 standard deviation of the mean in either direction • 95% of the values fall within 2 standard deviations of the mean in either direction • 99.7% of the values fall within 3 standard deviations of the mean in either direction 78 The Empirical Rule, the Standard Deviation, and the Range • Empirical Rule => the range from the minimum to the maximum data values equals about 4 to 6 standard deviations for data with an approximate bell shape. • You can get a rough idea of the value of the standard deviation by dividing the range by 6. 6 Ranges ≈ 79 Example 2.11 Women’s Heights (cont) Mean height for the 199 British women is 1602 mm and standard deviation is 62.4 mm. • 68% of the 199 heights would fall in the range 1602 ± 62.4, or 1539.6 to 1664.4 mm • 95% of the heights would fall in the interval 1602 ± 2(62.4), or 1477.2 to 1726.8 mm • 99.7% of the heights would fall in the interval 1602 ± 3(62.4), or 1414.8 to 1789.2 mm 80 Example 2.11 Women’s Heights (cont) Summary of the actual results: Note: The minimum height = 1410 mm and the maximum height = 1760 mm, for a range of 1760 – 1410 = 350 mm. So an estimate of the standard deviation is: mm 3.58 6 350 6 ==≈ Ranges 21 81 Standardized z-Scores Standardized score or z- score: deviation Standard Mean valueObserved − =z Example: Mean resting pulse rate for adult men is 70 beats per minute (bpm), standard deviation is 8 bpm. The standardized score for a resting pulse rate of 80: 25.1 8 7080 = − =z A pulse rate of 80 is 1.25 standard deviations above the mean pulse rate for adult men. A: 5 6 2 3 B: 4 4 4 4 3 82 The Empirical Rule Restated For bell-shaped data, • About 68% of the values have z-scores between –1 and +1. • About 95% of the values have z-scores between –2 and +2. • About 99.7% of the values have z-scores between –3 and +3. 83 In class exercises: Section 2.7, Pages 65-66 # 78a, 79a, 84, 94
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved