Download Descriptive Statistics - Lecture Slides | BINF 702 and more Study notes Bioinformatics in PDF only on Docsity! i BINF702 FALL 2008
CHAPTER 2 โ Descriptive Statistics
BINF702 - FALLO8- SOLKA -
CHAPTER 2 Descriptive Statistics
!"#$ "!%&!" '!&!'!(!")*+),&%,-) "&)!"#-&&*,"!!(*-!&.-# /##$ !)".!-!!%-!"&)!" !')" ""#!--!,0%, 1&!""!!*%!")" !"&)!")!". -!*2%""%!"!*).-(&%# -!"!*+!&*%()-%!"."%/!-" "%+!&*%("--!*-))'!" &"*%%#3--!&*%(%!"+--%' -,!*%4 !"#$ "!%&!" /## 6%*--!'"&%- )!5 !*+-!-)*%!"!)!5(&+-!*! +!5"""!")"+-!-+-!-)!5 ).- -)%&*)!",'&"!"&*#"$7 -.!&" ".!&(*-%&*"%".- )!5%%"%%-."'"*,*!+&*)!", '&"!"-"!)(*"!")!5+-!%%"!+!5" )!5,"!")"# &!".%"0-&-! )&%-(!" )!"!/%89!""!""- +!5"."!")"!')!5"%!'"!")!5 8+-"!)!5".+)%"-+!5*9!' -*!""!"-".%!-!&!'- %,# !"#$ *!!'- )!53-!"'%" !&"% +!&*%(" "".%.!.* /!,!+ "'&"!"!!%& *!*5-#- "&+!&*%(!' /,*&# &!"!":-1& " * "%;&*2!"!&:6< *:-"%/*!!"8 ==>? >9 "'"% *;&*2!"8 >? @9 ***"% (,3*) #**"% !)!((" (,!)#!((" %+%#&' ! " #$ (,%+%#&' !"#- -)6" -)"" )"8/0A9 > x = matrix(rnorm(100), nrow=50, ncol=2) > x[50,1] = 60 > x[50,2] = 880 > plot(x[,1],x[,2]) > apply(x,2,mean) [1] 1.259362 1.648327 > apply(x,2,mean,trim=0.1) [1] 0.1475340 0.1076194 -'!"!'!(!" !()!%'!)- "%!'-%# !"# -6%" '## -)*)%" ."(, 1(1) The th largest observation if is odd 2 (2) The average of the th and 1 th largest observation if is even. 2 2 n n n n n + + !"##$ -6%"" )%"8/0"#)B 9 /##= tbl2.2 = c(7, 35, 5, 9, 8, 3, 10, 12, 8) > median(tbl2.2) [1] 8 What does this do? !"# !)!"!'- -)6""%-6%" 8 ,))(&!" '*%9 > negexpdata = -rexp(1000) > hist(negexpdata, nclass = 20) > points(mean(negexpdata), 0, pch = 'o') > points(median(negexpdata ), 0, pch = '*') Section 2.2 โ Comparison of the
Arithmetic Mean and the Median (All
| Three Distributions)
Histogram of symmdata
Histogram of expdata Histogram of negexpdata
8
zo
&
o {
a
eS
2 4 0 1 97 3 4 ;ยฐ ยฐ
โ : 7 soe
symmdata
BINF702 - FALLO8- SOLKA -
CHAPTER 2 Descriptive Statistics
!"# 6&!' !!"8- 6!%9 '##> -)!%-)!'1&"*,!&".*& )!".*-!(!"")*# "-()!-"!")!%4 !--!()!%4 !)!%-!(%*&4 3-%!,!&-"5&")!%*%(&!"4 !+(!&)&*)!%*%(&!"4 3".'&"!"'!-)!%"+!&*%(""". %.!.*/# #> !)!!'- -) 6"8'!' *".9 Eq. 2.2 - If , 1, , then . i iy cx i n y cx = = = > x = rnorm(1000) > mean(x) [1] 0.02741812 > mean(2*x) [1] 0.05483624 #> !)!!'- -) 6"8'!'"*!""% *".9 1 n 1 2 1 Eq. 2.3 - Let , ,x be the original sample of data and let , 1, , represent a transformed sample obtained by multiplying each orinigal sample point by a factor and then shifting over by i i x y c x c i n c = + = 2 1 2 a constant then . c y c x c= + > x = rnorm(1000) > mean(x) [1] 0.02741812 > mean(2*x+5) [1] 5.054836
2.4 โ Measures of Spread
= These two datasets have the same mean
= c(rep(10, 10))
= C(rep(0,9),100)
= Are they the same?
BINF702 - FALLO8- SOLKA -
CHAPTER 2 Descriptive Statistics
#C 6&!' %81&"*"9 1&"*8/0!(B180$0#@90"#)B 0")B <0,B0##9 > bw = c(3265, 3260, 3245, 3484, 4146, 3323, 3649, 3200, 3031, 2069, 2581, 2841, 3609, 2838, 3541, 2759, 3248, 3314, 3101, 2834) > quantile(bw, probs =c(.1,.9), type = 2) 10% 90% 2670 3629 #C 6&!' %8-;" "% "%%!"9 ,*G,- ( ) 1 n i i x x d n = โ = 1##C -&)!'- %!"!'-"%%&* !(!"!')*(!& -)*)"*+, 8(&))9 #C 6&!' %8-;" "% "%%!"9 ( ) ( ) 2 2 1 2 1 Def. 2.7 - The sample variance, or variance , is defined as follows: 1 Def. 2.8 - The sample standard deviation, or standard deviation, is defined as follows: 1 N i I n i i x x S n x x s n = = โ = โ โ = โ #@ !)!!'-;""% "%%!"8'!' *".9 1 1 2 2 2 2 2 Eq. 2.6 Suppose there are two samples , , and y , , were , 1, , , 0 If the respective sample variances of the two samples are denoted by and Then ; n n i i x y y x y x x x y y cx i n c s s s c s s cs = = > = = #= -!'"!';!" Def. 2.9 The coefficient of variation (CV) is defined by 100%* s x "&"*!0,!&"!)-;!' (*/%"%'"&"# !"*,)5"!!;'!(*+-2! *,)"2!0&-)!"2,),# !"H*&*;'!(*0&-)&0+- -%'"!"!'2!!)+-(,# http://www.graphpad.com/articles/interpret/Analyzing_one_group/descr_stats.htm
2.6 โ The Coefficient of Variation and R
= There is no COV implemented in R but it would be trivial to
implement one.
BINF702 - FALLO8- SOLKA -
CHAPTER 2 Descriptive Statistics
#:!&%"8-9 Arguments: x: a vector of values for which the histogram is desired. breaks: one of: * a vector giving the breakpoints between histogram cells, * a single number giving the number of cells for the histogram, * a character string naming an algorithm to compute the number of cells (see Details), * a function to compute the number of cells. In the last three cases the number is a suggestion only. #:!&%"8-9 freq: logical; if 'TRUE', the histogram graphic is a representation of frequencies, the 'counts' component of the result; if 'FALSE', probability densities, component 'density', are plotted (so that the histogram has a total area of one). Defaults to 'TRUE' _iff_ 'breaks' are equidistant (and 'probability' is not specified). probability: an _alias_ for '!freq', for S compatibility. include.lowest: logical; if 'TRUE', an 'x[i]' equal to the 'breaks' value will be included in the first (or last, for 'right = FALSE') bar. This will be ignored (with a warning) unless 'breaks' is a vector. right: logical; if 'TRUE', the histograms cells are right-closed (left open) intervals. #:!&%"8-9 ; %",J-%",!'-%".*"0"*""-#-%'&* *&!'H< H)"-"!-%".*"%+"# !" !*&!'H%",H*!"-(-%+".!' -%".*"# ".*J-*!!'-%".*"0.""".*"%. 8!&" *!5+9# !*J!*!&!(&%!'*-(#-%'&*!'H< H ,*%&"'*%(# #:!&%"8-9 ; The definition of "histogram" differs by source (with country-specific biases). R's default with equi-spaced breaks (also the default) is to plot the counts in the cells defined by 'breaks'. Thus the height of a rectangle is proportional to the number of points falling into the cell, as is the area _provided_ the breaks are equally-spaced. The default with non-equi-spaced breaks is to give a plot of area one, in which the _area_ of the rectangles is the fraction of the data points falling in the cells. If 'right = TRUE' (default), the histogram cells are intervals of the form '(a, b]', i.e., they include their right-hand endpoint, but not their left one, with the exception of the first cell when 'include.lowest' is 'TRUE'. #:!&%"8-9 ; For 'right = FALSE', the intervals are of the form '[a, b)', and 'include.lowest' really has the meaning of "_include highest_". A numerical tolerance of 1e-7 times the median bin size is applied when counting entries on the edges of bins. The default for 'breaks' is '"Sturges"': see 'nclass.Sturges'. Other names for which algorithms are supplied are '"Scott"' and '"FD"' / '"Friedman-Diaconis"' (with corresponding functions 'nclass.scott' and 'nclass.FD'). Case is ignored and partial matching is used. Alternatively, a function can be supplied which will compute the intended number of breaks as a function of 'x'. #:!&%"8-9 K8*&9 an object of class '"histogram"' which is a list with components: breaks: the n+1 cell boundaries (= 'breaks' if that was a vector). counts: n integers; for each cell, the number of 'x[]' inside. density: values f^(x[i]), as estimated density values. If 'all(diff(breaks) == 1)', they are the relative frequencies 'counts/n' and in general satisfy sum[i; f^(x[i]) (b[i+1]-b[i])] = 1, where b[i] = 'breaks[i]'. # :-6-!% L!&&-!G%'"!"!'(.-*,)-G )*)"!"!'-!.)--!))"%# ) "% *'*! '&*,%,!&&-!G!"&!".&%*"!".# 3+**5(!&) "% *'*!!"-"/*% # :-6-!%8) "% *'*! "9 "/)*8.##=!',!&/9 > stem(bw, scale = 2) The decimal point is 1 digit(s) to the right of the | 3 | 2 4 | 5 | 8 6 | 478 7 | 8 | 3556788999 9 | 12344568889 10 | 0123444445567888899 11 | 00122235555556889 12 | 01112222344445567788 13 | 222334557888 14 | 0146 15 | 5 16 | 1
- 2.8 Graphics Methods (boxplot)
> boxplot(bw)
40 60 80 100 120 140 160
BINF702 - FALLO8- SOLKA -
CHAPTER 2 Descriptive Statistics