Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Stat 133 Class Notes - Spring, 2011, Lecture notes of Data Communication Systems and Computer Networks

University of California - Berkeley Data Communication Systems and Computer Networks

Class notes for Stat 133 course taught by Phil Spector in Spring 2011. The notes cover basic concepts in computing with data, introduction to R language, data in R, vectors, modes and classes, matrices, data frames, working with multiple data frames, adding color to plots, using dates in R, data summaries, functions, sizes of objects, character manipulation, Unix basics, command path, basic commands, command history, editors, CGI programming with R, web servers, CGI scripting, and a first CGI program with R.

Typology: Lecture notes

2010/2011

Uploaded on 05/11/2023

shekara_44het 🇺🇸

4.3

(7)

5 documents

1 / 351

Partial preview of the text

Download Stat 133 Class Notes - Spring, 2011 and more Lecture notes Data Communication Systems and Computer Networks in PDF only on Docsity! Stat 133 Class Notes - Spring, 2011 Phil Spector May 31, 2011 Contents 1 Introduction 5 1.1 What’s this course about? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Some Basic Concepts in Computing with Data . . . . . . . . . . . . . . . . . 7 1.3 A Short Note on Academic Integrity . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Introduction to R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 The R Language 10 2.1 Data in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Modes and Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Reading Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5 Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.6 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.7 Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.8 More on Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.9 Reading Data Frames from Files and URLs . . . . . . . . . . . . . . . . . . . 25 2.10 Working with Multiple Data Frames . . . . . . . . . . . . . . . . . . . . . . 32 2.11 Adding Color to Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.12 Using Dates in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.13 Data Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.14 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.15 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.16 Sizes of Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.17 Character Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.18 Working with Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3 Unix 63 3.1 Software for Remote Access . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.2 Basics of Unix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.3 Command Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.4 Basic Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.5 Command History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.6 Editors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 1 13 CGI Programming with R 266 13.1 Web Servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 13.2 CGI Scripting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 13.3 A First CGI program with R . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 13.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 13.5 Combo Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 13.6 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 13.7 Hidden Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 13.8 Outgoing HTTP Headers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 13.9 Creating Pretty Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 13.10File Upload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 13.11Debugging CGI Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 14 Smoothers 286 14.1 Smoothers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 14.2 Kernel Smoothers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 14.3 Locally Weighted Regression Smoothers . . . . . . . . . . . . . . . . . . . . 291 14.4 Spline Smoothers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 14.5 Supersmoother . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 14.6 Smoothers with Lattice Plots . . . . . . . . . . . . . . . . . . . . . . . . . . 296 15 Linear Regression 299 15.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 15.2 The lm command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 15.3 Using the model object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 15.4 Regression Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304 15.5 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 15.6 Generalized Additive Models (gam) . . . . . . . . . . . . . . . . . . . . . . . 312 15.7 Recursive Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316 15.8 Comparison of the 3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 317 16 Analysis of Variance 319 16.1 Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 16.2 Multiple Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 16.3 Two-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 16.4 Another Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 16.5 More Complex Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340 16.6 Constructing Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 16.7 Alternatives for ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 4 Chapter 1 Introduction 5 1.1 What’s this course about? The goal of this course is to introduce you to a variety of concepts and methods that are useful when dealing with data. You can think of data as any information that could potentially be studied to learn more about something. Some simple examples: 1. Sales records for a company 2. Won/Lost records for a sports team 3. Web log listings 4. Email messages 5. Demographic Information found on the web What may be surprising is that some of the data (for example web log listings or email) consists of text, not numbers. It’s increasingly important to be able to deal with text in order to look at the wide variety of information that is available. In most statistics courses, a lot of time is spent working on the preliminaries (formulas, algorithms, statistical concepts) in order to prepare the student for the interesting part of statistics, which is studying data for a problem that matters to you. Unfortunately, by the time these preliminaries are covered, many students are bored or frustrated, and they leave the course with a distorted view of statistics. In this class, we’re going to do things differently. We will concentrate on: 1. Computer Languages – which will let us read in and manipulate data. 2. Graphical Techniques – which will allow us to display data in a way that makes it easier to understand and easier to make decisions based on the data. 3. Technologies – so that we can present these techniques to someone who’s not as knowl- edgeable as us without too much misery. The main computer language that we will be using in the course is a statistical pro- gramming environment known as R (http://r-project.org). R is freely downloadable and copyable, and you are strongly encouraged to install R on your own computer for use in this course and beyond. We will use this language for both data acquisition, data manipulation and producing graphical output. R is not the ideal language for all of the tasks we’re going to do, but in the interest of efficiency, we’ll try to use it for most things, and point you in the direction of other languages that you might want to explore sometime in the future. Another tool that we’ll use are UNIX shell commands, which allow you to store, copy and otherwise manipulate files which contain data, documents or programs. The computer accounts for this course will allow you to access computers running a version of the UNIX operating system 6 3. The RSiteSearch() command will open a browser to a searchable database of questions and answers posted on the R-help mailing list. (See http://www.r- project.org/mail.html for more information on the R-help mailing list.) • When you type the name of an object into R, it will display that object. This can be frustrating when you want to actually run the object. For example, if you type q at the R prompt, you’ll see: > q function (save = "default", status = 0, runLast = TRUE) .Internal(quit(save, status, runLast)) <environment: namespace:base> To actually execute the q command, type > q() 9 Chapter 2 The R Language 10 2.1 Data in R While R can handle many types of data, the three main varieties that we’ll be using are numeric, character and logical. In R, you can identify what type of object you’re dealing with by using the mode function. For example: > name = ’phil’ > number = 495 > happy = TRUE > mode(name) [1] "character" > mode(number) [1] "numeric" > mode(happy) [1] "logical" Note that when we enter character data, it needs to be surrounded by quotes (either double or single), but the symbols TRUE and FALSE (without quotes) are recognized as values of a logical variable. Another important characteristic of an object in R is its class, because many functions know how to treat objects of different classes in a special way. You can find the class of an object with the class function. 2.2 Vectors Occasionally it may be useful to have a variable (like name or happy in the above example) to have only a single value (like ’phil’ or TRUE), but usually we’ll want to store more than a single value (sometimes refered to as a scalar) in a variable. A vector is a collection of objects, all of the same mode, that can be stored in a single variable, and accessed through subscripts. For example, consider the minimum temperature in Berkeley for the first 10 days of January, 2006: 50.7 52.8 48.6 53.0 49.9 47.9 54.1 47.6 43.6 45.5 We could create a variable called mintemp as follows: > mintemp = c(50.7,52.8,48.6,53.0,49.9,47.9,54.1,47.6,43.6,45.5) The c function is short for catenate or combine, and it’s used to put individual values together into vectors. You can find the number of elements in a vector using the length function. Once you’ve created a vector, you can refer to the elements of the vector using subscripts. Numerical subscripts in R start at 1, and continue up to the length of the vector. Subscripts of 0 are silently ignored. To refer to multiple elements in a vector, simply use the c function to create a vector of the indexes you’re interested in. So to extract the first, third, and fifth values of the mintemp vector, we could use: 11 6.25 > taxrate[’KS’] KS 5.3 One of the most powerful tools in R is the ability to use logical expressions to extract or modify elements in the way that numeric subscripts are traditionally used. While there are (of course) many cases where we’re interested in accessing information based on the numeric or character subscript of an object, being able to use logical expressions gives us a much wider choice in the way we can study our data. For example, suppose we want to find all of observations in taxrate with a taxrate less than 6. First, let’s look at the result of just asking whether taxrate is less than 6: > taxrate < 6 AL CA IL KS NY TN TRUE FALSE FALSE TRUE TRUE FALSE The result is a logical vector of the same length as the vector we were asking about. If we use such a vector to extract values from the taxrate vector, it will give us all the ones that correspond to TRUE values, discarding the ones that correspond to FALSE. > taxrate[taxrate > 6] CA IL TN 7.25 6.25 7.00 Another important use of logical variables is counting the number of elements of a vector meet a particular condition. When a logical vector is passed to the sum function, TRUEs count as one and FALSEs count as 0. So we can count the number of TRUEs in a logical expression by passing it to sum: > sum(taxrate > 6) [1] 3 This tells us three observations in the taxrate vector had values greater than 6. As another example, suppose we want to find which of the states we have information about has the highest sales tax. The max function will find the largest value in a vector. (Once again, note that we don’t have to worry about the size of the vector or looping over individual elements.) > max(taxrate) [1] 7.25 We can find the state which has the highest tax rate as follows: > taxrate[taxrate == max(taxrate)] CA 7.25 14 Notice that we use two equal signs when testing for equality, and one equal sign when we are assigning an object to a variable. Another useful tool for these kinds of queries is the which function. It converts between logical subscripts and numeric ones. For example, if we wanted to know the index of the element in the taxrate vector that was the biggest, we could use: > which(taxrate == max(taxrate)) CA 2 In fact, this is such a common operation that R provides two functions called which.min and which.max which will return the index of the minimum or maximum element of a vector: > which.max(taxrate) CA 2 While it’s certainly not necessary to examine every function that we use in R, it might be interesting to see what which.max is doing beyond our straight-forward solution. As always, we can type the name of the function to see what it does: > which.max function (x) .Internal(which.max(x)) <environment: namespace:base> .Internal means that the function that actually finds the index of the maximum value is compiled inside of R. Generally functions like this will be faster than pure R solutions like the first one we tried. We can use the system.time function to see how much faster which.max will be. Because functions use the equal sign (=) to name their arguments, we’ll use the alternative assignment operator, <- in our call to system.time: > system.time(one <- which(taxrate == max(taxrate))) user system elapsed 0 0 0 It’s not surprising to see a time of 0 when operating on such a small vector. It doesn’t mean that it required no time to do the operation, just that the amount of time it required was smaller than the granularity of the system clock. (The granularity of the clock is simply the smallest interval of time that can be measured by the computer.) To get a good comparison, we’ll need to create a larger vector. To do this, we’ll use the rnorm function, which generates random numbers from the normal distribution with mean 0 and standard deviation 1. To get times that we can trust, I’ll use a vector with 10 million elements: 15 > x = rnorm(10000000) > system.time(one <- which(x == max(x))) user system elapsed 0.276 0.016 0.292 > system.time(two <- which.max(x)) user system elapsed 0.068 0.000 0.071 While the pure R solution seems pretty fast (0.292 seconds to find the index of the largest element in a vector of 10 million numbers), the compiled (internal) version is actually around 4 times faster! Of course none of this matters if they don’t get the same answers: > one [1] 8232773 > two [1] 8232773 The two methods do agree. If you try this example on your own computer, you’ll see a different value for the index of the maximum. This is due to the way random numbers are generated in R, and we’ll see how to take more control of this later in the semester. 16 Fortunately, these problems are fairly easy to solve. In the first case, many functions (like mean, min, max, sd, quantile, etc.) accept an na.rm=TRUE argument, that tells the function to remove any missing values before performing the computation: > mean(x,na.rm=TRUE) [1] 12.375 In the second case, we just need to remember to always use is.na whenever we are testing to see if a value is a missing value. > is.na(x) [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE By combining a call to is.na to the logical “not” operator (!) we can filter out missing values in cases where no na.rm= argument is available: > x[!is.na(x)] [1] 1 4 7 12 19 15 21 20 2.6 Matrices A very common way of storing data is in a matrix, which is basically a two-way generalization of a vector. Instead of a single index, we can use two indexes, one representing a row and the second representing a column. The matrix function takes a vector and makes it into a matrix in a column-wise fashion. For example, > mymat = matrix(1:12,4,3) > mymat [,1] [,2] [,3] [1,] 1 5 9 [2,] 2 6 10 [3,] 3 7 11 [4,] 4 8 12 The last two arguments to matrix tell it the number of rows and columns the matrix should have. If you used a named argument, you can specify just one dimension, and R will figure out the other: > mymat = matrix(1:12,ncol=3) > mymat [,1] [,2] [,3] [1,] 1 5 9 [2,] 2 6 10 [3,] 3 7 11 [4,] 4 8 12 19 To create a matrix by rows instead of by columns, the byrow=TRUE argument can be used: > mymat = matrix(1:12,ncol=3,byrow=TRUE) > mymat [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 [4,] 10 11 12 When data is being read from a file, you can simply imbed a call to scan into a call to matrix. Suppose we have a file called matrix.dat with the following contents: 7 12 19 4 18 7 12 3 9 5 8 42 We could create a 3×4 matrix, read in by rows, with the following command: matrix(scan(’matrix.dat’),nrow=3,byrow=TRUE) To access a single element of a matrix, we need to specify both the row and the column we’re interested in. Consider the following matrix, containing the numbers from 1 to 10: > m = matrix(1:10,5,2) > m [,1] [,2] [1,] 1 6 [2,] 2 7 [3,] 3 8 [4,] 4 9 [5,] 5 10 Now suppose we want the element in row 4 and column 1: > m[4,1] [1] 4 If we leave out either one of the subscripts, we’ll get the entire row or column of the matrix, depending on which subscript we leave out: > m[4,] [1] 4 9 > m[,1] [1] 1 2 3 4 5 20 2.7 Data Frames One shortcoming of vectors and matrices is that they can only hold one mode of data; they don’t allow us to mix, say, numbers and character strings. If we try to do so, it will change the mode of the other elements in the vector to conform. For example: > c(12,9,"dog",7,5) [1] "12" "9" "dog" "7" "5" Notice that the numbers got changed to character values so that the vector could accomodate all the elements we passed to the c function. In R, a special object known as a data frame resolves this problem. A data frame is like a matrix in that it represents a rectangular array of data, but each column in a data frame can be of a different mode, allowing numbers, character strings and logical values to coincide in a single object in their original forms. Since most interesting data problems involve a mixture of character variables and numeric variables, data frames are usually the best way to store information in R. (It should be mentioned that if you’re dealing with data of a single mode, a matrix may be more efficient than a data frame.) Data frames correspond to the traditional “observations and variables” model that most statistical software uses, and they are also similar to database tables. Each row of a data frame represents an observation; the elements in a given row represent information about that observation. Each column, taken as a whole, has all the information about a particular variable for the data set. For small datasets, you can enter each of the columns (variables) of your data frame using the data.frame function. For example, let’s extend our temperature example by creating a data frame that has the day of the month, the minimum temperature and the maximum temperature: > temps = data.frame(day=1:10, + min = c(50.7,52.8,48.6,53.0,49.9,47.9,54.1,47.6,43.6,45.5), + max = c(59.5,55.7,57.3,71.5,69.8,68.8,67.5,66.0,66.1,61.7)) > head(temps) day min max 1 1 50.7 59.5 2 2 52.8 55.7 3 3 48.6 57.3 4 4 53.0 71.5 5 5 49.9 69.8 6 6 47.9 68.8 Note that the names we used when we created the data frame are displayed with the data. (You can add names after the fact with the names function.) Also, instead of typing the name temps to see the data frame, we used a call the the head function instead. This will show me just the first six observations (by default) of the data frame, and is very handy to check to make sure a large data.frame really looks the way you think. (There’s a function called tail that shows the last lines in an object as well.) 21 2.8 More on Data Frames 1. Notice that if you want to extract more than one column of a data frame, you need to use single brackets, not double: > temps[c(’min’,’max’)] min max 1 50.7 59.5 2 52.8 55.7 3 48.6 57.3 4 53.0 71.5 5 49.9 69.8 6 47.9 68.8 7 54.1 67.5 8 47.6 66.0 9 43.6 66.1 10 45.5 61.7 > temps[[c(’min’,’max’)]] Error in .subset2(x, i, exact = exact) : subscript out of bounds 2. If you want to work with a data frame without having to constantly retype the data frame’s name, you can use the with function. Suppose we want to convert our minimum and maximum temperatures to centigrade, and then calculate the difference between them. Using with, we can write: > with(temps,5/9*(max-32) - 5/9*(min-32)) [1] 4.888889 1.611111 4.833333 10.277778 11.055556 11.611111 7.444444 [8] 10.222222 12.500000 9.000000 which may be more convenient than typing out the data frame name repeatedly: > 5/9*(temps$max-32) - 5/9*(temps$min-32) [1] 4.888889 1.611111 4.833333 10.277778 11.055556 11.611111 7.444444 [8] 10.222222 12.500000 9.000000 3. Finally, if the goal is to a add one or more new columns to a data frame, you can combine a few operations into one using the transform function. The first argument to transform is the name of the data frame that will be used to construct the new columns. The remaining arguments to transform are name/value pairs describing the new columns. For example, suppose we wanted to create a new variable in the temps data frame called range, representing the difference between the min and max values for each day. We could use transform as follows: 24 > temps = transform(temps,range = max - min) > head(temps) day min max range 1 1 50.7 59.5 8.8 2 2 52.8 55.7 2.9 3 3 48.6 57.3 8.7 4 4 53.0 71.5 18.5 5 5 49.9 69.8 19.9 6 6 47.9 68.8 20.9 As can be seen, transform returns a new data frame like the original one, but with one or more new columns added. 2.9 Reading Data Frames from Files and URLs While creating a data frame the way we just did is very handy for quick examples, it’s actually pretty rare to enter a data frame in that way; usually we’ll be reading data from a file or possibly a URL. In these cases, the read.table function (or one of its’ closely related variations described below) can be used. read.table tries to be clever about figuring out what type of data you’ll be using, and automatically determines how each column of the data frame should be stored. One problem with this scheme is has to do with a special type of variable known as a factor. A factor in R is a variable that is stored as an integer, but displayed as a character string. By default, read.table will automatically turn all the character variables that it reads into factors. You can recognize factors by using either the is.factor function or by examining the object’s class, using the class function. Factors are very useful for storing large data sets compactly, as well as for statistical modeling and other tasks, but when you’re first working with R they’ll most likely just get in the way. To avoid read.table from doing any factor conversions, pass the stringsAsFactors=FALSE argument as shown in the examples below. By default, R expects there to be at least one space or tab between each of the data values in your input file; if you’re using a different character to separate your values, you can specify it with the sep= argument. Two special versions of read.table are provided to handle two common cases: read.csv for files where the data is separated by commas, and read.delim when a tab character is used to separate values. On the other hand, if the variables in your input data occupy the same columns for every line in the file, the read.fwf can be used to turn your data into a data frame. If the first line of your input file contains the names of the variables in your data separated with the same separator used for the rest of the data, you can pass the header=TRUE argument to read.table and its variants, and the variables (columns) of your data frame will be named accordingly. Otherwise, names like V1, V2, etc. will be used. As an example of how to read data into a data frame, the URL http://www.stat.berkeley.edu/classes/s133/data/world.txt contains information about liter- 25 acy, gross domestic product, income and military expenditures for about 150 countries. Here are the first few lines of the file: country,gdp,income,literacy,military Albania,4500,4937,98.7,56500000 Algeria,5900,6799,69.8,2.48e+09 Angola,1900,2457,66.8,183580000 Argentina,11200,12468,97.2,4.3e+09 Armenia,3900,3806,99.4,1.35e+08 (You can use your favorite browser to examine a file like this, or you can use R’s download.file and file.edit functions to download a copy to your computer and examine it locally.) Since the values are separated by commas, and the variable names can be found in the first line of the file, we can read the data into a data frame as follows: world = read.csv(’http://www.stat.berkeley.edu/classes/s133/data/world.txt’,header=TRUE,stringsAsFactors=FALSE) Now that we’ve created the data frame, we need to look at some ways to understand what our data is like. The class and mode of objects in R is very important, but if we query them for our data frame, they’re not very interesting: > mode(world) [1] "list" > class(world) [1] "data.frame" Note that a data frame is also a list. We’ll look at lists in more detail later. As we’ve seen, we can use the sapply function to see the modes of the individual columns. This function will apply a function to each element of a list; for a data frame these elements represent the columns (variables), so it will do exactly what we want: > sapply(world,mode) country gdp income literacy military "character" "numeric" "numeric" "numeric" "numeric" > sapply(world,class) country gdp income literacy military "character" "integer" "integer" "numeric" "numeric" You might want to experiment with sapply using other functions to get familiar with some strategies for dealing with data frames. You can always view the names of the variables in a data frame by using the names function, and the size (number of observations and number of variables) using the dim function: > names(world) [1] "country" "gdp" "income" "literacy" "military" > dim(world) [1] 154 5 26 Mode :character Median : 4900 Median : 5930 Median :88.55 Mean : 9031 Mean :10319 Mean :81.05 3rd Qu.:11700 3rd Qu.:15066 3rd Qu.:98.42 Max. :55100 Max. :63609 Max. :99.90 NA’s : 1 military Min. :6.500e+06 1st Qu.:5.655e+07 Median :2.436e+08 Mean :5.645e+09 3rd Qu.:1.754e+09 Max. :3.707e+11 Another useful way to view the properties of a variable is with the stem function, which produces a text-base stem-and-leaf diagram. Each observation for the variable is represented by a number in the diagram showing that observation’s value: > stem(world$gdp) The decimal point is 4 digit(s) to the right of the | 0 | 11111111111111111111111111112222222222222222222223333333333344444444 0 | 55555555555666666666677777778889999 1 | 000111111223334 1 | 66788889 2 | 0022234 2 | 7778888999 3 | 00013 3 | 88 4 | 4 | 5 | 5 | 5 Graphical techniques are often useful when exploring a data frame. While we’ll look at graphics in more detail later, the functions boxplot, hist, and plot combined with the density function are often good choices. Here are examples: > boxplot(world$gdp,main=’Boxplot of GDP’) > hist(world$gdp,main=’Histogram of GDP’) > plot(density(world$gdp),main=’Density of GDP’) 29 Frequency Density 20 40 20 2-05 6-05 42-05 2e-05 bex00 Histogram of GDP 1 1 1 1 T 1 10000 20000 30000 40000 0000 0000 worldtade Density of GDP T T T T T T 10000 0000 sso000 0000 50000 0000 N 154 Bandwidth -2az2 Boxplot of GDP. 7 10000 20000 30000 40000 ‘50000 The resulting plot looks like this: As we’d expect, gdp (Gross Domestic Product) and income seem to have a very consistent relationship. The relation between literacy and income appears to be interesting, so we’ll examine it in more detail, by making a separate plot for it: > with(world,plot(literacy,income)) The first variable we pass to plot (literacy in this example) will be used for the x-axis, and the second (income) will be used on the y-axis. The plot looks like this: 34 In many cases, the most interesting points on a graph are the ones that don’t follow the usual relationships. In this case, there are a few points where the income is a bit higher than we’d expect based on the other countries, considering the rate of literacy. To see which countries they represent, we can use the identify function. You call identify with the same arguments as you passed to plot; then when you click on a point on the graph with the left mouse button, its row number will be printed on the graph. It’s usually helpful to have more than just the row number, so identify is usually called with a labels= argument. In this case, the obvious choice is the country name. The way to stop identifying points depends on your operating system; on Windows, right click on the plot and choose “Stop”; on Unix/Linux click on the plot window with the middle button. Here’s the previous graph after some of the outlier points are identified: 35 2.11 Adding Color to Plots Color is often refered to as the third dimension of a 2-dimensional plot, because it allows us to add extra information to an ordinary scatterplot. Consider the graph of literacy and income. By examining boxplots, we can see that there are differences among the distributions of income (and literacy) for the different continents, and it would be nice to display some of that information on a scatterplot. This is one situation where factors come in very handy. Since factors are stored internally as numbers (starting at 1 and going up to the number of unique levels of the factor), it’s very easy to assign different observations different colors based on the value of a factor variable. To illustrate, let’s replot the income vs. literacy graph, but this time we’ll convert the continent into a factor and use it to decide on the color of the points that will be used for each country. First, consider the world1 data frame. In that data frame, the continent is stored in the column (variable) called cont. We convert this variable to a factor with the factor function. First, let’s look at the mode and class of the variable before we convert it to a factor: > mode(world1$cont) [1] "character" > class(world1$cont) [1] "character" > world1$cont = factor(world1$cont) In many situations, the cont variable will behave the same as it did when it was a simple character variable, but notice that its mode and class have changed: 36 Code Value %d Day of the month (decimal number) %m Month (decimal number) %b Month (abbreviated) %B Month (full name) %y Year (2 digit) %Y Year (4 digit) 2.12 Using Dates in R Dates on computers have been the source of much anxiety, especially at the turn of the century, when people felt that many computers wouldn’t understand the new millenium. These fears were based on the fact that certain programs would store the value of the year in just 2 digits, causing great confusion when the century “turned over”. In R, dates are stored as they have traditionally been stored on Unix computers – as the number of days from a reference date, in this case January 1, 1970, with earlier days being represented by negative numbers. When dates are stored this way, they can be manipulated like any other numeric variable (as far as it makes sense). In particular, you can compare or sort dates, take the difference between two dates, or add an increment of days, weeks, months or years to a date. The class of such dates is Date and their mode is numeric. Dates are created with as.Date, and formatted for printing with format (which will recognize dates and do the right thing.) Because dates can be written in so many different formats, R uses a standard way of providing flexibility when reading or displaying dates. A set of format codes, some of which are shown in the table below, is used to describe what the input or output form of the date looks like. The default format for as.Date is a four digit year, followed by a month, then a day, separated by either dashes or slashes. So conversions like this happen automatically: > as.Date(’1915-6-16’) [1] "1915-06-16" > as.Date(’1890/2/17’) [1] "1890-02-17" The formatting codes are as follows: (For a complete list of the format codes, see the R help page for the strptime function.) As an example of reading dates, the URL http://www.stat.berkeley.edu/classes/s133/data/movies.txt contains the names, release dates, and box office earnings for around 700 of the most popular movies of all time. The first few lines of the input file look like this: Rank|name|box|date 1|Avatar|$759.563|December 18, 2009 2|Titanic|$600.788|December 19, 1997 3|The Dark Knight|$533.184|July 18, 2008 As can be seen, the fields are separated by vertical bars, so we can use read.delim with the appropriate sep= argument. 39 > movies = read.delim(’http://www.stat.berkeley.edu/classes/s133/data/movies.txt’, + sep=’|’,stringsAsFactors=FALSE) > head(movies) rank name box date 1 1 Avatar $759.563 December 18, 2009 2 2 Titanic $600.788 December 19, 1997 3 3 The Dark Knight $533.184 July 18, 2008 4 4 Star Wars: Episode IV - A New Hope $460.998 May 25, 1977 5 5 Shrek 2 $437.212 May 19, 2004 6 6 E.T. the Extra-Terrestrial $434.975 June 11, 1982 The first step in using a data frame is making sure that we know what we’re dealing with. A good first step is to use the sapply function to look at the mode of each of the variables: > sapply(movies,mode) rank name box date "numeric" "character" "character" "character" Unfortunately, the box office receipts (box) are character, not numeric. That’s because R doesn’t recognize a dollar sign ($) as being part of a number. (R has the same problem with commas.) We can remove the dollar sign with the sub function, and then use as.numeric to make the result into a number: > movies$box = as.numeric(sub(’\\$’,’’,movies$box)) To convert the character date values to R Date objects, we can use as.Date with the appropriate format: in this case it’s the month name followed by the day of the month, a comma and the four digit year. Consulting the table of format codes, this translates to ’%B %d, %Y’: > movies$date = as.Date(movies$date,’%B %d, %Y’) > head(movies$date) [1] "2009-12-18" "1997-12-19" "2008-07-18" "1977-05-25" "2004-05-19" [6] "1982-06-11" The format that R now uses to print the dates is the standard Date format, letting us know that we’ve done the conversion correctly. (If we wanted to recover the original format, we could use the format function with a format similar to the one we used to read the data.) Now we can perform calculations using the date. For example, to see the difference in time between the release of Titanic and Avatar (2 very popular movies directed by James Cameron), we could use: > movies$date[movies$name == ’Avatar’] - movies$date[movies$name == ’Titanic’] Time difference of 4382 days 40 Even though the result prints out as a character string, it’s actually just a number which can be used any way a number could be used. Now suppose we want to see the time difference in years. To convert days to years, we can divide by 365.25. (The .25 tries to account for leap years.): > diff = movies$date[movies$name == ’Avatar’] - movies$date[movies$name == ’Titanic’] > diff / 365.25 Time difference of 11.99726 days We could either adjust the units attribute of this value or use as.numeric to convert it to an ordinary number. (In R, an attribute is additional information stored along with a variable.) > diff = diff / 365.25 > attr(diff,’units’) = ’years’ > diff Time difference of 11.99726 years > as.numeric(diff) [1] 11.99726 Either way, it will be treated as an ordinary number when used in a calculation. The Sys.Date function can be used to return the current date, so R can calculate the time until any date you choose. For example, the midterm for this class is March 2, 2011: > as.Date(’2011-03-02’) - Sys.Date() Time difference of 28 days Another way to create dates is with the ISOdate function. This function accepts three numbers representing the year, month and day of the date that is desired. So to reproduce the midterm date we could use > midterm = ISOdate(2011,3,2) > midterm [1] "2011-03-02 12:00:00 GMT" Notice that, along with the date, a time is printed. That’s because ISOdate returns an object of class POSIXt, not Date. To make a date like this work properly with objects of class Date, you can use the as.Date function. Once we’ve created an R Date value, we can use the functions months, weekdays or quarters to extract those parts of the date. For example, to see which day of the week these very popular movies were released, we could use the table function combined with weekdays: > table(weekdays(movies$date)) Friday Monday Saturday Sunday Thursday Tuesday Wednesday 753 10 7 11 39 22 164 41 is that it if the first argument is a data frame, it will calculate the statistic for each column of the data frame. If we passed aggregate both the rank and box, we’d get two columns of summaries: > aggregate(movies[,c(’rank’,’box’)],movies[’weekday’],mean) weekday rank box > aggregate(movies[,c(’Rank’,’box’)],movies[’weekday’],mean) weekday rank box 1 Monday 354.5000 148.04620 2 Tuesday 498.9545 110.42391 3 Wednesday 423.2561 139.50965 4 Thursday 493.7692 117.89700 5 Friday 521.7384 112.24642 6 Saturday 577.5714 91.18714 7 Sunday 338.1818 140.45618 To add a column of counts to the table, we can create a data frame from the table function, and merge it with the aggregated results: > dat = aggregate(movies[,c(’rank’,’box’)],movies[’weekday’],mean) > cts = as.data.frame(table(movies$weekday)) > head(cts) Var1 Freq 1 Monday 10 2 Tuesday 22 3 Wednesday 164 4 Thursday 39 5 Friday 753 6 Saturday 7 To make the merge simpler, we rename the first column of cts to weekday. > names(cts)[1] = ’weekday’ > res = merge(cts,dat) > head(res) weekday Freq Rank box 1 Friday 753 521.7384 112.24642 2 Monday 10 354.5000 148.04620 3 Saturday 7 577.5714 91.18714 4 Sunday 11 338.1818 140.45618 5 Thursday 39 493.7692 117.89700 6 Tuesday 22 498.9545 110.42391 Notice that the default behaviour of merge is to sort the columns before merging, so that we’ve lost the order that the levels= argument prescribed. The sort=FALSE argument to merge can be used to prevent that: 44 > res = merge(cts,dat,sort=FALSE) > head(res) weekday Freq rank box 1 Monday 10 354.5000 148.04620 2 Tuesday 22 498.9545 110.42391 3 Wednesday 164 423.2561 139.50965 4 Thursday 39 493.7692 117.89700 5 Friday 753 521.7384 112.24642 6 Saturday 7 577.5714 91.18714 2.14 Functions As you’ve already noticed, functions play an important role in R. A very attractive feature of R is that you can write your own functions which work exactly the same as the ones that are part of the official R release. In fact, if you create a function with the same name as one that’s already part of R, it will override the built-in function, and possibly cause problems. For that reason, it’s a good idea to make sure that there’s not already another function with the name you want to use. If you type the name you’re thinking of, and R responds with a message like “object "xyz" not found” you’re probably safe. There are several reasons why creating your own functions is a good idea. 1. If you find yourself writing the same code over and over again as you work on different problems, you can write a function that incorporates whatever it is you’re doing and call the function, instead of rewriting the code over and over. 2. All of the functions you create are saved in your workspace along with your data. So if you put the bulk of your work into functions that you create, R will automatically save them for you (if you tell R to save your workspace when your quit.) 3. It’s very easy to write “wrappers” around existing functions to make a custom version that sets the arguments to another function to be just what you want. R provides a special mechanism to “pass along” any extra arguments the other function might need. 4. You can pass your own functions to built-in R functions like aggregate, by, apply, sapply, lapply, mapply, sweep and other functions to efficiently and easy perform customized tasks. Before getting down to the details of writing your own functions, it’s a good idea to understand how functions in R work. Every function in R has a set of arguments that it accepts. You can see the arguments that built-in functions take in a number of ways: viewing the help page, typing the name of the function in the interpreter, or using the args function. When you call a function, you can simply pass it arguments, in which case they must line up exactly with the way the function is designed, or you can specifically pass particular arguments in whatever order you like by providing the with names using the name=value 45 syntax. You also can combine the two, passing unnamed arguments (which have to match the function’s definition exactly), followed by named arguments in whatever order you like. For example, consider the function read.table. We can view its argument list with the command: > args(read.table) function (file, header = FALSE, sep = "", quote = "\"’", dec = ".", row.names, col.names, as.is = !stringsAsFactors, na.strings = "NA", colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE, flush = FALSE, stringsAsFactors = default.stringsAsFactors(), encoding = "unknown") NULL This argument list tells us that, if we pass unnamed arguments to read.table, it will interpret the first as file, the next as header, then sep, and so on. Thus if we wanted to read the file my.data, with header set to TRUE and sep set to ’,’, any of the following calls would be equivalent: read.table(’my.data’,TRUE,’,’) read.table(sep=’,’,TRUE,file=’my.data’) read.table(file=’my.data’,sep=’,’,header=TRUE) read.table(’my.data’,sep=’,’,header=TRUE) Notice that all of the arguments in the argument list for read.table have values after the name of the argument, except for the file argument. This means that file is the only required argument to read.table; any of the other arguments are optional, and if we don’t specify them the default values that appear in the argument list will be used. Most R functions are written so the the first few arguments will be the ones that will usually be used so that their values can be entered without providing names, with the other arguments being optional. Optional arguments can be passed to a function by position, but are much more commonly passed using the name=value syntax, as in the last example of calling read.table. Now let’s take a look at the function read.csv. You may recall that this function simply calls read.table with a set of parameters that makes sense for reading comma separated files. Here’s read.csv’s function definition, produced by simply typing the function’s name at the R prompt: function (file, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...) read.table(file = file, header = header, sep = sep, quote = quote, dec = dec, fill = fill, comment.char = comment.char, ...) <environment: namespace:utils> Pay special attention to the three periods (...) in the argument list. Notice that they also appear in the call to read.table inside the function’s body. The three dots mean all the 46 minmaxratio = edit(minmaxratio) You may also want to consider the fix function, which automates the process slightly. To start from scratch, you can use a call to edit like this: newfunction = edit(function(){}) Suppose we want to write a function that will allow us to calculate the mean of all the appropriate columns of a data frame, broken down by a grouping variable, and and including the counts for the grouping variables in the output. When you’re working on developing a function, it’s usually easier to solve the problem with a sample data set, and then generalize it to a function. We’ll use the movies data frame as an example, with both weekday and month as potential grouping variables. First, let’s go over the steps to create the movies data frame with both grouping variables: > movies = read.delim(’http://www.stat.berkeley.edu/classes/s133/data/movies.txt’, + sep=’|’,stringsAsFactors=FALSE) > movies$box = as.numeric(sub(’\\$’,’’,movies$box)) > movies$date = as.Date(movies$date,’%B %d, %Y’) > movies$weekday = weekdays(movies$date) > movies$weekday = factor(weekdays(movies$date), + levels = c(’Monday’,’Tuesday’,’Wednesday’,’Thursday’,’Friday’,’Saturday’,’Sunday’)) > movies$month = months(movies$date) > movies$month = factor(months(movies$date),levels=c(’January’,’February’,’March’, + ’April’,’May’,’June’,’July’,’August’,’September’,’October’,’November’,’December’) Since I’ve done a fair amount of processing to this data set, and since I’m going to want to use it later for testing my function, I’m going to use the save function to write a copy of the data frame to a file. This function writes out R objects in R’s internal format, just like the workspace is saved at the end of an R session. You can also transfer a file produced by save to a different computer, because R uses the same format for its saved objects on all operating systems. Since save accepts a variable number of arguments, we need to specify the file= argument when we call it: > save(movies,file=’movies.rda’) You can use whatever extension you want, but .rda or .Rdata are common choices. It’s often useful to breakdown the steps of a problem like this, and solve each one before going on to the next. Here are the steps we’ll need to go through to create our function. 1. Find the appropriate columns of the data frame for the aggregate function. 2. Write the call to the aggregate function that will give us the mean for each group. 3. Write the call to the function to get the counts and convert it to a data frame. 4. Merge together the results from aggregate and table to give us our result. 49 To find the appropriate variables, we can examine the class and mode of each column of our data frame: > sapply(movies,class) rank name box date weekday month "integer" "character" "numeric" "Date" "factor" "factor" > sapply(movies,mode) rank name box date weekday month "numeric" "character" "numeric" "numeric" "numeric" "numeric" For this data frame, the appropriate variables for aggregation would be rank and box, so we have to come up with some logic that would select only those columns. One easy way is to select those columns whose class is either numeric or integer. We can use the | operator which represents a logical “or” to create a logical vector that will let us select the columns we want. (There’s also the & operator which is used to test for a logical “and”.) > classes = sapply(movies,class) > numcols = classes == ’integer’ | classes == ’numeric’ While this will certainly work, R provides an operator that makes expressions like this easier to write. The %in% operator allows us to test for equality to more than one value at a time, without having to do multiple tests. In this example we can use it as follows: > numcols = sapply(movies,class) %in% c(’integer’,’numeric’) Now we need to write a call to the aggregate function that will find the means for each variable based on a grouping variable. To develop the appropriate call, we’ll use weekday as a grouping variable: > result = aggregate(movies[,numcols],movies[’weekday’],mean) > result weekday rank box 1 Monday 354.5000 148.04620 2 Tuesday 498.9545 110.42391 3 Wednesday 427.1863 139.38540 4 Thursday 493.7692 117.89700 5 Friday 520.2413 112.44878 6 Saturday 577.5714 91.18714 7 Sunday 338.1818 140.45618 Similarly, we need to create a data frame of counts that can be merged with the result of aggregate: > counts = as.data.frame(table(movies[’weekday’])) > counts Var1 Freq 50 1 Monday 10 2 Tuesday 22 3 Wednesday 161 4 Thursday 39 5 Friday 750 6 Saturday 7 7 Sunday 11 Unfortunately, this doesn’t name the first column appropriately for the merge function. The best way to solve this problem is to change the name of the first column of the counts data frame to the name of the grouping variable. Recall that using the sort=FALSE argument to merge will retain the order of the grouping variable that we specified with the levels= argument to factor > names(counts)[1] = ’weekday’ > merge(counts,result,sort=FALSE) weekday Freq rank box 1 Monday 10 354.5000 148.04620 2 Tuesday 22 498.9545 110.42391 3 Wednesday 161 427.1863 139.38540 4 Thursday 39 493.7692 117.89700 5 Friday 750 520.2413 112.44878 6 Saturday 7 577.5714 91.18714 7 Sunday 11 338.1818 140.45618 This gives us exactly the result we want, with the columns labeled appropriately. To convert this to a function, let’s put together all the steps we just performed: > load(’movies.rda’) > numcols = sapply(movies,class) %in% c(’integer’,’numeric’) > result = aggregate(movies[,numcols],movies[’weekday’],mean) > counts = as.data.frame(table(movies[’weekday’])) > names(counts)[1] = ’weekday’ > merge(counts,result,sort=FALSE) weekday Freq rank box 1 Monday 10 354.5000 148.04620 2 Tuesday 22 498.9545 110.42391 3 Wednesday 161 427.1863 139.38540 4 Thursday 39 493.7692 117.89700 5 Friday 750 520.2413 112.44878 6 Saturday 7 577.5714 91.18714 7 Sunday 11 338.1818 140.45618 To convert these steps into a function that we could use with any data frame, we need to identify the parts of these statements that would change with different data. In this case, 51 > aggall(world1,’cont’) cont Freq gdp income literacy military 1 AF 47 2723.404 3901.191 60.52979 356440000 2 AS 41 7778.049 8868.098 84.25122 5006536341 3 EU 34 19711.765 21314.324 98.40294 6311138235 4 NA 15 8946.667 10379.143 85.52000 25919931267 5 OC 4 14625.000 15547.500 87.50000 4462475000 6 SA 12 6283.333 6673.083 92.29167 2137341667 54 2.16 Sizes of Objects Before we start looking at character manipulation, this is a good time to review the different functions that give us the size of an object. 1. length - returns the length of a vector, or the total number of elements in a matrix (number of rows times number of columns). For a data frame, returns the number of columns. 2. dim - for matrices and data frames, returns a vector of length 2 containing the number of rows and the number of columns. For a vector, returns NULL. The convenience functions nrow and ncol return the individual values that would be returned by dim. 3. nchar - for a character string, returns the number of characters in the string. Returns a vector of values when applied to a vector of character strings. For a numeric value, nchar returns the number of characters in the printed representation of the number. 2.17 Character Manipulation While it’s quite natural to think of data as being numbers, manipulating character strings is also an important skill when working with data. We’ve already seen a few simple examples, such as choosing the right format for a character variable that represents a date, or using table to tabulate the occurences of different character values for a variable. Now we’re going to look at some functions in R that let us break apart, rearrange and put together character data. One of the most important uses of character manipulation is “massaging” data into shape. Many times the data that is available to us, for example on a web page or as output from another program, isn’t in a form that a program like R can easily interpret. In cases like that, we’ll need to remove the parts that R can’t understand, and organize the remaining parts so that R can read them efficiently. Let’s take a look at some of the functions that R offers for working with character variables: • paste The paste function converts its arguments to character before operating on them, so you can pass both numbers and strings to the function. It concatenates the arguments passed to it, to create new strings that are combinations of other strings. paste accepts an unlimited number of unnamed arguments, which will be pasted to- gether, and one or both of the arguments sep= and collapse=. Depending on whether the arguments are scalars or vectors, and which of sep= and collapse= are used, a variety of different tasks can be performed. 1. If you pass a single argument to paste, it will return a character representation: > paste(’cat’) 55 [1] "cat" > paste(14) [1] "14" 2. If you pass more than one scalar argument to paste, it will put them together in a single string, using the sep= argument to separate the pieces: > paste(’stat’,133,’assignment’) [1] "stat 133 assignment" 3. If you pass a vector of character values to paste, and the collapse= argument is not NULL, it pastes together the elements of the vector, using the collapse= argument as a separator: > paste(c(’stat’,133,’assignment’),collapse=’ ’) [1] "stat 133 assignment" 4. If you pass more than one argument to paste, and any of those arguments is a vector, paste will return a vector as long as its’ longest argument, produced by pasting together corresponding pieces of the arguments. (Remember the recycling rule which will be used if the vector arguments are of different lengths.) Here are a few examples: > paste(’x’,1:10,sep=’’) [1] "x1" "x2" "x3" "x4" "x5" "x6" "x7" "x8" "x9" "x10" > paste(c(’x’,’y’),1:10,sep=’’) [1] "x1" "y2" "x3" "y4" "x5" "y6" "x7" "y8" "x9" "y10" • grep The grep function searches for patterns in text. The first argument to grep is a text string or regular expression that you’re looking for, and the second argument is usually a vector of character values. grep returns the indices of those elements of the vector of character strings that contain the text string. Right now we’ll limit ourselves to simple patterns, but later we’ll explore the full strength of commands like this with regular expressions. grep can be used in a number of ways. Suppose we want to see the countries of the world that have the world ’United’ in their names. > grep(’United’,world1$country) [1] 144 145 grep returns the indices of the observations that have ’United’ in their names. If we wanted to see the values of country that had ’United’ in their names, we can use the value=TRUE argument: > grep(’United’,world1$country,value=TRUE) [1] "United Arab Emirates" "United Kingdom" 56 > nums = c(’12553’,’73911’,’842099’,’203’,’10’) > substring(nums,3,5) = ’99’ > nums [1] "12993" "73991" "849999" "209" "10" • tolower, toupper These functions convert their arguments to all upper-case characters or all lower-case characters, respectively • sub, gsub These functions change a regular expression or text pattern to a different set of characters. They differ in that sub only changes the first occurence of the specified pattern, while gsub changes all of the occurences. Since numeric values in R cannot contain dollar signs or commas, one important use of gsub is to create numeric vari- ables from text variables that represent numbers but contain commas or dollars. For example, in gathering the data for the world dataset that we’ve been using, I extracted the information about military spending from http://en.wikipedia.org/wiki/List of countries by military expenditures. Here’s an ex- cerpt of some of the values from that page: > values = c(’370,700,000,000’,’205,326,700,000’,’67,490,000,000’) > as.numeric(values) [1] NA NA NA Warning message: NAs introduced by coercion The presence of the commas is preventing R from being able to convert the values into actual numbers. gsub easily solves the problem: > as.numeric(gsub(’,’,’’,values)) [1] 370700000000 205326700000 67490000000 59 2.18 Working with Characters As you probably noticed when looking at the above functions, they are very simple, and, quite frankly, it’s hard to see how they could really do anything complex on their own. In fact, that’s just the point of these functions – they can be combined together to do just about anything you would want to do. As an example, consider the task of capitalizing the first character of each word in a string. The toupper function can change the case of all the characters in a string, but we’ll need to do something to separate out the characters so we can get the first one. If we call strsplit with an empty string for the splitting character, we’ll get back a vector of the individual characters: > str = ’sherlock holmes’ > letters = strsplit(str,’’) > letters [[1]] [1] "s" "h" "e" "r" "l" "o" "c" "k" " " "h" "o" "l" "m" "e" "s" > theletters = letters[[1]] Notice that strsplit always returns a list. This will be very useful later, but for now we’ll extract the first element before we try to work with its output. The places that we’ll need to capitalize things are the first position in the vector or letters, and any letter that comes after a blank. We can find those positions very easily: > wh = c(1,which(theletters == ’ ’) + 1) > wh [1] 1 10 We can change the case of the letters whose indexes are in wh, then use paste to put the string back together. > theletters[wh] = toupper(theletters[wh]) > paste(theletters,collapse=’’) [1] "Sherlock Holmes" Things have gotten complicated enough that we could probably stand to write a function: maketitle = function(txt){ theletters = strsplit(txt,’’)[[1]] wh = c(1,which(theletters == ’ ’) + 1) theletters[wh] = toupper(theletters[wh]) paste(theletters,collapse=’’) } Of course, we should always test our functions: > maketitle(’some crazy title’) [1] "Some Crazy Title" 60 Now suppose we have a vector of strings: > titls = c(’sherlock holmes’,’avatar’,’book of eli’,’up in the air’) We can always hope that we’ll get the right answer if we just use our function: > maketitle(titls) [1] "Sherlock Holmes" Unfortunately, it didn’t work in this case. Whenever that happens, sapply will operate on all the elements in the vector: > sapply(titls,maketitle) sherlock holmes avatar book of eli up in the air "Sherlock Holmes" "Avatar" "Book Of Eli" "Up In The Air" Of course, this isn’t the only way to solve the problem. Rather than break up the string into individual letters, we can break it up into words, and capitalize the first letter of each, then combine them back together. Let’s explore that approach: > str = ’sherlock holmes’ > words = strsplit(str,’ ’) > words [[1]] [1] "sherlock" "holmes" Now we can use the assignment form of the substring function to change the first letter of each word to a capital. Note that we have to make sure to actually return the modified string from our call to sapply, so we insure that the last statement in our function returns the string: > sapply(words[[1]],function(w){substring(w,1,1) = toupper(substring(w,1,1));w}) sherlock holmes "Sherlock" "Holmes" Now we can paste the pieces back together to get our answer: > res = sapply(words[[1]],function(w){substring(w,1,1) = toupper(substring(w,1,1));w}) > paste(res,collapse=’ ’) [1] "Sherlock Holmes" To operate on a vector of strings, we’ll need to incorporate these steps into a function, and then call sapply: mktitl = function(str){ words = strsplit(str,’ ’) res = sapply(words[[1]],function(w){substring(w,1,1) = toupper(substring(w,1,1));w}) paste(res,collapse=’ ’) } 61 3.1 Software for Remote Access To learn about the software you’ll need to access the SCF UNIX machines remotely see Accessing the SCF remotely. To see a list of all the SCF computers go to the Computer Grid page. 3.2 Basics of Unix On a UNIX system, most commands are entered into a shell. There is a large amount (sometimes too much) of online documentation for all UNIX commands, through the man command. For example, to find out more about the ls command, which lists the files in a directory, type man ls at the UNIX prompt. Another way to use the man command is to use keywords; type either man -k keyword or apropos keyword at the UNIX prompt to find a list of commands that involve “keyword”. One attractive feature of UNIX shells is tab completion; if you type only the first few letters of a command or file name, and then hit the tab key, the shell will complete the name you started to type provided that there is only one match. If there’s more than one match, hitting tab twice will list all the names that match. A properly configured UNIX file system has permission controls to prevent unauthorized access to files, and to make sure that users do not accidently remove or modify key files. If you want to adjust the permissions on the files you own, take a look at the man page for the chmod command. I’m not going to make a big distinction between UNIX, Linux and the UNIX core of Mac OSX, so the commands we’re going to look at here should work under any of these operating systems. 3.3 Command Path When you type a command into the shell, it will search through a number of directories looking for a command that matches the name you typed. The directories it looks through are called the search path. You can display your search path by typing echo $PATH 64 Command Description Examples ls Lists files in a given directory ls /some/directory ls # with no args, lists current dir cd Change Working Directory cd /some/directory cd #with no args, cd to home dir pwd Print Working Directory pwd mkdir Create New Directory mkdir subdirectory less Display file one screen at a time less filename cp Copy files cp file1 newfile1 cp file1 file2 file3 somedirectory mv Move or rename a file mv oldfile newfile mv file1 file2 file3 somedirectory rm Remove a file rm file1 file2 rm -r dir #removes all directories and subdirectories rmdir Remove a (empty) directory rmdir mydir history Display previously typed commands history grep Find strings in files grep Error file.out head Show the first few lines of a file head myfile head -20 myfile tail Show the last few lines of a file tail myfile tail -20 myfile file Identify the type of a file file myfile To see the complete list of commands that are available on a given computer, you could look at all the files (commands) in all of the directories on your search path. (There are well over 2000 commands on most UNIX systems.) 3.4 Basic Commands The table below shows some of the most commonly used UNIX commands: Each of these commands has many options which you can learn about by viewing their man page. For example, to get more information about the ls command, type man ls 3.5 Command History Another useful feature of most UNIX shells is the ability to retrieve and review the commands you’ve previously typed into that instance of the shell. The arrow keys can be used to scroll up or down through previous commands. Once a command appears on the command line (where you would normally type a command), you can use the arrow and/or backspace keys 65 Command Meaning Command Meaning control-p Previous line control-n Next line control-f One character forward control-b One character backward control-a Beginning of line control-e End of line control-d Delete one character control-k Delete to end of line Wildcard Meaning * Zero or more of any character ? Any single character [...] Any of the characters between the brackets [.̂..] Any characters except those between the brackets [x-y] Any character in the range x to y [0-9] [a-z] string-1,string-2,string-3 Each of the strings in turn to edit the command. A number of control key combinations can also be used to navigate your command history and are displayed in the table below; these same control sequences will work in the Emacs editor. 3.6 Editors Most programming tasks involve a stored copy of a program in a file somewhere, along with a file which could contain necessary data. On a UNIX system, it becomes very important to be able to view and modify files, and the program that does that is known as an editor. The kind of editor you use to work with files on a UNIX system is very different than a word processor, or other document handling program, as it’s not concerned with formatting, fonts, or other issues of appearance – it basically just holds the bit patterns that represent the commands in your program or the information in your data. There are many editors available on UNIX systems, but the two most popular are emacs and vi. Some of the other editors you might encounter on a UNIX system are pico, nano, vim, xemacs, gedit, and kate. 3.7 Wildcards Along with file completion, UNIX shells offer another way to save typing when entering filenames. Certain characters (known as wildcards) have special meaning when included as a filename, and the shell will expand these characters to represent multiple filenames. The most commonly used wildcard is the asterisk (*), which will match anything in a file name; other possibilities are in the table below. To see what will be matched for any string containing wildcards, use the UNIX echo command. In addition, the shell will recognize the tilda (~) as representing your home directory, and ~user as user’s home directory. 66 kill pid This is similar to control-C. To get the same affect as control-\, type kill -9 pid where pid is the number from the ps output. 69 Chapter 4 Regular Expressions 70 4.1 Regular Expressions Regular expressions are a method of describing patterns in text that’s far more flexible than using ordinary character strings. While an ordinary text string is only matched by an exact copy of the string, regular expressions give us the ability to describe what we want in more general terms. For example, while we couldn’t search for email addresses in a text file using normal searches (unless we knew every possible email address), we can describe the general form of an email address (some characters followed by an “@” sign, followed by some more characters, a period, and a few more characters. through regular expressions, and then find all the email addresses in the document very easily. Another handy feature of regular expressions is that we can “tag” parts of an expression for extraction. If you look at the HTML source of a web page (for example, by using View- ¿Source in a browser, or using download.file in R to make a local copy), you’ll notice that all the clickable links are represented by HTML like: <a href="http://someurl.com/somewhere"> It would be easy to search for the string href= to find the links, but what if some webmasters used something like <a href = ’http://someurl.com/somewhere’> Now a search for href= won’t help us, but it’s easy to express those sorts of choices using regular expressions. There are a lot of different versions of regular expressions in the world of computers, and while they share the same basic concepts and much of the same syntax, there are irritating differences among the different versions. If you’re looking for additional information about regular expressions in books or on the web, you should know that, in addition to basic regular expresssions, recent versions of R also support perl-style regular expressions. (perl is a scripting language whose creator, Larry Wall, developed some attractive extensions to the basic regular expression syntax.) Some of the rules of regular expressions are laid out in very terse language on the R help page for regex and regexpr. Since regular expressions are a somewhat challenging topic, there are many valuable resources on the internet. Before we start, one word of caution. We’ll see that the way that regular expressions work is that they take many of the common punctuation symbols and give them special meanings. Because of this, when you want to refer to one of these symbols literally (that is, as simply a character like other characters), you need to precede those symbols with a backslash (\). But backslashes already have a special meaning in character strings; they are used to indicate control characters, like tab (\t), and newline (\n). The upshot of this is that when you want to type a backslash to keep R from misinterpreting certain symbols, you need to precede it with two backslashes in the input. By the way, the characters for which this needs to be done are: . ^ $ + ? * ( ) [ ] { } | \ 71 Since regular expressions in R are simply character strings, we can save typing by storing regular expressions in variables. For example, if we say: > emailpat = ’[-A-Za-z0-9_.%]+@[-A-Za-z0-9_.%]+\\.[A-Za-z]+’ then we can use the R variable emailpat in place of the full regular expression. (If you use this technique, be sure to modify your stored variable when you change your regular expression.) To actually get to the regular expressions, we can use the gregexpr function, which provides more information about regular expression matches. First, let’s see what the output from gregexpr looks like: > gregout = gregexpr(emailpat,chk) > gregout [[1]] [1] 5 attr(,"match.length") [1] 24 [[2]] [1] -1 attr(,"match.length") [1] -1 [[3]] [1] 7 27 attr(,"match.length") [1] 14 17 First, notice that, since there may be a different number of regular expressions found in different lines, gregexpr returns a list. In each list element is a vector of starting points where regular expressions were found in the corresponding input string. In addition, there is additional information stored as an attribute, which is part of the value, but which doesn’t interfere if we try to treat the value as if it was simply a vector. The match.length attribute is another vector, of the same length as the vector of starting points, telling us how long each match was. Concentrating on the first element, we can use the substring function to extract the actual address as follows: > substring(chk[1],gregout[[1]],gregout[[1]] + attr(gregout[[1]],’match.length’) - 1) [1] "noboby@stat.berkeley.edu" To make it a little easier to use, let’s make a function that will do the extraction for us: getexpr = function(s,g)substring(s,g,g + attr(g,’match.length’) - 1) Now it’s a little easier to get what we’re looking for: 74 > getexpr(chk[2],gregout[[2]]) [1] "" > getexpr(chk[3],gregout[[3]]) [1] "me@mything.com" "you@yourspace.com" To use the same idea on an entire vector of character strings, we could either write a loop, or use the mapply function. The mapply function will repeatedly call a function of your choice, cycling through the elements in as many vectors as you provide to the function. To use our getexpr function with mapply to extract all of the email addresses in the chk vector, we could write: > emails = mapply(getexpr,chk,gregout) > emails $"abc noboby@stat.berkeley.edu" [1] "noboby@stat.berkeley.edu" $"text with no email" [1] "" $"first me@mything.com also you@yourspace.com" [1] "me@mything.com" "you@yourspace.com" Notice that mapply uses the text of the original character strings as names for the list it returns; this may or may not be useful. To remove the names, use the assignment form of the names function to set the names to NULL > names(emails) = NULL > emails [[1]] [1] "noboby@stat.berkeley.edu" [[2]] [1] "" [[3]] [1] "me@mything.com" "you@yourspace.com" The value that mapply returns is a list, the same length as the vector of input strings (chk in this example), with an empty string where there were no matches. If all you wanted was a vector of all the email addresses, you could use the unlist function: > unlist(emails) [1] "noboby@stat.berkeley.edu" "" [3] "me@mything.com" "you@yourspace.com" The empty strings can be removed in the usual way: 75 emails = emails[emails != ’’] or emails = subset(emails,emails != ’’) Suppose we wanted to know how many emails there were in each line of the input text (chk). One idea that might make sense is to find the length of each element of the list that the getexpr function returned: > emails = mapply(getexpr,chk,gregout) > names(emails) = NULL > sapply(emails,length) The problem is that, in order to maintain the structure of the output list, mapply put an empty (length 0) string in the second position of the list, so that length sees at least one string in each element of the list. The solution is to write a function that modifies the length function so that it only returns the length if there are some characters in the strings for a particular list element. (We can safely do this since we’ve already seen that there will always be at least one element in the list. We can use the if statement to do this: > sapply(emails,function(e)if(nchar(e[1]) > 0)length(e) else 0) [1] 1 0 2 4.3 How matches are found Regular expressions are matched by starting at the beginning of a string and seeing if a possible match might begin there. If not, the next character in the string is examined, and so on; if the end of the string is reached, then no match is reported. Let’s consider the case where there is a potential match. The regular expression program remembers where the beginning of the match was and starts checking the characters to the right of that location. As long as the expression continues to be matched, it will continue, adding more characters to the matched pattern until it reaches a point in the string where the regular expression is no longer matched. At that point, it backs up until things match again, and it checks to see if the entire regular expression has been matched. If it has, it reports a match; otherwise it reports no match. While the specifics of this mechanism will rarely concern you when you’re doing regular expression matches, there is one important point that you should be aware of. The regular expression program is always going to try to find the longest match possible. This means if you use the wildcard character, ., with the “one or more” modifier, *, you may get more than you expected. Suppose we wish to remove HTML markup from a web page, in order to extract the information that’s on the page. All HTML markup begins with a left angle bracket (<) and ends with a right angle bracket (>), and has markup information in between. Before diving 76 4.4 Tagging and Backreferences Consider again the problem of looking for email addresses. The regular expression that we wrote is exactly what we want, because we don’t care what’s surrounding the email address. But in many cases, the only way we can find what we want is to specify the surroundings of what we’re looking for. Suppose we wish to write a program that will find all of the links (URLs that can be reached by clicking some text on the page) of a web page. A line containing a link may look something like this: <a href="http://www.stat.berkeley.edu">UC Berkeley Stat Dept Home Page</a> Finding the links is very easy; but our goal here is to extract the links themselves. Notice that there’s no regular expression that can match just the link; we need to use some information about the context in which it’s found, and when we extract the matched expression there will be extra characters that we really don’t want. To handle this problem, parentheses can be used to surround parts of a regular expression that we’re really interested in, and tools exist to help us get those parts separated from the overall expression. In R, the only functions that can deal with these tagged expressions are sub and gsub, so to take take advantage of them, you may have to first extract the regular expressions with the methods we’ve already seen, and then apply sub or gsub. To illustrate, let’s compose a simple regular expression to find links. I don’t need to worry about the case of the regular expression, because the grep, sub, gsub and gregexpr functions all support the ignore.case= argument. Notice that I’ve surrounded the part we want ([^"’>]+) in parentheses. This will allow me to refer to this tagged expression as \1 in a call to gsub. (Additional tagged expressions will be refered to as \2, \3, etc.) Using this pattern, we can first find all the chunks of text that have our links embedded in them, and then use gsub to change the entire piece to just the part we want: > link = ’<a href="http://www.stat.berkeley.edu">UC Berkeley Stat Dept Home Page</a> ’ > gregout = gregexpr(’href *= *["\’]?([^"\’>]+)["\’]? *>’,link,ignore.case=TRUE) > thematch = mapply(getexpr,link,gregout) > answer = gsub(’href *= *["\’]?([^"\’>]+)["\’]? *>’,’\\1’,thematch,ignore.case=TRUE) > names(answer) = NULL > answer [1] "http://www.stat.berkeley.edu" 4.5 Getting Text into R Up until now, we’ve been working with text that’s already been formatted to make it easy for R to read, and scan or read.table (and its associated wrapper functions) have been sufficient to take care of things. Now we want to treat our input as raw text; in other words, we don’t want R to assume that the data is in any particular form. The main function for taking care of this in R is readLines. In the simplest form, you pass readLines the name of a URL or file that you want to read, and it returns a character vector with one element for each line of the file or url, containing the contents of each line. An optional argument to readLines specifies the number of lines you want to read; by default it’s set to -1, which 79 means to read all available lines. readLines removes the newline at the end of each line, but otherwise returns the text exactly the way it was found in the file or URL. readLines also accepts connections, which are objects in R that represent an alternative form of input, such as a pipe or a zipped file. Take a look at the help file for connections for more information on this capability. For moderate-sized problems that aren’t too complex, using readLines in its default mode (reading all the lines of input into a vector of character strings) will usually be the best way to solve your problem, because most of the functions you’ll use with such data are vectorized, and can operate on every line at once. As a simple example, suppose we wanted to get the names of all the files containing notes for this class. A glance at the page http://www.stat.berkeley.edu/classes/s133/schedule.html indicates that all the online notes can be found on lines like this one: <tr><td> Jan 23 </td><td> <a href="Unix.html">Introduction to UNIX</a></td></tr> We can easily extract the names of the note files using the sub function (since there is only one link per line, we don’t need to use gsub, although we could). The first step is to create a vector of character strings that will represents the lines of the URL we are trying to read. We can simply pass the URL name to readLines: > x = readLines(’http://www.stat.berkeley.edu/classes/s133/schedule.html’) Next, we can write a regular expression that can find the links. Note that the pattern that we want (i.e. the name of the file referenced in the link) has been tagged with parentheses for later extraction. By using the caret (^) and dollar sign ($) we can describe our pattern as an entire line – when we substitute the tagged expression for the pattern we’ll have just what we want. I’ll also add the basename of the files so that they could be, for example, entered into a browser. > baseurl = ’http://www.stat.berkeley.edu/classes/s133’ > linkpat = ’^.*<td> *<a href=["\’](.*)["\’]>.*$’ > x = readLines(’http://www.stat.berkeley.edu/classes/s133/schedule.html’) > y = grep(linkpat,x,value=TRUE) > paste(baseurl,sub(linkpat,’\\1’,y),sep=’/’) [1] "http://www.stat.berkeley.edu/classes/s133/Intro.html" [2] "http://www.stat.berkeley.edu/classes/s133/OS.html" [3] "http://www.stat.berkeley.edu/classes/s133/Unix.html" [4] "http://www.stat.berkeley.edu/classes/s133/R-1a.html" [5] "http://www.stat.berkeley.edu/classes/s133/R-2a.html" [6] "http://www.stat.berkeley.edu/classes/s133/R-3a.html" . . . 80 4.6 Examples of Reading Web Pages with R As an example of how to extract information from a web page, consider the task of extract- ing the spring baseball schedule for the Cal Bears from http://calbears.cstv.com/sports/m- basebl/sched/cal-m-basebl-sched.html . 4.7 Reading a web page into R Read the contents of the page into a vector of character strings with the readLines function: > thepage = readLines(’http://calbears.cstv.com/sports/m-basebl/sched/cal-m-basebl-sched.html’) Warning message: In readLines("http://calbears.cstv.com/sports/m-basebl/sched/cal-m-basebl-sched.html") : incomplete final line found on ’http://calbears.cstv.com/sports/m-basebl/sched/cal-m-basebl-sched.html’ The warning messages simply means that the last line of the web page didn’t contain a newline character. This is actually a good thing, since it usually indicates that the page was generated by a program, which generally makes it easier to extract information from it. Note: When you’re reading a web page, make a local copy for testing; as a courtesy to the owner of the web site whose pages you’re using, don’t overload their server by constantly rereading the page. To make a copy from inside of R, look at the download.file function. You could also save a copy of the result of using readLines, and practice on that until you’ve got everything working correctly. Now we have to focus in on what we’re trying to extract. The first step is finding where it is. If you look at the web page, you’ll see that the title “Opponent / Event” is right above the data we want. We can locate this line using the grep function: > grep(’Opponent / Event’,thepage) [1] 513 If we look at the lines following this marker, we’ll notice that the first date on the schedule can be found in line 536, with the other information following after: > thepage[536:545] [1] " <td class=\"row-text\">02/20/11</td>" [2] " " [3] " <td class=\"row-text\">vs. Utah</td>" [4] " " [5] " <td class=\"row-text\">Berkeley, Calif.</td>" [6] " " [7] " <td class=\"row-text\">W, 7-0</td>" [8] " " [9] " </tr>" [10] " " 81 > try = try[1:30] > try = sub(’<a href="/title[^>]*>(.*)</a>.*$’,’\\1’,try) > head(try) [1] " Just Go with It" [2] " $30.5M" [3] " $30.5M" [4] " Justin Bieber: Never Say Never" [5] " $29.5M" [6] " $30.3M" Once the spaces at the beginning of each line are removed, we can rearrange the data into a 3-column data frame: > try = sub(’^ *’,’’,try) > movies = data.frame(matrix(try,ncol=3,byrow=TRUE)) > names(movies) = c(’Name’,’Wkend Gross’,’Total Gross’) > head(movies) Name Wkend Gross Total Gross 1 Just Go with It $30.5M $30.5M 2 Justin Bieber: Never Say Never $29.5M $30.3M 3 Gnomeo & Juliet $25.4M $25.4M 4 The Eagle $8.68M $8.68M 5 The Roommate $8.13M $25.8M 6 The King's Speech $7.23M $93.7M We can replace the special characters with the following code: > movies$Name = sub(’&’,’&’,movies$Name) > movies$Name = sub(’'’,’\’’,movies$Name) 4.9 Dynamic Web Pages While reading data from static web pages as in the previous examples can be very useful (especially if you’re extracting data from many pages), the real power of techniques like this has to do with dynamic pages, which accept queries from users and return results based on those queries. For example, an enormous amount of information about genes and proteins can be found at the National Center of Biotechnology Information website (http://www.ncbi.nlm.nih.gov/), much of it available through query forms. If you’re only performing a few queries, it’s no problem using the web page, but for many queries, it’s beneficial to automate the process. Here is a simple example that illustrate the concept of accessing dynamic information from web pages. The page http://finance.yahoo.com provides information about stocks; if you enter a stock symbol on the page, (for example aapl for Apple Computer), you will be directed to a page whose URL (as it appears in the browser address bar) is 84 http://finance.yahoo.com/q?s=aapl&x=0&y=0 The way that stock symbols are mapped to this URL is pretty obvious. We’ll write an R function that will extract the current price of whatever stock we’re interested in. The first step in working with a page like this is to download a local copy to play with, and to read the page into a vector of character strings: > download.file(’http://finance.yahoo.com/q?s=aapl&x=0&y=0’,’quote.html’) trying URL ’http://finance.yahoo.com/q?s=aapl&x=0&y=0’ Content type ’text/html; charset=utf-8’ length unknown opened URL .......... .......... .......... ......... downloaded 39Kb > x = readLines(’quote.html’) To get a feel for what we’re looking for, notice that the words “Last Trade:” appear before the current quote. Let’s look at the line containing this string: > grep(’Last Trade:’,x) 45 > nchar(x[45]) [1] 3587 Since there are over 3500 characters in the line, we don’t want to view it directly. Let’s use gregexpr to narrow down the search: > gregexpr(’Last Trade:’,x[45]) [[1]] [1] 3125 attr(,"match.length") [1] 11 This shows that the string “Last Trade:” starts at character 3125. We can use substr to see the relevant part of the line: > substring(x[45],3125,3220) [1] "Last Trade:</th><td class=\"yfnc_tabledata1\"><big>363.50</b" There’s plenty of context – we want the part surrounded by <big>. One easy way to grab that part is to use a tagged regular expression with gsub: > gsub(’^.*<big>]*>([^<]*).*$’,’\\1’,x[45]) [1] "363.50" This suggests the following function: 85 > getquote = function(sym){ + baseurl = ’http://finance.yahoo.com/q?s=’ + myurl = paste(baseurl,sym,’&x=0&y=0’,sep=’’) + x = readLines(myurl) + q = gsub(’^.*<big>]*>([^<]*).*$’,’\\1’,grep(’Last Trade:’,x,value=TRUE)) + as.numeric(q) +} As always, functions like this should be tested: > getquote(’aapl’) [1] 196.19 > getquote(’ibm’) [1] 123.21 > getquote(’nok’) [1] 13.35 These functions provide only a single quote; a little exploration of the yahoo finance website shows that we can get CSV files with historical data by using a URL of the form: http://ichart.finance.yahoo.com/table.csv?s=xxx where xxx is the symbol of interest. Since it’s a comma-separated file, We can use read.csv to read the chart. gethistory = function(symbol) read.csv(paste(’http://ichart.finance.yahoo.com/table.csv?s=’,symbol,sep=’’)) Here’s a simple test: > aapl = gethistory(’aapl’) > head(aapl) Date Open High Low Close Volume Adj.Close 1 2011-02-15 359.19 359.97 357.55 359.90 10126300 359.90 2 2011-02-14 356.79 359.48 356.71 359.18 11073100 359.18 3 2011-02-11 354.75 357.80 353.54 356.85 13114400 356.85 4 2011-02-10 357.39 360.00 348.00 354.54 33126100 354.54 5 2011-02-09 355.19 359.00 354.87 358.16 17222400 358.16 6 2011-02-08 353.68 355.52 352.15 355.20 13579500 355.20 Unfortunately, if we try to use the Date column in plots, it will not work properly, since R has stored it as a factor. The format of the date is the default for the as.Date function, so we can modify our function as follows: gethistory = function(symbol){ data = read.csv(paste(’http://ichart.finance.yahoo.com/table.csv?s=’,symbol,sep=’’)) data$Date = as.Date(data$Date) data } 86 Note that, since blanks aren’t allowed in URLs, plus signs are used in place of spaces. If we were to click on the “Next” link at the bottom of the page, the URL changes to something like http://www.google.com/search?hl=en&safe=active&client=firefox-a&rls=com.ubuntu:en-US:unofficial&hs=xHq&q=introduction+to+r&start=10&sa=N For our purposes, we only need to add the &start= argument to the web page. Since google displays 10 results per page, the second page will have start=10, the next page will have start=20, and so on. Let’s read in the first page of this search into R: z = readLines(’http://www.google.com/search?q=introduction+to+r’) Warning message: In readLines("http://www.google.com/search?q=introduction+to+r") : incomplete final line found on ’http://www.google.com/search?q=introduction+to+r’ As always, you can safely ignore the message about the incomplete final line. Since we’re interested in the web links, we only want lines with ”href=” in them. Let’s check how many lines we’ve got, how long they are, and which ones contain the href string: > length(z) [1] 17 > nchar(z) [1] 369 208 284 26 505 39 40605 1590 460 291 152 248 [13] 317 513 507 5 9 > grep(’href=’,z) [1] 5 7 It’s pretty clear that all of the links are on the seventh line. Now we can construct a tagged regular expression to grab all the links. > hrefpat = ’href *= *"([^"]*)"’ > getexpr = function(s,g)substring(s,g,g+attr(g,’match.length’)-1) > gg = gregexpr(hrefpat,z[[7]]) > res = mapply(getexpr,z[[7]],gg) > res = sub(hrefpat,’\\1’,res) > res[1:10] [1] "http://images.google.com/images?q=introduction+to+r&um=1&ie=UTF-8&sa=N&hl=en&tab=wi" [2] "http://video.google.com/videosearch?q=introduction+to+r&um=1&ie=UTF-8&sa=N&hl=en&tab=wv" [3] "http://maps.google.com/maps?q=introduction+to+r&um=1&ie=UTF-8&sa=N&hl=en&tab=wl" [4] "http://news.google.com/news?q=introduction+to+r&um=1&ie=UTF-8&sa=N&hl=en&tab=wn" [5] "http://www.google.com/products?q=introduction+to+r&um=1&ie=UTF-8&sa=N&hl=en&tab=wf" [6] "http://mail.google.com/mail/?hl=en&tab=wm" [7] "http://www.google.com/intl/en/options/" [8] "/preferences?hl=en" [9] "https://www.google.com/accounts/Login?hl=en&continue=http://www.google.com/search%3Fq%3Dintroduction%2Bto%2Br" [10] "http://www.google.com/webhp?hl=en" 89 We don’t want the internal (google) links – we want external links which will begin with “http://”. Let’s extract all the external links, and then eliminate the ones that just go back to google: > refs = res[grep(’^https?:’,res)] > refs = refs[-grep(’google.com/’,refs)] > refs[1:3] [1] "http://cran.r-project.org/doc/manuals/R-intro.pdf" [2] "http://74.125.155.132/search?q=cache:d4-KmcWVA-oJ:cran.r-project.org/doc/manuals/R-intro.pdf+introduction+to+r&cd=1&hl=en&ct=clnk&gl=us&ie=UTF-8" [3] "http://74.125.155.132/search?q=cache:d4-KmcWVA-oJ:cran.r-project.org/doc/manuals/R-intro.pdf+introduction+to+r&cd=1&hl=en&ct=clnk&gl=us&ie=UTF-8" If you’re familiar with google, you may recognize these as the links to google’s cached results. We can easily eliminate them: > refs = refs[-grep(’cache:’,refs)] > length(refs) [1] 10 We can test these same steps with some of the other pages from this query: > z = readLines(’http://www.google.com/search?q=introduction+to+r&start=10’) Warning message: In readLines("http://www.google.com/search?q=introduction+to+r&start=10") : incomplete final line found on ’http://www.google.com/search?q=introduction+to+r&start=10’ > hrefpat = ’href *= *"([^"]*)"’ > getexpr = function(s,g)substring(s,g,g+attr(g,’match.length’)-1) > gg = gregexpr(hrefpat,z[[7]]) > res = mapply(getexpr,z[[7]],gg) Error in substring(s, g, g + attr(g, "match.length") - 1) : invalid multibyte string at ’<93>GNU S’ Unfortunately, there seems to be a problem. Fortunately, it’s easy to fix. What the message is telling us is that there’s a character in one of the results that doesn’t make sense in the language (English) that we’re using. We can solve this by typing: > Sys.setlocale(’LC_ALL’,’C’) > res = mapply(getexpr,z[[7]],gg) Since we no longer get the error, we can continue > res = sub(hrefpat,’\\1’,res) > refs = res[grep(’^https?:’,res)] > refs = refs[-grep(’google.com/’,refs)] > refs = refs[-grep(’cache:’,refs)] > length(refs) [1] 10 90 Once again, it found all ten links. This obviously suggests a function: googlerefs = function(term,pg=0){ getexpr = function(s,g)substring(s,g,g+attr(g,’match.length’)-1) qurl = paste(’http://www.google.com/search?q=’,term,sep=’’) if(pg > 0)qurl = paste(qurl,’&start=’,pg * 10,sep=’’) qurl = gsub(’ ’,’+’,qurl) z = readLines(qurl) hrefpat = ’href *= *"([^"]*)"’ wh = grep(hrefpat,z)[2] gg = gregexpr(hrefpat,z[[wh]]) res = mapply(getexpr,z[[wh]],gg) res = sub(hrefpat,’\\1’,res) refs = res[grep(’^https?:’,res)] refs = refs[-grep(’google.com/|cache:’,refs)] names(refs) = NULL refs[!is.na(refs)] } Now suppose that we want to retreive the links for the first ten pages of query results: > links = sapply(0:9,function(pg)googlerefs(’introduction to r’,pg)) > links = unlist(links) > head(links) [1] "http://cran.r-project.org/doc/manuals/R-intro.pdf" [2] "http://cran.r-project.org/manuals.html" [3] "http://www.biostat.wisc.edu/~kbroman/Rintro/" [4] "http://faculty.washington.edu/tlumley/Rcourse/" [5] "http://www.stat.cmu.edu/~larry/all-of-statistics/=R/Rintro.pdf" [6] "http://www.stat.berkeley.edu/~spector/R.pdf" 91 CI_Max_High_Blood_Pres Smoker CI_Min_Smoker CI_Max_Smoker Diabetes 1 39.0 26.6 19.1 34.0 14.2 2 36.6 24.6 20.3 28.8 7.2 3 -1111.1 17.7 10.2 25.1 6.6 4 -1111.1 -1111.1 -1111.1 -1111.1 13.1 5 -1111.1 23.6 16.7 30.4 8.4 6 -1111.1 -1111.1 -1111.1 -1111.1 -1111.1 CI_Min_Diabetes CI_Max_Diabetes Uninsured Elderly_Medicare Disabled_Medicare 1 9.1 19.3 5690 4762 1209 2 5.2 9.3 19798 22635 3839 3 2.0 11.3 5126 3288 1092 4 4.7 21.5 3315 2390 974 5 4.4 12.4 8131 5019 1300 6 -1111.1 -1111.1 2295 1433 504 Prim_Care_Phys_Rate Dentist_Rate Community_Health_Center_Ind HPSA_Ind 1 45.3 22.6 1 2 2 67.0 30.8 1 2 3 45.8 24.6 1 2 4 41.8 18.6 1 1 5 16.2 10.8 2 1 6 54.3 18.1 1 1 It’s clear that the value -1111.1 is being used for missing values, and we’ll need to fix that before we work with the data: > risk[risk == -1111.1] = NA Suppose we want to investigate the relationship between Diabetes and Smoking in each of the counties in California. We could create a data set with only California using the subset command, and then plot the two variables of interest: > risk.ca = subset(risk, CHSI_State_Name == ’California’) > plot(risk.ca$Smoker,risk.ca$Diabetes) Here’s what the plot looks like: 94 Now let’s say that we wanted to examine the relationship between smoking and diabetes for some other state, say Alabama. We can extract the Alabama data using subset, and then use the points command to add that data to the existing plot. (Unlike plot, points doesn’t produce a new plot, it adds to the currently active plot.) > risk.al = subset(risk, CHSI_State_Name == ’Alabama’) > points(risk.al$Smoker,risk.al$Diabetes,col=’red’) The plot now looks like this: 95 Clearly there’s a problem: some of the Alabama points are off the scale. This demon- strates that when you wish to plot multiple sets of points on the same graph that you have to make sure that the axes are big enough to accomodate all of the data. One very easy way to do this is to call the plot function with the minimums and maximums of all the data using type=’n’ as one of the arguments. This tells R to set up the axes, but not to actually plot anything. So a better way to plot the two sets of points would be as follows: > plot(range(c(risk.ca$Smoker,risk.al$Smoker),na.rm=TRUE), + range(c(risk.ca$Diabetes,risk.al$Diabetes),na.rm=TRUE), + type=’n’,xlab=’Smoking Rate’,ylab=’Diabetes Rate’) > points(risk.ca$Smoker,risk.ca$Diabetes) > points(risk.al$Smoker,risk.al$Diabetes,col=’red’) > legend(’topright’,c(’California’,’Alabama’),col=c(’black’,’red’),pch=1) > title(’Diabetes Rate vs. Smoking by County’) The completed plot looks like this: 96

Documents

questions

Stat 133 Class Notes - Spring, 2011, Lecture notes of Data Communication Systems and Computer Networks

Related documents

Partial preview of the text