Download Manipulating Data with Command Line Utilities - Lecture Notes | ECN 297 and more Study notes Economics in PDF only on Docsity! Economics 201 Cottrell Manipulating data with command-line utilities In doing “real world” data analysis, it quite often happens that one can hold of relevant data, but not in the exact format that one needs for processing using a program such as gretl or Excel. In that case we need to edit the data first. There are various options for doing this. You are probably used to editing stuff using a word processor (e.g. MS Word). One first, important point to notice is that in editing raw data, a word processor is generally not appropriate. The data must remain as plain text, and must not assume the format of a Word doc (which includes formatting codes and a bunch of other stuff). The Windows utility known as Wordpad is OK for the purpose, so long as you take care to save your edited work as plain text. Here, though, I will talk about another option that can be very useful when the raw data differ in some systematic way from what you want, i.e. where the editing task is to recognize some pattern in the raw data and change it in some specified way. The option I’m talking about is the use of simple command- line tools, which enable you to modify a text file non-interactively. By “non-interactively” I mean that you don’t have to go through the file searching and replacing; rather you issue a single command that does all the work for you. As a case in point, consider the wage data that we downloaded from the Bureau of Labor Statistics website, bls.gov. The file we obtained held monthly wage data from the 1960s to the present. It was just what we wanted, except that—as I discovered when the dates looked funny in gretl—after the December value (labeled M12) for each year, there was an M13 value, which represented the average value for the year as a whole. We wanted just the monthly data, so the task in this case was to strip out all lines containing M13, something that is not easy to do with standard Search-and-Replace tools. The smart utility for this job is a program called grep. This program scans a file and either § prints only those lines in the file that match a certain pattern; or § if you add the -v (think reVerse) option, prints only those lines that do not match the pattern. The command for the BLS task was then grep -v M13 bls.txt > new.txt Taking this apart, the first bit is grep -v M13, that is, we’re asking grep to give us lines that do not contain M13. The next bit is bls.txt, the name of the original data file we want scanned. The last bit is > new.txt, which is composed of two parts: the > symbol calls for redirection—instead of sending output to the screen, we want it sent to a file—and then we give the name of the file we want created, here new.txt. Putting it all back together, the command is: “Scan bls.txt for lines that do not contain M13 and send the output from this operation to new.txt.” Where to get grep, how to install How to open a console Other similar things: sed.