Download LING 572: Homework 2 - Info2Vectors, BinVectors, and Naive Bayes Algorithm in Mallet - Pro and more Assignments Linguistics in PDF only on Docsity! LING 572: Homework 2 Due date: Jan 20, 2007 1 Part I: 50 points Goal: write your first tui class, pipe classes and type classes. • The task: Write a script, info2vectors, which does the reverse of vectors2info. In other words, if you run the following commands: – vectors2info –input news3.vectors –print-matrix siw > news3.vectors.txt – info2vectors –input news3.vectors.txt –output foo.vectors foo.vectors and news.vectors should have the same content. • The input file of info2vectors is in the text format: (the same format as in news3.vector.txt) “instanceName targetLabel FeatName1 FeatVal1 FeatName2 FeatVal2 ....”. Note: the number of items on each line can vary a lot; for instance, the first line could have 5 feature pairs, and the second line could have 100 pairs. Also the input file is different from the attribute-value table in that the features on each input line can appear in any order. • The output file of info2vectors is in the binary format (the same format as news3.vectors). • Info2Vectors.java is very similar to Csv2Vectors.java. Instead of creating it from scratch, you should start with Csv2Vectors.java, and make some modification. Certain pipes used by Csv2Vectors.java need to be replaced. You need to do the following: (1) 5 free points: Read Csv2Vectors.java under $baseDir/classify/tui/. (2) 15 points: Write a tui class, Info2Vectors.java, that processes the input file. In the hw2 sol, ex- plain the major changes you have made to Csv2Vectors.java in order to create Info2Vectors.java. (3) 15 points: In order to process each line in the input file, you need to create some new Pipe classes (see examples under $baseDir/pipe) and new data type classes (see examples under $baseDir/types). In the hw2 sol, write down the names of these new Java classes and briefly explain their functionalities. (4) 5 points: Write a shell script, info2vectors, that calls Info2Vectors class. Hint: take a look at any file under $malletHomeDir/bin. (5) 5 points: Run the following commands: vectors2info –input news3.vectors –print-matrix siw > news3.vectors.txt info2vectors –input news3.vectors.txt –output foo.vectors vectors2info –input foo.vectors –print-matrix siw > foo.vectors.txt If you “diff” the two *.txt files, do they match? If not, why? (6) 5 points: If your code works properly, the two *.txt files in (5) should have the same content. Write down the commands that show that. Hint: you need to write some simple code (in Perl, Python, or Java) to process the *.txt files first, and then run “diff”. 1