Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

LING 572: Homework 2 - Info2Vectors, BinVectors, and Naive Bayes Algorithm in Mallet - Pro, Assignments of Linguistics

The instructions for ling 572 homework 2, which includes writing a script to convert text format to binary format, creating a simple tool to convert continuous features to binary features, and understanding the naive bayes algorithm and its implementation in mallet. The homework includes writing tui classes, pipe classes, and data type classes, as well as running commands to test the code and understand the formulas used in the training and classifying functions.

Typology: Assignments

Pre 2010

Uploaded on 03/10/2009

koofers-user-6um
koofers-user-6um 🇺🇸

10 documents

1 / 2

Toggle sidebar

Related documents


Partial preview of the text

Download LING 572: Homework 2 - Info2Vectors, BinVectors, and Naive Bayes Algorithm in Mallet - Pro and more Assignments Linguistics in PDF only on Docsity! LING 572: Homework 2 Due date: Jan 20, 2007 1 Part I: 50 points Goal: write your first tui class, pipe classes and type classes. • The task: Write a script, info2vectors, which does the reverse of vectors2info. In other words, if you run the following commands: – vectors2info –input news3.vectors –print-matrix siw > news3.vectors.txt – info2vectors –input news3.vectors.txt –output foo.vectors foo.vectors and news.vectors should have the same content. • The input file of info2vectors is in the text format: (the same format as in news3.vector.txt) “instanceName targetLabel FeatName1 FeatVal1 FeatName2 FeatVal2 ....”. Note: the number of items on each line can vary a lot; for instance, the first line could have 5 feature pairs, and the second line could have 100 pairs. Also the input file is different from the attribute-value table in that the features on each input line can appear in any order. • The output file of info2vectors is in the binary format (the same format as news3.vectors). • Info2Vectors.java is very similar to Csv2Vectors.java. Instead of creating it from scratch, you should start with Csv2Vectors.java, and make some modification. Certain pipes used by Csv2Vectors.java need to be replaced. You need to do the following: (1) 5 free points: Read Csv2Vectors.java under $baseDir/classify/tui/. (2) 15 points: Write a tui class, Info2Vectors.java, that processes the input file. In the hw2 sol, ex- plain the major changes you have made to Csv2Vectors.java in order to create Info2Vectors.java. (3) 15 points: In order to process each line in the input file, you need to create some new Pipe classes (see examples under $baseDir/pipe) and new data type classes (see examples under $baseDir/types). In the hw2 sol, write down the names of these new Java classes and briefly explain their functionalities. (4) 5 points: Write a shell script, info2vectors, that calls Info2Vectors class. Hint: take a look at any file under $malletHomeDir/bin. (5) 5 points: Run the following commands: vectors2info –input news3.vectors –print-matrix siw > news3.vectors.txt info2vectors –input news3.vectors.txt –output foo.vectors vectors2info –input foo.vectors –print-matrix siw > foo.vectors.txt If you “diff” the two *.txt files, do they match? If not, why? (6) 5 points: If your code works properly, the two *.txt files in (5) should have the same content. Write down the commands that show that. Hint: you need to write some simple code (in Perl, Python, or Java) to process the *.txt files first, and then run “diff”. 1
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved