Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Information Retrieval Systems: Design and Implementation with MySQL and Text Files - Prof., Study notes of School management&administration

An overview of information retrieval systems, focusing on enabling mysql databases for text search and the concept of indexing. It also covers building a front-end for mysql database searching and dealing with unstructured data using text files.

Typology: Study notes

Pre 2010

Uploaded on 03/16/2009

koofers-user-ast-1
koofers-user-ast-1 🇺🇸

10 documents

1 / 7

Toggle sidebar

Related documents


Partial preview of the text

Download Information Retrieval Systems: Design and Implementation with MySQL and Text Files - Prof. and more Study notes School management&administration in PDF only on Docsity! INLS 490-154: Information Retrieval Systems Design and Implementation. Spring 2009. 2. IR with MySQL and Text Files Chirag Shah∗ School of Information & Library Science (SILS) UNC Chapel Hill NC 27599 chirag@unc.edu 1 Introduction In this class we will continue (and finish) exploring pure structured data and move toward more unstructured domain of textual information. To help orient us for the rest of our journey, we will present a model of information seeking in the following section. While the focus of this course is information retrieval, it is important to understand how it fits in the overarching issue of information seeking. In this class we will learn how we can enable MySQL databases for performing searching through fields with textual data. We will use SQL to search through MySQL tables and retrieve records matching the query. At this point we will start talking about concepts such as indexing and stop words. These concepts will stick with us for the rest of our explorations of unstructured textual information. 2 A model for Information Seeking A general model of information access and organization is depicted in Figure 1 (Shah, 2008). On the left side, four layers are labeled, on the right side, examples are given for these layers, and in the middle, a typical scenario is presented. These four layers are described below in detail. Layer-1: Information This layer contains information in various sources and formats (structured, semi-structured, and unstructured). The sources include digital libraries, wikis, blogs, databases, and web- pages; formats include text, images, and videos. ∗ CC©BY:© $\© =© These notes for INLS 490-154 Spring 2009 by Chirag Shah (http://www.unc.edu/∼chirags) are licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License. 1 Figure 1: An IR-centric model of Information Seeking (IS) Layer-2: Tools This layer consists of tools and techniques a user can use to access the information of layer- 1. They include search services, relevance feedback (Buckley, Salton, & Allen, 1994), and query term suggestions (Anick, 2003). In addition, since this layer also acts as a mediating layer between information and users, it includes a variety of user interfaces, starting from results as rank-lists to touch panels with mechanisms to visualize results. We can see that a large amount of research in IR is focused on the link between layer-1 and layer-2; that is, developing tools and services appropriate for retrieving information of various forms. Layer-3: User This layer consists of a user, who uses the tools in layer-2 to access the information in layer-1 and accumulate the knowledge in layer-4. We can see that the focus of HCI research has been on the link between layer-2 and layer-3; that is, presenting the information and the information access tools in effective ways to the user. This layer-3 also includes elements relating to a user, such as user profiles, which can be used for personalization (Teevan, Du- mais, & Horvitz, 2005). Layer-4: Results The user of layer-3 accumulates the information relevant to him in layer-4. In the most basic sense, this could be a set of webpages that the user found relevant from his searches on the Web. Extending this further, we can have bookmarks, notes, and other kinds of 2 if ($value3) // Radio buttons have TRUE (1) or FALSE (0) values { // Do something } // Reading list boxes $value4 = $_GET[’item4’]; switch ($value4) // List boxes are multiple choices { case ’option1’: // Do something break; case ’option2’: // Do something break; } ?> Code 2: MySQL connection commands in PHP <?php $host = "localhost"; $username = "me"; $password = ""; $database = "example"; $dbh = mysql_connect($host,$username,$password) or die("Cannot connect to the database: ". mysql_error()); $db_selected = mysql_select_db($database) or die (’Cannot connect to the database: ’ . mysql_error()); ?> Code 3: Create and execute a SQL query as well as read the records in PHP <?php // Formulate the query $query = "SELECT * FROM table_name"; // Execute the query, get the results $results = mysql_query($query) or die(" ". mysql_error()); // Go record by record while ($line = mysql_fetch_array($results, MYSQL_ASSOC)) { // Do something } ?> 5 5 Working with text files for IR Let us now see how we could start dealing with unstructured data. To begin our exploration, which will continue for the rest of this course, we will take a very simple example. We have a text file and we want to look through it for a word. Of course, utilities such as grep in UNIX can do this easily. But let us write a simple program to achieve a similar effect.2 We have listed such a PHP script in Code 4. It reads the query (a word) from the input, looks for that query in a document (hardcoded here), and prints ‘Found’ if/when found that query. Code 4: Searching through a text file with PHP <?php $query = trim(fgets(STDIN)); $fin = fopen("document.txt", "r"); while ($line = fgets($fin)) { $words = explode(" ", $line); $i = 0; $found = 0; while (($words[$i]) && ($found==0)) { if (strcmp($words[$i], $query) == 0) { echo "Found!\n"; $found = 1; } $i++; } } fclose($fin); ?> This process works, but has several issues that are not quite visible in the example given. For instance, it is case sensitive. It assumes clean text, without punctuation and other special symbols attached to the words. It does not allow matching phrases or look for more than one words. And above all, it is really an inefficient process, which makes it almost impractical for large-scale systems. In the next class we will see how we could take this simple process and start enhancing it so that we achieve something more practical. 2Note that grep is really powerful and allows one to use regular expressions in searching. 6 6 Summary • Information retrieval can be seen as a part of the overarching domain of information seeking. We will focus on information retrieval, but also see how it fits with other components in an information seeking model. • MySQL inherently supports searching with regular expressions. To achieve better performance while searching on textual information, we need to index that field. • Indexing allows one to prepare a more sophisticated representation of the information that can help in better searching. • Stop words are the words that we do not want to store or process since they are not very useful while searching. References Anick, P. (2003). Using terminological feedback for web search refinement - a log-based study. In Proceedings of ACM SIGIR (p. 88-95). Buckley, C., Salton, G., & Allen, J. (1994). The effect of adding relevance information in a relevance feedback environment. In Proceedings of ACM SIGIR (p. 292-300). New York, NY: Springer-Verlag. Dumais, S. T., Cutrell, E., Cadiz, J., Jancke, G., Sarin, R., & Robbins, D. C. (2003, August). Stuff I’ve Seen: A System for Personal Information Retrieval and Re-Use. In Proceedings of ACM SIGIR. ACM Press. Shah, C. (2008, June 20). Toward Collaborative Information Seeking (CIS). In Collaborative exploratory search workshop. Pittsburgh, PA. Teevan, J., Dumais, S. T., & Horvitz, E. (2005). Personalizing search via automated analysis of interests and activities. In Proceedings of ACM SIGIR (p. 449-456). 7
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved