Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Automated Classification of Spam Based on Textual Features: A Project Proposal - Prof. Lin, Study Guides, Projects, Research of Computer Science

A project proposal for designing a system that classifies web pages into spam and ham categories based on their content. The system aims to provide a heuristic for page quality and can be used for email spam detection, personalized web page ratings, and other applications. Related work includes studies on web page metrics and machine learning techniques for spam detection.

Typology: Study Guides, Projects, Research

Pre 2010

Uploaded on 08/05/2009

koofers-user-bm9
koofers-user-bm9 🇺🇸

10 documents

1 / 6

Toggle sidebar

Related documents


Partial preview of the text

Download Automated Classification of Spam Based on Textual Features: A Project Proposal - Prof. Lin and more Study Guides, Projects, Research Computer Science in PDF only on Docsity! CS 8803 – AIAD Prof Ling Liu Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai Under the supervision of Steve Webb Motivations and Objectives Spam, which was until recently a characteristic which was associated only with emails has spilled over to the web. Web Spam is a new problem which is prevalent due to people trying to find ways around search engine algorithms in the aim of getting their website ranked higher on search engines. In many cases, the content on these sites may be inferior and it may either not be of any benefit to the user to visit these sites or it may even be harmful if the site is malicious. This definition is a hazy one and classification of pages as spam may vary from person to person. In this project, we attempt to design a system which classifies a page based on its content into distinct categories of spam and ham. This system if successfully built would also provide a method to generate a heuristic for the quality of a page without having to store huge crawls and process these using multiple servers in real time. Although such a methodology might not be as reliable as a rank got from a regular web crawl, depending on its robustness, it might possibly be put to various other uses which follow. - Email Spam Detection - This can be used in a application which can feed back into a email spam classifier. For every email which has a link in it, a system which uses a heuristic such as the one specified by us can assign a quality metric to the link in the email. Depending on the quality of links in the mail, the mail can be classified as spam / non spam. - Personalized ratings for web pages - This can be used in an application where some knowledge can be gained from the pages that a user likes / dislikes pages once the system has enough data, it can associate ratings automatically for pages that the user browses. The data for the application depends on the user manually classifying a set of pages as spam or ham. Also a personalized search can be created using this where we can have an intermediate analyzing layer between the normal search results and the user so that the results can be reordered so that the ones which are most likely to be appealing to the user are on top. - This technique can be extended into classification into classes other than spam and ham. For example, detecting whether a page is safe for kids is another problem which can be tackled using this. Related Work This project follows the efforts of Fetterly et al [1] and Drost et al [2]. In [1], the authors investigated a set of metrics pertaining to web pages and showed that for those metrics, spam pages follow a definite pattern which is distinct from pages which can be termed as ham. They proved that it is possible to detect distinct outliers in the graph plots of these metrics which can be termed as spam. As metrics in this study, they used features like length of hostnames, host to machine ratios , out links from a page and inlinks. They concluded that the graph structure of the web is not the only criteria to classify pages, but that simpler features like these could also be used to build a heuristic ranking system. In [2], the authors tried to build a simple system to classify link spam. We attempt to build a more robust system using a greater amount of features from the web pages and build a similar system which can classify web pages into spam and ham. Machine Learning has also been used to improve search results by filtering Nepotistic links in [4] can extract all the features which we have mentioned above. Any extra features will have to added to the set later by a seperate process. The features will be stored in xml files which will make it easy to access and restructure them for further use. However one feature which we expect problems with is that of the inlinks to the page. The API's provided by search engines like google do not permit such a large number of queries ( of the order of 1 million ) to be made to it. Due to these constraints, it is possible that we might have to not include this metric in the final set which we evaluate. The classification building part and the classification of new pages is the next part of the process and as mentioned, this will be conducted using WEKA. During this critical phase, we shall evaluate each of the metrics in the set and determine which of these can be used. We will proceed in this phase in two sub phases. In the first we shall evaluate all the metrics apart of the tokens on the page , which would lead to a explosion of features and in the second we will evaluate the tokens and determine which of the tokens can be termed as useful. A rough bi-monthly milestone set for the project is as follows, Feb end - Extract the set of URLs from spam mails and get the ham URL's and Crawl the web pages March mid - Extract the features of the crawled pages March End - Start of classifier coding and Evaluation of metrics apart from the text tokens on the page Apr Mid - Evaluate the important textual tokens on the page. Apr End - Have basic framework for classification of new pages Evaluation and Testing Methodology In order to test the correctness of the various classification methods, we could perform a random samplings of the spam and ham pages and build the models on these alone. Then we could test the models builts against the rest of the data sets and check for the accuracy of classification. The size of the sample used will have to be determined experimentally and the results will feed in directly to the classifier. We intend to start with smaller samples of 1000's of pages and scale upwards until we get an optimal performance. As a second level of testing, we could build a framework to build a random walk of the web and classify all of these pages. Some manual verification will have to be done on these pages later to determine how effective this technique has been. Bibliography [1] D Fetterly, M Manasse, M Najork, “Spam Damn Spam and Statistics” in Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004 WebDB '04 [2] I Drost, T Scheffer, “Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam” in Proceedings of the European Conference on Machine Learning. 2005 [3] Ian H. Witten and Eibe Frank (2005) "Data Mining: Practical machine learning tools and techniques", 2nd Edition, Morgan Kaufmann, San Francisco, 2005. [4] B. D. Davison, "Recognizing Nepotistic Links on the Web," in Artificial Intelligence for Web Search, pp. 23--28, 2000. [5] E. Amitay et al., "The Connectivity Sonar: Detecting Site Functionality by Structural Patterns," in Proceedings of the 14th ACM Conference on Hypertext and Hypermedia, pp. 38--47, 2003.
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved