Web Mining
Related Papers, Publications Project Prototypes People involved in this project
We have truly arrived in the clichéd Information Age. There is an ever expanding amount of information "out there". Moreover, the evolution of the Internet into the Global Information Infrastructure, coupled with the immense popularity of the Web, has also enabled the ordinary citizen to become not just a consumer of information, but also its disseminator. The Web, then, is becoming the apocryphal Vox Populi. Given that there is this vast and ever growing amount of information, how does the average user quickly find what s/he is looking for -- a task in which the present day search engines don't seem to help much!

One possible approach is to personalize the web space -- create a system which responds to user queries by potentially aggregating information from several sources in a manner which is dependent on who the user is. As a trivial example - a European querying on casinos is probably better served by URLs pointing to Monaco, whereas someone in North America should get URLs pointing to Las Vegas.  A biologist querying on cricket in all likelihood wants something other than a sports enthusiast would. 

Existing commercial systems seek to do some minimal personalization based on declarative information directly provided by the user, such as their zip code, or keywords describing their interests, or specific URLs, or even particular pieces of information they are interested in (e.g. price for a particular stock). Our research aims at creating systems that (semi) automatically tailor the content delivered to the user from a web site. We do so by mining the web -- both the contents, as well as the users' interaction.

Web mining, when looked upon in data mining terms, can be said to have three operations of interests - clustering (finding natural groupings of users, pages etc.), associations (which URLs tend to be requested together), and sequential analysis (the order in which URLs tend to be accessed). As in most real-world problems, the clusters and associations in Web mining do not have crisp boundaries.  and often overlap considerably. In addition, bad exemplars (outliers) and incomplete data can easily occur in the data set, due to a wide variety of reasons inherent to web browsing and logging. Thus, Web Mining and Personalization requires modeling of an unknown number of overlapping sets in the presence of significant noise and outliers, (i. e., bad exemplars). Moreover, the data sets in Web Mining are extremely large. 
The aim of our reserach is to develop scalable robust fuzzy techniques to model noisy data sets containing an unknown number of overlapping categories. Specifically, in this work we are : 

  • Developing new scalable robust fuzzy clustering techniques for modeling data.
  • Exploring new techniques to handle linguistic and textual features
  • Validating our techniques by creating prototype web mining and personalization systems.
Our initial efforts have been to mine web access logs and to cluster search engine results on the fly.

Contact : Anupam Joshi