__ABSTRACTS __

**
Scalable Distributed Change Detection from Astronomy Data Streams using Local, Asynchronous Eigen Monitoring Algorithms
**

**Abstract**:
This paper considers the problem of change detection using local distributed eigen monitoring algorithms for next generation of astronomy petascale data pipelines such as the Large Synoptic Survey Telescopes (LSST). This telescope will take repeat images of the night sky every 20 seconds, thereby generating 30 terabytes of calibrated imagery every night that will need to be co-analyzed with other astronomical data stored at different locations around the world. Change point detection and event classification in such data sets may provide useful insights to unique astronomical phenomenon displaying astrophysically significant variations: quasars, supernovae, variable stars, and potentially hazardous asteroids. However, performing such data mining tasks is a challenging problem for such high-throughput distributed data streams. In this paper we propose a highly scalable and distributed asynchronous algorithm for monitoring the principal components (PC) of such dynamic data streams. We demonstrate the algorithm on a large set of distributed astronomical data to accomplish well-known astronomy tasks such as measuring variations in the fundamental plane of galaxy parameters. The proposed algorithm is provably correct (i.e. converges to the correct PCs without centralizing any data) and can seamlessly handle changes to the data or the network. Real experiments performed on Sloan Digital Sky Survey (SDSS) catalogue data show the effectiveness of the algorithm.
** **

**
A Communication Efficient Probabilistic Algorithm for Mining Frequent Itemset from Peer-to-Peer Network
**

**Abstract**:
This paper focuses on developing a communication efficient algorithm for
discovering frequent
itemsets from data distributed homogeneously in a peer-to-peer(P2P)
network. The classic frequent
itemset mining algorithm does not work in this setting as it needs to
access the data in its entirety
available at one location. In a large-scale distributed environment
like a P2P network, finding all
the network-wide frequent itemsets with exact frequency is
computationally difficult and usually
requires large amount of communication (often communication per peer
reaches order of the network
size). Instead, this paper tries to find most of the frequent itemsets
with probabilistic guarantee and
their approximate frequency using only a bounded communication. The
paper takes sampling based
approach to identify frequent itemsets from the entire P2P network. It
first addresses the challenging
problem of collecting an unbiased uniform sample from a P2P network,
and shows how to take a
uniform sample of nodes and a uniform sample of data from a P2P
network using random-walk.
It then applies the proposed sampling technique to identify most of
the frequent itemsets from a
P2P network. Theoretical analysis shows how to bound the sample-size
irrespective of the size of
the entire data and hence, communication necessary to compute the
results. Experimental analysis
shows that the proposed algorithm discovers all of the network-wide
frequent itemsets using only a
bounded communication.
** **

**
Approximate K-means Clustering Over a Peer-to-peer Network
**

**Abstract**:
Data intensive peer-to-peer (P2P) networks are finding increasing
number of applications. Data mining
in such P2P environments is a natural extension. However, common
monolithic data mining architectures
do not fit well in such environments since they typically require
centralizing the distributed data which
is usually not practical in a large P2P network. Distributed data
mining algorithms that avoid large-scale
synchronization or data centralization offer an alternate choice. This
paper considers the distributed K-means
clustering problem where the data and computing resources are
distributed over a large P2P network. It offers
two algorithms which produce an approximation of the result produced
by the standard centralized K-means
clustering algorithm. The first is designed to operate in a dynamic
P2P network that can produce clustering
by "local" synchronization only. The second algorithm uses uniformly
sampled peers and provides analytical
guarantees regarding the accuracy of clustering on a P2P network.
Empirical results show that both the
algorithms demonstrate good performance compared to their centralized
counterparts at little communication
cost.
** **

**
On the Privacy of Euclidean Distance Preserving Data Perturbation
**

**Abstract**:
We examine Euclidean distance preserving data perturbation as a
tool for privacy-preserving data mining. Such perturbations allow many
important data mining algorithms, with only minor modification, to be
applied to the perturbed data and produce exactly the same results as if
applied to the original data, e.g., hierarchical cluatering and k-means
clustering. However, the issue of how well the original data is hidden
needs careful study. We take a step in this direction by assuming the role
of an attacker armed with two types of prior information regarding the
original data. We examine how well the attacker can recover the original
data from the perturbed data and prior information. Our results offer
insight into the vulnerabilities of Euclidean distance preserving
transformations.
** **

**
A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems
**

**Abstract**:
In a large network of computers or wireless sensors, each of the components (henceforth, peers)
has some data about the global state of the system. Much of the system’s functionality such as message routing, information retrieval and load sharing relies on modeling the global state. We refer to the outcome of the function (e.g., the load experienced by each peer) as the model of the system. Since the state of the system is constantly changing, it is necessary to keep the models up-to-date.
Computing global data mining models e.g. decision trees, k-means clustering in large distributed systems may be very costly due to the scale of the system and due to communication cost, which may be high. The cost further increases in a dynamic scenario when the data changes rapidly. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient local algorithm which can be used to monitor a wide class of data mining models. Then, we use this algorithm as a feedback loop for the monitoring of complex functions of the data such as its k-means clustering. The theoretical claims are corroborated with a thorough experimental analysis
** **

**
A Scalable Local Algorithm for Distributed Multivariate Regression
**

**Abstract**:
This paper offers a local distributed algorithm for multivariate regression in large peer-to-peer
environments. The algorithm can be used for distributed inferencing, data compaction, data modeling
and classification tasks in many emerging peer-to-peer applications for bioinformatics, astronomy, social
networking, sensor networks and web mining. Computing a global regression model from data available
at the different peer-nodes using a traditional centralized algorithm for regression can be very costly and
impractical because of the large number of data sources, the asynchronous nature of the peer-to-peer
networks, and dynamic nature of the data/network. This paper proposes a two-step approach to deal
with this problem. First, it offers an efficient local distributed algorithm that monitors the “quality” of
the current regression model. If the model is outdated, it uses this algorithm as a feedback mechanism
for rebuilding the model. The local nature of the monitoring algorithm guarantees low monitoring cost.
Experimental results presented in this paper strongly support the theoretical claims.
** **

**
Distributed Decision Tree Induction in Peer-to-Peer Systems
**

**Abstract**:
This paper offers a scalable and robust distributed algorithm for decision tree induction in large
Peer-to-Peer (P2P) environments. Computing a decision tree in such large distributed systems using
standard centralized algorithms can be very communication-expensive and impractical because of the
synchronization requirements. The problem becomes even more challenging in the distributed stream
monitoring scenario where the decision tree needs to be updated in response to changes in the data
distribution. This paper presents an alternate solution that works in a completely asynchronous manner
in distributed environments and offers low communication overhead, a necessity for scalability. It also
seamlessly handles changes in data and peer failures. The paper presents extensive experimental results
to corroborate the theoretical claims.
** **

**
An efficient local Algorithm for Distributed Multivariate Regression in
Peer-to-Peer Networks
**

**Abstract**:
This paper offers a local distributed algorithm for multivariate
regression in large peer-to-peer environments. The algorithm
is designed for distributed inferencing, data compaction,
data modeling and classification tasks in many
emerging peer-to-peer applications for bioinformatics, astronomy,
social networking, sensor networks and web mining.
Computing a global regression model from data available
at the different peer-nodes using a traditional centralized
algorithm for regression can be very costly and impractical
because of the large number of data sources, the asynchronous
nature of the peer-to-peer networks, and dynamic
nature of the data/network. This paper proposes a two-step
approach to deal with this problem. First, it offers an efficient
local distributed algorithm that monitors the “quality” of the
current regression model. If the model is outdated, it uses
this algorithm as a feedback mechanism for rebuilding the
model. The local nature of the monitoring algorithm guarantees
low monitoring cost. Experimental results presented in
this paper strongly support the theoretical claims.
** **

**
Distributed Identification of Top-l Inner Product Elements and its Application in a Peer-to-Peer Network
**

**Abstract**:
Inner product measures how closely two feature
vectors are related. It is an important primitive for many popular
data mining tasks, e.g., clustering, classification, correlation computation,
and decision tree construction. If the entire data set is
available at a single site, then computing the inner product matrix
and identifying the top (in terms of magnitude) entries is trivial.
However, in many real-world scenarios, data is distributed across
many locations and transmitting the data to a central server
would be quite communication-intensive and not scalable. This
paper presents an approximate local algorithm for identifying
top-l inner products among pairs of feature vectors in a large
asynchronous distributed environment such as a peer-to-peer
(P2P) network. We develop a probabilistic algorithm for this
purpose using order statistics and Hoeffding bound. We present
experimental results to show the effectiveness and scalability
of the algorithm. Finally, we demonstrate an application of
this technique for interest-based community formation in a P2P
environment.
** **

**
A Survey of Attack Techniques on Privacy-Preserving Data Perturbation Methods
**

**Abstract**:
We focus primarily on the use of additive and matrix multiplicative data perturbation
techniques in privacy preserving data mining (PPDM). We survey a
recent body of research aimed at better understanding the vulnerabilities of these
techniques. These researchers assumed the role of an attacker and developed
methods for estimating the original data from the perturbed data and any available
prior knowledge. Finally, we briefly discuss research aimed at attacking
k-anonymization, another data perturbation technique in PPDM.
** **

**
Privacy-Preserving Data Analysis on Graphs and Social Networks
**

**Abstract**:
While literature within the field of privacy-preserving data mining
(PPDM) has been around for many years, attention has mostly been given
to the perturbation and anonymization of tabular data; understanding the role
of privacy over graphs and networks is still very much in its infancy. In this
chapter, we survey a very recent body of research on privacy-preserving data
analysis over graphs and networks in an effort to allow the reader to observe
common themes and future directions.
** **

**
Thoughts on Human Emotions, Breakthroughs in Communication, and the Next Generation of Data Mining
**

HillolKargupta.
2007 National Science Foundation Symposium on Next Generation of Data Mining and Cyber-Enabled Discovery for Innovation.

**Abstract**:
This paper revisits some of the breakthroughs in the field of communication that changed the human civilization and tries to understand where we are going tomorrow. It argues that while we have become very good in quickly connecting an entity with another entity world-wide as long as the former knows the address of the latter, current technology for taking the message or service from one individual to a large population of willing and interested individuals is still very primitive. We need technology for dealing with the new world of attention-economics. The paper also argues the currently popular centralized client-server model for the Internet applications may not work well in doing so. It offers some alternate thoughts borrowed from the nature and some emerging scalable applications that may offer some directions.
** **

**
Uniform Data Sampling from a Peer-to-Peer Network
**

Souptik Datta, HillolKargupta (2007). 2007 IEEE International Conference on Distributed Computing Systems
Aceepted for publication.

**Abstract**:
Uniform random sample is often useful in analyzing data. Usually
taking a uniform sample is not a problem if the entire data
resides in one location. However, if the data is distributed in a
peer-to-peer (P2P) network with different amount of data in
different peers, collecting a uniform sample of data becomes a
challenging task. A random sampling can be performed using
random-walk on the P2P network modeled as a graph. However, due to
varying degrees of connectivity and different sizes of data owned
by each peer, this random walk with a specific length gives a
biased sample. In this paper, we propose a random walk based
sampling algorithm that can be run to sample datapoints uniformly
from a large, unstructured P2P network. We model the random walk
as a Markov chain and derive conditions to bound the length of the
random walk necessary to achieve uniformity in the sampling
procedure. A formal communication analysis shows logarithmic
communication cost to discover a uniform data sample. Experiments
verify the effectiveness of the technique in achieving uniformity
of data.
** **

**
Distributed Top-K Outlier Detection from Astronomy Catalogs using
the DEMAC System
**

Haimonti Datta,,
Chris Giannella, Kirk Borne and Hillol
Kargupta. Proceedings of the 2007 SIAM International Conference on Data Mining (2007).

**Abstract**:
The design, implementation and archiving of large sky surveys is an
important part of astronomy research. The Sloan Digital Sky Survey
(SDSS), The Two Micron All Sky Survey (2MASS) are some such surveys
producing tera bytes of geographically distributed data which need to be
stored, analyzed and queried to enable scientific discoveries. In this
paper, we describe the architecture of a system for Distributed
Exploration of Massive Astronomy Catalogs (DEMAC) which is built on top
of the existing National Virtual Observatory environment. We describe
distributed algorithms for doing Principal Component Analysis (PCA)
using random projection and sampling based techniques. Using the
approximate principal components, we develop a distributed outlier
detection algorithm which enables identification of data points that
deviate sharply from the ``correlation structure" of the data. We
provide simulation results with data obtained from sky-surveys SDSS and
2MASS.
** **

**
Distributed Probabilistic Inferencing in Sensor Networks using
Variational Approximation
**

Sourav Mukherjee, HillolKargupta (2007).

**Abstract**:
This paper considers the problem of distributed inferencing in a sensor
network. It particularly explores the probabilistic inferencing problem
in the context of a distributed Boltzmann machine-based framework for
monitoring the network. The paper offers a variational mean-field
approach to develop communication-efficient local algorithm for
Variational Inferencing in Distributed Environments (VIDE). It
compares the performance of the proposed approximate variational
technique with respect to the exact and centralized techniques. It shows
that the VIDE offers a much more communication-efficient solution at
very little cost in terms of the accuracy. It also offers experimental
results in order to substantiate the scalability of the proposed algorithm.
** **

**
A Game Theoretic Approach toward Multi-Party Privacy-Preserving Distributed Data Mining
**

H. Kargupta , K. Das, K. Liu.

**Abstract**:
Analyzing privacy-sensitive data in a multi-party environment often assumes that the parties are
well-behaved and they abide by the protocols as expected. Parties compute whatever is needed,
communicate correctly following the rules, and do not collude with other parties for exposing third
party sensitive data. This paper argues that most of these nice assumptions fall apart in real-life
applications of privacy-preserving distributed data mining (PPDM). The paper offers a more
realistic formulation of the PPDM problem as a multi-party game where each party tries to maximize
its own objectives. It develops a game-theoretic framework and offers a new methodology
to develop PPDM algorithms. It also presents equilibrium-analysis of PPDM-games and outlines a
game-theoretic solution based on the concept of ``cheap-talk'' borrowed from economics and game
theory literature.
** **

**Clustering Distributed Data
Streams in Peer-to-peer Environments **

Sanghamitra Banyopadhyay, Chris Giannella, Ujjwal Maulik, Hillol Kargupta, Souptik Datta, Kun Liu (2006). Information Sciences
, 176(14), 1952-1985,2006

**Abstract**: This
paper describes a technique for clustering homogeneously distributed data in a
peer-to-peer environment like sensor networks. The proposed technique is based
on the principles of the K-Means algorithm. It works in a localized
asynchronous manner by communicating with the neighboring nodes. The paper
offers extensive theoretical anal- ysis of the
algorithm that bounds the error in the distributed clustering process com-
pared to the centralized approach that requires downloading all the observed
data to a single site. Experimental results show that, in contrast to the case
when all the data is transmitted to a central location for application of the
conventional clustering algo- rithm, the
communication cost (an important consideration in sensor networks which are typically
equipped with limited battery power) of the proposed approach is significantly
smaller. At the same time, the accuracy of the obtained centroids is high
and the number of samples which are incorrectly labeled is also small.** **

** On-board Vehicle Data Stream Monitoring using MineFleet and Fast Resource Constrained Monitoring of Correlation Matrices
**

Hillol Kargupta, VasundharaPuttagunta, Martin Klein, Kakali Sarkar
Next GenerationComputing.
, Invited submission for special issue on learning from data streams, volume 25, no. 1, pp. 5--32, 2007

**Abstract**:
This paper considers the problem of monitoring vehicle data streams in a
resource-constrained environment. It particularly focuses on a monitoring
task that requires frequent computation of correlation matrices using
lightweight on-board computing devices. It motivates this problem in the
context of the {\em MineFleet Real-Time} system and offers a randomized
algorithm for fast monitoring of correlation (FMC), inner product, and
Euclidean distance matrices among others. Unlike the existing approaches
that compute all the entries of these matrices from a data set, the
proposed technique works using a divide-and-conquer approach. This paper
presents a probabilistic test for quickly detecting whether or not a
subset of coefficients contains a significant one with a magnitude greater
than a user given threshold. This test is used for quickly identifying the
portions of the space that contain significant coefficients. The proposed
algorithm is particularly suitable for monitoring correlation and related
matrices computed from continuous data streams.
** **

**Abstract:**
Peer-to-peer (P2P) networks are gaining popularity in many application s such
as file sharing, e-commerce, and social networking, many of which deal with
rich, distributed data sources that can benefit from data mining. P2P networks
are, infact,well-suited to distributed data mining
(DDM), which deals with the problem of data analysis in environments with
distributed data,computing nodes,and users. This article offers an overview of
DDM applications and algorithms for P2P environments,focusing
particularly on local algorithms that perform data analysis by using computing
primitives with limited communication overhead. The authors describe both exact
and approximate local P2P data mining algorithms that work in a decentralized
and communication-efficient manner.

**K-Means Clustering over a Large,
Dynamic Networks **

Souptik Datta, Chris Giannella and Hillol Kargupta (2006 *Proc. 2006 SIAM
Conf. Data Mining* (SDM 06), SIAM Press, 2006, pp. 153-164

**Abstract**: This paper presents an algorithm for K-means clustering
of data distributed over a large, dynamic network. The network is not
assumed to contain any special server nodes ( i.e. is peer-to-peer) and is not
assumed to be stable either with respect to the topology or the data held by
nodes. The algorithm requires only local communication and
synchronization at each iteration: nodes communicate and synchronize only with
their topologically neighboring nodes. Due to the growing prevalence of
peer-to-peer and mobile/wireless sensor networks, data analysis in large,
dynamic networks is likely to garner increasing importance in the near
future. To our knowledge, our algorithm represents the first K-means
algorithm (a common data analysis/mining technique) to be developed for a large
dynamic network.

We tested our algorithm in a simulated environment of up to 1000 nodes on
synthetic data. We examine its behavior in a static environment (no data
or network change) and a dynamic environment. Empirical results show the
algorithm demonstrates good accuracy (in both the static and dynamic
environment) in that the cluster labels produced are very similar to those
produced by K-means run

on centralized data.

**On the Privacy
Preserving Properties of Random Data Perturbation Techniques.**

Hillol Kargupta, Souptik Datta, Qi Wang, and Krishnamoorthy Sivakumar.
(2003). * *Proceedings of the Third ICDM IEEE International
Conference on Data Mining* (*p. 99-107)* (ICDM- 2003),*
Melbourne, FL, November,2003

**Abstract**:Privacy is becoming an increasingly important issue
in many data mining applications. This has triggered the development of many
privacy-preserving data mining techniques. A large fraction of them use
randomized data distortion techniques to mask the data for preserving the
privacy of sensitive data. This methodology attempts to hide the sensitive data
by randomly modifying the data values often using additive noise. This paper
questions the utility of the random value distortion technique in privacy
preservation. The paper notes that random objects (particularly random
matrices) have “predictable” structures in the spectral domain and it develops
a random matrix-based spectral filtering technique to retrieve original data
from the dataset distorted by adding random values. The paper presents the

theoretical foundation of this filtering method and extensive experimental
results to demonstrate that in many cases random data distortion preserve very
little data privacy.

**Local L2
Thresholding Based Data Mining in Peer-to-Peer Systems**.

R. Wolff, K. Bhaduri, H. Kargupta. SIAM International Conference in Data
Mining, Bethesda, Maryland, USA. 2006

**Abstract :** In a large network of computers, wireless
sensors, or mobile devices, each of the components (hence, peers) has
some data about the global status of the system. Many of the functions
of the system, such as routing decisions, search strategies, data
cleansing, and the assignment of mutual trust, depend on the global
status. Therefore, it is essential that the system be able to detect, and
react to, changes in its global status. Computing global
predicates in such systems is usually very costly. Mainly because of
their scale, and in some cases (e.g., sensor networks) also because of the high
cost of communication. The cost further increases when the data changes
rapidly (due to state changes, node failure, etc.) and
computation has to follow these changes. In this paper we describe a two step
approach for dealing with these costs. First, we describe a highly
efficient * local* algorithm which detect when the L2 norm of the
average data surpasses a threshold. Then, we use this algorithm as a
feedback loop for the monitoring of complex predicates on the data --
such as the data's *k*-means clustering. The efficiency of the L2
algorithm guarantees that so long as the clustering results represent
the data (i.e., the data is stationary) few resources are required. When
the data undergoes an epoch change -- a change in the underlying
distribution -- and the model no longer represents it, the feedback
loop indicates this and the model is rebuilt. Furthermore, the existence
of a feedback loop allows using approximate and ``best-effort'' methods
for constructing the model; if an ill-fit model is built the feedback
loop would indicate so, and the model would be rebuilt.

**Identifying
Significant Inner Product Elements in a Peer-to-Peer Network. **

K. Das, K. Bhaduri, H. Kargupta. In communication. 2006.

**Abstract: **Peer-to-peer networks are getting increasing
attention in various applications involving distributed file sharing,
sensor networks, and mobile ad hoc networks. Efficient identification of
top few inner product entries from the entire inner product matrix of
features in a distributed peer-to-peer network is a challenging problem
since centralizing the data from all the nodes in a synchronous,
communication efficient manner may not be an option. Inner product computation
is an important primitive used in many techniques for feature dependency
detection, distance computation, clustering and correlation computation
among others. This paper deals with the problem of identifying
significant inner products among features in a peer-to-peer environment
where different nodes observe a different set of data. It uses an
ordinal framework to develop probabilistic algorithms to find top *l*
elements in the inner product matrix which can be used for many data
mining applications. It also presents experimental results demonstrating
scalable performance of this algorithm for large peer-to-peer networks.

**Client-side Web
Mining for Community Formation in Peer-to-Peer Environments**.

K. Liu, K Bhaduri, K. Das, P. Nguyen, H. Kargupta. SIGKDD workshop on web
usage and analysis (WebKDD).

Philadelphia, Pennsylvania, USA. 2006.

**Abstract: **In this paper we present a framework for forming
interests-based Peer-to-Peer communities using client-side web browsing
history. At the heart of this framework is the use of an order
statistics-based approach to build communities with hierarchical
structure. We have also carefully considered privacy concerns of the
peers and adopted cryptographic protocols to measure similarity between
them without disclosing their personal profiles. We evaluated our
framework on a distributed data mining platform we have developed. The
experimental results show that our framework could effectively build
interests-based communities.

**Monitoring
Any Data Model in a Large Distributed System**.

R. Wolff, K. Bhaduri, H. Kargupta. In communication. 2006.

**Abstract: **In a large network of computers, wireless
sensors, or mobile devices, each of the components (hence, peers) has
some data about the global status of the system. Many of the functions of
the system, such as routing decisions, search strategies, data
cleansing, and the assignment of mutual trust, rely on modeling the
global status, and many of these model-ling techniques are akin to data
mining. In general, we refer to the outcome of the function (e.g., the
trust assigned to each peer) as the \emph{model} of the system. Since the
state of the system is in constant change, it is necessary to monitor the
suitability of the model to the state, and react in case it is
unsuitable.

Computing global predicates in large distributed systems is very costly.
Mainly, this is because of system's scale, and in some cases (e.g.,
sensor networks) also because of the high cost of communication. The cost
further increases when the data changes rapidly. In this paper we
describe a two step approach for dealing with these costs. First, we describe a
highly efficient *local* algorithm which can be used to monitor
any ordinal function of the global state, i.e., almost any data mining
model. Then, we use this algorithm as a feedback loop for the monitoring
of complex predicates on the data -- such as the data's *k*-means
clustering. The efficiency of the local algorithm guarantees that so long as
the model represents the data (i.e., the data is stationary) few
resources are required. When the data undergoes an epoch change --
a change in the underlying distribution -- and the model no longer
represents it, the feedback loop indicates this and the model is rebuilt.
Furthermore, the existence of a feedback loop allows using approximate
and ``best-effort'' methods for constructing the model; if an
ill-fit model is built the feedback loop would indicate so, and the model
would be rebuilt.

**Random projection-based
multiplicative data perturbation for privacy preserving distributed data
mining.**

K. Liu, H. Kargupta, and J. Ryan. IEEE Transactions on Knowledge and Data Engineering (TKDE), 18(1):92-106, January 2006.

**Abstract: **This paper explores the possibility of using
multiplicative random projection matrices for privacy preserving distributed
data mining. It specifically considers

the problem of computing statistical aggregates like the inner product matrix,
correlation coefficient matrix, and Euclidean distance matrix from distributed
privacy sensitive data possibly owned by multiple parties. This class of
problems is directly related to many other data-mining problems such as
clustering, principal component analysis, and classification. This paper makes
primary contributions on two different grounds. First, it explores Independent
Component Analysis as a possible tool for breaching privacy in deterministic
multiplicative perturbation-based models such as random orthogonal
transformation and random rotation. Then, it proposes an approximate random
projection-based technique to improve the level of privacy protection while
still preserving certain statistical characteristics of the data. The paper
presents extensive theoretical analysis and experimental results. Experiments
demonstrate that the proposed technique is effective and can be successfully
used for different types of privacypreserving data mining applications.

**An attacker's view of distance preserving
maps for privacy preserving data mining**.

K. Liu, C. Giannella, and H. Kargupta.In Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'06), Berlin, Germany, September 2006.

**Abstract: **We examine the effectiveness of distance
preserving transformations in privacy preserving data mining. These techniques
are potentially very useful in that some important data mining algorithms can
be applied to the transformed data and produce exactly the same results as if
applied to the original data e.g. distance-based clustering, k-nearest neighbor
classification. However, the issue of how well the original data is hidden has,
to our knowledge, not been carefully studied. We take a step in this direction
by assuming the role of an attacker armed with two types of prior information
regarding the original data. We examine how well the attacker can recover the
original data from the transformed data and prior information. Our results
offer insight into the vulnerabilities of distance preserving transformations.

**Communication efficient construction of
decision trees over heterogeneously distributed data.**

C. Giannella, K. Liu, T. Olsen, and H. Kargupta. In Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM'04), pages 67-74, Brighton, UK, November 2004.

**Abstract: **We present an algorithm designed to efficiently
construct a decision tree over heterogeneously distributed data without
centralizing. We compare our algorithm against a standard centralized decision
tree implementation in terms of accuracy as well as the communication
complexity. Our experimental results show that by using only 20% of the communication
cost necessary to centralize the data we can achieve trees with accuracy at
least 80% of the trees produced by the centralized version.

**Vedas:** **A mobile and
distributed data stream mining system for real-time vehicle monitoring.**

H. Kargupta, R. Bhargava, K. Liu, M. Powers, P. Blair, S. Bushra, J. Dull,
K. Sarkar, M. Klein, M. Vasa, and D. Handy.

In Proceedings of the 2004 SIAM International Data Mining Conference (SDM'04),
Orlando, FL, April 2004.

**Abstract: **This paper presents an overview of an experimental
mobile and distributed data stream mining system that allows real time
monitoring of vehicle diagnostic data for vehicle-health monitoring and driver
characterization. It offers the motivation behind this application, explains
the system architecture, discusses many of challenges that the project faced,
and shares some of the adopted solutions. The main contribution of the paper is
our experience in building one of the very early distributed data stream mining
system for wireless applications that performs most of the data analysis
related tasks using light-weight onboard computing devices.

This paper points out that the distributed data mining technology can play a
key role in solving real-life problems in a mobile application environment
where computing, storage, power, and communication resources are limited. The
paper also illustrates how privacy-preserving distributed data mining can play
an important role in this type of applications.

**Privacy sensitive distributed data
mining from multi-party data**.

H. Kargupta, K. Liu, and J. Ryan. In Proceedings of the First NSF/NIJ Symposium on Intelligence and Security Informatics, Lecture Notes in Computer Science, pages 336-342, Tucson, AZ, June 2003. Springer Berlin/Heidelberg.

**Abstract: **Privacy is becoming an increasingly important
issue in data mining, particularly in security and counter-terrorism-related
applications where the data is often sensitive. This paper considers the
problem of mining privacy sensitive distributed multi-party data. It
specifically considers the problem of computing statistical aggregates like the
correlation matrix from privacy sensitive data where the program for computing
the aggregates is not trusted by the owner(s) of the data. It presents a brief
overview of a random projection-based technique to compute the correlation
matrix from a single third-party data site and also multiple homogeneous sites.

**Homeland security and privacy sensitive
data mining from multi-party distributed resources.**

H. Kargupta, K. Liu, S. Datta, J. Ryan, and K. Sivakumar.In Proceedings of the 12th IEEE International Conference on Fuzzy Systems, volume 2, pages 1257-1260, St. Louis, MO, May 2003.

**Abstract: **Defending the safety of an open society from
terrorism or other similar threats requires intelligent but careful ways to
monitor different types of activities and transactions in the electronic media.
Data mining techniques are playing an increasingly important role in sifting
through large amount of data in search of useful patterns that might help us in
securing our safety. Although the objective of this class of data mining
applications is very well justi

ed, they also open up the possibility of misusing personal information by
malicious people with access to the sensitive data. This brings up the
following question: Can we design data mining techniques that are sensitive to
privacy? Several researchers are currently working on a class of data mining
algorithms that work without directly accessing the sensitive data in their
original form. This paper considers the problem of mining distributed data in a
privacy-sensitive manner. It

first points out the problems of some of the existing privacy-sensitive data
mining techniques that make use of additive random noise to hide sensitive information.
Next it briefly reviews some new approaches that make use of random projection
matrices for computing statistical aggregates from sensitive data.

H. Kargupta, B. Park, H. Dutta. Orthogonal Decision Trees. IEEE Transactions on Knowledge and Data Engineering, 2006.

**Abstract: **This paper introduces orthogonal decision trees
that offer an effective way to construct a redundancy-free, accurate, and
meaningful representation of large decision-tree-ensembles often created by
popular techniques such as Bagging, Boosting, Random Forests, and many
distributed and data stream mining algorithms. Orthogonal decision trees are
functionally orthogonal to each other and they correspond to the principal
components of the underlying function space. This paper offers a technique to
construct such trees based on the Fourier transformation of decision trees and
eigen-analysis of the ensemble in the Fourier representation. It offers
experimental results to document the performance of orthogonal trees on the
grounds of accuracy and model complexity.

H. Kargupta and H. Dutta (2004). The Fourth IEEE International Conference on Data Mining. Brighton, UK , pages 487—490.

**Abstract: **This paper introduces orthogonal decision trees
that offer an effective way to construct a redundancy-free, accurate, and
meaningful representation of large decision-treeensembles often created by
popular techniques such as Bagging, Boosting, Random Forests and many
distributed and data stream mining algorithms. Orthogonal decision trees are
functionally orthogonal to each other and they correspond to the principal
components of the underlying function space. This paper offers a technique to
construct such trees based on eigen-analysis of the ensemble and offers
experimental results to document the performance of orthogonal trees on grounds
of accuracy and model complexity.

**Orthogonal Decision Trees for
Resource-Constrained Physiological Data Stream Monitoring using Mobile Devices**.

H. Dutta and H. Kargupta and A. Joshi. Lecture Notes in Computer Science (3769) High Performance Computing - HiPC 2005.

**Abstract: **Several challenging new applications demand the
ability to do data mining on resource constrained devices. One such application
is that

of monitoring physiological data streams obtained from wearable sensing
devices. Such monitoring has applications for pervasive healthcare management,
be it for seniors, emergency response personnel, soldiers in the battlefield or
atheletes. A key requirement is that the monitoring system be able to run on
resouce constrained handheld or wearable devices. Orthogonal decision trees
(ODTs) offer an effective way to construct a redundancy-free, accurate, and
meaningful representation of large decision-tree-ensembles often created by
popular techniques such as Bagging, Boosting, Random Forests and many
distributed and data stream mining algorithms. Orthogonal decision trees are
functionally orthogonal to each other and they correspond to the principal
components of the underlying function space. This paper discusses various
properties of the ODTs and their suitability for monitoring physiological data
streams in a resource-constrained environment. It offers experimental results
to document the performance of orthogonal trees on grounds of accuracy, model
complexity, and other characteristics in a resource-constrained mobile
environment.

**Efficient Kernel Density Estimation
Over Distributed Data.**

C. Giannella, H. Dutta, S. Mukherjee, H. Kargupta. Proceedings of the 9th International Workshop on High Performance and Distributed Mining, SIAM International Data Mining Conference, Bethesda, USA. April, 2006.

We consider the problem of Kernel Density Estimation (KDE) over distributed data. Our work is motivated by the fact that, in some cases, large data repositories are, in their native form, distributed over many sites. In environments where communication is limited, centralization of the data can be problematic. We examine a framework where the sites build local models which are combined at a central site to produce an approximation to the KDE over the entire dataset. We carry out experiments comparing two realizations of this framework (uniform sampling and linear binning). We compare these two techniques in terms of their accuracy at: directly approximating the centralized KDE (measured by 2-norm function distance), density-based classification, and density-based clustering.

**Distributed Data Mining in
Astronomy Catalogs**.

C. Giannella, H. Dutta, K. Borne, R. Wolff, H. Kargupta. Special Issue of Concurrency and Computation: Practice and Experience, 2006 (In Communication)

Abstract:The design, implementation, and archiving of very large sky surveys is playing an increasingly important role in today's astronomy research. However, these data archives will necessarily be

geographically distributed. To fully exploit the potential of this data lode, we believe that capabilities ought to be provided allowing users a more communication-efficient alternative to multiple archive data analysis than first downloading the archives fully to a centralized site. In this paper, we propose a system, DEMAC, for the distributed mining of massive astronomical catalogs. The system

is designed to sit on top of the existing national virtual observatory environment and provide tools for distributed data mining (as web services) without requiring datasets to be fully down-loaded to a centralized server. To illustrate the potential effectiveness of our system, we develop communication-efficient distributed algorithms for principal component analysis (PCA) and outlier detection. Then, we carry out a case study using distributed PCA for detecting fundamental planes of astronomical parameters. In particular, PCA enables dimensionality reduction within a set of correlated physical

parameters, such as a reduction of a 3-dimensional data distribution (in astronomer's observed units) to a planar data distribution (in fundamental physical units). Fundamental physical insights are thereby

enabled through efficient access to distributed multi-dimensional data sets.

**Distributed Data Mining for Astronomy Catalogs**.

C. Giannella, H. Dutta, K. Borne, R. Wolff, H. Kargupta. Proceedings of 9th Workshop on Mining Scientific and Engineering Datasets, as part of the SIAM International Conference on Data Mining (SDM), 2006.

**Abstract:
**The design, implementation, and archiving of very large sky surveys is
playing an increasingly important role in today's astronomy research. However,
these data archives will necessarily be geographically distributed. To fully
exploit the potential of this data, we believe that capabilities ought to be
provided allowing users a more communication-efficient alternative to multiple
archive data analysis than first down-loading the archives fully to a
centralized site.

In
this paper, we describe the architecture of a system, DEMAC, for the
distributed mining of massive astronomic catalogs. The system is designed to sit
on top of the existing national virtual observatory environment and provide
tools for distributed data mining (as web services) without requiring datasets
to be fully down-loaded to a centralized server. To illustrate the potential
effectiveness of DEMAC, we carry out a case

study using distributed principal component analysis (PCA) for detecting
fundamental planes of astronomical parameters. In particular, PCA enables
dimensionality reduction within a set of correlated physical parameters, such
as reduction of a 3-dimensional data distribution (in astronomer's observed
units) to a planar data distribution (in fundamental physical units).
Fundamental physical insights are thereby enabled through efficient access to
distributed multi-dimensional data sets.

**Privacy Preserving Data Mining and Random
Perturbation. **

H.
Kargupta, H. Dutta, S. Datta , K. Sivakumar. Proceedings of the Workshop on
Privacy in the Electronic Society (WPES'03)*, *Washington
DC,October,2003

**Abstract:
**Privacy is becoming an increasingly important issue in many data
mining applications, particularly in the security and defense area. This has
triggered the development of many privacy-preserving data mining techniques. A
large fraction of them uses randomized data distortion techniques to mask the
data for preserving the privacy. They attempt to hide the sensitive data by
randomly modifying the data values

using additive noise. This paper questions the utility of such randomized data
distortion technique for preserving privacy in many cases and urges caution.It
notes that random objects (particularly random matrices) have
"predictable" structures in the spectral domain and then offers a
random matrix-based spectral filtering technique to retrieve original data from
the data-set distorted by adding random values. It extends our earlier work
questioning the efficacy of random perturbation techniques using additive noise
for privacy-preserving data mining in continuous valued domain and presents new
results in the discrete domain. It shows that the growing collection of random
perturbation-based "privacy-preserving" data mining techniques may
need a careful scrutiny in order to prevent privacy breaches through linear
transformations. The paper also presents extensive experimental results in
order to support this claim.