Lecture 24
New Applications
While databases are not new, advances in areas such as communications,
storage, processor, memory size, and Internet have lead to studies of new
ways to apply data to solutions:
-
Decision Support Systems: Businesses have begun to exploit available
data to make better decisions about their activities.
-
Spatial databases: Databases are now being used to store geographic
information, such as maps and associated information, and computer-aided-design
(CAD) information, such as integrated circuit and building designs.
The complexity and volume of data, the number of users combined to exceed
the capabilities of simple file systems.
-
Multimedia databases: Storing images, video, and audio in databases
has many advantages, but also have special needs when retrieving video
and audio.
-
Mobile databases: As users become equipped with mobile computing
systems with sophisticated communications, issues of what data to provide
and how to update during periods when communications is not possible has
become important.
-
Information retrieval: As more and more documents become available
electronically, finding the related documents becomes more complex.
-
Distributed information retrieval: Networking, and especially the
Internet, now make information available in a multitude of formats and
with different interfaces, finding and combining information is more challenging
and important!
Decision Support Systems
Because of all of the data being collected by corporations, there is a
large corpus of data (now growing into the terabyte range) available to
decision makers. The data ranges over the entire scope of company
operations, and now questions about what items sell best in what types
of situations can be studied in order to improve corporate performance.
Now that disk capacities have expanded to multi-gigabytes devices and operating
systems can efficiently address data on those large devices, and memory
has decreased in cost where computers can how have 100MB+ memory capacity,
studies can be done in a timely manner. However, new issues have
been raises:
-
SQL as currently defined does not support the kinds of questions being
asked by DSS. Extensions have been proposed.
-
Query languages are not suited for statistic analyses of data. Typically,
DBMSs must be enhanced with added packages.
-
Knowledge discovery techniques are being developed with artificial intelligence,
statistical rule, pattern matching to the field data mining so that
extremely large data stores can be explored.
-
The problems of massive amounts of data from diverse sources in a multitude
of formats has lead to the development of date warehousing.
This brings together all of that data into a unified schema at a single
site.
Data Analysis
The decision makers do not need all the raw data. They need the data
summarized into information, based on the complex statistical analysis
of their data. Additionally, they need that information in different
formats, bar charts, histograms, line charts, etc. SQL can not provide
these nor can it provide cross-tabulations. This area is extremely
ripe for new techniques and extensions of the query language.
Data Mining
Data Mining is the discovery of new knowledge from relevant information.
Note that this is not addressing the problem of finding relevant information,
rather it discovers knowledge that can be expressed as rules.
An example would be, "Young women with annual incomes greater than $50,000
are the most likely people to buy small sports cars." Knowledge representation
as rules can be in the form:
For every variable (with its associated range) there is an antecedent
that implies a consequence
There is a transaction T in the relation buys, where there is a tuple where
milk was bought, that transaction also has a tuple where bread was bought.
Now we can study if the store will have higher sales if bread and milk
are arranged near each other or does have them on opposites of the store
(requiring the customer to be tempted by the products in between).
Data mining has two important classes of problems, classification
and association. Classification could be when UMBC receives
applications for admission, which groups of students are most likely graduate
and, even more important, which groups are most likely to go on and be
graduate students at UMBC. Then the recruiters can focus on which
students to encourage to enter our combined Bachelors/Masters program.
If we find out that no freshman students who graduated from high school
before 1915 go on to earn a Ph.D., then it would be a waste of time and
resources to try to approach them for doctoral studies, we can let
them approach us if they are interested. However, if we find that
particular high schools in the area are producing highly motivated and
knowledgeable students, we can send teams of recruiters there several times
each year, so we can capture a better share of that market. Association
could be the college bookstore might find out that students who buy computers
are very likely to buy software. As a result they might target future
advertising of new software to those students who bought their computer
in the bookstore.
Data mining can be user-directed or automatic. User-directed is
when the user asks a question and the data mining system can show that
the question is true or false. Automatic data mining is when the
user requests information, such as which credit card applicants would have
the lowest risk for the credit card company and the system develops a classification
tree, with all the groups and what their credit rating are.
Data Warehousing
A data warehouse is a repository (or archive) of information gathered
from multiple sources, stored under a unified schema at a single site.
Once gathered, the data are stored for a long time. Since there is
a single interface, it makes the queries easier to write and thus provide
a greater degree of support to the users.
Issues to consider become:
-
When and how to gather data
-
What schema to use
-
How to propagate updates
-
What data to summarize (and not store the details)
Spatial and Geographic Databases
Both of these types of database use data types that are different than
the tradition databases. They are also used to provide answers to
a totally different type of questions. They require a different knowledge
representation format and a different set of operators on the data.
Both have different types of outputs as well. CAD systems might be
queried to see of a certain type of component has already been designed
and could it be reused in the current project. Geographic databases
might be queried to find the nearest hospital or find the population centers
with the most dense population, or do studies on areas that are located
within a certain distance to a lake of at least a particular size.
Multimedia Databases
Multimedia data requires data types that can hold extremely large objects
that are not as simple as characters and numbers. There are currently
projects to store music albums and movies in databases. Another area
receiving attention is to image the records of an organization and store
those images (and maybe the text from within the image) in the database.
To store one minute of video can take over 75 MB (compression can reduce
it to 12.5 MB). Then comes the requirement to retrieval the video
at a constant rate so that it can be view at the proper speed. Cable
companies are interested provided movies on demand to their customers.
Technology may make this possible.
Mobility and Personal Databases
Two technologies have proven to open new areas of use of large databases:
-
Notebook (palmtop) computers allow the user pack up the computing resources
are relocate constantly. These users want to continue to receive
the same support they were getting when they were chained to their desks.
-
Low-cost wireless digit communications infrastructure based on wireless
LANs and digital packet networks enable the user to stay in communications
when traveling.
The applications to make this possible are becoming increasing available
and in a cost-effective manner. However, we still have to solve issues
of how to communicate with mobile user (including recovery after a communications
drop-out), what data is to be made available to the mobile user and how
to insure the integrity of data on the mobile system. To raise the
complexity, users want to be mobile and to use different computers at different
times and still have everything work correctly.
Information Retrieval Systems
The explosive growth the the databases as forced the Information Retrieval
systems to develop in parallel. If information is organized in electronic
documents, how do we locate the documents the are related to a specific
problem. First problem is to find documents that appear to be related
somehow, and then refine that so the result is what the user is looking
for.
Suppose we do a search for a term that is the subject of our search,
such as "heavy metal". Should the query be case sensitive or not?
Then when documents are located with those twelve characters are located,
are they referring to chemistry or music? If the search is
for "software agents", should "mobile agents", "intelligent agents" or
"software publishing" be considered as relating. Does the system
learn about the user and use that knowledge to refine the query?
If I only search of software issues and then search for "robot", does it
know that "searchbot" is a related term?
Documents can be located by the use of indexing (similar to the card
catalog in the library), but with the advances in networking, there is
now new problems of indexing documents that are stored on remotes sites.
New tools are also available with hypertext linking of items.
Distributed Information Systems
Because of the advances, it is no longer required that all data be located
on one single site. As a result, new tools are being developed to
access that remote data, such as Gopher, ftp, telnet. However, the
users are demanding that we make this transparent.
The World Wide Web
The World Wide Web (WWW) has had a profound impact on the society in so
many ways. Because of the Web and easy-to-use browsers, people who
have never used a computer before now open one. Much data is available
in what is called HyperText Markup Language (HTML) This allows computers
around the world, running different kinds of operating systems, browsers,
communications system, and everything else, to see this document in one
format. More exactly, it allows me to author this page in one format
and all systems can have equal access to page.
This has lead to many new issues:
-
Universal Resource Locators (URL's) that allow a common method to refer
to remote locations:
http://www.csee.umbc.edu/~burt
Except that now we are using up domains (such as edu) so fast that
we will have to extend this system.
-
Web Servers are now expected to do more than simple transfer documents.
-
Display Languages are being push beyond what they were designed for as
new classes of users make ever increasingly comment demands on the system.
HTML and Java are provided both static and dynamic behavior to web pages.
-
Web interfaces to databases has added to ways users can interface to the
databases, both local and remote.
Conclusion
There are many new areas within the entire industry to apply databases.
The advances are coming hourly and it is difficult to become current and
stay current. The CSEE Department is very involved the research that
is making this happen. Check out the department's web pages to see
what we doing and the labs we have built for that research. Also
remember that we have courses in those areas also.
CSEE
| 461
| 461
F'98 | lectures
| news
| help