New Applications

UMBC CMSC 461 Fall '98

Lecture 24
New Applications

While databases are not new, advances in areas such as communications, storage, processor, memory size, and Internet have lead to studies of new ways to apply data to solutions:

Decision Support Systems: Businesses have begun to exploit available data to make better decisions about their activities.
Spatial databases: Databases are now being used to store geographic information, such as maps and associated information, and computer-aided-design (CAD) information, such as integrated circuit and building designs. The complexity and volume of data, the number of users combined to exceed the capabilities of simple file systems.
Multimedia databases: Storing images, video, and audio in databases has many advantages, but also have special needs when retrieving video and audio.
Mobile databases: As users become equipped with mobile computing systems with sophisticated communications, issues of what data to provide and how to update during periods when communications is not possible has become important.
Information retrieval: As more and more documents become available electronically, finding the related documents becomes more complex.
Distributed information retrieval: Networking, and especially the Internet, now make information available in a multitude of formats and with different interfaces, finding and combining information is more challenging and important!

Decision Support Systems

Because of all of the data being collected by corporations, there is a large corpus of data (now growing into the terabyte range) available to decision makers. The data ranges over the entire scope of company operations, and now questions about what items sell best in what types of situations can be studied in order to improve corporate performance. Now that disk capacities have expanded to multi-gigabytes devices and operating systems can efficiently address data on those large devices, and memory has decreased in cost where computers can how have 100MB+ memory capacity, studies can be done in a timely manner. However, new issues have been raises:

SQL as currently defined does not support the kinds of questions being asked by DSS. Extensions have been proposed.
Query languages are not suited for statistic analyses of data. Typically, DBMSs must be enhanced with added packages.
Knowledge discovery techniques are being developed with artificial intelligence, statistical rule, pattern matching to the field data mining so that extremely large data stores can be explored.
The problems of massive amounts of data from diverse sources in a multitude of formats has lead to the development of date warehousing. This brings together all of that data into a unified schema at a single site.

Data Analysis

The decision makers do not need all the raw data. They need the data summarized into information, based on the complex statistical analysis of their data. Additionally, they need that information in different formats, bar charts, histograms, line charts, etc. SQL can not provide these nor can it provide cross-tabulations. This area is extremely ripe for new techniques and extensions of the query language.

Data Mining

Data Mining is the discovery of new knowledge from relevant information. Note that this is not addressing the problem of finding relevant information, rather it discovers knowledge that can be expressed as rules. An example would be, "Young women with annual incomes greater than $50,000 are the most likely people to buy small sports cars." Knowledge representation as rules can be in the form:

For every variable (with its associated range) there is an antecedent that implies a consequence There is a transaction T in the relation buys, where there is a tuple where milk was bought, that transaction also has a tuple where bread was bought. Now we can study if the store will have higher sales if bread and milk are arranged near each other or does have them on opposites of the store (requiring the customer to be tempted by the products in between).

Data mining has two important classes of problems, classification and association. Classification could be when UMBC receives applications for admission, which groups of students are most likely graduate and, even more important, which groups are most likely to go on and be graduate students at UMBC. Then the recruiters can focus on which students to encourage to enter our combined Bachelors/Masters program. If we find out that no freshman students who graduated from high school before 1915 go on to earn a Ph.D., then it would be a waste of time and resources to try to approach them for doctoral studies, we can let them approach us if they are interested. However, if we find that particular high schools in the area are producing highly motivated and knowledgeable students, we can send teams of recruiters there several times each year, so we can capture a better share of that market. Association could be the college bookstore might find out that students who buy computers are very likely to buy software. As a result they might target future advertising of new software to those students who bought their computer in the bookstore.

Data mining can be user-directed or automatic. User-directed is when the user asks a question and the data mining system can show that the question is true or false. Automatic data mining is when the user requests information, such as which credit card applicants would have the lowest risk for the credit card company and the system develops a classification tree, with all the groups and what their credit rating are.

Data Warehousing

A data warehouse is a repository (or archive) of information gathered from multiple sources, stored under a unified schema at a single site. Once gathered, the data are stored for a long time. Since there is a single interface, it makes the queries easier to write and thus provide a greater degree of support to the users.

Issues to consider become:

When and how to gather data
What schema to use
How to propagate updates
What data to summarize (and not store the details)

Spatial and Geographic Databases

Both of these types of database use data types that are different than the tradition databases. They are also used to provide answers to a totally different type of questions. They require a different knowledge representation format and a different set of operators on the data. Both have different types of outputs as well. CAD systems might be queried to see of a certain type of component has already been designed and could it be reused in the current project. Geographic databases might be queried to find the nearest hospital or find the population centers with the most dense population, or do studies on areas that are located within a certain distance to a lake of at least a particular size.

Multimedia Databases

Multimedia data requires data types that can hold extremely large objects that are not as simple as characters and numbers. There are currently projects to store music albums and movies in databases. Another area receiving attention is to image the records of an organization and store those images (and maybe the text from within the image) in the database.

To store one minute of video can take over 75 MB (compression can reduce it to 12.5 MB). Then comes the requirement to retrieval the video at a constant rate so that it can be view at the proper speed. Cable companies are interested provided movies on demand to their customers. Technology may make this possible.

Mobility and Personal Databases

Two technologies have proven to open new areas of use of large databases:

Notebook (palmtop) computers allow the user pack up the computing resources are relocate constantly. These users want to continue to receive the same support they were getting when they were chained to their desks.
Low-cost wireless digit communications infrastructure based on wireless LANs and digital packet networks enable the user to stay in communications when traveling.

The applications to make this possible are becoming increasing available and in a cost-effective manner. However, we still have to solve issues of how to communicate with mobile user (including recovery after a communications drop-out), what data is to be made available to the mobile user and how to insure the integrity of data on the mobile system. To raise the complexity, users want to be mobile and to use different computers at different times and still have everything work correctly.

Information Retrieval Systems

The explosive growth the the databases as forced the Information Retrieval systems to develop in parallel. If information is organized in electronic documents, how do we locate the documents the are related to a specific problem. First problem is to find documents that appear to be related somehow, and then refine that so the result is what the user is looking for.

Suppose we do a search for a term that is the subject of our search, such as "heavy metal". Should the query be case sensitive or not? Then when documents are located with those twelve characters are located, are they referring to chemistry or music? If the search is for "software agents", should "mobile agents", "intelligent agents" or "software publishing" be considered as relating. Does the system learn about the user and use that knowledge to refine the query? If I only search of software issues and then search for "robot", does it know that "searchbot" is a related term?

Documents can be located by the use of indexing (similar to the card catalog in the library), but with the advances in networking, there is now new problems of indexing documents that are stored on remotes sites. New tools are also available with hypertext linking of items.

Distributed Information Systems

Because of the advances, it is no longer required that all data be located on one single site. As a result, new tools are being developed to access that remote data, such as Gopher, ftp, telnet. However, the users are demanding that we make this transparent.

The World Wide Web

The World Wide Web (WWW) has had a profound impact on the society in so many ways. Because of the Web and easy-to-use browsers, people who have never used a computer before now open one. Much data is available in what is called HyperText Markup Language (HTML) This allows computers around the world, running different kinds of operating systems, browsers, communications system, and everything else, to see this document in one format. More exactly, it allows me to author this page in one format and all systems can have equal access to page.

This has lead to many new issues:

Universal Resource Locators (URL's) that allow a common method to refer to remote locations:

Web Servers are now expected to do more than simple transfer documents.
Display Languages are being push beyond what they were designed for as new classes of users make ever increasingly comment demands on the system. HTML and Java are provided both static and dynamic behavior to web pages.
Web interfaces to databases has added to ways users can interface to the databases, both local and remote.

Conclusion

There are many new areas within the entire industry to apply databases. The advances are coming hourly and it is difficult to become current and stay current. The CSEE Department is very involved the research that is making this happen. Check out the department's web pages to see what we doing and the labs we have built for that research. Also remember that we have courses in those areas also.

Lecture 24 New Applications