UMBC CMSC 461 Fall '98  CSEE | 461 | 461 F'98 | lectures | news | help 

Lecture 24
New Applications

While databases are not new, advances in areas such as communications, storage, processor, memory size, and Internet have lead to studies of new ways to apply data to solutions:

Decision Support Systems

Because of all of the data being collected by corporations, there is a large corpus of data (now growing into the terabyte range) available to decision makers.  The data ranges over the entire scope of company operations, and now questions about what items sell best in what types of situations can be studied in order to improve corporate performance.  Now that disk capacities have expanded to multi-gigabytes devices and operating systems can efficiently address data on those large devices, and memory has decreased in cost where computers can how have 100MB+ memory capacity, studies can be done in a timely manner.  However, new issues have been raises:

Data Analysis

The decision makers do not need all the raw data.  They need the data summarized into information, based on the complex statistical analysis of their data.  Additionally, they need that information in different formats, bar charts, histograms, line charts, etc.  SQL can not provide these nor can it provide cross-tabulations.  This area is extremely ripe for new techniques and extensions of the query language.

Data Mining

Data Mining is the discovery of new knowledge from relevant information.  Note that this is not addressing the problem of finding relevant information, rather it discovers knowledge that can be expressed as rules.  An example would be, "Young women with annual incomes greater than $50,000 are the most likely people to buy small sports cars."  Knowledge representation as rules can be in the form: There is a transaction T in the relation buys, where there is a tuple where milk was bought, that transaction also has a tuple where bread was bought.  Now we can study if the store will have higher sales if bread and milk are arranged near each other or does have them on opposites of the store (requiring the customer to be tempted by the products in between).

Data mining has two important classes of problems, classification  and  association.  Classification could be when UMBC receives applications for admission, which groups of students are most likely graduate and, even more important, which groups are most likely to go on and be graduate students at UMBC.  Then the recruiters can focus on which students to encourage to enter our combined Bachelors/Masters program.  If we find out that no freshman students who graduated from high school before 1915 go on to earn a Ph.D., then it would be a waste of time and resources to try to approach them for  doctoral studies, we can let them approach us if they are interested.  However, if we find that particular high schools in the area are producing highly motivated and knowledgeable students, we can send teams of recruiters there several times each year, so we can capture a better share of that market.  Association could be the college bookstore might find out that students who buy computers are very likely to buy software.  As a result they might target future advertising of new software to those students who bought their computer in the bookstore.

Data mining can be user-directed or automatic.  User-directed is when the user asks a question and the data mining system can show that the question is true or false.  Automatic data mining is when the user requests information, such as which credit card applicants would have the lowest risk for the credit card company and the system develops a classification tree, with all the groups and what their credit rating are.

Data Warehousing

A data warehouse is a repository (or archive) of information gathered from multiple sources, stored under a unified schema at a single site.  Once gathered, the data are stored for a long time.  Since there is a single interface, it makes the queries easier to write and thus provide a greater degree of support to the users.

Issues to consider become:

Spatial and Geographic Databases

Both of these types of database use data types that are different than the tradition databases.  They are also used to provide answers to a totally different type of questions.  They require a different knowledge representation format and a different set of operators on the data.  Both have different types of outputs as well.  CAD systems might be queried to see of a certain type of component has already been designed and could it be reused in the current project.  Geographic databases might be queried to find the nearest hospital or find the population centers with the most dense population, or do studies on areas that are located within a certain distance to a lake of at least a particular size.

Multimedia Databases

Multimedia data requires data types that can hold extremely large objects that are not as simple as characters and numbers.  There are currently projects to store music albums and movies in databases.  Another area receiving attention is to image the records of an organization and store those images (and maybe the text from within the image) in the database.

To store one minute of video can take over 75 MB (compression can reduce it to 12.5 MB).  Then comes the requirement to retrieval the video at a constant rate so that it can be view at the proper speed.  Cable companies are interested provided movies on demand to their customers.  Technology may make this possible.

Mobility and Personal Databases

Two technologies have proven to open new areas of use of large databases: The applications to make this possible are becoming increasing available and in a cost-effective manner.  However, we still have to solve issues of how to communicate with mobile user (including recovery after a communications drop-out), what data is to be made available to the mobile user and how to insure the integrity of data on the mobile system.  To raise the complexity, users want to be mobile and to use different computers at different times and still have everything work correctly.

Information Retrieval Systems

The explosive growth the the databases as forced the Information Retrieval systems to develop in parallel.  If information is organized in electronic documents, how do we locate the documents the are related to a specific problem.  First problem is to find documents that appear to be related somehow, and then refine that so the result is what the user is looking for.

Suppose we do a search for a term that is the subject of our search, such as "heavy metal".  Should the query be case sensitive or not?  Then when documents are located with those twelve characters are located, are they referring to chemistry  or music?  If the search is for "software agents", should "mobile agents", "intelligent agents" or "software publishing" be considered as relating.  Does the system learn about the user and use that knowledge to refine the query?  If I only search of software issues and then search for "robot", does it know that "searchbot" is a related term?

Documents can be located by the use of indexing (similar to the card catalog in the library), but with the advances in networking, there is now new problems of indexing documents that are stored on remotes sites.  New tools are also available with hypertext linking of items.

Distributed Information Systems

Because of the advances, it is no longer required that all data be located on one single site.  As a result, new tools are being developed to access that remote data, such as Gopher, ftp, telnet.  However, the users are demanding that we make this transparent.

The World Wide Web

The World Wide Web (WWW) has had a profound impact on the society in so many ways.  Because of the Web and easy-to-use browsers, people who have never used a computer before now open one.  Much data is available in what is called HyperText Markup Language (HTML)  This allows computers around the world, running different kinds of operating systems, browsers, communications system, and everything else, to see this document in one format.  More exactly, it allows me to author this page in one format and all systems can have equal access to page.

This has lead to many new issues:

Conclusion

There are many new areas within the entire industry to apply databases.  The advances are coming hourly and it is difficult to become current and stay current.  The CSEE Department is very involved the research that is making this happen.  Check out the department's web pages to see what we doing and the labs we have built for that research.  Also remember that we have courses in those areas also.

CSEE | 461 | 461 F'98 | lectures | news | help