2 Searching On the Internet - Background

In December 1997, according to an announcement from the HotBot search engine [Sullivan, 1998], there were at least 175 million Web pages that could be reached from any computer connected to the Internet. The number is estimated to pass 1,000 millions in the year 2000. So, in a couple of years there will be a BILLION Web pages to choose between when someone wants to find the information they need, fast. As of today, we do not have the tools to handle such amounts of information.

2.1 The Tools of Today

Already there are several options to choose between when starting out looking for some particular piece of information on the Web. For many, the most efficient tools are link pages, where someone is responsible for maintaining the link list, by adding new, interesting links as they appear on the Net and removing links to sites that have ceased to exist. Such link pages most often appear when someone realizes that the collection of links they use in their daily work may be of interest to others, and offer their list of links to friends and colleagues. Thanks to e-mail, Usenet and word-of-mouth, soon many people with the same interest know of the link page. Eventually someone volunteers to maintain the links, and a special interest link page is born.

These pages are kept up to date by people with limited resources, especially when it comes to available time. Hence, the link pages will only contain a certain amount of all information available on the topic. In addition, in many cases people who need the information are not aware of the existence of these link pages. Therefore, while link pages like these may be the most efficient way to find information, finding the link page you need may be just as difficult as finding the actual information you need.

To cope with this problem the most useful general search tools are search engines, directories and hybrids between these [Sullivan, 1998]:

Search engines, such as Digital's Alta Vista and Wired's HotBot. A search engine consists of three modules; First, there are the robots/"spiders" that "move" around on the Web locating and transmitting the various Web pages they find back to the search engine site. The second module is the index located at the search engine site. This index is organized in some kind of hierarchical manner, so that it is optimized in one way or another to handle requests and search parameters from the users of the search engine. The last module of a search engine is the software that locates, ranks and returns the information that seems relevant to the user that initiates a search. Various algorithms and data structures are used to be able to respond to requests as quickly and with a result as good as possible. The robots may revisit an indexed Web page every now and then, to update the index with new information.

Directories, such as Yahoo!. As opposed to the robot- and software-indexed search engines, directories result directly from human work. Directories also have indexes, organized in a hierarchy of topics in a way intuitive or at least somehow familiar to people. Only Web pages that are considered to be of any interest are indexed manually, after a visit by a human indexing expert to check the existence and relevance of Web pages that normally are submitted for review by the owner of the Web page. Due to the high cost of having a person visit, review and index a Web page, only a limited number of Web pages can be indexed. The job of maintaining the index, i.e. checking whether sites still exist, if the content has changed etc., is also expensive.

Hybrid search engines, such as Excite and Infoseek. In addition to offering a search engine, these Web indexes also offer a smaller section of Web pages that are reviewed, commented on and rated, as well as classified into various topics.

Hence, there are two main methods for searching for information on the Internet. If you use a directory, you look up information by finding the area of the directory that covers your field of interest in the hierarchy available. The other way is using a search engine. Here your input is a set of keywords related to the topic you want to find information about, which the search engine uses in a search through its index. This search results in a number of "hits", meaning information the search engine decides may be of interest to you.

The main advantage of directories is that the user is directly in charge of the contents of the pages that are offered by the search tool. On Yahoo!, if you"re looking for serious information about the White House, you will find that and nothing else if you have navigated to the sub-hierarchy of

Government/Executive Branch/The White House.

If you are looking for less serious bits of information in connection with the White House, you will find that, and nothing else, in the sub-hierarchy

Entertainment/Humor/Jokes and fun/Internet Humor/Web site Parodies/The White House.

If you use a search engine, on the other hand, and you tell it to look for information about "The White House", you will get all kinds of information that mentions the White House (224,506 documents matched "the white house" on Alta Vista, February 7, 1998). You may even innocently be exposed to e.g. adult material, on Web pages that contain the text string "The White House", maybe just for the purpose of luring people into the pages.

However, most often search engines are capable of coming up with a sensible set of suggestions for where to find information of relevance to the keywords the user provides for the search. There are two different approaches to text search:

Free-text search, a complete word-by-word matching done to the complete contents of all documents in the index. A search like this is very accurate, and automated software agents can easily collect the raw material for the search. This makes the index much cheaper to create and to keep updated. The disadvantage of free-text searching is that it can never describe the context of a search very accurately (although logical combinations of several text string patterns can give a somewhat accurate search), and the computing costs of traversing large amounts of data looking for a specific pattern can be tremendous.
Keyword searching, where Web pages are related to keywords that either are quoted as keywords by the author of the Web pages, or are words that are repeated so often on the Web page that they can be classified as keywords by an automated text analysis. Unless the user who wants to perform a search knows exactly what keywords to look for, and the author and the person who is searching share the same idea of what good keywords are, this way of searching can give poor results. Still, keyword searching most often performs better than free-text search.

Not depending on the particular search approach, the "hits" presented to the user are ranked using certain criteria:

An analysis of the textual contents of a Web page results in a number of keywords that are given a special relevance value, meant to describe the contents of the page. The keyword provided by the user who wants to search the Web is compared to the values of the keyword on the various Web pages, and a ranking can be done through this. Words in the title and near the top of the page are considered more important than words located other places in the text.
Because of the text analysis just described, some pages contain a number of the generally most frequently used search terms repeated many times, to achieve a high importance value for many keywords, in the hope of attracting a lot of people using search engines. Therefore some search engines have recently been instructed to detect Web pages that try to do this, and give those Web pages a very low ranking. This is called "spam penalty".
In search engines that also offer directory service, the reviews from the directory may be used to rank pages. The quality of a Web site may be reflected through receiving a number of stars or a quality percentage value from a reviewer.
In a large index, counting how many of the other indexed Web pages that contain a link to that particular Web page may indicate the popularity of a Web page. Web pages with many links pointing to it may be given a high ranking.

Some Web pages contain little or no textual information, consisting of elements such as pictures, figures, sounds and animations. Web search robots can only read text, and in these cases they are dependent on that the creator of the Web page with the image, animation, etc. have provided a hidden description of what the page is all about. HTML supports this through use of the <ALT> comment tag. Keywords can also be contained within the <META> tag. We will take a closer look at an example where this option has been used later.

All the points in the list above have been exploited by Web publishers to make search engine robots report their Web sites to be as interesting/highly ranked as possible, to be able to attract as many visitors as possible. This has turned into quite a problem, as advertising has entered the Net and advertisers pay a small amount of money for each person that is exposed to their ads. The more people a Web page owner can trick into visiting his pages, the more money he makes.

2.1.1 Clouds on the Horizon

Until about a year ago, people were quite happy about the search engine situation and most often were able to find a page containing the information they were looking for through the search engines. The largest search engines had practically all the text on all the Web pages in store, and the robots roaming the Net were able to keep the index pretty much up to date. That was then, this is now.

In the summer of 1996, there were estimated to be 50-60 million Web pages available on the Internet. Following the steadily increasing growth rate of the Internet, the number of Web pages has since more than tripled. Nicholas Negroponte, head of MIT's Media Labs, has on several occasions claimed that the Web is doubling in size and number of users every fifty days. Unfortunately, the search engines' indexes haven't been able to keep up with this development, and the largest ones are still only able to index up to about 100 million Web pages. This has led to a situation where the users are not any longer guaranteed to find the page they are looking for, even if they have enough time to look through all the pages suggested by the search engine of their choice. Actually, if they are looking for any particular page, given the number of pages on the Web compared to the number of pages indexed on most search engines, there is a more than 50% probability that they will NOT find that page.

Search engine representatives will argue that even though "some" (today roughly one half) of all Web pages can not be found through their search engines, interesting and related Web pages will be found, and the user should be just as happy with not getting too many hits to look through. This sounds, somehow, quite sensible. There is a major catch, though, namely that this reality brings the search engine companies into a situation where they can profit on selectively choosing what pages are to be indexed and what pages are to be kept out of the indexes. In a way, we may come to a situation where the search engine owners decide who gets to practice their freedom of speech.

Alta Vista changed its slogan from "We index it all!" to "We index the best!" sometime during the winter/spring of 1997. The directory of Yahoo! is still growing, but the percentage of all available pages it covers is dramatically growing smaller every hour. In an increasingly important Internet marketplace, being listed in Yahoo! may come to mean life or death to a Web-based business. People have reported submitting their Web pages for review at Yahoo! up to 30 times without actually getting a review and a listing in the directory. Even the robots on the Net, covering several million Web pages every day, may take long to discover new sites, due to the size and complexity of the World Wide Web.

The reason for Yahoo!'s popularity is mainly its hierarchical, well-maintained, easy-to-navigate directory. To keep its position as the search tool market leader, Yahoo! will have to "keep up the good work", meaning that the directory must be manually looked after. They can not automate the submit process, as that would mean that the directory would be garbled by misplaced links and links leading to nowhere, submitted by people who either haven't understood how to do it properly or by the kind of people who enjoy messing up systems. Some search engines allow "instant submitting", meaning that robots will be sent directly to the Web site being submitted to index it within a matter of hours or a few days. Directories, whose strength is their ability to very accurately classify Web pages in some hierarchy, can not do this, as the effort they have to put into doing this exceeds the effort of submitting the page. Hence, the directories' strength becomes their weakness, as soon as there are more people submitting pages to the directory than there are people to handle the submits.

There is a need for a solution to this problem, to be able to create a World Wide Web where everyone can have equal possibilities to have their Web pages found through some search mechanism. It is not a healthy thing for the Internet community if the creators and owners of Web catalogues and search engines are to decide who gets to present their information. If the book is not on the bookshelves, it cannot be borrowed and read. This is a major issue for this thesis.

2.2 Providing Context to Information

As we have just seen, a directory's ability to provide a hierarchy of topics for classification of Web pages is what makes directories easier to use and more popular than search engines. This is mainly because it simply makes it easier to find what you need; the directories let you search within a context, removing nonrelevant information from the landscape surrounding the path along your journey of searching.

Other ways to create contextual directories than through human, manual indexing have been suggested. Two projects in particular have been met with enthusiasm in the international research community: The Dublin Core and the Meta Content Framework. The general idea used as a point of departure in both of these projects, is that all documents/objects should be equipped with a "tag", a container for information that can describe elements concerning what context the document/ object is created in. This kind of information is called metadata. Both concepts are meant to be used with all kinds of information, not only in relation to information that can be reached through the World Wide Web today. A meta-description has to be easily understandable, computable and generally demand as few resources to handle as possible.

The use of metadata is not new [Sølvberg, 1997]. In the database world different "schemes" are used to describe information elements and relations between them. The term "metadata" has lately been given a more specific meaning, mainly in the Digital Library community, where it is used to denote formats for describing online information resources. Here the concept "metadata" has been given several definitions:

Data or information about data.
The contents of a surrogate record that characterize an object.
The information and documentation that makes data sets understandable and shareable for users, and can be stored and processed. Data and its metadata must be available together in a way accessible to users.
In the information world: Records that refer to digital resources available across a network.

A Web document can have a lot of meta information, spanning from information about the subject of the document to its file format and size. For our use, we will mainly be interested in meta information that describes the document properties that can be most useful to allow for searching within a context.

2.2.1 The Dublin Core

The Dublin Core (DC) is a very general description of a metadata set which has been further developed through additional workshops, mainly in the Warwick Framework [Warwick, 1996]. The goal of DC was to "provide a minimal set of descriptive elements that facilitate description and automated indexing of document-like networked objects". It is also a goal that these elements should be simple enough to be understood and used, with no extensive training, by anyone who might be interested in supplying their own "document-like objects" with DC element codes. The elements suggested in the Dublin Core (with modifications to make it less text-centered) are:

Title	The name of the object
Auhor/Creator	The person(s) primarily responsible for the intellectual content of the object
Subject/Keywords	The topic of the object, or keywords, phrases or classification descriptors that describe the subject or content of the object
Description	A textual description of the content of the resource, including abstracts in the case of document-like objects or content descriptions in the case of visual resources
Publisher	The agent or agency responsible for making the objects available
Other Contributors	The person(s), such as editors and transcribers, who have made other significant intellectual contributions to the work
Date	The date of publication
ObjectType	The genre of the object, such as novel, poem, dictionary, etc.
Format	The data representation of the object, such as PostScript file
Identifier	String or number used to uniquely identify the object
Relation	Relationship to other objects
Source	Objects, either print or electronic, from which this object is derived
Language	Language of the intellectual content
Coverage	The spatial locations and temporal duration characteristic of the object
Rights management	The content of this element is intended to be a link (URL or other suitable URI) to a copyright notice, a rights-management server, etc.

Table 2-1, Dublin Core descriptive elements

All elements can be multi-valued. For example, a document may have several author elements or subject elements. Also, all elements are optional and can be modified by one or more qualifiers.

This table is basically the result of the March 1995 Dublin Metadata Workshop, and was only intended as an initial step towards defining a core descriptive metadata set. It has been criticized on several points, but it does provide a basis for further discussion concerning metadata. For now, I'd like to remark that the 15 elements, although all with a purpose, cover more meta-information than is useful for the majority of Web pages in existence today, while at the same time ignoring certain areas of metadata interest for various Web pages.

The Warwick Workshop, building on the Dublin Core, suggests that new metadata sets will develop as the networked information infrastructure matures. As more information that is proprietary is made available for purchase and delivery on the Internet, the need for a suitable metadata set will push development in this area forward. Metadata that may be of special interest for Web objects in these cases are [Warwick, 1996]:

Terms and conditions - Description of the rules for how an object may be used.
Cost - Prices and fees for the purchase and/or use of an object.
Administrative data - Data related to the management of the object, such as date of creation and/or last modification.
Content ratings - How the object rates, not only in quality, but also in content type, e.g. like how it is done with humor, nudity and violence in movie ratings.
Linkage/relations - Where the object is taken from, e.g. an ISBN of the paper publication in which the document originally was printed, or information about how the object has been constructed through a transformation from some other source.

The result of the Warwick Workshop, the Warwick Framework, is a proposal for a container architecture, a mechanism for aggregating distinct packages of metadata. This means a modularization of the metadata issue, so that designers from different areas of interest can choose a set of metadata that suits their information objects best, and have it included in the Warwick Framework for various operations. Even future metadata sets can be incorporated in the framework.

An implementation in HTML, the common formatting language used on the WWW, is among the implementations outlined in the workshop papers. To make it as easy as possible to introduce a new metadata format to the World Wide Web, it should be possible to start using it without requiring any changes in neither Web browsers nor HTML editors. A solution that follows this precaution and conforms to HTML 2.0 was proposed at the May 1996 W3C-sponsored Distributed Indexing/Searching Workshop in Cambridge, Massachusetts. The implementation takes advantage of two tags:

The META tag, used to embed metadata within the HEAD portion of HTML documents. Each META tag specifies an attribute/value pair, NAME and CONTENT. There can be multiple META tags in a document.
The LINK tag, providing for both indirect linking to the reference definition for a metadata schema and for indirect linking to a set of metadata. For instance, to provide a pointer to a URL where a human-readable reference definition of a metadata schema can be found, the following tag can be used: <LINK REL=SCHEMA.<Schema_name> HREF=”URL”>

An example of this implementation can look like this:

<HTML>
<HEAD>
<TITLE>Example Document with Metadata </TITLE>
<META NAME="Meta.Title" CONTENT="Example document">
<META NAME="Meta.Author" CONTENT="BC Torrissen">
<META NAME="Meta.DateCreated" CONTENT="26111997">
<LINK REL="Schema.Meta" HREF="http://meta.idi.ntnu.no/meta.html">
<LINK REL="META.FORMAT" HREF="http://meta.idi.ntnu.no/metadefinition/">
</HEAD>
<BODY>
Insert the document with contents as described in the metadata above here.
</BODY>
</HTML>

A Web spider familiar with the Warwick Framework, in addition to gathering the contents of the body of this HTML document, will also be able to index what is the title, author and creation date of the document. This is done according to the metadata scheme Meta, which can be found at the location meta.ntnu.no. There is also a pointer to where human readers can find a description of the metadata schema used here.

2.2.2 The Meta Content Framework (MCF)

The goal of MCF is to provide a common basic way to abstract and standardize the representation of the structures we use for organizing information. Today we have e-mail applications for keeping track of our e-mail, we have word processors/viewers to handle our documents, we have Web browsers for handling Web pages and a few other Internet protocols and so on. What Guha [Guha, 1997], who almost single-handedly has developed MCF at Apple research, is trying to come up with, is a way to give all these objects a meta description. With a set of metadata like this, we can gather information from files in various formats and keep track of them within the same application environment, or within the same "information management and communications application" (IMA), as he writes. After all, the reason information is accessed through different applications is not that the information is different from application to application, but that it is formatted and organized by different protocols.

The main applications for the MCF project has so far been Apple's HotSauce project and ProjectX, which provides a new way of visualizing and navigating through hierarchically stored information, whether it resides on the Web or on a single computer. Apple has officially dropped the research on MCF, but the concept has gained many enthusiastic followers, and seems to live on. One of the largest experiments has been to convert the Yahoo! directory to MCF, making it possible to navigate Yahoo! by "flying" through a three-dimensional information space, as shown in Figure 2-2. By moving in close to a category, the category opens and sub-categories and actual documents appear.

Figure 2-2, Screenshot of the Yahoo! Web site as visualized by Apple's HotSauce

The core of the MCF is the .mcf-file, containing meta information about the contents of the documents that the file is to cover. These files are generated from data manually produced by human users. MCF provides an SQL-ish language for accessing and manipulating meta content descriptions, as well as a standard vocabulary for terms to describe the document's attributes, such as "author" and "fileSize". Users can choose to use their own terms if they like. If they do, however, integrating their content information with others' will be more difficult.

MCF is fully scalable, meaning that the same architecture is to be used, whether it is for holding meta content information for a single computer or for the whole Internet. It is also designed to minimize the up-front cost of introducing the new technology for developers of existing applications. MCF does not aim to replace existing formats for exchanging and storing meta content. Instead, information in existing formats can be assimilated into richer MCF structures, thus making it possible to combine information from several formats into a larger MCF-based index.

2.3 How It Used to Be Done

How to quickly locate relevant information from a large source of information is not a new challenge. Archives and libraries of different kinds have existed for thousands of years. Suggestions for how to categorize and index books, magazines and other publications have been many. For paper-based information librarians today seem to be satisfied by the systems most widely in use: Dewey- and MARC-based cataloguing tools.

2.3.1 The MARCs

The MARC (MAchine-Readable Cataloguing) format was developed in the 1960s as a standard format for exchange of library catalogue records. The Library of Congress in Washington, USA maintains the most widely used MARC (MAchine-Readable Cataloguing) system, USMARC, in consultation with various user communities. This format is a very detailed set of codes and content designators defined for encoding machine-readable records, well suited for computer processing.

A number of additional dialects of MARC exist, both for national and international communities, but the basic idea remains the same in all MARCs. In USMARC, formats are defined for five types of data: Bibliographic, Authority, Holdings, Classification and Community information. Within these types a number of fields are defined, and may contain all kinds of information about the documents. For example, for bibliographic data, [MARC, 1996] codes are assigned like this:

0XX = Control information, numbers, codes
1XX = Main entry
2XX = Titles, edition, imprint
3XX = Physical description, etc.
4XX = Series statements
5XX = Notes
6XX = Subject access fields
7XX = Name, etc. added entries or series; linking
8XX = Series added entries; holdings and locations
9XX = Reserved for local implementation

As the name indicates, the MARC system is for interchange of bibliographic information between computer systems. However, to create a high quality MARC record requires skilled personnel who are experienced in the use of cataloguing rules. The motivation for creating MARC records is that if every library keeps a list of their resources in this format, information from several libraries can be collected and an index of all available resources from all libraries in a specific region can be created. This provides a great tool for locating information from wherever in the region it is available. Below is an example of a typical MARC record, taken from the Norwegian MARC dialect, NORMARC:

*001972095632
*008   eng
*015   $alc97024364
*020   $a0-07-035011-6
*082   $c006.3
*100   $aKnapik, Michael
*245   $aDeveloping intelligent agents for distributed systems
       $bexploring architecture, technologies, and applications
       $cMichael Knapik, Jay Johnson
*260   $aNew York$bMcGraw-Hill$cc1997
*300   $ap. cm.
*650   $aIntelligent agents (Computer software)
*650   $aElectronic data processing$xDistributed processing
*650   $aComputer software$xDevelopment
*700   $aJohnson, Jay$d1957
*096c $aRMH$n97c016905

The three-digit codes are indicators of what information follows on the line, $ indicates the start of an information field, and then the actual information is given. In the example above, the line reading "*082 $c006.3" tells us that the book is classified under 006.3 using the Dewey classification system.

As we see, the MARC format is similar to the aforementioned Dublin Core.

2.3.2 The Dewey Decimal Classification

The Dewey Decimal Classification (DDC) is not an alternative to MARC, but a tool for categorizing information into different topics. In this way, it is just a subset of the information contained in the MARC system. Still, the DDC is a very much appreciated tool in the library world. For reasons that will become apparent, we will take a closer look at the way DDC is organized.

The DDC was invented and published by the American librarian Melvil Dewey in the mid-1870s [Dewey, 1994]. It was originally devised as a system for small libraries to catalogue books, but has since been used also in larger scenarios than the local and school libraries that were the first to start using the system extensively. The system is based on ten classes of subject (000-999) which in turn are further subdivided. The main classes, which shall cover all human knowledge, are:

000-099 Generalities
100-199 Philosophy & Psychology
200-299 Religion
300-399 Social Science
400-499 Language
500-599 Natural Sciences&Mathematics
600-699 Technology (Applied Sciences)
700-799 Arts & Entertainment
800-899 Literature
900-999 Geography & History

Each of the ten classes is further divided into ten subclasses. For example, the 700's are divided into these subclasses:

700-709 The Arts
710-719 Civic & Landscape art
720-729 Architecture
730-739 Plastic Arts, Sculpture
740-749 Drawing & Decorative arts
750-759 Painting & Paintings
760-769 Graphic Arts, Prints
770-779 Photography & Photographs
780-789 Music
790-799 Recreational & Performing Arts

The subdivision of classes continues for as long as necessary to describe very precise topics. An example is "butterflies", going into Dewey Decimal Classification 595.789, deduced along the path: Natural Sciences (500) --> Zoological Sciences (590) --> Other Invertebrates (595) --> Insects (595.7) --> Lepidoptera (595.78) --> Butterflies (595.789).

Dewey has several advantages making it easy to use. The codes are uniformly constructed with room for describing any particular topic. "Birds found in Italy" can be constructed to be 598.0945, as 598 means Aves/Birds, the .09 indicates geographical treatment, and 45 is the geographical Dewey code for Italy. Adding an additional 9 at the end, making it 598.09459 will limit the topic to be Sardinian birds.

Although at first one may be a bit skeptical about the idea, as it seems like a mess of numbers to remember to be able to use the system for anything at all, this system is very easy to use. The main reason for this is that the codes are conceptually well organized, so that you actually do not have to know any codes at all to start searching for the information you need. Knowing the codes will just make gathering of information from the system go faster. The codes are well maintained centrally, by Library of Congress personnel, and one code will never have more than one meaning.

As of today, the DDC numbers appear in MARC records issued by countries throughout the world and are used in a multitude of national bibliographies; Australia, Botswana, Brazil, Canada, India, Italy, Norway, Pakistan, Papua New Guinea, Turkey, the United Kingdom, Venezuela and other countries. Relatively up-to-date comprehensive translations of the DDC are available in Arabic, French, Greek, Hebrew, Italian, Persian, Russian, Spanish and Turkish. Together with English these languages are spoken by more than 1,1 billion people today [WABF, 1996]. Less comprehensive but still fairly detailed translations exist in a large number of languages. A number of tutorials and guides to the DDC are also available.

Some general advantages of the use of the Dewey Decimal Classification are:

A topic will be given the same DDC code independently of what language is used. In addition, synonyms for the same concept will belong to the same DDC code. The system has traditionally been more widely used in Europe and the Middle East than in the USA, because the numeric notation is a tool to overcome the language barriers that are more obvious there than in the USA.
As well as text, all kinds of information, including animations, pictures, sounds and physical items, can be classified using the DDC.
A large number of people (mostly librarians) know the DDC codes well, most other people can easily learn it.
If several topics are contained in e.g. an article, the article can be given several DDC codes and in that way be accurately described topic-wise.
If you don't know exactly what you are looking for, you can use the conceptually well organized hierarchy of topics to navigate to a subset of all available information that probably will contain what you are looking for or closely related material at the very least.
DDC is not a static code set, but is kept up-to-date [OCLC, 1998]. Regularly, update listings are published with for example information on how Hong Kong-related material is to be classified after Hong Kong changed status from a British colony to a Chinese province.

A similar classification system, the Universal Decimal Code (UDC) was designed in the late 1890's by the International Federation for Information and Documentation. This was a mainly European effort, and had permission from Dewey to translate and adapt DDC for the purpose of preparing a universal bibliography. The principles behind the two code systems remain the same, and both UDC and DDC are in use in many libraries worldwide today.

2.4 Problem Summary

So far in this background chapter we have identified some of the main problems that need to be addressed, to come up with a solution to the problem of how to perform high quality information searches on the Internet, within a context. The problems are summarized here, not necessarily in any order of importance:

The problem of context: The search engines of today do not have an "understanding" of what the Web pages they store are really all about. This prevents the users from expressing and limiting the search to be within a specific area of interest.

The problem of magnitude: There is so much information available that we do not have the tools to handle it as fast and as well as we would like to. At the same time, new information is appearing so fast that no present search tool can keep track of it.

The problem of classification: The task of classifying a piece of information accurately and correctly is not a simple one. With today's technology, most classification methods require manual, human work. There are no HTML codes specifically designed for automating classification.

The problem of instability: In addition to the increasing amounts of information, information present on the Internet can be moved to new addresses. Information can also be completely removed from the Internet. Often information is duplicated and updated at one site, while the old information is still available somewhere else.

The problem of user-friendliness: Many users are confused by the performance of the various search tools available. Search engines come up with apparently insane suggestions for where to look for the information needed, they find too many pages and different search tools make different suggestions. The search performed may seem like a mechanical, non-human, cryptic process, due to logical expressions and low-quality responses.

The problem of abuse: It is possible to make the search engines "believe" that a page is about some topic, while it really is about something completely different. With the introduction of commercial Web advertising, this has lead to a number of misleading links from the search engines.

The problem of monopoly: Manually maintained directories can not keep all available data in store, and so the sites the directory have links to are given an advantage to competitors who are not included in the directory listings.

In this paper the focus will be on the top problem; the problem of context. The other problems will not be ignored, but used as guidance when it comes to making decisions concerning the suggested solution to the problem of context.

From the nature of some of the partial problems, it is obvious that we have a situation where the job that needs to be done is too large to handle for a relatively small group of people. We also have a situation where it seems as if manual work of human quality is required. This is not a new situation to man. We have come up with technology for doing things we could not do ourselves before. Let us take a closer look at what it is, exactly, that we want to be done.

Go to: Front page - Index - Ch. 1 - Ch. 2 - Ch. 3 - Ch. 4 - Ch. 5 - Ch. 6 - Ch. 7 - Ch. 8 - Ch. 9 - Glossary - References Visit the author's homepage : http://www.pvv.org/~bct/ E-mail the author, Bjørn Christian Tørrissen: bct@pvv.org