In December 1997, according to an announcement from the HotBot
search engine [Sullivan, 1998],
there were at least 175 million Web pages that could be reached from any
computer connected to the Internet. The number is estimated to pass 1,000
millions in the year 2000. So, in a couple of years there will be a BILLION
Web pages to choose between when someone wants to find the information
they need, fast. As of today, we do not have the tools to handle such amounts
of information.
These pages are kept up to date by people with limited resources, especially when it comes to available time. Hence, the link pages will only contain a certain amount of all information available on the topic. In addition, in many cases people who need the information are not aware of the existence of these link pages. Therefore, while link pages like these may be the most efficient way to find information, finding the link page you need may be just as difficult as finding the actual information you need.
To cope with this problem the most useful general search tools are search
engines, directories and hybrids between these [Sullivan,
1998]:
The main advantage of directories is that the user is directly in charge of the contents of the pages that are offered by the search tool. On Yahoo!, if you"re looking for serious information about the White House, you will find that and nothing else if you have navigated to the sub-hierarchy of
Government/Executive Branch/The White House.
If you are looking for less serious bits of information in connection with the White House, you will find that, and nothing else, in the sub-hierarchy
Entertainment/Humor/Jokes and fun/Internet Humor/Web site Parodies/The White House.
If you use a search engine, on the other hand, and you tell it to look for information about "The White House", you will get all kinds of information that mentions the White House (224,506 documents matched "the white house" on Alta Vista, February 7, 1998). You may even innocently be exposed to e.g. adult material, on Web pages that contain the text string "The White House", maybe just for the purpose of luring people into the pages.
However, most often search engines are capable of coming up with a sensible set of suggestions for where to find information of relevance to the keywords the user provides for the search. There are two different approaches to text search:
In the summer of 1996, there were estimated to be 50-60 million Web pages available on the Internet. Following the steadily increasing growth rate of the Internet, the number of Web pages has since more than tripled. Nicholas Negroponte, head of MIT's Media Labs, has on several occasions claimed that the Web is doubling in size and number of users every fifty days. Unfortunately, the search engines' indexes haven't been able to keep up with this development, and the largest ones are still only able to index up to about 100 million Web pages. This has led to a situation where the users are not any longer guaranteed to find the page they are looking for, even if they have enough time to look through all the pages suggested by the search engine of their choice. Actually, if they are looking for any particular page, given the number of pages on the Web compared to the number of pages indexed on most search engines, there is a more than 50% probability that they will NOT find that page.
Search engine representatives will argue that even though "some" (today roughly one half) of all Web pages can not be found through their search engines, interesting and related Web pages will be found, and the user should be just as happy with not getting too many hits to look through. This sounds, somehow, quite sensible. There is a major catch, though, namely that this reality brings the search engine companies into a situation where they can profit on selectively choosing what pages are to be indexed and what pages are to be kept out of the indexes. In a way, we may come to a situation where the search engine owners decide who gets to practice their freedom of speech.
Alta Vista changed its slogan from "We index it all!" to "We index the best!" sometime during the winter/spring of 1997. The directory of Yahoo! is still growing, but the percentage of all available pages it covers is dramatically growing smaller every hour. In an increasingly important Internet marketplace, being listed in Yahoo! may come to mean life or death to a Web-based business. People have reported submitting their Web pages for review at Yahoo! up to 30 times without actually getting a review and a listing in the directory. Even the robots on the Net, covering several million Web pages every day, may take long to discover new sites, due to the size and complexity of the World Wide Web.
The reason for Yahoo!'s popularity is mainly its hierarchical, well-maintained, easy-to-navigate directory. To keep its position as the search tool market leader, Yahoo! will have to "keep up the good work", meaning that the directory must be manually looked after. They can not automate the submit process, as that would mean that the directory would be garbled by misplaced links and links leading to nowhere, submitted by people who either haven't understood how to do it properly or by the kind of people who enjoy messing up systems. Some search engines allow "instant submitting", meaning that robots will be sent directly to the Web site being submitted to index it within a matter of hours or a few days. Directories, whose strength is their ability to very accurately classify Web pages in some hierarchy, can not do this, as the effort they have to put into doing this exceeds the effort of submitting the page. Hence, the directories' strength becomes their weakness, as soon as there are more people submitting pages to the directory than there are people to handle the submits.
There is a need for a solution to this problem, to be able to create
a World Wide Web where everyone can have equal possibilities to have their
Web pages found through some search mechanism. It is not a healthy thing
for the Internet community if the creators and owners of Web catalogues
and search engines are to decide who gets to present their information.
If the book is not on the bookshelves, it cannot be borrowed and read.
This is a major issue for this thesis.
Other ways to create contextual directories than through human, manual indexing have been suggested. Two projects in particular have been met with enthusiasm in the international research community: The Dublin Core and the Meta Content Framework. The general idea used as a point of departure in both of these projects, is that all documents/objects should be equipped with a "tag", a container for information that can describe elements concerning what context the document/ object is created in. This kind of information is called metadata. Both concepts are meant to be used with all kinds of information, not only in relation to information that can be reached through the World Wide Web today. A meta-description has to be easily understandable, computable and generally demand as few resources to handle as possible.
The use of metadata is not new [Sølvberg,
1997]. In the database world different "schemes" are used to describe
information elements and relations between them. The term "metadata" has
lately been given a more specific meaning, mainly in the Digital Library
community, where it is used to denote formats for describing online information
resources. Here the concept "metadata" has been given several definitions:
Title | The name of the object |
Auhor/Creator | The person(s) primarily responsible for the intellectual content of the object |
Subject/Keywords | The topic of the object, or keywords, phrases or classification descriptors that describe the subject or content of the object |
Description | A textual description of the content of the resource, including abstracts in the case of document-like objects or content descriptions in the case of visual resources |
Publisher | The agent or agency responsible for making the objects available |
Other Contributors | The person(s), such as editors and transcribers, who have made other significant intellectual contributions to the work |
Date | The date of publication |
ObjectType | The genre of the object, such as novel, poem, dictionary, etc. |
Format | The data representation of the object, such as PostScript file |
Identifier | String or number used to uniquely identify the object |
Relation | Relationship to other objects |
Source | Objects, either print or electronic, from which this object is derived |
Language | Language of the intellectual content |
Coverage | The spatial locations and temporal duration characteristic of the object |
Rights management | The content of this element is intended to be a link (URL or other suitable URI) to a copyright notice, a rights-management server, etc. |
All elements can be multi-valued. For example, a document may have several author elements or subject elements. Also, all elements are optional and can be modified by one or more qualifiers.
This table is basically the result of the March 1995 Dublin Metadata Workshop, and was only intended as an initial step towards defining a core descriptive metadata set. It has been criticized on several points, but it does provide a basis for further discussion concerning metadata. For now, I'd like to remark that the 15 elements, although all with a purpose, cover more meta-information than is useful for the majority of Web pages in existence today, while at the same time ignoring certain areas of metadata interest for various Web pages.
The Warwick Workshop, building on the Dublin Core, suggests that new
metadata sets will develop as the networked information infrastructure
matures. As more information that is proprietary is made available for
purchase and delivery on the Internet, the need for a suitable metadata
set will push development in this area forward. Metadata that may be of
special interest for Web objects in these cases are [Warwick,
1996]:
An implementation in HTML, the common formatting language used on the
WWW, is among the implementations outlined in the workshop papers. To make
it as easy as possible to introduce a new metadata format to the World
Wide Web, it should be possible to start using it without requiring any
changes in neither Web browsers nor HTML editors. A solution that follows
this precaution and conforms to HTML 2.0 was proposed at the May 1996 W3C-sponsored
Distributed Indexing/Searching Workshop in Cambridge, Massachusetts. The
implementation takes advantage of two tags:
<HTML>
<HEAD>
<TITLE>Example Document with Metadata </TITLE>
<META NAME="Meta.Title" CONTENT="Example document">
<META NAME="Meta.Author" CONTENT="BC Torrissen">
<META NAME="Meta.DateCreated" CONTENT="26111997">
<LINK REL="Schema.Meta" HREF="http://meta.idi.ntnu.no/meta.html">
<LINK REL="META.FORMAT" HREF="http://meta.idi.ntnu.no/metadefinition/">
</HEAD>
<BODY>
Insert the document with contents as described in the metadata
above here.
</BODY>
</HTML>
A Web spider familiar with the Warwick Framework, in addition to gathering
the contents of the body of this HTML document, will also be able to index
what is the title, author and creation date of the document. This is done
according to the metadata scheme Meta, which can be found at the location
meta.ntnu.no. There is also a pointer to where human readers can find a
description of the metadata schema used here.
The main applications for the MCF project has so far been Apple's HotSauce
project and ProjectX, which provides a new way of visualizing and navigating
through hierarchically stored information, whether it resides on the Web
or on a single computer. Apple has officially dropped the research on MCF,
but the concept has gained many enthusiastic followers, and seems to live
on. One of the largest experiments has been to convert the Yahoo! directory
to MCF, making it possible to navigate Yahoo! by "flying" through a three-dimensional
information space, as shown in Figure 2-2. By moving in close to a category,
the category opens and sub-categories and actual documents appear.
The core of the MCF is the .mcf-file, containing meta information about the contents of the documents that the file is to cover. These files are generated from data manually produced by human users. MCF provides an SQL-ish language for accessing and manipulating meta content descriptions, as well as a standard vocabulary for terms to describe the document's attributes, such as "author" and "fileSize". Users can choose to use their own terms if they like. If they do, however, integrating their content information with others' will be more difficult.
MCF is fully scalable, meaning that the same architecture is to be used,
whether it is for holding meta content information for a single computer
or for the whole Internet. It is also designed to minimize the up-front
cost of introducing the new technology for developers of existing applications.
MCF does not aim to replace existing formats for exchanging and storing
meta content. Instead, information in existing formats can be assimilated
into richer MCF structures, thus making it possible to combine information
from several formats into a larger MCF-based index.
A number of additional dialects of MARC exist, both for national and international communities, but the basic idea remains the same in all MARCs. In USMARC, formats are defined for five types of data: Bibliographic, Authority, Holdings, Classification and Community information. Within these types a number of fields are defined, and may contain all kinds of information about the documents. For example, for bibliographic data, [MARC, 1996] codes are assigned like this:
0XX = Control information, numbers, codes
1XX = Main entry
2XX = Titles, edition, imprint
3XX = Physical description, etc.
4XX = Series statements
5XX = Notes
6XX = Subject access fields
7XX = Name, etc. added entries or series; linking
8XX = Series added entries; holdings and locations
9XX = Reserved for local implementation
As the name indicates, the MARC system is for interchange of bibliographic information between computer systems. However, to create a high quality MARC record requires skilled personnel who are experienced in the use of cataloguing rules. The motivation for creating MARC records is that if every library keeps a list of their resources in this format, information from several libraries can be collected and an index of all available resources from all libraries in a specific region can be created. This provides a great tool for locating information from wherever in the region it is available. Below is an example of a typical MARC record, taken from the Norwegian MARC dialect, NORMARC:
*001972095632
*008 eng
*015 $alc97024364
*020 $a0-07-035011-6
*082 $c006.3
*100 $aKnapik, Michael
*245 $aDeveloping intelligent agents for distributed
systems
$bexploring architecture,
technologies, and applications
$cMichael Knapik, Jay Johnson
*260 $aNew York$bMcGraw-Hill$cc1997
*300 $ap. cm.
*650 $aIntelligent agents (Computer software)
*650 $aElectronic data processing$xDistributed processing
*650 $aComputer software$xDevelopment
*700 $aJohnson, Jay$d1957
*096c $aRMH$n97c016905
The three-digit codes are indicators of what information follows on the line, $ indicates the start of an information field, and then the actual information is given. In the example above, the line reading "*082 $c006.3" tells us that the book is classified under 006.3 using the Dewey classification system.
As we see, the MARC format is similar to the aforementioned Dublin Core.
The DDC was invented and published by the American librarian Melvil Dewey in the mid-1870s [Dewey, 1994]. It was originally devised as a system for small libraries to catalogue books, but has since been used also in larger scenarios than the local and school libraries that were the first to start using the system extensively. The system is based on ten classes of subject (000-999) which in turn are further subdivided. The main classes, which shall cover all human knowledge, are:
000-099 Generalities
100-199 Philosophy & Psychology
200-299 Religion
300-399 Social Science
400-499 Language
500-599 Natural Sciences&Mathematics
600-699 Technology (Applied Sciences)
700-799 Arts & Entertainment
800-899 Literature
900-999 Geography & History
Each of the ten classes is further divided into ten subclasses. For example, the 700's are divided into these subclasses:
700-709 The Arts
710-719 Civic & Landscape art
720-729 Architecture
730-739 Plastic Arts, Sculpture
740-749 Drawing & Decorative arts
750-759 Painting & Paintings
760-769 Graphic Arts, Prints
770-779 Photography & Photographs
780-789 Music
790-799 Recreational & Performing Arts
The subdivision of classes continues for as long as necessary to describe very precise topics. An example is "butterflies", going into Dewey Decimal Classification 595.789, deduced along the path: Natural Sciences (500) --> Zoological Sciences (590) --> Other Invertebrates (595) --> Insects (595.7) --> Lepidoptera (595.78) --> Butterflies (595.789).
Dewey has several advantages making it easy to use. The codes are uniformly constructed with room for describing any particular topic. "Birds found in Italy" can be constructed to be 598.0945, as 598 means Aves/Birds, the .09 indicates geographical treatment, and 45 is the geographical Dewey code for Italy. Adding an additional 9 at the end, making it 598.09459 will limit the topic to be Sardinian birds.
Although at first one may be a bit skeptical about the idea, as it seems like a mess of numbers to remember to be able to use the system for anything at all, this system is very easy to use. The main reason for this is that the codes are conceptually well organized, so that you actually do not have to know any codes at all to start searching for the information you need. Knowing the codes will just make gathering of information from the system go faster. The codes are well maintained centrally, by Library of Congress personnel, and one code will never have more than one meaning.
As of today, the DDC numbers appear in MARC records issued by countries throughout the world and are used in a multitude of national bibliographies; Australia, Botswana, Brazil, Canada, India, Italy, Norway, Pakistan, Papua New Guinea, Turkey, the United Kingdom, Venezuela and other countries. Relatively up-to-date comprehensive translations of the DDC are available in Arabic, French, Greek, Hebrew, Italian, Persian, Russian, Spanish and Turkish. Together with English these languages are spoken by more than 1,1 billion people today [WABF, 1996]. Less comprehensive but still fairly detailed translations exist in a large number of languages. A number of tutorials and guides to the DDC are also available.
Some general advantages of the use of the Dewey Decimal Classification are:
From the nature of some of the partial problems, it is obvious that
we have a situation where the job that needs to be done is too large to
handle for a relatively small group of people. We also have a situation
where it seems as if manual work of human quality is required. This is
not a new situation to man. We have come up with technology for doing things
we could not do ourselves before. Let us take a closer look at what it
is, exactly, that we want to be done.