If we take the metaphor of the Internet as the ultimate library one
step further, it is obvious that we lack something important today, namely
the librarians. In the same way a person is not able to instantly find
any particular book in his first visit to a large library, someone new
to the Internet will not know how to go about finding information about
some specific topic they may be interested in. In a library, however, it
is easy to find librarians who also, in addition to go “Shhh!!!” and point
a finger at you if you do not follow the rules, will help you find what
you are looking for, if you ask them nicely.
While a sensible librarian will be able to ask the user a few clarifying questions concerning what aspects of travelling are of interest to the user, search engines can not do this directly and focus or refine the search more. Nor do the search engines, usually, have any judgment and past experience with any particular user, which could have been used to rank the Web pages they find. Instead, search engines are equipped with very simple logical rules for how to rank Web pages, mainly based on the number and location of keywords that can be found on different Web pages. Pages with keywords appearing in the title are assumed by the search engines to be more relevant to the topic than others.
There are two main reasons for users to employ a webrarian, that is a World Wide Web librarian, when they want to find information on the Internet. First, there is the technical side to it, the webrarian can transform the user’s natural language request into a set of logical expressions that describes the desired information in a way that can be understood by various kinds of search tools. Second, the webrarian serves as a psychological abstraction, seeming to the user to be “someone” who can be asked to help in locating information. While most users have little or no experience in formulating efficient search engine requests, everyone, except maybe people with Tarzan-like backgrounds, have a lot of experience in dealing with other people. This experience should be taken advantage of in the effort of making the user understand and deal with complex software, such as a search engine.
Another main task for traditional librarians is to maintain the systematic
arrangement of the books on the shelves. New books acquired by the library
are catalogued and put in the right spot, so that they can easily be found
later by users familiar with the system/library or by the librarians themselves.
Since nothing is removed from anywhere on the Internet, but just copied
or downloaded, a webrarian will not have to put books back on the shelves.
Still, the ability to put new information in the right place would be a
valuable quality of a webrarian.
Advanced techniques for telling what category a certain Web page belongs to is necessary. It is not always possible to say that one Web page in the domain ”some.thing.com“ is a personal home page just because another Web page in the same domain has been classified as being one. For .gov domains, it is most often possible to say that they do contain information from governmental organizations. However, in most countries .gov as a standard has not been introduced, so saying that only .gov-sites contain information from governmental organizations will not cover it all.
It is possible to identify large subgroups of Web pages within these
main categories. Many Web pages classified as personal home pages will
be e.g. Curriculum Vitaes, “fan pages” (meaning that they are dedicated
to information about some special person), etc. These large subgroups also
deserve specific codes, to enable a further narrowing of searches using
the new system we are creating. Such automatic detection of text genre
has been a fairly small field in computer linguistics until recently. In
[Kessler, 1997], several cues for identifying genres
are presented. Choice of words, statistical data about the occurrence of
phrases, clauses, sentences and paragraphs and punctuation cues are some
factors that have given encouraging results in document class detection.
This combined with clues from the hypertext nature of the Web should give
even better precision when it comes to identifying specific “kinds” of
Web pages such as those mentioned above.
Since we want to come up with the ultimate search tool, we must also allow users to include or exclude any of a number of languages from their searches. As it happens, there is already a standard three-letter code for languages, the Z39-53 standard from the National Information Standard Organization [NISO, 1997]. It has been used by other meta content projects as well, such as in the Dublin Core and the MARC standards, and seems to be an obvious choice for our use. Currently almost 400 different languages have their own code in Z39-53. Languages out of daily use, such as Old Norse and Anglican English, are also included with their own codes.
Webrarians or programs that based on a statistical analysis can tell what language a Web page is written in, is a fairly simple thing to implement. A working language recognition function is already in use by the Alta Vista search engine run by Digital [Alta Vista, 1998]. It is necessary to provide the webrarian with some minimal size dictionary for all languages the agent is to classify documents into. Since many languages share words and in some cases share words but not the meaning of the words, the list of “typical words” for a language should be chosen carefully. By only including words with more than six letters, most words shared by several languages will be filtered out. A mechanism for handling misspelled words must also be included in the process. If the URL to a Web page contains a geographical domain name, like .se for Sweden or .de for Germany, this should increase the probability of the page being written in the official languages of these nations.
In a time where soon just about everyone will be using the Web, even
the functionality of being able to filter out all pages written in languages
you don’t know, will be valuable. In the long run, offering a translation
of Web pages written in other languages may also be an option, as Alta
Vista has begun to demonstrate [Babelfish, 1997]. For children who have
learned only their native language, it will make the Web less confusing
if they can read and understand all the information our agents present
to them. To a somewhat lesser degree, this will also be useful for elders
who have never learned any second language properly. In addition, having
search tools that offer language-specific searching may encourage people
to use their own language on their pages. This should be welcomed at the
very least by people who fear that the Net will turn the human race into
a homogeneous, English-speaking population with no geographical cultural
differences.
Other ratings of interest could be concerning the display of foul language, violence, nudity and sexually related nudity and/or actions. The level of acceptance for this kind of content will vary from person to person and from culture to culture. A scale for rating these “qualities” is necessary to be able to filter/censor the links the index will provide for users. Another point of interest would be the possibility of separating recreational/entertainment pages from serious information pages.
Ratings should be optional, meaning that a page does not have to be assigned any kind of rating to be included in the index. Still, it would be in the interest of the owner of a page that the page has a rating, since many users of the index probably will prefer to list only pages that are rated.
A framework for this latter kind of ratings exists, in the Platform for Internet Content Selection [PICS, 1996] project. This is a cross-industry working group, whose goal is to facilitate the development of technologies to give users of interactive media, such as the Internet, control over the kinds of material to which they and their children have access.
In general, PICS specifies the technical issues that affect interoperability for rating systems. It does not specify how selection software or rating services works, just how they work together. Based on this, standard scales for ratings can be offered, and content providers can use these to voluntarily label the content they create and distribute. Independent labelers can also rate other people’s pages, and the same content may receive different labels from different labeling services. This means that any particular user can choose to for example look at ratings from a governmental rating service in the USA or to use the ratings from the corresponding governmental rating service in Iran. Several of the main actors on the Internet scene [PICS, 1996] have announced PICS-compatible functionality in their products, and many serious content providers have included PICS ratings on their pages.
One of the rating systems available is the Content Advisor, managed
by the independent, non-profit RSACi, the Recreational
Software Advisory Council on the Internet. Any owner of a Web page
can follow on-screen instructions given at their Web site [RSAC, 1998],
and be given an automatically created PICS/RSACi rating label to paste
onto their Web page. By answering a number of questions, ratings for how
the page rates when it comes to violence, nudity, sex and language are
decided on, based on this scale, taken from the RSACi home page:
|
|
|
|
|
|
Frontal nudity (qualifying as provocative display) | Explicit sexual acts or sex crimes | Rape or wanton, gratuitous violence | Crude, vulgar language or extreme hate speech |
|
Frontal nudity | Non-explicit sexual acts | Aggressive violence or death to humans | Strong language or hate speech |
|
Partial nudity | Clothed sexual touching | Destruction of realistic objects | Moderate expletives or profanity |
|
Revealing attire | Passionate kissing | Injury to human being | Mild expletives |
|
None of the above | None of the above or innocent kissing; romance | None of the above or sports related | None of the above |
As an example, Disney’s home page includes this PICS label:
<META http-equiv="PICS-Label" content='(PICS-1.1
http://www.rsac.org/ratingsv01.html" l gen true comment "RSACi
North
America Server" by "webmaster@disney.com" for http://www.disney.com"
on "1997.01.13T13:29-0500" r (n 0 s 0 v 0 l 0))'>
and Playboy’s home page has the following PICS label:
<META http-equiv="PICS-Label" content='(PICS-1.0
"http://www.rsac.org/ratingsv01.html" l gen true comment "RSACi
North
America Server" by "eileenk@playboy.com" for "http://www.playboy.com"
on
"1996.04.04T08:15-0500" r (n 4 s 3 v 0 l 4))'>
As we can see, in addition to the rating values for n(udity), s(ex), v(iolence) and l(anguage), there is information about when and by whom the rating was done, and there is a pointer to where a description of the rating system can be found. It is possible to give one rating for the front page and separate ratings for the pages “behind” the front page.
The standard format for labeling provided by PICS should be taken advantage
of by our indexing system, so that ratings of various kinds can be provided
as far as possible. Ratings for whole sites can be a more economical solution
than to provide ratings for each particular page. In many cases, the ratings
will be easily retrievable for Web robots/agents, thanks to labels such
as those described above.
A second way to build the library without having agents roam the Net
is to come up with a way to let people themselves index their pages in
the webrary as easily as possible. An intuitive, interactive Web form with
clear instructions and the possibility of human guidance and support will
be a good solution. A conversation-based interaction between the user and
a more or less antropomorphized webrarian agent will be a better solution.
The process of having one’s publication registered with our index should
be considered an integral part of publishing any material. As soon as a
new Web page has been registered through the user interface, its existence
and content should be checked (preferably automatically by software agents,
as we will see later) as far as this is possible, before the new material
is linked to from the official index.
To check whether a link is dead or not, it is necessary to follow the link and see if it leads anywhere. Even if a link leads to nowhere at one specific moment, it can not be said for sure that this link should be removed from the index. Due to instability in servers, routers and even cables, links may seem to have ceased to exist, although they are just momentarily “out of order”. Therefore, multiple checks separated in time must be performed by webrarians before a link can be declared dead. If the existence of a page is checked once a month, that will probably be often enough to prevent the index from being filled with dead links. Until it can be established that a page does not change contents or that it changes contents regularly, it is necessary to check up on the page daily. When it is detected that the page changes daily, weekly or monthly, this should be reflected in the index entry in the webrary for that page.
Pages that change contents daily are likely to be distribution sites for some kind of news. Although the content changes, it is necessary to decide what area of interest the news generally is about, and give the news page a suitable position in the webrary index. News pages need to be distinguished from other pages, so that people can choose to search in the contents of news pages only.
The most time-critical part of indexing new Web pages is deciding what
the page is about and where in the hierarchy it therefore belongs. Hence,
this is the place where “automagic” work would be most valuable. It should
not come as a surprise that it is also a most difficult thing to “teach”
software to do. In the next chapter, we will take a closer look at what
is contained in the “agent” concept, a promising new technology for automating
and supporting manual work.
The Users
A: A software engineer running her own business, uses the
Internet at work for information gathering and at home for entertainment.
B: A pupil in the secondary school, uses the Internet
to talk to friends, to get information for schoolwork and to download computer
games.
C: A researcher, works in a field where the most interesting
publications are written in Polish, and have no interests on the Internet
whatsoever apart from finding information related to his research.
The Situations
Case 1 :
A wants to profile her company and has created a Web page presenting
herself and her skills and information on how people can hire her for what
kinds of jobs. After two months, the access log shows that there have only
been a few visitors to the Web page. Most of them even seem to have left
quickly, because the search engines’ list of Web pages containing the words
“female” and “teenager” rather disappointingly includes A’s Curriculum
Vitae.
A has registered her Web pages on Alta Vista, but it is not easy to find her pages there unless you already know the name of her company or what text can be found on the page. Searching for “software engineer” gives 55,591 hits, and A’s page is not even in the top 1,000 of those, making it hard to find. A more specified search, for “software engineer” and “for hire” gives only a few hundred hits, but surprisingly a lot of them are jokes about software engineers and other kinds of less interesting information. A has also tried to get her pages listed in the Yahoo! directory. Actually, she has tried 20 times, but Yahoo! does not seem to give high priority to this kind of business advertising, and it is not possible to pay anyone to ensure that she is included in their directory. She can be found through the Yellow Pages on Yahoo!, but this is just an electronic version of the phonebook, so her Web pages can not be reached through here, only her address and telephone number.
Our solution:
When our system is up and running, A can have her company’s Web page
included in the index hierarchy within a matter of a few days, thanks to
the efficiency in the processing of new submits allowed by the webrarian-supported
classification and indexing. A finds it easy to register her Web site,
as she is guided by a webrarian in the process. She automatically receives
a notification telling her when her page is expected to be included in
the index, based on the size of the submit queue at the time. In this way
she does not have to worry about being forgotten and resubmitting her page
several times, creating an even longer submit queue.
The Web page is assigned a very accurate description code upon indexing,
so that people who know what information they are looking for can find
it without knowing anything about A’s company at all. The information about
A’s company is also located next to information about similar companies,
so that potential customers can easily find and compare several software
engineers.
Case 2:
B has been given an assignment in his English class; to write about
Hemingway’s “For Whom the Bell Tolls”. B reads the book and writes down
his thoughts, but is not quite sure he has gotten it all right. He would
like to read other people’s ideas about the book, and especially to read
the writings of other secondary class pupils. B decides there has got to
be something interesting “on the Net”.
First B tries his favorite tool for searching on the Internet: Yahoo! Unfortunately the result is rather poor, just two links to pages that has nothing to do with neither schoolwork nor Hemingway; One link concerns the professional wrestler “The Undertaker” and the other is a link to a page with lyrics from heavy music bands. There is no category for homework or writings by youngsters to be found on Yahoo! either.
“Oh well”, B sighs, and tries the next option: Alta Vista. Here the problem is quite the opposite of the Yahoo! problem: Instead of getting very few hits, B gets 5,731 hits when using the search term “for whom the bell tolls”. Most of the hits seem to be Web pages from on-line bookshops that offer the book for sale in different editions and price categories, reading lists from numerous intellectuals from all over the world, or Web pages dedicated to the heavy band Metallica, who has a song with that title. B is overwhelmed and gives up. The page he may be looking for probably is in there somewhere, but how to separate them from the rest? After narrowing the search by including “Hemingway” as a keyword, there are still more than a thousand links to choose from.
Our solution:
As for everything else someone can think of having an interest in,
our index also has a branch in its hierarchy for the type of information
B is looking for. When B looks up our search site, a webrarian (W) greets
B and initiates a dialog:
W: “Hi. What can I assist you in finding information about?”
B: (types in a text field) “Hemingway”
W: “Ok. Do you want information about ‘Hemingway’ regarding Literature,
Film or Other?”
B: (clicks on one of the three choices on the screen) “Literature”
W: “Ok. Do you want commercial information, educational information
or information from personal homepages?”
B: (clicks) “educational” and “personal homepages”
W: “Ok. I have a list of 4,458 pages you can browse now, or I
can narrow it down further?” (A list appears on the bottom part of the
search screen.)
B: (clicks) “Narrow it down further, please”
W: “Ok. What language do you want the information in?”
B: (selects from a list of languages) “English” and “Spanish”
W: “Ok. 1931 pages left. The largest page groups among these are ‘Fan
pages’, ‘Biographies’, ‘Book reviews and criticisms’ and ‘Other’”
B: (clicks) “Book reviews and criticisms”
W: “Ok. 283 pages left. Do you want to further narrow it down, expand
the scope or search using keywords?”
B: (types ‘For Whom the Bell Tolls’ in a text field and clicks) “Search
using keywords”
W: “Ok. 31 pages left. I can not narrow the search down further. Browse
in the list below to see the results. Click on me, if you want to expand
the search.”
Now B has a small number of pages that all should be of interest to him. The webrarian does not see the actual contents of the pages it suggests, but “knows” that the pages are indexed with the keyword “Hemingway”, “Bell” and “Tolls” and that they carry the page codes for “English” or “Spanish” language, “educational” or “personal home page”, as well as one or both of the Dewey Decimal Classification codes, 813.52 – “Critical analysis” and 809.3 – “Fiction – criticism”. This corresponds with B’s responses to the webrarian’s questions.
If none of the 31 pages carry the information B wants, B can be assisted by the webrarian in widening the scope and find other pages with similar contents. If B identifies himself to the webrarian as an underage pupil, the webrarian automatically removes from the search scope links to pages containing any pornographic parodies of Hemingway’s works or other adult material, and instead puts extra emphasis on pages assumed to be of interest to secondary school pupils.
Some people may think that offering such easy access to this kind of
material will lead to pupils and students not doing schoolwork “properly”
any longer, but just copy other people’s writings instead. We must remember
that the teachers can also search the Internet this easily and discover
the cheating. However, most important are the new possibilities easy access
to existing relevant material provides for improving the quality of people’s
work.
Case 3:
C has taken a year off from work to travel around the world, and when
he comes back and starts working again, he first has to update himself
on what has happened in his neck of the research woods while he has been
gone. Unfortunately, the link page covering his research field he used
before he took his leave has expired, and he does not know of any other
good link pages. He could mail some colleagues and ask them for pointers
to a good site to start from, but he wants to get started right away and
decides to try some of the Internet search tools and see if he can find
any good pages that way.
His first shot is at Alta Vista. It turns out that while he has been gone, the most relevant keyword he can think of to describe the research he is interested in has become Buzzword of the Year, and he gets tens of thousands of hits. The vast majority of the pages suggested by the search engine are show-off pages from a variety of consulting companies. However, by narrowing down the search with a few more hard core keywords, C reduces the number of hits to a few hundred Web pages that seem to be related to ongoing research. He starts following the links from the top of the list, only to discover that most of them are poor quality pages not containing exactly what he is looking for. Also, many of the links lead nowhere, since they used to be maintained by graduate students who by now have graduated, left their universities and had their computer network accounts and Web pages deleted, something the search engine’s robots has not yet discovered. Another thing that puzzles C is that there is very little information to be found about publications by some of the Polish people he knows must be on the leading edge of the research in this field.
C gives up and tries Yahoo! to see if there is anything to be found there. It appears that the research field is so new and/or small that Yahoo! has not established a place for it in its hierarchy, and the only information Yahoo! offers is a link to a search on the Alta Vista search engine instead. C sighs loudly and goes off to make some coffee while he waits for one of his eccentric colleagues to reply to his request for Web pages suitable for his needs.
Our solution:
When looking up our search site, C can identify himself as a full-time
researcher looking for new papers and conferences. With this in “mind”,
the webrarian assisting him can concentrate its work on searching for pages
located at Web sites the webrarian has been programmed to know has a good
reputation for high quality information, such as the digital library of
the Association for Computing Machinery (ACM) and certain universities.
All commercial sites can be kept outside of the search scope. Thanks to
an efficient, automated link check system, the index contains very few
links to pages that have moved to new addresses or ceased to exist completely.
C can also choose to only search for link pages about the subject, since
that is what he is most interested in right now.
When the right branch of the topic hierarchy has been located, C will
also find the pages in Polish. This is possible since even though the Polish
words used to describe the concepts he is interested in are very different
from the English words C normally uses, the indexing system maps documents
concerning the same subject to the same indexing code, no matter what language
or words are used to describe the contents.