FIRST, Full Information Retrieval System Thesaurus Methodology

Submitted by Ted Takacs on December 27, 2004 - 10:54pm.

FIRST,
Full Information Retrieval System Thesaurus Methodology

Juan
Chamero
, from Intelligent Agents Internet Corp, Miami USA, August 2001

 

 

Abstract

 

FIRST,
Full Information Retrieval System Thesaurus is a methodology to create
evolutionary HKM’s, Maps of the Human Knowledge hosted in the Web. FIRST point
towards an acceptable â€œkernelâ€? of the HK estimated in nearly 500,000 basic documents
selected from a exponential growing universe doubling the size yearly and actually
having nearly 1400 million sites. There are many laudable and enormous
scientific efforts made along the idea of building an accurate taxonomy of the
Web and trying to define precisely that kernel. At the moment the only tools we
as users have to locate knowledge in the Web are the search engines and
directories that deliver answers lists ranging from hundreds to millions of
documents being the supposed “authorities� hidden in a rather chaotic
distribution within those lists. That means exhausting searching processes with
thousands of “clicks� in order to locate something valuable, let’s say an
authority.

 

FIRST create evolutionary search engines that deliver
reasonable good answers with only one click from the beginning. We talk of
reasonability as a synonym of mediocrity because the first kernel is only a
mediocre solution henceforth to be optimized via its interactions with users.
FIRST could be considered also an Expert System able to learn mainly from those
interactions mismatching. So initially FIRST generated kernels could be
considered mediocre one click solutions, for a given culture and for a given
language but able to learn converging to a consensual kernel. To accomplish
that the only that FIRST kernels need are interactions with users. As long as
users represent the whole the more the kernel will tend to represent the
knowledge of that whole. For that reason, we imagine a network of HKM’s
implemented via our FIRST or some others equivalent evolutionary tools. As each
node of this semantic network will serve a given population (or market) we
could easily implement something like a DIAN, Distributed Intelligent Agents
Network to coordinate the efforts made by each local staff de Intelligent
Agents (coopbots). Each node will have a kernel in a different stage of
evolution depending of its age, measured in interactivity, and of its
population profile.

 

The main differentiation of FIRST from most present knowledge
classification and representation projects rests on the hybrid procedure of
building the mediocre starting solution: a staff of human experts aided by IA’s
and IR algorithms. The reason of this approach is the actual “state of the art�
of Artificial Intelligence, AI. The best actual robots are unable to accurately
detect general authorities and are easy to be disguised, unfortunately by
millions of document owners that either unethically or by ignorance try to present their sites as authorities. Another flaw is
the primitiveness of even the most advanced robots unable to edit
comprehensible synthesis of sites. Otherwise the human being is extremely good
for those tasks, by far more accurate and more efficient.

 

The map
itself consists of I-URL’s, Intelligent URL’s, brief documents from half to two
pages, describing the sites referenced like pieces of tutorials, classified
along a set of taxonomy variables and tagged with a set Intelligent Tags, some
of them to manage and to track their evolutionary process. For each Major
Subject of the HK, a Tutorial, a Thesaurus, a Semantic Network and a Logical
Tree are provided and bound to the virtual evolutionary process of the users
playing a sort of “knowledge game� versus the kernel.

 

FIRST is presented here within a context
of the IR-AI â€œstate
of the art�. The methodology has been tested to build a HKM in 120 days. Time
is a very important engineering factor due to the explosive expansion of the
Web and because its inherent high volatility. The task
performed by the human experts staff is similar to the
task of providing, to a Knowledge Expert System, the basic knowledge to “play�
a Game of Knowledge reasonably well versus average Web users. Resembling the
beginnings of the Big Blue that beats Kasparov: it initially should have been
able to beat not a master but at least a second category chess player (with a
reasonable good ELO standard) and from that the evolutionary path through three
six levels more: first category, master, international master, grand master,
championship.

 



Content Index

 

 

1-
The Future of Cyberspace – The Noosphere

Introduction

The Web
space Regions

Region Volumes Estimations

The Web space looks like the Sky at
night

How the Search Engines illuminate
the Resources

The Cyberspace as a Global Market

Websites are built to match users

Mismatch reasons

The solution

What’s does Intelligent mean

Some examples about actual
general search inefficiency

Human Knowledge Shells

 

2- About a New
Approach to Internet Communications

Internet is a very
particular net

Information Offer versus
Information Demand

What people needs

Jargons Evolution

 

3- FIRST, Full
Information Retrieval System Thesaurus

 

The actual Information Retrieval
process in the Cyberspace

The main reasons of that uselessness

Uselessness Measure

Using Search Engines

Searching in Databases

Our Approach to this problem: FIRST

Internet Drawbacks: Internet the realm of mismatch

The solution in Theory

Are the Search Engines really
useless?

WOO_1:
Architecture and meaning of the first Virtual Library

Virtual Library

Volume of “sufficient� Virtual Libraries

The Two Key Components for Retrieval

The Thesaurus

The power of the right statistics

 

 

4- i-URL’s and Intelligent
Databases

 

i-URL’s Databases

The inefficiency of actual
Search Engines and Directories

First
Step: Valuable Comments – Virtual Libraries

Second Step: How to build Virtual
Libraries

The Thesaurus concept

How
we combine the virtues of Thesaurus and Indexes

How
an I-URL looks like

The
egg-chicken problem

Advantages of our i-Virtual
Libraries

Notes



5- Evolutionary Process - Some Program Analysis Considerations

 

User Track Mechanics

Tracking “zoom�

Thesaurus Evolution Mechanics

Analysis of some other types of
user interactions

Another crucial events: users’
feedback

Path keyword <=> string
correspondences

 

6- Noosphere
Mechanics – Evolutionary Sequence

 

7- An Approach
to Website Taxonomy

 

How to browse a site to measure
its structure

A first raw approach

 

8- FIRST within the
vast world of AI – IR

 

FIRST niche

Navigation by AI-IR Authorities

KR

DAI
and KK

Key
behavior of some IA’s

DAI,
network

WEB
scenario

HKM
complementary Information

Operational
Hints

Some
Subtle Hints

HITS

OGS

TAPER

Web
Sizing

KR - One outstanding IR “authority�

 

New and Old ideas in action now

Clustering: Vivisimo and Teoma

More
about KR

 

 



1- The Future of Cyberspace

The Web space and the Noosphere[1]

Introduction

You may find 30,136 pages dealing
with “noosphere� in Altavista at 2.22 PM Eastern Time for USA and Canada on Thursday
12th of April 2001. This is a rather strange word for many people
that did not deserve an entry in the Merriam Webster online dictionary yet.
However we know, use and enjoy the Cyberspace, concept that at nearly the same
time deserves as many as 777,290 entries in the same Altavista, but on the
contrary it has an entry in Merriam Webster since 1986, with the following
meaning: the on-line world of computer
networks. Web space is another neologism not yet included in that dictionary
but deserves 485,805 entries in Altavista.

 

The Web
space growths at a fantastic pace holding today nearly one and half billion of
documents, ranging from Virtual Libraries and virtual reference e-books dealing
with the Major Subjects of the human knowledge through ephemeral news and
trivial virtual flyers generated “on the fly� at any moment continuously. We
may find in the Web documents belonging to any of the three Internet major
resources or categories: Information, Knowledge and Entertainment.

 

 

The Web space Regions

 

id="_x0000_t75" coordsize="21600,21600" o:spt="75" o:preferrelative="t"
path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f">

















height:138pt'>

 

In the
above figure the black crown represents the Web space and the green circle the
users. The gray crown represents an intermediate net to be built in the near
future with intelligent resumes of the Human Knowledge, pointing to the Web
basic documents and e-books. One user is shown extracting a “cone� of what
he/she needs in terms of information and knowledge. The intelligent resumes
must be engineered in order to be good enough as introductory guides/tutorials
with a set of essential hyperlinks inside. If the user wants more detail goes
then directly to the right sources within the black region. Depending of the
Major Subject dealt with the user may go from resume to resume or jumping to
higher level guides inside the gray region going to the black region only to
look for specific themes. Moreover many users will be satisfied browsing within
the gray region without even venturing into the black region.

 

 

 

Another
user goes directly to the black region guided by aid of classical search
engines as now. The black region will be always necessary and its size will
grow fast as time passes by. On the contrary, the gray region will fluctuate
around a medium volume growing at a relatively very low rhythm. Effectively,
the Human Knowledge “kernel� of basic documents is almost bound, changing its
content but always around the same set of Major Subjects. The growth of the
gray region is extremely low in comparison to the black region. Some Major
Subjects die and some others are born along the time but slowly.

 

 

Region Volumes Estimations

For more Web sizing information see our Chapter 8 about The
Vast World of AI-IR

 

As a
science fiction exercise we invite you to make some calculations resembling
some Isaac Asimov’s stories and Carl Sagan’s speculations. Being the actual
Human Knowledge bound to let’s say 250 Major Subjects or Disciplines and if for
each of them we define a Virtual Library with non redundant 2,000 e-books, in
the average, we will have a volume of 500,000 e-books. Now we could design a
methodology to synthesize an intelligent text resume for each e-book in no more
than 2,000 characters, in the average, totaling 1,000 MB ó 1 GB storing one character in one single byte. That would
be the volume of the gray region!, not too much
really!.

 

Let’s then
compare this volume to the volume of the black region and to the volume of the
resources of the Human Knowledge. Once upon a time, there were a Web space with
one and a half billion documents with an average volume estimated in 2.5 MB (we
have documents ranging from 10KB and less to 100MB and more: to get that figure
we supposed the following arbitrary size series 1, 10, 100, 1,000, 10,000,
100,000 in KB and we assigned to each term the following arbitrary weights:
.64, .32, .16, .08, .004, .002 respectively). Then we have a volume of nearly
3750, 000,000 MB!. Within that giant space float
disperse the basic e-books, the resources of the Human Knowledge with an
estimated volume of nearly 500,000 MB assigning 1MB to each one, half a million
of text and 100 images of 5KB in the average.

 

Black
Region: ~3,750,000 GB => HK ~ 500 GB => Grey Region ~ 1 GB

 

Incredible
result that demonstrates how easy will be able to compile a rather stable HKIS,
Human Knowledge intelligent Summary in relation to the unstable, noisy,
bubbling, fizzy and always growing black region. Once the effort is done the
upgrade will be facilitated via Expert Systems and a set of specialized
Intelligent Agents that will detect and extract from the black region only the
“necessary� changes.

 

The
Web space looks like the Sky at night

 

In the figure above we depict the
actual Web space in black, resembling the physical space of the Universe. No
doubt the information we need as users is up there but where?.
That virtual space is really almost black for us. Some members of the
Cyberspace that provide searching services titled as Search Engines and/or Web
World Wide Directories are like stars that irradiate light all over the space
to make sites indirectly visible. Sometimes we may find quite a few sites with
their own light, like stars, activated by publicity in conventional media but
the rest is only illuminated by those services at users’ request. Let’s go
deepen a little about the nature of this singular searching process.

 

For each
resource (body) located in the Web space in an URL, which stands for Uniform
Resource Locator, robots of those lighting services prepare a brief summary
with some information extracted from it, no more than a paragraph and then all
the information collected goes to their databases. The summaries have attached
to them some keywords extracted from the resources visited and consequently are
indexed in as many keywords as they have attached.

How the Search Engines illuminate the Resources

 

The actual
robots are very “clever� but extremely primitive compared to human beings. They
are doing their best and they have to perform their work fast in fractions of
millisecond per resource as well so it would be unpractical being more
sophisticated because the time of “evaluation� grows exponentially with the
level of cleverness. To facilitate the robots work the Website programmers and
developers have at hand wise tools but many of them overuse those facilities so
badly to make them unwise. In fact with those tools the programmers could
communicate to the robots some essential information the site owners wish to be
known about the site.

 

Those wise
gateways are now noisy because most people try to deceive the robots
overselling what should be the essential information. Why do they that?. Because the Search Engines must present the sites listed
hierarchically, the first the best!. It occurs
something like in the Classified Section of the newspapers: the people wishing
to be listed first unethically make nonsense use of the first letter of the
alphabet: AAAAAAA Home Services go first that for instance AA Home Services.
The Search Engines do not have too much room to design a “fair� methodology to
rank the sites with equity and Internet is a non-police realm besides

 

One trivial
criterion should be to count how many times a keyword is cited within the
resource but that proved to be misleading because the robots only browse the
resource partially being practically impossible to differentiate a sound
academic treatise from a student homework concerning the same subject. To make
the things worse, programmers, developers, and content experts know all those
tricks and consequently they make overuse of the keywords they believe are
significant.

 

The Search
Engines have improved too much along the last two years but the searching
process continues being highly inefficient and tends to collapse. To help site
owners to gain positions within the lists (in fact to get more
light) proliferate ethical and unethical techniques and programs most of
them apt to deceive the “enemy�, namely the Search Engines. Even in a ‘Bona
Fide� utopia it’s impossible for a robot to differentiate between a complex
site and a humble site dealing with the same subject. Complex sites
architectures could even make the sites invisible for them because they are
only well suited to evaluate flat and simple sites. For instance search engines
like Google needs also to break even commercially and start selling pseudo
forms of score enforcing ways to desperate site owners that need traffic to
subsist.

 

We
emphasize again the fact that the “light� that a Search Engine provides to each
URL is indirect like the Moon reflects the Sun’s light. Then our conclusion is
that most of the information and the knowledge is
hidden in the darkness of the Cyberspace.

 

The Cyberspace as a Global Market

The Matchmaking Realm

style='width:216.75pt;height:162pt'>

Now that we
know the meaning of the HK Human Knowledge we may define HKIS, the Human
Knowledge Intelligent Summaries, a set of summaries that we have to explain
soon why do we title them as intelligent, and NHKIS, for a Network of Human
Knowledge Intelligent Summaries that correspond to the gray crown of the above
figures. Now we are going to enter into the problem of the languages and
jargons spoken in the Black Region, in the Gray Region and mainly in the Green
Region.

 

 

Websites are built to match users

Internet the Realm of Mismatch

 

The
Websites are built to match users, are like lighthouses in the darkness, to
broadcast information, knowledge and in the case of e-Commerce some kind of
attracting information as “opportunities�. What really happens is that at
present Internet is more the Realm of Mismatch than of Matching. The
lighthouses owners cannot find the users and the users neither cannot find the
alleged opportunities nor understand the broadcasted messages. This mismatching
scenario is dramatic in the case of Portals, huge lighthouses created to
attract as many people as possible via general interest “attractions�.

 

Something
similar occurs with the databases where are stored millions of units of
supposedly useful information such as catalogs, services, manufacturers,
professionals, jobs opportunities, commercial firms, etc: users could not find
what they need. When we are talking of mismatch we mean figures well over 95% and
in some databases matching efficiencies lower than 0,1%.

 

In the
figure above we depicted this dramatic mismatch. The yellow point is a Website
with its offer represented by the cone emerging from it, let’ say the Offer
expressed in its language and in its particular jargon. A point black within
the green circle represents a user and the cone emerging out from it his/her
Demand expressed also in his/her language and particular jargon.

 

 

Mismatch reasons

Websites and user speak and think different

 

What we
discovered is that both sides speak approximately the same language but by sure
different jargons and more than that, they think different!.
We have depicted the gray crown because the portion corresponding to its Major
Subject virtually exists: that’s the portion in dark gray within its cone. They have the “truthâ€? expressed in its
particular jargon, and sometimes the “official� and standard jargon. If the
Website were for instance a “Vertical� of the Chemical Industry, of course its
jargon will then be within the Chemical Industry Standards and its menu should
be expressed technically correct, resembling the Index of a Manual for that
particular Major Subject: Chemical Industry.

 

So our
conclusion of a research done along two years studying the mismatch causes was
that the lighthouses speak -or intend to speak- official jargons, certified by
the establishment of their particular Major Subjects. They are supposed having
the truth and they think as “teachers�, expressing their truth in their menus
that are in fact “logical trees�. They may allege to be e-books and they
behave, think, and look, pretty much the same as physical books.

 

Now let’s
analyze how the users act, express and behave. If one user meets the site to
learn, the cones convergence is obliged, the user is forced to think in terms
of concepts of the menu that for him/her resembles a program of study, and we
have a match scenario. If the user meets the site to search something, that’s
different. When one goes to search something one tends to think in keywords
terms instead, keywords that belong to our own jargon and at large to our own
Thesaurus. So, either by ignorance or on the contrary, being an expert, the
users’ cones diverge substantially from the site’s cone. One of the main
reasons of this divergence is that the site owners ignore what their market
targets need. Many of them are migrating from conventional businesses to
e-Commerce approaches and extrapolate their market know-how as it is. They were
working hard along decades to match their markets and to establish agreed
jargons and now they have to face unknown users coming virtually from all over
the world.

 

The solution

Evidently the solution will be the
evolution from mismatch to match in the most efficient way. To accomplish that,
both the Offer and the Demand, have to approximate each other until both share
a win-win scenario and a common jargon.

style='width:207.75pt;height:155.25pt'>

In the figure above we depict a
mismatch condition where we might distinguish three zones: the red zone
represents the idle and/or useless Knowledge; the gray zone corresponds to the
common section with an agreed Thesaurus concordance; and the blue zone
corresponds to what the users need, want, and apparently does not exist within
the site. So the site owners and administrators have three lines of action: a)
reduce to zero the red zones, for instance adapting and/or eliminating supposed
“attractions�; b) learn as much as possible about the blue zone, and; combine
both strategies.

 

At this moment the dark green
zones are extremely tiny, less than 5% being Internet the Realm of Mismatch
between Users’ Demand and Sites’ Offer. The big efforts to be done consist in
minimizing costs eliminating useless attractions and learn from non-satisfied
Users’ needs. To accomplish both purposes the site owners need intelligent
tools, agents that warn them about red and blue events.

 

 

What’s does Intelligent mean

 

Let’s analyze the basic process of
users-Internet interactions. One user meets one site to interact in one of
three forms some times concurrently: investing time, making click over a link
or filling a form or a box with some text, for instance to make a query to a
database. The site statistic are well prepared to account for clicks, telling
what “paths� were browsed by each user but they are not well suited to account
for interaction derived from textual interactions. Of course, you may record
the queries and even the answers but that’s not enough to learn from
mismatching. To accomplish that we may create programs and/or intelligent
agents that account for the different uses over the components of each answer,
but they have to do then a rather heavy accounting.

 

If we query a commercial database
for tires the answer would be a list of tires stores; and to have statistics
about how frequent the users ask for this specific keyword we need to account
for it; and to know about the “presence� of each store as a potential seller we
need to account for it; and if we want to know about the popularity of each
store we need to go farther, accounting for it and so forth. That accounting
process involves a terrific burden even done in the site server’s side.

 

An intelligent approach should be
to have all possible counters to detect documents popularity and users’
behavior, built in into the data to be queried. That’s the beginning of the
idea: to provide a set of counters within the data to be queried by users for
each type of statistic. So when a data is requested a counter is activated
accounting for the presence, and when it is selected by a click
another counter is activated and when the user by reading the “intelligent
summary� received decide to make a click over the original site or over
one of its inner hyperlinks, another counter is activated.

 

id="_x0000_i1028" type="#_x0000_t75" style='width:291pt;height:218.25pt'>

 

 

Here is represented a typical
track of user-site interaction. The user makes a query for “tires�. The i-Intelligent Database answers sending all data it has
indexed by tire adding a list of synonyms and related keywords it has for tire.
Each activated i-URL accounts its presence in that answer
adding one to the corresponding counter in the i-Tags zone. If the user makes
click on a specific i-URL the system presents it to the user accounting for
this preference in another counter of the i-Tags zone.

 

Finally if the user decides to
access the commented site located in the black crown makes a click and another
counter is activated within the i-Tags zone. At the
same time the counter corresponding to the keyword tire is activated adding one
and the same if the user activates some synonym or related keyword. If the
answer is zero data it means a mismatch because an error or a warning about a
non-existent resource within the database. In both cases the system has to
activate different counters for the wrong or non-existing keyword in order to
account for the popularity of this specific mismatch. If the popularity is high
it is a warning signal to the site Chief Editor (either human or virtual) about
the potential acceptance of the keyword, either as a synonym or a related
keyword. At the same time, the system may urge to look for additional data
within the black region. From time to time the systems could suggest the
rehearsal of the i-URL’s summaries database in order
to assign data to the new keywords as well. We will see how to work with a
network of these Expert Systems at different stadium of evolution.

 

 

Within the intelligent feature we
consider to register the IP of the users interactions
and the sequence of queries, normally related to something not found. The
keywords users’ strings are in their turn related to specific subjects within
the Major Subject of the site. So, statistically, the keywords strings analysis
tells us about the popularity of the actual menu items and suggests new items
to be considered.

 

 

 

 

Some examples about
actual general search inefficiency

 

Let’s try
to search for something apparently trivial like “Internet statistics�, for
instance using one of the best search engines, Google: More than 1,500,000
sites!. Do not dip too much along that list, only check what the first 20 or 30 sites offers. Most
of the content shown by the sites of that sample is obsolete and when updated
you are harassed by myriad of sales offers about particularly statistics,
market research studies and similar, priced on the thousands up. And if this
scenario occurs with supposed authorities: Library of Congress, Cyberatlas,
About.com statistics sites, Internet Index, Data Quest, InternetStats, what
then with the 1,500.000 resting?.

 

What if
that noisy cluster be replaced by a brief comment made by a statistician,
telling the state of the art about Internet Statistics and suggesting
alternatives ways to compile statistics from free updated authorities that by
sure exist in the Web?. That’s is very easy to do and
economic either, it should take no more than one hour of that specialist. Of
course that would be feasible as a permanent solution if the cost of updating
that kind of reports were relatively insignificant. Concerning this problem we
estimated that the global cost for updating a given HKM is of the order of 3%
to 5% per annum the cost of its creation. So the HKM’s will be updated by two
ways: evolutionary by evolution through their interaction with users and
authoritative by human experts updates.

 

Let’s see another examples with “sex� and “games�. Sex has more than
48,000,000 sites and is well known that the sources of sexual and pornographic
content are fewer than 100. The rest are speculators, repetitions, transfers,
and commuting sites of only one click per user playing the ingenuous role of
useful idiots. Something similar occurs with games with more than 35, 000,000
sites and again the world providers of games machines, solutions, and software
are no more than 100!.

 

 

Human Knowledge Shells

 

For a given
culture and for a given moment we have the following regions in the Web space

:

style='width:102.75pt;height:162.75pt'>

 

 

Red: a given HKM

Black Blue: HK Virtual Library

Regular Navy Blue: Ideal HK

Blue: Ideal HK plus New Research

Light Blue: Ideal HK plus NR plus Knowledge Movements

Deep Light Blue: Ideal HK plus NR plus KM plus Information

 

Everything
is working within an expanding universe of Human Intellectual Activity. It
takes too much time and effort for new ideas and concepts to form part of the
Ideal HK. We as human have two kinds of memory, semantic and episodic, and any
cultures in a given moment have its semantic memory, conscious and unconscious,
intuitive and rational as well as its episodic memory.

 

Along the
human history the dominant cultures have controlled the inflow of the Human
Intellectual Activity in explicit and implicit ways, for instance discouraging
the dissension. Internet allows us as users to dissent with any form of
“established� HK and to influence on an equality basis the allegedly ideal HK.
This feature will accelerate in an unprecedented way the enrichment of the
ideal HK. For that reason we emphasize in FIRST the mismatch between the HKM
and users thoughts, questions and expectations, oriented to satisfy users, that
is the human being as a whole and as a unit. 

 



2- About a New Approach to Internet Communications

Linguistic Approach

 

 

Internet is a very particular net

 

We make
specific reference to Internet Data Management because the “Big Net� differs
substantially from most nets. Internet deals with all possible groups of people
and all possible groups of interest. Internet users belong to all possible
markets from kids to old people in all possible economic, social and political
levels and cultures. This Universality makes the Internet man-machine
interactions extremely varied.

 

On the
contrary, in any other network we may define a “jargon�, ethic and rules. When
we build a new Internet Website we really ignore what will
our potential users be, and consequently what they want, what they need
and even we ignore their jargons. We imagine a target market and for that
specific market we design the site content, in fact, the “Information Offer� to
that market.

 

 

 

style='width:219pt;height:165pt'>

 

 

The figure
above depicts the matchmaking process within the Internet “noosphere�. The
users in green express what they want and even think in terms of “keywords�,
expressed in their own jargon, are open and flexible. On the contrary, the
Website owners through their sites believe they have the truth, only the truth
but the truth. In that sense being or not an authority they resemble “The Law�
of the establishment of the Human Knowledge. The law, for each Major Subject is
expressed in Indexes of the main branches of that Major Subject, resembling a
“Logical Tree�, depicted in gray over the yellow truth. They imagine their
sites as universal facilitators but always following the pattern of the logical
tree and expressed in their jargons.

 

The Websites
have their own Thesaurus, set of “official� keywords, depicted in white over
black background, within the darkness of the Web space. Between the logical tree
and the Thesaurus exists a correspondence. The Website owners are shown with
the Truth Staff in yellow. The users-Internet interactions are depicted as a
progressive matchmaking process, going from green to black and vice versa,
learning one from the other match-mismatch. Both sides strive for knowing
interchanging knowledge

 

 

Information Offer versus Information Demand

 

Paradoxically,
even being the Web so well suited to add, to generate and to manage
intelligence most people ignore this fantastic possibility. If we define our
Information Offer as WOO, which stands for What Owners Offer and what
the users want by WUW, which stands for What Users Want, the Web
Architecture permits the continuous match between them and as a byproduct the
intelligence emerging from any mismatch.

 

That
possibility means the following: WUW is what users want expressed in their
specific jargon/s, meanwhile WOO is the Website
information offer expressed in let’s say the “official/legal� jargon, the one
we choose to communicate with our target market. The continuous mismatch
between WOO versus WUW would permits us to know the following five crucial
things:

 

· What the Market wants

· The Market major
characteristics

· The Market homogeneity
and/or its segmentation

· The Market jargon/s

· The Market needs.

 

The
knowledge of the market jargon/s permit us to optimize our offer: for instance,
a negative answer to an user query could mean either
that we don’t have what he/she wants or that the name of what he/she is looking
for in his/her jargon differs in our jargon.

 

 

What people needs

 

What we
know directly from users queries is what they want, not what they need. The
difference between WUW and WUN, What Users Need is substantial.
People generally know what they need but adjust their needs to the supposed or
alleged Website capabilities. We learn what our users need as time passes by if
we make use of the intelligence byproducts and/or from surveys.

 

 

How is normally organized the Information Offer
and how the Queries

 

The IO is
normally presented as ordered sets under the form of Catalogs, Indexed Lists
and Indexes but the queries, where the users express their particulars needs
WUN are expressed by keywords. Both communication systems are completely
different even though could be complemented and we could make them work
together towards the ideal match between WUN versus WOO.

 

As we see soon the users
communicate with the different Websites via their subjective jargons, at least
as many jargons as MS, “major subjects� they are interested in. For instance,
if I’m an entrepreneur that manufactures sport car wheels I’m going to query
B2B sites to look for subjects related to the sport car wheels expressing
myself in “my� jargon, with differences with the “official� jargons used in the
B2B sites and of course, the query outcomes will strongly depend of the jargons
differences.

 

 

Jargons Evolution

 

In a
similar way as the official languages change from time to time, influenced at
large by the pressures of the people jargons, coexisting both at any time, we
may endow an extremely efficient evolutionary feature to the Websites of the
Cyberspace via Expert Systems, that learn from the
man-Internet interactions. We dare to qualify this feature as extremely
efficient because in the Cyberspace every transaction could be easily and
precisely accounted for. So, each time one user uses a keyword belonging to
his/her jargon this event could and should be accounted for.

 

Let’s then
imagine what kind of intelligent byproduct could we extract of this simple but
astonishing feature. Within a homogeneous market the keywords tend to be the
same among their members. So in our lat example, if the majority of users make
queries asking for wheels and the word-product wheel does not exist in our
database a trivial byproduct takes the form of the following suggestion: add
wheels to the database as soon as possible. On the other hand if the
word-product “ergaston� was never asked for along a considerable amount of
time, another trivial message should be: take ergaston out from the database.

 

style='width:174.75pt;height:131.25pt'>

 

The figure above depicts the
evolution of the matchmaking process. In the beginning, the Website owners had
the oval green-gray target, where one user is shown with a black dot. But that
user really belongs to a users affinity market
depicted as a dark green oval with a cone of Internet interest that differs too
much from the ideal initial target. The Website owners need an intelligent
process to shift towards the bigger potential market dark green. With a cone
border yellow we depict the final “stableâ€? matchmaking. 



 

3- FIRST, Full Information
Retrieval System Thesaurus

 

 

The actual Information Retrieval process in the Cyberspace

 

The Cyberspace actually has about
1,500 million documents ranging from reference to trivial, from truly e-books
dealing with the major subjects of the human knowledge to daily news and even
with minute to minute human interactions information as in the case of
Newsgroups, Chat and Forum “on the fly� pages generation. This information mass
grows continuously at an exponential rate, rather chaotically, as its
production rate is being by far exceeded by the human capacity for filtering,
qualifying and classifying it.

 

To help the
retrieval of information from the Cyberspace we make use of Search Engines and Directories
that are unable to attain WUN, What Users (We the Humans) Need. From all
that information mass the search engines offer to us “summaries�, telling what
kind of information could we get in each location of
the Cyberspace (the URL, Uniform Resource Locator). So for each URL we as users
obtain its summary. Those summaries are normally written by the Search Engines
robots, which try to do their best extracting pieces of “intelligence� from
each Cyberspace location.

 

style='width:222.75pt;height:167.25pt'>

In the figure we depict some sites
within the darkness of the Cyberspace. We may find from huge sites storing
millions of documents and with hundreds of sections through tiny sites with a
flat design storing a few pages. One Search Engine shown as a yellow crown
sends its robots to visit existing sites from time to time making a brief
“robotic� summary of them. As we will see soon those brief reports are noisy,
deceiving the users (green circle). The Search Engine assigns priorities, which
act in turn as a measure of the site magnitude (as the brightness of a star).
As it’s depicted, the priorities (the navy blue dots) have nothing to do with
the real magnitude of the site (depicted as the white circle diameter). So the
yellow crown is a severe distortion of the Web. These priorities defined for
the keywords set of a given site resemble the “light� that illuminates it: a
high priority means a powerful beam of light reflecting over the site
highlighting it to the users sight.

 

The actual
information provided for the search engines are as primitive as the map of the
sky we had one thousand years ago. The robots only detect some keywords the
site content have, equivalent to the chemical elements of the celestial bodies,
but tell us nothing about its structure, type of body and magnitude. Today we
may have for each celestial body the following data:

 

Among many others, diameter, density, its constitutive
elements spectral distribution, brightness, radiation, and albedo
. For each of these variables we have site equivalents that
must be known in order to say that we have a comprehensible Cyberspace map. For
instance we need to know something that resembles magnitude, density and
elements distribution and brightness.

 

Being the
bodies of this cultural and intellectual space (noosphere), intellectual
creatures, we need an intellectual summary of it, what is known as the abstract
in essays and research papers. For
instance a site could be camouflaged to appear attractive emphasizing the
importance of a given element, let’s say “climate�, to deceive a robot as being
a specialized climate site but in reality having nothing about climate content.
The same happens with information: Portal’ news, for instance, are presented as
content sites, being that true only concerning a specific type of information
resource known as “news�, of an extremely ephemeral life of hours. On the
contrary, content of philosophy or mathematics are by far denser, heavier, with
lives lasting centuries in the average. So we could distinguish all kind of
bodies from fizzy (news) to rocky (academic).

 

Another
complementary source of information are the Databases
hosted as collateral of the Websites, as huge stores of organized and
structured data. The content and quality of these databases are normally a
subjective “bona fide� declaration made by the Website owners. So far for the
users, the Cyberspace, particularly the Web Cyberspace, looks like a net of
information resources with some “Indexes� to facilitate their retrieval task.
Those robots made indexes are too noisy being practically useless. Below we
attach a well-known graphic sample of this uselessness

 

type="#_x0000_t75" style='width:261.75pt;height:217.5pt'>

 

The figure depicts
the finding of useful information (black spots) navigating along a searching
program

 

 

The main reasons of that uselessness

 

The main
reasons are, among many:

 

Increasing Websites Complexity: Robots could not cope with the Website increasing
complexity. Robots are unable to evaluate properly sites like the ones
belonging to the NASA, World Trade Organization, and the Library of Congress,
only to mention some institutional, concerning Aerospace, Commerce, and General
Knowledge respectively, and cannot differentiate them from trivial sites
dealing with similar subjects.

 

Inability to cope with Human Stratagems: Robots are unable to detect and to block some subtle
overselling stratagems made by the Website owners to position themselves high
in the Search Engines answers to users queries.

 

Linguistic Problems:
Robots could not cope with the increasing number and complexity of the
languages and jargons used in the net. They make their work using rather naïve
Thesaurus, only modified and enriched via the Website owners’ declarations, not
as it should be via the users feedback. As a
consequence of that bias the Search Engines speak the owners
jargons instead of the users jargons.

 

In brief,
the shadows of content that search engines offer to the users have almost
nothing to do with the real content of the Cyberspace, presenting a distorted
vision of it. The problem is the contagious spread of this distortion as long
as the Website owners use that summary information as a “bona fide� vision of
its world. As a corollary, Internet speaks today the Website owners jargons
pointing to a global distorted visions of the real markets!.

 

 

Uselessness
Measure

Using Search Engines

 

The
mismatch measure between WUN, What Users Need and WSO, What
Search-Engines Offer, should be one of the first priorities of scientific
institutions interested in the Internet health. However, almost everybody is
well acquainted of this abysmal mismatch and you may check it by yourself very
easily making random queries about any subject. We, as a private research
group, made our own investigations about that global mismatch finding the
following figures:

Mismatch of WSO versus WUN is
within the order of 6,000 to 1

Meaning that we, as ordinary users, searching through the
Cyberspace with the help of outstanding search engines, in the average, have to
browse through 6,000 summaries to find 1 potentially matching our needs.

 

Searching in Databases

 

Searching
information stored in Databases proved to be a tough task as well. Students of
Systems Engineering coursing the last year of their career in the Instituto
Tecnológico de Monterrey, Mexico, were invited to freely query a commercial
tested (2) Database being the mismatch greater than 99,9%, that is,
they needed in the average more than 100 queries to match a product/service
stored within the database. The main reason of the mismatch was not due to
missing information in the database but to linguistic problems. That was a
warning sign and we investigated some other commercial databases belonging to
well-known B2B sites with similar results.

 

Note 2: By “tested� we mean
that the content was checked before the trial. The information existed but the
students were unable to find what they were searching because linguistic
problems.

 

The
abysmal and chaotic mismatch enable forms of e-Commerce delinquency:
 When you as a user
face that finding your first reaction could be being suspicious of the declared
content of the database. On the owners’ side, they could allege that those
mismatches are due to linguistic ignorance of the users. Unfortunately there is
not something like an official audit to detect deceptions yet but we’ve found
many databases really empty betting to growth via users
membership with cynic declarations such as:

 

Come to join us! , We already have one million firms like
yours!,

Our Approach to this problem: FIRST

 

Our
methodology started as an effort to solve some Internet drawbacks Websites
owners and users experimented, mainly within the dot COM domain. Concerning
that, our Systems Engineering background warned us, and we were aware of, that
the crisis was the "Internet answer as a system" to wrong approaches
of most Internet newcomers. At large, Internet is a Net of computers and
servers obeying the rules of IT and Communications. What happened along the
last two years within the dot COM domain should have been a sort of science
fiction for traditional IT and Communications companies. But finally the waters
will find their natural courses.

 

Along that
reasoning we were confident that the solutions to some of the Internet
drawbacks should be found within classical systems engineering wisdom. Within
that wisdom were classical concepts like Information Retrieval Systems,
Selective Dissemination of Information and Expert Systems. Firms like BM have a
long history on those milestones. As I
can remember KWIC Keywords In Context, SDI Selective
Dissemination of Information, recently Taper Web semantic methodology and The
Big Blue that beat Kasparov run along these lines of research

 

The first two were respectively a
tool and a methodology to retrieve and to disseminate information efficiently
taking into account the different "jargons" of the Information Offer
and of the Information Demand, belong to the users
realm. That was a subtle differentiation that defies the pass of time. In fact,
Internet is, among many others things, an open World Market that tries to
captivate as much people as possible talking different tongues and different
jargons.

 

A jargon is
a practical subset language to communicate among people, for instance between
buyers and sellers, but it takes many years to get to a tacit agreement
concerning definitions. For instance, the equivalent of "tires" in Spanish could be neumáticos, gomas, cubiertas, ruedas, and hules being an agreement to consider only neumáticos as the
formal equivalence of "tires" and the rest as synonyms. 

 

 

Internet Drawbacks: Internet the realm of mismatch

 

The
mismatch between offer and demand could be depicted as follows:

WOO
ó
WUN

Which
stands for match/mismatch between WOO, What Owners Offer versus WUN,
What Users Needs. Internet will be commercially useful as long as WOO
approaches as much as possible to the always-changing WUN.

 

style='width:161.25pt;height:111pt'>

 

Let’s
advance a little in the user side. We may differentiate among the following
user satisfaction levels:

 

WUN > WUW > WUS > WUG > WUL

Where:

WUW stands for What
Users Want, generally restricted to users expectations
about the full capability of Offering;

WUS stands for What
Users Search, restricted by the explicit/intuited site limitations;

WUG stands for What
Users Get;

WUL stands for What
Users Loose in terms of potentially available information

 

style='width:247.5pt;height:170.25pt'>

 

The solution in Theory

 

So being
submerged in the mismatch we must learn as much as possible of it!. The Information Theory tells us that
mismatching deliver to us by far more information about the "other
side" than matching, in our case information about the markets. Studying
carefully the mismatch we could attain a convergent solution to our mismatch
problem as well.

 

In order to
accomplish that aim we need systems that learn from mismatching as much as
possible. With this idea in mind the whole problem could be stated as follows:

 

If our
first offer to the market is WOO_1 we must find a convergent process
such as

 

WOO_1 - WUN_1 > WOO_2 - WUN_2 > ......WOO_i
- WUN_i >.........

 

Where the inequalities converge to zero, exponentially if
possible. That is what an Expert System
does provided we may found a reasonable first approach to the market needs WOO_1,
the first iteration of a continuous evolutionary process. We were talking about
to learn but we are to define what from are we going to learn. We are going to
learn from users ó
Websites
interactions. Additionally we must create a methodology and
programs able to interpret what the (-) minus sign means in those inequalities
and how do we step up from iteration to iteration.

style='width:170.25pt;height:150.75pt'>

Are the Search Engines really useless?

 

No, definitively
no!. The search engines are extremely useful and this
fact is going to be the same in the future. We are going to need search engines
that cover the whole Cyberspace, as a virtual summary of the Noosphere (3)
or the World Sphere of the Human Knowledge. These World Summaries Databases
will be as now the best Indexes of the Human Knowledge in Internet, not
appropriated to use directly by ordinary users but for Website Engineers and
Architects.

 

Note 3: the sphere of human
consciousness and mental activity especially in regard to its influence on the
biosphere and in relation to evolution

 

For each
major subject of the Human Knowledge we are going to need specialized Websites
with almost 100% proprietary content and where ordinary users -looking for
subjects within a given major subject - will be able to navigate in “Only one
click YGWYW, You Get What You Want scenarioâ€?. That is, they will
find exactly what they are looking for in only one click of their mouse. To
accomplish that the Content Engineers must provide for each major subject a
satisfactory initial information offer WOO_1. And we have to ask
ourselves: where from are we going to get that initial content locations?. The answer is trivial: from the search engines databases.

 

Once we
implement this satisfactory initial offer our FIRST methodology via
its Expert System will start to learn from mismatching, adjusting the site
offer to the user needs and only by exception querying the Search Engines
databases when new content is needed. The exceptions are triggered by
non-satisfied users demand. We will see next how to create intelligent
summaries and how could we obtain a progressive independence of the Search
Engines.

 

 

WOO_1:
Architecture and meaning of the first Virtual Library

 

Virtual
Library

 

To start a
convergent process to approach to our real target we need a reasonable good
starting WOO_1. To accomplish that we designed a search methodology of
three steps depicted in our section devoted to how to create i-URL’s Databases,
Intelligent URL’s Databases. To understand the global methodology it is only
necessary to accept that WWO_1 is equivalent to our first VL, Virtual Library, that is our first credible Index of links pointing
to a set of basic e-books and documents representing our best initial approach
to a given major subject.

 

Let’s
suppose we were dealing about a Veterinary Portal addressed to Professionals.
Our first VL will have from 1,000 to 1,500 links pointing to the basic e-books
(most of them authorities) and documents with the “necessary and sufficient�
information veterinary professionals will presumably need.

 

 

Volume of “sufficient�
Virtual Libraries

 

The task of
building WOO_1 is heavy and must be either performed or controlled by experts
in the given major subject. Our strong hypothesis is that the practical human
knowledge could always be packed in finite volumes of e-books and documents,
ranging from 500 to 3,000, dealing with the basic subjects of the major
subject. Concerning that you may verify by yourself whether you could imagine a
physical specialized library with more than 3,000 different books!. Even within the academic context is hard to find
specialized Library sectors with more than that.

 

Another
fact to be taken into account to proceed with the understanding of the global
methodology is that once a given major subject is considered an established
discipline it’s classified following a hierarchy like a tree, with subjects,
sub-subjects, sub-sub-subjects and so on. That’s the way we humans communicate
among us, that’s the Law for that particular discipline, the established path
to learn it, to be certified as a professional as well.

 

On the
contrary, we humans as users of a given discipline are evolutionary beings, we
change, we improve, and sometimes we go farther the boundaries of our actual
discipline. Concerning VL’s we are prone to query them not by subjects but via
keywords. So for each discipline, for each niche of the human knowledge we may
define an Index and a set of keywords, both expressed as a jargon. The index
and the content of the corresponding e-books and documents are expressed using
the keywords of the set. If the index is analytic enough all the keywords of
the set will be used at least once.

 

 

The Two Key Components
for Retrieval

 

Then we have
a rather rigid Index resembling the “Law� for a particular branch of human
knowledge and a set of keywords. The keywords set is a living thing: some
keywords become either less or more important as time passes by and even some
of them could disappear of the users’ jargon. Some new keywords are created so
far and at large, if they are used consistently, they must be incorporated.
Finally the keywords evolution must suggest changes in the “old Law� as well.
With all these elements in mind we may step then to the core of our global
methodology.

 

WOO_1
ó
VL_1ó
[I,K]_1

 

Where [I,K]_1 is the first pair Index, Keywords, namely the initial
Index presented by the site with all available documents indexed by the initial
keyword set. The keyword set has in turn three components:

 

K = (Ko, Ks, Kr)

 

Where Ko stands for the “official� keywords, for instance the
standard to describe a particular product or service, Ks stands for all the
possible and accepted synonyms and Kr stands for the related keywords, defined
to help the users’ search.

 

Initially
the jargon of the site will be the owners jargon or the best linguistic
approximation made by the owners to interpret the market and progressively the
site jargon will approximate to the real market jargon.

 

 

The Thesaurus

 

The
Thesaurus is K plus the corresponding keyword definitions. For Merriam Webster
a Thesaurus is:

 

“a: a book of words or of information about a
particular field or set of concepts; especially : a book of words
and their synonyms b : a list of subject headings or descriptors
usually with a cross-reference system for use in the organization of a
collection of documents for reference and retrieval.�

 

 

 

 

The
power of the right statistics

 

As each type of information has a given statistical life (4)
it’s very important to dose them wisely in order to keep the maintenance cost
low. To offer an optimal information portfolio we must know the users’
preferences in detail. The classical statistics tell the Website owners how
their users browse their information resources in terms of “paths� statistics.
What we offer goes a little farther: what for the users go to a particular path
instead.

 

Let’s suppose that users go frequently to a given path
because its title suggests too many things. The people go there and find
nothing. How do we detect the natural deception?. On
the other side we may have solitary paths with powerful and useful content for
the users but its title suggest nothing. We must
realize that users think in terms of keywords in their own jargons, so we must
orient our offer in the same direction. Our i-URL’s
Databases are designed thinking in that way. Each document has an editorial
brief telling what the site is and what the site offers using a proprietary
taxonomy system, a set of keywords and a set of i-Tags, Intelligent Tags,
registering the whole life of it. For FIRST each query deserves maximum
attention accounting for each type of user reaction, namely: ignores it; browse
along the list; make click over at least one link; communicate with the
Webmaster (enabled in each query); etc. And, once a user has selected a summary
the system account whether the user selects or not the summarized document.

 

 

Note 4: We were talking about “life� and
effectively each piece of information has a given life, following exponential
functions of the elementary type e-lt where 1/l  is the mean life of the information



4- i-URLs

 

 

 

i-URL’s Databases

 

i-URL stands for
Intelligent Comment about a given Website located in the URL address. Everybody
knows how those comments look like when delivered by search engines but everybody
knows how frequently useless they are!.

 

 

The inefficiency of actual Search Engines and Directories

 

In fact you
may spend hours looking for something useful, even being an expert Web
navigator. Some confidential estimates about this unfruitful and heavy task
tells about efficiencies below 1: 5,000, meaning that to find what we are
looking for (the 1) we have to browse over at least 5,000 of those comments, in
the average. Concerning databases we talk about query efficiency, namely how
many queries in the average do we have to perform in order to find exactly we
are looking for. That efficiency found in commercial databases (1)
was extremely low: less than 0.1%!

 

That
general inefficiency is one of the big problems Internet has to overcome in the
near future. We are not going to discuss here the reasons of this inefficiency
but only to say that it is mainly due to the Websites owner’s lack of
responsibility. Most people do not respect the netiquette – the Internet
etiquette rules- lying, exaggerating their sites worth, trying to deceive
navigators and robots-, in fact, trying to oversell
themselves through their Websites.

 

To make the
things worse, the search engines simplify too much the process adding their
proprietary noise to the sites owners’ noise, resulting then a square noisy media, that is a power two noisy environment.

 

 

First
Step: Valuable Comments – Virtual Libraries

 

One first
step is then to build databases with professional and “true� comments. For a
given major subject, for instance “women’s health�, the first milestone should
be to have a credible documents database concerning that specific subject. In
that case we have to ask ourselves: how many basic documents will have that
database to have to deserve be titled as a ‘Virtual Library�?.
The exact answer is almost impossible to say but we could talk about boundaries
instead. When we talk about library we mean a collection of books and in this
case we have to locate a sort of e-books, Websites resembling classical books.
Turning then to define boundaries we may talk about a library with a volume
ranging from 2,000 to 4,000 books (2) and in our case of
Virtual Library the location and clever summaries of an equal number of
Websites resembling e-books. That’s not too much indeed (3) talking
now in terms of Cyberspace!.

 

We have
then to ask ourselves the next two questions: Do we may find those kinds of
e-books in the Web?; Is it possible to select
efficiently that specific library out from the Web?. And the answer is yes in
both cases for most of the major subjects of our human activity.

 

Now our
problem is bounded to locate efficiently those crucial Websites. However we
have to face another problem once located them: how to search fast and
efficiently within a Virtual Library of let’ say 3,000 Websites óe-books, complemented by a 10,000 to 100,000 technical and
scientific documents Auxiliary Library (Reviews, Journals, Proceedings,
Communications, etc).

 

Second Step: How to build Virtual Libraries

 

The problem
could be stated as follows: How we could build efficiently an efficient Virtual
Library?. Let’s face first the second problem: How to
build efficient Virtual Libraries?. Let’s suppose we
have to design a Cancer Virtual Library, (Altavista found 3,709,165 pages
as the search outcome for “cancer� at 6:00 PM of day 03-07-01). Of
course, in our Virtual Library we are not going to search among more than 3
million Websites but only in 3,000 but still that number is
big enough in terms of searching time.

 

Let’s
imagine ourselves within a real library with those 3,000 books filling the
space of three walls from ceiling to floor. If we are interested in finding all
the literature available for a specific query surely we are going to need some
indexing system to locate all the books dealing in some extent and deepness
with the query questioned and reviewing them afterwards. Even having an
adequate indexing system and a filing of “intelligent summaries� of all the
books we will spend a couple of hours selecting the set of books supposedly
covering the whole spectrum of the query.

 

Fine!. We are getting to the point of discovering a betterment
methodology to design an efficient e-library

 

  1. Select
    the basic 1,000 to 3,000 Websitesóe-books;
  2. Design
    an indexing system with an intelligent summary (i-Comment) of each e-book
    depicting the main subjects dealt within.

 

 

The Thesaurus concept

Keywords versus
subjects

 

The
summaries must be true, objective and covering all matters dealt within their
corresponding e-books. To be true and objective we only need adequately trained
professionals. To cover all the matters the trained professionals must browse
the whole e-book and know what “matters� mean.

 

In that interpretation we introduce some subtle details derived
from our searching experience. People really look for “keywords�, that is,
meaningful words and sequence-of-words triggering our memory and our
awareness. Many of these keywords become knowledge items within a hierarchy
of concepts for a given major subject. The keywords are important for us
depending of the circumstances not derived from its hierarchical importance
within a given major subject.

 

When an
author makes the index of its book he thinks in terms of rationality and as a
member of the society respecting the established order. The index resembles a
conventional and sequential step-by-step recommended teaching and learning
procedure. On the contrary, who is searching makes queries looking for what
he/she needs as a function of the circumstances. The index resembles the Law.

 

So the Thesaurus that collects all the possible
keywords of a given discipline is not a hierarchical logical “tree�. Each
keyword is generally associated to many others within the Thesaurus as a
transient closed system and sometimes a bunch of them could be matched to specific
item/s of a tree logical structure. The Thesaurus is the maximum possible
order within the chaos of the circumstances.

 

The logical
trees, all the indexes we could imagine, are only “statistical� and
conventional rules at a given moment of the knowledge. The knowledge, along its
evolutionary process takes the form of a subjective Thesaurus because each
person has its own Thesaurus for each major subject of his/her interest.

How we combine the virtues of
Thesaurus and Indexes

The Law and the Circumstances

 

Notwithstanding
we could make both concepts work together in the sake of searching efficiency.
The logical structures are good as starting procedures, in the learning stages.
Besides that, as the trees comes out from statistics the use of a given Thesaurus
could give rise to new and more updated logical tree indexes via man-machine
interaction along an evolutionary process. The indexes are too rigid and
obsolete easily.

 

Now we can
enter into the core of our new methodology to build Intelligent Virtual
Libraries the ones we titled i-URL’s in the sense that each URL hosts a basic
e-book, a crucial document, a hub, one authority

 

 

e-Thesaurus: a
collection of all known keywords (at a given moment in a given place, for
instance Today in the Website www.xyz.com ), eminently a subjective cyber
concept.

 

i-URL’s:
i-Comments of basic Websitesóe-books,
with the significant keywords dealt within the Website plus some aggregate of
i-tags, intelligent tags defining its morphology, the properties of the Web
space body.

 

i-URL’s
Database
: the database of all the
i-Comments of all the Websitesóe-books
that define the Virtual Library (at a given moment in a given place, for
instance Today in the Website www.xyz.com )

 

Virtual Library Index:
the indicative index of the i-URL’s Database content,
is the index that appears as the “by default� Menu to orient an ordered
browsing of the Virtual Library. As a matter of knowledge it is only valid for
the people that interacts with the Virtual Library as
a market-as-a-whole. This index is not adequate to orient the search but the
learning. It should be updated from time to time as suggested by the i-URL Virtual Library Integrator of the Expert System (4).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

How
an I-URL looks like

 

 

style='width:384pt;height:4in'>

\

 

 

In the figure
above, the yellow dot represents a reference site for a given Major Subject of
the Human Knowledge; let’s say Personal Financing. The dark green dot within
the green Users’ region represents a set of users interested in that Major
Subject, let’ say the target market. Represented as a gray
crown is the Map of the Human Knowledge, actually inexistent. A group of
people interested in capture this potential market decides to build a reference
site about it, let’ say a Personal Financing Portal. So, first of all, they
need something equivalent to the sector of Personal Financing Virtual Library
of e-books or reference sites actually existing in Internet. To accomplish that
they proceed along the steps described in the beginning of this document. 

 

The i-URL
Septuplet
: For each reference site they
create an I-URL as an information septuplet as follows:

 

1. i-URL:
http://www.major_subject037.com, that is an e-book dealing with the subject 037 of Personal
Financing, for instance “Financial Resources�. A Human Expert reviews this site
with a major in Financing within possible, specially
trained to evaluate any type of site patterns.

See “Site quanta� and our chapter about
Taxonomy of Websites.

 

2. Subject
– Logical Tree
: in this case, “Financial Resources�, one of the branches of
the Logical tree initially loaded into the system as a first approach of How
a Personal Financing Portal Should Be
.

 

3. Strategic
Information
: all kind of “coordinates� of the site and of the evaluation
done: dates, site origins, for example country, organizations where it belongs,
evaluator references (the human being that’s doing the evaluation), languages,
jargons, etc.

4. Site quanta:
all data about the structure of the site: type, importance, “size�, “deepness�,
“wide�, design features, architecture features, etc.

See our section about Taxonomy of
Websites.

 

5. Human
Comment
: the core of the site evaluation, written by an expert once
reviewed the site. It may contain some other site references (links) shown as
yellow lines and must be expressed as much as possible using keywords (shown as
green dots) of the Thesaurus at hand.

 

6. Keywords:
a set of the most significant keywords that depict the site thinking as users,
left to the personal criteria of the evaluators. Some keywords could be even
not being actually present in the site but anyway the evaluator considers that
it deals with it. The system is engineered to count how many times the i-URL was
referenced by each specific keyword.

 

7. Statistic
Counters: Where we have defined three types of counters:
presence counters, “a priori� interest counters and confirmed interest
counters. Presence counters count how many times the i-URL
was queried by the system in order to satisfy potential users’ needs. A priori
interest counters count how many times this specific i-URL was fully requested
and confirmed counters count how many times the users request the site in full.

 

 

The egg-chicken problem

How do we get the Initial Virtual Library

 

Fine!. We have defined what an efficient Virtual Library about a
specific major subject means. It’s straightforwardly conceivable that this
system works but a problem still remains:

 

As we build
Expert Systems that learns from the users man-machine interactions our main
problem is then how do we get our first i-URL’s Database, how do we locate the
first 3,000 e-books. Once solved this problem the Expert System will improve
and tune-up the Virtual Library along an evolutionary path.

 

This is a
typical egg-chicken problem: what first?: an initial
Thesaurus or an initial Subjects Index?. As one brings and positively feedbacks
the other no matter how do we start. For instance we
may start with an initial index provided by some expert as our seed. From this
initial index we may select the first keywords to start our searching process
or by the contrary, we may start with an arbitrary collection of keywords as
our seed also provided by an expert. In any case we must behave as head eggs
hunters trying to catch our first e-book, let’ say the first full content
Website authority concerning our major subject.

 

This first
candidate to become an e-book will provide us either a subject index or a tool
to better our initial Thesaurus. By sure within this Website we will have more
reference links that will open our panorama driving ourselves to find better
Websites or complementary sites or both. This is a sort of scientific artisan
methodology well suited to deepen our knowledge about something with no precise
rules but general criteria. We will see that for all these tasks we design
specialized Intelligent Agents to act as general utilities to make the process
efficient.

 

One
criterion is trying to fill all the items covered by the best index we have at
hand at any moment. That is, we investigate each milestone e-book as much as we
can until the dominant items dealt with are fully covered and then we continue
looking for more e-books that cover the remaining items of the index until the
full coverage has been attained.

 

To
accomplish this task we need searching experts with a high cultural level
trained to switch fast from intuition to rational context and vice versa and
within rational tracks able to switch fast between deductive and inductive
processes as well.

 

First
Round of Integration
: Once built a
Thesaurus covering all the items of the first index (this index has probably
evolved along the search with new items and amendments) we must begin the basic
e-books integration pivoting in the milestone e-books complemented with new
searches using the most “popular� keywords (the ones that have more milestones
indexed), The “exploration� of the milestones neighborhood is accomplished at
high speed via pure intuition along a process we titled the “first round�. To
select Websites in this first round we follow a “new rich� criterion: if the
Website look nice for our purpose we select it. To say
something about facts and figures we are talking about from 30 to 40
milestones, mainly authorities and hubs and from each milestones selecting from
100 to 200 Websites totaling from 6,000 to 8,000 Websites as the outcome of he
first round. This first round works over a first raw selection made via
infobots that query and gather Websites taken from search engines so the human
experts really work over a rather small universe.

 

Second
Round of Integration
: Once built this
“Redundant� Virtual Library we must make a tune up of it keeping the 1,500 to
2,000 best suited to our purposes and that will be the e-books collection of
our initial Virtual Library. To select them we use a logical template,
screening the most important Website attributes, such as: type, its traffic,
design, Internet niche, universality, bandwidth, deepness, etc.

See in our section about Hints how we
check the database completeness and redundancy via Intelligent Agents.

 

With this
template we proceed to build our i-URL’s, that is, the
intelligent summaries of the e-books of our initial Virtual Library. We must
emphasize here that the e-books remain in their original URL’s locations. The
only data we record in our i-URL’s Database are the i-URL’s.

 

 

Advantages of our i-Virtual
Libraries

Versus non intelligent Virtual Libraries and versus
classical Search Engines

 

This is a
rather sophisticated and heavy “only once� task but the advantages are
comparatively enormous compared to the use of the classical search engines (5):

 

  • We
    built proprietary Virtual Libraries versus copy-and-paste non intelligent
    Virtual Libraries
  • With
    a probability near 100% and absolutely under control the general users are
    going to find what they want making true our assert YGWYW, You Get
    What You Want;
  • We
    built a system that evolves positively as times passes by, with noise
    tending to zero, auto generating a scenario YGWYN You Get What You
    Need;
  • Our
    Virtual Libraries generates intelligence, mainly from users
    interactivity. WYN, What You Need and WYW What You Want are
    continuously matched against WWO, What we offer, providing to the
    site owners marketing intelligence
  • Universality:
    Our i-concept is extensible to all type of
    documents. With an Expert System of this nature we may homogenize Web
    URL’s, proprietary documents, man-machine interactions (queries, chats,
    forums, e-mail, mailing lists, newsgroups,
    personal and commercial transactions) and news.

 

 

 

 

 



Notes

 

Note 1: Along this line we made a joint research
study with the Mexican university Instituto Tecnológico de Monterrey analyzing
e-Commerce Databases efficiency with the following astonishing results: a groups of students of the Systems Engineering career
queried an Industrial Database with 200,000 Latin American firms. They were
trained in how to search by keywords, for instance by product, and the positive
matches were lower than the 0.1%!.

 

Note 2: we are talking about basic books.
Of course this information basement must be complemented with thousands of
technical and scientifically publications as well

 

Note3: Only considering Web documents we are talking
of about one and a half billion documents and we have to consider the others
Internet resources such as newsgroups and millions of “pages on the fly�
generated in chats and forums. 

 

Note 4: All our Expert Systems work under control of
a Virtual Integrator, that integrates the Expert System with all kind of
systems extensions such as, front-ends, back-ends, Intranet, Extranet, etc.

 

Note5: To remedy the search engines
inefficiency some sites decide to build proprietary content, that is a
collection of critical documents trying to answer a reasonable FAQ. This is
extremely useful and necessary and we recommend it but it’s not enough.
Effectively, the sum of real knowledge dispersed in the Cyberspace is so big
for any major subject that any particular effort is like a drop of water into
the ocean. Of course we may strategically design our “drop of water� in order
to demonstrate that we are alive as referents and not mere passive Internet
mediators.

 

 



5- Some Program Analysis Considerations

1- Thesaurus evolution Keywords Popularity and
something more

 

 

 

 

User Track Mechanics

 

 

style='width:387pt;height:169.5pt'>

 

 

 

 

The figure above depicts a typical
user track. We may define in each track the following significant events:

 

· Enter: a new user enters a query, asking for a given keyword
within a given subject (optional)

· c: means a positive HKM database answer to a query; in the
figure k1, k3,…,kn have positive answers meanwhile k2 doesn’t.

· C: means hat the user decides to retrieve one of the basic
documents of the Web space and catalogued as belonging to the Universal HK
Virtual Library. This is a crucial instance of the tracking. Effectively, the
user abandons the site to dive into the outside document.

· Error: another crucial instance: the corresponding keyword (in
the figure k2) leads to an error: supposedly the
referenced site is not hosted anymore in that URL address.

· Leave: the user leaves the system, but could still make a…..

· Re-entry: the user re-enter into the system, very important from the
point of view of HKM usage, for some other keywords string within the same
subject or for a different one.

· Subject: the user is emphatically invited to report a subject,
apart from keywords; however he/she is not obliged to provide it.

· r: another crucial instance: the user statistically decides
either to return to the system or to continue browsing the Web space by his/her
own means.

· Main Subjects’
Tutorials
: eventually, FIRST offer users a
set of tutorials where the main subjects of each Major Subject of the HKM are
thoroughly explained.

 

 

Warning: We are talking about
existing keywords, that is, the users query the HKM by existing keywords.
Perhaps the most crucial event occurs whenever an inexistent keyword is queried
provided it’s correctively written. Some things must be investigated by FIRST
in this case: a) test if the keyword is inexistent within an specific main
subject but it’s present in the HKM database for the queried Major Subject; b)
test if the keyword is inexistent in the HKM database for the queried Major
Subject but could be present in some others; c) test if it’s absolutely out of
the HKM.

 

See below the different
groups of keywords. First must analyze the existence/non-existence of not
recognized keywords for all those groups. The Chief Editor FIRST must carefully
review these cases once properly reported by.

 

 

Tracking
“zoom�

 

We could
improve our insight deepening into each incident, namely:

 

Over c:
Once a couple [keyword, subject] is keyed and properly checked about all types
of consistencies programmed, FIRST answers with a hierarchical list of either
the selected i-URLs or their corresponding briefs. The later procedure invites
to mark the most appropriate with a click. The user could even navigate within
the same list, that is, within the same couple.

 

Over error:
Eventually the users could get a wrong URL address (however, these kind of errors must be avoided as much as possible). The
system must make the most of these opportunities trying to offer the user some
alternatives: similar URL’s (once checked the link works properly!) and/or advising
to consult related tutorials within the system. Independently, these events
must trigger one of the searching intelligent agents either to locate where the
URL could have migrated (the most probable condition) or in an extreme to
proceed looking for new documents. The potential documents to replace the lost
one must be sent to the FIRST Chief Editor who finally approve/disapprove the
new document once the corresponding i-URL is edited. Once finally approved, the
announcement of the new document must be emailed to the users that previously
authorized the system to be warned.

 

Note: An internal clock measures for each
user the time duration session: once gone out to review something the system
waits a reasonable time to receive the user as working along the same session.
User may change subjects along one session.

 

 

Possible
strings are:

 

[k1, c, C, k2, k3, c, k4, c, C,
leave] subject i

[k1, k2, c, c, c, k3, c, C, c, C,
k4, k5, leave] subject j

 

In the
first string for subject i, the user decided to make a click over an URL once
reviewed its i-URL, then returned, searching for k2 and k3 but just peeping
without being interested to read the list of i-URL’s provided, then tried with
k4 and making another click over another URL and finally leaving the system.

 

In the second string for subject
j, the user sweeps over k1 but review extensively k2
list and with k3 made two more searches, then another sweeping over k4 and k5
to finally leave the system.

 

As our purpose is to keep only
keywords strings, those strings could be summarized as follows:

 

[k1, k2, k3, k4] subject i

 

[k1, k2, k3, k4, k5] subject j

 

Where we go from a cold color (blue) to a very hot and
active one (red). For each session and
for each subject the keyword strings are saved for statistic purposes.
Statistics are made by string as they are and alphabetical.

 

 

Thesaurus Evolution Mechanics

 

All
keywords and i-URL’s traffic are from time to time statistically analyzed.
Let’s see how the Thesaurus evolves. For each keyword we have at each moment
two variables: its quantitative presence within the Logical Tree structure and
its popularity. We may define within the Thesaurus the following groups:

 

a) Regular keywords

b) Synonyms of specific keywords

c) Related keywords to specific keywords

d) Antonyms of specific keywords

 

a versus b and their
respective popularities tell us about how well designed are the synonymies

a versus c and their
respective popularities tell us about some semantic irregularities

a versus d and their
respective popularities tell us about searching patterns the must be deeply
investigated

 

For
instance if in politics we detect a high popularity of peace and conversely a
low popularity of war it means that people is changing its attitude concerning
the crucial problem of peace versus war. We may investigate also all the other
possible combinations b versus c, b versus d, and c versus d.



Analysis of some
other types of user interactions

 

We may save
all