Link Popularity and the Bowtie Theory

  • Jun. 1, 2000

Source: Planet Ocean Communications

The Bowtie Theory Explains Link Popularity
...IWWWC-9 paper sheds light on indexing the most popular pages on the web

The Ninth International World Wide Web Conference was held in Amsterdam May 15 to 19. This year several interesting papers were published -- some of which were written by research divisions at Compaq, IBM and AltaVista. One of the most important papers was titled "Graph Structure In The Web". Most notable about this particular paper is the light it sheds on systems likely being tested at AltaVista.

Let's start by sharing with you our belief that this research involves a spider called "Mercator" -- if you've been checking your server logs, you've likely noticed it visiting your site sometime in the past and you've probably even wondered about it.

The primary focus of the paper is related to what is functionally known as "Link Popularity" -- an important factor which influences your rankings on many major search engines, including AltaVista. The research focuses on the study of link structure and how web pages link to each other. The study showed that the web, when viewed as a connection of links, looks like a Bow-Tie...

bowtie
Image from http://www9.org/w9cdrom/160/160.html

In the middle what you see is called the "Strongly Connected Component" or SCC. This area of the link mapping project is comprised of approximately 56 million sites. The sites inside the SCC have links to and from other sites in the SCC.

The pages within the SCC are characterized as having links both forward and backward that "explode" into approximately 100 million other distinct urls. If the sites didn't show this behavior of forward and backlinking to that number of sites, they typically only reached less than 90 other pages.

Our understanding is that sites in the SCC, or "core" must exhibit this large link behavior or they fall outside of the core. Perhaps this is one of the indications of a "not popular" site. You could easily conclude that sites like Yahoo, CNN, CNET and similar others would be very close to the center of the core with less well linked sites around the edges of it.

The "bow" section labeled as "in" on the illustration describes pages outside the SCC core that link to the SCC pages. However, these pages display a shortage of links -- i.e. they lack that "explosion" of links as described above. Conversely, pages within the "out" section are those that are linked from pages in the SCC but lack the explosion of links back to the SCC core.

The "tendrils" are web pages that link away or to pages within the "in" or "out" sections but lack significant paths to or from pages within the SCC. Pages within these sections typically have less than 54 links to or 20 links out before the spider stops finding new links.

We believe this pattern helps explain the link popularity approach at AltaVista and perhaps to some extent Google. Link connectivity is one indicator that many of the major engines are now, or soon will be, using to determine a page's usefulness in the index -- and thus determine how a page scores. With such a system, ideally, your pages should be found as close as possible to the center of the SCC in order to get spidered completely. Being found toward the center of the SCC would also, in theory, validate a high degree of link popularity and thus elevate the probability of achieving top relevancy scores.

If you take the time to review the research paper you'll find the engines appear to be on the verge of classifying site characteristics based on how they are linked. Currently, however, being outside the SCC does not yet seem to be significantly affecting relevancy scores on most search engines -- but it may in the future and especially at AltaVista. In any case, that is our belief based on the research and testing currently being done coupled with the recent behavior patterns of the engines themselves.

So what does this mean to you?

Well, for starters it appears to be a good idea to start creating links from your site to other topic relevant sites whenever possible -- especially to sites in the SCC. One example: place a link to ODP somewhere on your pages. This would enable a spider to find ODP (an SCC site) when following forward links from your site. It's also a good idea to obtain links from sites in the SCC to pages within your site. Now is the time seek out new links to your pages from other popular sites that are relevant. Not only will that help with additional traffic from these popular SCC sites it should also favorably influence the soon-to-come link popularity systems.

We believe that AltaVista is laying plans to tie this link mapping database, (called CS2), with another database, (Term Vector), to assist in scoring pages. At this point everything appears to be only at the research stage although, from what we can tell, it could be incorporated into AV's main search engine anytime in the very near future.

If you would like to learn more, here is a list of sources and suggested reading...

Warning: don't read while operating heavy equipment or driving.

Graph Structure in the web
http://www9.org/w9cdrom/160/160.html

Additional Recommended Articles -

The Term Vector database
http://www9.org/w9cdrom/159/159.html

WTMS: A System for Collecting and Analyzing Topic-Specific Web Information
http://www9.org/w9cdrom/293/293.html

Full list of papers from WWW9
http://www9.org/w9cdrom/

Are we having fun yet?
Stephen
John Heard -- Research Specialist
Planet Ocean Communications



Tags: