Google Search Ranking: A Brief History of Known Knowns, Known Unknowns, and Unknown Unknowns

Google Search Ranking: A Brief History of Known Knowns, Known Unknowns, and Unknown Unknowns

There are known knowns. There are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns—the ones we don’t know, we don’t know. – Donald Rumsfeld

In the summer of 1995, 22-year-old Larry Page visited Stanford University. Already accepted, he was considering the campus for his Ph.D. in computer science. Sergey Brin, a year younger but already in the program, showed him around. It was the first time Google’s future co-founders met. Three years later, in 1998, Page and Brin, along with two others, authored “The PageRank Citation Ranking: Bringing Order to the Web” (http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf). The academic paper contained the first Google search algorithm, and the only one ever published in full.

In subsequent years, the algorithm has become infinitely more complex, secretive, and lucrative. Search engine optimization (SEO) is based on the ability of outsiders to gain insights into Google algorithms and translate that knowledge into effective site design. Changes to the algorithm, when viewed over time, reflect the long arc of Google search in technical scope and philosophical intent. If algorithmic weather remains unpredictable, the climate does not.

Known Knowns: BackRub, PageRank, and Google’s First Algorithm

To test the utility of PageRank for search, we built a web search engine called Google. – Larry Page, et al.

The Stanford students’ paper was an expansion on Page’s initial concept, BackRub. BackRub applied citation notation—the practice of citing other authors within one’s own work—to the Internet, with hyperlinks taking the place of citations. More specifically, Page was interested backlinks, or hyperlinks that point to a given site. The quantity and quality of backlinks could estimate, according to Page, “a global ‘importance’ ranking of every Web page.” The first Google search engine used BackRub’s principles to crawl the Internet and assign a numerical weight to every site—its PageRank. PageRank didn’t constitute the entirety of the first Google algorithm, but it was the novel feature.

Of course, Google was not the Web’s first search engine. When the nascent Internet emerged, it was a compilation of File Transfer Protocol (FTP) sites—virtual libraries filled with stacks of loosely organized volumes, identifiable only by their filename. McGill University student Alan Emtage created Archie, the first search engine, in 1990; Mark McCahill followed with the creation of Gopher at the University of Minnesota in 1991. Primitive engines indexed files; they didn’t crawl sites. Wandex, the first to crawl sites, was capable of cataloging only page titles. In March 1994, WebCrawler became the first engine to search full text, with an initial database of more than 4,000 sites. Between 1993 and 1995, Excite, Yahoo!, Lycos, Infoseek, and Altavista all appeared in an increasingly crowded search marketplace. Through various buyouts and the dotcom crash at the turn of the millennium, many were subsumed or faded away.

It was in this climate that Page and Brin conducted their academic work. They recognized the perilous gap between academic knowledge and Internet knowledge: “Because the Web environment contains competing profit seeking ventures […] any evaluation strategy which counts replicable features of web pages is prone to manipulation.” This, of course, worked in contrast to peer-reviewed academic publications with defined units of work—articles and books—that helped established a threshold of legitimacy.

Indeed, the first Google algorithm had hurdles to overcome. These included properly weighting a high number of low-quality links compared to a low number of high-quality links. Other issues of “rank sink” and “dangling links” remained—how to properly value links trapped within cyclical link structures or those teetering on the edges of cyberspace. Page and his coauthors tackled each with a new variable or equation, which was then folded into the final algorithm.

The First Known Unknowns and the Death of SEO

“In traditional information retrieval we make the assumption that the articles in the corpus originate from a reputable source and all words found in an article were intended for the reader. These assumptions do not hold on the WWW.” – Krishna Bharat

Google had been registered as a domain since 1997. Its name played on the word “googol,” the mathematical term for 1 followed by 100 zeroes and a reference to the founders’ desire to organize the seemingly infinite information on the Internet. The company started with $100,000 from Sun Microsystems co-founder Andy Becholsheim; the following year, a $25 million investment by Sequoia Capital and Kleiner Perkins Lead Investment supplemented Google’s capital. By that time, Google had already been recognized by PC Magazine as the search engine of choice. Google would achieve a market share of 70 percent by 2004.

Google released the Google Toolbar in December 2000. Google Toolbar provided users with a free plug-in to search individual pages or the entire Web from anywhere within a browser. Along with the toolbar came PageRank, a single numerical score from zero to ten that provided lay users with a way to gauge page quality. (It was a superficial reflection of more complex internal numbers at Google.) Suddenly, Internet users—in particular webmasters—saw a comparative tool that vacillated over time. For them, the question was could they change it, and how?

While webmasters began pursuing optimization, Google made changes in September 2002 that caused upheaval, epitomized by 404 pages reaching into the first page of search results pages (SERPs). Confused webmasters christened it the first Google dance: the enigmatic shuffle of pages up and down SERPs. Google intended the updates to curtail anchor text–based “Googlebombing” (Microsoft presently was the top result for “go to hell”), but the first major update generally yielded lower-quality results. By early 2003, Google announced plans for a monthly algorithm update and index refresh.

Throughout 2003, monthly updates, named as though they were hurricanes, reshuffled Google SERPs. They had the impact of catastrophic storms. From April to June, Cassandra, Dominic, and Esmeralda battered many of the earliest SEO practices, ending multiple linking from self-owned sites and cracking down on the impact of hidden text. Page titles and link text grew in importance, as well as the navigational structure of pages. After a turbulent summer, Google announced that updates would henceforth be ongoing—everflux.

Still, some updates would be bigger than others. In November, the Florida update had a major effect on Google SERPs. Central to the changes was implementation of Hilltop. Hilltop, developed by Krishna Bharat at the University of Toronto, conquered a fatal flaw in PageRank: “Since PageRank is query-independent it cannot by itself distinguish between pages that are authoritative in general and pages that are authoritative on the query topic.” (ftp://ftp.cs.toronto.edu/dist/reports/csrg/405/hilltop.html) PageRank had an absolute value of importance wholly ignorant of keywords.

Hilltop bridged that gap, adding topical relevance to PageRank’s one-dimensional vision of authority. To do this, Bharat devised an algorithm dependent on “expert documents.” Expert documents were a subset of pages that served as link directories to non-affiliated sources. Bharat’s initial survey established 2.5 million expert documents within Altavista’s 140-million-page index. Hilltop folded into the Google algorithm only when a subject had at least two expert sources; otherwise, Hilltop was ignored. For the first time, SEO had become too complicated for most webmasters to decipher.

As came to be the pattern with many Google updates, a second wave followed to complete unfinished work. This time, it was the January 2004 Austin update. For many, Florida and Austin signaled the death of SEO, at least as they knew it. Sites that relied on link farms, stuffed meta tags, superfluous keyword density, hidden tags, and invisible text all plummeted in SERPs. Those exchanging links with off-topic sites were punished as well. Google had reinforced its point: It wanted the most relevant, not the most optimized, links at the top of SERPs.

While some webmasters reeled, Google floated its initial public offering in August. It sold 19 million shares and raised $1.67 billion in capital, giving Google a market value above $20 billion. Share prices doubled just four months later.

Jagger, Vince, and the Day the Long Tail Died

Long Tail“This is an algorithmic change in Google, looking for higher quality sites to surface for long tail queries. It went through vigorous testing and isn’t going to be rolled back.” – Matt Cutts

In January 2005 Google and all other major search engines took the step of adding the “nofollow” attribute, which devalued efforts of link spammers in comment sections and review sites. Effectively, webmasters could now tag links to tell crawlers, “I didn’t put that link there, someone else did.” Later that year, in June, Google unveiled personalized search results, which mined user search history and location to provide added “signals” within the algorithm.

Shortly thereafter, the Jagger update rolled out. Jagger, which began in September and continued through November, attacked duplicate content and reciprocal linking in addition to link-mongering, blog spam, and CSS spam. It also gave added authority to news and educational sites, as well as those with a longer Web presence.

Until the spring of 2009, the algorithm remained relatively steady, even as Google revealed that it made, on average, between 300 and 500 updates each year. These included additions like the canonical tag, which allowed webmasters to tag pages as copies of a “canonical” page for the purpose of aggregating content metrics. This proved especially important for retailers, who had many duplicate or near-duplicate pages to serve vast product lines. (The canonical tag was later expanded in 2011 to cover multi-page articles.)

The Vince update, however, did not go unnoticed. It had the primary effect of promoting larger brand websites over smaller, independent sites. While cynics suggested economic collusion, Google CEO Eric Schmidt argued that it was about adding trust into the algorithm: “The internet is fast becoming a ‘cesspool’ where false information thrives […] Brands are the solution, not the problem. Brands are how you sort out the cesspool.”

The MayDay update in 2010 quickly became known as “The Day the Long Tail Died.” Many sites lost between 5 and 15 percent of normal long-tail traffic. In an interview, Matt Cutts, head of Google’s webspam team, declared that Google instituted the change while looking for “higher quality sites to surface for long tail queries.” The primary casualties were large retailers that had hundreds or even thousands of product pages. These pages had limited inbound links and, because of their specificity, rarely attracted external links. Amazon fared surprisingly well due to its strong internal linking, user reviews, recommended items, and other features—all unique content Google valued. The ubiquity of Amazon as a retailer also meant that external users were more likely to link to Amazon when sharing where to buy their favorite gadget.

Long tail–dependent sites suffered another blow in September with Google Instant. (Google Instant enjoyed an earlier iteration as Google Suggest; it had been part of Google Labs since 2004.) Google Instant’s drop-down search suggestions appeared as one typed. Designed to save two to five seconds per search, Google Instant affected many businesses that depended on long-tail search. Users who once planned a search for “Charleston hotel waterfront pet friendly,” now saw advertisements and results as soon as they began typing. Instead of finishing their original query, some chose businesses that had invested in expensive head and body keywords like “Charleston hotel.”

Near-instant search was possible only with the added horsepower of the Caffeine update, an infrastructure improvement implemented three months prior. Caffeine provided Google with 100 million gigabytes of storage and an additional 100,000 gigabytes daily. Caffeine didn’t alter Google’s algorithm but provided the capacity to refresh results and index pages far faster, integrating the crawling and indexation operations to provide a 50 percent fresher index.

A Panda Gets Loose

Panda Gets LooseIn some sense when people come to Google, that’s exactly what they’re asking for —our editorial judgment. They’re expressed via algorithms. – Matt Cutts

In January 2011 Cutts gave an ominous warning: “We’re evaluating multiple changes that should help drive spam levels even lower, including one change that primarily affects sites that copy others’ content and sites with low levels of original content.” Panda was coming.

Panda, which appeared the following month, sought to tackle so-called content farms and scraper sites. Content farms, as Cutts derided, asked the question, “What’s the bare minimum that I can do that’s not spam?” Typical violators were often self-help sites with how-to content that barely registered as informative. Scraper sites copied information through RSS syndication, under fair-use guidelines, or, in some cases, outright theft. At its most egregious, syndicated content was outranking the original version. Industry watchers compiled lists of Panda casualties to memorialize the carnage. (http://searchengineland.com/who-lost-in-googles-farmer-algorithm-change-66173). Overall, Panda affected nearly 12 percent of U.S. search results. Subsequent rollouts internationally altered another 6–9 percent.

Critical to the Panda rollout were human editors Google used to rate sites. After all, the definition of a “low quality” site was subjective. Google tried to establish editorial standards by sending out examples and asking reviewers questions like, Would you be comfortable giving this site your credit card? Would you be comfortable giving medicine prescribed by this site to your kids? Would it be okay if this was in a magazine? Google’s internal research found an 84 percent overlap between the conclusions of human reviewers and those of the Panda update. Dozens of Panda updates continued to appear in the following months and years. (Panda 4.0 in April 2014 affected 7.5 percent of English-language queries, again targeting low-quality content; Panda 4.1 in September tackled affiliate marketers and deceptive advertising practices.)

Not until Penguin, in April 2012, did Google unveil another major shift. The Penguin update focused on keyword stuffing, which some believed Google had foreshadowed as its whispered “over-optimization” penalty. Penguin changes affected about 3.1 percent of all search queries. Later that year, Google watchers noticed that domain diversity declined on Google, particularly within the the first page of results, which had, in many cases, shifted from ten to just seven. Popular brands could dominate all seven listings on the first page, leading some to lament the ability of companies to bury negative reviews.

In fall 2013 Hummingbird debuted. Though the most thorough rewrite of the Google algorithm since 2001, changes were geared more toward making Google “precise and fast” than affecting SERPs. Hummingbird focused on improving comprehension of conversational speech—accounting for the collective meaning of a keyword phrase, not just the sum of its parts.

Google Search Ranking

Unknown Unknowns: Google Search Ranking, Tomorrow

“There are reports that there is no evidence of a direct link between Baghdad and some of these terrorist organizations,” a reporter lobbed at Secretary of Defense Donald Rumsfeld in a February 12, 2002, Defense Department briefing. Rumsfeld’s response, a delineation of intelligence information among known knowns, known unknowns, and unknown unknowns, drew an immediate chuckle, with later verdicts crediting it as an insightful if tortured turn of phrase. At the very least, it proved enduring. (Rumsfeld later truncated the quote for the title of his memoir.)

Oblique algorithm references leaked by Google have a similarly esoteric quality—precariously located along the dividing line between illuminating and inscrutable. Yet while, as history bore out, Rumsfeld misjudged the breadth of unknown unknowns, Google certainly knows its algorithm. For SEO-watchers, the fault is to focus solely on the formula rather than its intent, the letter rather than the spirit of the law. That is the enduring refrain from Google’s algorithmic history: A noble call for quality in design, content, and product. Cutts often repeats the same mantra, even if it fails to concede self interest on the part of a $400 billion company.

This, then, is the self-reflective question: What are my worst-practice SEO strategies? The answers are liabilities and, likely, the current projects of Google staffers. Webmasters best able to survive algorithmic storms are not those most adept at building and rebuilding their update-shattered sites. They are those that, after years of watching the skies, have come to understand the climate of their digital geography and have constructed their site as sturdy shelter from whatever may loom on the horizon.

An occasional glance skyward is still well advised.