Google Penguin: Why You Cannot Fix Your Website

Google announced a few days ago that the Penguin algorithm is now part of the “core” (indexing and ranking) algorithms, that it is now a “signal”, and that it is updated as Googlebot crawls the Web. The algorithmic discussion below pertains to the previous iterations of Penguin. Furthermore, you don’t need to disavow links any more for Penguin (you should still do so for manual actions). Penguin 4.0 strips links it identifies as “spammy” of their ability to pass value.

Despite all that has been written about the Google Penguin algorithm, including numerous question and answer articles that Googlers have assisted with and promoted, many Web marketers and business owners continue to drag their feet about pursuing alternative solutions to recovering their Google search referral traffic.

The moral argument that Google wants people who abused link building strategies to be punished is just silly. We’re approaching two years since the last confirmed Penguin algorithm update and in that time many offending Websites have either gone offline or have struggled to remove, disavow, or outgrow their toxic links. The sentence was handed down, the punishment was served, and you don’t owe it to Google to continue to suffer from your self-imposed exile from the SERPs.

You can recover from the Penguin algorithm very quickly, as I have said before. I promise, I won’t repeat those points in this article. This article should help many of you look at the Penguin algorithm in a very different way.

Google Penguin: What Is It?

In my guide to major Google algorithms I describe Penguin thus: “The Google Penguin Algorithm is a Web spam filter that was released by Google in April 2012 after two months of assigning manual Web penalties to subscription blog networks that were used for link building. The original version of the algorithm only looked at keyword stuffing and/or spammy links on the home pages of Websites. Subsequent versions of the algorithm evaluated other pages on Websites.”

Everyone seems to agree on what Penguin is doing but I still don’t see people agreeing on what it is doing things to. Let’s talk about how Penguin works.

Google Penguin: How It Works

What follows is intended as a high-level summary of how a software system processes data, biased toward how search system might work. This overview ignores the thousands of innovations and efficiencies that have come to light through various sources. Let’s just get from conceptual point A to conceptual point B as quickly as possible without worrying about how close this summary is to the way Google actually works. It’s NOT close.

A filter is a program that analyzes data or content and assigns some sort of grade to that content. The filter may simply delete the content from an information stream that is fed to other software. In search engineering many filters are called document classifiers. We generally assume that these special applications look for specific things and either score documents on the basis of the perceived presence or absence of those things, or maybe assign some flags, or perhaps they create some meta data that is used elsewhere by the larger system that processing the information.

A document classifier that assigns a score is indeed creating a piece of meta data but it’s numeric or in some way ordinal meta data that is combined with similar data from other sources. Think of every document as being assigned a Quality Score that might include something like PageRank but it could also include other numeric values that represent (say) Degree of Completeness, Degree of Freshness, Degree of Coherence, Degree of Spamminess, etc. I am using the word degree to imply that any document classifier-assigned score assigns a finite value from a range of possible scores and degree is the distance of that assigned score from the lowest possible value.

A document classifier that assigns a flag is not so much judging the quality of the content as setting a trigger for further evaluation or redirecting the content to a specific kind of processing sub-system. For example, a crawler may fetch an image file and an HTML file. These files might have to be processed by different sub-systems in the indexing system and so they might be flagged or queued for the appropriate sub-systems.

Some triggers might be imputed from the names of files or certain information embedded in the files. An imputed trigger is handled by conditional logic during initial processing.

Non-score meta data might be a snippet or group of snippets of text that could be used to describe the document internally. I’m not talking about the text snippets you see in the search results. The system can use a meta language that classifies and describes large blocks of data. When used this kind of meta data extraction helps to speed up bulk processing. It could be a data label used to identify one or more records in one or more tables or databases. It could be a normalized representation of critical data such as telephone numbers, addresses, names, etc. It could be tokens that some other program uses to construct and evaluate calculations. Meta data is anything the processing system generates to help itself somewhere else.

Well, that’s six paragraphs that don’t mention the Penguin algorithm. What Google has revealed so far suggests that the algorithm looks for suspicious keyword patterns and extracts links for evaluation. The links are evaluated on the basis of one or more data models (probably constructed by a learning algorithm). The evaluation almost certainly uses meta data of some kind since a hypertext link by itself does not indicate its own spamminess. Certain types of keywords might be used more often in spammy anchor text, for example. Some Websites that are known as depending on spammy links might be pooled in a set of meta references to compare to the link’s destination. These are just illustrative examples. Don’t assume this is exactly what Penguin is doing.

At the end of the process Penguin either does nothing to the document or it annotates it somehow. So here we are with a document that has a probably spammy link in it. What should Penguin do about that?

The general assumption among marketers is that Penguin assigns some sort of negative score to the destination. I don’t think that’s the case. I think Penguin just assigns a score to the link itself and that score alters the link’s ability to pass PageRank to its destination. The net result is that the value being passed from the spammy linking site to the destination Website is somehow altered. Let’s call this altered value that is passed to the destination Negative PageRank. That is a euphemism, my euphemism. It’s a simple metaphor, not an attempt to reverse engineer Google’s quality scoring system.

I settled on the concept of Negative PageRank a couple of years ago because John Mueller told people they could still grow their way out of a Penguin issue by earning better links. In other words, there is nothing wrong with your Website. It’s just that the link value flowing to it hurts the site.

Where before you had a total Positive PageRank value of X now, after Penguin has evaluated your backlink profile, you have a total Positive PageRank value of “X – [something]”. That [something] could be almost as large as X, equal to X, or greater in ordinal value than X. It could also be a tiny fraction of X.

But what about when Penguin finds what it thinks are spammy keywords on YOUR page? In that case I think your page gets some sort of spam score. This spam score might affect your page’s ability to rank and pass anchor text differently from the way Penguin computes [somethng] just for outbound links. Google can be granular or heavy-handed.

The end result is that your toxic links:
1.Do NOT help your site rank
2.Degrade your site’s ability to rank
3.Leave your site fully capable of improving through other links

This was always the case. Google just got better at finding links that match its profiles of suspicious links. They most likely targeted the large blog networks and semi-private blog networks that were selling home page backlinks and deep post links. It doesn’t take a genius to recognize a lot of those blog posts because the majority of them were created from spun content. The learning algorithms could have identified unusual phrases that were not used by real human writers to help score spun content. They have trillions of documents to evaluate. They probably have a very good idea of what natural language looks like.

Why Google Does Not Show Penguin Downgrades

One request people keep making (and I have raised it myself) is that Google let us know in Search Console when a site has been affected by Penguin. Googlers have only ever said or implied (to my knowledge) that this would be impractical. And that is a rather frustrating response because it leaves all of us out here asking how it can possibly be impractical to let a site owner know that their site’s ability to earn search referral traffic has been impacted by the Penguin algorithm.

So let’s step back a moment and consider some of the nasty things Web marketers do when they create spammy links.
•They engage in negative SEO, trying to intentionally hurt Websites
•They create camouflage links (pointing to “good” sites)
•They create “control” links that are intended to test the linking value of a resource
•They redirect old domains to the blog networks
•They redirect old domains to their money sites
•They drop spammy links on blogs and forums
•They create user profiles on social media sites
•They submit their RSS feeds to aggregators

Some of these behaviors fall almost solely into the arena of the typical Black Hat SEO. But some of these behaviors are quite common among people who are not trying to manipulate search results.

More importantly, a Website created for the sake of hosting spammy links can link out to many, many good Websites in an attempt to fool the algorithms. The algorithms may not actually know which destinations are supposed to benefit from Web spam.

So we have to ask: Just how many Websites have received Negative PageRank-like value from at least a few of their backlinks? My guess is millions, perhaps tens of millions.

Now picture the Search Console team having to deal with all the innocent Webmasters who receive notifications that some of their backlinks have been affected by Penguin. I think there would be mass panic on the Web. Mainstream news media would start investigating the story and yet another anti-competition complaint (or maybe a lot of them) would be filed against Google with various government authorities.

Okay, they could set a threshold so that not everyone hears about their Penguin-identified backlinks. But what kind of threshold should that be? Who should get the bad news? Do we want the heavy duty spammers to receive confirmation that their backlink profiles have been pinged on the Spam Radar? Or do we want to just tell the good guys who did nothing wrong but build a Website that is linked to from a site that is sending out spam signals?

At one end of the spectrum you create a perfect learning environment for the Black Hat community to figure out what signals the algorithm is using. At the other end of the spectrum you alarm and infuriate people who have done nothing wrong. And if you set the threshold for notifications somewhere in the middle of the spectrum do you notify the good guys or the bad guys, knowing that the people near the dividing point will all be a little bit upset and frustrated and probably motivated to try to reverse engineer the signals?

Informing Webmasters that they have a link problem has been a tricky business for Google for years. They don’t like giving out clear, concrete information in these matters because they assume the heavy duty spammers already know which links are bad. As for the good-hearted business owners who retained the wrong SEO service provider, I suspect the Googlers would like them to learn once and for which kinds of link acquisition strategies Google does not want to see again.

That may be heavy handed but you are dealing with only one search engine whereas the search engine is dealing with millions (or thousands) of you(s). I think I understand what they are up against.

Telling Us Which Sites are Affected by Penguin Is Impractical Because …

Imagine YOUR WEBSITE reports links from 75 domains in Search Console. One day Google could send you a message like:
•You have spammy links in your profile (do nothing, you’re okay)
•You have spammy links in your profile (clean them up and wait, no reconsideration request)
•You have spammy links in your profile (clean up is probably not practical, no reconsideration request)

Should they tell you which links are hurting your site? In the first case the message will alarm some people. I can see the Google Webmaster forums filling up with “why did I get his message” and “why does it say not to worry” questions. Why cause that kind of angst if the sites are not hurting?

But the people who could take some action have three choices: do nothing (for various reasons), do the clean up and wait, or start another site and spam again.

And the third group would just start over.

Out of all three groups you would probably see sub-groups of angry people complaining in the forums, asking Googlers why, why, why at conferences and in hangouts or on social media, maybe filing some lawsuits, etc. And the longer you make people wait for their changes in the backlink profile to take effect the more frustrated they become. We have already seen this.

But how many bad links do you have to receive before they really start hurting your site? Do they hurt just one page and the effect is diminished? Should Google allow that negative value to flow from one of your pages to the next? They could probably stop the internal flow of negative value but what if that takes too many resources?

Should they report which of your pages are receiving bad link value?

Should they tell you which bad links are being ignored?

Should they tell you if some links hurt more than others?

You want to know all these things but what happens to the already fragile equilibrium of the Searchable Web Ecosystem if Google reveals all (assuming they have the resources to do this)? I don’t think you would get the resolution you’re looking for. Those few who would seek to manipulate the system for their advantage no matter what would be watching everything.

You cannot reverse engineer Google’s ranking algorithms but you CAN reverse engineer their spam signals. I think if the transparency everyone desires were easy to achieve Google would have made the effort by now. Every time the Web community moves away from spamming the search engines we do see a significant increase in the quality of the search results. It’s a Dead Spam Bounce but that effect doesn’t last long.

Spammers go right back to work as soon as they recover their wits and pick themselves up off the floor. And as I have pointed out before, most Web marketers contribute toward the next generation of Web spam by sharing and replicating each other’s “best” practices. The adversarial nature between indexer and publisher is built into the system because of the incentive for increasing profits.

And So, Timmy, You Need to Build a Damn New Website

You are not nearly as tired of hearing this from me as I am of telling you. Stop waiting for Google to “fix” Penguin. You don’t even know if you have tracked down all the links that Penguin declared to be toxic. Many of you have clearly been disavowing or removing very good links that were helping your sites.

When the Penguin rolls out again (assuming it does as Google believes it shall) some people will be very happy but I am convinced on the basis of past experience that way too many people are going to be immensely disappointed with the results.

Worse, everything that you are waiting to do until after the next Penguin release is stuff you SHOULD be doing now because it WILL HELP now.

Unless you’re just waiting to go out and spam again.

Need SEO Hertfordshire? >>


#SEOHertfordshire #SEOEssex #SEOChelmsford


No Comments Yet.

Leave a comment