For all of you that don't have time to read through the
new patent that was released on March 31st then take a look at this summary. It covers most of the main points discussed in the patent.
Document Inception Date
The search engine will learn the creation date of a document through crawling the document, submission of the document or when they first discover a link. Other techniques to determine the age of a document might be to look at the time stamp on the server or viewing the date that a domain was registered when the document was created. They will use this ‘inception date’ to score the document accordingly. Older documents should have a large number of back links while newer documents should have fewer. If a newer document has a spiky rate of growth in the number of back links then this may be seen as trying to SPAM the search engine and the score of the document might actually be lowered. In addition , The search engine “may determine the age of each of the documents in a results set, determine the average age of the documents, and modify the scores of the documents based on a difference between the document’s age and the average age”
Content Updates/Changes
A document that is updated frequently will be scored differently than a document whose content remains static over time. The more updated content, the more ‘different’ your score will be. The search engine will also look at the number of new or unique pages associated with the document over a period of time, versus the number of total pages associated with the document. In some cases “signature” of documents are stored to detect changes in content over time. In other cases the full documents may be stored. The search engine may “determine a date when the content of each of the documents in a result set last changed, determine the average date of change for the documents, and modify the scores of the documents based on a difference between the documents’ date-of-change and the average date-of-change.
Query Analysis
The search engine might score documents ‘differentially’ based on how often a document is selected in a set of search results. If one document is selected relatively more often/increasingly by users, than this document will score higher. Also, a significant increase in a search query could indicate a ‘hot topic’ or ‘breaking news’. Sites related to the ‘hot topic’ will be scored higher during this time. Another query factor has to do with the staleness of a document (which is determined by many factors such as creation date, anchor growth, traffic, link growth etc.). The search engine will look at how often a ‘stale’ document is selected over ‘newer’ documents and will adjust the score accordingly. Documents may also be monitored and used as a basis for scoring. “For example, if a particular document appears as a hit for a discordant set of queries, this may (though not necessarily) be considered a signal that the document is spam, in which case search engine 125 may score the document relatively lower.”
Link-Based Criteria
The search engine will monitor the behavior of links over time such as when links appear or disappear and the rates at which this happens. A downward trend of links over time can indicate that a document is ‘stale’ in which the search engine might decrease its’ score. An upward trend may signal a ‘fresh’ document and might be considered more relevant. Links may also be weighed on how much the linking documents are trusted (like government and education sites) and also the ‘freshness’ of the linking page (or in other words, how long the link has been up). If a document is considered ‘stale’ the links contained in that document may be discounted or ignored in the document’s score.
Anchor Text
Changes in anchor text over time may indicate that there has been an update or a change in focus in the document. This may happen when an expired domain is purchased and the anchor text no longer matches the theme of the site. The search engine will try to combat this problem by estimating the date when the anchor text changed focus. All anchor text links prior to that day may be considered obsolete. The date that linking documents change (or become fresh) is also a good help to your page. When documents that link to your document change, this indicates that your anchor text is still relevant and on-topic.
Traffic
The search engine will use time characteristics of traffic to alter the score of a document. A large reduction in traffic may indicate that the document is becoming stale. The search engine will look at traffic pattern changes over periods of time when a document is more or less popular (such as summer, weekends, or other seasonal periods) to score the document. Advertising traffic is also used to determine the rate which advertisements are updated over time and the quality of the advertisers (for example amazon.com would be given more weight than advertisements that refer to low traffic/untrustworthy documents such as porn sites)
User Behavior
The search engine will monitor the number of times a document has been slected from a search and score the document accordingly. If the document is visited for a longer time than other documents within the results, then weight for that document will be adjusted accordingly. Shorter time periods spent surfing a page can often times indicate the staleness of the document.
Domain-Related Information
Information on when and where a domain is hosted is used to score the document. Doorway domains (used to spam search engines) are often times registered for only one year in advance whereas a more legitimate domain is registered for several years in advance. Data is stored associated with a domains’ contact information, name servers, IP addresses, etc. This is used to predict the legitimacy of a domain and the documents within the domain. Good name servers have a mix of different domains from different registrars and usually have a good history. Bad name servers might host mainly pornography or doorway domains, domains with commercial words or bulk domains from a single registrar.
Ranking History
Documents will be monitored based on their time-varying rankings. A document that jumps in rankings across many queries may be an attempt to spam the search engine. Queries that rise over a short period of time may indicate the search is a ‘hot topic’ and related documents might be scored higher. Ranks over time, or spikes in ranks over time will also be analyzed. Authoritative documents such as government sites might be exempt from any penalties if a large number of back links start to come in at once. On the other hand, if a document decreases in rankings over a steady amount of time, the search engine will score it as outdated.
User Maintained / Generated Data
The search engines will monitor user data such as ‘bookmarks’, ‘favorites’ and other types of data that indicate how the document is favored. There will also be analyses of upward and downward trends of this data, such as how many times people remove the bookmarks etc. Other data such as a document’s ‘cache’ files may also be analyzed to view upward and downtward trends.
Unique Words, Bigrams, Phrases in Anchor Text
The search engine will monitor links, graphs and their behavior in an attempt to detect spam. Large number of identical anchor text links and large numbers of deliberately different anchor text links may indicate spam.
Linkage of Independent Peers
A sudden growth in the number of apparently independent peers, with incoming or outgoing links, may indicate spam. If the growth is happening with anchor text that is unusually coherent or discordant, this will further strengthen the search engine’s belief that the document is spam.
Document Topics
The search engine analyzes pieces of information to determine the topic of the document. Information such as categorization, URL analysis, content analysis, clustering, summarization and a set of unique low frequency words is used to determine the topic. A document with a spike in the number of topics may indicate spam. Another indication may include the disappearance of the original topics associated with the document.
Exemplary Processing
The search engine may score documents based on how relevant they are to the search query. This is determined in part by the historical data of the document. Historical data is determined by inception dates, document updates, querey analysis, link-based criteria, anchor text, traffic, user behavior, domain information, ranking history, user data, unique words, anchor text, linkage of peers and document topics.