Lucene search

HistoryOct 11, 2006 - 12:00 a.m.

With robots. txt explore Google Baidu hidden secrets-loopholes warning-the black bar safety net


Search engine through a program robot, also known as spider will automatically visit web pages on the Internet and obtain web page information. However, if the site of some of the information don’t want to be someone else searched, 可以创建一个纯文本文件robots.txt and put in the root directory of the site. Thus, the search robot will be based on the contents of this file to determine what is allowed to search, which is don’t want to be seen.

Interestingly, this characteristic is often used as a reference, the guess website what is the new trend on the horse, and don’t want to let people know. For example, by analyzing Google’s robots. txt change to predict Google is going to launch any kind of service.

The interested reader can look at the Google’s robots. txt file, note the first few lines will have“Disallow: /search”, and the end of the new added“Disallow: /base/s2”.

Now to do a test, in accordance with the rules of the hidden address is http://www. Google. com/base/s2, after opening find Google gives an error message:“the server encountered a temporary problem can not respond to your request, please 3 0 seconds and then try again.”


Figure 1

But the s2 is the last number replaced by 1, 3 or anything else digital, the error message is yet another way:“we don’t know why you want to access a non-existent page.”


Figure 2

Obviously the“/base/s2”is a special page, given that Google has said over the year the main focus is the search engine, we hypothesize that little, so-called“s2”indicates whether the“search2”, that is the legendary second generation of search engines? Out of curiosity, 尝试了一下百度的robots.txt than the hordes of Google simple many, only a few short lines:

User-agent: Baiduspider
Disallow: /baidu

User-agent: *
Disallow: /shifen/dqzd.html

The first paragraph is needless to say, the second paragraph also encountered the same does not open the error, however, according to previous information, This is Baidu was once the PPC area of the core proxy list as well as district General Agent list, for some understandable reasons to do fuzzy processing.


Figure 3