RSS

评李彦宏的“暗网” / Baidu and its "Aladdin" project

English version

最近,百度CEO李彦宏宣布了其“阿拉丁计划”,指出“在互联网上,有很多信息并没有被网页化,我们根本没有触及到,也就是所谓的‘暗网’。”更透露,这一计划百度已经酝酿了一年多,“如果说目前能被搜索到的‘明网’是以百亿级计量,那么暗网就是以万亿级计量”。

初读此消息,就感觉有些不对劲。于是抱着“找茬”的态度反复看了几遍。说实在的,我对百度一向是不以为然的,尤其是在技术研发上和大多数中国IT公司一样假。而如今,百度居然翻身了?2000名核心工程师,超过一半要投入到这样一个代表“下一代搜索技术”的具有“高科技”含量的计划?好在我没有眼镜,不然一定掉下来。

“暗网”这个东西,看起来着实令人莫名其妙。于是我就“百度”了一把,然后百度百科告诉我,所谓“暗网”指的是“目前搜索引擎所无法抓取的网页、不能检索到的信息”,分为两种,一种是技术原因,也就是网站本身不规范造成的,“这不是搜索引擎自身就能解决的问题,而是有赖整个网络结构的规范化,百度的“阿拉丁计划”、谷歌的“云计算”就是要从根本解决这一问题。”另一种则是网站因为版权、隐私方面的原因本身不愿意被抓取,如优酷,“这更不是搜索引擎能解决的问题了。如果他们能被搜索引擎抓取到,就属于违法了。”原来,这个“暗网”指的就是“不是搜索引擎自身能解决的问题”啊。百度真逗。就连解决方案,也还得扯上Google的“云”才行。看来这“阿拉丁”,顶多也不过是“云”的克隆罢了。

其实,即使所谓“暗网”真的能够解决,作为一家搜索引擎公司,尤其是占据了中国70%以上市场份额的中国第一大搜索引擎公司,竟然单纯地把扩大搜索数据库作为核心计划,这本身已经相当不“技术”了。众所周知,搜索数据库越大,要找到想要的数据就越困难。对于学术搜索这种对相关性的要求比较高的应用来说,搜索数据库更几乎是被固定死了的。在我们的日常搜索中虽然没有那么高的相关性要求,但搜到的结果也经常是数以百万计,对比之下,95%的用户不会有耐心看到20项以上。因此,对于搜索引擎来说,扩大搜索数据库自然也是必要的,但想要提高用户的搜索体验,提高搜索精确度才是至关重要的。

和其“进军日本”的大手笔战略相比,百度在技术方面明显乏力。百度的成功本身就非源于技术,而是商业策略以及某些不光彩的“小动作”。虽说商业策略也能够创造价值,但百度毕竟是一家IT公司而非投资公司或贸易公司。讽刺的是,根本不精于技术的百度居然在就虚假医药信息事件道歉的声明中提到:“为了与Google这样全球领先的技术公司进行竞争,百度过多的关注了技术和研发,而对销售运营缺乏严格的管理和系统的投入,百度对此进行了深刻的反省。”实在是令人笑掉大牙。

百度的今天看起来是成功的,但单纯凭借“无赖”精神,尤其是在面对Google这样内外兼修的强敌的情况下,内功严重不足的百度还能走多远呢?

参考链接


百度“阿拉丁”探宝“暗网”
http://tech.sina.com.cn/i/2008-12-19/15312672609.shtml

独家:百度就虚假医药信息问题致歉
http://tech.sina.com.cn/i/2008-11-17/22162584794.shtml

百度百科条目:暗网
http://baike.baidu.com/view/2074980.html

维基百科条目(英文)
http://en.wikipedia.org/wiki/Baidu


Baidu and its "Aladdin" project


Robin Li, the CEO of Baidu, has published his "Next Generation Search Technology" recently, by the name of "Aladdin". He stated that, there are still much information on the Internet can not be touched (searched for), that he would like to name as "Hidden Web". Baidu also believe that the "Hidden Web" should be hundreds of times more than those "Lighten Web" currently.

Baidu is a leading Chinese company who provide search engine similar to Google. Baidu is the most widely used search engine in China, while it dominates more than 70% of the market. And it also get a high position in the NASDAQ index.

However, while Baidu stated "We put too much effort in competing technically with Google", it is well-known that Baidu is poor in technology and was often blamed for plagiarize from Google. Though Baidu was considered to be able to provide more accurate Chinese search results than Google, the advertising scandal happened on Nov 2008 had disappointed many supporters. By now, Baidu still insists that its advertising programs are efficient and should be continued.

I don't think the "Aladdin" project would be a meaningful project. To my knowledge of Information Retrieval, increase database do not lead to any increment of the search engine's performance. For some fields of search which required high relevance, such as academic search, only a limited amount of corpus or even a closed corpus is supported. I don't think increase search experience can be achieved by simply increase the database, especially when the "Hidden Web" is defined as a problem which "can not be solved by only a search engine" by Baidu itself.

In fact, the "Aladdin" project is considered to be a plagiarism of Google's "cloud computing", but with a different and tricky definition. While "competing technically with Google" is only a nonsense, how can Baidu make a way to go any further in the future?

References


Wikipedia: Baidu
http://en.wikipedia.org/wiki/Baidu

Baidu's "Aladdin" searching for the "Hidden Web" (Chinese)
http://tech.sina.com.cn/i/2008-12-19/15312672609.shtml

Scoop: Baidu apologized for the advertising scandal (Chinese)
http://tech.sina.com.cn/i/2008-11-17/22162584794.shtml

Entry of Baidu Baike: Hidden Web (Chinese)
http://baike.baidu.com/view/2074980.html

0 Comments (評論):

Post a Comment