Rewrite the Technological Landscape

Chapter 40



Chapter 40 Search engine algorithm

Approaching one o’clock in the afternoon, Meng Qian arrived in Pudong, Shanghai. This was the first time he came to Shanghai after his rebirth, and he often came to this place in his previous life.

As China’s financial center, Shanghai is a business card for the world.

However, the first time Meng Qian came to Shanghai was in 2007. He had never seen it in Shanghai in 2000.

At this time in Pudong, high-rise buildings have begun to rise, and there are large areas of factory buildings and shanty towns. As the car drove all the way, you can see that many places are being demolished and reconstructed.

“Ms. Zhang is going to put the branch in Pudong?” After arriving at the destination, Meng Qian relied on the comparison of memories. If he reads correctly, this should be the Zhangjiang High-tech Park.

Among the four key development areas in Pudong, the financial center Lujiazui and the science and technology center Zhangjiang should be relatively familiar to the world.

Zhang Jiang in 2000, leading industries are circuits, software and biomedicine.

Zhang Shuxin nodded and expressed his affirmation, “Now the most promising places in the south are Shenzhen and Shanghai’s Pudong, and the Zhangjiang Hi-Tech Park is an incubator for science and technology.”

At this time, when everyone talks about the development potential of southern cities, especially the development of science and technology, no one would even think of Hangzhou.

When he came to the newly rented place by Zhang Shu, five men were waiting there, two of them were foreign men at first sight.

Zhang Shuxin introduced one by one. One of the two foreign men is from IBM and the other is from Google. It means that they have been dug up, or they are going to be dug up. Both of them were previously in the search engine project team.

The other three Chinese people, one is Ying Haiwei’s own technical director, and the other two are back from Silicon Valley, one graduated from Stanford University and worked at Intel, one graduated from Harvard, and worked at Oracle. They are all talents.

A simple hello, everyone went to the meeting room to sit, and then Meng Qian’s performance time, he will show his core search engine technology today.

Search engines need to use web crawler technology, search ranking technology, web page processing technology, big data processing technology, natural language processing technology, etc. Of course, at this time in 2000, natural language processing technology and big data processing were not needed. The concept of later generations is also different.

But to make it simpler, the core is actually one thing, an algorithm.

Because every technology is inseparable from algorithms.

“I am not quite sure about the achievements and understanding of all of you present in search engines. I can only speak at my own pace. If anyone has any questions, I can interrupt me at any time.” Meng Qian walked to the blackboard and went straight to the topic.

“Before I show my core technology, let’s take a look at the three current mainstream algorithms, the hyperlink analysis of whiteness, Google’s PageRank algorithm and IBM’s HITS algorithm.

Almost everyone thinks that the hyperlink analysis of whiteness is the most backward of the three algorithms, but we still have to look at some things from multiple angles. The hyperlink analysis of whiteness can be regarded as established to some extent. The development basis of search engines.

There are some voices saying that Google actually plagiarized the whiteness hyperlink algorithm. After all, Robin Li’s patent is indeed before Google. We are not guessing the true or false now, but this statement reflects a very important signal, in fact, no matter which one it is. The algorithm, the algorithm basis is actually the same.

Grab web page information, and then use a certain mechanism to sort these web pages. When the user enters keywords to search, the keywords are matched to the web pages arranged according to the mechanism.

So where does the whiteness lose? The key is that the whiteness is now too simple based on the basic sorting method that the more web pages that are pointed to by other web pages with hyperlinks in all the results of a certain search, the higher the value.

In contrast, Google’s pagerank has two more important things. The first thing is to interpret the link from page A to page B as a voting behavior from A to B. Here, Google will evaluate A and B at the same time. The level of the formation of a new level.

means that every page has a PR value, and your PR value will become a reference for the PR value of other pages.

Then repeatedly calculate the PR of each page. Assuming that each page is given a random PR value, then after repeated calculations, the PR value of these pages will tend to be stable, that is, the state of convergence.

As for HITS, its theoretical basis remains the same. Its biggest feature or change is that it realizes that the average distribution weight of the pagerank algorithm does not conform to the actual situation of the link.

Therefore, another kind of webpage is introduced in the HITS algorithm, called the Hub webpage. The Hub webpage is a WEB webpage that provides a collection of links to authoritative webpages.

So the search results using HITS will be more authoritative than the other two, but this algorithm will greatly increase the computational burden, right? ”

Meng Qian glanced at the buddies who came out of IBM, who was taken aback for a moment and nodded in uncertainty.

So now, to briefly summarize, the algorithm basis of search engines is hyperlink analysis. The advantages and disadvantages of algorithms are how to make search results more valuable and allow users to obtain more effective information.

Of course, if you can directly understand the user’s needs and help him search for the content he wants most, this is the most ideal search engine state, but everyone knows that this is impossible.

Therefore, the quality of search engines is determined by whether you can make relatively more people get the content they want under the same keywords.

10 users use Google, and 5 people find what they want. If we use our search engine, 6 people find what they want. In the current technical environment of this field, we are better.

Then, based on this understanding, what I want to introduce to you next is my search engine algorithm and dynamic rule hyperlink analysis algorithm.

The dynamic rule hyperlink analysis algorithm has the following changes.

First, as we mentioned earlier, a good search engine depends on whose feedback results under the same keyword can better meet the needs of users. Then when a user is searching for something, from a high probability, he wants The result you see should be more vertical related to this thing.

For example, when a customer is searching for a car, whether he wants to buy a car or want to learn about cars, professional webpages related to cars should be more helpful to him.

So in my algorithm, for a link to a certain website, I first do a vertical rate score. For example, there are now 10 websites linking to A, these 10 websites are all automobile websites and these 10 websites are not. The results of automotive websites must be different.

There is also a small psychological problem here, that is, there are very few hyperlinks between peers, so websites with more links to vertical websites must be more professional than websites linked by messy websites. Spectrum.

Second, establish a ranking mechanism for the popularity of the keyword database. Now several search engine companies have sorted the web pages, and I have also sorted the keywords, and it is very simple to sort the keywords, that is, it depends on the user’s search. the amount.

For example, there are the most users searching for cars today, so the car’s score may be 10 points. At this time, the algorithm will allocate more resources to car-related information to crawl more high-quality web pages.

There are four advantages here, which are to improve the speed of information feedback, increase the timeliness of hot feedback, save computer resources, and focus on the ultimate goal, so that more users who use our search engine can get useful information.

Third, the user feedback mechanism, which is to track the user’s clicks and browsing.

Let’s take a car as an example. After 100 users searched for a car at UUreading www.uukanshu.com, 80 clicked on page A, and the rating of page A would rise. If more users stayed on page A for a longer time, The rating of webpage A will also increase. If more users directly link and perform operations on webpage A, the rating of webpage A will also increase.

In other words, add user feedback points to the entire web page rating system.

Fourth, the regular algorithm searches for high-probability behaviors in all user behaviors, and feeds these high-probability behaviors back to humans. For example, 60% of users who search for cars are insurance.

Some rules like this are unpredictable, but we can use algorithms for big data mining, and the feedback results can be used by the human analysis department to score certain web pages. This is manual score.

Combining the above four points, under my algorithm, any webpage will also have a score, which I call accuracy score.

The factors that affect the accuracy score include self score, link vertical website score, user feedback score, manual setting score, and external link influence. ”

After   , Meng Qian simply showed the algorithm logic and algorithm deduction formulas of each branch.

However, while Meng Qian was talking about the last regular algorithm, Jeff from IBM suddenly stood up and exclaimed, “OHMYGAD! Artificial Intelligence?!”

Meng Qian turned his head and glanced at each other, frowning.

Jeff paused, thinking that Meng Qian didn’t understand, and said in a strange pronunciation, “Fuck!!!”

With Jeff’s interruption, the eyes of the other four technicians who had been immersed in Meng Qian’s sharing also changed significantly…


Tip: You can use left, right, A and D keyboard keys to browse between chapters.