SQL Joins 翻译

Stéphane Faroult先生在SQL Joins里简单介绍了SQL各种Join的内涵,运用了一个形象的舞池比喻。


Hello. I’m Stéphane Faroult, a French database consultant, and I’ve designed this short video note about joins.

大家好,我是Stéphane Faroult,一个法国的数据库顾问,我设计了这个关于联接(join)的简短的视频笔记(希望能帮助你了解join的工作原理)。

You probably know what a join is: in a relational database, when two tables have one or several common columns, you can link them by saying that a row in one table matches a row from the other table that has the same values in the common columns.



For instance, you can store in your database the registered members of a web site; they are identified by their screen name, and a numerical id is used as a convenient reference. Your table probably contains more information, such as the date when members joined, and perhaps a number of points acquired by providing useful answers in a forum.


Another table will contain the messages posted by members on the forum. I have just shown here the member id and the subject of the post, you would obviously also have a date, the message body and possibly some other attributes.


If we want to list the messages, most recent first, we also want to display who posted the message, as well as the member’s points. We will therefore write a query such as this one, in which the ‘memberid’ column in the messages table refers to the ‘memberid’ column in the members table and is used to perform the join.


When you are taught SQL, you are often explained that the operation consists in generating all possible combinations between the rows from the members table and the rows from the messages table, to only retain the combinations for which the common column contains the same value.


This operation is called a cartesian join, after René Descartes, one of the major European philosophers and mathematicians of the 17th century. Although a database system sometimes executes cartesian joins, the most common ways to perform a join are nested loops, and hash or merge joins.

这个操作就叫做笛卡尔联接,是根据17世纪欧洲主要哲学家和数学家勒内·笛卡尔命名的。尽管数据库系统有时执行笛卡尔联接,但是执行联接的最常用方法是嵌套循环联接(nested loops)和哈希(hash)联接或合并(merge)联接。

To explain the difference between those two categories of joins, Tom Kyte, who is senior technologist with Oracle, once used an excellent analogy on his popular “Ask Tom” site, and I’ve tried to illustrate Tom’s  dance floor analogy.

为了解释一下那两种联接的区别, Oracle的高级技术人员Tom Kyte,曾经在他的Ask Tom网站上使用了一个杰出的类比,下面我将尝试阐述Tom的舞池类比。

Here is how it goes. Suppose that you have a few guys, and a few girls. Each guy is looking for a partner of about the same size as himself. The first guy moves forward, and is strongly attracted by a girl who looks like a model. Unfortunately, models are tall, and our guy is rather short. A second girl quite lovely, but a tad too tall for him. Finally,he finds a suitable partner who has both the right size and a nice smile, and invites her. If each guy repeats the same process, we have in effect the same algorithm as with a nested loop.

这是它的内容。假设有几个男士和几个女士,每个男士在寻找一个和他身高差不多的舞伴。第一个男士前进一步,并且被一个看起来像模特的女士强烈的吸引住了,不幸的是,模特很高,而那个男士相当的矮;第二个女孩很可爱,但是还是比他高一点;最后,他找到了一个合适的舞伴,她有着和他相仿的身高和甜美的微笑,男士邀请了她。如果每个男士重复这个同样的步骤,我们就有与嵌套循环联接(nested loop)实质相同的算法。

Now suppose that we have a lot of girls, and a lot of guys. If we take the same Goldilock approach to flirting as before, it is likely that dawn will be there before all couples are formed. Therefore, we are going to trade romanticism against efficiency, and ask all girls and all boys to get in rank by order of size, so that we can easily match them. This would be an overkill with very few individuals. But with a large number of people, this is probably the best way to proceed, and it is exactly the way a merge join works.

现在假定有很多的女士和很多的男士。如果我们还像之前那样接近并与一个金发女郎调情,很可能要等到黎明时才能使所有的人都找到舞伴。因此,我们将放弃浪漫而追求效率,让所有的女士和所有的男士都按身高排队,然后他们可以简单的进行配对。如果人数少的话,用这种方法就像是“用大炮打蚊子”,小题大做了。但是如果有一大堆人的话,这可能就是最好的方法了,并且它恰好和合并联接(merge joins)是一样的。

In the dry world of databases, to perform a nested loop an SQL engine will scan the smallest of the two tables, then find in turn the matching rows in the other table.

在数据库的世界里,要执行一个嵌套循环联接(nested loop),SQL引擎将扫描两个表中最小的那个表,然后依次匹配另一个表中的行。

It is very important that the column used for the join is indexed in the bigger table, because otherwise you would have to scan it each time you consider a new row in the smaller table.


We have seen how merge joins work. Hash joins are a variant that is a good compromise in terms of CPU and memory usage. In a hash join , a temporary hash table is created. A hash table is a kind of array that contains pointers to the rows of a table. A special function, called the hash function,  applied to the key associates each key to a position in the array.

我们已经看过合并联接(merge joins)是怎样运作的了。哈希联接(hash joins)是一个变体,是CPU和内存使用的方面很好的折中。在哈希联接里,一个临时的哈希表被建立。哈希表是一个数组,包含指向表中行的指针。一个特别的函数,称作哈希函数,应用于表中的每个键(主键),在数组中每个键占一个位置。

The smallest table is scanned first, and the hash table is populated with pointers to the rows that match each key.


Then the second table is scanned, only once, the function directs us to the proper slot in the hash table and from there you find the pointers to the matching rows.


Contrarily to the nested loop join, indexes are no longer so important. An SQL optimizer may choose to scan an index rather than a table because indexes are usually smaller than tables, but this is a marginal improvement.


When the query isn’t madly complicated, when data statistics allow to get a fair idea about the expected size of the result set and if existing indexes allow a choice, the query optimizer can usually decide which type of join will be fastest.  If the query is really complex, the way you write it often sends signals to the optimizer.


 But that, as Kipling used to say, is another story. Thank you for listening.


作者:Stéphane Faroult

普人特福的博客cnzz&51la for wordpress,cnzz for wordpress,51la for wordpress