SQL Joins 翻译

Stéphane Faroult先生在SQL Joins里简单介绍了SQL各种Join的内涵,运用了一个形象的舞池比喻。

以下分别是中文字幕视频和翻译文档的中英文对照。

Hello. I’m Stéphane Faroult, a French database consultant, and I’ve designed this short video note about joins.

大家好,我是Stéphane Faroult,一个法国的数据库顾问,我设计了这个关于联接(join)的简短的视频笔记(希望能帮助你了解join的工作原理)。

You probably know what a join is: in a relational database, when two tables have one or several common columns, you can link them by saying that a row in one table matches a row from the other table that has the same values in the common columns.

也许你已经知道联接(join)是什么:

在关系型数据库中,如果两个表有一个或多个公共列,我们可以连接两表中对应列里列值相同的行

For instance, you can store in your database the registered members of a web site; they are identified by their screen name, and a numerical id is used as a convenient reference. Your table probably contains more information, such as the date when members joined, and perhaps a number of points acquired by providing useful answers in a forum.

例如,您可以在您的数据库中储存一个网站的注册会员,他们由网名标识,并且用一个数字的id方便的作为参考。你的表中也许包含着更多的信息,例如会员的注册日期,以及在论坛中回帖所得的点数。

Another table will contain the messages posted by members on the forum. I have just shown here the member id and the subject of the post, you would obviously also have a date, the message body and possibly some other attributes.

另一个表将会包含论坛会员发帖的信息。我在这只显示了会员ID和帖子主题,在真实生活中,这个表中或许还应包含发帖时间,帖子内容和帖子的一些其他属性。

If we want to list the messages, most recent first, we also want to display who posted the message, as well as the member’s points. We will therefore write a query such as this one, in which the ‘memberid’ column in the messages table refers to the ‘memberid’ column in the members table and is used to perform the join.

如果我们想列出这些信息,最新的优先,同时也想要显示发帖人,以及他的点数。我们需要写一个像这样的查询,其中messages表的memberid列参考members表中的memberid列,用来完成联接。

When you are taught SQL, you are often explained that the operation consists in generating all possible combinations between the rows from the members table and the rows from the messages table, to only retain the combinations for which the common column contains the same value.

当你学习SQL时,你学习到有一个操作在于生成在members表和messages表的行和行之间所有可能的组合,只保留包含相同值的共同的列的组合。

This operation is called a cartesian join, after René Descartes, one of the major European philosophers and mathematicians of the 17th century. Although a database system sometimes executes cartesian joins, the most common ways to perform a join are nested loops, and hash or merge joins.

这个操作就叫做笛卡尔联接,是根据17世纪欧洲主要哲学家和数学家勒内·笛卡尔命名的。尽管数据库系统有时执行笛卡尔联接,但是执行联接的最常用方法是嵌套循环联接(nested loops)和哈希(hash)联接或合并(merge)联接。

To explain the difference between those two categories of joins, Tom Kyte, who is senior technologist with Oracle, once used an excellent analogy on his popular “Ask Tom” site, and I’ve tried to illustrate Tom’s  dance floor analogy.

为了解释一下那两种联接的区别, Oracle的高级技术人员Tom Kyte,曾经在他的Ask Tom网站上使用了一个杰出的类比,下面我将尝试阐述Tom的舞池类比。

Here is how it goes. Suppose that you have a few guys, and a few girls. Each guy is looking for a partner of about the same size as himself. The first guy moves forward, and is strongly attracted by a girl who looks like a model. Unfortunately, models are tall, and our guy is rather short. A second girl quite lovely, but a tad too tall for him. Finally,he finds a suitable partner who has both the right size and a nice smile, and invites her. If each guy repeats the same process, we have in effect the same algorithm as with a nested loop.

这是它的内容。假设有几个男士和几个女士,每个男士在寻找一个和他身高差不多的舞伴。第一个男士前进一步,并且被一个看起来像模特的女士强烈的吸引住了,不幸的是,模特很高,而那个男士相当的矮;第二个女孩很可爱,但是还是比他高一点;最后,他找到了一个合适的舞伴,她有着和他相仿的身高和甜美的微笑,男士邀请了她。如果每个男士重复这个同样的步骤,我们就有与嵌套循环联接(nested loop)实质相同的算法。

Now suppose that we have a lot of girls, and a lot of guys. If we take the same Goldilock approach to flirting as before, it is likely that dawn will be there before all couples are formed. Therefore, we are going to trade romanticism against efficiency, and ask all girls and all boys to get in rank by order of size, so that we can easily match them. This would be an overkill with very few individuals. But with a large number of people, this is probably the best way to proceed, and it is exactly the way a merge join works.

现在假定有很多的女士和很多的男士。如果我们还像之前那样接近并与一个金发女郎调情,很可能要等到黎明时才能使所有的人都找到舞伴。因此,我们将放弃浪漫而追求效率,让所有的女士和所有的男士都按身高排队,然后他们可以简单的进行配对。如果人数少的话,用这种方法就像是“用大炮打蚊子”,小题大做了。但是如果有一大堆人的话,这可能就是最好的方法了,并且它恰好和合并联接(merge joins)是一样的。

In the dry world of databases, to perform a nested loop an SQL engine will scan the smallest of the two tables, then find in turn the matching rows in the other table.

在数据库的世界里,要执行一个嵌套循环联接(nested loop),SQL引擎将扫描两个表中最小的那个表,然后依次匹配另一个表中的行。

It is very important that the column used for the join is indexed in the bigger table, because otherwise you would have to scan it each time you consider a new row in the smaller table.

如果在一个很大的表中,给联接要用到的列建立索引是很有必要的,否则每次根据一个小表中的行来寻找大表中相同的行,都要对大表进行完全扫描。

We have seen how merge joins work. Hash joins are a variant that is a good compromise in terms of CPU and memory usage. In a hash join , a temporary hash table is created. A hash table is a kind of array that contains pointers to the rows of a table. A special function, called the hash function,  applied to the key associates each key to a position in the array.

我们已经看过合并联接(merge joins)是怎样运作的了。哈希联接(hash joins)是一个变体,是CPU和内存使用的方面很好的折中。在哈希联接里,一个临时的哈希表被建立。哈希表是一个数组,包含指向表中行的指针。一个特别的函数,称作哈希函数,应用于表中的每个键(主键),在数组中每个键占一个位置。

The smallest table is scanned first, and the hash table is populated with pointers to the rows that match each key.

最小的表首先被扫描,然后哈希表被填入相匹配的每个键的行指针。

Then the second table is scanned, only once, the function directs us to the proper slot in the hash table and from there you find the pointers to the matching rows.

然后,第二个表被扫描,只扫描一次,那个函数引导我们到哈希表中适合的位置,并且从那里,你找到匹配的行指针。

Contrarily to the nested loop join, indexes are no longer so important. An SQL optimizer may choose to scan an index rather than a table because indexes are usually smaller than tables, but this is a marginal improvement.

与嵌套循环联接相反,索引不再是非常重要的了。一个SQL优化器会选择扫描一个索引,而不是一个表,因为索引通常比表小,但是这只是一个微小的改进。

When the query isn’t madly complicated, when data statistics allow to get a fair idea about the expected size of the result set and if existing indexes allow a choice, the query optimizer can usually decide which type of join will be fastest.  If the query is really complex, the way you write it often sends signals to the optimizer.

当查询并不是很复杂,数据统计信息对预计的结果集大小有正确的认识,并且存在的索引允许选择,那么查询优化器通常可以决定哪种类型的联接将会更快。如果查询语句真的特别复杂,那么你构造语句的方法通常会影响优化器的选择。

 But that, as Kipling used to say, is another story. Thank you for listening.

不过,正如英国作家吉卜林说(译者注:英国作家,1865-1936,1907年获诺贝尔文学奖),是另一个故事。感谢您们的聆听。

作者:Stéphane Faroult
翻译:纪晓文,苗洁明
校对:Kamus(ACOUG),仇实

普人特福的博客cnzz&51la for wordpress,cnzz for wordpress,51la for wordpress