Category Archives: 翻译

Oracle architecture 翻译

此翻译由ioracle实验室完成,旨在方便大家学习。我们在翻译的过程中对原文中的一些语言进行了更加本土化的调整。如果英文较好,建议直接读英文文档或是忽略字幕。如有任何建议,请联系我。BTW 老史说的太快了,如果感情再投入点就好了

本篇内容简介:

1. 数据库开启过程中到底发生了什么(可先参阅站内文章:数据库启动

2. 客户机如何连接数据库

3. SGA/PGA

4. 一点点锁机制的内容

5. 更新、删除数据时数据库内部探究(涉及数据文件、数据高速缓存、重做日志缓存、撤销段、SCN、CKPT、LGWR等)

(以下是中英对照翻译,打包下载和视频版本随后奉上,敬请期待。文字版比视频版多一部分内容,请各位自行对照。以下文字版本皆为从Stéphane Faroult先生处获取的原始文档)

Hello,My name is Stéphane Faroult, I am a database performance consultant. The first Oracle database I ever installed was a beta version of Oracle 5, an awful number of years ago. I have published two books with O’Reilly, « The Art of SQL » and more recently « Refactoring SQL Applications ».

大家好,我是Stéphane Faroult,一名数据库性能咨询师.

我安装第一个Oracle数据库系统已经是很久之前的事情了,那时还是Oracle5测试版.

我通过O’Reilly出版公司出版过两本书,一本是《SQL语言艺术》,另一本是最近出版的《SQL应用重构》.

For me, there is a clear link between performances and the understanding of what happens inside the database. I’ve had the opportunity to witness that for many developers «the « persistence layer » is kind of cloudy. I’d like to take you behind the scene, and explain without getting too much into details how everything works.

在我看来,数据库性能和数据库内部构造是有必然的联系的。

我曾目睹过无数开发人员为因搞不懂持久层而内牛满面。接下来,我会在不过分涉及细节的情况下,带你看看数据库在后台如何工作,帮助你解开谜团。

I’ll start by explaining what happens when you start the database, something which is most often automatically performed by scripts when a machine boots. I’ll talk more precisely about the Unix implementation of Oracle, but it’s not massively different under Windows.

首先,我来解释下数据库启动过程中到底发生了什么,有一些东西会在这一过程中通过脚本自动执行。

我的Oracle数据库是在Unix下工作的,别担心,这与Windows下的没什么太大区别。

An administrator who wants to manually start a database (called a server by some other products) must have at least two important variables set in his environment or in the registry : a variable that says where the distribution of Oracle has been installed, called ORACLE_HOME, and a variable that identifies the database, called ORACLE_SID. You can have on the same machine several different databases running, that may be running under different versions of Oracle.

作为一名数据库管理员,如果你想手动启动数据库(有些产品中称之为服务),你必须在你的运行环境或是机器注册表中配置过以下两个重要的变量:ORACLE_HOME,用来说明oracle具体安装在哪;ORACLE_SID,用于标识你要运行的数据库。这样你就可以在一台机器上运行多个数据库了,这些数据库可以是不同的Oracle版本。

The administator will run the sqlplus program, and « connect » – actually the connection is dummy at this stage, it’s just a check that the program is invoked from a Unix or Windows account with the right privileges. You can then run the magical startup command.

数据库管理员可以运行sqlplus程序或者使用connect命令,注意这个时候的连接其实是伪连接(注:对于DBA的验证是系统验证),它仅仅是对程序是否被Unix或者windows中拥有正确权限用户调用的核查.登陆sqlplus后你才能使用startup命令开启数据库.

The program starts by reading a file called spfile<ORACLE_SID>.ora that contains the parameters for the database identified by ORACLE_SID. It’s a binary file, and you may hear about the init.ora file that is a text file that historically predates the binary file.

程序启动的时候会读取一个叫做spfile<ORALCE_SID>.ORA的文件,他包含了用于唯一识别各个数据库的ORALCE_SID,它是一个2进制文件。在它之前使用的是init.ora文本文件

With a binary file, parameters are modified using sqlplus, and many of them can by dynamically changed when the database is up and running.

如果使用spfile作为参数文件启动,那么其中的很多参数可以在数据库运行的过程中,通过sqlplus进行动态的修改。

Database parameters will give the size of a shared memory area that is allocated at this stage and which is called the System Global Area, or SGA.

在这个阶段,数据库参数会决定在系统全局区中(简称SGA)共享内存的大小.

This is where Oracle works.

数据库的多数工作在SGA中完成

A number of utility processes will also be started.

一些后台进程也在这一阶段开启.

The number increases with each version of Oracle, some of them are mandatory, others are optional.

它们的数量随着版本的升级而增加,其中一些是强制性运行的,而另一些则是可选的.

Under Windows, they aren’t processes but threads. All the processes are running the same oracle program and have the SGA in their address-space.

在WINDOWS下它们不是进程而是线程.所有的进程都会运行同一个oracle程序,它们在各自的地址空间中会有自己的SGA.

It sometimes look like they are consuming a lot of memory, but actually most of this memory is shared

从表面看起来这些进程会消耗很多的内存,而实际上这些内存绝大多数是共享使用的.

When memory is allocated and processes started, SQL*Plus displays the total SGA size as well as the size of four subparts, the fixed part, the variable part, the database buffers and the redo buffers.

当内存已被分配并且进程启动之后,SQL*Plus会显示SGA的总值和4个子区域的大小:固定区、可变区、数据库缓冲区和重做缓冲区

At this point, Oracle is going to open and read a file that is so critical that it’s always duplicated : the control file, found at a location that is specified in the parameter file.

在这时,ORALCE将会读取一个非常重要的文件,重要到它总是被拷贝以防丢失:它就是控制文件,他的位置在参数文件中被具体指明了.

This is where Oracle finds the names of all the data files, temporary files and log files that together make up the database.

在控制文件中可以找到所有数据文件、临时文件和日志文件的名称,它们3者便组成了数据库

The control file will also allow to check global consistency, whether something needs to be restored and whether the database was properly shutdown or if some unfinished transactions will have to be rolled back.

控制文件也会去检查全局的一致性,去判断是否有一些东西需要被恢复(译者注:通过对比控制文件与数据文件的SCN),数据库之前是否被正确的关闭或者是否之前存在一些尚未完成的事务需要回滚.

Without a control file, your database is dead

没有了控制文件,你的数据库就杯具了,如果你搞不定,就去蹲墙角,画圈圈去吧。

After passing this stage, SQL*Plus will display that the database is mounted, which mostly means that all files are identified. The only step that remains is the opening of the various files for read/write operations.

这个阶段以后,SQL*Plus会显示数据库已被加载,这意味着所有的文件都被确认了.最后一步就是打开文件进行读写操作。

When files are opened, then the database is open and users who aren’t database administrators can connect.

当文件打开之后数据库才处于真正的打开状态,非DBA的用户才可以连接.

That is, they can connect if they are logged onto the machine that hosts the database.

也就是说,非DBA用户只有先通过装有数据库的机器的主机验证后,才能连接数据库

If you want to enable users or an application server to connect from another machine, which is the most common case, you have to launch with the lsnrctl start command another program called tnslsnr, which is the “listener” that waits for incoming connections. After displaying a lot of intimate details about its configuration, found in the listener.ora file, the program runs in the background and everything is in place.

如果您想授权一个用户或者程序从另一个主机上连接,最常用的方法是你需要通过lsnrctl命令启动另一个叫做tnslsnr程序,它是一个’监听器’,用来监听各个连接请求.在屏幕上显示了listener.ora中一些许多关于配置的隐信息并提示命令运行成功之后,tnslsnr便在开始在后台执行工作。

Let’s now contemplate a client program that runs on another machine.

接下来让我们看看在客户端应该如何配置。

If it wants to access the database, it needs to provide, first of all, three pieces of information: the name of the host on which Oracle runs, the port on which tnslsnr is listening, and the service, that is in practice the identifier of the database you want to connect to.

如果客户端想去连接数据库,它就至少要首先提供3个信息:1.运行ORALCE的主机名称2.tnslsnr监听的端口3.服务,实际上就是你想要连接的数据库标识符

There are several ways to provide these pieces of information.

有以下几种方式可提供这些信息。

You can directly provide everything in the function that performs the connection request; this is for instance what you do with JDBC and a pure java driver.

对于使用JDBC和Java驱动的实例,可以在请求连接时直接提供以上信息。

If your client is using the Oracle client libraries, you can also provide an alias, and your client will transparently fetch the associated information from a file called tnsnames.ora.

如果是您使用ORALCE的客户端,你可以提供别名,然后您的客户端会从一个叫做tnsnames.ora的文件中读取连接信息.

These two methods are by far the most common ones, but there are other possibilities, such as using an LDAP directory or system utilities such as the Network Information Services as a substitute to the tnsnames.ora file.

这就是目前最常用的两种方法,当然还有其他一些可行方法,例如使用LDAP目录或者系统应用,例如使用网络信息服务(Network Information Services)代替tnsnames.ora文件.

Obviously Oracle will not give you uncontrolled access to the data.

显然,ORALCE不会允许毫无约束的访问数据.

You must sign-on, so that you are identified and the database clearly knows what you are allowed to do. Usually, you will provide a username and a password.

您必须“签名”,这样数据库才能确认您的身份,数据库也能清楚的知道您的权限有哪些.通常需要通过用户名和密码来验证

 Authentification through an LDAP directory is also sometimes possible.

有时LDAP的直接认证也行.

In practice, your client issues a connection request that is handled by the listener.

实际上,您的客户端申请的链接是受监听器的监管的.

The listener will either spawn a sub-process that will run the oracle code and become the client’s dedicated server, or redirect the connection to an already existing shared server, which is a much less common configuration.

监听器或者会产生一个子程序用来运行ORACLE代码并生成客户端的专用服务,或者将链接改变指向,使它重新指向一个已经存在的共享服务,但这种方法很少使用.

The server process is, so to speak, the client’s proxy on the Oracle side.

可以这么说,服务端进程可以视作客户端在ORALCE上的代理.

It can access the SGA, that is shared between all processes.

它可以访问SGA,这个SGA在所有程序之间被共享.

It will also allocate a private memory area called PGA, where it stores what is private to the process and to the session.

数据库也会为服务器进程分配私有的内存空间,这块空间叫做PGA(程序全局区),那里存放着私有的进程和会话信息.

But let’s take a closer look to the various parts in the SGA. There isn’t much to say about the fixed system area.

然后让我们再仔细看看各种在SGA上的区域吧.在这里不过多涉及的那些固定的SGA区.

But the variable area contains the shared pool, which holds compiled versions of queries, as well as various buffer pools that will contain for instance data dictionary information.

动态区包含共享池,共享池中包含编译过的查询语句,除此以外,动态区中还有各种缓冲池,缓冲池中含有实例的数据字典信息.

The data buffer contains data from the tables but also index structures and various work areas.

在数据缓冲区中存有的数据包括表和索引结构的还有各种工作区域里的。.

Everything is split in pages or blocks; the block is the Oracle basic storage unit, and a typical size for a block is 8K.

所有事物都被分成页或者块.块是ORACLE中的基本存储单元,每一个块的默认大小是8K.

Whenever you query or change the content of a table, you operate in memory on blocks that are in this cache.

无论何时查询表还是修改表,所做的操作都会被以块的形式记录在数据缓冲区中.

Finally, a fourth area is used to store information before it is written to log files to ensure transactional consistency even in the occurrence of a crash.

最后来说一下第四个区域,它是用来存储被写入到日志文件之前的信息(重做信息),以保证即使出现突发性的崩溃,数据库的工作也能持续进行。

The parts that really matter to understand how Oracle works are the shared pool, the data buffer and the log buffer.

理解共享池,数据缓冲区和重做日志缓冲区对理解数据库如何工作非常有帮助。.

This is where the dedicated server process will operate, as well as the main utility processes started with the database.

因为这些(原文中This is应为These are)就是专用服务进程进行操作的位置,也是主后台进程在数据库程序中启动的地方.

Let’s see what happens when the client issues an SQL query.

我们通过一条SQL查询语句的执行过程来看看数据库到底如何工作.

The query arrives as text to the server process that will start by computing a hash value that is used as a checksum.

查询命令通过文本的方式到达服务端进程,服务进程会通过计算hash值来校验.

Then the server will check whether this query has already been recently executed, and will inspect the shared pool.

然后服务端会检查这个查询指令是否在最近的一段时间里被执行过进而会查询共享池.

If the query is found, the analysis stops here, this phase is called soft-parsing and the server process can execute the query.

如果在共享池中发现了这条指令,分析器就此终结,然后由进程服务执行这条指令, 这个过程叫做软解析.

If a prior query with exactly the same text cannot be found, then we have hard-parsing that is CPU intensive and locks some shared resources.

如果之前未运行过相同的语句,我们便会使用硬解析.它使CPU负担加重并且锁住部分共享资源(注:通过锁机制).

SQL syntax must be analyzed, the objects that are referred to in the query must be identified, some synonyms may have to be translated, and the best execution plan must be determined among sometimes a very large number of possible execution plans.

在此期间,SQL的句法被分析,在SQL语句中使用到的对象被确定,一些同义词被翻译并且要在多个执行方案中筛选出最优方案。.

It may cause recursive queries on the data dictionary and input/output operations.

这可能会引起数据字典和I/O操作的重复递归查询

This is why you shouldn’t concatenate to the text of a query identifiers that change with every call, but you should pass them as parameters, as if the query were a function.

这就是为什么您不应该在每次调用发生改变的时候都连接到查询标识符文本,而应该把它们作为参数一样传送的原因,就好像它们是一个函数一样.

Once the query is analyzed and compiled, it is going to be loaded into the shared pool and executed.

一旦查询指令被分析和编辑完毕,它会被放入共享区域中并被执行.

 It will then reference pieces of data that may or may not be in the data buffer.

然后它会查找一些数据片,这些数据片可以存在于数据缓冲中也可以不在那里.

If data blocks aren’t in memory, the server process fetches them from the data files and loads them.

如果数据块不在内存中,服务器进程会把它们从数据文件中读取并进行载入内存.

Blocks that have been modified are written asynchronously to the datafile by one or several database writer processes.

被改变的块区域会被一个或者多个数据库写进程异步的写入数据文件.

When you update data, mechanisms for analyzing the statement and uploading data blocks into memory are the same ones.

当您更新数据时,在内存中存在的分析和更新数块的机制是同一个.

But before modifying data, initial values are copied to a work area called “undo segment”.

但是在修改数据之前,先前的数值会被拷贝到一个叫做“撤销段”的工作区域中. 

Another process that wants to read data being modified isn’t blocked, but reads data from the undo segment instead of reading from the table block. Initial and new value are also copied to the redo buffer; if the redo buffer fills up, the utility process LGWR, or log writer, dumps its content to a log file.

如果在修改过程中有另一个进程想读取被修改的这一块的数据,它会从’撤销段’中读取,而不是从数据在表中对应的位置读取(读取的值为先前值).修改前的数据和最新的数据都会被拷贝到重做缓冲区中,如果重做缓冲区被填满,LGWR后台进程或者日志写入器会将重做缓冲区中的数据写入重做日志文件.

When you commit the transaction, Oracle records that the initial value is no longer required for rolling back the change, it records that the change is validated and the LGWR process writes the transaction to the log file.

当您提交事务的时候,ORALCE记录的原数据不再需要被用来做回滚,数据文件记录已生效的改变并且LGWR进程将事务写入日志中(.

This time the operation is synchronous, and the server process waits for the go ahead before the call returns to the client.

这一次的操作是同步进行的,并且服务器进程在客户端向其发出调用之前一直处于等待状态的.

That means that if you commit every single update in a loop, you will waste a lot of time just waiting for acknowledgment.

这就意味着如果我们总是每更新一次数据便提交一次,会因为确认信息而浪费许多时间.

This is a very quick introduction, but I hope it will allow you to understand better what happens.

这虽是一次快速的介绍,但是我希望这会帮助您更好的理解数据库如何工作.

Thank you for your attention.

感谢您的观看.

作者:Stéphane Faroult
翻译:纪晓文,苗洁明
校对:仇实

Database Design 翻译

国内的图书往往通过各种范式、规则来讲解数据库设计。Stéphane Faroult先生则是通过一个关于电影的例子,生动形象的说明了数据库设计过程中遇到的问题和解决方法。视频分为上下两集,时间大约11分钟。相信看完这段视频和文字之后,会对你有所启发。

Hello; my name is Stéphane Faroult. I’m a French database consultant and a database book author.
大家好,我是StéphaneFaroult,是一名法国的数据库顾问,同时还是《SQL语言艺术》这本书的作者。
I’d like to present a brief introduction to database design by showing you on a small example some issues you can encounter and some ways to solve them.
我希望通过视频中的一个小例子,及在这个例子中遇到的问题和解决方法,来向大家简要的介绍下数据库设计。
For my example, I’d like to create a website that provides information about movies.
这个例子是:建立一个提供电影信息的网站
You are probably aware that in a database, data is stored in tables.
你可能已经知道,在数据库中数据是存储于表中的。
We’ll obviously have some kind of MOVIES table inside our database.
显然我们的数据库中会有各类的MOVIES表。
Tables in a database are very different from spreadsheets, but let’s start with a spreadsheet that contains information about a small number of classic movies: title, country, year, director, and leading actors.
数据库中的表和电子表格是很不同的,但是不妨让我们先创建一些电子表格,表格中包含一部分经典电影的信息,如影片名,国家,上映年份,导演,主演。
Initially we shall model our main table after the spreadsheet.
最初,我们的主表需要模仿电子表格。
Each row brings together a number of pieces of information that are the characteristics of a movie.
每一行整合了很多关于电影属性的信息。
The order of columns doesn’t matter.
列的顺序无关紧要。
The order of rows doesn’t matter.
行的顺序也不重要。
What matters is that the different pieces of information on one row are related, and the titles of the columns define the relation. This is why we talk about relational databases.
重要的是,一行中不同的信息是相关的,列的标题就是在定义它们之间的关系。这就是我们要讨论关系型数据库的原因。
It makes no sense to have the same row several times. An important step, when designing a database, consists in identifying the characteristics that allow us to distinguish one row from another; and very often these characteristics are a subset of all the characteristics. They are collectively called “the key”. What is the smallest subset of characteristics that uniquely identifies a movie?
拥有相同行是毫无意义的。设计数据库时很重要的一步,正是标识出那些能够帮助我们区分不同行的属性,通常这些属性是所有属性的一个子集。它们被统称为“键”。而什么是唯一标识一部电影属性的最小子集呢?
The title? No, the title isn’t enough because you have remakes.
电影名称?不,只有名称是不够的,因为一部电影可以重拍多次。
For instance, there are several movies based on the “Treasure Island” novel.
比如说,有很多部电影都是改编自《金银岛》(”Threasure Island”)这部小说。
Title and director perhaps?
电影名称和导演?
It doesn’t work either, because Alfred Hitchcock for instance made two “The man who knew too much”, one in 1934 and the other one in 1956.
这同样不管用,比如阿尔弗莱德?希区柯克(Alfred Hithchcock)导演了两部《擒凶记》(“The man who knew  too much”),一部在1934年,另一部在1956年。
Title, director and year may seem a better choice – except that “director” isn’t as simple a characteristic as it may look;
电影名称,导演,上映时间似乎是一个好一点的选择了,除非“导演”这一项比较非典型。
Some movies have several directors, and it may lead to ambiguities.
比如一些电影有好几位导演,这就可能会导致表达不明确。
Title, country and year look reasonable. It is very unlikely, not least for commercial reasons, that two movies with exactly the same title will be produced the same year in the same country. It is sometimes possible to find several groups of characteristics that uniquely identify a row in table; you are free to choose the group that you find the most convenient as the main identifier, and this group of characteristics is called the “primary key”.
电影名称,国家、上映看起来比较合理。至少从商业理由而言,两个完全同名的电影在同一国家,同一年发行,是非常不可能的。有时,可能找到一些属性组去唯一标识表中的一行;你可以自由的选择你认为最方便的组作为主要标识符,这一组属性被称为“主键”。
The ultimate goal of all “data bases” is to be queried. That means that people will search the table, using some interface, to retrieve movies that satisfy some particular conditions.
数据库的最终目标是用于查询。这意味着人们将通过一些界面来搜索表,检索出符合特定条件的电影。
Behind the scene, the program will issue an SQL query in which you name the characteristics you want
后台进程中,程序将根据你键入的属性发出一个SQL查询
Followed by the name of the table
接着是表名,
Followed by filtering conditions
再接下来是设置过滤条件。
And here we have several problems. If we want to retrieve all the movies that were directed by one given director, it is essential that the condition exactly matches the contents of the table – that is that if the row has for “Director” attribute “Hitchcock, A.”, the filtering condition is exactly entered as “Hitchcock, A.”, not as r “Hitchcock”), and definitely not as “Alfred Hitchcock”.
在这里我们有一些问题,当我们想要检索出同一位导演所执导的所有影片,就必须要准确输入与表中内容一致的条件,就像是如果这一行中“导演”的属性是“Hitchcock,A”,那么筛选的信息必须按照“Hitchcock,A”这样输入,而不可以是“Hitchcook”,更不能是“Alfred Hitchcock”(这几种是阿尔弗莱德?希区柯克英文名的不同书写形式)。
There is also a consistency issue with data that will probably be entered by different people at different times and must respect the same conventions: if you say that “North by Northwest” and “Rear Window” are “Hitchcock, A.” movieswhile “Strangers on a Train” is an “Alfred Hitchcock” movie, users who query the database will have much trouble getting the full list of Hitchcock movies.
不同时间向数据库中键入数据的人也要遵守数据一致性的约定:如果你说《西北偏北》(“North by Northwest”),《后窗》(“ Rear Window”)是导演Hitchcock, A.(阿尔弗雷德?希区柯克)的电影作品,而《火车怪客》(“ Strangers on a Train”)是导演Alfred Hitchcock(阿尔弗雷德?希区柯克英文名字的另一种格式)的另一部电影。这样用户在查询导演阿尔弗雷德?希区柯克电影作品的详细列表时就会遇到麻烦。
As a general rule, character fields that are free-form text call for trouble when they are used for searches. This is true for directors, and this is true as well for the leading stars. We must therefore define some kind of standard for storing names; storing firstname and surname separately, in uppercase perhaps. Standardization is a first step in the right direction. But we still have a few numbers of problems to solve. The first one is typos. A user who mistypes a name may realize the mistake, check the spelling, and try again. But if there are some misspellings in the database, information that is misspelt will only be found by users doing the same mistake when they query; not a desirable situation.
一般来说,没有规定格式的文本在查询时会引起麻烦。对于导演一栏是这样,对于主演一栏也是如此。因此我们必须定义一些用于存储名字信息的标准,例如姓氏与名字分开存储,或者使用大写。标准化是向正确方向迈进的第一步。但我们仍然有一些问题需要解决。第一个就是拼写错误。用户查询信息,输错一个名字时可能会意识到这个错误,检查拼写,并且再试一次。但是当数据库中存储的信息本身存在一些拼写错误时,这些拼错的信息只能在用户查询并出现同样的错误时才能被查询出来,这很杯具。
Let’s ditch this model
让我们抛弃这种数据模型
And let’s try a new one. To ensure that names are correctly entered, the best solution is to enter them only once (and if they are misspelt, we’ll have only one place to correct). This can be performed by a very convenient feature of the SQL language, which is the ability to perform joins.
我们来试试另一个。确保名字准确输入的最好方法就是只输入一次(当它们被输错了,我们将只有一个地方需要去修改)。这可以通过SQL语言中一个十分便捷的方法来完成——执行连接(join)。
We have been dealing so far with a single table. But we may have several tables, with some common columns, and SQL allows to associate data from one table to data from another table by using the common columns to establish the link – this operation is called a join.
迄今为止我们一直在处理单个表。但是我们可能有一些表,这些表中有部分列相同,SQL中允许数据通过共同的列从一个表到另一个关联表建立连接,这种操作就称为连接(join)。
Let’s see how it works. Suppose that we create a table of directors. In this table, we shall store first name, surname and perhaps some biographical information.
我们来看看它是如何工作的。假设我们创建了一个关于导演的表。在这个表中,我们将存储导演的姓氏、名字,并且或许还可以加上其他一些个人信息。
The key will be firstname and surname; however, since we want to link movies and directors and since we want to avoid misspellings, we are going to use a system-generated numerical identifier that will be a synonym for the key and that we shall call ID. This will be, technically speaking, a second key and we can make ID the primary key of the DIRECTORS table while asking the DBMS to ensure that firstname and surname remain unique across the table.
键是姓氏和名字,然而,为避免在连接MOVIES表和DIRECTORS表时出现拼写错误,我们用系统生成的数字标识符代替姓名键,我们称它为ID。ID也是表的键,我们可以把它设为DIRECTORS表的主键同时可以利用数据库管理系统确保表中的名字和姓氏是唯一存在的。
Inside the MOVIES table we shall replace the identification of the director by the id; of course, we’ll do something similar for actors. When a user wants to retrieve the titles and year of the movies directed by someone whose only surname has been provided, the system will issue an SQL query that will look like this one.
在MOVIES表里我们将用ID替换导演列的数据,当然,我们也将对演员作相应的更改。当用户想要检索由一位只提供了姓的导演所执导的电影的片名及年份,系统将发出一个SQL查询。如演示。
However, we have a problem with the infrequent movies that have more than one director.
然而,我们又遇到一个问题,个别电影会有多个导演
Should we reserve two columns for storing director ids? That would mean joining on two columns, which can be done but isn’t most convenient.
我们应该为存储导演ID号保留两列吗?这意味着在两个表建立连接,这种方法可行但却不便捷。
But two isn’t enough. Some movies have three directors.
而且有时两列并不够。一些电影有三个导演。
Some even have five.
有些电影甚至有五个导演。
We have the same problem, even more acute, with actors. How many actors should we record? The three leading actors?
在记录演员方面,我们有相同甚至更加严重的问题。我们应该记录多少演员?三个主演?
What about a star-studded movie such as “The Longest Day”?
如果是像《最长的一日》(The Longest Day)这样众星云集的电影呢?
What if one day we decide to improve the database and record more actors, and perhaps the full cast? If we want to add more actor identifiers to the table we shall have to review all the existing queries to add “OR” conditions.
如果有一天,我们决定改进数据库并记录更多的演员,甚至是记录全部演员,应该怎么办?如果我们向表中添加更多的演员标识符,我们就必须回顾现存的所有查询并加入“或”(OR)条件。
Storing director and actor identifiers inside the movies table will make the evolution of the application very painful.
将导演及演员标识符存储在MOVIE表中会使更改应用变得十分困难。
This isn’t therefore a good design
因此,这并不是一个好的设计。
And we must try harder.
并且我们需要拥有一个更好的设计。
Instead of having one column per director or actor in the movies table, a DBMS allows us to have an intermediate table that, through two joins, will allow us to relate one director to several movies, or several directors to one movie if need be. We shall call this table directed_by and it implements a link between the table of movies and the table of directors.
与director和actor各占一列这种模式不同,数据库管理系统允许我们有一个中间表,通过两个JOIN,我们可以将一个导演和多个影片或者将多个导演和一个影片相关联起来。我们称这个表为directed_by,它实现了MOVIES表和DIRECTORS表的连接。
The director and the leading actors will no longer be an attribute of the movie.
而导演以及主演将不再是MOVIES表中的属性。
Since we must reference movies in another table, the real life key may be a bit difficult to work with. And in the same way as for directors and actors, we are going to ask the system to generate for us a numerical identifier movie id that will be used as a shorter alias for the title, year and country and that we’ll use to refer to one particular movie.
因为我们必须在其他表中使用MOVIES表中内容,用现实中属性做键可能有些难处理。DIRECTORS表和ACTORS表也是相同的道理,这时我们就要利用系统自动生成的数字标识符作为列title,year,country的别名,以此来查询一个特定的电影
We have the movies, we have the directors, and a new table called directed_by will record the fact that one movie was directed by one director.  The unique combination of a director id and a movie id is the key of the new table.
我们有MOVIES表,DIRECTORS表,还有一个新表directed_by,用来记录电影与导演的对应关系。一个director id 和movie id 的特定组合就是这个新表中的键。
The SQL query to retrieve the information will be hardly more complicated.
这样检索信息的SQL查询就不会那么复杂了
In the same way, we can also have a table called PLAYED_IN that links the table that records actors to the movies table.
用相同的方法,我们也可以创建一个新表PLAYED_IN,用来将关联表Actors与表Movies
Now, we still have a problem. Orson Welles directed and played in Citizen Kane, and actors turned directors, or directors who occasionally or often play in their own movies are rather common.
现在,我们仍旧有一个问题。好比奥逊?威尔斯执导并出演了《公民凯恩》,一些演员同时又是导演,一些导演也会时不时的出现在他们自己的影片中,这种现象非常普遍。
If we store separately directors and actors, some information will be duplicated, and if we update one table and forget the other one we shall have inconsistencies.
如果我们分开存放directors 和actors信息,一些信息就将出现重复,而如果我们更新其中一个表而不更新另一个,就会导致信息不一致。
What we really want is to merge the two tables and keep all the information about one person at a single location.
我们真正需要的是合并两个表并确保某一个人的信息来源于同一数据表。
Let’s ditch once again this model, and try a last one.
让我们把这种模型也抛弃,继续尝试最后一种模型。
We could still keep played_in and directed_by, and make them link the MOVIES table to a PEOPLE table.
我们仍然可以保留中间表playes_in 和directed_by ,并且通过它们将表MOVIES和PEOPLES相关联。
Another solution is to merge played_in and directed_by into a MOVIE_CREDITS table that would hold a movie id, a person id, and another attribute to tell in which capacity a person participated to the movie: actor, or director. This last design is probably better if we plan to also record one day who composed the music, who was the photography director, who designed the set and so on … which would otherwise require one additional table per type of involvement.
另一种方法是把表played_in和directed_by合并为表MOVIE_CREDIT,表中包含列movie id ,person_id 还有一些另外的属性来说明某个人是以何种身份参与到电影中:导演(D)或者演员(A)。最后一个设计或许会更好一些如果我们计划也将有一天的记录是谁谱的曲子,谁是摄影导演,谁设计的舞台等等,否则以上每种分类将会需要一个额外的表。
A good design guarantees data integrity and allows evolution.
一个好的设计既能保证信息的完整性又允许对数据库的进行升级改善。
This is a very simple example, but I hope that it will be enough to convince you of the importance of database design.
这是一个很简单的例子,但是我希望它足以使你了解数据库设计的重要意义。
Thank you for your attention.
感谢观看。

作者:Stéphane Faroult
翻译:王瑞莹,郑冉
校对:Kamus(ACOUG),仇实

SQL Joins 翻译

Stéphane Faroult先生在SQL Joins里简单介绍了SQL各种Join的内涵,运用了一个形象的舞池比喻。

以下分别是中文字幕视频和翻译文档的中英文对照。

Hello. I’m Stéphane Faroult, a French database consultant, and I’ve designed this short video note about joins.

大家好,我是Stéphane Faroult,一个法国的数据库顾问,我设计了这个关于联接(join)的简短的视频笔记(希望能帮助你了解join的工作原理)。

You probably know what a join is: in a relational database, when two tables have one or several common columns, you can link them by saying that a row in one table matches a row from the other table that has the same values in the common columns.

也许你已经知道联接(join)是什么:

在关系型数据库中,如果两个表有一个或多个公共列,我们可以连接两表中对应列里列值相同的行

For instance, you can store in your database the registered members of a web site; they are identified by their screen name, and a numerical id is used as a convenient reference. Your table probably contains more information, such as the date when members joined, and perhaps a number of points acquired by providing useful answers in a forum.

例如,您可以在您的数据库中储存一个网站的注册会员,他们由网名标识,并且用一个数字的id方便的作为参考。你的表中也许包含着更多的信息,例如会员的注册日期,以及在论坛中回帖所得的点数。

Another table will contain the messages posted by members on the forum. I have just shown here the member id and the subject of the post, you would obviously also have a date, the message body and possibly some other attributes.

另一个表将会包含论坛会员发帖的信息。我在这只显示了会员ID和帖子主题,在真实生活中,这个表中或许还应包含发帖时间,帖子内容和帖子的一些其他属性。

If we want to list the messages, most recent first, we also want to display who posted the message, as well as the member’s points. We will therefore write a query such as this one, in which the ‘memberid’ column in the messages table refers to the ‘memberid’ column in the members table and is used to perform the join.

如果我们想列出这些信息,最新的优先,同时也想要显示发帖人,以及他的点数。我们需要写一个像这样的查询,其中messages表的memberid列参考members表中的memberid列,用来完成联接。

When you are taught SQL, you are often explained that the operation consists in generating all possible combinations between the rows from the members table and the rows from the messages table, to only retain the combinations for which the common column contains the same value.

当你学习SQL时,你学习到有一个操作在于生成在members表和messages表的行和行之间所有可能的组合,只保留包含相同值的共同的列的组合。

This operation is called a cartesian join, after René Descartes, one of the major European philosophers and mathematicians of the 17th century. Although a database system sometimes executes cartesian joins, the most common ways to perform a join are nested loops, and hash or merge joins.

这个操作就叫做笛卡尔联接,是根据17世纪欧洲主要哲学家和数学家勒内·笛卡尔命名的。尽管数据库系统有时执行笛卡尔联接,但是执行联接的最常用方法是嵌套循环联接(nested loops)和哈希(hash)联接或合并(merge)联接。

To explain the difference between those two categories of joins, Tom Kyte, who is senior technologist with Oracle, once used an excellent analogy on his popular “Ask Tom” site, and I’ve tried to illustrate Tom’s  dance floor analogy.

为了解释一下那两种联接的区别, Oracle的高级技术人员Tom Kyte,曾经在他的Ask Tom网站上使用了一个杰出的类比,下面我将尝试阐述Tom的舞池类比。

Here is how it goes. Suppose that you have a few guys, and a few girls. Each guy is looking for a partner of about the same size as himself. The first guy moves forward, and is strongly attracted by a girl who looks like a model. Unfortunately, models are tall, and our guy is rather short. A second girl quite lovely, but a tad too tall for him. Finally,he finds a suitable partner who has both the right size and a nice smile, and invites her. If each guy repeats the same process, we have in effect the same algorithm as with a nested loop.

这是它的内容。假设有几个男士和几个女士,每个男士在寻找一个和他身高差不多的舞伴。第一个男士前进一步,并且被一个看起来像模特的女士强烈的吸引住了,不幸的是,模特很高,而那个男士相当的矮;第二个女孩很可爱,但是还是比他高一点;最后,他找到了一个合适的舞伴,她有着和他相仿的身高和甜美的微笑,男士邀请了她。如果每个男士重复这个同样的步骤,我们就有与嵌套循环联接(nested loop)实质相同的算法。

Now suppose that we have a lot of girls, and a lot of guys. If we take the same Goldilock approach to flirting as before, it is likely that dawn will be there before all couples are formed. Therefore, we are going to trade romanticism against efficiency, and ask all girls and all boys to get in rank by order of size, so that we can easily match them. This would be an overkill with very few individuals. But with a large number of people, this is probably the best way to proceed, and it is exactly the way a merge join works.

现在假定有很多的女士和很多的男士。如果我们还像之前那样接近并与一个金发女郎调情,很可能要等到黎明时才能使所有的人都找到舞伴。因此,我们将放弃浪漫而追求效率,让所有的女士和所有的男士都按身高排队,然后他们可以简单的进行配对。如果人数少的话,用这种方法就像是“用大炮打蚊子”,小题大做了。但是如果有一大堆人的话,这可能就是最好的方法了,并且它恰好和合并联接(merge joins)是一样的。

In the dry world of databases, to perform a nested loop an SQL engine will scan the smallest of the two tables, then find in turn the matching rows in the other table.

在数据库的世界里,要执行一个嵌套循环联接(nested loop),SQL引擎将扫描两个表中最小的那个表,然后依次匹配另一个表中的行。

It is very important that the column used for the join is indexed in the bigger table, because otherwise you would have to scan it each time you consider a new row in the smaller table.

如果在一个很大的表中,给联接要用到的列建立索引是很有必要的,否则每次根据一个小表中的行来寻找大表中相同的行,都要对大表进行完全扫描。

We have seen how merge joins work. Hash joins are a variant that is a good compromise in terms of CPU and memory usage. In a hash join , a temporary hash table is created. A hash table is a kind of array that contains pointers to the rows of a table. A special function, called the hash function,  applied to the key associates each key to a position in the array.

我们已经看过合并联接(merge joins)是怎样运作的了。哈希联接(hash joins)是一个变体,是CPU和内存使用的方面很好的折中。在哈希联接里,一个临时的哈希表被建立。哈希表是一个数组,包含指向表中行的指针。一个特别的函数,称作哈希函数,应用于表中的每个键(主键),在数组中每个键占一个位置。

The smallest table is scanned first, and the hash table is populated with pointers to the rows that match each key.

最小的表首先被扫描,然后哈希表被填入相匹配的每个键的行指针。

Then the second table is scanned, only once, the function directs us to the proper slot in the hash table and from there you find the pointers to the matching rows.

然后,第二个表被扫描,只扫描一次,那个函数引导我们到哈希表中适合的位置,并且从那里,你找到匹配的行指针。

Contrarily to the nested loop join, indexes are no longer so important. An SQL optimizer may choose to scan an index rather than a table because indexes are usually smaller than tables, but this is a marginal improvement.

与嵌套循环联接相反,索引不再是非常重要的了。一个SQL优化器会选择扫描一个索引,而不是一个表,因为索引通常比表小,但是这只是一个微小的改进。

When the query isn’t madly complicated, when data statistics allow to get a fair idea about the expected size of the result set and if existing indexes allow a choice, the query optimizer can usually decide which type of join will be fastest.  If the query is really complex, the way you write it often sends signals to the optimizer.

当查询并不是很复杂,数据统计信息对预计的结果集大小有正确的认识,并且存在的索引允许选择,那么查询优化器通常可以决定哪种类型的联接将会更快。如果查询语句真的特别复杂,那么你构造语句的方法通常会影响优化器的选择。

 But that, as Kipling used to say, is another story. Thank you for listening.

不过,正如英国作家吉卜林说(译者注:英国作家,1865-1936,1907年获诺贝尔文学奖),是另一个故事。感谢您们的聆听。

作者:Stéphane Faroult
翻译:纪晓文,苗洁明
校对:Kamus(ACOUG),仇实

翻译Stéphane Faroult的视频

Stéphane Faroult是Oracle技术畅销书籍《The Art of SQL》的作者,这本书在国内的译本是《SQL语言艺术》

Stéphane Faroult经营着RoughSea公司,从事数据库咨询业务,致力于帮助客户从数据库投资中获得最佳性能。他的SQL经验开始于1983年。Oracle法国的第一个性能及调优课程就是他1987年编写的。

他同时也维护Konagora网站,这是一个致力于网上教学的站点,在该站点中可以找到他本人录制的很多关于Oracle数据库知识的视频教学短片,同时这些短片也可以在Youtube中找到。

Stéphane的短片有别于你们以前见到的任何视频教学片,相比起来,那些以前的片子是如此枯燥而刻板。而Stephane的短片给我们的感觉是简洁、直接、形象、印象深刻

我们实验室与ACOUG合作,在他们的介绍下与Stéphane Faroult大师接触,并翻译他的视频。
视频都是实验室成员利用假期时间翻译的,并为其加上了中文字幕,将来打算放出中英文字幕的视频。

中文字幕视频链接地址:

SQL Joins中文字幕版
DB Design 1/2中文字幕版
DB Design 2/2中文字幕版

原版视频链接地址:

SQL Joins原版
DB Design 1/2原版
DB Design 2/2原版

翻译文档请见[翻译]目录

普人特福的博客cnzz&51la for wordpress,cnzz for wordpress,51la for wordpress