Database Design 翻译

国内的图书往往通过各种范式、规则来讲解数据库设计。Stéphane Faroult先生则是通过一个关于电影的例子,生动形象的说明了数据库设计过程中遇到的问题和解决方法。视频分为上下两集,时间大约11分钟。相信看完这段视频和文字之后,会对你有所启发。

Hello; my name is Stéphane Faroult. I’m a French database consultant and a database book author.
大家好,我是StéphaneFaroult,是一名法国的数据库顾问,同时还是《SQL语言艺术》这本书的作者。
I’d like to present a brief introduction to database design by showing you on a small example some issues you can encounter and some ways to solve them.
我希望通过视频中的一个小例子,及在这个例子中遇到的问题和解决方法,来向大家简要的介绍下数据库设计。
For my example, I’d like to create a website that provides information about movies.
这个例子是:建立一个提供电影信息的网站
You are probably aware that in a database, data is stored in tables.
你可能已经知道,在数据库中数据是存储于表中的。
We’ll obviously have some kind of MOVIES table inside our database.
显然我们的数据库中会有各类的MOVIES表。
Tables in a database are very different from spreadsheets, but let’s start with a spreadsheet that contains information about a small number of classic movies: title, country, year, director, and leading actors.
数据库中的表和电子表格是很不同的,但是不妨让我们先创建一些电子表格,表格中包含一部分经典电影的信息,如影片名,国家,上映年份,导演,主演。
Initially we shall model our main table after the spreadsheet.
最初,我们的主表需要模仿电子表格。
Each row brings together a number of pieces of information that are the characteristics of a movie.
每一行整合了很多关于电影属性的信息。
The order of columns doesn’t matter.
列的顺序无关紧要。
The order of rows doesn’t matter.
行的顺序也不重要。
What matters is that the different pieces of information on one row are related, and the titles of the columns define the relation. This is why we talk about relational databases.
重要的是,一行中不同的信息是相关的,列的标题就是在定义它们之间的关系。这就是我们要讨论关系型数据库的原因。
It makes no sense to have the same row several times. An important step, when designing a database, consists in identifying the characteristics that allow us to distinguish one row from another; and very often these characteristics are a subset of all the characteristics. They are collectively called “the key”. What is the smallest subset of characteristics that uniquely identifies a movie?
拥有相同行是毫无意义的。设计数据库时很重要的一步,正是标识出那些能够帮助我们区分不同行的属性,通常这些属性是所有属性的一个子集。它们被统称为“键”。而什么是唯一标识一部电影属性的最小子集呢?
The title? No, the title isn’t enough because you have remakes.
电影名称?不,只有名称是不够的,因为一部电影可以重拍多次。
For instance, there are several movies based on the “Treasure Island” novel.
比如说,有很多部电影都是改编自《金银岛》(”Threasure Island”)这部小说。
Title and director perhaps?
电影名称和导演?
It doesn’t work either, because Alfred Hitchcock for instance made two “The man who knew too much”, one in 1934 and the other one in 1956.
这同样不管用,比如阿尔弗莱德?希区柯克(Alfred Hithchcock)导演了两部《擒凶记》(“The man who knew  too much”),一部在1934年,另一部在1956年。
Title, director and year may seem a better choice – except that “director” isn’t as simple a characteristic as it may look;
电影名称,导演,上映时间似乎是一个好一点的选择了,除非“导演”这一项比较非典型。
Some movies have several directors, and it may lead to ambiguities.
比如一些电影有好几位导演,这就可能会导致表达不明确。
Title, country and year look reasonable. It is very unlikely, not least for commercial reasons, that two movies with exactly the same title will be produced the same year in the same country. It is sometimes possible to find several groups of characteristics that uniquely identify a row in table; you are free to choose the group that you find the most convenient as the main identifier, and this group of characteristics is called the “primary key”.
电影名称,国家、上映看起来比较合理。至少从商业理由而言,两个完全同名的电影在同一国家,同一年发行,是非常不可能的。有时,可能找到一些属性组去唯一标识表中的一行;你可以自由的选择你认为最方便的组作为主要标识符,这一组属性被称为“主键”。
The ultimate goal of all “data bases” is to be queried. That means that people will search the table, using some interface, to retrieve movies that satisfy some particular conditions.
数据库的最终目标是用于查询。这意味着人们将通过一些界面来搜索表,检索出符合特定条件的电影。
Behind the scene, the program will issue an SQL query in which you name the characteristics you want
后台进程中,程序将根据你键入的属性发出一个SQL查询
Followed by the name of the table
接着是表名,
Followed by filtering conditions
再接下来是设置过滤条件。
And here we have several problems. If we want to retrieve all the movies that were directed by one given director, it is essential that the condition exactly matches the contents of the table – that is that if the row has for “Director” attribute “Hitchcock, A.”, the filtering condition is exactly entered as “Hitchcock, A.”, not as r “Hitchcock”), and definitely not as “Alfred Hitchcock”.
在这里我们有一些问题,当我们想要检索出同一位导演所执导的所有影片,就必须要准确输入与表中内容一致的条件,就像是如果这一行中“导演”的属性是“Hitchcock,A”,那么筛选的信息必须按照“Hitchcock,A”这样输入,而不可以是“Hitchcook”,更不能是“Alfred Hitchcock”(这几种是阿尔弗莱德?希区柯克英文名的不同书写形式)。
There is also a consistency issue with data that will probably be entered by different people at different times and must respect the same conventions: if you say that “North by Northwest” and “Rear Window” are “Hitchcock, A.” movieswhile “Strangers on a Train” is an “Alfred Hitchcock” movie, users who query the database will have much trouble getting the full list of Hitchcock movies.
不同时间向数据库中键入数据的人也要遵守数据一致性的约定:如果你说《西北偏北》(“North by Northwest”),《后窗》(“ Rear Window”)是导演Hitchcock, A.(阿尔弗雷德?希区柯克)的电影作品,而《火车怪客》(“ Strangers on a Train”)是导演Alfred Hitchcock(阿尔弗雷德?希区柯克英文名字的另一种格式)的另一部电影。这样用户在查询导演阿尔弗雷德?希区柯克电影作品的详细列表时就会遇到麻烦。
As a general rule, character fields that are free-form text call for trouble when they are used for searches. This is true for directors, and this is true as well for the leading stars. We must therefore define some kind of standard for storing names; storing firstname and surname separately, in uppercase perhaps. Standardization is a first step in the right direction. But we still have a few numbers of problems to solve. The first one is typos. A user who mistypes a name may realize the mistake, check the spelling, and try again. But if there are some misspellings in the database, information that is misspelt will only be found by users doing the same mistake when they query; not a desirable situation.
一般来说,没有规定格式的文本在查询时会引起麻烦。对于导演一栏是这样,对于主演一栏也是如此。因此我们必须定义一些用于存储名字信息的标准,例如姓氏与名字分开存储,或者使用大写。标准化是向正确方向迈进的第一步。但我们仍然有一些问题需要解决。第一个就是拼写错误。用户查询信息,输错一个名字时可能会意识到这个错误,检查拼写,并且再试一次。但是当数据库中存储的信息本身存在一些拼写错误时,这些拼错的信息只能在用户查询并出现同样的错误时才能被查询出来,这很杯具。
Let’s ditch this model
让我们抛弃这种数据模型
And let’s try a new one. To ensure that names are correctly entered, the best solution is to enter them only once (and if they are misspelt, we’ll have only one place to correct). This can be performed by a very convenient feature of the SQL language, which is the ability to perform joins.
我们来试试另一个。确保名字准确输入的最好方法就是只输入一次(当它们被输错了,我们将只有一个地方需要去修改)。这可以通过SQL语言中一个十分便捷的方法来完成——执行连接(join)。
We have been dealing so far with a single table. But we may have several tables, with some common columns, and SQL allows to associate data from one table to data from another table by using the common columns to establish the link – this operation is called a join.
迄今为止我们一直在处理单个表。但是我们可能有一些表,这些表中有部分列相同,SQL中允许数据通过共同的列从一个表到另一个关联表建立连接,这种操作就称为连接(join)。
Let’s see how it works. Suppose that we create a table of directors. In this table, we shall store first name, surname and perhaps some biographical information.
我们来看看它是如何工作的。假设我们创建了一个关于导演的表。在这个表中,我们将存储导演的姓氏、名字,并且或许还可以加上其他一些个人信息。
The key will be firstname and surname; however, since we want to link movies and directors and since we want to avoid misspellings, we are going to use a system-generated numerical identifier that will be a synonym for the key and that we shall call ID. This will be, technically speaking, a second key and we can make ID the primary key of the DIRECTORS table while asking the DBMS to ensure that firstname and surname remain unique across the table.
键是姓氏和名字,然而,为避免在连接MOVIES表和DIRECTORS表时出现拼写错误,我们用系统生成的数字标识符代替姓名键,我们称它为ID。ID也是表的键,我们可以把它设为DIRECTORS表的主键同时可以利用数据库管理系统确保表中的名字和姓氏是唯一存在的。
Inside the MOVIES table we shall replace the identification of the director by the id; of course, we’ll do something similar for actors. When a user wants to retrieve the titles and year of the movies directed by someone whose only surname has been provided, the system will issue an SQL query that will look like this one.
在MOVIES表里我们将用ID替换导演列的数据,当然,我们也将对演员作相应的更改。当用户想要检索由一位只提供了姓的导演所执导的电影的片名及年份,系统将发出一个SQL查询。如演示。
However, we have a problem with the infrequent movies that have more than one director.
然而,我们又遇到一个问题,个别电影会有多个导演
Should we reserve two columns for storing director ids? That would mean joining on two columns, which can be done but isn’t most convenient.
我们应该为存储导演ID号保留两列吗?这意味着在两个表建立连接,这种方法可行但却不便捷。
But two isn’t enough. Some movies have three directors.
而且有时两列并不够。一些电影有三个导演。
Some even have five.
有些电影甚至有五个导演。
We have the same problem, even more acute, with actors. How many actors should we record? The three leading actors?
在记录演员方面,我们有相同甚至更加严重的问题。我们应该记录多少演员?三个主演?
What about a star-studded movie such as “The Longest Day”?
如果是像《最长的一日》(The Longest Day)这样众星云集的电影呢?
What if one day we decide to improve the database and record more actors, and perhaps the full cast? If we want to add more actor identifiers to the table we shall have to review all the existing queries to add “OR” conditions.
如果有一天,我们决定改进数据库并记录更多的演员,甚至是记录全部演员,应该怎么办?如果我们向表中添加更多的演员标识符,我们就必须回顾现存的所有查询并加入“或”(OR)条件。
Storing director and actor identifiers inside the movies table will make the evolution of the application very painful.
将导演及演员标识符存储在MOVIE表中会使更改应用变得十分困难。
This isn’t therefore a good design
因此,这并不是一个好的设计。
And we must try harder.
并且我们需要拥有一个更好的设计。
Instead of having one column per director or actor in the movies table, a DBMS allows us to have an intermediate table that, through two joins, will allow us to relate one director to several movies, or several directors to one movie if need be. We shall call this table directed_by and it implements a link between the table of movies and the table of directors.
与director和actor各占一列这种模式不同,数据库管理系统允许我们有一个中间表,通过两个JOIN,我们可以将一个导演和多个影片或者将多个导演和一个影片相关联起来。我们称这个表为directed_by,它实现了MOVIES表和DIRECTORS表的连接。
The director and the leading actors will no longer be an attribute of the movie.
而导演以及主演将不再是MOVIES表中的属性。
Since we must reference movies in another table, the real life key may be a bit difficult to work with. And in the same way as for directors and actors, we are going to ask the system to generate for us a numerical identifier movie id that will be used as a shorter alias for the title, year and country and that we’ll use to refer to one particular movie.
因为我们必须在其他表中使用MOVIES表中内容,用现实中属性做键可能有些难处理。DIRECTORS表和ACTORS表也是相同的道理,这时我们就要利用系统自动生成的数字标识符作为列title,year,country的别名,以此来查询一个特定的电影
We have the movies, we have the directors, and a new table called directed_by will record the fact that one movie was directed by one director.  The unique combination of a director id and a movie id is the key of the new table.
我们有MOVIES表,DIRECTORS表,还有一个新表directed_by,用来记录电影与导演的对应关系。一个director id 和movie id 的特定组合就是这个新表中的键。
The SQL query to retrieve the information will be hardly more complicated.
这样检索信息的SQL查询就不会那么复杂了
In the same way, we can also have a table called PLAYED_IN that links the table that records actors to the movies table.
用相同的方法,我们也可以创建一个新表PLAYED_IN,用来将关联表Actors与表Movies
Now, we still have a problem. Orson Welles directed and played in Citizen Kane, and actors turned directors, or directors who occasionally or often play in their own movies are rather common.
现在,我们仍旧有一个问题。好比奥逊?威尔斯执导并出演了《公民凯恩》,一些演员同时又是导演,一些导演也会时不时的出现在他们自己的影片中,这种现象非常普遍。
If we store separately directors and actors, some information will be duplicated, and if we update one table and forget the other one we shall have inconsistencies.
如果我们分开存放directors 和actors信息,一些信息就将出现重复,而如果我们更新其中一个表而不更新另一个,就会导致信息不一致。
What we really want is to merge the two tables and keep all the information about one person at a single location.
我们真正需要的是合并两个表并确保某一个人的信息来源于同一数据表。
Let’s ditch once again this model, and try a last one.
让我们把这种模型也抛弃,继续尝试最后一种模型。
We could still keep played_in and directed_by, and make them link the MOVIES table to a PEOPLE table.
我们仍然可以保留中间表playes_in 和directed_by ,并且通过它们将表MOVIES和PEOPLES相关联。
Another solution is to merge played_in and directed_by into a MOVIE_CREDITS table that would hold a movie id, a person id, and another attribute to tell in which capacity a person participated to the movie: actor, or director. This last design is probably better if we plan to also record one day who composed the music, who was the photography director, who designed the set and so on … which would otherwise require one additional table per type of involvement.
另一种方法是把表played_in和directed_by合并为表MOVIE_CREDIT,表中包含列movie id ,person_id 还有一些另外的属性来说明某个人是以何种身份参与到电影中:导演(D)或者演员(A)。最后一个设计或许会更好一些如果我们计划也将有一天的记录是谁谱的曲子,谁是摄影导演,谁设计的舞台等等,否则以上每种分类将会需要一个额外的表。
A good design guarantees data integrity and allows evolution.
一个好的设计既能保证信息的完整性又允许对数据库的进行升级改善。
This is a very simple example, but I hope that it will be enough to convince you of the importance of database design.
这是一个很简单的例子,但是我希望它足以使你了解数据库设计的重要意义。
Thank you for your attention.
感谢观看。

作者:Stéphane Faroult
翻译:王瑞莹,郑冉
校对:Kamus(ACOUG),仇实

普人特福的博客cnzz&51la for wordpress,cnzz for wordpress,51la for wordpress