Tag Archives: 翻译 数据库设计 Database Design 中文字幕 Stephane Faroult

Database Design 翻译

国内的图书往往通过各种范式、规则来讲解数据库设计。Stéphane Faroult先生则是通过一个关于电影的例子,生动形象的说明了数据库设计过程中遇到的问题和解决方法。视频分为上下两集,时间大约11分钟。相信看完这段视频和文字之后,会对你有所启发。

Hello; my name is Stéphane Faroult. I’m a French database consultant and a database book author.
I’d like to present a brief introduction to database design by showing you on a small example some issues you can encounter and some ways to solve them.
For my example, I’d like to create a website that provides information about movies.
You are probably aware that in a database, data is stored in tables.
We’ll obviously have some kind of MOVIES table inside our database.
Tables in a database are very different from spreadsheets, but let’s start with a spreadsheet that contains information about a small number of classic movies: title, country, year, director, and leading actors.
Initially we shall model our main table after the spreadsheet.
Each row brings together a number of pieces of information that are the characteristics of a movie.
The order of columns doesn’t matter.
The order of rows doesn’t matter.
What matters is that the different pieces of information on one row are related, and the titles of the columns define the relation. This is why we talk about relational databases.
It makes no sense to have the same row several times. An important step, when designing a database, consists in identifying the characteristics that allow us to distinguish one row from another; and very often these characteristics are a subset of all the characteristics. They are collectively called “the key”. What is the smallest subset of characteristics that uniquely identifies a movie?
The title? No, the title isn’t enough because you have remakes.
For instance, there are several movies based on the “Treasure Island” novel.
比如说,有很多部电影都是改编自《金银岛》(”Threasure Island”)这部小说。
Title and director perhaps?
It doesn’t work either, because Alfred Hitchcock for instance made two “The man who knew too much”, one in 1934 and the other one in 1956.
这同样不管用,比如阿尔弗莱德?希区柯克(Alfred Hithchcock)导演了两部《擒凶记》(“The man who knew  too much”),一部在1934年,另一部在1956年。
Title, director and year may seem a better choice – except that “director” isn’t as simple a characteristic as it may look;
Some movies have several directors, and it may lead to ambiguities.
Title, country and year look reasonable. It is very unlikely, not least for commercial reasons, that two movies with exactly the same title will be produced the same year in the same country. It is sometimes possible to find several groups of characteristics that uniquely identify a row in table; you are free to choose the group that you find the most convenient as the main identifier, and this group of characteristics is called the “primary key”.
The ultimate goal of all “data bases” is to be queried. That means that people will search the table, using some interface, to retrieve movies that satisfy some particular conditions.
Behind the scene, the program will issue an SQL query in which you name the characteristics you want
Followed by the name of the table
Followed by filtering conditions
And here we have several problems. If we want to retrieve all the movies that were directed by one given director, it is essential that the condition exactly matches the contents of the table – that is that if the row has for “Director” attribute “Hitchcock, A.”, the filtering condition is exactly entered as “Hitchcock, A.”, not as r “Hitchcock”), and definitely not as “Alfred Hitchcock”.
在这里我们有一些问题,当我们想要检索出同一位导演所执导的所有影片,就必须要准确输入与表中内容一致的条件,就像是如果这一行中“导演”的属性是“Hitchcock,A”,那么筛选的信息必须按照“Hitchcock,A”这样输入,而不可以是“Hitchcook”,更不能是“Alfred Hitchcock”(这几种是阿尔弗莱德?希区柯克英文名的不同书写形式)。
There is also a consistency issue with data that will probably be entered by different people at different times and must respect the same conventions: if you say that “North by Northwest” and “Rear Window” are “Hitchcock, A.” movieswhile “Strangers on a Train” is an “Alfred Hitchcock” movie, users who query the database will have much trouble getting the full list of Hitchcock movies.
不同时间向数据库中键入数据的人也要遵守数据一致性的约定:如果你说《西北偏北》(“North by Northwest”),《后窗》(“ Rear Window”)是导演Hitchcock, A.(阿尔弗雷德?希区柯克)的电影作品,而《火车怪客》(“ Strangers on a Train”)是导演Alfred Hitchcock(阿尔弗雷德?希区柯克英文名字的另一种格式)的另一部电影。这样用户在查询导演阿尔弗雷德?希区柯克电影作品的详细列表时就会遇到麻烦。
As a general rule, character fields that are free-form text call for trouble when they are used for searches. This is true for directors, and this is true as well for the leading stars. We must therefore define some kind of standard for storing names; storing firstname and surname separately, in uppercase perhaps. Standardization is a first step in the right direction. But we still have a few numbers of problems to solve. The first one is typos. A user who mistypes a name may realize the mistake, check the spelling, and try again. But if there are some misspellings in the database, information that is misspelt will only be found by users doing the same mistake when they query; not a desirable situation.
Let’s ditch this model
And let’s try a new one. To ensure that names are correctly entered, the best solution is to enter them only once (and if they are misspelt, we’ll have only one place to correct). This can be performed by a very convenient feature of the SQL language, which is the ability to perform joins.
We have been dealing so far with a single table. But we may have several tables, with some common columns, and SQL allows to associate data from one table to data from another table by using the common columns to establish the link – this operation is called a join.
Let’s see how it works. Suppose that we create a table of directors. In this table, we shall store first name, surname and perhaps some biographical information.
The key will be firstname and surname; however, since we want to link movies and directors and since we want to avoid misspellings, we are going to use a system-generated numerical identifier that will be a synonym for the key and that we shall call ID. This will be, technically speaking, a second key and we can make ID the primary key of the DIRECTORS table while asking the DBMS to ensure that firstname and surname remain unique across the table.
Inside the MOVIES table we shall replace the identification of the director by the id; of course, we’ll do something similar for actors. When a user wants to retrieve the titles and year of the movies directed by someone whose only surname has been provided, the system will issue an SQL query that will look like this one.
However, we have a problem with the infrequent movies that have more than one director.
Should we reserve two columns for storing director ids? That would mean joining on two columns, which can be done but isn’t most convenient.
But two isn’t enough. Some movies have three directors.
Some even have five.
We have the same problem, even more acute, with actors. How many actors should we record? The three leading actors?
What about a star-studded movie such as “The Longest Day”?
如果是像《最长的一日》(The Longest Day)这样众星云集的电影呢?
What if one day we decide to improve the database and record more actors, and perhaps the full cast? If we want to add more actor identifiers to the table we shall have to review all the existing queries to add “OR” conditions.
Storing director and actor identifiers inside the movies table will make the evolution of the application very painful.
This isn’t therefore a good design
And we must try harder.
Instead of having one column per director or actor in the movies table, a DBMS allows us to have an intermediate table that, through two joins, will allow us to relate one director to several movies, or several directors to one movie if need be. We shall call this table directed_by and it implements a link between the table of movies and the table of directors.
The director and the leading actors will no longer be an attribute of the movie.
Since we must reference movies in another table, the real life key may be a bit difficult to work with. And in the same way as for directors and actors, we are going to ask the system to generate for us a numerical identifier movie id that will be used as a shorter alias for the title, year and country and that we’ll use to refer to one particular movie.
We have the movies, we have the directors, and a new table called directed_by will record the fact that one movie was directed by one director.  The unique combination of a director id and a movie id is the key of the new table.
我们有MOVIES表,DIRECTORS表,还有一个新表directed_by,用来记录电影与导演的对应关系。一个director id 和movie id 的特定组合就是这个新表中的键。
The SQL query to retrieve the information will be hardly more complicated.
In the same way, we can also have a table called PLAYED_IN that links the table that records actors to the movies table.
Now, we still have a problem. Orson Welles directed and played in Citizen Kane, and actors turned directors, or directors who occasionally or often play in their own movies are rather common.
If we store separately directors and actors, some information will be duplicated, and if we update one table and forget the other one we shall have inconsistencies.
如果我们分开存放directors 和actors信息,一些信息就将出现重复,而如果我们更新其中一个表而不更新另一个,就会导致信息不一致。
What we really want is to merge the two tables and keep all the information about one person at a single location.
Let’s ditch once again this model, and try a last one.
We could still keep played_in and directed_by, and make them link the MOVIES table to a PEOPLE table.
我们仍然可以保留中间表playes_in 和directed_by ,并且通过它们将表MOVIES和PEOPLES相关联。
Another solution is to merge played_in and directed_by into a MOVIE_CREDITS table that would hold a movie id, a person id, and another attribute to tell in which capacity a person participated to the movie: actor, or director. This last design is probably better if we plan to also record one day who composed the music, who was the photography director, who designed the set and so on … which would otherwise require one additional table per type of involvement.
另一种方法是把表played_in和directed_by合并为表MOVIE_CREDIT,表中包含列movie id ,person_id 还有一些另外的属性来说明某个人是以何种身份参与到电影中:导演(D)或者演员(A)。最后一个设计或许会更好一些如果我们计划也将有一天的记录是谁谱的曲子,谁是摄影导演,谁设计的舞台等等,否则以上每种分类将会需要一个额外的表。
A good design guarantees data integrity and allows evolution.
This is a very simple example, but I hope that it will be enough to convince you of the importance of database design.
Thank you for your attention.

作者:Stéphane Faroult

普人特福的博客cnzz&51la for wordpress,cnzz for wordpress,51la for wordpress