Oracle Monitoring(Oracle性能监控)

We still work with Stéphane Faroult and ACOUG!

Unfortunately, we are a lit be busy these days.So the translating work may be late for your guys.But fine, we will work harder and harder,hope to tranlate more befor July.

Thanks for Ms.Liu Tina’s work,I also verified it.Maybe it still have mistakes or errors,please point them out.Thank you!

Also thanks for Stéphane Faroult and Kamus’s support!

iOracle还在继续翻译Stéphane Faroult先生的作品,我们尽量在翻译的过程中使用活泼的语言。希望这一点点努力能够帮助你了解到更多的知识。如果在翻译中发现任何错误,请随时联系我。感谢刘甜同学的辛勤劳动,作为一个女生,在学习之余抽出时间参加翻译工作让我十分感动。同时也感谢Stéphane Faroult先生和Kamus大师的关心与支持,我们将会奉献更多的翻译作品。

下面是本次翻译的内容,虽然这是中英对照文稿,我们依旧建议阅读英文为主,中文作为参考。

Hello, I’m Stéphane Faroult. I specialize in database application performance and I’m going to tell you the basis of Oracle performance monitoring.

大家好,我是Stéphane Faroult。我致力于数据库应用性能的研究,接下来,让我来告诉你Oracle数据库性能监控 (performance monitoring )的基础知识。

Oracle is heavily instrumented. I assume you know the architecture of Oracle (if not, watch my short video on the topic). There is a very large collection of pseudo-views known as “dynamic views” or “V$ views” the name of which starts with V$ and that are, for the most part, Oracle internal state variables and counters that are presented as if they were views and can be queried with SQL. Most counters are cumulative counters that are reset whenever the database is restarted.

Oracle中有很多数据都能像我们查看汽车仪表一样,通过ASH或AWR等工具进行查看。我认为你已经知道了Oracle的结构(如果不是的话,就去看看我做的关于这个主题的短片吧)。Oracle里有很多伪视图组成的一系列视图,一般被叫做“动态视图”或“V$视图”,因为这些视图的名字都是以“V$ ”开头的。另外,大部分的Oracle内部状态变量和计数器都像视图一样,可以用SQL语句直接查询。大多数计数器是累积计数器,在数据库重新启动的时候才会重置清零。

With cumulative counters, you must work by difference; you must have a “before” and an “after” pictures, a bit like all these ads that you see everywhere. A long time ago, Oracle introduced two scripts, bstat and estat, both later prefixed with “utl” for utility. Bstat copies the current values of v$ views to tables and is run before a period of interest. At the end of the period, estat is run, it computes the differences between the current values and those that were saved, and it generates a report.

由于这些累积的计数器,你必须根据这些不同的记数点来完成工作。你必须有一个“先前的”图片和一个“后来的”图片,这有点像那种随处可见的减肥广告。为此,很久之前,Oracle发布过两种脚本,bstat.sql和astat.sql,为了实用,这两个脚本后来都加上了"utl"的前缀,也就是utlbatat.sql和utlastat.sql。最后运行estat,它会计算当前值与以前保存的值的差异,然后据此生成一个报告。

Later Oracle introduced statspack, which is indeed a package the most important procedure of which is called snap.

后来Oracle又发布了statspack,这其实是一个非常给力的程序,又叫做snap。

The way it works is that you run snap at regular intervals, and it take snapshots of the dynamic views.

它工作的方式是这样的,你每隔一定间隔运行snap,它就会获取动态视图的快照。

Typically, statspack.snap will run a series of insert … select statements into performance history tables

其实通常的情况是,statspack.snap 会从某些表(如v$session_wait)中执行select获取某些数据,然后insert到某些表(如v$active_session_history)等表中。

There is a snag. All v$ views don’t directly map memory structures. They are already pretty elaborate and are built ontop of raw views. Many V$ views are joins, some are aggregates, and a select * may be a costlier operation than with a plain table. You also save the values, and writing to a database costs more than reading from it.

所有v$视图都不直接映射存储结构。它们的描述已经相当详尽,他们建立的视图是原始视图。很多动态视图都是联接的,还有一些是聚合的,对它们执行select * 可能会比对一个简单的表执行这条语句要执行更多的操作;在你保存数据、往数据库写数据时,也会比你从上面直接读取数据花费更多的时间。

What happens is that taking snapshots actually increases the load. As you often take snapshots because you have performance issues, you don’t really want to make things worse. The frequency cannot be high. Typically, you take snapshots every 30 minutes to 2 hours, so as to keep overhead to a minimum.

当获取快照的时候,事实上反而会增大负载。因为性能有问题,所以你可能通过生成快照来辅助你发现问题。但是生成快照会导致负载更高,这可能会使性能变得更差。获取快照的频率不能太高。通常情况下,你需要每30分钟至2小时获取一次快照,以便使系统开销最低。

Three issues remain.

然而还有三个问题依然存在。

Issue #1 is that statistics about the execution of a query are updated when the query terminates. If you have a query of death that is bringing the server to its knees and that out of despair the system administrator kills the process before you can probe in memory, for instance by querying V$session_longops, there is some collateral damage: a significant part of forensic evidence disappears. It will make problem-solving harder.

第一个问题是,当查询终止的时候,对查询执行所做的统计就被更新了。如果你有一个查询进程僵死了,并且影响到服务器,然而在你通过内存视图进行诊断之前(例如通过查询 v$session_longops,执行时间超过6秒的事务将会被记录在这个视图中),系统管理员却杀死了这个进程,这种情况就很悲剧了。因为这样会有一些意外的损失:一部分重要的错误信息就无声无息的消失了。这样只会使问题的解决变得更难。

Issue #2 is that if things are so bad that everything looks almost frozen, querying the dynamic views will be as bad as querying anything else. It happens rarely, but it may happen.

第二个问题是,当系统貌似被hang住的时候,查询v$视图(因为可能会join更多的表)会使情况更糟。这种情况虽然很少见,但还是有可能发生的。

Issue #3 is that when you take a snapshot memory can be full of innocent looking queries buzzing along. Then queries that are no longer executed may be replaced by new queries, some of which may be nasty, but disappear when you take the next snapshot, when everything looks again nice and peaceful. Some counters will tell you that something bad occurred in the interval, but it will be hard to find out what. Let me tell you that your real problem is that queries aren’t reused, which you have several ways to spot.

第三个问题则是,每当你获取快照时,内存就会被没用的查询迅速的占满了。然后那些查询将不会再被执行,可能会被新的查询替代,但是有一些查询是可能导致异常的,但是当你再一次获取快照时这些查询就会消失,然后一切看起来又变得正常了。一些计数器会告诉你某些时间段发生了什么情况,但若是想要找出具体是什么导致的就不容易了。那么让我来告诉你吧,其实你真正的问题在于没有重复利用查询,你可以通过一些办法来标记这些查询。

Will you increase the snap rate? Well, it will push overhead up. Besides, a snapshot every 10 minutes may be bearable, but a lot usually happens in 10 minutes, and you certainly cannot take a snapshot every tenth of second.

你会突然增加执行snap的频率么?好吧,这样会让你的系统开销一下子变得很大。另外,每10分钟执行一次快照或许还是能忍受的,但是很多时候,虽然数据库在十分钟内执行了很多操作,但是你也不能每十分之一秒执行一次快照!(这种情况建议使用ASH)。

This is why some third party suppliers came with monitors that are running on database servers, which attach to the SGA in read-only mode and read counters directly, without issuing SQL queries, and without saving values inside the database each time. It much lowers the overhead and allows to monitor the memory at a much higher frequency.

这就是为什么第三方供应商要提供他们的监控程序,这些监控程序可以直接在数据库服务器运行,依附于只读模式的SGA以及直接读取计数器数据,这样就不需要执行SQL查询,也不需要每次都往数据库里保存数据。这样就让你的系统开销减少很多,并且允许监控程序更高频率的对内存进行监控。

With version 10, Oracle shipped an improved statspack in an offer called “performance pack”. The two major components are a new monitor called ASH for “Active Session History” and a revamped statspack called AWR for “Automatic Workload Repository”. By default, ASH probes memory every second and only checks active sessions; data is saved to a memory buffer and dumped to the database only once in a while. In the meantime, AWR takes statspack-style snapshots every hour by default.

在ORACLE version 10里,自带了一个叫做“performance pack”的程序包,它是statspack的改良版,能为我们监控数据库提供更多帮助。它主要由两部分组成,一个是叫做ASH(即 Active Session History)的新的monitor,还有一个是经过改良之后的statspack,叫做AWR(即Automatic Workload Repository)。默认情况下,ASH每秒进行一次内存诊断,并且只检查活动的会话。数据被存储在一个内存缓冲区,只在偶尔被丢在数据库里。与此同时,AWR默认的每小时执行一次staspack格式的快照。

All great and good, except that the “Performance Pack” is separately licensed and that you have to pay for it.

这样看来,除了“Perfoemance Pack”是需要独立许可的,你不得不花钱购买这个许可之外,一切都完美了。

If your boss looks like this, great, don’t look for another job, and go for the performance pack.If this is a more familiar figure, don’t despair. Statspack is still available and supported, and you can actually do a lot with it. Many DBAs did a great job for years with no other tool; and there are a couple of free Internet resources that allow you to make an even greater job with it. Unfortunatly, statspack has completely disappeared from the recent documentation, most probably by mistake.

如果你的上司喜欢这个,那么就太好了,不要试图去寻找其他方法了,去找performance pack吧。如果需要更多的分析计算,不要绝望。Statspack仍然是可用的、能为你提供支持的,并且你确实可以利用它做很多事,很多有经验的DBA不用其他工具也做的很出色。网上还有一些免费的资源,帮助你更好的工作。但是很遗憾,从最近文档看来,statspack已经完全消失了。

I’m going to fill the gaps in the doc and show you how to install statspack on a small test database running Oracle 11 release 2.

如你所见,目前为止,这篇doc文档介绍的只有理论,那么接下来我将要弥补一下实践的空白,让我来向你演示如何在一个小的测试数据库(ORACLE 11 release2[11gR2安装文档可参考这里])安装statspack。

First of all you need to create the statspack package and a repository that will belong to a schema called Perfstat. You must connect to the database under the Oracle OS account as a “sysdba”, and run procedure spcreate that is located in the subdirectory rdbms/admin under your ORACLE_HOME. Use a better password than I do, because PERFSTAT is an account that all Oracle hackers have heard about. You need to specify a tablespace to store performance data and a temporary tablespace, I’m happy with the default values.

首先呢,你需要创造一个statspack包和一个属于Perfstat模式的库。安装Statspack需要在操作系统认证的情况下以“sysdba”的身份登录,并且要转到你的$ORACLE_HOME/RDBMS/ADMIN目录下,然后运行spcreate程序(SQL>@spcreate)。用个强度高的密码,因为每个oracle黑客都知道perfstat这个账户。你需要指定一个表空间来存储性能数据和一个永久表空间,我表示很乐意用默认值设置。

… lots of stuff on the screen …

。。。你会看到屏幕上出现很多东东。。。

When the script terminates, you are still connected as PERFSTAT. Reconnect as a DBA, and grant the CREATE JOB privilege to Perfstat. At this stage, you might also want to type show parameter job and check that the job_queue_processes value is greater than zero. Then you connect as PERFSTAT again and use, with Oracle 10 and above, the package DBMS_SCHEDULER to create a job that will call a PL/SQL block that will call snap at level 7 – there are several possible levels, 7 is good.

当脚本运行结束的时候,你仍然是作为perfstat这个用户连接的数据库。以DBA的身份重连,然后授予Perfstat CREATE JOB的权限。在现阶段,你也可能想要知道job的的参数,你可以查看到job_queue_processes的值是大于0的(10g中,此参数的默认值是10,可以通过show parameter job查看,job_queue_processes的值就是并行执行的job的个数,在将此参数改为0的时候,dbms_job创建的job将不能再启动,但DBMS_SCHEDULER创建的job不会受此影响,这些job可通过DBA_SCHEDULER_JOBS查看)。然后你可以再作为perfstat连接数据库并使用,在oracle 10以及以上的版本,DBMS_SCHEDULER包创建的job将会调用一个PL/SQL块来调用level 7的snap,这里有几种可能的级别,level 7完全可以满足需要。

The horrendous expression I am using for start_date just guarantees that the job will start on the hour or the half hour. I just like making things complicated. If you are running Oracle 9, this expression will be useful to you with the dbms_job package to compute the time when to run the job next. The repeat_interval parameter says that my time unit is the minute, and that I want the job to run every 30 time units. Don’t forget to set enabled to true, otherwise the job won’t start.

我用的start_date表达式,只能保证job将会开始于第几小时或第几个半个小时。我只是喜欢把事情弄得更复杂。如果你运行的是oracle9,这个表达式和dbms_job 包可以用来计算下一个job的时间。repeat_interval 参数告诉我我的时间单元是分钟,我想要每30分钟运行一次job。不要忘了把enable的参数设为true,否则job是不会启动的。

Run this statement and you’re set . You can query user_scheduler_jobs to check that statspack is ready to run.

运行完这些语句你就设置好了。你可以通过查询user_scheduler_jobs来检查statspack是不是已经在运行了。

You should monitor the size of tables in the perfstat schema and set up some purging mechanism, which is automated in AWR.

你应当监控perfstat模式的表的大小并且建立一些清除数据以释放表空间的机制,这些在AWR里是自动化执行的。

Now, to give you an idea about the monster, there are 600 dynamic views in Oracle 11, about the same number of various statistics are recorded, and over 1,000 different events are tracked. This can be a bit overwhelming.

现在,oracle 11为了让你知道哪里有问题,有600个动态视图,关于他们的各种各样的统计被记录下来,还有一千多个的不同的事件被跟踪着。这在我们看起来可能有点措手不及。

Most people generate the pre-packaged reports with both statspack and AWR. The procedure for statspack, spreport, is in the same directory as spcreate. When you run it, it displays the various snapshots that were taken, by groups. The database was shut down and restarted between two groups. Then you specify the snapshot identifier at the beginning of the interval you are interested in, and the one at the end of the interval. Both snapshots must belong to the same group. As counters are reset when the database starts up results would be meaningless otherwise.

很多人同时用statspack和AWR生成包装的报告。statspack、spreport和spcreate在同一个目录里。当你运行它的时候,它会按分组显示生成的各种各样的快照,数据库每一次关闭然后重启都会形成一个组。然后你可以在你感兴趣的时间段的开始指定快照标识,在这个时间段的最后指定另一个快照标识。此时的所有快照必须属于同一组。当数据库启动的时候计数器的结果就没有意义了,所以此时计数器会重置。

And there you go.

那么我们继续往下看。

I’m not going to delve into the details. And to be honest, it’s full of things some DBAs delight in, but that don’t necessary make sense to developers who have a hard time relating Oracle events to their code. Fortunately, there are two interesting sites that offer free services to help you.

我并不打算去钻研细节。老实说,让DBA着迷的事情太多了,但是开发者们真的没有必要绞尽脑汁的把自己的代码和Oracle事件都联系在一起来分析数据库性能。你非常幸运喔,因为有两个有趣的网站,能够提供免费服务来帮助你完成分析。

To help you make sense of one report, there is oraperf.com, with a service developed by someone who had more than a hand in Oracle’s instrumentation. You can upload reports, and you will get a diagnosis with plain English recommendations. That’s what you need to drill on a particular period.

如果想增强一个报告的可读性,去找oraperf.com帮忙吧,有一些潜心研究oracle物化的高手们给你提供服务。你可以上传你的报告,然后你会得到一个简洁的英语诊断建议报告。这就是你在这个时期需要研究的东西了。

Ondataperf.com takes a different approach. It helps you make sense of a series of report over intervals of equal duration – reports over a full day, or a week, or 4 hours, or one hour, it doesn’t matter. You take a series of statspack or AWR reports, zip them, upload them to the site, and they generate synthetic report chokefull of graphics in which every interval on the X-axis represents one of your reports. They chart almost everything they can, and add comments. It’s a great resource for monitoring over time and measuring the impact of an upgrade, for instance.

Ondataperf.com则使用了另一种不同的方法。它帮助你检测持续一段时间间隔的一系列报告,这些报告可以是一整天的、一个星期的,或者4个小时的,再或者1个小时的,都可以。你把那一系列的statspack或AWR报告用zip的格式压缩,然后把它们上传到他们的网站,然后他们就会据此生成综合报告,这个报告用坐标图展示了每个时间间隔的分析报告,每个坐标刻线代表你的一个报告(详情点我)。他们几乎把所有他们能做的都画在图表中了,并且又加上了注释。这为根据时间监控实例提供了非常好的资源,也能测定升级对实例的影响。

If you are daring, you can also directly query the PERFSTAT tables or the DBA_HIS views that are the AWR equivalent if you are licensed for it. This for instance is a HTML report I generated with a script in my toolkit, four straight SQL queries, without any PL/SQL.

如果你有能力,你甚至可以直接查询perfstat表或者DBA_HIS视图,如果你被授权允许查询,这些都是和AWR一样给力的。例如,这些可能是我用我的工具包里的脚本生成的HTML格式的报告,4个直接的SQL查询,没有任何PL/SQL。

Monitoring will give you a very precise idea of the symptoms, much more than vague reports of sluggishness. Of course, the usual suspect is the code, and surgery is a different story. But it will help you focus your efforts.

Monitoring将会给你一个关于数据库的问题的非常精确的概念,比低灵敏度、含糊的报告要好得多。当然,通常令人怀疑的是代码,这和外科手术又是一个不同的故事。但是它将会帮你找到问题并把精力都集中在解决这个问题上。

Thank you for your attention.

好的,就这些,谢谢你的关注。

  1. 这个译文里的内容很丰富呀
    还不经意上了youtube,还下了个ByteArmor监控的软件

    THNAKS!

普人特福的博客cnzz&51la for wordpress,cnzz for wordpress,51la for wordpress