作者: 车东 Email: chedongATbigfoot.com/chedongATchedong.com
写于:2002/07 最后更新:
版权声明:可以任意转载,转载时请务必以超链接形式标明文章原始出处和作者信息及本声明
http://www.chedong.com/tech/rotate_merge_log.html
关键词:webalizer apache log analysis sort merge cronolog awstats 日志 分析
内容摘要:你完全不必耐心地看完下面的所有内容,因为结论无非以下2点:
1 用 cronolog 干净,安全地轮循apache“日”志
2 用 sort -m 合并排序多个日志
根据个人的使用经历:
1 先介绍apache日志的合并方法;
2 然后根据由此引出的问题说明日志轮循的必要性和解决方法,介绍如何通过cronolog对apache日志进行轮循;
中间有很多在设计日志合并过程中一些相关工具的使用技巧和一些尝试的失败经历……
我相信解决以上问题的路径不止这一条途径,以下方案肯定不是最简便或者说成本最低的,希望能和大家有更多的交流。
越来越多大型的WEB服务使用DNS轮循来实现负载均衡:使用多个同样角色的服务器做前台的WEB服务,这大大方便了服务的分布规划和扩展性,但多个服务器的分布使得日志的分析统计也变得有些麻烦。如果使用webalizer等日志分析工具对每台机器分别做日志统计:
1 会对数据的汇总带来很多麻烦,比如:统计的总访问量需要将SERVER1 SERVER2...上指定月份的数字相加。
2 会大大影响统计结果中唯一访客数unique visits,唯一站点数unique
sites的等指标的统计,因为这几个指标并非几台机器的代数相加。
统一日志统计所带来的好处是显而易见的,但如何把所有机器的统计合并到一个统计结果里呢?
首先也许会想:多个服务器能不能将日志记录到同一个远程文件里呢?我们不考虑使用远程文件系统记录日志的问题,因为带来的麻烦远比你获得的方便多的多……
因此,要统计的多个服务器的日志还是:分别记录=>并通过一定方式定期同步到后台=>合并=>后用日志分析工具来进行分析。
首先,要说明为什么要合并日志:因为webalizer没有将同一天的多个日志合并的功能
先后运行
webalizer log1
webalizer log2
webalizer log3
这样最后的结果是:只有log3的结果。
能不能将log1<<log2<<log3简单叠加呢?
因为一个日志的分析工具不是将日志一次全部读取后进行分析,而且流式的读取日志并按一定时间间隔,保存阶段性的统计结果。因此时间跨度过大(比如2条日志间隔超过5分钟),一些日志统计工具的算法就会将前面的结果“忘掉”。因此,
log1<<log2<<log3直接文件连接的统计结果还是:只有log3的统计结果。
典型的多个日志文件的时间字段是这样的:
log1 log2 log3
00:15:00 00:14:00 00:11:00
00:16:00 00:15:00 00:12:00
00:17:00 00:18:00 00:13:00
00:18:00 00:19:00 00:14:00
14:18:00 11:19:00 10:14:00
15:18:00 17:19:00 11:14:00
23:18:00 23:19:00 23:14:00
日志合并必须是按时间将多个日志的交叉合并。合并后的日志应该是:
00:15:00 来自log1
00:15:00 来自log2
00:16:00 来自log1
00:17:00 来自log3
00:18:00 来自log2
00:19:00 来自log1
....
如何合并多个日志文件?
下面以标准的clf格式日志(apache)为例:
apche的日志格式是这样的:
%h %l %u %t \"%r\" %>s %b
具体的例子:
111.222.111.222 - - [03/Apr/2002:10:30:17 +0800] "GET /index.html
HTTP/1.1" 200 419
最简单的想法是将日志一一读出来,然后按日志中的时间字段排序
cat log1 log2 log3 |sort -k 4 -t " "
注释:
-t " ": 日志字段分割符号是空格
-k 4: 按第4个字段排序,也就是:[03/Apr/2002:10:30:17 +0800] 这个字段
-o log_all: 输出到log_all这个文件中
但这样的效率比较低,要知道。如果一个服务已经需要使用负载均衡,其服务的单机日志条数往往都超过了千万级,大小在几百M,这样要同时对多个几百M的日志进行排序,机器的负载可想而之……
其实有一个优化的途径,要知道:即使单个日志本身已经是一个“已经按照时间排好序“的文件了,而sort对于这种文件的排序合并提供了一个优化合并算法:使用
-m merge合并选项,
因此:合并这样格式的3个日志文件log1 log2 log3并输出到log_all中比较好方法是:
sort -m -t " " -k 4 -o log_all log1 log2 log3
注释:
-m: 使用 merge优化算法
注意:合并后的日志输出最好压缩以后再发给webalizer处理
有的系统能处理2G的文件,有的不能。有的程序能处理大于2G的文件,有的不能。尽量避免大于2G的文件,除非确认所有参与处理的程序和操作系统都能处理这样的文件。所以输出后的文件如果大于2G,最好将日志gzip后再发给webalizer处理:大于2G的文件分析过程中文件系统出错的可能性比较大,并且gzip后也能大大降低分析期间的I/O操作。
日志的按时间排序合并就是这样实现的。
让我们关心一下数据源问题:webalizer其实是一个按月统计的工具,支持增量统计:因此对于大型的服务,我可以按天将apache的日志合并后送给webalizer统计。WEB日志是如何按天(比如每天子夜00:00:00)截断呢?
如果你每天使用crontab:每天0点准时将日志备份成access_log_yesterday
mv /path/to/apache/log/access_log
/path/to/apache/log/access_log_yesterday
的话:你还需要:马上运行一下:apache restart
否则:apache会因为的日志文件句柄丢失不知道将日志记录到哪里去了。这样归档每天子夜重启apache服务会受到影响。
比较简便不影响服务的方法是:先复制,后清空
cp /path/to/apache/log/access_log
/path/to/apache/log/access_log_yesterday
echo >/path/to/apache/log/access_log
严肃的分析员会这样做发现一个问题:
但cp不可能严格保证严格的0点截断。加入复制过程用了6秒,截断的access_log_yesterday日志中会出现复制过程到00:00:06期间的日志。对于单个日志统计这些每天多出来几百行日志是没有问题的。但对于多个日志在跨月的1天会有一个合并的排序问题:
[31/Mar/2002:59:59:59 +0800]
[31/Mar/2002:23:59:59 +0800]
[01/Apr/2002:00:00:00 +0800]
[01/Apr/2002:00:00:00 +0800]
要知道[01/Apr/2002:00:00:00
这个字段是不可以进行“跨天排序”的。因为日期中使用了dd/mm/yyyy,月份还是英文名,如果按照字母排序,很有可能是这样的结果:排序导致了日志的错误
[01/Apr/2002:00:00:00 +0800]
[01/Apr/2002:00:00:00 +0800]
[01/Apr/2002:00:00:00 +0800]
[01/Apr/2002:00:00:00 +0800]
[01/Apr/2002:00:00:00 +0800]
[01/Apr/2002:00:00:00 +0800]
[01/Apr/2002:00:00:00 +0800]
[31/Mar/2002:59:59:59 +0800]
[31/Mar/2002:59:59:59 +0800]
[31/Mar/2002:23:59:59 +0800]
[31/Mar/2002:59:59:59 +0800]
[31/Mar/2002:23:59:59 +0800]
这些跨天过程中的非正常数据对于webalizer等分析工具来说简直就好像是吃了一个臭虫一样,运行的结果是:它可能会把前一个月所有的数据都丢失!因此这样的数据会有很多风险出现在处理上月最后一天的数据的过程中。
问题的解决有几个思路:
1 事后处理:
。所以一个事后的处理的方法是:用grep命令在每月第1天将日志跨月的日志去掉,比如:
grep -v "01/Apr" access_log_04_01 > access_log_new
修改SORT后的日志:所有跨天的数据去掉。也许对日志的事后处理是一个途径,虽然sort命令中有对日期排序的特殊选项
-M(注意是:大写M),可以让指定字段按照英文月份排序而非字母顺序,但对于apache日志来说,用SORT命令切分出月份字段很麻烦。(我尝试过用
"/"做分割符,并且使用“月份”
“年:时间”这两个字段排序)。虽然用一些PERL的脚本肯定可以实现,但最终我还是放弃了。这不符合系统管理员的设计原则:通用性。
并且你需要一直问自己:有没有更简单的方法呢?
还有就是将日志格式改成用TIMESTAMP(象SQUID的日志就没有这个问题,它的日志本身就是使用TIMESTAMP做时间时间戳的),但我无法保证所有的日志工具都能识别你在日期这个字段使用了特别的格式。
2 优化数据源:
最好的办法还是优化数据源。将数据源保证按天轮循,同一天的日志中的数据都在同一天内。这样以后你无论使用什么工具(商业的,免费的)来分析日志,都不会因为日志复杂的预处理机制受到影响。
首先可能会想到的是控制截取日志的时间:比如严格从0点开始截取日志,但在子夜前1分钟还是后一分钟开始截取是没有区别的,你仍然无法控制一个日志中有跨2天记录的问题,而且你也无法预测日志归档过程使用的时间。
因此必须要好好考虑一下使用日志轮循工具的问题,这些日志轮循工具要符合:
1 不中断WEB服务:不能停apache=>移动日志=>重启apache
2 保证同一天日志能够按天轮循:每天一个日志00:00:00-23:59:59
3 不受apache重启的影响:如果apache每次重启都会生成一个新的日志是不符合要求的
4 安装配置简单
首先考虑了apache/bin目录下自带的一个轮循工具:rotatelogs
这个工具基本是用来按时间或按大小控制日志的,无法控制何时截断和如何按天归档。
然后考虑logrotate后台服务:logrotate是一个专门对各种系统日志(syslogd,mail)进行轮循的后台服务,比如SYSTEM
LOG,但其配置比较复杂,放弃,实际上它也是对相应服务进程发出一个-HUP重启命令来实现日志的截断归档的。
在apache的FAQ中,推荐了经过近2年发展已经比较成熟的一个工具cronolog:安装很简单:configure=>make=>
make install
他的一个配置的例子会让你了解它有多么适合日志按天轮循:对httpd.conf做一个很小的修改就能实现:
TransferLog "|/usr/sbin/cronolog /web/logs/%Y/%m/%d/access.log"
ErrorLog "|/usr/sbin/cronolog /web/logs/%Y/%m/%d/errors.log"
然后:日志将写入
/web/logs/2002/12/31/access.log
/web/logs/2002/12/31/errors.log
午夜过后:日志将写入
/web/logs/2003/01/01/access.log
/web/logs/2003/01/01/errors.log
而2003 2003/01 和 2003/01/01 如果不存在的话,将自动创建
所以,只要你不在0点调整系统时间之类的话,日志应该是完全按天存放的(00:00:00-23:59:59),后面日志分析中:[31/Mar/2002:15:44:59这个字段就和日期无关了,只和时间有关。
测试:考虑到系统硬盘容量,决定按星期轮循日志
apache配置中加入:
#%w weekday
TransferLog "|/usr/sbin/cronolog /path/to/apache/logs/%w/access_log"
重启apache后,除了原来的CustomLog
/path/to/apche/logs/access_log继续增长外,系统log目录下新建立了
3/目录(测试是在周3),过了一会儿,我忽然发现2个日志的增长速度居然不一样!
分别tail了2个日志才发现:
我设置CustomLog使用的是combined格式,就是包含(扩展信息的),而TransferLog使用的是缺省日志格式,看了apache的手册才知道,TransferLog是用配置文件中离它自己最近的一个格式作为日志格式的。我的httpd.conf里写的是:
LogFormat ..... combined
LogFormat ... common
...
CustomLog ... combined
TransferLog ...
所以TrasferLog日志用的是缺省格式,手册里说要让TRANSFER日志使用指定的格式需要:
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\"
\"%{User-Agent}i\""
TransferLog "|/usr/local/sbin/cronolog
/path/to/apache/logs/%w/access_log"
重启,OK,日志格式一样了。
这样的设置结果其实是同时在logs目录下分别记录2个日志access_log和%w/access_log,能不能只记录%w/下的日志那?
查apache手册,更简单的方法:直接让CustomLog输出到cronolog归档日志,并且还能指定格式。
CustomLog "|/usr/local/sbin/cronolog
/path/to/apache/logs/%w/access_log" combined
最后是一个日志同步的问题。
任务:每天凌晨找到前1天的日志,另存一个文件准备发送到服务器上。
比如我要保留前1周的日志:每天复制前1天的日志到指定目录,等待日志服务器来抓取:
/bin/cp -f /path/to/apache/logs/`date -v-1d +%w`/access_log
/path/for/backup/logs/access_log_yesterday
在FREEBSD上使用以下命令
date -v-1d +%w
注释:
-v-1d: 前1天,而在GNU/Linux上这个选项应该是date -d yesterday
+%w: weekday,由于使用的都是标准时间函数库,所有工具中的WEEKDAY定义都是一样的 0-6 => 周日-周六
注意:写到CRONTAB里的时候"%"前面需要加一个"\"转义:每天0点5分进行一次日志归档
5 0 * * * /bin/cp /path/to/logs/`date -v-1d +\%w`/access_log
/path/to/for_sync/logs/access_yesterday
首次开始cronolog日志统计是周3,一周以后日志又将轮循回3/access_log
但这次日志是追加到3/access_log还是重新创建一个文件呢?>>access_log or >access_log?
我测试的结果是日志将被追加:
[01/Apr/2002:23:59:59 +0800]
[01/Apr/2002:23:59:59 +0800]
[08/Apr/2002:00:00:00 +0800]
[08/Apr/2002:00:00:00 +0800]
肯定是不希望每次日志还带着上周的数据的并重复统计一次的(虽然对结果没影响),而且这样%w/下的日志不是也越来越多了吗?
解决方法1 把每天的cp改成mv
解决方法2 每天复制完成后:删除6天以前的access_log日志
find /path/to/apache/logs -name access_log -mtime +6 -exec rm -f {}\;
多保留几天的日志还是有必要的:万一日志分析服务器坏了一天呢?
以下是把apache安装在/home/apache下每天统计的一个脚本文件:
#!/bin/sh
#backup old log
/bin/cp -f /home/apache/logs/`date -d yesterday +%w`/access_log
/home/apache/logs/access_log_yesterday
#remove old log
/usr/bin/find /home/apache/logs -name access_log -mtime +6 -exec rm -f
{}\;
#analysis with webalizer
/usr/local/sbin/webalizer
总结:
1 用 cronolog 干净,安全地轮循日志
2 用 sort -m 排序合并多个日志
参考资料:
日志分析统计工具:
http://directory.google.com/Top/Computers/Software/Internet/Site_Management/Log_Analysis/
Apche的日志设置:
http://httpd.apache.org/docs/mod/mod_log_config.html
Apache的日志轮循:
http://httpd.apache.org/docs/misc/FAQ.html#rotate
Cronolog
http://www.cronolog.org
Webalizer
http://www.mrunix.net/webalizer/
Webalzer的Windows版
http://www.medasys-lille.com/webalizer/
#
# Webalizer 样例配置文件
# Copyright 1997-2000 by Bradford L. Barrett ([email protected])
# 翻译: 车东 ([email protected])
#
# Distributed under the GNU General Public License. See the
# files "Copyright" and "COPYING" provided with the webalizer
# distribution for additional information.
#
# 这是一个Webalizer (版本 2.01)的配置文件样例
# 所有以'#'开始的行都是被程序忽略的注释,此外空白行也会被跳过,其他行都是具体的配置选项。
# 并按照"ConfigOption Value"的格式,ConfigOption是合法的配置选项关键词,而Value是相应选项对应的值
# 非法的键/值会被忽略并会有相应的警告提示。关键词和值之间至少需要一个空格或制表符tab分割
#
# 从0.98版本开始,Webalizer会找缺省在当前目录下找一个名为webalizer.conf缺省配置文件
# 如果没有找到,会使用/etc/webalizer.conf
# LogFile 定义了WEB服务的日志文件,如果这里没有定义,并且命令行参数也没有指定文件名,
# 则将STDIN(系统标准输入)作为输入数据源
# 如果日志文件扩展名为'.gz' (是一个gzip压缩文件),程序会一边读取一边进行解压缩。
LogFile
/home/apache/log/access_log_yesterday
# LogType 定义了日志的类型,Webalizer一般用于CLF和Combined格式的WEB服务日志格式
指定这个选项,你可以处理FTP日志(比如wu-ftp生成的xferlog,和Squid自己的日志
值可以是:'clf', 'ftp' 或'squid', 缺省是'clf'
# JNH : 新的'iis'是为IIS设计的,IIS4缺省使用标准日志格式,IIS5缺省使用W3C格式
# webalizer会自动根据日志的文件名进行识别:标准格式的日志文件名以I开头,W3C的是E
# 你可以在一个目录下同时存放2种日志,webalizer会全部读取并生成一份报告
LogType iis
# OutputDir 报告的输出目录地址,必须是完整的全路径名,但相对路径也许也行,
# 如果没有指定,输出目录就是当前目录。
OutputDir
/home/apache/htdocs/usage/
# HistoryName 允许你设置webalizer生成的历史数据文件名
# 历史数据文件保存了12个月内的数据,这些数据会用来生成首页的HTML页面index.html
# 缺省文件名是:"webalizer.hist",缺省存放在指定的输出目录中, 也可以使用绝对路径指定到其他目录中。
#HistoryName webalizer.hist
# Incremental 增量处理允许你处理被分隔成多个小文件的大日志,对于大型站点的按周,按天的日志轮循会非常有用
# 为了继续上次的处理,Webalizer在退出前会保存当时处理的数据并在下次运行是恢复当时的状态
在这个模式下,Webalizer会扫描并忽略重复的记录,请看README文件,里面有更详细的解说
值可以是:'yes'或'no'缺省为'no'.
# 'webalizer.current'这个文件用来保存当前数据,位置在OutputDir设置的输出目录中
# 启用这个选项前,请至少阅读一下README文件中的增量处理一节
Incremental yes
# IncrementalName
允许你设置保存当前数据的文件名,和HistoryName选项一样,除非设置绝对路径,否则文件就在缺省输出目录中,
# 这个选项只有在启用了Incremental模式后才有意义
#IncrementalName webalizer.current
# ReportTitle是标题文字,除非这个字符串是空的,否则主机名会空一格后显示在后面,
# 缺省是英文:"Usage Statistics for".
#ReportTitle Usage Statistics for
# HostName 定义了报告对应的主机名,用在报告的标题和URL统计里,这样
# 即使在一个虚拟主机的统计中,点击URL统计的链接也可以转向相应的正确地址。
# 或者生成报告的服务器是在另外一台机器,clicking on URL's in the report to go to the
proper location in
# the event you are running the report on a 'virtual' web server,
# or for a server different than the one the report resides on.
# 如果这里没有指定webalizer会尝试调用uname命令获得系统的主机名,如果失败缺省为"localhost"
HostName www.chedong.com
# HTMLExtension 允许你设置生成报告的文件扩展名,一般缺省是"html",但你也可以根据站点改成你需要的名字
(像配置PHP一样 embeded pages)?
#HTMLExtension html
# PageType 你告诉Webalizer那种类型的URL是你定义的'页面访问'(Page View).
大部分人认为一个html或cgi请求文档是页面,
# 而嵌入在页面中的图片和声音不算,如果没有指定,如果是WEB日志统计,页面的扩展名就是'htm*'和'cgi',
# 如果是ftp日志,扩展名就是'txt' 对于Servlet这样没有扩展名的请求Webalizer也是算页面的。
PageType htm*
PageType cgi
PageType asp
PageType p*
#PageType phtml
#PageType php3
#PageType pl
# UseHTTPS 如果分析的站点使用安全服务器,URL的链接将是以'https://'开头,而不是缺省的'http://'.
如果需要,把它设置成'yes'。缺省是'no'. 这个配置只影响'Top URL's'里的链接.
#UseHTTPS no
# DNSCache 指定了用于反相DNS解析的DNS缓存文件,如果你希望对所有日志中所有的IP地址进行反相域名解析
# addresses found in the log file.
如果没有指定绝对路径(文件名不是以'/'开头),这个文件缺省就在输出目录下
更多详细说明请参考DNS.README
# JNH : 如果你使用ListServer选项,你必须指定DnsCache的全路径
#DNSCache dns_cache.db
# DNSChildren 允许你设置用多少个"子"进程进行DNS解析和更新DNS缓存文件。
# 如果指定了数字,Webalizer会创建DNS缓存文件并且每次运行都会更新,DNS解析会在
日志分析之前根据指定的数值调起子进程进行。如果使用DNS解析,DNS缓存文件名也必须指定。
# DNS lookups. If used, the DNS cache filename MUST be specified as
# well. 缺省值是0,等于禁用DNS缓存文件,子进程的个数可以是用1 到100之间,如果更大会影响系统运行。
比较合理的值是5到20之间,更多详细信息请参考DNS.README
#DNSChildren 0
# HTMLPre 定义了输出页面中最开头的HTML代码,缺省是以下的DOCTYPE声明
# 每行最长是80个字符,如果需要更多代码可以使用多条配置。
#HTMLPre <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0
Transitional//EN">
# HTMLHead 定义了插入到<HEAD></HEAD>中间,紧接在<TITLE>行后的HTML代码
# 每行最长是80个字符,如果需要更多代码可以使用多条配置。
#HTMLHead <META NAME="author" CONTENT="The Webalizer">
# HTMLBody 定义了第一行<BODY>标签的HTML代码,缺省如下:
# 每行最长是80个字符,如果需要更多代码可以使用多条配置。
#HTMLBody <BODY BGCOLOR="#E8E8E8" TEXT="#000000" LINK="#0000FF"
VLINK="#FF0000">
# HTMLPost 定义了输出页面中紧跟在第个<HR>标签后面紧跟在标题
# 和"summary period"-"Generated on:"这几行后面的代码。
# As with HTMLHead, you can define as many of these as you want and
# they will be inserted in the output stream in order of apperance.
# 每行最长是80个字符,如果需要更多代码可以使用多条配置。
#HTMLPost <BR CLEAR="all">
# HTMLTail defines the HTML code to insert at the bottom of each
# HTML document, usually to include a link back to your home
# page or insert a small graphic. It is inserted as a table
# data element (ie: <TD> your code here </TD>) and is right
# alligned with the page. Max string size is 80 characters.
#HTMLTail <IMG SRC="msfree.png" ALT="100% Micro$oft free!">
# HTMLEnd defines the HTML code to add at the very end of the
# generated files. It defaults to what is shown below. If
# used, you MUST specify the </BODY> and </HTML> closing
tags
# as the last lines. Max string length is 80 characters.
#HTMLEnd </BODY></HTML>
# The Quiet option suppresses output messages... Useful when run
# as a cron job to prevent bogus e-mails. Values can be either
# "yes" or "no". Default is "no". Note: this does not suppress
# warnings and errors (which are printed to stderr).
#Quiet no
# ReallyQuiet will supress all messages including errors and
# warnings. Values can be 'yes' or 'no' with 'no' being the
# default. If 'yes' is used here, it cannot be overriden from
# the command line, so use with caution. A value of 'no' has
# no effect.
#ReallyQuiet no
# TimeMe allows you to force the display of timing information
# at the end of processing. A value of 'yes' will force the
# timing information to be displayed. A value of 'no' has no
# effect.
#TimeMe no
# GMTTime allows reports to show GMT (UTC) time instead of local
# time. Default is to display the time the report was generated
# in the timezone of the local machine, such as EDT or PST. This
# keyword allows you to have times displayed in UTC instead. Use
# only if you really have a good reason, since it will probably
# screw up the reporting periods by however many hours your local
# time zone is off of GMT.
#GMTTime no
# Debug prints additional information for error messages. This
# will cause webalizer to dump bad records/fields instead of just
# telling you it found a bad one. As usual, the value can be
# either "yes" or "no". The default is "no". It shouldn't be
# needed unless you start getting a lot of Warning or Error
# messages and want to see why. (Note: warning and error messages
# are printed to stderr, not stdout like normal messages).
#Debug no
# FoldSeqErr forces the Webalizer to ignore sequence errors.
# This is useful for Netscape and other web servers that cache
# the writing of log records and do not guarentee that they
# will be in chronological order. The use of the FoldSeqErr
# option will cause out of sequence log records to be treated
# as if they had the same time stamp as the last valid record.
# Default is to ignore out of sequence log records.
#FoldSeqErr no
# VisitTimeout 用来定义一个访客回话的超时时间,缺省为30分钟。
# Visits是根据访客发出请求的时间和来自这个访客所在站点(IP)的最后访问时间决定的,
# 如果2者时间间隔超过VisitTimeout的值,这个请求就被认为是一个新的访客,访客数也被加1
# 值为超时的秒数(缺省为=1800秒=30分钟)
#VisitTimeout 1800
# IgnoreHist shouldn't be used in a config file, but it is here
# just because it might be usefull in certain situations. If the
# history file is ignored, the main "index.html" file will only
# report on the current log files contents. Usefull only when you
# want to reproduce the reports from scratch. USE WITH CAUTION!
# Valid values are "yes" or "no". Default is "no".
#IgnoreHist no
# Country Graph allows the usage by country graph to be disabled.
# Values can be 'yes' or 'no', default is 'yes'.
#CountryGraph yes
# DailyGraph and DailyStats allows the daily statistics graph
# and statistics table to be disabled (not displayed). Values
# may be "yes" or "no". Default is "yes".
#DailyGraph yes
#DailyStats yes
# HourlyGraph and HourlyStats allows the hourly statistics graph
# and statistics table to be disabled (not displayed). Values
# may be "yes" or "no". Default is "yes".
#HourlyGraph yes
#HourlyStats yes
# GraphLegend allows the color coded legends to be turned on or off
# in the graphs. The default is for them to be displayed. This only
# toggles the color coded legends, the other legends are not changed.
# If you think they are hideous and ugly, say 'no' here :)
#GraphLegend yes
# GraphLines allows you to have index lines drawn behind the graphs.
# I personally am not crazy about them, but a lot of people requested
# them and they weren't a big deal to add. The number represents the
# number of lines you want displayed. Default is 2, you can disable
# the lines by using a value of zero ('0'). [max is 20]
# Note, due to rounding errors, some values don't work quite right.
# The lower the better, with 1,2,3,4,6 and 10 producing nice results.
#GraphLines 2
# The "Top" options below define the number of entries for each table.
# Defaults are Sites=30, URL's=30, Referrers=30 and Agents=15, and
# Countries=30. TopKSites and TopKURLs (by KByte tables) both default
# to 10, as do the top entry/exit tables (TopEntry/TopExit). The top
# search strings and usernames default to 20. Tables may be disabled
# by using zero (0) for the value.
#TopSites 30
#TopKSites 10
#TopURLs 30
#TopKURLs 10
#TopReferrers 30
#TopAgents 15
#TopCountries 30
#TopEntry 10
#TopExit 10
#TopSearch 20
#TopUsers 20
# All* 关键词允许显示所有的URL,独立站点(IP),引用链接(Referrers)
# 用户浏览器, 搜索关键词和用户名,如果启用,会生成另外一个HTML页面并有链接
# 加在相应栏目的下面,注意以下2点,这些统计必然比TOP统计要大的多,第2,这些对外都是可见的
# 值可以是yes或no,缺省都是no,对于一个公开发布的站点,这些按月生成的统计
# 会非常大。会需要很多磁盘空间,如果访问很多也会带来很多流量。
#AllSites no
AllURLs yes
#AllReferrers no
#AllAgents no
AllSearchStr yes
#AllUsers no
# The Webalizer normally strips the string 'index.' off the end of
# URL's in order to consolidate URL totals. For example, the URL
# /somedir/index.html is turned into /somedir/ which is really the
# same URL. This option allows you to specify additional strings
# to treat in the same way. You don't need to specify 'index.' as
# it is always scanned for by The Webalizer, this option is just to
# specify _additional_ strings if needed. If you don't need any,
# don't specify any as each string will be scanned for in EVERY
# log record... A bunch f them will degrade performance. Also,
# the string is scanned for anywhere in the URL, so a string of
# 'home' would turn the URL /somedir/homepages/brad/home.html into
# just /somedir/ which is probably not what was intended.
#IndexAlias home.htm
#IndexAlias homepage.htm
# The Hide*, Group* and Ignore* and Include* keywords allow you to
# change the way Sites, URL's, Referrers, User Agents and Usernames
# are manipulated. The Ignore* keywords will cause The Webalizer to
# completely ignore records as if they didn't exist (and thus not
# counted in the main site totals). The Hide* keywords will prevent
# things from being displayed in the 'Top' tables, but will still be
# counted in the main totals. The Group* keywords allow grouping
# similar objects as if they were one. Grouped records are displayed
# in the 'Top' tables and can optionally be displayed in BOLD and/or
# shaded. Groups cannot be hidden, and are not counted in the main
# totals. The Group* options do not, by default, hide all the items
# that it matches. If you want to hide the records that match (so just
# the grouping record is displayed), follow with an identical Hide*
# keyword with the same value. (see example below) In addition,
# Group* keywords may have an optional label which will be displayed
# instead of the keywords value. The label should be seperated from
# the value by at least one 'white-space' character, such as a space
# or tab.
#
# The value can have either a leading or trailing '*' wildcard
# character. If no wildcard is found, a match can occur anywhere
# in the string. Given a string "www.yourmama.com", the values "your",
# "*mama.com" and "www.your*" will all match.
# Your own site should be hidden
#HideSite *mrunix.net
#HideSite localhost
# Your own site gives most referrals
#HideReferrer mrunix.net/
# This one hides non-referrers ("-" Direct requests)
#HideReferrer Direct Request
# Usually you want to hide these
HideURL *.gif
HideURL *.GIF
HideURL *.jpg
HideURL *.JPG
HideURL *.png
HideURL *.PNG
HideURL *.ra
HideURL *.css
# Hiding agents is kind of futile
#HideAgent RealPlayer
# You can also hide based on authenticated username
#HideUser root
#HideUser admin
# Grouping options
#GroupURL /cgi-bin/* CGI Scripts
#GroupURL /images/* Images
#GroupSite *.aol.com
#GroupSite *.compuserve.com
#GroupReferrer yahoo.com/ Yahoo!
#GroupReferrer excite.com/ Excite
#GroupReferrer infoseek.com/ InfoSeek
#GroupReferrer webcrawler.com/ WebCrawler
#GroupUser root Admin users
#GroupUser admin Admin users
#GroupUser wheel Admin users
# The following is a great way to get an overall total
# for browsers, and not display all the detail records.
# (You should use MangleAgent to refine further...)
#GroupAgent MSIE Micro$oft Internet Exploder
#HideAgent MSIE
#GroupAgent Mozilla Netscape
#HideAgent Mozilla
#GroupAgent Lynx* Lynx
#HideAgent Lynx*
# HideAllSites allows forcing individual sites to be hidden in the
# report. This is particularly useful when used in conjunction
# with the "GroupDomain" feature, but could be useful in other
# situations as well, such as when you only want to display grouped
# sites (with the GroupSite keywords...). The value for this
# keyword can be either 'yes' or 'no', with 'no' the default,
# allowing individual sites to be displayed.
#HideAllSites no
# The GroupDomains keyword allows you to group individual hostnames
# into their respective domains. The value specifies the level of
# grouping to perform, and can be thought of as 'the number of dots'
# that will be displayed. For example, if a visiting host is named
# cust1.tnt.mia.uu.net, a domain grouping of 1 will result in just
# "uu.net" being displayed, while a 2 will result in "mia.uu.net".
# The default value of zero disable this feature. Domains will only
# be grouped if they do not match any existing "GroupSite" records,
# which allows overriding this feature with your own if desired.
#GroupDomains 0
# The GroupShading allows grouped rows to be shaded in the report.
# Useful if you have lots of groups and individual records that
# intermingle in the report, and you want to diferentiate the group
# records a little more. Value can be 'yes' or 'no', with 'yes'
# being the default.
#GroupShading yes
# GroupHighlight allows the group record to be displayed in BOLD.
# Can be either 'yes' or 'no' with the default 'yes'.
#GroupHighlight yes
# The Ignore* keywords allow you to completely ignore log records based
# on hostname, URL, user agent, referrer or username. I hessitated in
# adding these, since the Webalizer was designed to generate _accurate_
# statistics about a web servers performance. By choosing to ignore
# records, the accuracy of reports become skewed, negating why I wrote
# this program in the first place. However, due to popular demand, here
# they are. Use the same as the Hide* keywords, where the value can
have
# a leading or trailing wildcard '*'. Use at your own risk ;)
#IgnoreSite bad.site.net
#IgnoreURL /test*
#IgnoreReferrer file:/*
#IgnoreAgent RealPlayer
#IgnoreUser root
# The Include* keywords allow you to force the inclusion of log records
# based on hostname, URL, user agent, referrer or username. They take
# precidence over the Ignore* keywords. Note: Using Ignore/Include
# combinations to selectivly process parts of a web site is _extremely
# inefficent_!!! Avoid doing so if possible (ie: grep the records to a
# seperate file if you really want that kind of report).
# Example: Only show stats on Joe User's pages...
#IgnoreURL *
#IncludeURL ~joeuser*
# Or based on an authenticated username
#IgnoreUser *
#IncludeUser someuser
# The MangleAgents allows you to specify how much, if any, The Webalizer
# should mangle user agent names. This allows several levels of detail
# to be produced when reporting user agent statistics. There are six
# levels that can be specified, which define different levels of detail
# supression. Level 5 shows only the browser name (MSIE or Mozilla)
# and the major version number. Level 4 adds the minor version number
# (single decimal place). Level 3 displays the minor version to two
# decimal places. Level 2 will add any sub-level designation (such
# as Mozilla/3.01Gold or MSIE 3.0b). Level 1 will attempt to also add
# the system type if it is specified. The default Level 0 displays the
# full user agent field without modification and produces the greatest
# amount of detail. User agent names that can't be mangled will be
# left unmodified.
#MangleAgents 0
# 搜索引擎关键词允许你设置搜索引擎和URL中的查询格式,用于统计用户通过那些关键词
# 被用来找到你的站点。第1个关键词是从WEB日志中的referrer字段识别搜索引擎,第2个是
# URL中的关键词的参数名。
SearchEngine yahoo.com p=
SearchEngine altavista.com q=
SearchEngine google.com q=
SearchEngine eureka.com q=
SearchEngine lycos.com query=
SearchEngine hotbot.com MT=
SearchEngine msn.com MT=
SearchEngine infoseek.com qt=
SearchEngine webcrawler searchText=
SearchEngine excite search=
SearchEngine netscape.com search=
SearchEngine mamma.com query=
SearchEngine alltheweb.com query=
SearchEngine northernlight.com qr=
SearchEngine baidu.com word=
SearchEngine sina.com.cn word=
SearchEngine sohu.com word=
SearchEngine 163.com q=
# Dump* 用来将统计导出成用制表符(TAB)分割的文本文件,从而方便导入到其他应用中做统计。
# 比如数据库和统计软件
# DumpPath specifies the path to dump the files. If not specified,
# it will default to the current output directory. Do not use a
# trailing slash ('/').
#DumpPath /var/lib/httpd/logs
# The DumpHeader keyword specifies if a header record should be
# written to the file. A header record is the first record of the
# file, and contains the labels for each field written. Normally,
# files that are intended to be imported into a database system
# will not need a header record, while spreadsheets usually do.
# Value can be either 'yes' or 'no', with 'no' being the default.
#DumpHeader no
# DumpExtension allow you to specify the dump filename extension
# to use. The default is "tab", but some programs are pickey about
# the filenames they use, so you may change it here (for example,
# some people may prefer to use "csv").
#DumpExtension tab
# 控制各个大类统计的导出。
# 值可以是'yes'或 'no'缺省为'no'.
#DumpSites no
DumpURLs yes
DumpReferrers yes
#DumpAgents no
#DumpUsers no
DumpSearchStr yes
# End of configuration file... Have a nice day!
# begin of JNH mofications
# new entry for Win32 release
# NOUVELLE ENTREE pour les serveurs NT
# nom de la page par defaut sur le serveur
# replace file "Index" for unix systems by other name
# IndexPage default
# 所有的日志存放目录
# 文件个数限制为一个目录下250,如果需要处理更多你需要移动文件并再次运行。
# FolderLog C:\JnhDev\WebAlizer32\Exemple de Logs\IIS4.0\Log
Standard\
FolderLog
C:\WINNT\system32\LogFiles\W3SVC3\
ExtentionLog log
# when you use mix type of log in same folder, webalizer sort file for
order by
# name, but if begin of file file is mix sort didn't make work, then
you can disable it
# default is no
# DisableSort yes
# Name of file contain list of server to process like for each line :
# Name of Customer<SPACE>Folder of log<SPACE>Folder
output<SPACE>Host Name1;Host Name 2
# sample (extract of production file, who have 255 lines)
# all of option in this file apply to all reports ...
# New in this file you can use coma (") for delimit field
# wA001 c:\WA001\LogIIS\ c:\wA001\stats
wa001.LeRelaisInternet.com;www1.jeanlouisaubert.com
# wA002 c:\WA002\LogIIS\ c:\wA002\stats
wa002.LeRelaisInternet.com;www.restotel.fr;www.nordpage.fr
# wA003 c:\WA003\LogIIS\ c:\wA003\stats
Wa003.LeRelaisInternet.com;www.autobusavapeur.com
#ServerList c:\jnhdev\webalizer\listeserv.txt
# If you have dayly rotation on log name, you can change name after
process a file
# to have less no productive work day
# to use this option you need to use "HistoryName" and "Incremental"
RenameLog yes
NewExtension sav
# 2 New Options for optimize DNS resolution : is time to live in data
base cache
# for good dns resolution (default is 30 days) and for bad resolution,
like
# no reverse IP, in this case it's better to store errors in database
file
# cause each day bad dns consume a lot of time (default 7 days)
#TtlDns 30
#TtlDnsError 7
# new option for convert each record date to Local time before process
it ...
# Test only
# default = No
ConvertTime yes
# end of JNH .. HAve a nice day !!!
注意:对IIS日志需要通过配置将发送字节数sc_size和referer2个字段启用。
下载:
http://awstats.sourceforge.net
解包:
tar zxf awstats-5.4.tgz
后台统计:把应用放到一个目录下
mv awstats-5.4/wwwroot/cgi-bin /home/awstats
mv awstats-5.4/tools/* /home/awstats
输出目录:
mkdir /home/apache/htdocs/awstats
mv awstats-5.4/wwwroot/icon/ /home/apache/htdocs/
配置:
cd /home/awstats/
cp awstats.model.conf awstats.chedong.conf
vi awstats.mysite.conf
修改:
LogFile="/home/apache/logs/access_log_yesterday"
SiteDomain="www.chedong.com"
定期运行:每天凌晨运行
40 5 * * * (cd /home/awstats; ./awstats_buildstaticpages.pl
-config=chedong -update -lang=cn -dir=/home/apache/htdocs/awstats/
-awstatsprog=./awstats.pl)
-config=chedong 使用的配置文件
-update= 先统计新的日志,然后生成输出
-lang=cn 中文界面
-dir=/home/apache/htdocs/awstats/ 输出目录
-awstatsprog=./awstats.pl 调用的应用位置
输出为:
http://www.chedong.com/awstats/awstats.chedong.html
在Windows上的设置:
符合AWSTATS规定的IIS标准日志格式字段列表:
日期 date
时间 time
客户IP地址 c-ip
用户名 cs-username
方法 cs-method
URI资源 cs-uri-stem
协议状态 sc-status
发送字节数 sc-bytes
协议版本 cs-version
用户代理 cs(User-Agent)
参照 cs(Referer)
因此比缺省设置减少的有
服务器IP地址
服务器端口
URI查询
增加的有:
发送字节数
协议版本
参照
自动运行脚本:
d:\Perl\bin\perl.exe d:\AWStats\awstats_buildstaticpages.pl -update
-config=chedong -lang=cn -dir=c:\html\awstats\
-awstatsprog=d:\awstats\awstats.pl
原文出处:<a
href="http://www.chedong.com/tech/rotate_merge_log.html">http://www.chedong.com/tech/rotate_merge_log.html</a>
<<返回