基准测试

线程饥饿

2007年3月5日

在运行几次基准测试时，我发现 http_load 在某些连接上出现超时。这种情况在所有 Web 服务器以及 lighttpd 中的不同后端都出现过。

在修补 http_load 以正确处理 timed out 和 byte count 错误后，我能够轻松地将超时与其他问题区分开来。在最近的一个变更集中，我添加了一个基础设施来跟踪 lighttpd 中单个连接所花费的时间，包括在 gthread-aio 后端的不同阶段所花费的时间

调度线程化读取（threaded-read()）
在线程中启动读取（read()）
等待其完成
将结果发送到主循环
将缓冲数据写入套接字

您可以通过设置定义 LOG_TIMING 来启用此计时。

network_gthread_aio.c.78: (trace) write-start: 1173101341.237616 read-queue-wait: 680 ms read-time: 0 ms write-time: 21 ms
network_gthread_aio.c.78: (trace) write-start: 1173101341.229014 read-queue-wait: 576 ms read-time: 0 ms write-time: 134 ms
network_gthread_aio.c.78: (trace) write-start: 1173101341.240815 read-queue-wait: 681 ms read-time: 6 ms write-time: 19 ms

我编写了一个脚本，从错误日志中提取这些计时数据，并使用 gnuplot 将其转换为图像。

#!/bin/sh

## parse the errorlog for the lines from the timing_print
## - extract the numbers
## - sort it by start-time
## - make all timestamps relative to the first start-time

cat benchmark.error.log | \
    grep "network_gthread_aio.c.78:" | \
    awk '{ print $4,$6,$9,$12 }'  | \
    sort -n | \
    perl -ne '@e=split / /;if (!$start) { $start = $e[0]; } $e[0] = $e[0] - $start; print join " ", @e; ' > $1.data

cat <<EOF
set autoscale
set terminal png
set xlabel "start-time of a request"
set ylabel "ms per request"
set yrange [0:30000]

set title "$1"
set output "$1.png"
plot \
"$1.data" using 1:2 title "read-queue-wait-time" with points ps 0.2, \
"$1.data" using 1:(\$2 + \$3) title "read-time" with points ps 0.2, \
"$1.data" using 1:(\$2 + \$3 + \$4) title "write-time" with dots

set title "$1 (read-queue-wait)"
set output "$1-read-queue-wait.png"
plot \
"$1.data" using 1:2 title "read-queue-wait-time" with points ps 0.8

set title "$1 (read)"
set output "$1-read.png"
plot \
"$1.data" using 1:3 title "read-time" with points ps 0.8 pt 2

set title "$1 (write)"
set output "$1-write.png"
plot \
"$1.data" using 1:4 title "write-wait-time" with points ps 0.8 pt 3

EOF

第一个基准测试花费了

./http_load -parallel 100 -fetches 500 ./http-load.10M.urls-10G

和

server.max-read-threads = 64
## compiled with 64k read-ahead

从磁盘read()数据所花费的时间增加

64 threads

更详细

64 threads, read() calls

如果您将线程数减少到 4，您将得到

4 threads

并且读取时间降至

4 threads, read() calls

对于我们的超时，只有那些超出 4 秒范围的点才值得关注，因为它们是我们的饥饿 read() 线程。如果它们花费太长时间才能完成，客户端将关闭连接，用户将得到一个损坏的传输。

如上图所示，减少线程数有助于限制问题的影响。

threads-runnable = threads-started - threads-blocked

随着越来越多的线程陷入停滞，能够运行的线程越来越少，一个停滞的线程再次获得 CPU 时间的可能性正在增加。在最坏的情况下，所有可用线程都在等待，并且至少有一个会完成。

经验法则

将最大线程数保持在磁盘数量的两倍。

缓冲 IO 性能

2007年2月11日

除了对大量静态文件传输很重要的原始 IO 性能外，缓冲 IO 性能对于那些拥有少量可保存在文件系统缓存中的静态文件的网站更具吸引力。

由于我们在此基准测试中使用热缓存，因此服务器的“轻量性”变得很重要。系统调用越少越好。

测试用例由 100MB 文件组成，文件大小分别为 10MB 和 100KB。

基准测试

100KB

从热缓存中提供的 100MB 的 100KB 文件

lighttpd
后端	MB/秒	请求/秒	用户 + 系统
writev	82.20	802.71	90%
linux-sendfile	70.27	686.32	56%
gthread-aio	75.39	736.23	98%
posix-aio	73.10	713.88	98%
linux-aio-sendfile	31.32	305.90	35%
其他
Apache 2.2.4 (event)	70.28	686.38	60%
LiteSpeed 3.0rc2	70.20	685.65	50%

linux-aio-sendfile 正在失去大部分性能，因为它必须使用 O_DIRECT 进行操作，这始终是无缓冲读取。
Apache、LiteSpeed 和 linux-sendfile 使用相同的系统调用：sendfile()，并最终获得相同的性能值。
gthread-aio 和 posix-aio 的性能优于 sendfile()
write() 的性能优于 线程 AIO 和 sendfile()，我现在无法解释这一点 :)

10MB

从热缓存中提供的 100MB 的 10MB 文件。基准测试命令已更改，与其他基准测试相同。

$ http_load -verbose -timeout 40 -parallel 100 -fetches 500 http-load.10M.urls-100M

当我们使用 -seconds 选项时，http_load 会进行硬性截断，我们可能会因传输不完整而损失一些 MB/秒。

lighttpd
后端	MB/秒	请求/秒	用户 + 系统
writev	82.20	8.76	80%
linux-sendfile	53.95	5.65	40%
gthread-aio	83.02	8.66	90%
posix-aio	82.31	8.60	93%
linux-aio-sendfile	70.17	7.35	60%
其他
Apache 2.2.4 (event)	50.92	5.33	40%
LiteSpeed 3.0rc2	55.58	5.80	40%

所有 sendfile() 实现似乎都有相同的性能问题。
writev() 和 线程 AIO 后端按预期利用了网络。
linux-aio-sendfile 比缓冲的 sendfile() 更快，即使它必须从磁盘读取所有内容……奇怪

原始 IO 性能

2007年2月3日

在 lighttpd 1.5.0 中，我们支持多种网络后端。它们的作用是从磁盘获取静态数据并将其发送到客户端。

我们希望比较不同后端的性能，以及何时应该使用哪种后端。

writev
linux-sendfile
gthread-aio
posix-aio
linux-aio-sendfile

stat 线程的影响也应进行检查。

我们使用了一个最小配置文件

server.document-root     = "/home/jan/wwwroot/servers/grisu.home.kneschke.de/pages/"

server.port              = 1025
server.errorlog          = "/home/jan/wwwroot/logs/lighttpd.error.log"

server.network-backend   = "linux-aio-sendfile"
server.event-handler     = "linux-sysepoll"
server.use-noatime       = "enable"

server.max-stat-threads  = 2
server.max-read-threads = 64

iostat、vmstat 和 http_load

我们使用 iostat 和 vmstat 来查看系统如何处理负载。

$ vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0 506300  12900  20620 437968    0    0 14204    17 5492  3323  4 23  9 63
 0  1 506300  11212  20620 439888    0    0 17720     4 6713  3966  3 29  3 66
 0  1 506300  11664  20632 440356    0    0 14460     8 5416  3120  2 24  2 71
 1  0 506300  18916  20612 433168    0    0 13180    50 5505  3088  2 23  2 72
 0  1 506300  11960  20628 440188    0    0 15860     6 5485  3307  2 24  3 71

$ iostat -xm 5
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
       2.20    0.00   24.40   70.40    0.00    3.00

Device:    rrqm/s wrqm/s    r/s   w/s  rsec/s  wsec/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda         67.40   0.40  68.00  0.80 13600.00   14.40     6.64     0.01   197.88    12.87  176.83  11.23  77.28
sdb         82.80   0.40  84.60  0.80 17000.00   14.40     8.30     0.01   199.23    23.18  280.16  11.61  99.12
md0          0.00   0.00 302.20  0.80 30520.00    6.40    14.90     0.00   100.75     0.00    0.00   0.00   0.00

我们的 http_load 进程返回

>http_load -verbose -timeout 40 -parallel 100 -seconds 60 urls.100k
9117 fetches, 100 max parallel, 9.33581e+008 bytes, in 60 seconds
102400 mean bytes/connection
151.95 fetches/sec, 1.55597e+007 bytes/sec
msecs/connect: 5.47226 mean, 31.25 max, 0 min
msecs/first-response: 144.433 mean, 3156.25 max, 0 min
HTTP response codes:
  code 200 -- 9117

我们将使用相同的基准测试和相同的配置来比较不同的后端。

比较

作为比较，我尝试将其他 Web 服务器加入比较。一如既往，基准测试需要持保留态度。不要相信它们，请尝试自己重复这些测试。

lighttpd 1.5.0-svn，配置如上
litespeed 3.0rc2
- epoll 和 sendfile 已启用。所有其他选项均为默认值。
来自 OpenSUSE 10.2 软件包的 Apache 2.2.4 event-mpm
- MinSpareThreads = 25
- MaxSpareThreads = 75
- ThreadLimit = 64
- ThreadsPerChild = 25

我将尝试获取一个精简的、基于文本的配置文件，其中只包含其他人重复测试所需的必要选项。

预期

该基准测试旨在表明单线程 Web 服务器的异步文件 IO 表现良好。我们预期

阻塞式网络后端速度慢
Apache 2.2.4 将提供最佳性能，因为它结合了线程和事件驱动
lighttpd + 异步文件 IO 将达到 Apache2 的性能范围

阻塞式文件 IO 的问题在于，单线程服务器在等待系统调用完成时无法执行其他任何操作。

基准测试

针对不同后端运行 http_load 显示了异步 IO 与同步 IO 的影响。

100K 文件

lighttpd
后端	吞吐量	请求/秒
writev	6.11MB/秒	59.77
linux-sendfile	6.50MB/秒	63.62
posix-aio	12.88MB/秒	125.75
gthread-aio	15.04MB/秒	147.08
linux-aio-sendfile	15.56MB/秒	151.95
其他
litespeed 3.0rc2 (writev)	4.35MB/秒	42.78
litespeed 3.0rc2 (sendfile)	5.49MB/秒	53.68
apache 2.2.4	15.04MB/秒	146.93

对于小文件，您可以获得约 140% 的吞吐量提升。

不使用 no-atime

为了展示 server.use-noatime = "enable" 的影响，我们比较了 gthread-aio 输出在启用和禁用 noatime 时的 vmstat 输出。

在 O_NOATIME 启用时

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1 62 506300   9192  20500 470064    0    0 12426     5 7005  6324  3 27  7 63
 0 63 506300  10188  19768 469732    0    0 14154     2 8252  7614  3 30  0 67
 0 64 506300  10488  19124 470492    0    0 13589     0 8261  7483  3 27  0 69
 0 64 506300  10196  17952 473092    0    0 13062     8 7388  6560  3 25  8 65
 0 64 506300  10656  16836 474720    0    0 11790     0 6378  5074  2 23 11 64

在 O_NOATIME 禁用时

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0 21 506300  10408  15452 491680    0    0 10515   326 5362  1619  2 22 19 57
 3  7 506300  11116  17888 487588    0    0 11020   493 6056  2400  7 25 10 58
 0  0 506300  10200  19704 488004    0    0  8840   365 4506  1622  2 20 29 49
 2 14 506300  10460  21624 485288    0    0 12422   428 6464  1986  2 26 11 60
 0  1 506300   9436  24116 485316    0    0 12640   513 7159  2109  3 28  2 67
 0 21 506300  11864  25768 481588    0    0  8760     5 4436  1571  2 19 39 41
 0 21 506300  10352  24892 483412    0    0 11941   339 6005  1913  3 24 12 6

您会看到 bo（输出块）如何上升，以及 bi（输入块）如何以相同方式下降。由于您通常不需要 atime（文件访问时间），您应该通过 noatime, nodiratime 选项挂载文件系统，或者使用设置 server.use-noatime = "enable"。默认情况下，此设置是禁用的，以保持向后兼容性。

10MB 文件

lighttpd
后端	吞吐量	请求/秒	磁盘利用率%	用户 + 系统
writev	17.59MB/秒	2.35	50 %	25 %
linux-sendfile	33.13MB/秒	3.77	70 %	30 %
posix-aio	50.61MB/秒	5.69	98%	60%
gthread-aio	47.97MB/秒	5.51	100%	50%
linux-aio-sendfile	44.15MB/秒	4.95	90%	40 %
其他
litespeed 3.0rc2 (sendfile)	22.18MB/秒	2.72	65%	35 %
Apache 2.2.4	42.81MB/秒	4.73	95%	40 %

对于更大的文件，异步 IO 带来的优势仍然在 50% 左右。

stat() 线程

对于小文件，性能很大程度上受到 stat() 系统调用的影响。在文件open()以供读取之前，会首先检查文件是否存在、是否是常规文件以及是否可以从中读取。这个系统调用本身不是异步的。

我们将使用 gthread-aio 后端和 100K 文件集再次运行基准测试，这次将 server.max-stat-threads 从 0 更改为 16。

线程	吞吐量
0	8.55MB/秒
1	13.60MB/秒
2	14.18MB/秒
4	12.33MB/秒
8	12.62MB/秒
12	13.10MB/秒
16	12.71MB/秒

为了获得最佳性能，您应该将 stat 线程的数量设置为与您的磁盘数量相等。

读取线程

您还可以调整读取线程的数量。每个磁盘读取请求都被排队，然后由一个读取器池并行执行。目标是使磁盘利用率保持在 100%，并在 lighttpd 将数据发送到网络的时间内隐藏 stat() 和 read() 的寻道开销。

线程	吞吐量
1	6.83MB/秒
2	11.61MB/秒
4	13.02MB/秒
8	13.61MB/秒
16	13.81MB/秒
32	14.04MB/秒
64	14.87MB/秒

看起来每个磁盘 2 次读取已经是一个不错的数值了。

基准测试

线程饥饿

2007年3月5日

更多图表

经验法则

缓冲 IO 性能

2007年2月11日

基准测试

100KB

10MB

原始 IO 性能

2007年2月3日

iostat、vmstat 和 http_load

比较

预期

基准测试

100K 文件

不使用 no-atime

10MB 文件

stat() 线程

读取线程