Ping a Specific Port

Question

Antonio

Asked: 2012-09-17 01:39:32 +0800 CST2012-09-17 01:39:32 +0800 CST 2012-09-17 01:39:32 +0800 CST

iSCSI timeouts under high load

772

I have two servers connected via Gigabit Ethernet. One is iSCSI target, the second one is initiator. When I run mkfs.ext4 at initiator, after a while disk IO slows down critically in target. In the target host I can see the following in syslog:

Sep 14 09:40:03 sh11 tgtd: abort_task_set(1139) found 119668c 0
Sep 14 09:40:03 sh11 tgtd: abort_cmd(1115) found 119668c 6
Sep 14 09:40:03 sh11 tgtd: abort_task_set(1139) found 119668d 0
Sep 14 09:40:03 sh11 tgtd: abort_cmd(1115) found 119668d 6
Sep 14 09:40:03 sh11 tgtd: abort_task_set(1139) found 119668e 0
Sep 14 09:40:03 sh11 tgtd: abort_cmd(1115) found 119668e 6
Sep 14 09:40:03 sh11 tgtd: abort_task_set(1139) found 1196696 0
Sep 14 09:40:03 sh11 tgtd: abort_cmd(1115) found 1196696 6
Sep 14 09:40:03 sh11 tgtd: abort_task_set(1139) found 119669e 0
Sep 14 09:40:03 sh11 tgtd: abort_cmd(1115) found 119669e 6
Sep 14 09:40:04 sh11 tgtd: abort_task_set(1139) found 119669f 0
Sep 14 09:40:04 sh11 tgtd: abort_cmd(1115) found 119669f 6

And load average grows to 12 or even more:

# uptime
 12:37:00 up 23 days, 13:25,  1 user,  load average: 12.00, 7.00, 4.00

CentOS 6.3
tgtd 1.0.24
Intel Pentium 4 2.4GHz
1Gb RAM
2Tb WD Cavlar Green SATA 2.0

#lspci
00:00.0 Host bridge: Intel Corporation 82845G/GL[Brookdale-G]/GE/PE DRAM Controller/Host-Hub Interface (rev 02)
00:01.0 PCI bridge: Intel Corporation 82845G/GL[Brookdale-G]/GE/PE Host-to-AGP Bridge (rev 02)
00:1d.0 USB controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #1 (rev 02)
00:1d.1 USB controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #2 (rev 02)
00:1d.2 USB controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #3 (rev 02)
00:1d.7 USB controller: Intel Corporation 82801DB/DBM (ICH4/ICH4-M) USB2 EHCI Controller (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 82)
00:1f.0 ISA bridge: Intel Corporation 82801DB/DBL (ICH4/ICH4-L) LPC Interface Bridge (rev 02)
00:1f.1 IDE interface: Intel Corporation 82801DB (ICH4) IDE Controller (rev 02)
00:1f.3 SMBus: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) SMBus Controller (rev 02)
00:1f.5 Multimedia audio controller: Intel Corporation 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) AC'97 Audio Controller (rev 02)
01:00.0 VGA compatible controller: Advanced Micro Devices [AMD] nee ATI RV200 QW [Radeon 7500]
02:01.0 Ethernet controller: D-Link System Inc DGE-530T Gigabit Ethernet Adapter (rev 11) (rev 11)
02:02.0 RAID bus controller: VIA Technologies, Inc. VT6421 IDE/SATA Controller (rev 50)
02:03.0 RAID bus controller: VIA Technologies, Inc. VT6421 IDE/SATA Controller (rev 50)
02:04.0 RAID bus controller: Silicon Image, Inc. SiI 3114 [SATALink/SATARaid] Serial ATA Controller (rev 02)
02:08.0 Ethernet controller: Intel Corporation 82801DB PRO/100 VE (CNR) Ethernet Controller (rev 82)

Is there a way to tune target host to avoid these timeouts?

Update Failing disk shows the following values:

# smartctl -A /dev/sdb
smartctl 5.42 2011-10-20 r3458 [i686-linux-2.6.32-279.2.1.el6.i686] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED >     RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   167   167   021    Pre-fail  Always       -       6633
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       93
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   088   088   000    Old_age   Always       -       9444
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       91
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       64
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       722663
194 Temperature_Celsius     0x0022   104   092   000    Old_age   Always       -       46
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       1
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

4 Answers

Voted

pfo · Answer 1 · 2012-10-05T12:54:29+08:00

pfo

2012-10-05T12:54:29+08:002012-10-05T12:54:29+08:00

Please note that a WD Caviar Green SATA disk is allowed to take up from 20 seconds to minutes (or even longer) to recover from read/write errors. WD disables "time limited error recovery"(TLER) on their desktop drives. Enterprise drives from WD have 7 seconds time limit for reads and 0 seconds time limit for recovery of writes. Since you're creating a new file system with the mkfs call this will of course touch a lot of sectors on that drive some may need recovery which could be the source of the time outs.

Just to be on the safe side of things consider running badblocks on the backing disk of the target.

Also note that SMART may show 0 errors even if the disk is dead.

5

Falcon Momot · Answer 2 · 2012-10-01T15:38:51+08:00

Falcon Momot

2012-10-01T15:38:51+08:002012-10-01T15:38:51+08:00

If the load average is increasing on the target, you might have a disk failure (it would increase due to hardware interrupts caused by retried writes), or you may have a bug. If it is increasing on the initiator, there could be a lot of issues, such as an issue with the iSCSI daemon or possibly a network fault.

In the past, this error message has been caused by bugs in tgtd; consider updating if you aren't already up to date.

1

Antonio · Answer 3 · 2013-02-13T01:59:10+08:00

Best Answer

Antonio

2013-02-13T01:59:10+08:002013-02-13T01:59:10+08:00

The issue was that that the WD Caviar Green disk had a defect which was not detected by SMART test. After disk replacement the problem gone.

1

Nikolaidis Fotis · Answer 4 · 2012-10-06T04:28:41+08:00

Nikolaidis Fotis

2012-10-06T04:28:41+08:002012-10-06T04:28:41+08:00

It seems that it tries to write something sequentially. So, probably as mentioned earlier you may have some badblocks in your disk. Also, in which mode are you running your disk ? AHCI or ATA ? {You can check it in BIOS}

0

iSCSI timeouts under high load

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?