Kevin's questions -server

Kevin

Asked: 2020-12-31 09:27:15 +0800 CST

Will a Solaris server tolerate a ZFS pool from its future?

5

My experience with ZFS has generally been that it just works, so I expect the answer will be, it’s not a problem — but I have a data pool which will ruin my January if it fubars, so I want to double-check.

This is a question that could actually come up in two different situations involving a separate data pool. Right now I’m dealing with the first, but I’ve also wondered about the second:

The storage for the system disk (i.e., the one holding rpool) fails, but storage for the data pool is fine, so you want to restore the system disk from backups but just keep going with the live storage of the data pool.
You have Solaris running in a VM and want to roll back to a snapshot the hypervisor has taken (not a ZFS snapshot of rpool), but the data pool is stored on disks that are in “independent” mode, RDMs, etc., so will not be rolled back.

In both of these situations, when Solaris is booted back up, it’s going to see a data pool that it knows about but which is in a state it had never (as far as it would remember) put it into.

I’m primarily only concerned with the case where the system was cleanly shut down before the system disk is rewound, and where the system had been cleanly shut down prior to the image it’s being rewound to. I’d expect switching between running states could be a bit trickier.

Note also that in my particular case, the pool’s storage geometry and paths to the storage have not changed. Again, I’d expect this to be trickier if they had.

I wouldn’t even be asking this with Windows and NTFS because that’s a comparatively simplistic decoupled system so it’s hard to see why it wouldn’t work. However, it seems that Solaris keeps some kind of pool metadata out of band, as evidenced by the fact that you’re supposed to zpool export and zpool import when you move pools between systems (something I’ve never done in that manner thanks to VMware). My knowledge of this metadata and its purpose is limited so it’s hard for me to reason about the impact in this situation. (An explanation of this would be great!)

I actually still have access to the pre-rollback system. It’s sitting in a VMFS datastore backed by an HP SmartArray that threw a 1716 POST warning after an ill-fated preventive maintenance disk change (which lost data because SmartArray is dumber than ZFS). All important VMs still seem fine and scans of their filesystems found no errors, but I plan to restore the array from a very recent backup anyway because I have reason to suspect that ESXi silently zeros bad sectors instead of passing the errors to the guests, so I don’t want to risk some zeroed sector lurking somewhere to bite me in the butt later.

For the Solaris VM, I don’t have to worry about zeroed sectors, because ZFS would catch that, but most of the other VMs use dumb filesystems. The backup is an image of the whole VMware datastore, though, so fixing them will roll back the Solaris VM, too. Actually, I did a scrub on the rpool of this VM and it found no errors, so hell, if I wanted, I could just stash its VMDK somewhere else and copy it back in after the roll-back, and then this whole question would be moot. I guess that’s what I’ll do if nobody answers, lol. But it’s something I’ve wondered for a while, so I’ll still ask.

So, the question is, can I just go ahead and roll back the system disk’s storage and be done with it? Or would I have to export the pool from the pre-rollback system, roll back, delete the pool before attaching its storage, then attach the storage and import the pool? I don’t like the sounds of the latter, partly because there is both CIFS and iSCSI being served from that pool and I don’t remember off hand how I set those up or even how to do so, so if they break I’ll be mad. (Can you tell we don’t have a full-time sysadmin? lol)

The VM is running an older version, Solaris 11.0.

^{(Incidentally, it’s older partly because of the same question. I wanted to snapshot the VM prior to attempting an upgrade in case I bork it, but then I was worried about how a rolled-back system might react to the independent pool, so just left it alone. And yeah, I realize I could also snapshot the rpool, but that doesn’t give the same level of comfort for someone who doesn’t work with Solaris daily.)}

Kevin

Asked: 2020-12-21 08:13:11 +0800 CST

Getting a RAID controller to surface scan on a sane schedule

3

The controller I'm presently working with is quite old, the HP Smart Array P400; in part I want to know how to deal with that controller, but I'm also interested in the general perspective -- if there are other/newer controllers that handle this better, how do they handle it? I'm looking ideally for OS-neutral solutions, but if that doesn't work, it's running VMware ESXi.

There are basically two settings for surface scan on this controller: high, or idle with a configurable delay in seconds.

For years it's been on idle with a 3 second delay. (Not sure why, this was probably the default.) However, I recently got concerned that this means it basically never runs the surface scan, since even during periods of very little actual use, ESXi sends "heartbeat" I/O more frequently than that, and most of the guest OSes also send little blips of one kind or another during idle time.

Figuring it's a bad idea to effectively have the controller never do a surface scan, I picked the only other option, "high".

There might be some kind of performance penalty here, but this array's workload is just system disks for the VMs, not data disks (I use ZFS on a plain HBA for that), so nobody's noticed thus far.

My concern is that, now the drives won't stop, period. I've had this setting for several days, and over those days there have been plenty of idle periods such that I figure the controller could probably have done a complete scan by now. I can do a ZFS scrub on a pool 7 times larger and on lower RPM drives in less time. I've peeked at the server a number of times during idle periods and not once have I seen it without the disk lights dancing around like a music video.

It seems like it has the scan on an infinite loop, without any kind of delay in between scans. Am I correct here?

This to me seems kind of ridiculous. I would have hoped that once the controller managed to get through a scan, it would stop for a few days at least before starting the next one. I really doubt sectors degrade quick enough to justify constant scanning.

I'm worried that this is going to kill off drives way faster. These are 2.5" 10k SAS disks, 300GB and 600GB, in RAID 1+0. Is this a valid concern? I'm guessing this setting has increased total daily disk activity by at least ten times.

Now, disks constantly spin regardless of access, heads don't actually touch platters, and the actuator is moved by a contactless electromagnetic system. So I think the only big difference in wear-out would be on the actuator axis bearing, when the disk seeks. In principle that sounds pretty minor, but in practice it does seem that lots of seeks wear drives out faster.

I imagine this scan is accessing sectors sequentially, which, in of itself, wouldn't involve tons of actuator movement. However, if the scan is being frequently interrupted by little idle accesses that need the heads to be somewhere else, that could at worst amplify that back-and-forth significantly.

^{(I should perhaps look at migrating to SSDs, but in any case I don't want to kill off the magnetic disks already installed.)}

To summarize, my questions are:

Is it actually going to scan continually?
Is there some way to make this scanning periodic instead of continuous? (If not on this controller, even on any different ones?)
Should I actually be worried about this wearing out the disks?

Kevin

Asked: 2017-09-15 14:21:48 +0800 CST

vim-cmd connection refused in management console

1

I'm trying to manage VMs on the local management console of an ESXi 5 host (due to a problem accessing it through VI client and needing to do some VM management prior to attempting a reboot on the host.)

I will enter a command such as:

vim-cmd vmsvc/getallvms

and get this error:

Failed to login: Connection refused

I read somewhere that SSH has to be enabled in order for vim-cmd to work. I'm actually working directly on the host's terminal, not by SSH, but just in case I enabled SSH, and the same error still occurred.

How can I get vim-cmd to work in the local management console?

Kevin

Asked: 2017-06-29 12:54:18 +0800 CST

What is the difference between aclinherit and aclmode?

5

ZFS filesystems can have the aclinherit and aclmode properties set on them to control how inheritable ACL entries interact with object creation and Unix-style permissions operations.

Unfortunately, the official documentation is a bit cryptic/ambiguous as to exactly what the difference is between these two properties in terms of their role in computing ACLs. To illustrate, take these excerpts from Securing Files and Verifying File Integrity in Oracle® Solaris 11.3, emphasis mine:

aclinherit – Determine the behavior of ACL inheritance...

and:

aclmode – Modifies ACL behavior when a file is initially created or controls how an ACL is modified during a chmod operation...

This is really confusing, because ACL inheritance is going to occur or not occur when a file is initially created!

As for chmod, the above language and some of the examples suggest that its behaviour is governed by aclmode, but there is also an example on p.45 that shows it being governed by aclinherit.

I have a feeling this is also complicated by variables in the APIs used to create files. (I am familiar with the Windows APIs but not *nix ones.)

I feel like even after reading through the documentation I have a rather incomplete picture of how these properties work.

What exactly is the difference between the two? They seem to have some overlap, so what governs which is applied? What if they contradict?

Kevin

Asked: 2013-11-06 20:55:55 +0800 CST

Can the "harm your local or remote computer" warning be eliminated?

1

I have a Server 2012 RD farm which is used exclusively for sessions - not remote apps. It's working well except for some login concerns. One in particular is a warning that remote clients get when logging in:

This remote connection could harm your local or remote computer. Make sure that you trust the publisher before you connect.

Here's a screen shot on Windows 7:

RD connection warning dialog

The clients are not owned by us so we don't have them domain-joined. However, we do have our root certificates installed into them, and all the involved certs chain up to those. As such, we do not get certificate warnings/errors, but we still do get the above warning.

My concern with this is that the users are not that security conscious and some are not highly technical. As such, if the user has to routinely dismiss this dialog, he/she might get immune to security dialogs in general when connecting to our server, and this could be disastrous if one happens to be warning about an attacker's invalid cert!

Is there a way to get rid of this dialog? If it requires a bit of config on the client that is ok as long as it doesn't leave the user vulnerable to other sites. I also definitely do not want certificate errors / untrusted certificate warnings to be suppressed - only the above dialog.

We do not need (or even want) clients to be able to pull in local disks, USB devices, etc., and in fact already have that turned off.

I also note that the dialog claims our web site "wants to run a RemoteApp program," even though we are not publishing remote apps and only want to allow session logins. I assume this has something to do with how RD Web works internally. I wouldn't even bother with RD Web except that it makes the logins a lot more intuitive and easily permits the selection of particular server Collections by the users.

Any ideas?

Kevin

Asked: 2013-11-05 02:28:16 +0800 CST

Can Windows 7 access an RD farm that uses SHA512/4096-bit certificates?

1

I have a Server 2012 RD farm that works fine if configured with the self-signed certs generated by Server Manager, but not with certs from our internal CA.

With the self-signed certs, our remote clients can connect but of course get security warnings due to the untrusted certificates. The clients do trust our root CA, so we should be able to eliminate those by using certs from our internal CA instead.

However, when I do configure the farm to use certs from our internal CA (which runs AD CS on Server 2008 R2,) clients can log into the RD Web site but cannot open RDP sessions. They get errors such as the following on Windows 7 (none of the clients are newer than 7 so I haven't tried 8) :

Your computer can't connect to the remote computer because an error occurred on the remote computer that you want to connect to. Contact your network administrator for assistance.

or

Your computer can't connect to the remote computer because the Remote Desktop Gateway and the remote computer are unable to exchange policies. This could happen due to one of the following reasons:

The remote computer is not capable of exchanging policies with the Remote Desktop Gateway.

The remote computer's configuration does not permit a new connection.

The connection between the Remote Desktop Gateway and the remote computer ended.

Contact your network administrator for assistance.

In turn these will appear in the server logs:

(Source: Schannel; Event ID: 36874) [sic] An TLS 1.2 connection request was received from a remote client application, but none of the cipher suites supported by the client application are supported by the server. The SSL connection request has failed.

(Source: Schannel; Event ID: 36888) A fatal alert was generated and sent to the remote endpoint. This may result in termination of the connection. The TLS protocol defined fatal error code is 40. The Windows SChannel error state is 1205.

Our root CA and issuing CA are both using SHA512 hashes and 4096-bit public keys. I notice that the self-signed certs from Server Manager that do work are using SHA256 and 2048-bit keys, so I'm wondering if the stronger encryption is not supported by RDP in Windows 7.

(I can't test this easily as I can't make our CA hand out a cert using SHA256, I'm guessing because the CA's own public key is too big. Even if it did, the client would still need SHA512 to validate the issuing CA against the root CA.)

The strange thing is that it does work with our certs except for "RD Connection Broker - Enable Single Sign On." If I leave that one set to the self-signed cert but use ours for the other three, everything works basically fine (apart from users having to enter their password three times.)

In that case, Internet Explorer on the client trusts one of our certs without issue even though it has SHA512. This makes it seem rather odd that the stronger crypto would trip up RDP - I would have assumed they'd both use a provider built into Windows.

Kevin

Asked: 2013-06-20 00:32:52 +0800 CST

What happens to missed writes after a zpool clear?

4

I am trying to understand ZFS' behaviour under a specific condition, but the documentation is not very explicit about this so I'm left guessing.

Suppose we have a zpool with redundancy. Take the following sequence of events:

A problem arises in the connection between device D and the server. This causes a large number of failures and ZFS therefore faults the device, putting the pool in degraded state.
While the pool is in degraded state, the pool is mutated (data is written and/or changed.)
The connectivity issue is physically repaired such that device D is reliable again.
Knowing that most data on D is valid, and not wanting to stress the pool with a resilver needlessly, the admin instead runs zpool clear pool D. This is indicated by Oracle's documentation as the appropriate action where the fault was due to a transient problem that has been corrected.

I've read that zpool clear only clears the error counter, and restores the device to online status. However, this is a bit troubling, because if that's all it does, it will leave the pool in an inconsistent state!

This is because mutations in step 2 will not have been successfully written to D. Instead, D will reflect the state of the pool prior to the connectivity failure. This is of course not the normative state for a zpool and could lead to hard data loss upon failure of another device - however, the pool status will not reflect this issue!

I would at least assume based on ZFS' robust integrity mechanisms that an attempt to read the mutated data from D would catch the mistakes and repair them. However, this raises two problems:

Reads are not guaranteed to hit all mutations unless a scrub is done; and
Once ZFS does hit the mutated data, it (I'm guessing) might fault the drive again because it would appear to ZFS to be corrupting data, since it doesn't remember the previous write failures.

Theoretically, ZFS could circumvent this problem by keeping track of mutations that occur during a degraded state, and writing them back to D when it's cleared. For some reason I suspect that's not what happens, though.

I'm hoping someone with intimate knowledge of ZFS can shed some light on this aspect.

Kevin

Asked: 2013-04-22 05:29:24 +0800 CST

Possible to re-establish trust with a 2003 R2 member server?

2

I have a 2008 functional level domain in a 2003 functional level forest, with DCs running Server 2008 and 2008 R2.

A number of member servers & computers are virtual machines. Sometimes these get rolled back to prior snapshots and sometimes this breaks trust. Usually I am able to resolve this easily in this manner:

Disconnect the network;
Log in as the domain admin, which works using cached credentials;
Reconnect the network;
Run netdom resetpwd /server:<DC-name> /userd:<dom>\<dom-admin> /passwordd:*

I've had no trouble doing this with more recent Windows versions, but I'm finding this doesn't work on a Server 2003 R2 member server. Instead it just gives me this error:

The machine account password for the local machine could not be reset
The specified domain either does not exist or could not be contacted

I'm not sure if I'm doing something wrong or if it's even possible in 2003 R2.

I also tried resetting the machine account via AD Users & Computers but it seemed to have no effect.

The DCs can be pinged fine from the server, and the domain is fine otherwise. I've also tried net useing the IPC$ share on a DC and that works, too. Their NETLOGON and SYSVOL also show up in net view. The member worked fine prior to rolling back. In fact, not much has changed, other than the member machine account password.

Kevin

Asked: 2013-04-18 19:33:14 +0800 CST

Mysterious “fragmentation required” rejections from gateway VM

6

I've been troubleshooting a severe WAN speed issue. I fixed it, but for the benefit of others:

Via WireShark, logging, and simplifying the config I narrowed it down to some strange behaviour from a gateway doing DNAT to servers on the internal network. The gateway (a CentOS box) and servers are both running in the same VMware ESXi 5 host (and this turns out to be significant).

Here is the sequence of events that happened - quite consistently - when I attempted to download a file from an HTTP server behind the DNAT, using a test client connected directly to the WAN side of the gateway (bypassing the actual Internet connection normally used here):

The usual TCP connection establishment (SYN, SYN ACK, ACK) proceeds normally; the gateway remaps the server's IP correctly both ways.
The client sends a single TCP segment with the HTTP GET and this is also DNATted correctly to the target server.
The server sends a 1460 byte TCP segment with the 200 response and part of the file, via the gateway. The size of the frame on the wire is 1514 bytes - 1500 in payload. This segment should cross the gateway but doesn't.
The server sends a second 1460 byte TCP segment, continuing the file, via the gateway. Again, the link payload is 1500 bytes. This segment doesn't cross the gateway either and is never accounted for.
The gateway sends an ICMP Type 3 Code 4 (destination unreachable - fragmentation needed) packet back to the server, citing the packet sent in Event 3. The ICMP packet indicates the next hop MTU is 1500. This appears to be nonsensical, as the network is 1500-byte clean and the link payloads in 3 and 4 already were within the stated 1500 byte limit. The server understandably ignores this response. (Originally, ICMP had been dropped by an overzealous firewall, but this was fixed.)
After a considerable delay (and in some configurations, duplicate ACKs from the server), the server decides to resend the segment from Event 3, this time alone. Apart from the IP identification field and checksum, the frame is identical to the one in Event 3. They are the same length and the new one still has the Don't Fragment flag set. However, this time, the gateway happily passes the segment on to the client - in one piece - instead of sending an ICMP reject.
The client ACKs this, and the transfer continues, albeit excruciatingly slowly, since subsequent segments go through roughly the same pattern of being rejected, timing out, being resent and then getting through.

The client and server work together normally if the client is moved to the LAN so as to access the server directly.

This strange behaviour varies unpredictably based on seemingly irrelevant details of the target server.

For instance, on Server 2003 R2, the 7MB test file would take over 7h to transmit if Windows Firewall was enabled (even if it allowed HTTP and all ICMP), while the issue would not appear at all, and paradoxically the rejection would never be sent by the gateway in the first place if Windows Firewall was disabled. On the other hand, on Server 2008 R2, disabling Windows Firewall had no effect whatsoever, but the transfer, while still being impaired, would occur much faster than on Server 2003 R2 with the firewall enabled. (I think this is because 2008 R2 is using smarter timeout heuristics and TCP fast retransmission.)

Even more strangely, the problem would disappear if WireShark were installed on the target server. As such, to diagnose the issue I had to install WireShark on a separate VM to watch the LAN side network traffic (probably a better idea anyway for other reasons.)

The ESXi host is version 5.0 U2.

Kevin

Asked: 2013-04-18 04:57:04 +0800 CST

Stop Windows from switching to local computer for "administrator"?

2

Modern versions of Windows seem to have a "feature" wherein, at the login screen of a domain-joined machine, if "administrator" is entered as the username, it automatically switches from the domain to the local accounts. If you're trying to log in as "domain\administrator", you therefore always have to type in "domain", whereas this is not necessary for logging into any other domain account.

This is getting very annoying so I'm wondering if there is a way to disable this behaviour. Ideally I'd like to set group policy to stop it on all systems.

Web searches came up dry apart from the occasional mention of this "feature" as an annoyance.

Kevin

Asked: 2013-03-29 02:14:46 +0800 CST

What is the rationale for a minimum password age?

13

I just had a user unable to change his password on a Windows 2008 domain. It was giving him a cryptic message about complexity requirements even though he was certain his chosen password was meeting them. I tested it myself and confirmed.

It seems his last password had been set too recently per a Microsoft-recommended default of something like 10 days if I recall.

He asked of me a very good question, which I couldn't answer: why would there be a minimum password age? How could this reasonably benefit security? He also pointed out that one might discover their password to be compromised within this 10 day period and not be able to change it!

Would there be any valid reason to enforce a minimum password age?

Kevin

Asked: 2013-02-21 20:39:12 +0800 CST

Adding vCPUs to an existing OS install

4

The VM configuration dialog in ESXi 5 warns me that if I change the number of vCPUs after the guest OS is installed, the sky will fall - ahem - it 'might make my virtual machine unstable.'

I know that certain CPU instructions involved in thread serialization will require a LOCK prefix in a multiprocessor system but not in a uniprocessor system (or at least not with a single core). The OS will generally omit LOCKs where they're not needed.

If the OS uses a kernel that omits the LOCKs but uses multiple CPUs, then this would lead to extreme instability and difficult-to-isolate bugs. But if the kernel was designed for one processor then what is it doing using more than one (which it has to do knowingly)? This seems like a completely absurd OS design which I would hope doesn't exist in practice.

A more plausible OS design would be to detect CPUs on boot and pick either the uniprocessor or multiprocessor kernel accordingly. Failing that, the only other sensible design would install the correct kernel, but the uniprocessor kernel would simply never use the other processor and therefore there would be no harm in another CPU other than it not being used at all.

Application software could get into trouble a little easier because it's easy to use multiple threads even on a single-core system, so not paying heed to the fact that it's on a multiprocessor system and not LOCKing (or using the OS' facilities) could cause horrible bugs. But would any serious software have such a poor design as to test uni/multiprocessor status only during installation?

What is the reasoning behind the doomsday warning? On what, if any, OSes or applications should I actually expect problems?

Kevin

Asked: 2013-02-12 09:34:13 +0800 CST

Second NIC "not working properly" on diskless Server 2012

3

I've got (after a considerable fight) a diskless Server 2012 running via the MS software iSCSI initiator being booted via gPXE chained from the NIC's PXE.

However, now that it is booting properly I have another problem (iSCSI HBAs are getting more attractive with every hair-pull). The server has a dual NIC, and Windows is only accepting the NIC that is hooked into the SAN, leaving me with no LAN connection!

Device manager shows both NICs, but the one for the LAN has an exclamation point and the properties indicate "This device is not working properly because Windows cannot load the drivers required for this device. (Code 31)"

Windows obviously does have the right driver, as the two ports are identical and the other is working; furthermore, if I install the same OS to the same hardware but onto a local HDD, it has no problems with the NIC. If I tell it to search for a better driver, it just turns around and says the driver is fine, not surprisingly.

I'm pretty sure I know what's going on here, thanks to the previous chapter in this adventure.

The pre-boot program (gPXE in this case) has to write an iBFT (iSCSI Boot Firmware Table) into memory, which is then picked up by the OS (Windows in this case). This table provides among other things a list of the NICs. For each, it specifies the PCI bus and device #, MAC address, and IP information.

I know from examining its source code (and also by a little tool I developed to dump the iBFT) that gPXE by design/laziness only ever writes one NIC to the iBFT, even though the standard allows about 240 of them. Even if it did write multiple NICs, I'd still be in the same boat because other gPXE/iPXE issues forced me to use the UNDI-only build, which means it doesn't even know about the other NICs.

I am supposing that what's happening here is that Windows is looking at the iBFT and - even though it knows the other NIC exists from its own device management system - is deciding that it can't be used because it's not in the iBFT. I have no idea why it would do this.

Is there some way to coax Windows into using the other NIC even though it is not in the iBFT? Or, is there some iSCSI pre-boot program that actually works right? Or is there an entirely different explanation?

Kevin

Asked: 2013-02-09 17:55:36 +0800 CST

Can't boot Windows Server 2012 from iSCSI

7

I'm working out how to install various OSes onto diskless servers to be run via iSCSI. The servers have no special hardware for this (no iSCSI HBAs etc.)

I briefly played with Windows Deployment Services but it seems to be about the opposite of what I want: I could be wrong but it seems designed to do a network-based install of Windows onto a system that will then boot from a local disk, whereas I want to install Windows such that it will run from an iSCSI disk, and I don't care if install media has to be inserted locally. Therefore my present approach doesn't use WDS.

This test was of Windows Server 2012 but likely applies to prior versions also.

In short, I am setting up a blank iSCSI target (I have tried several s/w vendors and it doesn't seem to make much difference although presently I'm using StarWind), and using PXE to chain to gPXE or iPXE to do an iSCSI boot (I've tried both, with similar result although I found gPXE a little easier to use.)

Of course, without an OS, the boot fails, but it leaves the LU hooked in through the BIOS disk interface such that when the local DVD-ROM is then booted, the installer sees the volume to install to.

I accomplish the boot by dropping to a gPXE prompt and typing the following commands (same in iPXE except for the interface #):

dhcp net1
set keep-san 1
sanboot iscsi:#.#.#.#::::iqn.xxxxxxxxxxxxxxxxxxx
exit

Windows Server 2012 seems to install and at some point reboots. iPXE kicks in to boot from iSCSI, and the 'Metro' logo appears - so at least something did get installed to the LU. However, the screen then turns light blue and a message briefly flashes in large text, something about the "PC" (it's actually a rack server but anyway) encountering a problem but I can't read it all before it vanishes and the computer is rebooting.

Eventually I get a text screen titled Windows Boot Manager which reads:

Windows failed to start. A recent hardware or software change might be the
cause. To fix the problem:

  1. Insert your Windows installation disc and restart your computer.
  2. Choose your language settings, and then click "Next."
  3. Click "Repair your computer."

If you do not have this disc, contact your system administrator or computer
manufacturer for assistance.

    File: \Windows\system32\ntoskrnl.exe

    Status: 0xc00000e9

    Info: The operating system couldn't be loaded because the kernel is
          missing or contains errors

Those paying close attention will realize these instructions would be very difficult to follow effectively, especially on physical hardware. (By the way, I have tried installs both to physical and virtual machines, with similar results.)

I am guessing what is happening here is that Windows 2012 gets as far as it does because iPXE has hooked Interrupt 13h (BIOS disk services), but once the system has booted far enough to switch to protected mode drivers, the boot is rudely interrupted because there is no hard disk to read from and/or Windows' iSCSI initiator fails.

The iPXE I'm trying is 1.0.0+ (3fcb) and the gPXE I tried was 1.0.1.

Is it possible to have Server 2012 run off iSCSI without a hardware HBA, and if so, what might I be doing wrong?

Kevin

Asked: 2013-02-06 00:11:29 +0800 CST

Can the Microsoft iSCSI Software Target back an LU by a physical disk?

3

In short, I want the Microsoft iSCSI Software Target to expose a physical disk (from the OS' perspective) as an iSCSI unit, instead of using a file (a Microsoft-proprietary "virtual disk") as the backing storage. I'll explain why:

I have an ESXi 5 box and one of the VMs on it I am interested in testing out bare-metal on another server. (Notwithstanding the usual caveats about moving an OS installation onto different hardware - it's running Windows, which in my experience can usually tolerate being transplanted to different hardware.)

The spare server I want to bare-metal run this VM on has a NIC that supports iSCSI boot, so I figured I could get another VM in the ESXi box - call it diskserver - to expose the VM's disk image to the bare-metal server via iSCSI. In principle this should make it fairly easy to move the VM back and forth between running under ESXi and running bare-metal. When it runs bare-metal, it would be running off the same VMDK that was originally attached to it under ESXi.

I'm trying to use Windows Server on the diskserver to expose the VM's disk under iSCSI using the Microsoft iSCSI Software Target. I've tried both Server 2012 and 2008 R2. My idea was to attach a virtual HDD to the diskserver's VM which is backed by the same VMDK as the VM of interest uses. (Naturally they can't both be powered on at the same time.) I can do that no problem, but the problem is getting the MS iSCSI Software Target to use the attached drive.

The drive comes up fine in disk manager on the diskserver and it sees the partitions on it, even assigning a drive letter to the NTFS partition. The problem is that the iSCSI Target seems to have been designed under the assumption that we always want to expose storage backed by files, not by physical disks. (In reality it's not a physical disk, but the diskserver will see it as such.)

To clarify, I don't need the MS iSCSI Software Target to understand the VMDK format, since diskserver is running under ESXi and I'm making the VMDK appear to Windows Server as a physical disk.

I'm pretty sure STMF (SCSI Target Mode Framework) in Solaris would be able to do what I want, so worst-case I could spin a Solaris VM for this, but that's a bit tedious and I was hoping the MS iSCSI Software Target would be easier. Is there no way to make the MS iSCSI target back a target by a physical disk?

Kevin

Asked: 2013-01-28 12:49:02 +0800 CST

Is there a complete reference for HP's diagnostic port 84 and/or port 85 codes?

4

The HP ProLiant DL580 systems (G5 is the one I'm interested in at the moment) have a two-digit 7-segment display on the system board (visible only inside the chassis) which displays various diagnostic codes which they refer to as "port 84" or "port 85", switchable via a DIP switch.

The manuals mention this display but give only sketchy information about the actual codes displayed. Is there anywhere I could find a more specific & complete reference of the display codes? (I've done some web searches but have so far found only unrelated results.)

Kevin

Asked: 2013-01-27 13:44:39 +0800 CST

Unexpected RAID space overhead

2

I'm using an HP Smart Array P400 and seeing the controller take a rather large amount of overhead I would not expect and am wondering where it is going.

I have six SAS drives hooked up. They're all marked 146GB except one. (One happens to be 300GB but this won't matter for RAID.) I'm not sure if this means 146,000,000,000 bytes or 156,766,306,304 bytes or what.

In ACU, under Physical Drives it shows them as 146GB. When I create an array from them, unused space (before creating a logical disk or setting the redundancy level) shows as 820.2GB.

Since 146*6=876GB and not 820.2GB, at first I would have thought that the disk sizes were being quoted in decimal gigs (GB = 10^9) and the array size in binary gigs (GiB = 2^30).

However, if I assume this, the numbers still don't work out. 146GB in binary would be 135.973GiB, and six of them would be 815.839GiB.

815.8GiB is smaller than the 820.2GB that ACU is quoting as the array size, which logically means it must be quoting both the drive sizes and array size in the same units (be them either binary or decimal gigs).

But if this is the case, then 55.8GB, or a whopping 6.4% of the array has mysteriously vanished.

Now, I know the RAID controller probably places some metadata on the drives so I can't expect 100% of the space to be available. But I would expect this metadata should only be on the order of a few megabytes at most. What accounts for a loss of 55.8GB over six drives?

To clarify, we are not talking about losses due to redundancy. For instance, RAID1-0 makes 50% of the space available, for six drives RAID5 makes 83.3% of the space available, etc., but what I'm talking about here is space that is lost before redundancy is even chosen. This space would be lost even with RAID0, which should expose nearly 100% of the space.

Kevin

Asked: 2013-01-25 12:20:24 +0800 CST

Options for managing Smart Array P400 on an ESXi 5 host

5

I'm about to install ESXi 5.0U2 on a RAID10 exposed by an HP Smart Array P400i. (The server is a DL580 G5.) However, I notice that the ROM-based utility lacks many of the features of the browser-based HP Array Configuration Utility, but the latter only runs in Windows or Linux distros.

I can't do any PCI passthrough on this server, so as far as I can see I can't just install ACU into a guest OS and expect to get anywhere.

The only way I can see to use ACU is to have an installation of Windows on a separate drive to boot to when I need to do array management, but this is very inconvenient and also does nothing for monitoring.

Do I have any better options for managing the hardware RAID on such a host?

Kevin

Asked: 2013-01-22 21:23:02 +0800 CST

Can IBM System x3650 passthrough SAS?

2

I am looking at the specs and manual for the IBM System x3650. According to the specs a ServeRAID-8k-l hardware RAID comes standard, but there's an option to upgrade to ServeRAID-8k. If I understand this correctly, this system will always have some h/w RAID.

I have found that h/w RAID firmware often assumes that all disks attached should be part of a RAID configuration managed by h/w, and of the same make of h/w at that. This won't always be the case, either because one is using s/w RAID and/or because HDDs are being brought in from other systems with data on them.

My question is whether this limitation applies to the x3650 or whether the on-board SAS ports can 'pass-through,' allowing the OS to 'see' an attached physical disk directly.

(I'm quite cautious because I once bought an Adaptec RAID whose specs and manual claimed it had this capacity, only to find out it didn't.)

I'm also wondering of the same on-board controller, notwithstanding the usual cautions on using disks that big, whether it is able to talk to 2TB disks (I've seen some that don't).

Kevin

Asked: 2012-09-26 23:01:44 +0800 CST

How ZFS handles online replacement in a RAID-Z (theoretical)

7

This is a somewhat theoretical question about ZFS and RAID-Z. I'll use a three disk single-parity array as an example for clarity, but the problem can be extended to any number of disks and any parity.

Suppose we have disks A, B, and C in the pool, and that it is clean.

Suppose now that we physically add disk D with the intention of replacing disk C, and that disk C is still functioning correctly and is only being replaced out of preventive maintenance. Some admins might just yank C and install D, which is a little more organized as devices need not change IDs - however this does leave the array degraded temporarily and so for this example suppose we install D without offlining or removing C. Solaris docs indicate that we can replace a disk without first offlining it, using a command such as:

zpool replace pool C D

This should cause a resilvering onto D. Let us say that resilvering proceeds "downwards" along a "cursor." (I don't know the actual terminology used in the internal implementation.)

Suppose now that midways through the resilvering, disk A fails. In theory, this should be recoverable, as above the cursor B and D contain sufficient parity and below the cursor B and C contain sufficient parity. However, whether or not this is actually recoverable depnds upon internal design decisions in ZFS which I am not aware of (and which the manual doesn't say in certain terms).

If ZFS continues to send writes to C below the cursor, then we are fine. If, however, ZFS internally treats C as though it were gone, resilvering D only from parity between A and B and only writing A and B below the cursor, then we're toast.

Some experimenting could answer this question but I was hoping maybe someone on here already knows which way ZFS handles this situation. Thank you in advance for any insight!

Will a Solaris server tolerate a ZFS pool from its future?

Getting a RAID controller to surface scan on a sane schedule

vim-cmd connection refused in management console

What is the difference between aclinherit and aclmode?

Can the "harm your local or remote computer" warning be eliminated?

Can Windows 7 access an RD farm that uses SHA512/4096-bit certificates?

What happens to missed writes after a zpool clear?

Possible to re-establish trust with a 2003 R2 member server?

Mysterious “fragmentation required” rejections from gateway VM

Stop Windows from switching to local computer for "administrator"?

What is the rationale for a minimum password age?

Adding vCPUs to an existing OS install

Second NIC "not working properly" on diskless Server 2012

Can't boot Windows Server 2012 from iSCSI

Can the Microsoft iSCSI Software Target back an LU by a physical disk?

Is there a complete reference for HP's diagnostic port 84 and/or port 85 codes?

Unexpected RAID space overhead

Options for managing Smart Array P400 on an ESXi 5 host

Can IBM System x3650 passthrough SAS?

How ZFS handles online replacement in a RAID-Z (theoretical)

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?