PXEBoot 2, the revenge

Thanks very much for all of the advice. I’ve made a significant step forward in figuring out what the heck is going on, but I still don’t know why.

Despite the fact that I’m using identical versions of ISC dhcpd (down to the md5 checksum of the binary) and identical configurations, the difference is that the is behaving server is NOT setting the “next-server” and “boot file name” flags in the DHCP OFFER packet.

Fortunately, this narrows down the possibilities quite a bit. It is absolutely a problem with that particular machine and/or the software on it, not in the network or on the client machine.

I am currently trying to narrow down what the differences are between those machines. Here’s what they have in common:

  • Identically configured and installed CentOS 5.5 virtual machines being hosted on identically configured ESXi hosts (version 4.1.0, 260247).
  • Fully updated with the CentOS repositories
  • Identical configuration files, with only the network and MAC addresses changed to reflect the different sites

I have already removed and reinstalled the dhcp package, and I’ve rebuilt the machine from scratch with identical symptoms. My next step is going to be to copy the VM image from the site that works and trying it with the other netblocks. If that works, I may cry. If it doesn’t, I suppose the only thing left is to dive into the source RPMs for dhcpd and figure out where the bug lies. I sincerely hope that I don’t have to get anywhere near that far in.

If anyone has any inkling as to why dhcpd would not put in the “next-server” and “boot file name” options, even though it does at the other site, I’d be very, very happy to hear about it. Just for the sake of reference, here’s the existing configuration file (line for line identical to my working config, just with different numbers and a different client MAC address):

ddns-update-style interim;

subnet 10.x.1.0 netmask 255.255.255.0 {

        option routers          10.x.1.1;
        option subnet-mask      255.255.255.0;
        option domain-name      "mydomain";
        option domain-name-servers      10.x.1.43;
        option time-offset      -18000;
        range dynamic-bootp     10.x.1.95 10.x.1.96;
        default-lease-time      21600;
        max-lease-time          43200;

        group {
                next-server 10.x.1.91;
                filename "pxelinux.0";

                host ops1tp {
                        hardware ethernet  00:0c:29:2d:ea:5a;
                        fixed-address   10.x.1.94;
                }
        }
}

I have tried moving the “next-server” and “filename” lines out of the group block and into the subnet block, and also tried putting them in the host block, all with no change. And again, this configuration works exactly as it is at the other site.

Thanks again, everyone. I really appreciate all of the help and suggestions. It’s always great to get new ways to look at a problem.

Issues with PXEBoot using dhcpd

I’ve only been working on this literally all day, so I thought I’d open it up for discussion. I posted it to the LOPSA tech list, but I thought I’d try here, too.

I’m building a machine that can be reinstalled automagically using a combination of PXEboot, kickstart, and magic. I’ve already done this once, and it worked great. I copied the exact configuration files, installed the same services, and basically tried my best to port the process to another network, and I’m failing miserably.

I’ve got 2 VMware ESXi guests on the same vSwitch, one of which is the server, running CentOS 5.5, ISC dhcpd 3.0.5-RedHat, and has tftpd started in xinetd. The client has no OS installed, and is configured to boot with PXE.

My dhcpd config file is as follows:

ddns-update-style interim;

subnet 10.x.1.0 netmask 255.255.255.0 {

      option routers          10.x.1.1;
      option subnet-mask      255.255.255.0;
      option domain-name      "mydomain";
      option domain-name-servers      10.x.1.43;
      option time-offset      -18000;
      range dynamic-bootp     10.x.1.95 10.x.1.96;
      default-lease-time      21600;
      max-lease-time          43200;

      group {
              next-server 10.x.1.91;
              filename "pxelinux.0";

              host ops1tp {
                      hardware ethernet 00:0c:29:2d:ea:5a;
                      fixed-address   10.x.1.94;
              }
      }
}

When the client boots, I immediately get:

Network boot from Intel E1000
Copyright (C) 2003-2008 VMware, Inc.
Copyright (C) 1997-2000 Intel Corporation

CLIENT MAC ADDR: 00 0C 29 2D EA 5A GUID: 564D48EF-5B1F-A4A3-C0A6-2493F02DEA5A
DHCP…|

The pipe at the end of the DHCP line is a spinner, and the dots slowly
increase in number while the spinner goes.

At the same time, on the server, I get the following log entries in /var/log/messages:

Nov 29 16:29:55 kickstart-host dhcpd: DHCPDISCOVER from
00:0c:29:2d:ea:5a via eth0
Nov 29 16:29:55 kickstart-host dhcpd: DHCPOFFER on 10.x.1.94 to
00:0c:29:2d:ea:5a via eth0

Then 2 seconds later, I get these entries:

Nov 29 16:29:57 kickstart-host dhcpd: DHCPREQUEST for 10.x.1.94
(10.x.1.91) from 00:0c:29:2d:ea:5a via eth0
Nov 29 16:29:57 kickstart-host dhcpd: DHCPACK on 10.x.1.94 to
00:0c:29:2d:ea:5a via eth0

Those 4 lines cycle a total of 4 times, after which, the client
console replaces the last “DHCP…” line with:

CLIENT IP: 10.x.1.94 MASK: 255.255.255.0 DHCP IP: 10.x.1.91
PXE-E55: ProxyDHCP service did not reply to request on port 4011.

PXE-M0F: Exiting Intel PXE ROM.
Operating System not found

Obviously, the server is seeing the request. Since the client eventually knows which IP it’s supposed to have, it’s receiving the DHCPOFFER. The problem appears to be that something in my DHCP configuration is making it expect a PXE server (listening on UDP port 4011) on the server (presumably 10.x.1.91, which is indeed the kickstart server).

The oddity is that the configuration is identical to the configuration that I had at the other site.

I’m pretty stuck at this point. Any advice you’d be willing to offer would be welcome.

Powershell? In my toolkit? It’s more likely than you think…

I’m primarily a Linux admin, but I’m a pragmatic admin, so I use what’s best, and it just so happens that Windows is better than Linux for me in some cases (mostly centralized authentication, but there are also some softwares out there that only run on Windows). The point is that I have a heterogeneous network, and I can’t afford to have some of my machine be “second class” citizens.

I’ve also scripted in bash for years now, sometimes to the point of absurdity. I’m hesitant to switch away, just because I have so much braintrust dedicated to bash, but alas, bash doesn’t run natively on Windows without some work, and I’m generally of the opinion that if you have to work that hard, there’s probably a better way.

Since I started out on batch files in DOS days, I was aware of their…uhh…lets see, how shall I put this politely… deficiencies. Actually managing systems with batch files is not my idea of a good time, and so for a long time, I didn’t do it, but I kept hearing rumblings of another language on the horizon from Microsoft that could be used for effectively managing a system…and those noises were referring to Powershell.

A lot of you have tried powershell, and a lot of you like it. I know, because when I said on twitter that I was using it, some people spoke up very vocally and supported the idea, and gave me hints on what to try. On the other hand, some people went the other direction and accused me of “selling out”, which I thought was interesting, considering that I have nothing particularly against Microsoft, at least in terms of stuff I use at work. Anyway, a lot of people were very much shocked and dismayed at my use of powershell, because of the company that produced it. Lets move past that, be grown ups, and say “if it works for what we need, then we’ll use it”, because that is what system administration is about…getting the job done.

I’ve been playing with powershell now for about 6 hours total, and let me assure you, it helps get the job done.

There are some oddities, but overall, speaking as someone who comes from a background in bash and who is lightly conversational in perl and ruby, it’s a very familiar syntax style and methodology. It was obviously written by people who scripted a lot, because plenty of things that used to be hard are now very easy.

I’m probably going to do a “how to get your feet wet with powershell” post or two in the coming days, because I don’t know enough to instruct you beyond that, but let me show you some of the resources I’ve been using, and maybe you can beat me to the punch.

Requirements:
Windows XP or newer
Windows 7 (or Server 2008) are ideal, because they come with PowerShell and PowerShell ISE (the Integrated Scripting Environment, basically an IDE) preinstalled, but you can get Powershell 1.0 on XP and Server 2003, and Powershell 2.0 on Vista *shudder*.

I’ve actually been working through a book, Windows PowerShell 2.0 Administrator’s Pocket Consultant by Microsoft Press. It’s good, if a little slow. I’ve been supplementing it with some TechNet Script Center tasks. If you get a chance, I recommend working through the date/time example to see how powerful the language can be.

So far, I’m very happy with my new scripting language, particularly when I combine it with ultrapowerful addons, like the VMware Virtual Infrastructure PowerCLI, a VMware specific set of commandlets that let you address vCenter and ESX(i) hosts and guests (and clusters) programmatically. Even in the 6 hours I’ve been messing around, I’ve been able to do things that aren’t possible without a lot of hacking in shell scripts.

If you run Windows and you haven’t dug into powershell yet, it’s worth your time. I hope to show you some examples in the near future to kickstart your way.

If you use powershell, drop a comment and share your favorite tip. I’m eager to learn, and I know everyone else is, too!