PXEBoot 2, the revenge

Date November 30, 2010

Thanks very much for all of the advice. I've made a significant step forward in figuring out what the heck is going on, but I still don't know why.

Despite the fact that I'm using identical versions of ISC dhcpd (down to the md5 checksum of the binary) and identical configurations, the difference is that the is behaving server is NOT setting the "next-server" and "boot file name" flags in the DHCP OFFER packet.

Fortunately, this narrows down the possibilities quite a bit. It is absolutely a problem with that particular machine and/or the software on it, not in the network or on the client machine.

I am currently trying to narrow down what the differences are between those machines. Here's what they have in common:

  • Identically configured and installed CentOS 5.5 virtual machines being hosted on identically configured ESXi hosts (version 4.1.0, 260247).
  • Fully updated with the CentOS repositories
  • Identical configuration files, with only the network and MAC addresses changed to reflect the different sites

I have already removed and reinstalled the dhcp package, and I've rebuilt the machine from scratch with identical symptoms. My next step is going to be to copy the VM image from the site that works and trying it with the other netblocks. If that works, I may cry. If it doesn't, I suppose the only thing left is to dive into the source RPMs for dhcpd and figure out where the bug lies. I sincerely hope that I don't have to get anywhere near that far in.

If anyone has any inkling as to why dhcpd would not put in the "next-server" and "boot file name" options, even though it does at the other site, I'd be very, very happy to hear about it. Just for the sake of reference, here's the existing configuration file (line for line identical to my working config, just with different numbers and a different client MAC address):

ddns-update-style interim;

subnet 10.x.1.0 netmask 255.255.255.0 {

        option routers          10.x.1.1;
        option subnet-mask      255.255.255.0;
        option domain-name      "mydomain";
        option domain-name-servers      10.x.1.43;
        option time-offset      -18000;
        range dynamic-bootp     10.x.1.95 10.x.1.96;
        default-lease-time      21600;
        max-lease-time          43200;

        group {
                next-server 10.x.1.91;
                filename "pxelinux.0";

                host ops1tp {
                        hardware ethernet  00:0c:29:2d:ea:5a;
                        fixed-address   10.x.1.94;
                }
        }
}

I have tried moving the "next-server" and "filename" lines out of the group block and into the subnet block, and also tried putting them in the host block, all with no change. And again, this configuration works exactly as it is at the other site.

Thanks again, everyone. I really appreciate all of the help and suggestions. It's always great to get new ways to look at a problem.

  • http://blogs.ncl.ac.uk/paul.haldane Paul Haldane

    Other things I might consider in your situation ...

    1. Odd characters in the config file (including non-Unix line endings) - would expect dhcpd to complain but worth running through od -bc to check

    2. Disable the dhcpd service and run by hand with -d to capture debug output - something might jump out at you.

    3. Check the netmask settings on the interface of the server (and /etc/netmasks). I find it difficult to imagine why this would cause problems here but it often tripped me up when doing Sun network installs.

  • Brian Moyles

    Have you tried enabling/disabling the authoritative config option in dhcpd.conf?
    " Network administrators setting up authoritative DHCP servers for their networks should always write authori-
    tative; at the top of their configuration file to indicate that the DHCP server should send DHCPNAK messages
    to misconfigured clients. If this is not done, clients will be unable to get a correct IP address after
    changing subnets until their old lease has expired, which could take quite a long time."

    That sounds like it could cause you some grief if you're moving boxes around at all...

    Here's a snippet of one of my configs that works just fine with pxe:
    authoritative;
    ignore client-updates;
    ddns-update-style interim;
    default-lease-time 604800; # 1 week
    max-lease-time 1209600; # 2 weeks

    # management network
    include "/etc/dhcpd/hosts/10.x.1.conf";

    10.x.1.conf:
    subnet 10.x.1.0 netmask 255.255.255.0 {
    group {
    option routers 10.x.1.1;
    option ntp-servers 10.x.1.1;
    option domain-name-servers 10.x.y.254,10.x.y.254;
    host a-host {
    option host-name "a-host.com";
    hardware ethernet 00:f0:66:1f:31:01;
    fixed-address 10.x.1.38;
    next-server 10.x.1.34;
    filename "pxelinux.0";
    }
    }
    }

  • http://saintaardvarkthecarpeted.com/blog Saint Aardvark

    Now that's odd. My two cents:

    -- Are packets getting truncated somewhere? Maybe the differences you're seeing are at the end of a packet...(is it jumbo frames? I bet it's jumbo frames. My lunch went missing today...know why? JUMBO FRAMES.)

    -- checksum the libs that the dhcpd binary depends on...maybe there's a difference somewhere?

    Yeah, I'm grasping at straws. But I'm ON YOUR SIDE.

  • Ty_a

    I had a problem one time with a VLAN on a switch having the exact IP of my dhcp server when trying to do a PXE OS re-image. We couldn't figure out why it would work if the device was connected with a dumb switch but would fail connected through a Cisco switch until we checked the IP address assigned to VLAN 1. We could ping the server address fine from the client. Duh!

    That probably doesn't help though. :)

  • http://www.standalone-sysadmin.com Matt Simmons

    @Paul

    Thanks for the idea, but I've re-written that file so many times I think I could do it from memory now ;-)

    @Brian

    Yeah, I've gone through several different testing rounds with "authoritative" enabled, also "allow bootp" and "allow booting", all in various configurations. As for moving the machines, I'm not. I really just created a VM that boots up looking for a bootp server. Thanks for sharing your configuration, though. I do appreciate it!

    LOL @SaintAardvark - Sadly, my packets aren't being truncated. I snarfed with -s 1500, and still got padded 0s at the end. I don't have jumbo frames enabled because I'm scared of them ;-) I appreciate your support nonetheless.

    @Ty_a hey, commiseration ALWAYS helps. Misery loves company.

  • Geoff Crompton

    Have you checked the arguments that both dhcpd servers are being started with? On a debian system /etc/default/dhcp3-server has an "INTERFACES" variable used by the init script. I can't explain how a mistake with a command line argument would cause your symptoms, but I figured it wouldn't hurt to check.

  • Pingback: Tweets that mention PXEBoot 2, the revenge | Standalone Sysadmin -- Topsy.com

  • John McGrath

    This is grasping at straws, and showing my ignorance in DHCP, but...

    Have you tried to open up the Dynamic DHCP range to include the servers?

    Yours:
    "range dynamic-bootp 10.x.1.95 10.x.1.96;"

    To:
    "range dynamic-bootp 10.x.1.90 10.x.1.96;"

    I had an issue once where DHCP did not see my fixed IP addresses since they were not in the DHCP range. I reset the range, and Viola! it worked, despite having dedicating the IP's in my config.

    Any corrections or enlightenments are willingly accepted...

  • Stephen P. Schaefer

    Is it possible that there's a "rogue" DHCP server on the network, and that the client is using the "wrong" DHCP response?