Linux machines with no rebooting…? Is this what we want?
September 1, 2010
The other day, I caught a message that KSplice was available for Fedora. I thought I’d be a wiseguy and I replied “Yeah, great. Call me in 20 years when it’s available for for RHEL”. Well, as several people pointed out, it turns out the joke is on me.
As you can see, it’s actually available for many Linux-based OSes at various prices. I suppose my confusion stemmed from the fact that I misunderstood what ksplice was.
My impression from a long time ago, when it first came out on Ubuntu, was that it was essentially a kernel patch that dynamically loaded patches and provided the ability to rebootstrap a kernel that was already loaded. As it turns out, it’s a commercial product that offers the ability to not have to reboot your machine to update the kernel. Let me be frank: I’m all about that.
The part that I kind of object to is in the press release, of all things. It’s the opening line of the company profile:
Ksplice is an enterprise software company making reboots a thing of the past.
Please, lets be honest. Reboots are inevitable. Using this product as a stop-gap for untimely reboots may be handy (at the low low price of $50 per year per server), but it can’t (and shouldn’t!) replace regular reboots.
The reasons for scheduled rebooting of machines are numerous. The primary one is that regular reboots assure that the machine is configured to boot correctly. If you’ve got a machine that’s got over 100 days of uptime, how do you know it will start correctly? You last booted it last quarter…what has happened to that machine since then? Changes in installed services, mountpoints, etc…it’s hard to tell if it’s going to be in a known-good state when it comes back up after a power failure.
Another reason to reboot occasionally is to clean up the running state of the machine. What’s that you say? Your machine is running fine? Well, sure, it may be, but how much cruft is left hanging that isn’t obvious? Have you ever used kill -9? Do you know for sure that there aren’t any memory leaks in your running services? Any processes hang while reading I/O and is now stuck in uninterruptible sleep?
Yes, there are lots of things that happen to servers over the course of doing their jobs. A reboot fixes many of them. The only argument against it is uptime.
I’ve written about uptime before, and I still feel the same way. Modern system administration has advanced beyond a single server providing a service. Uptime needs to be measured from the outside in, and according to the availability of the service, not the individual servers comprising that pool.
Feel free to disagree. Let me know if you’ve got an uptime of a year plus and you’re proud of it, or if you would be ashamed to be in that position.
Edit
This entry is causing quite a stir on Reddit. Cxunix from twitter also weighed in on his blog, servermanaged.it (link is in Italian, English translation here).














Posted in





Email me



content rss
September 1st, 2010 at 9:52 am
I heartily agree with you. As annoying as a scheduled reboot may be, they give you the confidence that an unscheduled reboot will probably work fine too.
Another major point you neglect to mention is that when you pay for support from one of the commercial linuxes they will not support your kernel when you’ve added the ksplice cruft, since you’re no longer running a vendor supplied kernel. This will result in ${VENDOR} asking you to reboot into a supported kernel to verify operation there.
September 1st, 2010 at 11:19 am
kill -9 will never leave handles laying around. IO Handles are a function the process, if you tell the kernel to kill one they go away by proxy. If you mean temp files — they might be laying around — but a reboot won’t fix that anyway. One thing that reboot’s tend to fix fairly well, are driver bugs and bringing the system back quickly if it swapped out too aggressively. Reboots are still required here. But, alas, I think you’re missing the point of this: It is simply to isolate the problem of rebooting-for-kernel-patching, and rebooting-as-reseting. It does the former fairly well (perfectly for the last 64 vulnerabilities per wikipedia), and makes no attempt (that I see) to lay claim to the latter.
September 1st, 2010 at 11:20 am
Rebooting can be a security risk. I have seen a server that was hacked, where the hacker made changes to a config file, could didn’t have permission to restart the service to reload the config file. When the server was rebooted his changes were read, and hell was loss.
I only reboot when I absolutely have to, and schedule with the worst disaster happens when I reboot.
September 1st, 2010 at 11:32 am
I agree. And for the reasons you state.
Availability is what is important, not uptime.
That’s not to say it’s not something that one could need though. I don’t currently have a need for it.
I will say, the ksplice blog is top notch, and reflects some smarts over there.
September 1st, 2010 at 12:02 pm
[...] This post was mentioned on Twitter by Matt Simmons and Ben Cotton, Planet SysAd. Planet SysAd said: Standalone Sysadmin: Linux machines with no rebooting…? Is this what we want? http://bit.ly/9dEW0a [...]
September 1st, 2010 at 1:56 pm
I would respectfully suggest that the security risk example is focusing on the pebble to ignore the boulder. Not rebooting to prevent an already successful intrusion from having its full effect should at most be a small consideration, used only when an IDS has identified something has happened and information needs to be gathered from the still running system. It is far from a sufficient solution to security needs and relying on it means only that you will face the possibility of aditional problems when something forces a reboot.
September 1st, 2010 at 2:47 pm
uptime14:46:45 up 323 days, 8:41
September 1st, 2010 at 4:00 pm
I am not completely sure if I agree or disagree, but I would like to point out a few things that have not been brought up.
First, a short background, our production environment runs the gambit of clustered services with multiple nodes which can reboot without affecting the services. on one end At the other end of the gambit, we have highly important shared hosting boxes with fail over that is 24 hours behind, which means a reboot does matter.
Now, some theory, I do have boxes that have been up for 700+ days. I am proud of them, here is why. We generally place a box in service for three to five years, this equates to 1095 to 1825 days. If a server can go into service and never be rebooted during it’s production life cycle, then wether it can reboot becomes a contingency plan. I am not arguing this is the right goal for everyone, but I speculate it will become common place in the next ten years. I think three to five years may already be common place in mainframes which have a 15 to 30 year life cycle.
Also, I am not sure that a warm and fuzzy is what I am looking for with a reboot, if a computer is 3Ghz, that is 34,608,000,000,000,000 cycles per year. I am not sure that 346,080,000,000,000,000 is a hugely different number of cycles I think we are in the same order of magnitude in Big(O). From the cruft perspective, wether you reboot after a year or seven years really doesn’t matter to much in my opinion.
Honestly, one of my biggest concerns was a file system error that might not be obvious until a reboot. If data were to become lost, unnoticed for too long, it might persist past our backup window. This could be a nightmare. Since we only have 30 day of daily backups, it would be wise to reboot once a week to discover this kind of error. Once a week is too much, so we made a quick risk analysis and decided to just never reboot. The time cost/benefit was just too much.
I did some research a while back and different distributions have different guidelines on file system checks (as set in the default with mkfs.ext3 or mkfs.ext4). Worse, some don’t have any guidelines, the mount count is set to -1 and never get’s checked. Again, the cost/benefit analysis said, just don’t worry about it. Either extend your backup granularity or accept the risk.
Scott M
September 1st, 2010 at 4:07 pm
Oh jesus, you guys got me thinking with this one, so I did a quick mass ssh. Here are our top uptimes. I didn’t even realize that I already have almost five years, and guess what, no service outage.
I also have about 15 others that are above one year. By inductive reasoning, clearly you can run a business with very little rebooting. I am surprised myself, this is never something that we tracked.
=== server1.eyemg.com ===
16:02:16 up 1011 days, 7:38, 0 users, load average: 0.01, 0.00, 0.00
=== server2.eyemg.com ===
16:02:17 up 1664 days, 3:38, 0 users, load average: 0.18, 0.18, 0.16
=== server3.eyemg.com ===
16:01:17 up 229 days, 6:02, 0 users, load average: 0.02, 0.01, 0.00
=== server4.eyemg.com ===
16:02:10 up 1628 days, 17:42, 0 users, load average: 0.06, 0.06, 0.01
=== gsdev.eyemg.com ===
14:04:06 up 1122 days, 5:03, 0 users, load average: 0.00, 0.00, 0.00
September 1st, 2010 at 4:27 pm
[...] This post was mentioned on Twitter by Benjamin W. Smith, EYEMG LLC and Scott McCarty, EYEMG. EYEMG said: OK, I don't want to brag, well, I kind of do, but my uptime is very manly! http://bit.ly/a2B9PH #sysadmin #devops #linux [...]
September 1st, 2010 at 6:29 pm
humbly gets off Scott’s lawn….
September 1st, 2010 at 11:34 pm
If your infrastructure cannot survive reboots, you fail as a sysadmin. (or your engineering staff fails as engineers)
September 2nd, 2010 at 4:13 am
[...] Ksplice solleva molte considerazioni e critiche come testimoniato da questo post in cui l’autore sottolinea come molto spesso un riavvio di un server possa essere un [...]
September 2nd, 2010 at 5:08 am
Scott: And now you’ve noticed those figures, you’re going to be paranoid when a machine so much as coughs :)
September 2nd, 2010 at 6:10 am
I must agree with L Green: if your systems do not survive a reboot, the failure is yours, no matter how long it has been up.
We have douzends of systems having an uptime of 300+, even 700+ days. In my last 13 years of professional system administration – having operated SuSE, RHEL, CentOS, Fedora, Ubuntu and Debian – the only reason I deliberately rebooted a system was for Kernel upgrades or after doing a major distro-upgrade from e.g. RHEL4/CentOS4 to RHEl5/CentOS5 … and they always came back online.
In my experience, it is not even a problem to make major upgrades e.g. CentOS 5.4 to 5.5, skipping Kernel upgrades (as long as no 5.5 package requires the new Kernel) and postbone the Kernel related stuff (including a reboot) until, let’s say, 1 year later. I have to say that my focus clearly is RHEL/CentOS, so eventually this specific situation is different with non “enterprise-grade” distributions.
For my part, I am proud of high uptimes and encourage them. I have enough self-confidence in my skills to know that any system will come up again, no matter how long it has been up.
September 2nd, 2010 at 9:04 am
[...] Linux machines with no rebooting…? Is this what we want? The other day, I caught a message that KSplice was available for Fedora. I thought I’d be a wiseguy and I replied “Yeah, great. Call me in 20 years when it’s available for for RHEL”. Well, as several people pointed out, it turns out the joke is on me. [...]
September 2nd, 2010 at 9:17 am
I think Ksplice is aiming for the PCI DSS compliance market where you DO have to do kernel patches to be compliant. Also, coincidentally, retail is one of the verticals that are most sensitive to downtime and most cost conscious. So they are cheap and they want the world :-)
Enterprise computing (eg. Marketing, HR, BI, Internal IT) is not so sensitive to a good’ol reboot during a nightly maintenance window.
I agree that you should always be conscious as to wether your systems can theoretically reboot. You should never leave a boot loader broken, etc. I also have a philosophy that unless otherwise necessary, you should always have services start correctly. But, you never “KNOW” wether it will reboot, until it boots.
Honestly though, I am not too worried about our systems rebooting, we are ordering the system to replace our 1665 day uptime machine this week :-)
I remember having problems 10 years ago with cheap Boxx servers rebooting, but we use HP DL380s and honestly, they are as bullet proof as it gets in our industry. All of our servers have raid, as such they have disks fail, replaced, and arrays rebuild automatically. The major scare for me is a software problem with filesystem. Database and application data loss can’t be fixed with a reboot and is way, way, way more likely.
Finally, engineering is about producing a product that meets a requirement for a specified budget. There are plenty of scenarios, as I am sure any one here will concur, where business owners will want more than they can afford. Often they will opt for less than was specified. For example, I have been in scenarios where I told a business owner that they needed redundant load balancers, firewalls, routers, core switches, distributions switches, links to the dist switches, but they CAN’T pay for it. That is just the course of business.
So, you can never really know wether your engineering is good or great unless you have a competitor that you judge to be about as smart as you, but just can’t get the same “service” uptime as you for the same or cheaper price ;-)
Scott M
September 2nd, 2010 at 9:48 am
I’m managing a compute farm where there is “traditionally” only one reboot per year for maintenance. In reality, the type of jobs we run exhaust all available RAM on a regular basis and over time the machines have issues. I have been pushing for a quarterly reboot, just so the memory leaks and inevitable hung processes get cleared in a timely and organized manner. I’m not going to get into the details of the issues, but a reboot improves simulation speed and system reliability.
Scott
September 2nd, 2010 at 8:58 pm
This seems to just be talking about uptime when dealing with services. However, it dosnt really bring up embeded devices. There is alot of things that run linux and cant have any downtime because it cost to much. In that case uptime means ALOT because even a 50 ms ring switch is bad.
September 2nd, 2010 at 9:34 pm
seems like everyone and his uncle has an IT product out there for the low, low price of $50/server-year.
in a 2,000 server environment, that is $100,000/yr and IT budgets are tending to hemorrhage these kinds of licensing costs and leading to the failure of IT projects due to the cost of licensing.
it sounds like a great way to get a gravy train going where you scale as the customer’s business scales, but you can wind up doubling the cost of a server just by all the licensing (and don’t get me started on VMware).
September 2nd, 2010 at 9:46 pm
@lamont. Sure. But like, if you have an it project/budget, you do a cost/benefit analysis and if it doesn’t make sense don’t buy it.
If you have 2K servers, I hope you have an infrastructure that can have some systems reboot while others stay up and there is no loss of service provision. Or you at least have a plan to get there.
Being an SA means taking a lot of things into account.
1. What does the business want.
2. What can the business afford
3. If they are not aligned, what is the tolerance for failure.
4. Being honest about what they can afford will provide
5. Giving the customer (employer) the most bang for the buck, taking all of the above into consideration.
If it doesn’t make business sense, abandon it. If it does, buy it.
September 2nd, 2010 at 10:57 pm
Uptime is everything!
My record for a laptop is 14 months. It went through a suspend/resume cycle about once per day. Red Hat 6.2. Wish my server would stay up that long.
September 2nd, 2010 at 11:45 pm
Is the author of this article saying that Linux has needed to reboot? The only time I’ve had to reboot Linux systems is when we lost power long enough and the UPS could not run them any longer (we did not have backup generators). Look at the feedback, one mentions multiple servers being up for over 5 years!! This is the norm with Linux. You should only need to reboot Linux for a kernel upgrade. Everything else can be fixed or patched on a running system.
September 3rd, 2010 at 12:36 am
2:31pm up 363 days, 13:20, 424 users, load average: 0.18, 0.15, 0.14
This is a corporate server serving up to 1300 users daily in a 10000 employee financial business. A good, well managed and well balanced server does not need to be rebooted. Excuses as “how do you know it’ll come up”, based on changed mountpoints etc, are the ramblings of a bad administrator.
Along with that regular online maintenance, stop/starting services, cleaning shared memory etc etc are all able to be done with the machine up. In a data centre the only machines requiring regular rebooting are the poorly configured ones, or those that have an OS that is badly written (i.e. Windows servers with nightly scheduled reboots).
Rebooting should be a thing of the past, on any well running and well managed system.
September 3rd, 2010 at 5:34 am
I strongly disagree.
“Reboot” culture is derived from Windows, it’s a shame, an evil but now it’s considered normal :( I have Linux servers with 300, 600 and even 1200 days of uptime, interrupted only by power outages and CED moving :-), I skip kernel updates when it’s possible. A serious system like Linux/Unix doesn’t need reboot, it’s only a Windows heritage.
September 3rd, 2010 at 9:46 am
@Homer
Thanks for your comment. I’m sorry that you feel that my blog entry consisted of ramblings.
On the other hand, I would like to learn more about how to administer machines better. How do you get around hung processes? How do you prevent memory leaks?
When your machine rebooted a year ago, why was that, and did it come up right?
These are things that I, and every other administrator that I know of, deal with. If you have found a way around these things, then you should be writing papers for the rest of us to learn by.
Certainly, I could stand up a simple NFS server and it would serve files until the disks failed. I could also stand up an IBM Z-series server and it would probably run until the power company went out of business. But the virtual machines running on the Z-series would still be subject to the day to day wear and tear of running buggy processes that use and abuse system resources.
I want to learn more about what your computing experience is like. Please share more.
September 3rd, 2010 at 1:30 pm
The 346080000…. example has the problem of not factoring context in.
1 second vs 10 seconds might seem like a lot but that’s just 1*10^15 fs vs 1*10^16 fs. A factor of 10 is always a factor of 10. In a human lifetime of 70 years there might be a “google” number of atomic transitions. In 700 years of human life, there might be a mere 10 times that, but clearly people don’t live to 700 because they can live to 70.
All physical processes have times where the probability say based on half-life reach a certain point that makes something likely to happen (>50%). Prior enough to that most things might be fine. After that you are working on borrowed time. And the transition point can occur at .001 second, 300 days, or 1 google years. Context does matter. [Note that a half-life of 1 day means that 80 more days changes the odds by 2^80!]
Specific to server uptime, there are unknown bugs in loads of complex software and the odds that a deadly combination might be reached increases over time. That’s not to put a danger limit on server uptime generally (maybe 1000 years would be fine in some scenarios), but to say that context does matter and 10x is 10x is 10x and whether that is important or not depends on context.
September 3rd, 2010 at 2:13 pm
To add to the prior comment..
The reason perhaps that intuition says that if something lasts 12345678 units of time then it likely will last 123456789 units of time as well is that without knowing more about context that statement is probably correct.
10x does not mean something goes from no trouble to trouble. 10x simply means 10x, and when context is analyzed, we might find that the given 10x lands in a safe (or unsafe) zone or else across a transition from one zone to the other, if perhaps the odds of landing in the same zone (knowing nothing about context) is the more likely scenario.
As for server software, there have been enough reasons for almost anyone to change software and computer hardware within a mere decade (eg, because of new software features and hardware improvements happening so frequently), that if the natural average uptime of some hypothetical system (and the precise context of that system is important) were to be 1000 years, few people would ever wait around to actually test it or push that system to its breaking point. In practice, expectation is guided by past experiences with systems/contexts we believe to be similar. Assuming you trust the integrity of the system as a practical matter, service availability is probably much more important than the uptime of any specific server. Having high uptimes in the past, however, may help build confidence that new similar systems are likely to be robust as well.
As an aside, I don’t like/trust proprietary if there is an open alternative. Besides the ownership, educational, and other benefits of FOSS, I’d rather have every single “good guy” (including myself) able to legally and practically study the system well than have only a very limited number of good guys able to study any aspect of the system freely, while bad guys speculating they can make a few million or billion if they can find certain flaws, be the main actors working to really get at the system (eg, by bribing, stealing, social engineering, launching various systems attacks on the vendor, bending or breaking the law as necessary, spending much $$ in the process, etc). Additionally, 1,000 heads is better than 10, etc, which is probably a fair ratio to expect when comparing engineers/scientists of skill level X working at one company and having full access to sensitive proprietary code of commodity software vs. working for world+dog and having access to high visibility similarly categorized FOSS.
September 5th, 2010 at 5:52 am
From a sun box I administer:
uptime
9:50am up 1027 day(s), 13:54, 2 users, load average: 0.11, 0.07, 0.06
They are completely stable. In fact, rebooting them has caused more problems than not.
September 5th, 2010 at 1:03 pm
fileserver:
up 1115 days, 12:25, 5 users, load average: 0.00, 0.00, 0.00
unidata server running AIX:
up 268 days, 14:02, 1 user, load average: 0.08, 0.08, 0.07
September 6th, 2010 at 1:11 pm
I find rebooting necessary to ensure the healthy state of the machine. As outlined in this article, there are many things that a reboot guarantees. Keeps you sane. With that said, my uptime:
11:09:28 up 2:01, 1 user, load average: 0.28, 0.31, 0.35Yup, I just rebooted. Debian shipped an updated point release for Lenny. I upgraded, got a new kernel, and booted into the kernel. I then noticed that a service wasn’t starting on boot, which a quick
sysv-rc-config service onfixed.September 7th, 2010 at 5:41 am
I agree that reboot is time to time required, but who realy need it, especialy scheduled? It is just a question what else instaled. Normaly linux does not need reboot as is not a mass of stupid binary chunks as windows were before .net.
I am awaiting for memory that doesnt need power to remember, then people will change meaning about reboots. And they will be surprised if they use windows.
Soo, nc for reboot, i dont need if i dont go bad.
September 7th, 2010 at 5:46 am
Who in linux need reboot?
If you have good hw and you dont go bad, lama way, you dont need restart whole system.
There is trouble into drivers, may be devices only to need reboot whole machine to make them work propertly.
If you dont go bad, cheap, you dont need to reboot.
People who need scheduled reboot, probably need reboot too.
September 7th, 2010 at 4:41 pm
[...] kernels for Fedora which sparked off some interesting conversation about uptime over at the Standalone Sysadmin. Honestly, I ran across Ksplice a while back and I thought to myself, huh that might be useful for [...]
June 29th, 2011 at 9:09 am
[...] way to move a server, and the beginning of the video talks about a 7 year uptime on the machine. I don’t think that’s a particularly good idea either. Don’t do this at [...]
July 23rd, 2011 at 12:07 am
[...] wrote last year about KSplice. If you remember (or read the article I linked to), it’s software that took [...]
December 7th, 2011 at 1:07 am
Oyunları”>Çocuk Oyunları}.