I manage to tame my crashing server quite a bit.
When I phoned the shopkeeper asking about the terms for getting the motherboard replaced, he asked me what SATA hard drive am I using. He told me that Maxtor harddrives have big problems with NForce chipsets (in this forum there's also something). Of course I had a damn Matrox DiamondMax Plus 9 harddrive mounted on that damn SATA port.
Now I disconnected the harddrive, then run 8 kernel compiles in parallel together with this small C program:
#include <stdlib.h>
#include <stdio.h>
#define SIZE 1000000000
main()
{
char* buf = malloc(SIZE);
int i;
printf("allocated\n");
while (1)
{
for (i = 0; i < SIZE; i++)
buf[i]++;
printf("filled\n");
sleep(1);
}
return 0;
}
the C program is to make sure that the swap is used, otherwise even 8 kernel compiles in parallel would happily fit in 1Gb RAM.
I run this kind of heavy load for 3 hours and I only had some timeouts in some network services (load was like 25!) and this error twice:
Jan 9 14:58:46 eddie kernel: ----------- [cut here ] --------- [please bite here ] ---------
Jan 9 14:58:46 eddie kernel: Kernel BUG at lib/radix-tree.c:372
Jan 9 14:58:46 eddie kernel: invalid operand: 0000 [1]
Jan 9 14:58:46 eddie kernel: CPU 0
[...blah blah blah...]
Since the error happened twice and in the exact same position, and since I previously wanted to try the latest drivers and the kernel is a too new 2.6.15, I'd blame those errors to a normal kernel bug. That code seems to be a nasty bit which already gave problems on 64bit systems.
Now it's time to get that hard drive replaced. This is so time consuming, but at least things are improving.