Reply to topic  [ 4 posts ] 
Segmentation fault when sorting 
Author Message
Yorick Master

Joined: Wed Jun 01, 2005 11:34 am
Posts: 112
Post Segmentation fault when sorting
I'm running into a crash when sorting large-ish arrays that contain few unique values.

On one extreme, this runs perfectly fine and almost instantly:
Code:
> x = indgen(200000)
> list = sort(x)


At the other extreme, this runs for a relatively long time and then crashes with a segmentation fault:
Code:
> x = array(long, 200000)
> list = sort(x)
Segmentation fault


I tried this on a few different linux systems:
  • 32 bit system running a recent 2.2.00x CVS version of Yorick
  • 32 bit system running a 2.1.06x version of Yorick
  • 64 bit system running a 2.1.06x version of Yorick

All crashed with an array of size 200,000. However, all three worked when I tried with an array of size 100,000.

In my real data, I have much larger arrays with more unique values. However, the number of unique values is very, very small compared to the size of the array. For example, an array of size approximately 200,000 with three unique values results in a crash for me.


Fri Sep 24, 2010 11:12 am
Profile
Yorick Master

Joined: Mon Nov 22, 2004 9:43 am
Posts: 354
Location: Livermore, CA, USA
Post 
I reproduced this. The yorick sort() function uses the standard quicksort algorithm, which chooses an array element at random, divides the array into those elements greater than or equal to the partition element, and those less than the partition element, and -- the bit that causes your problem -- recursively calls itself to sort each of those two shorter lists. Therefore, for 200000 equal elements, quicksort will call itself 200000 times recursively. My linux machine ran out of stack space after 131051 calls, at which point the stack had grown down into the read-only data segment of the program, and it died with SIGSEGV. I haven't checked whether the system qsort function has the same problem, since it also uses quicksort.

I have a heapsort written (see math/heapsort.[ch] in the distribution) which would probably fix this misfeature, but it will take some effort to replace the current function.

The sort function scales poorly, partly because it sorts array indices rather than the values themselves, which leads to very poor cache coherence.

One workaround (partly used in the msort function) would be to add small unequal increments to all of your array elements before sorting them. For example, if you want to sort x, and you know that unequal values always differ by at least 1 (if they are integers as in your example), you could do this:

list = sort(x+span(0.,0.99,numberof(x)));

then increments are guaranteed not to change the order of any truly different values; in fact they have the side effect that the sort will not reorder any equal elements (like msort). This trick should dramatically accelerate your searches, and should prevent the call stack depth from becoming proportional to the size of x. Of course, if you don't have a way to estimate the minimum spacing, or if the increments are so small they get lost in roundoff, this workaround. For the example you gave, it should work perfectly.


Sat Oct 02, 2010 10:03 am
Profile
Yorick Master

Joined: Wed Jun 01, 2005 11:34 am
Posts: 112
Post 
I just noticed that Yeti has a heapsort function. As you suggested, heapsort works properly (and quickly) for this scenario.


Fri Oct 08, 2010 9:24 am
Profile
Yorick Master

Joined: Mon Nov 22, 2004 9:43 am
Posts: 354
Location: Livermore, CA, USA
Post 
I just committed a patch to the yorick sort function which works around this misbehavior in the quicksort algorithm. I feel a little bad to modify the textbook quicksort; the fix probably slows down the sort function slightly. It consists of putting half of any elements equal to the partition element in the "greater" partition and half in the "lesser" partition, rather than all in the "lesser" one, as the classic algorithm does. (Surely this is a well-known fix; I hope I'm not falling into some other well-known trap. If any of you is an expert on sorting, please comment.)

The distribution yorick sort function now handles the case of an arbitrary number of equal values in the input array correctly and without any speed difference relative to all unequal values. It still uses quicksort, but I don't believe any input will cause a disastrous slowdown or stack overflow any more. Please report any new misbehavior.

Thanks for giving such a nice example. I wonder if sort performance problems I have noticed in the past and called "poor scaling with sort length" were really this problem...


Mon Oct 11, 2010 5:53 pm
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 4 posts ] 

Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group.
Designed by STSoftware for PTF.