[Haskell-beginners] sorting almost sorted list
Daniel Fischer
daniel.is.fischer at googlemail.com
Wed Sep 28 00:00:06 CEST 2011
On Tuesday 27 September 2011, 04:46:29, Dennis Raddle wrote:
> I have a problem from music. I have a list of notes sorted by begin
> time. I want to sort them by end time.
>
> Notes that are sorted by begin time are usually close to sorted by end
> time, because notes tend to cluster within a small range of durations.
>
> What algorithms are available to me (hopefully in a library) that are
> good with this kind of thing?
Data.List.sort/sortBy should be fine most of the time. Most algorithms that
perform significantly better in some situations are rather specific and not
available in libraries. Nobody will implement them until they're needed,
and then the implementor may think that it's not worth publishing as a
library as it's too specialised. Thus I doubt you'll find a library
function adapted to your special needs.
Leaving that aside, which algorithms are best depends on various things.
The most important are, I think, what kind of almost-sortedness you have
and what your space and laziness requirements are.
Most sorting algorithms require O(n) space and can't lazily produce the
output[1], but in your situation, that is possible (assuming the data is
sufficiently nice, if the last note to begin is the first to end, you need
O(n) space and can't produce incremental output).
Regarding the kind of almost-sortedness, if you have long monotonic runs
with few out-of-place elements in between, like
[2 .. 1000] ++ [1] ++ [1002 .. 5000] ++ [1001] ++ [5001 .. 10000],
Data.List.sort[By] will be quite good, less so if you have some jittering
superimposed on a monotonic list, like
concat [[n+1,n] | n <- [0, 2 .. 10000]].
I suppose your situation is more like the second.
Then an insertion sort usually does rather well, it can often outperform
algorithms with lower (worst case) complexity. (If the average displacement
needed for sorting is d, insertion sort takes O(n*(d+1)) time; if d is much
smaller than log n, insertion sort is very good.)
Like Data.List's mergesort, insertion sort is generic, needs O(n) space,
and can't produce incremental output. It is easy to implement,
inSortBy cmp = foldr ins []
where
ins x [] = [x]
ins x (y:ys) = case cmp x y of
GT -> y : ins x ys
_ -> x : y : ys
but all is not roses; for long input lists, that builds a large thunk which
may blow your stack. Insertion sort works best on mutable arrays where you
don't have the problem of building large thunks. However, it also works
well on short enough lists. Benchmarking with Yitz's list generator
(version 2), it can be more than twice as fast as Yitz's sortAlmostBy when
the lists are at most a few thousand elements long and the average
displacement is small, but it is much slower if the lists are a few ten-
thousand elements long or the average displacement is large.
Better at dealing with longer lists and larger displacements is the left-
fold version of insertion sort,
linSortBy cmp = reverse . foldl' ins []
where
ins [] a = [a]
ins l@(b:bs) a =
case cmp a b of
LT -> case ins bs a of
k@(_:_) -> b : k
[] -> error "impossible"
_ -> a : l
(note: This is adapted to the case of an almost sorted list, for an almost
reverse-sorted list, you'd use the obvious modification of ins and not need
the reverse at the end, giving better performance. For a more or less
random list, no version of insertion sort does well.)
The strictness is essential, using foldl instead of foldl' leads to worse
performance than the right fold gives, making the insertion ins less strict
isn't nearly as bad, but still hurts - a bit if the displacements are
small, more if they're large.
This version of insertion sort can keep up with sortAlmostBy much longer,
but it suffers from the same problem as the right-fold insertion sort, only
less.
Moving to specifically tailored sorting algorithms, we have Yitz's nice
sortAlmostBy. It requires that you know a bound for the number of notes
beginning not before but ending before any given note, but if you do, it
doesn't suffer too badly if you overestimate (as long as your estimate is
still small relative to the length of the list). Of course, if you
underestimate, it'll probably produce wrong results.
Then, more specifically tailored to the problem, the algorithm posted by
David Fletcher earlier, taking advantage of the fact that you know the
starting times of the notes and that a note can't end before it began (and
that the list is sorted by starting time).
My (more generic) implementation of it:
sortABy :: (a -> a -> Ordering) -> (a -> a -> Ordering) -> [a] -> [a]
sortABy _ _ [] = []
sortABy cmp1 cmp2 (x:xs) = go [] x xs
where
ins a [] = [a]
ins a l@(b:bs) = case cmp1 a b of
LT -> a : l
_ -> case ins a bs of
k@(_:_) -> b : k
[] -> error "oops"
go !store y [] = y : store
go store y zzs@(z:zs) =
case cmp2 y z of
GT -> case cmp1 y z of
GT -> go (ins y store) z zs
_ -> go (ins z store) y zs
_ -> y : case store of
(u:us) -> go us u zzs
[] -> go [] z zs
The first comparison function compares the end times, the second compares
the end time of the first argument to the starting time of the second.
As for the left-fold insertion sort, making the insertion stricter gains a
bit of speed (or more than a bit for larger displacements).
It has the advantage over sortAlmostBy that you don't need to know a bound,
and under favourable circumstances (if there never are many notes beginning
during any given note's lifetime), can be much faster (much meaning
something like a factor of 1.5 or so).
However, it doesn't take larger displacements well, and in extreme cases
degrades to a badly adapted insertion sort.
You could prevent that by using a good heap/priority queue for the store
instead of the list. That would make it slower in the good cases, but
guarantee better worst-case behaviour.
Both specialised algorithms can produce incremental output and run in
constant space if the preconditions are satisfied. Which is better depends
on the nature of your data.
Cheers,
Daniel
[1] Generic sorting algorithms can at best produce incremental output after
the entire input has been partially processed, since the minimum could
occur at any position. Thus they necessarily use at least O(n) space.
[ignoring obscenities like
sort [] = []
sort xs = let (x,ct) = findMinCount xs
in replicate ct x ++ sort' x (recalculate xs)
findMinCount (x:xs) = go 1 x xs
where
go k x [] = (x,k)
go k x (y:ys)
| y < x = go 1 y ys
| y == x = go (k+1) x ys
| otherwise = go k x ys
sort' x xs =
case dropWhile (<= x) xs of
[] -> []
(y:ys) -> let (z,ct) = findMinCount1 x 1 y ys
in replicate ct z ++ sort' z (recalculate xs)
findMinCount1 x k y [] = (y,k)
findMinCount1 x k y (z:zs)
| z <= x = findMinCount1 x k y zs
| z < y = findMinCount1 x 1 z zs
| z == y = findMinCount1 x (k+1) y zs
| otherwise = findMinCount1 x k y zs
which relies on (==) identifying only truly indistinguishable values and a
hypothetical 'recalculate' which recalculates the list lazily in O(1)
space. But it was fun to come up with.]
More information about the Beginners
mailing list