Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add move_quantile function #418

Open
wants to merge 34 commits into
base: master
Choose a base branch
from

Conversation

andrii-riazanov
Copy link

This PR adds move_quantile to the list of supported move functions.

Why?

Quantiles (and moving quantiles) are often useful statistics to look at, and having a fast move version of quantile would be great.

How?

Moving/rolling quantile is implemented in almost exactly the same way as moving median: via two heaps (max-heap and min-heap). The only difference is in sizes of the heaps -- for move_median they should have the same size (modulo parity nuances), while for the move_quantile sizes of the heaps should be rebalanced differently.

The changes to transform move_median into move_quantile are very minor, and were implemented in the first commit 524afbf (36++, 13--). This commit fully implemented move_quantile with fixed q=0.25 out of move_median.

  • The initial approach was to substitute move_median with move_quantile completely. Then, on move_median call, just call move_quantile(q=0.5). This is implemented and tested in commits until de181da , where fully working move_quantile (and move_median via move_quantile) was implemented.

    At this point, new move_median bench was compared to old move_median bench. It was observed that the new move_median became slower by 1-3%. Even though the changes were minor, apparently new arithmetic operations introduced were enough to cause this overhead. For a performance-oriented package with decrease in speed is not justifiable.

  • It was decided to implement move_quantile parallel to move_median. This causes a lot of code repetition, but this needed to be done to not sacrificy move_median performance (and also to avoid abusing macros) cd49b4f . A lot of the functions in move_median.c were almost duplicated, hence a large diff. At this commit, both move_quantile and move_median were fully implemented and almost fully tested.

  • When move_quantile is called with q=0., instead move_min is called, which is much faster. Similarly with q=1. and move_max, and with q=0.5 and move_median.

  • Only interpolation method "midpoint" was implemented for now.

Other changes

  • Function parse_args in move_template.c was heavily refactored for better clarity

Technicalities

  • np.nanquantile behaves weirdly when there are np.inf's in the data. See, for instance, BUG: np.percentile gives unreasonable results when array contains np.inf numpy/numpy#21932, BUG: inf in quantile has undefined behaviour (and possibly different for -inf vs +inf) numpy/numpy#21091 . In particular, np.nanquantile(q=0.5) doesn't give the same result as np.nanmedian on such data, because of how arithmetic operation work on np.infs. Our move_quantile behaves as expected and in agreement with move_median when q=0.5. To test properly (and have a numpy slow version of move_quantile), we notice that np.nanmedian behaviour can be achieved if one takes
    (np.nanquantile(a, q=0.5, method="lower") + np.nanquantile(a, q=0.5, method="higher")) / 2. This is what we use for slow function if there are np.inf's in the data. The fact that this and np.nanmedian return the same is tested in move_test.py. This issue is also discussed in there in comments (which I used pretty liberally)
  • When there are no infs in a, the usual np.nanquantile is called in bn.slow.move_quantile, so benching is "fair", since we don't consider infinite values during benching.

Tests

  • A lot of extensive tests were added for move_quantile. With constant REPEAT_FULL_QUANTILE set to 1 in test_move_quantile_with_infs_and_nans, the test considers 200k instances, and takes ~7 mins to run. It was tested with more repetitions and larger range of parameter, the current values are set so that the Github Actions tests run reasonable time.

Benches

  • bn.move_quantile is significantly faster than bn.slow.move_quantile:
    Bottleneck 1.3.5.post0.dev24; Numpy 1.23.1
    Speed is NumPy time divided by Bottleneck time
    None of the array elements are NaN

   Speed  Call                          Array
   269.9  move_quantile(a, 1, q=0.25)   rand(1)
  2502.7  move_quantile(a, 2, q=0.25)   rand(10)
  6718.9  move_quantile(a, 20, q=0.25)   rand(100)
  5283.4  move_quantile(a, 200, q=0.25)   rand(1000)
  5747.2  move_quantile(a, 2, q=0.25)   rand(10, 10)
  3197.3  move_quantile(a, 20, q=0.25)   rand(100, 100)
  3051.9  move_quantile(a, 20, axis=0, q=0.25)   rand(100, 100, 100)
  3135.6  move_quantile(a, 20, axis=1, q=0.25)   rand(100, 100, 100)
  3232.8  move_quantile(a, 20, axis=2, q=0.25)   rand(100, 100, 100)

The increase in speed was tested and confirmed separately (outside of bn.bench) for sanity check. q = 0.25 is used for all benches with move_quantile.

  • A slight complication that arises is that these benches are very long to run now, because of how slow np.nanquantile is. bn.bench(functions=["move_quantile"]) runs for about 20 minutes:
Bottleneck performance benchmark
    Bottleneck 1.3.5.post0.dev24; Numpy 1.23.1
    Speed is NumPy time divided by Bottleneck time
    NaN means approx one-fifth NaNs; float64 used

              no NaN     no NaN      NaN       no NaN      NaN    
               (100,)  (1000,1000)(1000,1000)(1000,1000)(1000,1000)
               axis=0     axis=0     axis=0     axis=1     axis=1  
move_quantile 6276.9     1961.2     1781.8     2294.7     2255.1

Further changes

Several things that can be improved with move_quantile going further:

  • Implement more interpolation methods. Refactoring of parse_arg function made it much easier to pass additional arguments to functions in move. Changing behavior of mq_get_quantile should not be a problem as well
  • np.quantile supports a list (iterable) of quantiles to compute. Can also add it here, quite easy to do if implement it at the first step on python level.
  • I had an attempt of making the argument q a required argument for move_quantile (as it should be), but was met with some complications and left it as is. If will create a python wrapper to parse the iterable q input anyway, can add non-keyword q to that python layer.

Wrap-up

Thanks for considering, and sorry for a large diff. 50% of that is duplicating code in move_median.c, and another 20% is new tests. You can see in de181da how few changes were actually made for move_quantile to work, but this approach just unfortunately slowed down move_median by a bit.

andrii-riazanov and others added 29 commits September 12, 2022 21:24
Implementation of moving quantile instead of moving median, which is a partial case of quantile. For now using the fixed constant in #define
Add quantile and has_quantile parameters to the template
Added all imports
Fixed a bug when q=1. Call move_max for this case (on python layer)
Added a lot of tests for move_median and move_quantile
Fix keyword argument "method"/"interpolation" for different
numpy verisons (keyword was changed after 1.22.0)
Copy over doc string for move_quantile from C layer to Python layer
This reverts commit c2a2ae3.

move_min is significanlty faster than move_quantile with q = 0. So in case of
q=0 apply move_min instead. Same for q=1 and move_max.
Instead of move_quantile substituting move_median completely,
have both move_median and move_quantile implemented
separately.
Remove the wrapper for move_quantile
on python level which checked for
q = 0 or 1. Now it's fully in C.
Also check for q=0.5 as we checked
it's 3-4% faster to call move_median
Remove redundant import
This eliminate the need for macro in move_median.c
mm_handle will just have an unused membet "quanitle"
for the case of move_median.
move_median and move_quantile now have all the same
functions except for the construction of mm/mq.
@andrii-riazanov
Copy link
Author

andrii-riazanov commented Sep 29, 2022

Update 1

The implementation of move_median.c was refactored in 72677f8 to remove code repetition and usage of macro completely. Now move_median and move_quantile use the same functions for managing heaps, and only differ when they calculate the actual statistic. This makes the implementation of both mm and mq at the same time cleaner while keeping the performance of move_median unchanged. The diff in the source code is much smaller now.

@andrii-riazanov
Copy link
Author

Update 2

In 2c892db added a very simple python layer for move_quantile to support iterable q argument. Also argument q was made a required (non-default) argument on python layer. Documenation (copied from move_template.c) updated correspondingly.

@andrii-riazanov
Copy link
Author

Hi @rdbisme, I was wondering if someone could take a look or make a comment on this PR at their convenience. I know it's a large one, just want to understand what I could expect from this. Thanks :)

@rdbisme
Copy link
Collaborator

rdbisme commented Apr 14, 2023

Ehi @andrii-riazanov, thanks for your contribution. I'm currently alone managing this package, mostly focusing on keep it easily available and installable on supported Python versions.

I hope someone else from the community can step in and help to review implementations and improvements of the actual business logic as for your PR.

Otherwise, I'll try to find a bit of free time to actually give it a look, but it might take time.

Anyway, if anyone is reading this, feel free to step in this discussion and provide feedback :)

@RichieHakim
Copy link

While I'm not able to help directly with the code, I'm very thankful and eager to try this out. Also:

  1. Currently, the best moving quantiles are: pandas.DataFrame.rolling.quantile + multiprocessing, as well as rolling_quantiles (https://github.com/marmarelis/rolling-quantiles)
  2. These are partially benchmarked here: (https://github.com/RichieHakim/rolling_percentile`)
  3. This is a paper on how median calculation can be done faster than typical insertion sort methods (https://www.stat.cmu.edu/~ryantibs/papers/median.pdf)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants