-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add move_quantile function #418
base: master
Are you sure you want to change the base?
Conversation
Implementation of moving quantile instead of moving median, which is a partial case of quantile. For now using the fixed constant in #define
Add quantile and has_quantile parameters to the template
Added all imports Fixed a bug when q=1. Call move_max for this case (on python layer) Added a lot of tests for move_median and move_quantile
Fix keyword argument "method"/"interpolation" for different numpy verisons (keyword was changed after 1.22.0) Copy over doc string for move_quantile from C layer to Python layer
This reverts commit c2a2ae3. move_min is significanlty faster than move_quantile with q = 0. So in case of q=0 apply move_min instead. Same for q=1 and move_max.
Instead of move_quantile substituting move_median completely, have both move_median and move_quantile implemented separately.
Remove the wrapper for move_quantile on python level which checked for q = 0 or 1. Now it's fully in C. Also check for q=0.5 as we checked it's 3-4% faster to call move_median
Add some more tests
Remove redundant import
Mostly get rid of macros
for versions comparison
This eliminate the need for macro in move_median.c mm_handle will just have an unused membet "quanitle" for the case of move_median.
move_median and move_quantile now have all the same functions except for the construction of mm/mq.
Update 1The implementation of move_median.c was refactored in 72677f8 to remove code repetition and usage of macro completely. Now move_median and move_quantile use the same functions for managing heaps, and only differ when they calculate the actual statistic. This makes the implementation of both mm and mq at the same time cleaner while keeping the performance of move_median unchanged. The diff in the source code is much smaller now. |
Update 2In 2c892db added a very simple python layer for move_quantile to support iterable q argument. Also argument |
Hi @rdbisme, I was wondering if someone could take a look or make a comment on this PR at their convenience. I know it's a large one, just want to understand what I could expect from this. Thanks :) |
Ehi @andrii-riazanov, thanks for your contribution. I'm currently alone managing this package, mostly focusing on keep it easily available and installable on supported Python versions. I hope someone else from the community can step in and help to review implementations and improvements of the actual business logic as for your PR. Otherwise, I'll try to find a bit of free time to actually give it a look, but it might take time. Anyway, if anyone is reading this, feel free to step in this discussion and provide feedback :) |
While I'm not able to help directly with the code, I'm very thankful and eager to try this out. Also:
|
This PR adds
move_quantile
to the list of supported move functions.Why?
Quantiles (and moving quantiles) are often useful statistics to look at, and having a fast move version of quantile would be great.
How?
Moving/rolling quantile is implemented in almost exactly the same way as moving median: via two heaps (max-heap and min-heap). The only difference is in sizes of the heaps -- for
move_median
they should have the same size (modulo parity nuances), while for themove_quantile
sizes of the heaps should be rebalanced differently.The changes to transform
move_median
intomove_quantile
are very minor, and were implemented in the first commit 524afbf (36++, 13--). This commit fully implementedmove_quantile
with fixed q=0.25 out ofmove_median
.The initial approach was to substitute
move_median
withmove_quantile
completely. Then, onmove_median
call, just callmove_quantile(q=0.5)
. This is implemented and tested in commits until de181da , where fully workingmove_quantile
(andmove_median
viamove_quantile
) was implemented.At this point, new
move_median
bench was compared to oldmove_median
bench. It was observed that the newmove_median
became slower by 1-3%. Even though the changes were minor, apparently new arithmetic operations introduced were enough to cause this overhead. For a performance-oriented package with decrease in speed is not justifiable.It was decided to implement
move_quantile
parallel tomove_median
. This causes a lot of code repetition, but this needed to be done to not sacrificymove_median
performance (and also to avoid abusing macros) cd49b4f . A lot of the functions inmove_median.c
were almost duplicated, hence a large diff. At this commit, bothmove_quantile
andmove_median
were fully implemented and almost fully tested.When
move_quantile
is called withq=0.
, insteadmove_min
is called, which is much faster. Similarly withq=1.
andmove_max
, and withq=0.5
andmove_median
.Only interpolation method "midpoint" was implemented for now.
Other changes
parse_args
inmove_template.c
was heavily refactored for better clarityTechnicalities
np.percentile
gives unreasonable results when array containsnp.inf
numpy/numpy#21932, BUG: inf in quantile has undefined behaviour (and possibly different for -inf vs +inf) numpy/numpy#21091 . In particular,np.nanquantile(q=0.5)
doesn't give the same result asnp.nanmedian
on such data, because of how arithmetic operation work on np.infs. Ourmove_quantile
behaves as expected and in agreement withmove_median
when q=0.5. To test properly (and have a numpy slow version ofmove_quantile
), we notice thatnp.nanmedian
behaviour can be achieved if one takes(np.nanquantile(a, q=0.5, method="lower") + np.nanquantile(a, q=0.5, method="higher")) / 2
. This is what we use for slow function if there are np.inf's in the data. The fact that this andnp.nanmedian
return the same is tested inmove_test.py
. This issue is also discussed in there in comments (which I used pretty liberally)a
, the usualnp.nanquantile
is called inbn.slow.move_quantile
, so benching is "fair", since we don't consider infinite values during benching.Tests
move_quantile
. With constantREPEAT_FULL_QUANTILE
set to 1 intest_move_quantile_with_infs_and_nans
, the test considers 200k instances, and takes ~7 mins to run. It was tested with more repetitions and larger range of parameter, the current values are set so that the Github Actions tests run reasonable time.Benches
bn.move_quantile
is significantly faster thanbn.slow.move_quantile
:The increase in speed was tested and confirmed separately (outside of bn.bench) for sanity check. q = 0.25 is used for all benches with
move_quantile
.np.nanquantile
is.bn.bench(functions=["move_quantile"])
runs for about 20 minutes:Further changes
Several things that can be improved with
move_quantile
going further:parse_arg
function made it much easier to pass additional arguments to functions inmove
. Changing behavior ofmq_get_quantile
should not be a problem as wellnp.quantile
supports a list (iterable) of quantiles to compute. Can also add it here, quite easy to do if implement it at the first step on python level.q
a required argument formove_quantile
(as it should be), but was met with some complications and left it as is. If will create a python wrapper to parse the iterableq
input anyway, can add non-keywordq
to that python layer.Wrap-up
Thanks for considering, and sorry for a large diff. 50% of that is duplicating code in move_median.c, and another 20% is new tests. You can see in de181da how few changes were actually made for
move_quantile
to work, but this approach just unfortunately slowed downmove_median
by a bit.