Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: speedup date parsing using ciso8601 #590

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

maartenbreddels
Copy link

This is a cutout of running the profiler on voila, where it handles a single comm msg:
image

As can be seen, the green 'parse_date' is a significant part of the cpu usage.

With this PR, is shrinks this quite a bit:
image

This uses ciso8601 to do the datetime parsing, which is significantly faster

Note that is not a requirements, it will only be used when installed. However, 1 test regarding timezones fails (it should fail to convert and it does not, I'm not sure how important that is)

Running it in a benchmark

Before:

--------------------------------------------------------- benchmark: 1 tests --------------------------------------------------------
Name (time in us)                     Min       Max      Mean   StdDev    Median      IQR  Outliers  OPS (Kops/s)  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------
test_deserialize_performance     226.2670  556.7280  272.2735  58.3634  249.3170  19.2865   291;378        3.6728    2251           1
-------------------------------------------------------------------------------------------------------------------------------------

After:

-------------------------------------------------------- benchmark: 1 tests -------------------------------------------------------
Name (time in us)                    Min         Max     Mean   StdDev   Median     IQR  Outliers  OPS (Kops/s)  Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------
test_deserialize_performance     59.9780  4,237.0830  76.7926  60.7056  67.9120  4.8000   190;974       13.0221    5717           1
-----------------------------------------------------------------------------------------------------------------------------------

@maartenbreddels
Copy link
Author

Relying on the regex gives about the same performance, but should let the tests pass:

------------------------------------------------------- benchmark: 1 tests ------------------------------------------------------
Name (time in us)                    Min       Max     Mean   StdDev   Median     IQR  Outliers  OPS (Kops/s)  Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------
test_deserialize_performance     57.6940  597.6290  71.3625  32.5174  61.1040  3.7240   413;661       14.0130    3586           1
---------------------------------------------------------------------------------------------------------------------------------

@maartenbreddels
Copy link
Author

In fact, I think we can assume we know where the date fields are right?

That gives us the following performance:

------------------------------------------------------- benchmark: 1 tests ------------------------------------------------------
Name (time in us)                    Min       Max     Mean   StdDev   Median     IQR  Outliers  OPS (Kops/s)  Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------
test_deserialize_performance     27.2420  201.6740  32.8866  11.3219  29.4945  1.3950   520;798       30.4076    5266           1
---------------------------------------------------------------------------------------------------------------------------------

And when using orjson instead of json, we can get:

-------------------------------------------------------- benchmark: 1 tests -------------------------------------------------------
Name (time in us)                    Min         Max     Mean   StdDev   Median     IQR  Outliers  OPS (Kops/s)  Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------
test_deserialize_performance     21.9820  3,148.4180  35.3785  76.8630  23.5860  5.6985    50;395       28.2658    1939           1
-----------------------------------------------------------------------------------------------------------------------------------

This is 10x faster.

@martinRenou
Copy link
Member

Will you include your changes on orjson in this PR? Or a separate one?

Thanks for working on this!

@maartenbreddels
Copy link
Author

Well, jupyter_client users from zmq.utils import jsonapi which seems like a smart idea, but orjson does not support seperator, which is used in that module. And I'm also not sure if orjson supports ensure_ascii=False, allow_nan=False,

Copy link
Member

@minrk minrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! We did something similar in ipython/ipyparallel#424 where we discovered date parsing is so expensive when there's lots of messages, but we disabled it entirely, since that skips the regex match as well.

What I think would be even better is to remove date parsing entirely, but that's a backward-incompatible change. That said, I think very few things (ipyparallel may even be the only one) that really rely on dates being parsed as part of deserialization, so it should be a major change with relatively minor disruption.

What do you think about skipping extract_dates entirely as a major revision change? I can't think of a great mechanism for gracefully deprecating this functionality. Maybe a config option on Session with a deprecation warning?

run: py.test --cov jupyter_client jupyter_client
run: |
py.test --cov jupyter_client jupyter_client
pip install ciso8601
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be added to the existing pip install , one step above, instead of being part of the test stage?

@minrk
Copy link
Member

minrk commented Dec 9, 2020

Well, jupyter_client users from zmq.utils import jsonapi

This is a legacy from when we supported Pythons that didn't even have json in the standard library and there were fiddly issues with str vs bytes. There's no need to keep that, as long as we use something that's producing valid utf8 JSON bytes.

message['msg_id'] = header['msg_id']
message['msg_type'] = header['msg_type']
message['parent_header'] = extract_dates(self.unpack(msg_list[2]))
message['parent_header'] = self.unpack(msg_list[2])
if 'date' in message['parent_header']:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break current stable ipyparallel, which currently relies on the behavior of parsing any date-like strong, but I can deal with that. I never should have made it find and parse any valid date objects!

I suspect that parsing just the date will break exactly as many things as disabling parsing entirely, though. Which is to say: ipyparallel and probably nothing else.

@blink1073
Copy link
Contributor

Hi @maartenbreddels! We're pushing for a 7.0 release, do you want to pick this back up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants