perf: speedup date parsing using ciso8601 #590

maartenbreddels · 2020-11-17T12:06:34Z

This is a cutout of running the profiler on voila, where it handles a single comm msg:

As can be seen, the green 'parse_date' is a significant part of the cpu usage.

With this PR, is shrinks this quite a bit:

This uses ciso8601 to do the datetime parsing, which is significantly faster

Note that is not a requirements, it will only be used when installed. However, 1 test regarding timezones fails (it should fail to convert and it does not, I'm not sure how important that is)

Running it in a benchmark

Before:

--------------------------------------------------------- benchmark: 1 tests --------------------------------------------------------
Name (time in us)                     Min       Max      Mean   StdDev    Median      IQR  Outliers  OPS (Kops/s)  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------
test_deserialize_performance     226.2670  556.7280  272.2735  58.3634  249.3170  19.2865   291;378        3.6728    2251           1
-------------------------------------------------------------------------------------------------------------------------------------

After:

-------------------------------------------------------- benchmark: 1 tests -------------------------------------------------------
Name (time in us)                    Min         Max     Mean   StdDev   Median     IQR  Outliers  OPS (Kops/s)  Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------
test_deserialize_performance     59.9780  4,237.0830  76.7926  60.7056  67.9120  4.8000   190;974       13.0221    5717           1
-----------------------------------------------------------------------------------------------------------------------------------

maartenbreddels · 2020-11-17T12:23:32Z

Relying on the regex gives about the same performance, but should let the tests pass:

------------------------------------------------------- benchmark: 1 tests ------------------------------------------------------
Name (time in us)                    Min       Max     Mean   StdDev   Median     IQR  Outliers  OPS (Kops/s)  Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------
test_deserialize_performance     57.6940  597.6290  71.3625  32.5174  61.1040  3.7240   413;661       14.0130    3586           1
---------------------------------------------------------------------------------------------------------------------------------

maartenbreddels · 2020-11-17T15:52:54Z

In fact, I think we can assume we know where the date fields are right?

That gives us the following performance:

------------------------------------------------------- benchmark: 1 tests ------------------------------------------------------
Name (time in us)                    Min       Max     Mean   StdDev   Median     IQR  Outliers  OPS (Kops/s)  Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------
test_deserialize_performance     27.2420  201.6740  32.8866  11.3219  29.4945  1.3950   520;798       30.4076    5266           1
---------------------------------------------------------------------------------------------------------------------------------

And when using orjson instead of json, we can get:

-------------------------------------------------------- benchmark: 1 tests -------------------------------------------------------
Name (time in us)                    Min         Max     Mean   StdDev   Median     IQR  Outliers  OPS (Kops/s)  Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------
test_deserialize_performance     21.9820  3,148.4180  35.3785  76.8630  23.5860  5.6985    50;395       28.2658    1939           1
-----------------------------------------------------------------------------------------------------------------------------------

This is 10x faster.

martinRenou · 2020-11-19T11:48:28Z

Will you include your changes on orjson in this PR? Or a separate one?

Thanks for working on this!

maartenbreddels · 2020-11-19T11:58:00Z

Well, jupyter_client users from zmq.utils import jsonapi which seems like a smart idea, but orjson does not support seperator, which is used in that module. And I'm also not sure if orjson supports ensure_ascii=False, allow_nan=False,

minrk

This is great! We did something similar in ipython/ipyparallel#424 where we discovered date parsing is so expensive when there's lots of messages, but we disabled it entirely, since that skips the regex match as well.

What I think would be even better is to remove date parsing entirely, but that's a backward-incompatible change. That said, I think very few things (ipyparallel may even be the only one) that really rely on dates being parsed as part of deserialization, so it should be a major change with relatively minor disruption.

What do you think about skipping extract_dates entirely as a major revision change? I can't think of a great mechanism for gracefully deprecating this functionality. Maybe a config option on Session with a deprecation warning?

minrk · 2020-12-09T08:09:56Z

.github/workflows/main.yml

-      run: py.test --cov jupyter_client jupyter_client
+      run:  |
+        py.test --cov jupyter_client jupyter_client
+        pip install ciso8601


Can this be added to the existing pip install , one step above, instead of being part of the test stage?

minrk · 2020-12-09T08:17:01Z

Well, jupyter_client users from zmq.utils import jsonapi

This is a legacy from when we supported Pythons that didn't even have json in the standard library and there were fiddly issues with str vs bytes. There's no need to keep that, as long as we use something that's producing valid utf8 JSON bytes.

minrk · 2020-12-09T08:22:11Z

jupyter_client/session.py

        message['msg_id'] = header['msg_id']
        message['msg_type'] = header['msg_type']
-        message['parent_header'] = extract_dates(self.unpack(msg_list[2]))
+        message['parent_header'] = self.unpack(msg_list[2])
+        if 'date' in message['parent_header']:


This will break current stable ipyparallel, which currently relies on the behavior of parsing any date-like strong, but I can deal with that. I never should have made it find and parse any valid date objects!

I suspect that parsing just the date will break exactly as many things as disabling parsing entirely, though. Which is to say: ipyparallel and probably nothing else.

blink1073 · 2021-07-30T09:38:29Z

Hi @maartenbreddels! We're pushing for a 7.0 release, do you want to pick this back up?

maartenbreddels added 2 commits November 17, 2020 12:59

perf: speedup date parsing using ciso8601

8f6b91d

rely on the regex for check

898aa89

This was referenced Nov 17, 2020

Does not work well together with tornado gaogaotiantian/viztracer#47

Closed

Performance: voila takes double the time to create widgets compared to Notebook voila-dashboards/voila#764

Closed

assume where the date fields are

d164f41

minrk reviewed Dec 9, 2020

View reviewed changes

blink1073 added the enhancement label Jul 30, 2021

This was referenced Mar 9, 2024

deprecate automatically parsing header timestamps #1014

Open

deprecate extracting dates in message headers #1015

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: speedup date parsing using ciso8601 #590

perf: speedup date parsing using ciso8601 #590

maartenbreddels commented Nov 17, 2020

maartenbreddels commented Nov 17, 2020

maartenbreddels commented Nov 17, 2020

martinRenou commented Nov 19, 2020

maartenbreddels commented Nov 19, 2020

minrk left a comment

minrk Dec 9, 2020

minrk commented Dec 9, 2020

minrk Dec 9, 2020

blink1073 commented Jul 30, 2021

perf: speedup date parsing using ciso8601 #590

Are you sure you want to change the base?

perf: speedup date parsing using ciso8601 #590

Conversation

maartenbreddels commented Nov 17, 2020

Running it in a benchmark

Before:

After:

maartenbreddels commented Nov 17, 2020

maartenbreddels commented Nov 17, 2020

martinRenou commented Nov 19, 2020

maartenbreddels commented Nov 19, 2020

minrk left a comment

Choose a reason for hiding this comment

minrk Dec 9, 2020

Choose a reason for hiding this comment

minrk commented Dec 9, 2020

minrk Dec 9, 2020

Choose a reason for hiding this comment

blink1073 commented Jul 30, 2021