Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Casting existing timestamp to timestamp again strips timezone information #12218

Open
jeffreyssmith2nd opened this issue Aug 28, 2024 · 23 comments
Assignees
Labels
bug Something isn't working

Comments

@jeffreyssmith2nd
Copy link
Contributor

Describe the bug

Using the ::timestamp (or CAST()) function strips timezone information.

To Reproduce

DataFusion CLI v40.0.0
> select arrow_typeof(now()) as tz;
+---------------------------------------+
| tz                                    |
+---------------------------------------+
| Timestamp(Nanosecond, Some("+00:00")) |
+---------------------------------------+

> select arrow_typeof(now()::timestamp) as tz;
+-----------------------------+
| tz                          |
+-----------------------------+
| Timestamp(Nanosecond, None) |
+-----------------------------+

> select arrow_typeof((now() AT TIME ZONE 'America/Denver')::timestamp) as tz;
+-----------------------------+
| tz                          |
+-----------------------------+
| Timestamp(Nanosecond, None) |
+-----------------------------+

Or as seen with actual values:

DataFusion CLI v40.0.0
> select now();
+-----------------------------+
| now()                       |
+-----------------------------+
| 2024-08-28T17:08:08.698528Z |
+-----------------------------+

> select (now() AT TIME ZONE 'America/Denver');
+----------------------------------+
| now()                            |
+----------------------------------+
| 2024-08-28T11:08:13.578025-06:00 |
+----------------------------------+

> select now()::timestamp;
+----------------------------+
| now()                      |
+----------------------------+
| 2024-08-28T17:08:17.228985 |
+----------------------------+

> select (now() AT TIME ZONE 'America/Denver')::timestamp;
+----------------------------+
| now()                      |
+----------------------------+
| 2024-08-28T17:08:20.042920 |
+----------------------------+

Expected behavior

I would expect that using the ::timestamp cast on a timestamp would not impact its timezone. This is how Postgres behaves.

postgres@localhost:postgres> select now()::timestamp
+----------------------------+
| now                        |
|----------------------------|
| 2024-08-28 17:05:07.779323 |
+----------------------------+

postgres@localhost:postgres> select (now() at time zone 'America/Denver')::timestamp
+----------------------------+
| timezone                   |
|----------------------------|
| 2024-08-28 11:05:09.275359 |
+----------------------------+

Additional context

No response

@devanbenz
Copy link
Contributor

I'd be happy to take this if it's not already being worked on.

@devanbenz
Copy link
Contributor

take

@findepi
Copy link
Member

findepi commented Aug 30, 2024

i'd expect ::timestamp / CAST(x AS timestamp) to cast to a timestamp without time zone type.
Is this what None zone indicates?

@devanbenz
Copy link
Contributor

@findepi It appears so

@findepi
Copy link
Member

findepi commented Aug 31, 2024

The SQL spec compliant behavior when casting timestamp with time zone to timestamp is to strip time zone information and retain year/month/day/hour/minute/second fields.
You can observe this behavior in Trino

trino> SELECT x "twtz", CAST(x AS timestamp) AS "t" FROM (VALUES (CAST('2024-08-31 11:47:58.977899 Europe/Warsaw' AS timestamp with time zone))) t(x);
                 twtz                  |            t
---------------------------------------+-------------------------
 2024-08-31 11:47:58.978 Europe/Warsaw | 2024-08-31 11:47:58.978
(1 row)

In a sense PostgreSQL's timestamp with time zone to timestamp cast follows the spec because PostgreSQL's timestamp with time zone value doesn't include the time zone information. It's "a point in time in UTC.

postgres=# SELECT x "twtz", CAST(x AS timestamp) AS "t" FROM (VALUES (CAST('2024-08-31 11:47:58.977899 Europe/Warsaw' AS timestamp with time zone))) t(x);
             twtz              |             t
-------------------------------+----------------------------
 2024-08-31 09:47:58.977899+00 | 2024-08-31 09:47:58.977899

Now, DF behavior when casting timestamp with time zone to timestamp seems to be to convert timestamp with time zone to UTC (retaining point in time) and then strip zone

> SELECT x "twtz", CAST(x AS timestamp) AS "t" FROM (VALUES (CAST('2024-08-31 11:47:58.977899 Europe/Warsaw' AS timestamp with time zone))) t(x);
+----------------------------------+----------------------------+
| twtz                             | t                          |
+----------------------------------+----------------------------+
| 2024-08-31T11:47:58.977899+02:00 | 2024-08-31T09:47:58.977899 |
+----------------------------------+----------------------------+

IMO we should change DF cast from timestamp with time zone to timestamp so that it retains year/month/day/hour/minute/second fields and strips the zone information.

cc @alamb

@findepi
Copy link
Member

findepi commented Aug 31, 2024

apache/arrow-rs#5827 seems related.

@devanbenz
Copy link
Contributor

So would the expected outcome be that timestamp will always have UTC time (which it seems to already be currently doing)?

@findepi
Copy link
Member

findepi commented Sep 2, 2024

I wouldn't say that timestamp has UTC time.
timestamp is local date/time. You can think of it as a struct with fields year/month/day/hour/minute/seconds(fractional) and it has no relationship with UTC.
However, when doing math on timestamp values using a date/time library it's often to represent it as UTC-based.

@devanbenz
Copy link
Contributor

I wouldn't say that timestamp has UTC time. timestamp is local date/time. You can think of it as a struct with fields year/month/day/hour/minute/seconds(fractional) and it has no relationship with UTC. However, when doing math on timestamp values using a date/time library it's often to represent it as UTC-based.

Makes sense, so the expectation is for the date-time calculation to be correct for that TZ but no offset?

So for the example above:

> select (now() AT TIME ZONE 'America/Denver')::timestamp;
+----------------------------+
| now()                      |
+----------------------------+
| 2024-08-28T17:08:20.042920 |
+----------------------------+

Should actually be:

> select (now() AT TIME ZONE 'America/Denver')::timestamp;
+----------------------------+
| now()                      |
+----------------------------+
| 2024-08-28T11:08:13.578025 |
+----------------------------+

@findepi
Copy link
Member

findepi commented Sep 4, 2024

@devanbenz Agreed
if now() AT TIME ZONE 'America/Denver' is 2024-08-28T11:08:13.578025-06:00 (or 2024-08-28 11:08:13.578025 America/Denver)
then (now() AT TIME ZONE 'America/Denver')::timestamp should be 2024-08-28[T]11:08:13.578025

@devanbenz
Copy link
Contributor

Okay sounds good @findepi -- I do see: https://github.com/apache/arrow-rs/pull/5831/files which appears to be relevant. I think this change would be better suited for arrow vs. datafusion I suppose?

@findepi
Copy link
Member

findepi commented Sep 6, 2024

@devanbenz i think so, yes

@devanbenz
Copy link
Contributor

devanbenz commented Sep 10, 2024

@findepi @alamb It actually appears that datafusion (and in regards apache arrow) seems to be casting as expected afaik. It looks like the now() udf in datafusion is inconsistent with how other engines do it. Taking a look at 2 other systems Postgres and Clickhouse I'm seeing that when calling now() without any adjustments it returns the local TZ. It appears that datafusion is returning Zulu time. Not sure if this is by design. But I'm recalling a thread: #12357 (comment)

It seems to me we also haven't documented anywhere the "the built in SQL dialect tries to follow postgresql semantics when possible"

Should built in functions be following PG as well?

Postgres

postgres=# SELECT now();
              now
-------------------------------
 2024-09-10 11:22:41.895841-05
(1 row)

postgres=# SELECT now()::timestamp;
            now
----------------------------
 2024-09-10 11:22:47.294941
(1 row)

postgres=# SELECT now()::timestamptz;
              now
-------------------------------
 2024-09-10 11:22:52.118688-05
(1 row)

Clickhouse

:) SELECT now();

SELECT now()

Query id: 13a227b8-f867-49ad-9941-8d8bb70b9c92

   ┌───────────────now()─┐
1. │ 2024-09-10 11:25:49 │
   └─────────────────────┘

1 row in set. Elapsed: 0.005 sec.

:) SELECT now()::timestamp;

SELECT CAST(now(), 'timestamp')

Query id: adb042b8-7ffb-4e98-bbb4-21affade7201

   ┌─CAST(now(), 'timestamp')─┐
1. │      2024-09-10 11:25:56 │
   └──────────────────────────┘

1 row in set. Elapsed: 0.002 sec.

Datafusion

> SELECT now();
+-----------------------------+
| now()                       |
+-----------------------------+
| 2024-09-10T16:26:39.928247Z |
+-----------------------------+
1 row(s) fetched.
Elapsed 0.007 seconds.

> SELECT now()::timestamp;
+----------------------------+
| now()                      |
+----------------------------+
| 2024-09-10T16:26:44.623198 |
+----------------------------+
1 row(s) fetched.
Elapsed 0.010 seconds.

> SELECT now()::timestamptz;
+-----------------------------+
| now()                       |
+-----------------------------+
| 2024-09-10T16:26:52.911837Z |
+-----------------------------+
1 row(s) fetched.
Elapsed 0.006 seconds.

Note: All queries were ran at around 11:30 AM CST. The PG and CH queries are using local TZ where as DF does not and uses Zulu time instead. Clickhouse does not support timestamptz hence why it was not in the query outputs given in the examples.

Taking a look at a similar query with Clickhouse I see that casting to a timestamp will strip the timezone information -- since they are using Apache Arrow the functionality should be the same afaik:

:) SELECT toTimeZone(now(), 'America/Denver');

SELECT toTimeZone(now(), 'America/Denver')

Query id: 9a79d47a-8551-4199-8b1a-ae9682259cbc

   ┌─toTimeZone(now(), 'America/Denver')─┐
1. │                 2024-09-10 10:45:14 │
   └─────────────────────────────────────┘

1 row in set. Elapsed: 0.007 sec.

:) SELECT toTimeZone(now(), 'America/Denver')::timestamp;

SELECT CAST(toTimeZone(now(), 'America/Denver'), 'timestamp')

Query id: b4ab2ec2-e361-4d83-8235-2ec9e95a43b5

   ┌─CAST(toTimeZone(now(), 'America/Denver'), 'timestamp')─┐
1. │                                    2024-09-10 11:45:19 │
   └────────────────────────────────────────────────────────┘

1 row in set. Elapsed: 0.003 sec.

But instead of stripping it and making it Zulu time it just sets it back to whatever the now() call would do which is what Datafusion currently is doing... We are just using Zulu instead of the local TZ.

@alamb I think that we shouldn't modify the current casting since it appears to be functioning as expected per the arrow spec apache/arrow-rs#5827 (comment) (at least for this bug specifically) but we should change the way that now() works to have it function as expected and falls in line with the postgres spec 🤔

@jeffreyssmith2nd
Copy link
Contributor Author

jeffreyssmith2nd commented Sep 10, 2024

Just for the sake of clarity, my original description of the behaviour holds with times other than now().

> select arrow_typeof(('2020-01-01T00:00:00Z' AT TIME ZONE 'America/New_York'));
+-------------------------------------------------+
| arrow_typeof(Utf8("2020-01-01T00:00:00Z"))      |
+-------------------------------------------------+
| Timestamp(Nanosecond, Some("America/New_York")) |
+-------------------------------------------------+
1 row(s) fetched.
Elapsed 0.002 seconds.

> select arrow_typeof(('2020-01-01T00:00:00Z' AT TIME ZONE 'America/New_York')::timestamp);
+--------------------------------------------+
| arrow_typeof(Utf8("2020-01-01T00:00:00Z")) |
+--------------------------------------------+
| Timestamp(Nanosecond, None)                |
+--------------------------------------------+
1 row(s) fetched.
Elapsed 0.002 seconds.

That being said, if this is the expected behaviour of other SQL dialects then it seems reasonable to me that it would behave the same way in DataFusion as well.

@devanbenz
Copy link
Contributor

devanbenz commented Sep 10, 2024

Just for the sake of clarity, my original description of the behaviour holds with times other than now().

> select arrow_typeof(('2020-01-01T00:00:00Z' AT TIME ZONE 'America/New_York'));
+-------------------------------------------------+
| arrow_typeof(Utf8("2020-01-01T00:00:00Z"))      |
+-------------------------------------------------+
| Timestamp(Nanosecond, Some("America/New_York")) |
+-------------------------------------------------+
1 row(s) fetched.
Elapsed 0.002 seconds.

> select arrow_typeof(('2020-01-01T00:00:00Z' AT TIME ZONE 'America/New_York')::timestamp);
+--------------------------------------------+
| arrow_typeof(Utf8("2020-01-01T00:00:00Z")) |
+--------------------------------------------+
| Timestamp(Nanosecond, None)                |
+--------------------------------------------+
1 row(s) fetched.
Elapsed 0.002 seconds.

That being said, if this is the expected behaviour of other SQL dialects then it seems reasonable to me that it would be behave the same way in DataFusion as well.

Ah yeah just testing something similar out in Clickhouse I'm seeing that it strips the timezone information, which is in line with how arrow currently works.

:) SELECT toTimeZone('2024-09-10 11:45:19'::timestamp, 'America/Denver');

SELECT toTimeZone(CAST('2024-09-10 11:45:19', 'timestamp'), 'America/Denver')

Query id: 1ae2e612-51cd-47e1-9cbb-81acb5fae491

   ┌─toTimeZone(CAST('2024-09-10 11:45:19', 'timestamp'), 'America/Denver')─┐
1. │                                                    2024-09-10 10:45:19 │
   └────────────────────────────────────────────────────────────────────────┘

1 row in set. Elapsed: 0.002 sec.

:) SELECT toTimeZone('2024-09-10 11:45:19'::timestamp, 'America/Denver')::timestamp;

SELECT CAST(toTimeZone(CAST('2024-09-10 11:45:19', 'timestamp'), 'America/Denver'), 'timestamp')

Query id: 2463bbee-363b-4339-91b2-4c9055e064ce

   ┌─CAST(toTimeZone(CAST('2024-09-10 11:45:19', 'timestamp'), 'America/Denver'), 'timestamp')─┐
1. │                                                                       2024-09-10 11:45:19 │
   └───────────────────────────────────────────────────────────────────────────────────────────┘

1 row in set. Elapsed: 0.002 sec.

But taking a look at postgres that is not the case as you had mentioned above:

postgres=# SELECT ('2024-09-10 11:45:19' AT TIME ZONE 'America/Denver');
      timezone
---------------------
 2024-09-10 10:45:19
(1 row)

postgres=# SELECT ('2024-09-10 11:45:19' AT TIME ZONE 'America/Denver')::timestamp;
      timezone
---------------------
 2024-09-10 10:45:19
(1 row)

Taking a look at MariaDB I'm seeing a similar output:

MariaDB [(none)]> SELECT TIMESTAMP(CONVERT_TZ('2024-09-10 11:45:19', '-5:00', '-6:00'));
+----------------------------------------------------------------+
| TIMESTAMP(CONVERT_TZ('2024-09-10 11:45:19', '-5:00', '-6:00')) |
+----------------------------------------------------------------+
| 2024-09-10 10:45:19                                            |
+----------------------------------------------------------------+
1 row in set (0.001 sec)

I think the now() function should probably be changed as users would most likely expect it to function similar to other engines but as for the current casting methodology I'm going to update apache/arrow-rs#5827 with a proposal thats more in line with how established SQL database systems currently operate with timestamp -> timezone update -> timestamp conversion. As well as take a look at https://github.com/apache/arrow/issues?q=sort%3Aupdated-desc+is%3Aissue+is%3Aopen to see if there are any issues related to timezone casting in the original arrow spec. Its strange that there appears to be a discrepancy with casting to timestamp in which Arrow is stripping timezones vs how more "traditional" RDBMs' are not. Seeing as Clickhouse uses the original Apache Arrow format vs Datafusion using arrow-rs it does appear to fall in line with the spec.

@devanbenz
Copy link
Contributor

devanbenz commented Sep 12, 2024

Continuation from this thread

The TL;DR is that we change nothing since we are in line with the current Apache Arrow expectations and datafusion functions similar to a different system that uses Arrow--Clickhouse it is consistent. This proposal is under the assumption that we desire to be completely in line with the current Apache Arrow specification though.

This investigation did lead me to two other findings in which I think should be action'ed on. They are not arrow-rs related but Datafusion related.

  1. now() and the default timestamp value should function similar to how other systems do it. Unless specified it should be using the local system timestamp. Not sure if this is desired but it appears to be how other systems work.
  2. Datafusion needs to implement something similar to the following in Clickhouse for casting to timestamp with a specified timezone:
desc (SELECT toTimeZone('2024-09-10 11:45:19'::timestamp, 'America/Denver')::timestamp a);
┌─name─┬─type─────┬
│ a    │ DateTime │
└──────┴──────────┴

desc (SELECT toTimeZone('2024-09-10 11:45:19'::timestamp, 'America/Denver')::timestamp('America/Los_Angeles') a);
┌─name─┬─type────────────────────────────┬
│ a    │ DateTime('America/Los_Angeles') │
└──────┴─────────────────────────────────┴

I believe the functionality is there in the code since I see that the Timestamp datatype can have an input parameter of Option(Timezone) but I do not think the SQL parsing functionality operates as expected at this moment in time.

@alamb
Copy link
Contributor

alamb commented Sep 13, 2024

Datafusion needs to implement something similar to the following in Clickhouse for casting to timestamp with a specified timezone:

I think you can do it like

select now() AT TIME ZONE 'America/Denver';

Though it would be awesome if that was documented (it doesn't appear to be in the SQL reference): https://datafusion.apache.org/search.html?q=AT+TIMEZONE

@devanbenz
Copy link
Contributor

Datafusion needs to implement something similar to the following in Clickhouse for casting to timestamp with a specified timezone:

I think you can do it like

select now() AT TIME ZONE 'America/Denver';

Though it would be awesome if that was documented (it doesn't appear to be in the SQL reference): datafusion.apache.org/search.html?q=AT+TIMEZONE

Taking a look at this it appears that casting to a timezone does work in datafusion but it doesn't function to how I would expect (or maybe it needs to be a different type of cast similar to clickhouse syntax?):

Clickhouse:

:) SELECT toTimeZone(now(), 'America/Denver')::timestamp;

SELECT CAST(toTimeZone(now(), 'America/Denver'), 'timestamp')

Query id: 2f766f9f-9723-41dc-9a01-901421793e65

   ┌─CAST(toTimeZone(now(), 'America/Denver'), 'timestamp')─┐
1. │                                    2024-09-13 16:00:10 │
   └────────────────────────────────────────────────────────┘

1 row in set. Elapsed: 0.002 sec.

:) SELECT toTimeZone(now(), 'America/Denver')::timestamp('America/Denver');

SELECT CAST(toTimeZone(now(), 'America/Denver'), 'timestamp(\'America/Denver\')')

Query id: db0d0937-89b8-4ea4-93cd-d07c157733b2

   ┌─CAST(toTimeZone(now(), 'America/Denver'), 'timestamp(\'America/Denver\')')─┐
1. │                                                        2024-09-13 15:00:21 │
   └────────────────────────────────────────────────────────────────────────────┘

1 row in set. Elapsed: 0.002 sec.

Datafusion:

> select (now() at time zone 'America/Chicago');
+----------------------------------+
| now()                            |
+----------------------------------+
| 2024-09-13T15:45:08.013132-05:00 |
+----------------------------------+
1 row(s) fetched.
Elapsed 0.007 seconds.

> select (now() at time zone 'America/Chicago')::timestamp;
+----------------------------+
| now()                      |
+----------------------------+
| 2024-09-13T20:45:16.085603 |
+----------------------------+
1 row(s) fetched.
Elapsed 0.007 seconds.

> select (now() at time zone 'America/Chicago')::timestamp at time zone 'America/Chicago';
+----------------------------------+
| now()                            |
+----------------------------------+
| 2024-09-13T20:45:28.485804-05:00 |
+----------------------------------+
1 row(s) fetched.
Elapsed 0.008 seconds.

Thats more so what I mean -- that when casting to a timestamp with a given timezone the output should abide by the timezone where it appears that AT TIME ZONE casts from a timestamp to a timestamptz (in postgres terminology).

@Omega359
Copy link
Contributor

Unfortunately as it stands the now() function nor any other UDF function has no access to the timezone of the server (though it could likely be retrieved) nor the default timezone set in ExecutionOptions (datafusion.execution.time_zone). I would like to fix that at some point but I am fairly certain it would mean a breaking change to UDF's as they are currently implemented.

@findepi
Copy link
Member

findepi commented Oct 28, 2024

@Omega359 good point
we should be able to provide a UDF with some form of environment state while maintaining backwards-compatibility.

@alamb
Copy link
Contributor

alamb commented Oct 28, 2024

I think you could potentially get the server time from the SimplifyInfo passed to

https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.ScalarUDF.html#method.simplify

(aka you could rewrite now() into something like now('Americas/New_York') 🤔

@Omega359
Copy link
Contributor

I looked into that @alamb but I think what is required is access to ConfigOptions (so we can get tz from config, etc). I don't think SimplifyInfo provides that. I was toying with some approaches this weekend for this and didn't come up with anything simple. The most obvious option to me is to add an InvokeInfo that provides access to ExecutionProps and ConfigOptions to the arguments for invoke_batch, etc. That of course is a breaking change though. Beyond that I toyed with:

  • making UDF's not static which looked like a fairly large amount of work just for this
  • forcing users that want UDF's to have access to DF config/etc to be registered separately so that they could inject the config 'manually'
  • using just the system tz for date related things ... which to me might cause more issues than it would solve

Augmenting SimplyInfo to have ConfigOptions could work but it's a lot of effort to rewrite calls just to apply tz info (or anything else that a UDF might want from config) . Seems like applying a jackhammer to insert a screw.

@alamb
Copy link
Contributor

alamb commented Oct 29, 2024

Seems like applying a jackhammer to insert a screw.

Yeah I agree. There are other usecases for storing state / customized in UDFs (like pre-compiling regexps once per query rather than once per batch, for example). I keep trying to find some common API / pattern we can put in and follow

Related discussions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants