Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support >= Pandas 2.0.0 #811

Open
devstein opened this issue Apr 5, 2023 · 8 comments
Open

Support >= Pandas 2.0.0 #811

devstein opened this issue Apr 5, 2023 · 8 comments

Comments

@devstein
Copy link
Collaborator

devstein commented Apr 5, 2023

Is your feature request related to a problem? Please describe.
CI is failing due to Pandas 2.0.0

Describe the solution you'd like
Support Pandas 2.0.0

Describe alternatives you've considered
Current solution is to pin <2.0.0

Additional context
https://github.com/jupyter-incubator/sparkmagic/actions/runs/4603955122

#812

@devstein
Copy link
Collaborator Author

devstein commented Sep 20, 2023

@dosu-bot Can you help debug? Here are the error logs

=================================== FAILURES ===================================
________________________ test_graphbase_get_x_y_values _________________________

self = <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fedee223e50>
how = 'mean'
values = array([['str', 'str', 'str', 'str', 'str', 'str']], dtype=object)
ndim = 2, alt = <function GroupBy.mean.<locals>.<lambda> at 0x7fedee[29](https://github.com/jupyter-incubator/sparkmagic/actions/runs/6251559522/job/16972931665#step:7:30)0ea0>

    def _agg_py_fallback(
        self, how: str, values: ArrayLike, ndim: int, alt: Callable
    ) -> ArrayLike:
        """
        Fallback to pure-python aggregation if _cython_operation raises
        NotImplementedError.
        """
        # We get here with a) EADtypes and b) object dtype
        assert alt is not None
    
        if values.ndim == 1:
            # For DataFrameGroupBy we only get here with ExtensionArray
            ser = Series(values, copy=False)
        else:
            # We only get here with values.dtype == object
            # TODO: special case not needed with ArrayManager
            df = DataFrame(values.T)
            # bc we split object blocks in grouped_reduce, we have only 1 col

autovizwidget/autovizwidget/plotlygraphs/graphbase.py:1[38](https://github.com/jupyter-incubator/sparkmagic/actions/runs/6251559522/job/16972931665#step:7:39): InvalidEncodingError
=========================== short test summary info ============================
FAILED autovizwidget/autovizwidget/tests/test_plotlygraphs.py::test_graphbase_get_x_y_values - autovizwidget.widget.invalidencodingerror.InvalidEncodingError: Cannot aggregate column 'temp_diff' with aggregation function 'Avg' because the type
cannot be aggregated over.

@dosubot
Copy link

dosubot bot commented Sep 20, 2023

🤖

Thank you for bringing this to our attention. The error message suggests that the test is failing due to an InvalidEncodingError when trying to aggregate the 'temp_diff' column with the 'Avg' function. This could be due to changes in the way Pandas 2.0.0 handles aggregation functions.

The test_graphbase_get_x_y_values test is designed to verify the functionality of the _get_x_y_values method of the GraphBase class. This method is used to extract x and y values from a DataFrame based on the provided encoding. If this test is failing, it means that the _get_x_y_values method is not correctly extracting x and y values from the DataFrame based on the provided encoding, or it is not correctly raising an InvalidEncodingError when it should.

To debug this issue, we need to understand the specific changes in Pandas 2.0.0 that could be causing this behavior. Unfortunately, I couldn't find specific details about these changes within the repository. However, I would recommend checking the release notes or changelog for Pandas 2.0.0 to see if there are any changes related to aggregation functions that could be causing this issue.

Once we have more information about the specific changes in Pandas 2.0.0, we can update the _get_x_y_values method or the test itself to ensure compatibility with the new version of Pandas.

In the meantime, pinning the version of Pandas to <2.0.0 is a good workaround to prevent the CI from failing. I understand that this is not a long-term solution, and I appreciate your patience as we work to resolve this issue.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@devstein
Copy link
Collaborator Author

devstein commented Sep 20, 2023

@dosu-bot Here are changes from Pandas 2.0.0

Removal of prior version deprecations/changes
Removed Int64Index, UInt64Index and Float64Index. See also [here] for more information 
Removed deprecated Timestamp.freq, Timestamp.freqstr and argument freq from the [Timestamp] constructor and [Timestamp.fromordinal] 
Removed deprecated CategoricalBlock, Block.is_categorical, require datetime64 and timedelta64 values to be wrapped in DatetimeArray or TimedeltaArray before passing to Block.make_block_same_class, require DatetimeTZBlock.values to have the correct ndim when passing to the BlockManager constructor, and removed the “fastpath” keyword from the SingleBlockManager constructor 
Removed deprecated global option use_inf_as_null in favor of use_inf_as_na 
Removed deprecated module pandas.core.index 
Removed deprecated alias pandas.core.tools.datetimes.to_time, import the function directly from pandas.core.tools.times instead 
Removed deprecated alias pandas.io.json.json_normalize, import the function directly from pandas.json_normalize instead 
Removed deprecated Categorical.to_dense, use np.asarray instead 
Removed deprecated Categorical.take_nd 
Removed deprecated Categorical.mode, use Series.mode instead 
Removed deprecated Categorical.is_dtype_equal and CategoricalIndex.is_dtype_equal 
Removed deprecated CategoricalIndex.take_nd 
Removed deprecated Index.is_type_compatible 
Removed deprecated Index.is_mixed, check index.inferred_type directly instead 
Removed deprecated pandas.api.types.is_categorical; use [pandas.api.types.is_categorical_dtype] instead 
Removed deprecated Index.asi8 
Enforced deprecation changing behavior when passing datetime64[ns] dtype data and timezone-aware dtype to [Series], interpreting the values as wall-times instead of UTC times, matching [DatetimeIndex] behavior 
Enforced deprecation changing behavior when applying a numpy ufunc on multiple non-aligned  [DataFrame] that will now align the inputs first 
Removed deprecated DataFrame._AXIS_NUMBERS, DataFrame._AXIS_NAMES, Series._AXIS_NUMBERS, Series._AXIS_NAMES 
Removed deprecated Index.to_native_types, use obj.astype instead 
Removed deprecated Series.iteritems, DataFrame.iteritems, use obj.items instead 
Removed deprecated DataFrame.lookup 
Removed deprecated Series.append, DataFrame.append, use [concat] instead 
Removed deprecated Series.iteritems, DataFrame.iteritems and HDFStore.iteritems use obj.items instead 
Removed deprecated DatetimeIndex.union_many 
Removed deprecated weekofyear and week attributes of DatetimeArray, [DatetimeIndex] and dt accessor in favor of isocalendar.week 
Removed deprecated RangeIndex._start, RangeIndex._stop, RangeIndex._step, use start, stop, step instead 
Removed deprecated DatetimeIndex.to_perioddelta, Use dtindex - dtindex.to_period.to_timestamp instead 
Removed deprecated Styler.hide_index and Styler.hide_columns 
Removed deprecated Styler.set_na_rep and Styler.set_precision 
Removed deprecated Styler.where 
Removed deprecated Styler.render 
Removed deprecated argument col_space in [DataFrame.to_latex] 
Removed deprecated argument null_color in [Styler.highlight_null] 
Removed deprecated argument check_less_precise in testing.assert_frame_equal, testing.assert_extension_array_equal, testing.assert_series_equal, testing.assert_index_equal 
Removed deprecated null_counts argument in [DataFrame.info]. Use show_counts instead 
Removed deprecated Index.is_monotonic, and Series.is_monotonic; use obj.is_monotonic_increasing instead 
Removed deprecated Index.is_all_dates 
Enforced deprecation disallowing passing a timezone-aware [Timestamp] and dtype="datetime64[ns]" to [Series] or [DataFrame] constructors 
Enforced deprecation disallowing passing a sequence of timezone-aware values and dtype="datetime64[ns]" to to [Series] or [DataFrame] constructors 
Enforced deprecation disallowing numpy.ma.mrecords.MaskedRecords in the [DataFrame] constructor; pass "{name: data[name] for name in data.dtype.names} instead 
Enforced deprecation disallowing unit-less “datetime64” dtype in [Series.astype] and [DataFrame.astype] 
Enforced deprecation disallowing using .astype to convert a datetime64[ns] [Series], [DataFrame], or [DatetimeIndex] to timezone-aware dtype, use obj.tz_localize or ser.dt.tz_localize instead 
Enforced deprecation disallowing using .astype to convert a timezone-aware [Series], [DataFrame], or [DatetimeIndex] to timezone-naive datetime64[ns] dtype, use obj.tz_localize or obj.tz_convert.tz_localize instead 
Enforced deprecation disallowing passing non boolean argument to sort in [concat] 
Removed Date parser functions parse_date_time, parse_date_fields, parse_all_fields and generic_parser 
Removed argument index from the core.arrays.SparseArray constructor 
Remove argument squeeze from [DataFrame.groupby] and [Series.groupby] 
Removed deprecated apply, apply_index, __call__, onOffset, and isAnchored attributes from DateOffset 
Removed keep_tz argument in [DatetimeIndex.to_series] 
Remove arguments names and dtype from [Index.copy] and levels and codes from [MultiIndex.copy] 
Remove argument inplace from [MultiIndex.set_levels] and [MultiIndex.set_codes] 
Removed arguments verbose and encoding from [DataFrame.to_excel] and [Series.to_excel] 
Removed argument line_terminator from [DataFrame.to_csv] and [Series.to_csv], use lineterminator instead 
Removed argument inplace from [DataFrame.set_axis] and [Series.set_axis], use obj = obj.set_axis instead 
Disallow passing positional arguments to [MultiIndex.set_levels] and [MultiIndex.set_codes] 
Disallow parsing to Timedelta strings with components with units “Y”, “y”, or “M”, as these do not represent unambiguous durations 
Removed MultiIndex.is_lexsorted and MultiIndex.lexsort_depth 
Removed argument how from PeriodIndex.astype, use [PeriodIndex.to_timestamp] instead 
Removed argument try_cast from [DataFrame.mask], [DataFrame.where], [Series.mask] and [Series.where] 
Removed argument tz from [Period.to_timestamp], use obj.to_timestamp.tz_localize instead 
Removed argument sort_columns in [DataFrame.plot] and [Series.plot] 
Removed argument is_copy from [DataFrame.take] and [Series.take] 
Removed argument kind from [Index.get_slice_bound], [Index.slice_indexer] and [Index.slice_locs] 
Removed arguments prefix, squeeze, error_bad_lines and warn_bad_lines from [read_csv] 
Removed arguments squeeze from [read_excel] 
Removed argument datetime_is_numeric from [DataFrame.describe] and [Series.describe] as datetime data will always be summarized as numeric data 
Disallow passing list key to [Series.xs] and [DataFrame.xs], pass a tuple instead 
Disallow subclass-specific keywords  in the [Index] constructor 
Removed argument inplace from Categorical.remove_unused_categories 
Disallow passing non-round floats to [Timestamp] with unit="M" or unit="Y" 
Remove keywords convert_float and mangle_dupe_cols from [read_excel] 
Remove keyword mangle_dupe_cols from [read_csv] and [read_table] 
Removed errors keyword from [DataFrame.where], [Series.where], [DataFrame.mask] and [Series.mask] 
Disallow passing non-keyword arguments to [read_excel] except io and sheet_name 
Disallow passing non-keyword arguments to [DataFrame.drop] and [Series.drop] except labels 
Disallow passing non-keyword arguments to [DataFrame.fillna] and [Series.fillna] except value 
Disallow passing non-keyword arguments to StringMethods.split and StringMethods.rsplit except for pat 
Disallow passing non-keyword arguments to [DataFrame.set_index] except keys 
Disallow passing non-keyword arguments to Resampler.interpolate except method 
Disallow passing non-keyword arguments to [DataFrame.reset_index] and [Series.reset_index] except level 
Disallow passing non-keyword arguments to [DataFrame.dropna] and [Series.dropna] 
Disallow passing non-keyword arguments to ExtensionArray.argsort 
Disallow passing non-keyword arguments to Categorical.sort_values 
Disallow passing non-keyword arguments to [Index.drop_duplicates] and [Series.drop_duplicates] 
Disallow passing non-keyword arguments to [DataFrame.drop_duplicates] except for subset 
Disallow passing non-keyword arguments to [DataFrame.sort_index] and [Series.sort_index] 
Disallow passing non-keyword arguments to [DataFrame.interpolate] and [Series.interpolate] except for method 
Disallow passing non-keyword arguments to [DataFrame.any] and [Series.any] 
Disallow passing non-keyword arguments to [Index.set_names] except for names 
Disallow passing non-keyword arguments to [Index.join] except for other 
Disallow passing non-keyword arguments to [concat] except for objs 
Disallow passing non-keyword arguments to [pivot] except for data 
Disallow passing non-keyword arguments to [DataFrame.pivot] 
Disallow passing non-keyword arguments to [read_html] except for io 
Disallow passing non-keyword arguments to [read_json] except for path_or_buf 
Disallow passing non-keyword arguments to [read_sas] except for filepath_or_buffer 
Disallow passing non-keyword arguments to [read_stata] except for filepath_or_buffer 
Disallow passing non-keyword arguments to [read_csv] except filepath_or_buffer 
Disallow passing non-keyword arguments to [read_table] except filepath_or_buffer 
Disallow passing non-keyword arguments to [read_fwf] except filepath_or_buffer 
Disallow passing non-keyword arguments to [read_xml] except for path_or_buffer 
Disallow passing non-keyword arguments to [Series.mask] and [DataFrame.mask] except cond and other 
Disallow passing non-keyword arguments to [DataFrame.to_stata] except for path 
Disallow passing non-keyword arguments to [DataFrame.where] and [Series.where] except for cond and other 
Disallow passing non-keyword arguments to [Series.set_axis] and [DataFrame.set_axis] except for labels 
Disallow passing non-keyword arguments to [Series.rename_axis] and [DataFrame.rename_axis] except for mapper 
Disallow passing non-keyword arguments to [Series.clip] and [DataFrame.clip] except lower and upper 
Disallow passing non-keyword arguments to [Series.bfill], [Series.ffill], [DataFrame.bfill] and [DataFrame.ffill] 
Disallow passing non-keyword arguments to [DataFrame.replace], [Series.replace] except for to_replace and value 
Disallow passing non-keyword arguments to [DataFrame.sort_values] except for by 
Disallow passing non-keyword arguments to [Series.sort_values] 
Disallow passing non-keyword arguments to [DataFrame.reindex] except for labels 
Disallow [Index.reindex] with non-unique [Index] objects 
Disallowed constructing [Categorical] with scalar data 
Disallowed constructing [CategoricalIndex] without passing data 
Removed Rolling.validate, Expanding.validate, and ExponentialMovingWindow.validate 
Removed Rolling.win_type returning "freq" 
Removed Rolling.is_datetimelike 
Removed the level keyword in [DataFrame] and [Series] aggregations; use groupby instead 
Removed deprecated Timedelta.delta, Timedelta.is_populated, and Timedelta.freq 
Removed deprecated NaT.freq 
Removed deprecated Categorical.replace, use [Series.replace] instead 
Removed the numeric_only keyword from Categorical.min and Categorical.max in favor of skipna 
Changed behavior of [DataFrame.median] and [DataFrame.mean] with numeric_only=None to not exclude datetime-like columns THIS NOTE WILL BE IRRELEVANT ONCE numeric_only=None DEPRECATION IS ENFORCED 
Removed is_extension_type in favor of is_extension_array_dtype 
Removed .ExponentialMovingWindow.vol 
Removed Index.get_value and Index.set_value 
Removed Series.slice_shift and DataFrame.slice_shift 
Remove DataFrameGroupBy.pad and DataFrameGroupBy.backfill 
Remove numpy argument from [read_json] 
Disallow passing abbreviations for orient in [DataFrame.to_dict] 
Disallow partial slicing on an non-monotonic [DatetimeIndex] with keys which are not in Index. This now raises a KeyError 
Removed get_offset in favor of to_offset 
Removed the warn keyword in [infer_freq] 
Removed the include_start and include_end arguments in [DataFrame.between_time] in favor of inclusive 
Removed the closed argument in [date_range] and [bdate_range] in favor of inclusive argument 
Removed the center keyword in [DataFrame.expanding] 
Removed the truediv keyword from [eval] 
Removed the method and tolerance arguments in [Index.get_loc]. Use index.get_indexer instead 
Removed the pandas.datetime submodule 
Removed the pandas.np submodule 
Removed pandas.util.testing in favor of pandas.testing 
Removed Series.str.__iter__ 
Removed pandas.SparseArray in favor of [arrays.SparseArray] 
Removed pandas.SparseSeries and pandas.SparseDataFrame, including pickle support. 
Enforced disallowing passing an integer fill_value to [DataFrame.shift] and Series.shift` with datetime64, timedelta64, or period dtypes 
Enforced disallowing a string column label into times in [DataFrame.ewm] 
Enforced disallowing passing True and False into inclusive in [Series.between] in favor of "both" and "neither" respectively 
Enforced disallowing using usecols with out of bounds indices for read_csv with engine="c" 
Enforced disallowing the use of **kwargs in [ExcelWriter]; use the keyword argument engine_kwargs instead 
Enforced disallowing a tuple of column labels into DataFrameGroupBy.__getitem__ 
Enforced disallowing missing labels when indexing with a sequence of labels on a level of a [MultiIndex]. This now raises a KeyError 
Enforced disallowing setting values with .loc using a positional slice. Use .loc with labels or .iloc with positions instead 
Enforced disallowing positional indexing with a float key even if that key is a round number, manually cast to integer instead 
Enforced disallowing using a [DataFrame] indexer with .iloc, use .loc instead for automatic alignment 
Enforced disallowing set or dict indexers in __getitem__ and __setitem__ methods 
Enforced disallowing indexing on a [Index] or positional indexing on a [Series] producing multi-dimensional objects e.g. obj[:, None], convert to numpy before indexing instead 
Enforced disallowing dict or set objects in suffixes in [merge] 
Enforced disallowing [merge] to produce duplicated columns through the suffixes keyword and already existing columns 
Enforced disallowing using [merge] or join on a different number of levels 
Enforced disallowing value_name argument in [DataFrame.melt] to match an element in the [DataFrame] columns 
Enforced disallowing passing showindex into **kwargs in [DataFrame.to_markdown] and [Series.to_markdown] in favor of index 
Removed setting Categorical._codes directly 
Removed setting Categorical.categories directly 
Removed argument inplace from Categorical.add_categories, Categorical.remove_categories, Categorical.set_categories, Categorical.rename_categories, Categorical.reorder_categories, Categorical.set_ordered, Categorical.as_ordered, Categorical.as_unordered 
Enforced Rolling.count with min_periods=None to default to the size of the window 
Renamed fname to path in [DataFrame.to_parquet], [DataFrame.to_stata] and [DataFrame.to_feather] 
Enforced disallowing indexing a [Series] with a single item list with a slice . Either convert the list to tuple, or pass the slice directly instead 
Changed behavior indexing on a [DataFrame] with a [DatetimeIndex] index using a string indexer, previously this operated as a slice on rows, now it operates like any other column key; use frame.loc[key] for the old behavior 
Enforced the display.max_colwidth option to not accept negative integers 
Removed the display.column_space option in favor of df.to_string 
Removed the deprecated method mad from pandas classes 
Removed the deprecated method tshift from pandas classes 
Changed behavior of empty data passed into [Series]; the default dtype will be object instead of float64 
Changed the behavior of DatetimeIndex.union, DatetimeIndex.intersection, and DatetimeIndex.symmetric_difference with mismatched timezones to convert to UTC instead of casting to object dtype 
Changed the behavior of [to_datetime] with argument “now” with utc=False to match Timestamp 
Changed the behavior of indexing on a timezone-aware [DatetimeIndex] with a timezone-naive datetime object or vice-versa; these now behave like any other non-comparable type by raising KeyError 
Changed the behavior of [Index.reindex], [Series.reindex], and [DataFrame.reindex] with a datetime64 dtype and a datetime.date object for fill_value; these are no longer considered equivalent to datetime.datetime objects so the reindex casts to object dtype 
Changed behavior of SparseArray.astype when given a dtype that is not explicitly SparseDtype, cast to the exact requested dtype rather than silently using a SparseDtype instead 
Changed behavior of [Index.ravel] to return a view on the original [Index] instead of a np.ndarray 
Changed behavior of [Series.to_frame] and [Index.to_frame] with explicit name=None to use None for the column name instead of the index’s name or default 0 
Changed behavior of [concat] with one array of bool-dtype and another of integer dtype, this now returns object dtype instead of integer dtype; explicitly cast the bool object to integer before concatenating to get the old behavior 
Changed behavior of [DataFrame] constructor given floating-point data and an integer dtype, when the data cannot be cast losslessly, the floating point dtype is retained, matching [Series] behavior 
Changed behavior of [Index] constructor when given a np.ndarray with object-dtype containing numeric entries; this now retains object dtype rather than inferring a numeric dtype, consistent with [Series] behavior 
Changed behavior of Index.__and__, Index.__or__ and Index.__xor__ to behave as logical operations  instead of aliases for set operations 
Changed behavior of [DataFrame] constructor when passed a list whose first element is a [Categorical], this now treats the elements as rows casting to object dtype, consistent with behavior for other types 
Changed behavior of [DataFrame] constructor when passed a dtype  that the data cannot be cast to; it now raises instead of silently ignoring the dtype 
Changed the behavior of [Series] constructor, it will no longer infer a datetime64 or timedelta64 dtype from string entries 
Changed behavior of [Timestamp] constructor with a np.datetime64 object and a tz passed to interpret the input as a wall-time as opposed to a UTC time 
Changed behavior of [Timestamp.utcfromtimestamp] to return a timezone-aware object satisfying Timestamp.utcfromtimestamp.timestamp == val 
Changed behavior of [Index] constructor when passed a SparseArray or SparseDtype to retain that dtype instead of casting to numpy.ndarray 
Changed behavior of setitem-like operations  on an object with [DatetimeTZDtype] when using a value with a non-matching timezone, the value will be cast to the object’s timezone instead of casting both to object-dtype 
Changed behavior of [Index], [Series], [DataFrame] constructors with floating-dtype data and a [DatetimeTZDtype], the data are now interpreted as UTC-times instead of wall-times, consistent with how integer-dtype data are treated 
Changed behavior of [Series] and [DataFrame] constructors with integer dtype and floating-point data containing NaN, this now raises IntCastingNaNError 
Changed behavior of [Series] and [DataFrame] constructors with an integer dtype and values that are too large to losslessly cast to this dtype, this now raises ValueError 
Changed behavior of [Series] and [DataFrame] constructors with an integer dtype and values having either datetime64 or timedelta64 dtypes, this now raises TypeError, use values.view instead 
Removed the deprecated base and loffset arguments from [pandas.DataFrame.resample], [pandas.Series.resample] and [pandas.Grouper]. Use offset or origin instead 
Changed behavior of [Series.fillna] and [DataFrame.fillna] with timedelta64[ns] dtype and an incompatible fill_value; this now casts to object dtype instead of raising, consistent with the behavior with other dtypes 
Change the default argument of regex for [Series.str.replace] from True to False. Additionally, a single character pat with regex=True is now treated as a regular expression instead of a string literal. 
Changed behavior of [DataFrame.any] and [DataFrame.all] with bool_only=True; object-dtype columns with all-bool values will no longer be included, manually cast to bool dtype first 
Changed behavior of [DataFrame.max], [DataFrame.min], [DataFrame.mean], [DataFrame.median], [DataFrame.skew], [DataFrame.kurt] with axis=None to return a scalar applying the aggregation across both axes 
Changed behavior of comparison of a [Timestamp] with a datetime.date object; these now compare as un-equal and raise on inequality comparisons, matching the datetime.datetime behavior 
Changed behavior of comparison of NaT with a datetime.date object; these now raise on inequality comparisons 
Enforced deprecation of silently dropping columns that raised a TypeError in [Series.transform] and [DataFrame.transform] when used with a list or dictionary 
Changed behavior of [DataFrame.apply] with list-like so that any partial failure will raise an error 
Changed behaviour of [DataFrame.to_latex] to now use the Styler implementation via [Styler.to_latex] 
Changed behavior of Series.__setitem__ with an integer key and a Float64Index when the key is not present in the index; previously we treated the key as positional , now we treat it is a label , consistent with Series.__getitem__` behavior 
Removed na_sentinel argument from [factorize], [Index.factorize], and [ExtensionArray.factorize] 
Changed behavior of [Series.diff] and [DataFrame.diff] with ExtensionDtype dtypes whose arrays do not implement diff, these now raise TypeError rather than casting to numpy 
Enforced deprecation of calling numpy “ufunc”s on [DataFrame] with method="outer"; this now raises NotImplementedError 
Enforced deprecation disallowing passing numeric_only=True to [Series] reductions  with non-numeric dtype 
Changed behavior of [DataFrameGroupBy.apply] and [SeriesGroupBy.apply] so that group_keys is respected even if a transformer is detected 
Comparisons between a [DataFrame] and a [Series] where the frame’s columns do not match the series’s index raise ValueError instead of automatically aligning, do left, right = left.align before comparing 
Enforced deprecation numeric_only=None  in DataFrame reductions that would silently drop columns that raised; numeric_only now defaults to False 
Changed default of numeric_only to False in all DataFrame methods with that argument 
Changed default of numeric_only to False in [Series.rank] 
Enforced deprecation of silently dropping nuisance columns in groupby and resample operations when numeric_only=False 
Enforced deprecation of silently dropping nuisance columns in Rolling, Expanding, and ExponentialMovingWindow ops. This will now raise a [errors.DataError] 
Changed behavior in setting values with df.loc[:, foo] = bar or df.iloc[:, foo] = bar, these now always attempt to set values inplace before falling back to casting 
Changed default of numeric_only in various DataFrameGroupBy methods; all methods now default to numeric_only=False 
Changed default of numeric_only to False in Resampler methods 
Using the method [DataFrameGroupBy.transform] with a callable that returns DataFrames will align to the input’s index 
When providing a list of columns of length one to [DataFrame.groupby], the keys that are returned by iterating over the resulting DataFrameGroupBy object will now be tuples of length one 
Removed deprecated methods ExcelWriter.write_cells, ExcelWriter.save, ExcelWriter.cur_sheet, ExcelWriter.handles, ExcelWriter.path 
The [ExcelWriter] attribute book can no longer be set; it is still available to be accessed and mutated 
Removed unused *args and **kwargs in Rolling, Expanding, and ExponentialMovingWindow ops 
Removed the deprecated argument line_terminator from [DataFrame.to_csv] 
Removed the deprecated argument label from [lreshape] 
Arguments after expr in [DataFrame.eval] and [DataFrame.query] are keyword-only 
Removed Index._get_attributes_dict 
Removed Series.__array_wrap__ 
Changed behavior of [DataFrame.value_counts] to return a [Series] with [MultiIndex] for any list-like but an [Index] for a single label 

@dosubot
Copy link

dosubot bot commented Sep 21, 2023

🤖

Hello,

Thank you for providing the detailed error logs and context. The test_graphbase_get_x_y_values test in the sparkmagic codebase is designed to test the functionality of the get_x_y_values method in the GraphBase class. This method is responsible for aggregating data based on the provided aggregation function and column names.

The error you're encountering is due to the upgrade to Pandas 2.0.0. The 'temp_diff' column, which is of integer data type, is being aggregated using the 'Avg' function. However, the error logs indicate an InvalidEncodingError, which suggests that there might be an issue with the data type compatibility with the aggregation function.

In the Pandas 2.0.0 release, there were several changes that could potentially affect this operation. One of the significant changes was the introduction of the new pandas.NA value, which is used to represent scalar missing values. This could potentially affect the aggregation operation if the 'temp_diff' column contains any missing values.

Another change that could potentially affect this operation is the enhancements to the groupby operation. In the new release, the groupby operation has been optimized for better performance and memory usage. However, this could potentially introduce some compatibility issues with certain data types and aggregation functions.

To resolve this issue, you could try the following:

  1. Check if the 'temp_diff' column contains any missing values. If it does, you might need to handle these missing values before performing the aggregation operation.

  2. Try using a different aggregation function to see if the issue persists. This could help determine if the issue is specific to the 'Avg' function or if it's a more general issue with the groupby operation.

  3. If the issue persists, you might need to downgrade to a previous version of Pandas until a fix is available.

I hope this helps! If you have any further questions or need more assistance, please don't hesitate to ask.

Best regards,
Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@ektasharma4
Copy link

ektasharma4 commented Apr 1, 2024

From here #861 (comment), @devstein @ljubon I'm wondering if there's any progress for supporting pandas 2.0 or if an ETA is available for this?

@frbelotto
Copy link

pandas 2.0 is out for more than one year now.

@tleonhardt
Copy link

It looks like this was resolved on GitHub about 3 weeks ago. Would it be possible to get a new release out?

@devstein
Copy link
Collaborator Author

Done in 0.22.0! Thank you everyone for helping get this over the finish line.

Please let me know if you run into issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants