Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default string storage from "python" to "pyarrow" (if installed) for for NA-variant of StringDtype #60287

Open
jorisvandenbossche opened this issue Nov 12, 2024 · 0 comments
Labels
API Design NA - MaskedArrays Related to pd.NA and nullable extension arrays Strings String extension data type and string data
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Historically, the default value for the string storage (globally configurable through pd.options.mode.string_storage) of StringDtype was "python", and users needed to explicitly ask for "pyarrow". For example:

>>> ser = pd.Series(["a", "b"], dtype="string")
>>>  ser.dtype
string[python]

and this is still the behaviour on main.

For the new NaN-variant of StringDtype, however, we implemented the default string storage option "auto" meaning "use pyarrow if installed, otherwise use python". So on a system with pyarrow installed:

>>> pd.options.future.infer_string = True
>>> ser = pd.Series(["a", "b"], dtype="str")
>>> ser.dtype.storage
'pyarrow'

Essentially we interpret the default string_storage option setting of "auto" differently for the NaN vs NA variant of the string dtype, which you can see in the code here:

if storage is None:
if na_value is not libmissing.NA:
storage = get_option("mode.string_storage")
if storage == "auto":
if HAS_PYARROW:
storage = "pyarrow"
else:
storage = "python"
else:
storage = get_option("mode.string_storage")
if storage == "auto":
storage = "python"


Proposal: I think it makes sense to also switch to "pyarrow" as the default string storage (if installed) for the nullable StringDtype. This is somewhat a breaking change (although mostly for the dtype object itself, because behaviour-wise for string operations, there should be hardly any difference between both backends), so I would keep this for 3.0 and properly document it in the whatsnew notes.

@jorisvandenbossche jorisvandenbossche added API Design Strings String extension data type and string data NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Nov 12, 2024
@jorisvandenbossche jorisvandenbossche added this to the 3.0 milestone Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design NA - MaskedArrays Related to pd.NA and nullable extension arrays Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

1 participant