bug: non-nullable columns are turned into nullable after anti_join() and semi_join() #10416

NickCrews · 2024-11-02T03:28:31Z

What happened?

import ibis

t1 = ibis.table({"k": "!int32"})
t2 = ibis.table({"k": "int32"})
j = t1.anti_join(t2, "k")
j.schema()
# ibis.Schema {
#   k  int32
# }

I would expect this to give !int32, the exact same schema as t1. If I do the equivalent, t1.filter(t1.k.notin(t2.k)), then I get !int32 as expected.

I assume this is due to assumptions that after joins, any non-null columns can now be "full of holes" and thus nullable. This assumption is correct for the other other join types but not semi and anti joins.

What version of ibis are you using?

main

What backend(s) are you using, if any?

No response

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

cpcloud · 2024-11-02T12:01:04Z

This isn't a bug, it's a particular choice: lower maintenance overhead (because we don't have to special case any particular join types) at the cost of lower type-specificity.

Why should we write additional code to make the type more precise? What is lost and what is gained by doing that?

NickCrews · 2024-11-04T19:43:20Z

I wouldn't say nullable is lower-specificity that non-nullable, I would say they are siblings of each other, and often incompatible. eg when doing a ibis.union(), we can't combine a non-nullable and nullable column together. Consider

nullable = ibis.table(schema={"x": "int32"})
non = ibis.table(schema={"x": "!int32"})

If I could use nullable everywhere that I could use non, then I would agree that the current behavior is simply "less precise". But since I cannot do this, I find the current behavior more of a problem.

I totally see how it is going to be more maintenance overhead though. It may not be worth it, especially when the workaround is so easy.

Maybe we just close this as not planned for now, and if someone wants to take the initiative and fix it, then we can see how much complexity their PR adds, and make a decision then?

Just so you see, here is where I ran into the issue:

def insert(
    db: SQLBackend,
    table_name: str,
    obj: pd.DataFrame | ir.Table | list | dict,
    *,
    database: str | None = None,
    overwrite: bool = False,
    return_inserted: bool = False,
) -> None:
    """db.insert() with extra features.

    - If new table is missing columns, add them with NULL values.
    - Ensure correct column order to work around ibis issue #9249.
    """
    if not isinstance(obj, ir.Table):
        obj = ibis.memtable(obj)
    existing_schema = db.table(table_name).schema()
    obj = cast(obj, existing_schema)
    # ensure correct order, per https://github.com/ibis-project/ibis/issues/9249
    obj = obj.select(*existing_schema)

    def do():
        db.insert(table_name, obj, database=database, overwrite=overwrite)

    if not return_inserted:
        do()
        return None
    else:
        # TODO: ideally this would use RETURNING sql syntax,
        # but I don't want to reach into ibis guts to do that.
        pk = _get_primary_key(db, table_name)
        old = db.table(table_name).select(pk).view().cache()
        do()
        t = db.table(table_name)
        # avoid anti join: https://github.com/ibis-project/ibis/issues/10416
        return t.filter(t[pk].notin(old[pk]))

# tests:

import ibis
import pytest

from noatak import db


@pytest.fixture
def con():
    return ibis.duckdb.connect()


@pytest.fixture
def existing_table(con):
    con.raw_sql("CREATE TABLE test (id INT PRIMARY KEY, value INT)")
    con.insert("test", {"id": [1, 2], "value": [11, 12]})
    t = con.table("test")
    assert t.schema() == ibis.Schema({"id": "!int32", "value": "int32"})

def test_insert_returns(existing_table, con):
    to_insert = ibis.memtable({"value": [23, 24], "id": [3, 4]})
    inserted = db.insert(con, "test", to_insert, return_inserted=True)
    assert inserted.schema() == con.table("test").schema()
    result = sorted(inserted.value.execute())
    assert result == [23, 24]

    # Do it again to make sure we aren't using some stale value
    to_insert = ibis.memtable({"value": [25, 26], "id": [5, 6]})
    inserted = db.insert(con, "test", to_insert, return_inserted=True)
    assert inserted.schema() == con.table("test").schema()
    result = sorted(inserted.value.execute())
    assert result == [25, 26]

NickCrews added the bug Incorrect behavior inside of ibis label Nov 2, 2024

github-project-automation bot moved this to backlog in Ibis planning and roadmap Nov 2, 2024

github-project-automation bot added this to Ibis planning and roadmap Nov 2, 2024

cpcloud removed the bug Incorrect behavior inside of ibis label Nov 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: non-nullable columns are turned into nullable after anti_join() and semi_join() #10416

bug: non-nullable columns are turned into nullable after anti_join() and semi_join() #10416

NickCrews commented Nov 2, 2024

cpcloud commented Nov 2, 2024

NickCrews commented Nov 4, 2024

bug: non-nullable columns are turned into nullable after anti_join() and semi_join() #10416

bug: non-nullable columns are turned into nullable after anti_join() and semi_join() #10416

Comments

NickCrews commented Nov 2, 2024

What happened?

What version of ibis are you using?

What backend(s) are you using, if any?

Relevant log output

Code of Conduct

cpcloud commented Nov 2, 2024

NickCrews commented Nov 4, 2024