Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: non-nullable columns are turned into nullable after anti_join() and semi_join() #10416

Open
1 task done
NickCrews opened this issue Nov 2, 2024 · 2 comments
Open
1 task done

Comments

@NickCrews
Copy link
Contributor

What happened?

import ibis

t1 = ibis.table({"k": "!int32"})
t2 = ibis.table({"k": "int32"})
j = t1.anti_join(t2, "k")
j.schema()
# ibis.Schema {
#   k  int32
# }

I would expect this to give !int32, the exact same schema as t1. If I do the equivalent, t1.filter(t1.k.notin(t2.k)), then I get !int32 as expected.

I assume this is due to assumptions that after joins, any non-null columns can now be "full of holes" and thus nullable. This assumption is correct for the other other join types but not semi and anti joins.

What version of ibis are you using?

main

What backend(s) are you using, if any?

No response

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@NickCrews NickCrews added the bug Incorrect behavior inside of ibis label Nov 2, 2024
@cpcloud
Copy link
Member

cpcloud commented Nov 2, 2024

This isn't a bug, it's a particular choice: lower maintenance overhead (because we don't have to special case any particular join types) at the cost of lower type-specificity.

Why should we write additional code to make the type more precise? What is lost and what is gained by doing that?

@cpcloud cpcloud removed the bug Incorrect behavior inside of ibis label Nov 3, 2024
@NickCrews
Copy link
Contributor Author

I wouldn't say nullable is lower-specificity that non-nullable, I would say they are siblings of each other, and often incompatible. eg when doing a ibis.union(), we can't combine a non-nullable and nullable column together. Consider

nullable = ibis.table(schema={"x": "int32"})
non = ibis.table(schema={"x": "!int32"})

If I could use nullable everywhere that I could use non, then I would agree that the current behavior is simply "less precise". But since I cannot do this, I find the current behavior more of a problem.

I totally see how it is going to be more maintenance overhead though. It may not be worth it, especially when the workaround is so easy.

Maybe we just close this as not planned for now, and if someone wants to take the initiative and fix it, then we can see how much complexity their PR adds, and make a decision then?

Just so you see, here is where I ran into the issue:

def insert(
    db: SQLBackend,
    table_name: str,
    obj: pd.DataFrame | ir.Table | list | dict,
    *,
    database: str | None = None,
    overwrite: bool = False,
    return_inserted: bool = False,
) -> None:
    """db.insert() with extra features.

    - If new table is missing columns, add them with NULL values.
    - Ensure correct column order to work around ibis issue #9249.
    """
    if not isinstance(obj, ir.Table):
        obj = ibis.memtable(obj)
    existing_schema = db.table(table_name).schema()
    obj = cast(obj, existing_schema)
    # ensure correct order, per https://github.com/ibis-project/ibis/issues/9249
    obj = obj.select(*existing_schema)

    def do():
        db.insert(table_name, obj, database=database, overwrite=overwrite)

    if not return_inserted:
        do()
        return None
    else:
        # TODO: ideally this would use RETURNING sql syntax,
        # but I don't want to reach into ibis guts to do that.
        pk = _get_primary_key(db, table_name)
        old = db.table(table_name).select(pk).view().cache()
        do()
        t = db.table(table_name)
        # avoid anti join: https://github.com/ibis-project/ibis/issues/10416
        return t.filter(t[pk].notin(old[pk]))

# tests:

import ibis
import pytest

from noatak import db


@pytest.fixture
def con():
    return ibis.duckdb.connect()


@pytest.fixture
def existing_table(con):
    con.raw_sql("CREATE TABLE test (id INT PRIMARY KEY, value INT)")
    con.insert("test", {"id": [1, 2], "value": [11, 12]})
    t = con.table("test")
    assert t.schema() == ibis.Schema({"id": "!int32", "value": "int32"})

def test_insert_returns(existing_table, con):
    to_insert = ibis.memtable({"value": [23, 24], "id": [3, 4]})
    inserted = db.insert(con, "test", to_insert, return_inserted=True)
    assert inserted.schema() == con.table("test").schema()
    result = sorted(inserted.value.execute())
    assert result == [23, 24]

    # Do it again to make sure we aren't using some stale value
    to_insert = ibis.memtable({"value": [25, 26], "id": [5, 6]})
    inserted = db.insert(con, "test", to_insert, return_inserted=True)
    assert inserted.schema() == con.table("test").schema()
    result = sorted(inserted.value.execute())
    assert result == [25, 26]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: backlog
Development

No branches or pull requests

2 participants