Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Implement hf:// / "hugging face" integration in datafusion-cli #10792

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

xinlifoobar
Copy link
Contributor

@xinlifoobar xinlifoobar commented Jun 4, 2024

Which issue does this PR close?

Closes #10720

Rationale for this change

What changes are included in this PR?

Are these changes tested?

# xinli @ arch-dev in ~/source/repos/datafusion/datafusion-cli on git:dev/xinli/hfstore o [12:16:09] 
$ ./target/debug/datafusion-cli
DataFusion CLI v38.0.0
> SELECT count(*) AS count
FROM 'hf://datasets/cais/mmlu/astronomy/dev-00000-of-00001.parquet';
+-------+
| count |
+-------+
| 5     |
+-------+
1 row(s) fetched. 
Elapsed 2.469 seconds.

> create external table test stored as parquet location "hf://datasets/cais/mmlu/astronomy/";
0 row(s) fetched. 
Elapsed 1.398 seconds.

> select count(*) from test;
+----------+
| COUNT(*) |
+----------+
| 173      |
+----------+
1 row(s) fetched. 
Elapsed 1.199 seconds.

> select * from test limit 2
;
+--------------------------------------------------------------------------------------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
| question                                                                             | subject   | choices                                                                                                                                                                                                                                                                                                                          | answer |
+--------------------------------------------------------------------------------------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
| You cool a blackbody to half its original temperature. How does its spectrum change? | astronomy | [Power emitted is 1/16 times as high; peak emission wavelength is 1/2 as long., Power emitted is 1/4 times as high; peak emission wavelength is 2 times longer., Power emitted is 1/4 times as high; peak emission wavelength is 1/2 as long., Power emitted is 1/16 times as high; peak emission wavelength is 2 times longer.] | 3      |
| What drives differentiation?                                                         | astronomy | [Spontaneous emission from radioactive atoms., The minimization of gravitational potential energy., Thermally induced collisions., Plate tectonics.]                                                                                                                                                                             | 1      |
+--------------------------------------------------------------------------------------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
2 row(s) fetched. 
Elapsed 1.098 seconds.

Are there any user-facing changes?

@xinlifoobar xinlifoobar changed the title [Draft][DONNOT MERGE] Impl for HF_Store Feat: Implement hf:// / "hugging face" integration in datafusion-cli Jun 7, 2024
@xinlifoobar xinlifoobar marked this pull request as ready for review June 7, 2024 04:15
@xinlifoobar
Copy link
Contributor Author

Hi @alamb. I am still working on this for completing UTs, and E2Es and fixing bad code styles... Could you please help do a pre-quick review of the ideas behind... I did not find a simple way to implement this other than creating a wrapper ObjectStore impl on top of HttpStore.

Copy link
Contributor

@edmondop edmondop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xinlifoobar
Copy link
Contributor Author

Hi @alamb. I am still working on this for completing UTs, and E2Es and fixing bad code styles... Could you please help do a pre-quick review of the ideas behind... I did not find a simple way to implement this other than creating a wrapper ObjectStore impl on top of HttpStore.

This is now able to be reviewed now. I completed most of the refining of the initial code.

)
STORED AS parquet
LOCATION "hf://datasets/cais/mmlu/astronomy/";
```
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently not working due to:

  1. The hugging face user access token is case-sensitive.
  2. a previous change enforces every option value in lower case. https://github.com/apache/datafusion/pull/9723/files.

I will figure out the history to see whether this will be feasible.

@alamb
Copy link
Contributor

alamb commented Jun 11, 2024

I am sorry @xinlifoobar for the delayed review -- I am traveling this week (actually presenting about DataFusion at SIGMOD: https://2024.sigmod.org/industrial-list.shtml)

@xinlifoobar
Copy link
Contributor Author

I am sorry @xinlifoobar for the delayed review -- I am traveling this week (actually presenting about DataFusion at SIGMOD: https://2024.sigmod.org/industrial-list.shtml)

Ya, the first time I found this on my timeline on LinkedIn, and am glad to be part of this awesome project.

I would pause updating on this PR since it is extremely large IMO and difficult for reviewers. Let me know your thoughts on it and I could do an update in the following iterations.

@xinlifoobar xinlifoobar reopened this Jun 13, 2024
@alamb
Copy link
Contributor

alamb commented Jun 15, 2024

I would pause updating on this PR since it is extremely large IMO and difficult for reviewers. Let me know your thoughts on it and I could do an update in the following iterations.

If it is too complicated, maybe we should just stop working on it (or maybe we should put the code into a datafusion-contrib repo 🤔 )

@@ -0,0 +1,9 @@
select count(*) from "hf://datasets/cais/mmlu/astronomy/dev-00000-of-00001.parquet";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😮 -- tests too!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw duckdb build their organizations and datasets, instead of depending on the random existing ones, to make their CI safe. It would be too early for this PR to do so... It is still in an earlier stage.

@xinlifoobar
Copy link
Contributor Author

I would pause updating on this PR since it is extremely large IMO and difficult for reviewers. Let me know your thoughts on it and I could do an update in the following iterations.

If it is too complicated, maybe we should just stop working on it (or maybe we should put the code into a datafusion-contrib repo 🤔 )

Ya. re-implementing the datastore and associated facilities is code-consuming. Do you think the objectstore solution is the right way to go? If so, I could split part of the code into datafusion-contrib repo.

@alamb
Copy link
Contributor

alamb commented Jun 28, 2024

Sorry for the delay -- I paln to review this tomorrow

@xinlifoobar
Copy link
Contributor Author

Sorry for the delay -- I paln to review this tomorrow

Sorry, I lost Github connections for a couple of days and just returned. Also Thanks. please take your time.

@alamb
Copy link
Contributor

alamb commented Jul 7, 2024

This is still on my list, but I am behind in my reviews

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @xinlifoobar

I am sorry for the delay in responding to this PR. This is an amazing piece of software engineering. Very nice 🎩 👌

As you have noted, the challege here is that the hs_store is non trivial and yet somewhat specialized for HuggingFace. It is a really neat feature but somewhat hard to justify adding to the datafusion-cli in the DataFusion repo.

I feel like there is a tension between making datafusion-cli easy to use with many built in integrations (e.g. hugging face, delta-rs, etc) and keeping the dependencies manageable

What would you think about somehow moving this hugging face integration into another repo and making some version of datafusion-cli that had a bunch of pre-defined integrations?

For example, maybe put it in https://github.com/datafusion-contrib/datafusion-cli-plus or something

That could be like the power user version of datafusion-cli that we could use all the fun table providers (like what is in the connector libraries, etc), delta-rust, iceberg-rust, etc

If you think this is a reasonable idea, I will file a ticket for larger discussion

@xinlifoobar
Copy link
Contributor Author

Hi @xinlifoobar

I am sorry for the delay in responding to this PR. This is an amazing piece of software engineering. Very nice 🎩 👌

As you have noted, the challege here is that the hs_store is non trivial and yet somewhat specialized for HuggingFace. It is a really neat feature but somewhat hard to justify adding to the datafusion-cli in the DataFusion repo.

I feel like there is a tension between making datafusion-cli easy to use with many built in integrations (e.g. hugging face, delta-rs, etc) and keeping the dependencies manageable

What would you think about somehow moving this hugging face integration into another repo and making some version of datafusion-cli that had a bunch of pre-defined integrations?

For example, maybe put it in https://github.com/datafusion-contrib/datafusion-cli-plus or something

That could be like the power user version of datafusion-cli that we could use all the fun table providers (like what is in the connector libraries, etc), delta-rust, iceberg-rust, etc

If you think this is a reasonable idea, I will file a ticket for larger discussion

Hey @alamb, thanks for taking the time to review this PR. Great honor to me on this.

I am glad to have a repo like datafusion-cli-plus but think of a broader project for a new repo.

As you may have already observed, the datafusion-cli currently supports the Datafusion SQL interface. Are you considering expanding its capabilities to encompass protocols such as MySQL client, Arrow Flight SQL, and others? This expansion would entail naming it act to a full-fledged server. I've observed similar approaches being implemented in downstream projects like InfluxDB.

@Xuanwo
Copy link
Member

Xuanwo commented Sep 6, 2024

Apologies for missing this PR. I wanted to share that OpenDAL has native support for Huggingface. I'm considering whether integrating with OpenDAL would be beneficial, making it easier, more extensible, and maintainable to cool features like this one (as a response to #12357).

@alamb
Copy link
Contributor

alamb commented Sep 6, 2024

I'm considering whether integrating with OpenDAL would be beneficial, making it easier, more extensible

I think using the connectivity offered by openDAL in a tool such as #11979 would be very interesting.

Basically I think it would be valuable to distinguish between the query processing engine in DataFusion and the larger integrated applications that are built with DataFusion (and other components)

@alamb
Copy link
Contributor

alamb commented Oct 18, 2024

marking as draft as I don't think this is waiting on review -- more like waiting on porting to another repo (like dft)

@alamb alamb marked this pull request as draft October 18, 2024 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement hf:// / "hugging face" integration in datafusion-cli
4 participants