You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
When pyspark starts the connection to Spark cluster (Livy), it should load the packages in the local folder by default (or at least a way to specify that), so users can use these packages in the spark session as well.
For example, in pySpark kernel, if I do :
%%local
import matplotlib
It loads successfully. This is expected because "local" reads the package matplotlib I have on the jupyterlab machine.
But if I do:
import matplotlib
Starting Spark application
ID YARN Application ID Kind State Spark UI Driver log Current session?
32 application_1636505612345_0200 pyspark idle Link Link ✔
SparkSession available as 'spark'.
An error was encountered:
No module named 'matplotlib'
Traceback (most recent call last):
ModuleNotFoundError: No module named 'matplotlib'
As we can see it errors out. It can't find the said package on the spark cluster because in this case it runs in the cluster.
Describe the solution you'd like
People may say why not install the said packages on the Spark cluster? Well, most of the time, end users don't have direct permissions to do that. If there is a way so pyspark kernel can upload the packages when it starts the spark session, that will be really helpful! For example, a config before start the session, in which users can specify which packages to upload.
The text was updated successfully, but these errors were encountered:
I'll extend slightly on the above request, though I believe that the suggestion that @kdzhao gave would achieve it.
In my case, I have several notebooks that I use on my EMR clusters which re-use some key functions that I would ideally like to only define in one place.
Ideally, the %run magic could be adapted to allow running a python script which is located on the jupyterlab machine on the EMR cluster.
I also think this is a big problem. Take for example the following scenario:
I would like to have my functions and important codebase managed by a git repository, simplified somewhat:
src/my_module.py
for example consider the following cells in the nb:
%%local
s = "Some string I want to send over"
%%send_to_spark -i s -t str -n s
Will successfully send the string
What I want to achieve or similar is to import things in my local and send them over:
%%local
from src.my_module import foo_function
And send to spark foo_function
This is crucial to have git management instead of the monolithic trash we end up with these days.
Is your feature request related to a problem? Please describe.
When pyspark starts the connection to Spark cluster (Livy), it should load the packages in the local folder by default (or at least a way to specify that), so users can use these packages in the spark session as well.
For example, in pySpark kernel, if I do :
It loads successfully. This is expected because "local" reads the package matplotlib I have on the jupyterlab machine.
But if I do:
As we can see it errors out. It can't find the said package on the spark cluster because in this case it runs in the cluster.
Describe the solution you'd like
People may say why not install the said packages on the Spark cluster? Well, most of the time, end users don't have direct permissions to do that. If there is a way so pyspark kernel can upload the packages when it starts the spark session, that will be really helpful! For example, a config before start the session, in which users can specify which packages to upload.
The text was updated successfully, but these errors were encountered: