pyspark should upload / import packages from local by default #739

kdzhao · 2021-11-11T02:53:38Z

Is your feature request related to a problem? Please describe.

When pyspark starts the connection to Spark cluster (Livy), it should load the packages in the local folder by default (or at least a way to specify that), so users can use these packages in the spark session as well.

For example, in pySpark kernel, if I do :

%%local
import matplotlib

It loads successfully. This is expected because "local" reads the package matplotlib I have on the jupyterlab machine.

But if I do:

import matplotlib

Starting Spark application
ID      YARN Application ID     Kind    State   Spark UI        Driver log      Current session?
32      application_1636505612345_0200  pyspark idle    Link    Link    ✔
SparkSession available as 'spark'.
An error was encountered:
No module named 'matplotlib'
Traceback (most recent call last):
ModuleNotFoundError: No module named 'matplotlib'

As we can see it errors out. It can't find the said package on the spark cluster because in this case it runs in the cluster.

Describe the solution you'd like

People may say why not install the said packages on the Spark cluster? Well, most of the time, end users don't have direct permissions to do that. If there is a way so pyspark kernel can upload the packages when it starts the spark session, that will be really helpful! For example, a config before start the session, in which users can specify which packages to upload.

The text was updated successfully, but these errors were encountered:

nicolaslazo · 2021-12-14T01:00:06Z

Hmmm maybe what you're looking for is this poorly documented function referenced in the AWS EMR documentation, sc.install_pypi_package?

Another option could be to use the %%bash IPython magic to call pip directly

gloisel · 2021-12-21T17:34:03Z

I'll extend slightly on the above request, though I believe that the suggestion that @kdzhao gave would achieve it.

In my case, I have several notebooks that I use on my EMR clusters which re-use some key functions that I would ideally like to only define in one place.

Ideally, the %run magic could be adapted to allow running a python script which is located on the jupyterlab machine on the EMR cluster.

Michallote · 2024-05-13T20:39:30Z

I also think this is a big problem. Take for example the following scenario:
I would like to have my functions and important codebase managed by a git repository, simplified somewhat:

src/my_module.py

for example consider the following cells in the nb:

%%local
s = "Some string I want to send over"

%%send_to_spark -i s -t str -n s

Will successfully send the string

What I want to achieve or similar is to import things in my local and send them over:

%%local
from src.my_module import foo_function

And send to spark foo_function

This is crucial to have git management instead of the monolithic trash we end up with these days.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyspark should upload / import packages from local by default #739

pyspark should upload / import packages from local by default #739

kdzhao commented Nov 11, 2021

nicolaslazo commented Dec 14, 2021

gloisel commented Dec 21, 2021

Michallote commented May 13, 2024 •

edited

Loading

pyspark should upload / import packages from local by default #739

pyspark should upload / import packages from local by default #739

Comments

kdzhao commented Nov 11, 2021

nicolaslazo commented Dec 14, 2021

gloisel commented Dec 21, 2021

Michallote commented May 13, 2024 • edited Loading

Michallote commented May 13, 2024 •

edited

Loading