Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LNC Receive #1342

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from
Draft

LNC Receive #1342

wants to merge 9 commits into from

Conversation

riccardobl
Copy link
Member

@riccardobl riccardobl commented Aug 29, 2024

Description

LNC-web based LNC receive attachment.
Solves #1141 .

Additional Context

The lnc-web library, the lnc wasm client and the Go wasm_exec runtime are designed for browser environments, not for backend services. As a result, I had to implement some workarounds, which are less than ideal. I've been invited to send the PR anyway and the implementation appears to be stable overall, so I leave it to your judgment whether it is worth to merge this or not.

Workaround 1: Global context isolation and polyfill

The lnc-web and wasm_exec libraries heavily rely on browser standard apis and the global context to store their state, to protect the backend global context from being polluted and to avoid lnc calls from conflicting with eachother, this pr runs every call inside an isolated node:vm instance.

Workaround 2: Force kill the go runtime

There is a bug in lnc-web that causes an infinite loop when the connection can't be established ( lightninglabs/lnc-web#83 ), to work around this issue i made a kill() function that sets the internal exited flag causing the go runtime to kill all pending tasks with the "Go program has already exited" exception. This happens inside the node:vm runtime that is thrown away after every call, so it shouldn't leave anything dangling.

Checklist

Are your changes backwards compatible? Please answer below:
yes

On a scale of 1-10 how well and how have you QA'd this change and any features it might affect? Please answer below:
7
tested with a local connection created with litcli sessions add --type custom --label <your label> --account_id <account_id> --uri /lnrpc.Lightning/AddInvoice
and with voltage (but i had to disable strict permissions for that)

Did you introduce any new environment variables? If so, call them out explicitly here:
no

@riccardobl riccardobl marked this pull request as ready for review August 29, 2024 13:42
@huumn
Copy link
Member

huumn commented Aug 29, 2024

tested with a local connection created [...] with voltage

We have a litd container in the sndev docker cluster that works for sending and should work for receiving.

@riccardobl
Copy link
Member Author

tested with a local connection created [...] with voltage

We have a litd container in the sndev docker cluster that works for sending and should work for receiving.

I've been using ./sndev stacker_litcli sessions add ...
is this what you are referring to?

@huumn
Copy link
Member

huumn commented Aug 29, 2024

Yep, but what do you need a voltage node for if you're doing that?

@riccardobl
Copy link
Member Author

Yep, but what do you need a voltage node for if you're doing that?

i've tested it on the dev environment and on voltage, just to test it on another environment

@huumn
Copy link
Member

huumn commented Aug 29, 2024

Ah I see!

@huumn
Copy link
Member

huumn commented Sep 2, 2024

Just a heads up: I'll try to review this tomorrow

@huumn
Copy link
Member

huumn commented Sep 4, 2024

Fell behind today with meetings. I'll pull this first thing tomorrow.

@huumn
Copy link
Member

huumn commented Sep 4, 2024

I've done a QA pass. It works really well! A lot of the code has lots of great workarounds for lnc's bugs - I've never used vm.

I'll begin looking closer now, but two main things I'll be looking at:

  1. The in-memory mutex system won't work in prod because we run several application servers behind a load balancer. It's probably not worth constructing a mutex in postgres for this, but it's unclear how else we'd enforce only one active "connection" to the mailbox for a given pairing phrase.
  2. I'm curious if spawning some variety of subprocess that starts, returns the bolt11 on stdout, and then exits might allow us to achieve the same namespace isolation more simply.

@huumn
Copy link
Member

huumn commented Sep 4, 2024

So, still unsure what the right move is, but things we can do for the above:

  1. We can use an advisory lock
    • we'll probably want to grab a postgres connection independent of the pool (given lnc is slow)
    • and use a session-level lock grabbed in an abort controller so that we can time it out if we need (unlocking/closing the session)
    • and for the lock, we'll want to use two keys (to not conflict with other "keyspaces"), e.g. pg_advisory_lock ( <some lnc specific magic number>, walletId )
  2. Still unsure about the need to replace vm but the eval-like nature of it scares me and maintaining it is going to be tough through lnc-web changes, so alternatively we can:
    • call fork, which is its own process with its own memory and global namespace (containing lnc's buggy/hackiness), passing it DH keys as args, and then we send back the bolt11

If you'd like to do these revisions, I can promote this issue to difficulty:hard bringing the bounty to 2m. Alternatively, I can pay you 1m for the current pr (its solid work) and we'll pop this in draft and ek or I can take this the last mile. Up to you!

@riccardobl
Copy link
Member Author

riccardobl commented Sep 5, 2024

If you'd like to do these revisions, I can promote this issue to difficulty:hard bringing the bounty to 2m. Alternatively, I can pay you 1m for the current pr (its solid work) and we'll pop this in draft and ek or I can take this the last mile. Up to you!

Thank you for your fairness in handling these issue, but I wouldn't feel comfortable accepting a reward for incomplete work. So, unless you prefer to take over, I’d like to make these revisions myself 🫡 .

We can use an advisory lock

My understanding is that we are going to need a new pg session for each lnc call for this to work, since session-level advisory locks do not prevent the session for which they are acquired from acquiring them again, so async tasks that share the same session will run at the same time.

Instead we could use a combination of:

  • in-memory mutex: to sync the tasks in the local instance
  • a loop polling pg_try_advisory_lock: to lock the task if another backend instance using a different session is running it, it is non-blocking so we won't need a separate pool

Does this make sense?

Still unsure about the need to replace vm but the eval-like nature of it scares me and maintaining it is going to be tough through lnc-web changes, so alternatively we can:
call fork, which is its own process with its own memory and global namespace (containing lnc's buggy/hackiness), passing it DH keys as args, and then we send back the bolt11

Not knowing much about the production environment, i went for vm because it allows to isolate the context without booting a new nodejs instance or requiring IPC, so performance wise it is much better. But if spawning a new runtime for every LNC call is not an issue, then child_process should work just fine, the only difference is that the global functions will be set from the process using globalThis instead of being set on the vm context, also i can probably remove the kill() workaround and just kill the process directly.

@huumn
Copy link
Member

huumn commented Sep 5, 2024

In general, I prefer not to be concerned with hypothetical performance problems before a feature has seen production performance problems, and while spawning subprocesses can be CPU intensive ime this will probably be manageable (our prod instances only get ~15% cpu consumption on average).

Does this make sense?

Sort of. There might be a better way to use advisory locks than I suggested, but we shouldn't need in-memory mutexes except to maybe limit us to one subprocess running at a time.

The way I imagine it: subprocess opens a db connection (session), grabs a session-level advisory lock, returns bolt11/errors via ipc, exits.


To be honest, I'm uncertain about how to handle this even knowing the full production context. The only thing I'm certain about is that we'll have to use an advisory lock. (We could also use a skip lock postgres queue but lnc-web's namespace pollution and closures (ime we can't reuse an existing lnc instance) and bugs will probably lead us back to vm/subprocesses anyway.)

I'd like to avoid you having to guess about stuff. You seem super capable, so if you're hacking on SN I don't want stuff out of your control to be a bottleneck.

Still, if you want to keep at this problem, I'm happy to support it until it we can get something merged.

@riccardobl
Copy link
Member Author

In general, I prefer not to be concerned with hypothetical performance problems before a feature has seen production performance problems, and while spawning subprocesses can be CPU intensive ime this will probably be manageable (our prod instances only get ~15% cpu consumption on average).

Makes sense, i'll adapt the code as you proposed 👍

The way I imagine it: subprocess opens a db connection (session), grabs a session-level advisory lock, returns bolt11/errors via ipc, exits.

Ok, i was thinking about using the advisory lock for queuing the launch of the lnc subprocesses from the main sn process.
My understanding is that what you are proposing is to make the subprocesses connect back to the database to "lock themselves".
I think your proposal is valid, but has the potential to create up to n_lnc_wallets*backend_instances short living connections to the database, and it think this might not scale very well.

If you are ok with it i will give a try to implement it in the way i've proposed first (that should work fine with the same connection pool used by the backend) and then once we can see the whole code, if it ends up being suboptimal for whatever reason, i can move the lock to the subprocess as you proposed, or whatever we figure makes sense.

@huumn
Copy link
Member

huumn commented Sep 5, 2024

I think your proposal is valid, but has the potential to create up to n_lnc_wallets*backend_instances short living connections to the database.

Not if we only let one subprocess run at a time per backend, which is what I think I'd do, but again, I'm not sure what will still work here without being some huge hack.

Copy link

socket-security bot commented Sep 6, 2024

New and removed dependencies detected. Learn more about Socket for GitHub ↗︎

Package New capabilities Transitives Size Publisher
npm/[email protected] environment, filesystem Transitive: network, shell, unsafe +86 17.8 MB google-wombot

🚮 Removed packages: npm/[email protected]), npm/[email protected])

View full report↗︎

@riccardobl
Copy link
Member Author

I've ended up with an implementation that is something in-between of what was discussed here.

Subprocess instead of vm

The child_process (aka worker.cjs) accepts a list of actions via IPC, runs them sequentially and then returns an output and quits.
The child_process will try to disconnect cleanly, but if it gets stuck somewhere or takes too long, the backend will send a SIGKILL to terminate it (replaces the kill() workaround).

Locking

The PARALLEL flag allows to set the attachment to run:

  • PARALLEL=true: one process per wallet, requests for different wallets can run at the same time
  • PARALLEL=false: only one process per backend instance, requests for different wallets run sequentially

The SINGLE_SESSION flag allow to set it use a single long living connection per backend

The lock is acquired with pg_advisory_lock or by polling pg_try_advisory_lock if PARALLEL=true and SINGLE_SESSION=true

The in-memory mutex is used:

  • in SINGLE_SESSION=true: to ensure there is only one task using the session
  • in SINGLE_SESSION=false: to avoid connecting to the database multiple times from the same backend when the lock is known to be already acquired locally

Not if we only let one subprocess run at a time per backend, which is what I think I'd do

This behavior is toggled with PARALLEL=false, but imo once you have a bunch of wallets triggering autowithdraw it might degrade pretty quickly with a queue that is too long and causes calls to testCreateInvoice for wallet attachment or createInvoice to timeout or wait for a very long time.

The code is currently set to run in PARALLEL=true SINGLE_SESSION=true, that should be the most performant, imo.

We can test all the modes and decide which one to keep in the final PR, or just keep it configurable so you can easily switch if you notice performance degradation in production.

@huumn
Copy link
Member

huumn commented Sep 6, 2024

Man you turned this out fast. It's really impressive considering how hamstrung LNC is.

I'll take a closer look this weekend, but tbh, it's looking like lnc receives may be more trouble than they're worth. If LNC only had this single connection limitation, it'd be worth complex code to give people the option of using it, but it's also extremely slow and flakey. This isn't your fault obviously - LNC is what it is.

Again, I'll take a look this weekend and pay you regardless of what we decide.

@huumn
Copy link
Member

huumn commented Sep 7, 2024

Just brainstorming ...

If LNC is built assuming a single long-lived connection on an end-user device, which seems to be the case given the way lnc and its libraries are written, perhaps the ideal solution will be wrapping https://github.com/lightninglabs/lightning-node-connect into a daemon (lncd) and running it as a microservice.

Currently, it only builds wasm and mobile libraries, but all the logic is there - we're just missing the right interface. We'd basically just write another main.go and interact with it via a rest or grpc interface. It would hold connections to the mailbox and we don't have to worry about mutexes and subprocesses in the application server.


Another option is building https://github.com/lightninglabs/lightning-node-connect into binary with a cli. We'd still need mutexes, but it'd be way more nimble than creating a nodejs subprocess just to use wasm.

@riccardobl
Copy link
Member Author

If LNC is built assuming a single long-lived connection on an end-user device, which seems to be the case given the way lnc and its libraries are written, perhaps the ideal solution will be wrapping https://github.com/lightninglabs/lightning-node-connect into a daemon (lncd) and running it as a microservice.

Currently, it only builds wasm and mobile libraries, but all the logic is there - we're just missing the right interface. We'd basically just write another main.go and interact with it via a rest or grpc interface. It would hold connections to the mailbox and we don't have to worry about mutexes and subprocesses in the application server.

I am not 100% sure if keeping a long living connection to the mailbox on the server, for each lnc user, is scalable. We should figure out at least how many concurrent connections from the same host the mailbox will accept.
Probably a better approach would be to limit the active connections to a known maximum and then close oldest connections as soon as a new one is requested.

Another option is building https://github.com/lightninglabs/lightning-node-connect into binary with a cli. We'd still need mutexes, but it'd be way more nimble than creating a nodejs subprocess just to use wasm.

probably would improve boot time a bit


That said, before thinking about maintaining a fork for this library, that already doesn't seem something very stable, do we have an idea on how slow/heavy the current implementation is and if an improvement is needed?
My understanding is that receive connectors are used only for autowithdraw, so as long as we don't make a long queue (with SEQUENTIAL=true) it should probably work good enough. A delay of ~1 second per autowithdraw should not be a problem, imo.

@huumn
Copy link
Member

huumn commented Sep 8, 2024

Most of my concerns relate to:

  1. maintainability (global + local mutexes + wasm + subprocesses is heavy)
  2. reliability (lnc-web one-off connections on the client has proven terrible on that front)

So you’re right that a go sub process wouldn’t be much better.

Re: too many connections. I’d be surprised if more than 10 people use LNC receives to start, let alone so many it’s overwhelming. It’d be pretty trivial to create a FIFO to keep it manageable too.

I’m still chewing on all of this though. I’ve been really unhappy with our lnc sending (very slow and unreliable).

@huumn huumn marked this pull request as draft October 20, 2024 23:09
@ekzyis ekzyis added feature new product features that weren't there before wallets labels Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature new product features that weren't there before wallets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants