Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCP Proxy: Fix corrupted hostname from partial connection read. #10454

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

chotiwat
Copy link
Contributor

@chotiwat chotiwat commented Sep 28, 2023

What this PR does / why we need it:

We have run into intermittent "HTTP request sent to an HTTPS server" errors from our SSL passthrough ingresses. We found that the passthrough proxy would sometimes read incomplete data from the connection and cause corrupted hostname.

When this happens, the connection handler would fall back to the default server at 127.0.0.1:442 and cause the error if the nginx.ingress.kubernetes.io/backend-protocol: HTTPS annotation is not specified.

This PR fixes the issue by making sure that we fully read the Client Hello data from the connection by getting the total length from the TLS header.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • CVE Report (Scanner found CVE and adding report)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation only

Which issue/s this PR fixes

fixes #11491
fixes #11424

How Has This Been Tested?

We built a custom image to debug the issue and added some debug logs. For example, we updated

klog.V(4).InfoS("TLS Client Hello", "host", hostname)

to

klog.V(4).InfoS("TLS Client Hello", "host", hostname, "conn", fmt.Sprintf("%p", conn), "read-length", length, "total-length", int(data[3])<<8+int(data[4]))

Example log of corrupted data before the fix:

I0928 01:42:51.621260       7 tcp.go:74] "TLS Client Hello" host="\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00" conn="0xc002db0500" read-length=244 total-length=330

We can reproduce the issue by running curl in a loop. I've seen around 5% of errors from my machine with the number of requests as low as 10 enough to trigger the error. This fix eliminates the errors completely for us.

Checklist:

  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I've read the CONTRIBUTION guide
  • I have added unit and/or e2e tests to cover my changes.
  • All new and existing tests passed.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 28, 2023
@netlify
Copy link

netlify bot commented Sep 28, 2023

Deploy Preview for kubernetes-ingress-nginx canceled.

Name Link
🔨 Latest commit 43bc60e
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-ingress-nginx/deploys/672340a25e7c6400087f246f

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Sep 28, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @chotiwat. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 28, 2023
@chotiwat
Copy link
Contributor Author

/retest

@k8s-ci-robot
Copy link
Contributor

@chotiwat: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@chotiwat
Copy link
Contributor Author

chotiwat commented Oct 5, 2023

Hi @cpanato @tao12345666333, is there anything I need to do on my end to get this PR reviewed?

Copy link
Member

@cpanato cpanato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not familiar with this part of the code

maybe @rikatz can help

@chotiwat
Copy link
Contributor Author

Hi @rikatz, would you be able to take a look at this PR when you get a chance?

@chotiwat
Copy link
Contributor Author

chotiwat commented Feb 5, 2024

Hi @rikatz, ping on this issue 🙏

@strongjz
Copy link
Member

strongjz commented Feb 9, 2024

/retest

@longwuyuan
Copy link
Contributor

longwuyuan commented Feb 9, 2024

  • Which issue number shows the corrupted hostname
  • Can the corrupted hostname problem be reproduced
  • Why does the problem NOT occur for multiple users of ssl-passthrough
  • What are details of why a backend-protocol is in play when ssl traffic is directly passed through to backend
  • Is there any tests written for this change
  • Is any e2e test suite going to be added or modified to check this change's impact

@chotiwat
Copy link
Contributor Author

Hi @longwuyuan, I've answered your questions below:

  • Which issue number shows the corrupted hostname

None. I couldn't find any issue describing the same problem.

  • Can the corrupted hostname problem be reproduced

Yes, as described in the PR body.

  • Why does the problem NOT occur for multiple users of ssl-passthrough

I wonder that myself. I can hazard a few guesses:

  • SSL passthrough might not be a widely used feature
  • The errors might have been disregarded by others if they aren't violating their SLAs
  • Retry mechanisms might have masked the errors
  • People might not care enough to report the issue

This doesn't mean the problem doesn't occur for other users of SSL passthrough though.

  • What are details of why a backend-protocol is in play when ssl traffic is directly passed through to backend

The connection handler for the proxy falls back to the default NGINX backend (p.Default) when the Client Hello hostname is corrupted, because it cannot be found in p.ServerList.

When this happens, the default backend will terminate the TLS and send a new request to the upstream per its generated NGINX configuration block, hence the HTTP-sent-to-HTTPS error.

Setting backend-protocol to HTTPS is just a hack to at least avoid the error, but it's not a real passthrough since the TLS gets terminated then re-initiated.

proxy := p.Default
hostname, err := parser.GetHostname(data)
if err == nil {
klog.V(4).InfoS("TLS Client Hello", "host", hostname)
proxy = p.Get(hostname)
}

// Get returns the TCPServer to use for a given host.
func (p *TCPProxy) Get(host string) *TCPServer {
if p.ServerList == nil {
return p.Default
}
for _, s := range p.ServerList {
if s.Hostname == host {
return s
}
}
return p.Default
}

  • Is there any tests written for this change

No, there isn't currently. I could add some though (see below).

  • Is any e2e test suite going to be added or modified to check this change's impact

I could modify the existing test suite if that sounds good to you. I didn't see it initially. It took some digging around the docs.

@longwuyuan
Copy link
Contributor

  • Can you create a issue for this and link it here. If you do, then ensuring that the issue description helps a developer of the ingress-nginx controller to reduce the work on reproducing the issue, will go a long way. I am still unclear on how I would reproduce the issue, if i wanted to
  • The hope against hope is that there is at least a few other users who face the same issue because the fact is that there is a large number of users of ssl-passthrough feature and why nobody else has reported this until now is extremely odd
  • I am not a developer so I am lost at the "connection handler" info you provided. So waiting for a developer to comment on that. My limit is failing to understand why, a new backend HTTPS connection, from the controller to backend pod, is even being stated as occuring, when the ssl-passthrough annotation implies that the termination of thc onnection from client, is not on the controller but instead is directly passed-through-and-through to the backend pod

@longwuyuan
Copy link
Contributor

  • And the e2e tests are a absolute requirement I think, based on the latest information you provided
  • thanks for the contribution

@chotiwat
Copy link
Contributor Author

Ok, I'll go ahead and try to add an e2e test for this.

  • Can you create a issue for this and link it here. If you do, then ensuring that the issue description helps a developer of the ingress-nginx controller to reduce the work on reproducing the issue, will go a long way. I am still unclear on how I would reproduce the issue, if i wanted to

I believe I've explained as clearly as I could in this PR. If you insist, I can create an issue, which would be a copy of this PR body and my previous answers to your questions, but I personally don't see any point in doing so.

The mistake in the code should also be pretty clear from the developer's point of view as well. Hopefully, I'd be able to come up with an e2e test that would fail against the current main branch but would pass on this branch. If that's still not enough for the developers, then I can try to set up a minimum reproduction steps.

  • I am not a developer so I am lost at the "connection handler" info you provided. So waiting for a developer to comment on that.

Yes, let's do that.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 16, 2024
@chotiwat chotiwat mentioned this pull request Feb 16, 2024
10 tasks
Copy link
Contributor Author

@chotiwat chotiwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the e2e tests that would have failed without this fix, and cherry-picked just the tests to #10988.

Unfortunately, it seems that this issue is harder to reproduce on a kind cluster, perhaps because the traffic doesn't go through many hops and there are no other workloads contending for I/O, so I had to introduce virtual throttling to the tests.

The throttling is done by wrapping the connection returned from the client's DialContext with a net.Conn that writes to the connection at most chunkSize bytes.

These tests would sometimes pass without the fix from this PR so I added retries with ginkgo.MustPassRepeatedly decorator as well. Alternatively, we could make the chunkSize less than len(host) but I don't know if it would make the tests significantly slower.

@@ -527,7 +527,7 @@ func newDeployment(name, namespace, image string, port int32, replicas int32, co
{
Name: name,
Image: image,
Env: []corev1.EnvVar{},
Env: env,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a bug in the e2e framework as well. This was preventing httpbun from getting the environment variables so it always ran as an HTTP server even though HTTPBUN_SSL_CERT and HTTPBUN_SSL_KEY are specified.

@@ -75,114 +75,211 @@ var _ = framework.IngressNginxDescribe("[Flag] enable-ssl-passthrough", func() {
Status(http.StatusNotFound)
})

ginkgo.It("should pass unknown traffic to default backend and handle known traffic", func() {
ginkgo.Context("when handling traffic", func() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I grouped these tests with ginkgo.Context() so that we can reuse some of the common setup code.

"nginx.ingress.kubernetes.io/ssl-passthrough": "true",
}

ingressDef := framework.NewSingleIngressWithTLS(host,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When SSL passthrough is working correctly, we shouldn't need TLS settings on the NGINX side. The upstream server should be the one who terminates the TLS.


f.NewDeploymentWithOpts("echopass",
framework.HTTPBunImage,
80,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tripped me up a bit and caused me to spent some time trying to add an HTTPS port to the deployment. It turns out the httpbun image binds to port 80 by default no matter what the protocol it's listening for.

ENV HTTPBUN_BIND=0.0.0.0:80

So this sets up an HTTPS server listening on port 80 which technically works for the purpose of the tests.

ForceResolve(f.GetNginxIP(), 443).
Expect().
Status(http.StatusOK)
ginkgo.It("should handle known traffic with Host header", ginkgo.MustPassRepeatedly(tries), func() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failure without the fix:
Get "[https://testpassthrough.com:443/](https://testpassthrough.com/)": tls: failed to verify certificate: x509: certificate is valid for ingress.local, not testpassthrough.com

Reason:
The proxy directs the traffic to NGINX. The configuration for the host specified in the HOST header doesn't specify any TLS certificate, so NGINX tries to terminate TLS with the default certificate.

Evidence:
https://github.com/kubernetes/ingress-nginx/actions/runs/7929254222/job/21649253282?pr=10988#step:6:826

ForceResolve(f.GetNginxIP(), 443).
Expect().
Status(http.StatusNotFound)
ginkgo.It("should handle insecure traffic with Host header", ginkgo.MustPassRepeatedly(tries), func() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failure without the fix:
400 Bad Request

Reason:
The proxy directs the traffic to NGINX. Unlike the should handle known traffic with Host header case above, NGINX doesn't verify certificate validity (InsecureSkipVerify: true) and is able to terminate TLS, but it tries to send the plain HTTP request to the HTTPS server because the configuration for the host specified in the HOST header doesn't specify HTTPS as the upstream protocol.

Evidence:
https://github.com/kubernetes/ingress-nginx/actions/runs/7929254222/job/21649252198?pr=10988#step:6:833

}
tries := 3

ginkgo.It("should handle known traffic without Host header", ginkgo.MustPassRepeatedly(tries), func() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failure without the fix:
Get "[https://testpassthrough.com:443/](https://testpassthrough.com/)": tls: failed to verify certificate: x509: certificate is valid for ingress.local, not testpassthrough.com

Reason:
The proxy directs the traffic to NGINX. There's no HOST header so NGINX tries to terminate TLS with the default certificate because it doesn't know what server block to look for.

Evidence:
https://github.com/kubernetes/ingress-nginx/actions/runs/7929254222/job/21649250966?pr=10988#step:6:827

f.WaitForNginxServer(hostBad,
func(server string) bool {
return strings.Contains(server, "listen 442")
ginkgo.It("should handle insecure traffic without Host header", ginkgo.MustPassRepeatedly(tries), func() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failure without the fix:
404 Not Found

Reason:
The proxy directs the traffic to NGINX. Unlike the should handle known traffic without Host header case above, NGINX doesn't verify certificate validity (InsecureSkipVerify: true) and is able to terminate TLS, but it doesn't have any hostname for routing and returns 404.

Evidence:
There are no instances of this error in this workflow run so I ran it locally with make kind-e2e-test FOCUS='insecure traffic without Host' SKIP_INGRESS_IMAGE_CREATION=false SKIP_E2E_IMAGE_CREATION=false

  [FAILED] 
        Error Trace:    /go/src/k8s.io/ingress-nginx/test/e2e/framework/httpexpect/reporter.go:47
                                                /go/src/k8s.io/ingress-nginx/test/e2e/framework/httpexpect/chain.go:37
                                                /go/src/k8s.io/ingress-nginx/test/e2e/framework/httpexpect/response.go:263
                                                /go/src/k8s.io/ingress-nginx/test/e2e/framework/httpexpect/response.go:79
                                                /go/src/k8s.io/ingress-nginx/test/e2e/settings/ssl_passthrough.go:241
                                                /go/src/k8s.io/ingress-nginx/.modcache/github.com/onsi/ginkgo/[email protected]/internal/node.go:463
                                                /go/src/k8s.io/ingress-nginx/.modcache/github.com/onsi/ginkgo/[email protected]/internal/suite.go:889
                                                /usr/local/go/src/runtime/asm_arm64.s:1197
        Error:      
                        expected status equal to:
                         "200 OK"
                    
                        but got:
                         "404 Not Found"
        Test:           [Flag] enable-ssl-passthrough With enable-ssl-passthrough enabled when handling traffic on throttled connections should handle insecure traffic without Host header

}
}

ginkgo.It("should handle known traffic without Host header", func() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f.WaitForNginxServer(hostBad,
func(server string) bool {
return strings.Contains(server, "listen 442")
ginkgo.It("should handle insecure traffic without Host header", func() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ForceResolve(f.GetNginxIP(), 443).
Expect().
Status(http.StatusOK)
ginkgo.It("should handle known traffic with Host header", func() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ForceResolve(f.GetNginxIP(), 443).
Expect().
Status(http.StatusNotFound)
ginkgo.It("should handle insecure traffic with Host header", func() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@longwuyuan
Copy link
Contributor

@chotiwat thanks for the contribution. I am copy/pasting my comments on the other PR here as well

  • I meant for the change and the tests to be in the same PR. Sorry i was not clear on that.
  • Can you please squash your commits
  • I am not a developer but I am extremely skeptical of the approach and assumptions here. If I can't reproduce the half-read connection using a stress tool like "hey" or "httperf" or "locust" or other such, then my opinion is that the changes you propose should not be forced on all the other users (simply because nonbody else is reporting this and we can not reproduce at will, so why fix things that are not broken)
  • Maybe your use case and your app have relative uniqueness for the incomplete read situation. Maybe your infra and networking contribute to the problem. If not then I should be able to reproduce at will with kin/minikube on any infra anywhere.
  • Your comment related to I/O makes me think that some resource starvation could have contributed to the incomplete-read hence your changes should not go to all the users (simply because bad idea to fix what is not broken for all other users)

@chotiwat
Copy link
Contributor Author

@longwuyuan

  • Can you please squash your commits

Of course. I can do that once the PR has been reviewed and I address all the comments.

  • I am not a developer but I am extremely skeptical of the approach and assumptions here. If I can't reproduce the half-read connection using a stress tool like "hey" or "httperf" or "locust" or other such, then my opinion is that the changes you propose should not be forced on all the other users (simply because nonbody else is reporting this and we can not reproduce at will, so why fix things that are not broken)
  • Maybe your use case and your app have relative uniqueness for the incomplete read situation. Maybe your infra and networking contribute to the problem. If not then I should be able to reproduce at will with kin/minikube on any infra anywhere.
  • Your comment related to I/O makes me think that some resource starvation could have contributed to the incomplete-read hence your changes should not go to all the users (simply because bad idea to fix what is not broken for all other users)

Our use case is not unique at all. In fact, I believe it's one of the main reasons this SSL passthrough feature exists.

Tools like kind and minikube are great for testing Kubernetes functionalities but by no means can they represent production clusters with traffic going though multiple network interfaces. Yes, in a perfect world, I should be able to reproduce the issue on a kind cluster so everyone can try to reproduce. I tried but I couldn't. I cannot grant you access to our cluster that we can consistently reproduce the problem on either, and it would be overkill for me to try to set up an EKS cluster just to convince you, especially when

  • the problem is evident just by reading the code
  • I've provided a definitive proof, reproducing the problem, via the added e2e tests that pass with this fix but fail without

Here are a few resources that might be helpful for more productive conversations in the future:

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 8, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: chotiwat
Once this PR has been reviewed and has the lgtm label, please assign tao12345666333 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 25, 2024
@Gacko Gacko changed the title Fix corrupted hostname from partial connection read TCP Proxy: Fix corrupted hostname from partial connection read. Oct 25, 2024
@Gacko
Copy link
Member

Gacko commented Oct 25, 2024

/cherry-pick release-1.12

@Gacko
Copy link
Member

Gacko commented Oct 25, 2024

/cherry-pick release-1.11

@k8s-infra-cherrypick-robot
Copy link
Contributor

@Gacko: once the present PR merges, I will cherry-pick it on top of release-1.12 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@Gacko
Copy link
Member

Gacko commented Oct 25, 2024

/cherry-pick release-1.10

@k8s-infra-cherrypick-robot
Copy link
Contributor

@Gacko: once the present PR merges, I will cherry-pick it on top of release-1.11 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-infra-cherrypick-robot
Copy link
Contributor

@Gacko: once the present PR merges, I will cherry-pick it on top of release-1.10 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.10

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Oct 25, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Oct 25, 2024
pkg/tcpproxy/tcp.go Outdated Show resolved Hide resolved
pkg/tcpproxy/tcp.go Outdated Show resolved Hide resolved
Co-authored-by: Marco Ebert <[email protected]>
@Gacko Gacko requested review from cpanato and rikatz October 31, 2024 08:46
Copy link
Member

@Gacko Gacko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/triage accepted
/kind bug
/priority backlog

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority labels Nov 4, 2024
@Gacko
Copy link
Member

Gacko commented Nov 4, 2024

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 4, 2024
@Gacko
Copy link
Member

Gacko commented Nov 4, 2024

@rikatz @cpanato Could you maybe review and LGTM this from a Go perspective?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/backlog Higher priority than priority/awaiting-more-evidence. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
8 participants