Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

💭 Jsoup - Not able to identify escaped/unescaped html entity in the text nodes #2206

Open
Muthukirthan opened this issue Oct 6, 2024 · 1 comment

Comments

@Muthukirthan
Copy link

Not able to identify whether the input document has & or &amp; in the text node, since Jsoup escapes the character in text node. Same goes to other entities like </&lt;.

This does not provide any control to the Jsoup users where they can take any action based on input. For example; If we want to remove < character in text node but preserve when given as entity &lt;

Note: Please let me know if there is already a way to differentiate this.


Providing an option where I could input Jsoup to not modify the text node will be super helpful. This provides more flexibility and control to the customers.

@jhy

@Muthukirthan Muthukirthan changed the title Jsoup - Not able to identify escaped/unescaped html entity in the text nodes 💭 💭 Jsoup - Not able to identify escaped/unescaped html entity in the text nodes Oct 6, 2024
@Muthukirthan
Copy link
Author

Muthukirthan commented Oct 6, 2024

Tried different methods in TextNode to get the original input text content, but did not worked.

Example:
Input: <p> actual_lt: < || escaped_lt: &lt; </p>

for (TextNode textNode : doc.selectFirst("p").textNodes()) {
    System.out.println("textNode.toString():-" + textNode.toString());
    System.out.println("textNode.text():-" + textNode.text());
    System.out.println("textNode.getWholeText():-" + textNode.getWholeText());
    System.out.println("textNode.outerHtml():-" + textNode.outerHtml());
}

Expected (in any one of the method): actual_lt: < || escaped_lt: &lt;
Output:

textNode.toString():-  actual_lt: &lt;   ||   escaped_lt: &lt;  
textNode.text():- actual_lt: < || escaped_lt: < 
textNode.getWholeText():-  actual_lt: <   ||   escaped_lt: <  
textNode.outerHtml():-  actual_lt: &lt;   ||   escaped_lt: &lt; 

@jhy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@Muthukirthan and others