Package org.opencms.util
Class CmsHtmlParser
- java.lang.Object
-
- org.htmlparser.visitors.NodeVisitor
-
- org.opencms.util.CmsHtmlParser
-
- All Implemented Interfaces:
I_CmsHtmlNodeVisitor
- Direct Known Subclasses:
CmsHtml2TextConverter
,CmsHtmlDecorator
,CmsLinkProcessor
public class CmsHtmlParser extends org.htmlparser.visitors.NodeVisitor implements I_CmsHtmlNodeVisitor
Base utility class for OpenCms
implementations, which provides some often used utility functions.NodeVisitor
This base implementation is only a "pass through" class, that is the content is parsed, but the generated result is exactly identical to the input.
- Since:
- 6.2.0
-
-
Field Summary
Fields Modifier and Type Field Description protected boolean
m_echo
Indicates if "echo" mode is on, that is all content is written to the result by default.protected java.util.List<java.lang.String>
m_noAutoCloseTags
List of upper case tag name strings of tags that should not be auto-corrected if closing divs are missing.protected java.lang.StringBuffer
m_result
The buffer to write the out to.protected static java.lang.String[]
TAG_ARRAY
The array of supported tag names.protected static java.util.List<java.lang.String>
TAG_LIST
The list of supported tag names.
-
Constructor Summary
Constructors Constructor Description CmsHtmlParser()
Creates a new instance of the html converter with echo mode set tofalse
.CmsHtmlParser(boolean echo)
Creates a new instance of the html converter.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected java.lang.String
collapse(java.lang.String string)
Collapse HTML whitespace in the given String.protected org.htmlparser.PrototypicalNodeFactory
configureNoAutoCorrectionTags()
Internally degrades Composite tags that do have children in the DOM tree to simple single tags.java.lang.String
getConfiguration()
Returns the configuartion String of this visitor or the empty String if was not provided before.java.util.List<java.lang.String>
getNoAutoCloseTags()
Returns a list of upper case tag names for which parsing / visiting will not correct missing closing tags.java.lang.String
getResult()
Returns the text extraction result.java.lang.String
getTagHtml(org.htmlparser.Tag tag)
Returns the HTML for the given tag itself (not the tag content).java.lang.String
process(java.lang.String html, java.lang.String encoding)
Extracts the text from the given html content, assuming the given html encoding.void
setConfiguration(java.lang.String configuration)
Set a configuartion String for this visitor.void
setNoAutoCloseTags(java.util.List<java.lang.String> noAutoCloseTagList)
Sets a list of upper case tag names for which parsing / visiting should not correct missing closing tags.void
visitEndTag(org.htmlparser.Tag tag)
Visitor method (callback) invoked when a closing Tag is encountered.void
visitRemarkNode(org.htmlparser.Remark remark)
Visitor method (callback) invoked when a remark Tag (HTML comment) is encountered.void
visitStringNode(org.htmlparser.Text text)
Visitor method (callback) invoked when a remark Tag (HTML comment) is encountered.void
visitTag(org.htmlparser.Tag tag)
Visitor method (callback) invoked when a starting Tag (HTML comment) is encountered.
-
-
-
Field Detail
-
m_noAutoCloseTags
protected java.util.List<java.lang.String> m_noAutoCloseTags
List of upper case tag name strings of tags that should not be auto-corrected if closing divs are missing.
-
TAG_ARRAY
protected static final java.lang.String[] TAG_ARRAY
The array of supported tag names.
-
TAG_LIST
protected static final java.util.List<java.lang.String> TAG_LIST
The list of supported tag names.
-
m_echo
protected boolean m_echo
Indicates if "echo" mode is on, that is all content is written to the result by default.
-
m_result
protected java.lang.StringBuffer m_result
The buffer to write the out to.
-
-
Constructor Detail
-
CmsHtmlParser
public CmsHtmlParser()
Creates a new instance of the html converter with echo mode set tofalse
.
-
CmsHtmlParser
public CmsHtmlParser(boolean echo)
Creates a new instance of the html converter.- Parameters:
echo
- indicates if "echo" mode is on, that is all content is written to the result
-
-
Method Detail
-
configureNoAutoCorrectionTags
protected org.htmlparser.PrototypicalNodeFactory configureNoAutoCorrectionTags()
Internally degrades Composite tags that do have children in the DOM tree to simple single tags. This allows to avoid auto correction of unclosed HTML tags.- Returns:
- A node factory that will not autocorrect open tags specified via
setNoAutoCloseTags(List)
-
getConfiguration
public java.lang.String getConfiguration()
Description copied from interface:I_CmsHtmlNodeVisitor
Returns the configuartion String of this visitor or the empty String if was not provided before.- Specified by:
getConfiguration
in interfaceI_CmsHtmlNodeVisitor
- Returns:
- the configuartion String of this visitor - by this contract never null but an empty String if not provided.
- See Also:
I_CmsHtmlNodeVisitor.getConfiguration()
-
getResult
public java.lang.String getResult()
Description copied from interface:I_CmsHtmlNodeVisitor
Returns the text extraction result.- Specified by:
getResult
in interfaceI_CmsHtmlNodeVisitor
- Returns:
- the text extraction result
- See Also:
I_CmsHtmlNodeVisitor.getResult()
-
getTagHtml
public java.lang.String getTagHtml(org.htmlparser.Tag tag)
Returns the HTML for the given tag itself (not the tag content).- Parameters:
tag
- the tag to create the HTML for- Returns:
- the HTML for the given tag
-
process
public java.lang.String process(java.lang.String html, java.lang.String encoding) throws org.htmlparser.util.ParserException
Description copied from interface:I_CmsHtmlNodeVisitor
Extracts the text from the given html content, assuming the given html encoding.- Specified by:
process
in interfaceI_CmsHtmlNodeVisitor
- Parameters:
html
- the content to extract the plain text fromencoding
- the encoding to use- Returns:
- the text extracted from the given html content
- Throws:
org.htmlparser.util.ParserException
- if something goes wrong- See Also:
I_CmsHtmlNodeVisitor.process(java.lang.String, java.lang.String)
-
setConfiguration
public void setConfiguration(java.lang.String configuration)
Description copied from interface:I_CmsHtmlNodeVisitor
Set a configuartion String for this visitor.This will most likely be done with data from an xsd, custom jsp tag, ...
- Specified by:
setConfiguration
in interfaceI_CmsHtmlNodeVisitor
- Parameters:
configuration
- the configuration of this visitor to set.- See Also:
I_CmsHtmlNodeVisitor.setConfiguration(java.lang.String)
-
visitEndTag
public void visitEndTag(org.htmlparser.Tag tag)
Description copied from interface:I_CmsHtmlNodeVisitor
Visitor method (callback) invoked when a closing Tag is encountered.- Specified by:
visitEndTag
in interfaceI_CmsHtmlNodeVisitor
- Overrides:
visitEndTag
in classorg.htmlparser.visitors.NodeVisitor
- Parameters:
tag
- the tag that is ended.- See Also:
I_CmsHtmlNodeVisitor.visitEndTag(org.htmlparser.Tag)
-
visitRemarkNode
public void visitRemarkNode(org.htmlparser.Remark remark)
Description copied from interface:I_CmsHtmlNodeVisitor
Visitor method (callback) invoked when a remark Tag (HTML comment) is encountered.- Specified by:
visitRemarkNode
in interfaceI_CmsHtmlNodeVisitor
- Overrides:
visitRemarkNode
in classorg.htmlparser.visitors.NodeVisitor
- Parameters:
remark
- the remark Tag to visit.- See Also:
I_CmsHtmlNodeVisitor.visitRemarkNode(org.htmlparser.Remark)
-
visitStringNode
public void visitStringNode(org.htmlparser.Text text)
Description copied from interface:I_CmsHtmlNodeVisitor
Visitor method (callback) invoked when a remark Tag (HTML comment) is encountered.- Specified by:
visitStringNode
in interfaceI_CmsHtmlNodeVisitor
- Overrides:
visitStringNode
in classorg.htmlparser.visitors.NodeVisitor
- Parameters:
text
- the text that is visited.- See Also:
I_CmsHtmlNodeVisitor.visitStringNode(org.htmlparser.Text)
-
visitTag
public void visitTag(org.htmlparser.Tag tag)
Description copied from interface:I_CmsHtmlNodeVisitor
Visitor method (callback) invoked when a starting Tag (HTML comment) is encountered.- Specified by:
visitTag
in interfaceI_CmsHtmlNodeVisitor
- Overrides:
visitTag
in classorg.htmlparser.visitors.NodeVisitor
- Parameters:
tag
- the tag that is visited.- See Also:
I_CmsHtmlNodeVisitor.visitTag(org.htmlparser.Tag)
-
collapse
protected java.lang.String collapse(java.lang.String string)
Collapse HTML whitespace in the given String.- Parameters:
string
- the string to collapse- Returns:
- the input String with all HTML whitespace collapsed
-
getNoAutoCloseTags
public java.util.List<java.lang.String> getNoAutoCloseTags()
Returns a list of upper case tag names for which parsing / visiting will not correct missing closing tags.- Returns:
- a List of upper case tag names for which parsing / visiting will not correct missing closing tags
-
setNoAutoCloseTags
public void setNoAutoCloseTags(java.util.List<java.lang.String> noAutoCloseTagList)
Sets a list of upper case tag names for which parsing / visiting should not correct missing closing tags.- Specified by:
setNoAutoCloseTags
in interfaceI_CmsHtmlNodeVisitor
- Parameters:
noAutoCloseTagList
- a list of upper case tag names for which parsing / visiting should not correct missing closing tags to set.
-
-