Class A_CmsTextExtractor
- java.lang.Object
-
- org.opencms.search.extractors.A_CmsTextExtractor
-
- All Implemented Interfaces:
I_CmsTextExtractor
- Direct Known Subclasses:
CmsExtractorHtml
,CmsExtractorMsOfficeOLE2
,CmsExtractorMsOfficeOOXML
,CmsExtractorOpenOffice
,CmsExtractorPdf
,CmsExtractorRtf
public abstract class A_CmsTextExtractor extends java.lang.Object implements I_CmsTextExtractor
Base utility class that allows extraction of the indexable "plain" text from a given document format.- Since:
- 6.0.0
-
-
Constructor Summary
Constructors Constructor Description A_CmsTextExtractor()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
combineContentItem(java.lang.String itemValue, java.lang.String itemKey, java.lang.StringBuffer content, java.util.Map<java.lang.String,java.lang.String> contentItems)
Combines a meta information item extracted from the document with the main content buffer and also stores the individual information as item in the Map of content items.I_CmsExtractionResult
extractText(byte[] content)
Extracts the text and meta information from the given binary document.I_CmsExtractionResult
extractText(byte[] content, java.lang.String encoding)
Extracts the text and meta information from the given binary document, using the specified content encoding.I_CmsExtractionResult
extractText(java.io.InputStream in)
Extracts the text and meta information from the document on the input stream.I_CmsExtractionResult
extractText(java.io.InputStream in, java.lang.String encoding)
Extracts the text and meta information from the document on the input stream, using the specified content encoding.protected CmsExtractionResult
extractText(java.io.InputStream in, org.apache.tika.parser.Parser parser)
Parses the given input stream with the provided parser and returns the result as a map of content items.protected java.lang.String
removeControlChars(java.lang.String content)
Removes "unwanted" control chars from the given content.
-
-
-
Constructor Detail
-
A_CmsTextExtractor
public A_CmsTextExtractor()
-
-
Method Detail
-
extractText
public I_CmsExtractionResult extractText(byte[] content) throws java.lang.Exception
Description copied from interface:I_CmsTextExtractor
Extracts the text and meta information from the given binary document.The encoding of the input stream is either not required (the document type may have one common default encoding) or the extractor is able to divine the encoding from the provided binary array automatically.
Delivers is the same result as calling
whenI_CmsTextExtractor.extractText(byte[], String)
String == null
.- Specified by:
extractText
in interfaceI_CmsTextExtractor
- Parameters:
content
- the binary content of the document to extract the text from- Returns:
- the extracted text
- Throws:
java.lang.Exception
- if the text extration fails- See Also:
I_CmsTextExtractor.extractText(byte[])
-
extractText
public I_CmsExtractionResult extractText(byte[] content, java.lang.String encoding) throws java.lang.Exception
Description copied from interface:I_CmsTextExtractor
Extracts the text and meta information from the given binary document, using the specified content encoding.The encoding is a hint for the text extractor, if the value given is
null
then the text extractor should try to figure out the encoding itself.- Specified by:
extractText
in interfaceI_CmsTextExtractor
- Parameters:
content
- the binary content of the document to extract the text fromencoding
- the encoding to use- Returns:
- the extracted text
- Throws:
java.lang.Exception
- if the text extration fails- See Also:
I_CmsTextExtractor.extractText(byte[], java.lang.String)
-
extractText
public I_CmsExtractionResult extractText(java.io.InputStream in) throws java.lang.Exception
Description copied from interface:I_CmsTextExtractor
Extracts the text and meta information from the document on the input stream.The encoding of the input stream is either not required (the document type may have one common default encoding) or the extractor is able to divine the encoding from the provided input stream automatically.
Delivers is the same result as calling
whenI_CmsTextExtractor.extractText(InputStream, String)
String == null
.- Specified by:
extractText
in interfaceI_CmsTextExtractor
- Parameters:
in
- the input stream for the document to extract the text from- Returns:
- the extracted text and meta information
- Throws:
java.lang.Exception
- if the text extration fails- See Also:
I_CmsTextExtractor.extractText(java.io.InputStream)
-
extractText
public I_CmsExtractionResult extractText(java.io.InputStream in, java.lang.String encoding) throws java.lang.Exception
Description copied from interface:I_CmsTextExtractor
Extracts the text and meta information from the document on the input stream, using the specified content encoding.The encoding is a hint for the text extractor, if the value given is
null
then the text extractor should try to figure out the encoding itself.- Specified by:
extractText
in interfaceI_CmsTextExtractor
- Parameters:
in
- the input stream for the document to extract the text fromencoding
- the encoding to use- Returns:
- the extracted text and meta information
- Throws:
java.lang.Exception
- if the text extration fails- See Also:
I_CmsTextExtractor.extractText(java.io.InputStream, java.lang.String)
-
combineContentItem
protected void combineContentItem(java.lang.String itemValue, java.lang.String itemKey, java.lang.StringBuffer content, java.util.Map<java.lang.String,java.lang.String> contentItems)
Combines a meta information item extracted from the document with the main content buffer and also stores the individual information as item in the Map of content items.- Parameters:
itemValue
- the value of the item to storeitemKey
- the key in the Map of content itemscontent
- a buffer where to append the content itemcontentItems
- the Map of individual content items
-
extractText
protected CmsExtractionResult extractText(java.io.InputStream in, org.apache.tika.parser.Parser parser) throws java.lang.Exception
Parses the given input stream with the provided parser and returns the result as a map of content items.- Parameters:
in
- the input stream for the content to parseparser
- the parser to use- Returns:
- the result of the parsing as a map of content items
- Throws:
java.lang.Exception
- in case something goes wrong
-
removeControlChars
protected java.lang.String removeControlChars(java.lang.String content)
Removes "unwanted" control chars from the given content.- Parameters:
content
- the content to remove the unwanted control chars from- Returns:
- the content with the unwanted control chars removed
-
-