java.lang.Object
- org.opencms.search.extractors.A_CmsTextExtractor

All Implemented Interfaces:

I_CmsTextExtractor

Direct Known Subclasses:

CmsExtractorHtml, CmsExtractorMsOfficeOLE2, CmsExtractorMsOfficeOOXML, CmsExtractorOpenOffice, CmsExtractorPdf, CmsExtractorRtf
```
public abstract class A_CmsTextExtractor
extends java.lang.Object
implements I_CmsTextExtractor
```
Base utility class that allows extraction of the indexable "plain" text from a given document format.

Since:

6.0.0

Constructor Summary

Constructors
Constructor Description

A_CmsTextExtractor()

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`protected void`	`combineContentItem(java.lang.String itemValue, java.lang.String itemKey, java.lang.StringBuffer content, java.util.Map<java.lang.String,java.lang.String> contentItems)`	Combines a meta information item extracted from the document with the main content buffer and also stores the individual information as item in the Map of content items.
`I_CmsExtractionResult`	`extractText(byte[] content)`	Extracts the text and meta information from the given binary document.
`I_CmsExtractionResult`	`extractText(byte[] content, java.lang.String encoding)`	Extracts the text and meta information from the given binary document, using the specified content encoding.
`I_CmsExtractionResult`	`extractText(java.io.InputStream in)`	Extracts the text and meta information from the document on the input stream.
`I_CmsExtractionResult`	`extractText(java.io.InputStream in, java.lang.String encoding)`	Extracts the text and meta information from the document on the input stream, using the specified content encoding.
`protected CmsExtractionResult`	`extractText(java.io.InputStream in, org.apache.tika.parser.Parser parser)`	Parses the given input stream with the provided parser and returns the result as a map of content items.
`protected java.lang.String`	`removeControlChars(java.lang.String content)`	Removes "unwanted" control chars from the given content.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - A_CmsTextExtractor
```
public A_CmsTextExtractor()
```
- Method Detail
  - extractText
```
public I_CmsExtractionResult extractText(byte[] content)
                                  throws java.lang.Exception
```
    Description copied from interface: I_CmsTextExtractor
    
    Extracts the text and meta information from the given binary document.
    The encoding of the input stream is either not required (the document type may have one common default encoding) or the extractor is able to divine the encoding from the provided binary array automatically.
    Delivers is the same result as calling I_CmsTextExtractor.extractText(byte[], String) when String == null.
    
    Specified by:
    
    extractText in interface I_CmsTextExtractor
    
    Parameters:
    
    content - the binary content of the document to extract the text from
    
    Returns:
    
    the extracted text
    
    Throws:
    
    java.lang.Exception - if the text extration fails
    
    See Also:
    
    I_CmsTextExtractor.extractText(byte[])
  - extractText
```
public I_CmsExtractionResult extractText(byte[] content,
                                         java.lang.String encoding)
                                  throws java.lang.Exception
```
    Description copied from interface: I_CmsTextExtractor
    
    Extracts the text and meta information from the given binary document, using the specified content encoding.
    The encoding is a hint for the text extractor, if the value given is null then the text extractor should try to figure out the encoding itself.
    
    Specified by:
    
    extractText in interface I_CmsTextExtractor
    
    Parameters:
    
    content - the binary content of the document to extract the text from
    
    encoding - the encoding to use
    
    Returns:
    
    the extracted text
    
    Throws:
    
    java.lang.Exception - if the text extration fails
    
    See Also:
    
    I_CmsTextExtractor.extractText(byte[], java.lang.String)
  - extractText
```
public I_CmsExtractionResult extractText(java.io.InputStream in)
                                  throws java.lang.Exception
```
    Description copied from interface: I_CmsTextExtractor
    
    Extracts the text and meta information from the document on the input stream.
    The encoding of the input stream is either not required (the document type may have one common default encoding) or the extractor is able to divine the encoding from the provided input stream automatically.
    Delivers is the same result as calling I_CmsTextExtractor.extractText(InputStream, String) when String == null.
    
    Specified by:
    
    extractText in interface I_CmsTextExtractor
    
    Parameters:
    
    in - the input stream for the document to extract the text from
    
    Returns:
    
    the extracted text and meta information
    
    Throws:
    
    java.lang.Exception - if the text extration fails
    
    See Also:
    
    I_CmsTextExtractor.extractText(java.io.InputStream)
  - extractText
```
public I_CmsExtractionResult extractText(java.io.InputStream in,
                                         java.lang.String encoding)
                                  throws java.lang.Exception
```
    Description copied from interface: I_CmsTextExtractor
    
    Extracts the text and meta information from the document on the input stream, using the specified content encoding.
    The encoding is a hint for the text extractor, if the value given is null then the text extractor should try to figure out the encoding itself.
    
    Specified by:
    
    extractText in interface I_CmsTextExtractor
    
    Parameters:
    
    in - the input stream for the document to extract the text from
    
    encoding - the encoding to use
    
    Returns:
    
    the extracted text and meta information
    
    Throws:
    
    java.lang.Exception - if the text extration fails
    
    See Also:
    
    I_CmsTextExtractor.extractText(java.io.InputStream, java.lang.String)
  - combineContentItem
```
protected void combineContentItem(java.lang.String itemValue,
                                  java.lang.String itemKey,
                                  java.lang.StringBuffer content,
                                  java.util.Map<java.lang.String,java.lang.String> contentItems)
```
    Combines a meta information item extracted from the document with the main content buffer and also stores the individual information as item in the Map of content items.
    
    Parameters:
    
    itemValue - the value of the item to store
    
    itemKey - the key in the Map of content items
    
    content - a buffer where to append the content item
    
    contentItems - the Map of individual content items
  - extractText
```
protected CmsExtractionResult extractText(java.io.InputStream in,
                                          org.apache.tika.parser.Parser parser)
                                   throws java.lang.Exception
```
    Parses the given input stream with the provided parser and returns the result as a map of content items.
    
    Parameters:
    
    in - the input stream for the content to parse
    
    parser - the parser to use
    
    Returns:
    
    the result of the parsing as a map of content items
    
    Throws:
    
    java.lang.Exception - in case something goes wrong
  - removeControlChars
```
protected java.lang.String removeControlChars(java.lang.String content)
```
    Removes "unwanted" control chars from the given content.
    
    Parameters:
    
    content - the content to remove the unwanted control chars from
    
    Returns:
    
    the content with the unwanted control chars removed

Class A_CmsTextExtractor

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

A_CmsTextExtractor

Method Detail

extractText

extractText

extractText

extractText

combineContentItem

extractText

removeControlChars