Design Models – Analyzing Documents from Different Document Sources

Context:

I'm working on a document classifier that has to classify the document according to certain keywords. The list of keywords and the category are provided as a configuration parameter. The document to be classified can be of any format, so it is necessary to perform a conversion to plain text.

The classified document must be presented as a report (DOCx, PDF, HTML, etc.).

Problem:

Since the documents are of different formats, I decided to integrate them into the common text format represented by the following templates:

Document
- PDFDocument
- HTMLDocument
- etc.

Paragraph

the Document take it file path and the classifier and classifies the document by adding categories to the Paragraph mutate the same object.

The logic of analysis of each type of document is very complex and varies. Currently, the type of document is responsible for the analysis itself. Reading the file access path is also coupled to the document subclasses.

I think that there is no adequate separation of concerns and that the design can be made more intuitive and resistant.