I'm working on a document classifier that has to classify the document according to certain keywords. The list of keywords and the category are provided as a configuration parameter. The document to be classified can be of any format, so it is necessary to perform a conversion to plain text.
The classified document must be presented as a report (DOCx, PDF, HTML, etc.).
Since the documents are of different formats, I decided to integrate them into the common text format represented by the following templates:
Document - PDFDocument - HTMLDocument - etc. Paragraph
Document take it
file path and the
classifier and classifies the document by adding categories to the
Paragraph mutate the same object.
The logic of analysis of each type of document is very complex and varies. Currently, the type of document is responsible for the analysis itself. Reading the file access path is also coupled to the document subclasses.
I think that there is no adequate separation of concerns and that the design can be made more intuitive and resistant.