Check source code for Firefox:
* grabArticle - Using a variety of metrics (content score, classname, element types), find the content that is * most likely to be the stuff a user wants to read. Then return it wrapped up in a div.
You can find lots of comment like:
Remove DIV, SECTION, and HEADER nodes without any content(e.g. text, image, video, or iframe).
Loop through all paragraphs, and assign a score to them based on how content-y they look.
Add a point for the paragraph itself as a base.
Add points for any commas within this paragraph.
For every 100 characters in this paragraph, add another point. Up to 3 points.
After we’ve calculated scores, loop through all of the possible
candidate nodes we found and find the one with the highest score.
Scale the final candidates score based on link density. Good content
should have a relatively small link density (5% or less) and be mostly
unaffected by this operation.
Good summary is here: https://stackoverflow.com/a/40747529/173149