html – How do search engines treat spaces that are not U+20?

A problem with writing HTML pages for agglutinative languages, is that search engines generally work poorly when trying to parse them. Often-times, searches with correctly written concatenated words will be suggested rewritten to the incorrect form.

Example of Google replacing correctly concatenated word form with incorrect word form

In the above example, Google suggests that splitting the word ‘solcellepanelforskning’ (‘research on solar panels’ would be the correctly spelt word, which of course it isn’t. Now, for those writing web pages, this matters. You need to write your headings in a way that yields the most hits to your page. There are generally only three options here that I am aware of:

  1. Train search engines.
  2. Incorrectly space the words, but use letter-spacing to make it look right.
  3. Incorrectly space the words, but with either ​ (zero width space) or  (zero width no-break space).

1: Train search engines

It may not be a meme, but training Google to prefer giraffes is documented. Ads is one thing, but the number of hits a page gets will get the more attention from Google or other search engines. This is in other words, doable, but it requires labour and lots of it.

2: CSS-trick letter-spacing

The trick is simple enough to execute (combination of span class and a similar class performing the desired spacing), but this has to issues: You get messy HTML for one, but what is even worse, is that oral readers will incorrectly insert a pause between words, where there should be none. An example of this in English, would be the difference between ‘every day’ and ‘everyday’.

3: Zero width spaces

These have the advantage of removing the need for a special class defined in CSS, yielding a somewhat (though admittedly not much) less messy HTML code. However, considering that this are not standard, this could perhaps cause problems with rendering on some devices. Further, you still get the issue (I would assume) with oral readers, as stated in 2 supra.

How do search engines treat none-  spaces? And for bonus points, which of options two and three would be the better choice, if one desires compatibility across renderers, both visual and oral ones?