synchronization – How to support text editing when the text has to be synchronized with fixed audio?

In my application, we record audio of people speaking, send this to a speech-to-text (STT) service, and present the text to the user for editing.

Simplifying a bit, the STT service returns results in the form of a long list of words with timings:

      "words": (
        {
          "value": "Today",
          "from": 0.34,
          "to": 0.75,
          "confidence": 0.865
        },
        {
          "value": "is",
          "from": 0.76,
          "to": 0.91,
          "confidence": 0.923
        },
        {
          "value": "Friday",
          "from": 0.92,
          "to": 1.36,
          "confidence": 0.783
        },

        ...
     )

The from and to timings are offsets in seconds from the start of the recording, so in this example, the word “Today” starts at t=0.34s and ends at t=0.75s, and so on. (Confidence is a measure of the STT engine’s confidence that it’s right about the word, which I use elsewhere in the app.)

The timings are important, because I have a UI that knows where you are in the audio and indicates this with a mark in the text. You can play the audio out loud and the app moves the marker to keep the text location in sync, or, for any location in the text, when you hit play, it knows where to start playing.

So far so good.

The challenge I’ve got is how to handle it when the user edits text, because it would get out of sync with the timings. If you, say, delete the space between “Today” and “is”, now you have one word, not two. What should it’s “time” be?

I handle this particular case by just concatenating the times, but what should happen if you select from the middle of one word to the middle of another word in another paragraph and then paste a block of text? I can maintain a list of the words, but what should happen to the timings?

Is there a better way to organize my data structures to support text editing that can stay in sync with audio?