performance – How do I optimize the bubble sort in Assembly8086?

I tried to implement bubble sort in Assembly 8086.

datasg      SEGMENT BYTE 'data'
array       DB 1, 3, 2, 5, 4
n           DW 5
datasg      ENDS
stacksg     SEGMENT BYTE STACK 'stack'
            DW 12 DUP(?)
stacksg     ENDS
codesg      SEGMENT PARA 'code'
            ASSUME CS:codesg, DS:datasg, SS:stacksg

; Pushing the previous data segment to keep it secure.
            PUSH DS
            XOR AX, AX
            PUSH AX
            MOV AX, datasg
            MOV DS, AX
;SI = i
            XOR SI, SI
            MOV CX, n
            DEC CX
out:        PUSH CX; Pushing CX to the stack before entering the second for loop
            XOR DI, DI
            MOV CX, n
            DEC CX
            SUB CX, SI
in:         MOV AH, array(DI)
            CMP AH, array(DI+1)
            JLE if_end
            XCHG AH, array(DI+1)
            MOV array(DI), AH
if_end:     INC DI
            LOOP in
            POP CX
            INC SI
            LOOP out
            XOR SI, SI
; Some garbage code to move array elements to AL register one by one to see them while debugging.
            MOV AL, array(SI)
            INC SI
            MOV AL, array(SI)
            INC SI
            MOV AL, array(SI)
            INC SI
            MOV AL, array(SI)
            INC SI
            MOV AL, array(SI)
MAIN        ENDP
codesg      ENDS
            END MAIN

It seems to be working for the given example in the above code. I also tried it with different arrays and they all seem to work.
I just want to learn if there is a way to improve it? Improvements like changing JMP codes to decrease the size of code or using AX with XCHG because that is faster.

I also can’t comprehend the idea of pushing CX to stack for using nested-for loops. If you would give some suggestion about it I would be very happy.