stm32f1x: use async algorithm in flash programming routine
Let the target algorithm be running in the background and buffer data
continuously through a FIFO. This reduces or removes the effect of latency
because only a very small number of queue executions needs to be done per
buffer fill. Previously, the many repeated target state changes, register
accesses (really inefficient) and algorithm uploads caused the flash
programming to be latency bound in many cases. Now it should scale better
with increased throughput.
Signed-off-by: Andreas Fritiofson <andreas.fritiofson@gmail.com>