On intel MacBookPro 2020, CPU-only, the original one[1] using pytorch utilized one core only. A tensorflow implementation[2] with oneDNN support which utilized most of the cores ran at ~11sec/iteration. Another OpenVINO based implementation[3] ran at ~6.0sec/iteration.
You mean the keras version? How does it compare to the original one? Currently on my 10850k I get 2.4s/iteration, which is borderline usable. I haven't managed (nor tried very hard) to get the cuda version working on my 1070; I expect to be a little better, but I don't want to fight with ram issues.
I guess I should give it a try.