The thing that takes special hardware is _protecting_ one task from a bug in
another - a bug which could trash the first tasks's (or the system's!)
memory. One has to have memory management of some kind to do that.
Actually, a modified version of the * approach will also work. When switching processes, swap the whole process out to your fastest device (on * this was a single write to the drum) and swap in the new process. * hardware had a bounds register, so it was only necessary to swap out enough of the previous process to fit the smaller process in. So after a while, core started to look like an onion, with the current process at the bottom and pieces of larger non-current processes above that.
(I thought that * was MIT CTSS, but I can't confirm that.)