Hyper threading does not split a large program into smaller parts. That idea is simply known as threading, which is quite simple to implement on any multitasking operating system.
A simple (though perhaps pointless, it does give a good idea of whats going on) example on the Amiga would be to write a graphics converter program, where your program does little more than run a graphical user interface. When the user selects a file to convert, say an IFF to a Jpeg, the program starts another task passes it the graphics data and then lets that task do the number crunching. The main program is now free to allow the user to do something else. Once the converter task has finished it just has to signal the main program this fact and let the user decide what to do next.
In a normal (Amiga) CPU like the 68060, each task/program is given its turn to run (I believe, assuming both tasks are at the same priority, they would switch every 12th of a second (i.e. every 4 VBL)). This just means that each program runs a bit slower than if they were running on their own.
Modern CPUs have multiple execution units which can perform the same operations (actually sometimes, like in the 68060, there is a division of abilities, but this complicates matters)... like having two calculator on your desk, this allows the CPU to run two instructions (that do not depend upon the result of each other) at the same time. A decent compiler or a good human coder, will think carefully about the order of their program's instructions to allow the CPU to do this effectivly. This is known as optimisation.
Hyperthreading, is simply a special schedualer inside the CPU which makes the single CPU core look like 2 CPU cores. So that if there is a free execution unit, ie one that the CPU couldn't fill with the currently executing task (or if the currently executing task is wating for a memory operation to complete... these take forever in CPU terms), then another program will be given a chance to use that free resource, hense using time on the CPU which otherwise would be going to waste.
The problems are many. Firstly both tasks have to share the same infrastructure... Cache, buffers, memory bus etc... secondly it's rare for a good optimised program to leave much CPU resource free.. thirdly and most importantly it requires quite a bit of silicon space to implement... and only really offers a 5 to 10 percent performance gain.
It was only useful on the "brain-dead" NetBurst architecture, and is unlikely to return in the near future. Having 2 real cores on a single die offers much greater performance gains.