A Look Back: Creating a VST 2.x Plug-In from Nothing (Part 1)

When I started with VoiceFX, my original goal was to only support VST 3.x, as it was the most modern version of the SDK, and surely by now every important software had moved to it. Unfortunately I didn't account for the occasional big shot releasing a modern product with a relatively ancient version of the SDK - an SDK that no longer officially exists. So what do you do in this situation?

You do what every other totally sane developer does and start a clean-room reverse engineering project for the now abandoned VST 2.x SDK, staying faithful to the law. And finally, after roughly 5 months of development, I managed to make it work. So how did I get there?

Establishing the Boundaries

When trying to stay within the law, you have to establish clear boundaries. Especially for clean-room reverse engineering, I had to go into quite a few lawsuits about it to figure out what is and is not allowed. In the end, I had to actually ask a lawyer for advice, which ended up with some costs, and the advice I got boiled down to this:

Any work performed must be for the purpose of interoperability.
Don't use any original material or third-party source material that is not clearly set apart from the original.
Avoid the use of reverse engineering tools where possible.

These rules sound simple, but they ended up making the project a living hell. I had to rely only on information that was clearly detached from the applications used, or not at all rely on these applications. A single misstep and you end up with a huge amount of legal issues. But what other option do I have when the actual SDK is now intentionally hidden by the creators, but still in use by commercial applications released today?

Where is the Entry?

I have to begin somewhere, and for VST 2.x Plug-Ins and Hosts, it is the interface to the Plug-In itself. I needed to figure out the "entry point" from which everything starts, and what that entry point actually does, which meant I had to look into what existing VST 2.x Plug-Ins export. On Windows, I could hook into GetProcAddress, while on Linux, I could hook into dlsym.

I ended up using the Windows function for the initial steps and after filtering out several thousand unrelated strings, I finally found something that looked relevant: a string with the content "VSTPluginMain". This string was often accompanied by "MAIN" and "main_macho", which I assume are for Windows and MacOS exclusive VST 2.x plugins only.

I had the name of the entry point, but none of it's details - in common terms, I knew where the door was, but not where the knob was, what the key looked like, and how it would even open. But I had a start, a which is more than nothing, and gave me the necessary push required to move on after almost giving up from the mass of data.

Lockpicking the Entry

While having an entry point is great, it still being locked is a problem - it needs to be unlocked for us to actually do anything. I could not rely on anything but the most basic tools at this point, as every information other tools may give me could be wrong. The only tool that I had available to unlock the entry point are the CPU registers, so the work began.

On AMD64, you have the registers (R)AX, (R)BX, (R)CX, (R)DX, (R)SI, (R)DI, (R)BP, (R)SP, R8, R9, R10, R11, R12, R13, R14, R15, (R)IP, and (R)FLAGS. All of these have the potential to hold critical information, but some of these will actually contain useful information. I won't bore you with the details about x86/amd64 Assembly here, there's plenty of other resources out there that are easily found if you're actually into that.

The short form of testing is that I found two clear patterns that repeated every single time: (R)CX would point into some kind of read-only executable memory, while the value in (R)AX when returning would result in different crashes - except if the value is zero. So most likely I was looking at two pointers of some kind, one as the argument, one as the return value.

That left the calling convention, which is where it got a bit difficult. I've never had a clear resource on what each calling convention actually does, but it seems that 64-bit has somewhat unified the world to stop creating additional standards for something so critical. I have no idea what the 32-bit calling convention would be, but my assumption is that it is either stdcall or cdecl, with the latter sounding more sane - time will tell.

Magic Space

With the entry point unlocked, but without a clear idea of what it does, questions flooded into my mind. Clearly the value I'm returning has some sort of meaning, so can I affect the behavior in other ways that just returning 0? What if I have a bunch of memory which is all filled with 0, but I return the pointer to that memory? How large does that memory have to be if that works?

While it seemed simple to test, of course the reality turned out to be much harder. My initial idea of simply returning memory that shrinks every time it succeeds resulted in a size of 4 bytes - nowhere enough to store anything. There was clearly something going on with the first four bytes, but what? It had to be a magic number of some kind to clearly identify the structure.

And so I wrote a script which would repeatedly launch a VST 2.x host application with a crafted plug-in, which on every iteration would try a new number to insert into the first four bytes. And lo and behold, after 1450406992 iterations, I found the exact string: VstP.

I assume that this is short for Vst Plug-In, and it should have occured to me - before I wasted several kWh on this problem - to attempt to limit the possible values of each byte. But I now had the magic number, and thanks to it I had a rough estimate of the size of the memory to return: roughly 128 bytes or more.

Structural Inspection

With the magic number in place, it was time to delve into what the structure actually contains aside from the first four bytes clearly being a magic number. The easiest way to check data integrity is by poisioning it intentionally, and that's exactly what I did: I started inserting 1 bit changes at random locations to figure out what does and does not cause problems.

I didn't have to wait long to hit something. Precisely when I chose the bytes 8 though 15 I was hitting something critical that seems to be used right after loading the VST 2.x Plug-In. The length of this looked suspiciously like a pointer, so my first attempt was to place a function at this point, and my guess was right. I was now hitting a breakpoint in my newly created function.

After more register investigation, I figured out that the function in total has 6 arguments, of which two are pointers (one pointing at the memory i returned from the entry point), one is at least 8 bits, another is at least 24 bits (most likely 32), another is either 16, 32 or 64, and one is a 32 bit float. Going by the MSVC x64 calling convention, I guessed at the rough order of parameters, which seems to be correct so far. Time will tell if I messed up.

Difficulty Spike

At this point, the difficulty went from Easy to Dark Souls but you only have the X button. It was time to start logging the calls to the function which I dubbed "control", every single one that would happen. It did not take long for patterns to emerge on most VST 2.x Hosts, with the most common calls having the second parameter be 0x2D, 0x2F, 0x30, 0x31, 0x3A, 0x23, 0x0A, 0x0B, and finally 0x00 - I called this parameter the "opcode".

Unfortunately after this much time I can't remember the exact details anymore. Some of them are extremely obvious, like 0x0A which sets the sample rate. Others were more complex and I only just recently figured out what they mean and how to use them correctly through the development of VoiceFX.

However this part is already long enough, so I'll continue with this in the 2nd part in the future.

Permalink 17 Mar 2021

Tags: VoiceFX VST VST 2.x WordPress Archive