JIT Emulation?

  • Thread starter Thread starter sfx1999
  • Start date Start date
Status
Not open for further replies.
I got it now.


After looking at it, I have to question if it is really what you want to be using.
That tutorial tells you how to recompile the entire target code all at once and then just execute it.

You said you want to compile on-the-fly.
Well, the tutorial can help a little with this, particularly with emitting x86 code, but you are going to have to do heavy modifications to their base concept.

Essentially the problem is this:  They use a single thread for everything.  They read the target code and emit x86 code.  Of course while this process is happening, the registers in their main thread are changing.
Well, for them, this is no problem, because the last step will be to execute the written code all at once.
Since it is all done at once, the code will be able to set registers and retrieve them without fear of them being modified.

Now here is the problem with trying to do this instruction-by-instruction.
Your method would be to write code, execute code, write code, execute code, etc.
When you write code, your thread’s registers will change.
When you execute the code, the code might set a register to a specific value.
When you write the next set of code, that register may be overwritten because it may be needed for the writing of the code, so when the next set of code is executed, the register it wants is basically some random value, rather than what it had set previously.


You would need a separate thread for the execution of the code, and certainly you will need a way to make sure that thread can wait for new code to be written while maintaining the registers used by the emitted code.
If you use naked functions, this isn’t so difficult.
One solution off the top of my head would be to PUSH each register, then check a flag that determines if there is new code to be executed.
If there is, POP each register back into place and go directly to the new code, making sure not to change any registers on the way, and use a CALL to get to the new code.
The new code would end in a RETN, putting your thread back into its loop where it immediately PUSH’s all its registers again to keep them safe.


You could improve speed by using CMP directly on the value you are checking (to determine if new code has been written), so that nothing is loaded into registers.  This would allow you to avoid all the PUSH’ing and POP’ing.


But then there is the problem of other thread flags being modified, such as EFlags.
By using CMP or TEST, you will modify flags in your thread that the emitted code may want to use.
You would have to work around this as well.

There is the option of storing fake copies of registers and thread flags and using them instead of using the real thread’s registers and flags, but you would be interpreting code rather than recompiling it at run-time.


Since you aren’t compiling it all at once and executing it, your challenge is to get your emitted-code thread to wait for new code to be written without changing its flags/registers.


L. Spiro
 
I am not sure what you mean exactly. Aren't the registers stored when a function is called? As for making sure that the emulated registers aren't destroyed, you put them in a global variable.
 
Aren't the registers stored when a function is called?
When necessary, the ESP and EBP registers are stored on the stack. The rest can change (functions often return the values through registers).
 
To sfx1999.

There is no need to store them in globals; it is slower than simply pushing/popping them, and more cumbersome.

If you store to globals, you have to use each global name (don't use array indexes or speed will be compromised) individually; a specific global set aside for a specific register.
Then you would have to code the assembly:
mov g_rEAX, EAX
mov g_rECX, ECX
mov g_rEDX, EDX
etc.

Then before you execute your translated code, you have to put the registers back:
mov EAX, g_rEAX
mov ECX, g_rECX
mov EDX, g_rEDX
etc.

It’s easier to just do:
PUSH EAX
PUSH ECX
PUSH EDX

…
POP EDX
POP ECX
POP EAX


This is all in regards to storing them to ensure they don’t get overwritten while you are waiting for new code to process.

To answer your question…
Aren't the registers stored when a function is called?
Generally EBP (the function’s local stack pointer) and ESP (the stack pointer) are stored at the start of a function, changed in some way, then restored at the end of the function, back to whatever they were originally.
Meanwhile, other registers will be changing, and those changes last even after the function returns.
Especially functions that return values, in which case EAX will contain the return value (and possibly EDX).


Well, normally you wouldn’t need to worry about this, since it should (in theory) only be the code you are emulating that will change registers, and of course whatever changes it makes, you don’t want to get in the way.
The problem is actually trying to stay out of the way.

You have one thread that sits in a loop and waits for new code to be written.
It checks a simple flag to determine if there is new code.
When the flag is changed, it goes immediately to the location where the new code was written, executes it, and comes back to the loop.
The new code may have stored a value in EAX that is intended for the next instruction (which has yet to be parsed/executed).
There is no second copy of EAX unless you specifically make one (with a global or by PUSH’ing).
So you go back to your loop.
It checks the global.  Maybe the code looks like this:

MOV EAX, g_bCodeWritten
TEST EAX, EAX

Well now your loop has changed the same EAX that the emitted code is supposed to be using.
This is why you will definitely be required to write your loop in ASM yourself, so you know exactly what it is changing.

As I mentioned before, you don’t actually have to store backups of the registers at all, if you just write an ASM loop that uses “CMP g_bCodeWritten, 0” directly.
This will avoid changing any registers, so you don’t need to have globals or PUSH/POP.

However, it will change the EFlags, and this will cause a problem.
Registers are not actually the problem.
The thread flags used for conditional jumps now become the problem.
If you don’t understand why, read up on how CMP works in conjunction with JE, JNZ, or whatever conditional jump you like.


Luckily, there is a method for you to get around this problem.
Basically you’re going to have your own EIP pointer which keeps track of where you are in the code, so you know what instruction to translate next.
You load the instruction, translate it, write it in x86, set a flag, your second thread executes it, goes back to its loop, and you continue.
Well, you’re always saving the code to a specific location.
That means you actually can’t write CALL or JMP instructions (or conditional jumps).
Actually, these instructions would be parsed directly, and rather than translating them and executing them, you make the call or jump with your original EIP pointer, then continue from there.
To do this correctly, your original interpreter (with the EIP pointer) will be required to correctly interpret TEST and CMP instructions, and any others that modify EFlags.  It will keep its own EFlags value for storing these results (I mean it will have a DWORD m_dwEFlags member; I am not suggesting it uses its real EFlags value in its thread context).
Then when it encounters a JNZ or whatever, it will check its m_dwEFlags to determine if the jump is taken.
If it is, it will need to correctly go to the location where it jumps and continue from there.
Remember that all of this is done without the steps of translating/executing the code.
This part is all interpreted.


I know this seems confusing.
You may want to really read up on ASM and thread context members.
You are going to need to code in ASM and set up a working system (partially in ASM) that allows two threads to communicate frequently and stay synchronized perfectly.
Your EIP thread can’t extract an instruction and write it to be executed while the executing thread is still executing the last bit if code it was passed.


You are also going to have to set up your own storage method to hold the values that the emulated code will be using.
The emulated code may try to access 0x003800EC, which quite likely won’t be a valid address in your program.
So you are going to create a method of mapping memory back and forth from the real location in your program to the respective location in the emulated code.
This part SHOULD be nothing more than a simple offset.  Depends on what you are emulating.


L. Spiro
 
Aren't the registers stored when a function is called?
When necessary, the ESP and EBP registers are stored on the stack. The rest can change (functions often return the values through registers).

Aren't the registers stored when a function is called?
Generally EPB (the function’s local stack pointer) and ESP (the stack pointer) are stored at the start of a function, changed in some way, then restored at the end of the function, back to whatever they were originally.
Meanwhile, other registers will be changing, and those changes last even after the function returns.
Especially functions that return values, in which case EAX will contain the return value (and possibly EDX).
deja vu?  :o

Also, I was quite inaccurate when talking about these registers. Only EBP (or EPB as L. Spiro names it :P - it's a Base Pointer btw.) is stored (on the stack). ESP is restored from the EBP when the call is completed.

However, it will change the EFlags, and this will cause a problem.
Registers are not actually the problem.
The thread flags used for conditional jumps now become the problem.
If you don’t understand why, read up on how CMP works in conjunction with JE, JNZ, or whatever conditional jump you like.
If the CMP is used then conditional jump usually directly follows that opcode (not always, but since the instructions between those two CAN'T change EFlags, I don't see any problem). When using two threads, it's OS job not to mess up two thread's registers/flags.
 
If the CMP is used then conditional jump usually directly follows that opcode (not always, but since the instructions between those two CAN'T change EFlags, I don't see any problem). When using two threads, it's OS job not to mess up two thread's registers/flags.
Normally it wouldn’t be a problem.
If he followed that tutorial page, he would end up recompiling the whole thing and executing it all at once.

But this isn’t what he wants to do.
To emulate with Just-in-Time emulation, he executes instruction-by-instruction.
Although he can work ahead and execute several instructions at once, he would still run the risk of resetting the EFlags flag while in the waiting loop.

The risk I am trying to explain is that when he checks his flag while in the waiting loop (via CMP), that thread’s EFlags is going to change, and unfortunately that is the same thread that is executing the emitted code.


Emitted code performs its own CMP instruction, setting thread’s EFlags a certain way.
Thread goes back to its loop to wait for the next instruction it will execute.
While in the loop, it uses CMP to check a flag that determines if there is new code to execute.
This is all on the same thread, which means when it checked to determine if there is new code, it just overwrote the flags set by the emitted code.
Generally speaking, the conditional jump will immediately or almost immediately follow a CMP, but don’t let that fool you into thinking you can get around this problem 100% safely by simply executing sequentially all the instructions after the CMP up to the first conditional jump.
It IS possible that multiple conditional jumps use the same CMP instruction to determine if they jump or not.


But again, this is not actually a problem.
Your executing thread won’t actually execute CMP, TEST, CALL, JMP, or any conditional jump instructions.
These few instructions will be interpreted separately.
Since you want your emitted code to always land in the same spot, just think what would happen if you actually executed a CALL instruction!
This won’t work at all with Just-in-Time emulation, because it would require you to recompile the location where the call goes, all the way up to the return of the CALL, and thus you would also have to recompile every CALL it makes, etc.

Instead, you have one thread that runs through the target code.  This thread has two DWORD values it uses to keep track of its “emulated” EIP and EFlags values.  It also has its own fake stack which it uses to keep track of CALL and RETN instructions.
When this thread encounters a CMP, it sets the emulated EFlags value accordingly and then continues to the next instruction, without sending it to the secondary thread.
This first thread uses its EIP value to keep track of its position inside the target code.
When this first thread hits a conditional jump, it checks its emulated EFlags value and sets its emulated EIP accordingly, then continues, again without sending the jump to be executed by the secondary thread.
When it encounters a CALL, it pushes the return location into its own fake stack (this is the only purpose of this stack; all other stack operations will be executed in the secondary thread and will use that thread’s real stack) and sets its fake EIP to the location of the call.  It then continues, again, without sending the instruction to be executed.
When it encounters a RETN, it pops its fake stack and sets its fake EIP pointer to that location, exactly how a real thread would.  It then continues, again, without sending the instruction to be executed.


All other instructions are sent to be actually executed by the secondary thread.
It has to work this way for Just-in-Time emulation to even work.
The secondary thread CAN’T execute JMP or CALL instructions because it will send the thread off somewhere that isn’t valid code and you die.
If you want the secondary thread to execute JMP or CALL instruction, then you are going to have to actually recompile ahead, which in turn (by the nature of how it works) will actually force you to recompile the entire program all at once, which is not Just-in-Time emulation.


L. Spiro
 
The risk I am trying to explain is that when he checks his flag while in the waiting loop (via CMP), that thread’s EFlags is going to change, and unfortunately that is the same thread that is executing the emitted code.
Now it makes sense :) Sorry, for not-fully-understanding your previous post.
 
Using threads seams like a good idea, but won't jumping back and forth from thread to thread cause a performance hit?

Also, would using Windows' thread system be faster than using the pthreads library? I would like to keep this portable.

Another problem will be returning the new registers from one thread to the other, though. That won't necessarily be easy. Since I plan to interpret and compile at the same time, then just run from memory the next time, I will need a way to pass registers, which involves global variables. If a block is really small, your thread system won't necessarily be faster.

So, I was thinking, why not just read the registers and store them at the beginning and end of a block execution?

Also, the memory handling wouldn't be that difficult. It would be hard if the processor had a BIOS for creating pointers. I could just create pointers to system memory and that case, but self modifying code would be a pain in the ass there and I could bring the computer to its knees with a bad program that doesn't run out of memory as fast as the real deal.
 
Using threads seams like a good idea, but won't jumping back and forth from thread to thread cause a performance hit?
The purpose of using two threads together is so that you can do true execution of the emitted code with minimal overhead used to store copies of registers and flags.
Switching back and forth shouldn’t cause any performance loss; based on the setup, you should see an increase in performance.
Without two threads, you would have to back up AND restore registers for each instruction you execute.
You can create a waiting loop that does not modify registers, which alleviates the need for restoring registers before executing emitted code.  This alone decreases the actual code size by 8 instructions.
Imagine, for every instruction you execute, you add 8 more instructions to it to save the registers, then 8 more to restore the registers.
Suddenly one instruction becomes 17.
But your main thread needs a way to get the registers from the executing thread quickly, so it may be a good idea to back them up in globals (but not restore them, since they won’t be modified in your waiting loop).
This adds 8 steps to every instruction you execute, but it is still faster than performing it on one thread, where you will be required to both store and restore them.

But note that you will be at the mercy of Windows®’ thread management.
You must set each thread at the same priority or they will block each other.
But if they simply use signals to each other, there should not be a speed loss by switching back and forth.


As for PThreads, while I have never used them myself, they are popular for a reason.
On the other hand, if you are using Windows®, you have to go through their API anyway; using a wrapper is only slower.
Unless you plan to make your own device driver.


I plan to interpret and compile at the same time, then just run from memory the next time
Well that changes things a lot.
You’re going to have to map out a large space where you can put lots of code.
And create a fast method for determining if a section of code has already been executed.

Lunch time so I can’t finish my essay.
So sad!!!


L. Spiro
 
Status
Not open for further replies.
Back
Top