==================================================================== Copyright (c) 2000 Kevin P. Lawton Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with no Invariant Sections, with no Front-Cover Texts, and with no Back-Cover Texts. ==================================================================== Running multiple operating systems concurrently on an IA32 PC using virtualization techniques, by Kevin Lawton (Last updated Nov 29, 1999) The objective of this text is to bring together a number of ideas on the implementation of virtualization on the IA32 x86 PC. This is a collection of ideas from many people who have offered them to the FreeMWare discussion forum, including myself. CONTRIBUTORS ============ Many folks have contributed to this project thus far. The following is an incomplete list of people who have contributed ideas about the implementation of virtualization on IA32. Email me if I forgot you. Kevin Lawton Ulrich Weigand Ramon van Handel Cedric Adjih Clem Dickey Nick Behnken Jens Nerche THE RATIONALE FOR VIRTUALIZATION ================================ What many users and developers desire, is a way to run a primary PC operating system and related software, while retaining the ability to concurrently run software engineered for a different PC operating system. For example, one might want to run Linux/x86 as a primary operating system, yet have the ability to run all of their MS Windows applications without the need for a reboot. There are several strategies that can be used to this end. It's worth describing them briefly, so it's more obvious where virtualization fits in. Strategy #1, pure emulation: If you want a solution which runs x86 operating systems and applications on non-x86 platforms, you need to model a fairly complete x86 PC in software, since the x86 instruction set is not available to you. This is the method used by my emulation project "bochs" (http://www.bochs.com). The benefit here is portability. The tradeoff is a significant performance hit. Though, there are some advanced techniques, such as dynamic translation which can improve performance. Strategy #2, OS/API emulation: Since applications generally run in a different space than the operating system, and communicate via a set of APIs, another method of running applications designed for a different x86 OS, would be to intercept and emulate the behavior of these APIs using facilities in the existing OS. This is the strategy used in the Wine project (http://www.winehq.com), though it's complicated by Windows internals issues. Application binaries can be run natively using this strategy, thus one of the benefits is very good performance. Since the OS APIs are emulated, the associated OS is not needed, giving a pleasant side effect of not needing to purchase a license for the OS. On the negative side, this strategy only works for running the x86 OS for which you have implemented emulation of the APIs. Strategy #3, virtualization: Let's first talk about why we can't just run two operating systems on the same PC. First, devices such as video cards, disk controllers, timers, etc. are not designed to be driven by multiple operating systems. In general, hardware is designed to be driven exclusively by one device driver. Additionally, System features of the IA32 CPU are designed to be configured and used in a coordinated effort by only one operating system. These would include the paging unit, protection mechanisms, segmentation model, etc. Other application oriented features and instructions are not a problem and could potentially be used/executed natively. Fortunately these constitute a great deal of the workload imposed on the CPU. We might generalize by saying that some larger fraction of the feature/instruction set would work in the context of running multiple operating systems, and some smaller fraction would not be compatible. Thus our general strategy for virtualization is to let the larger set of compatible instructions "pass through" and execute natively. We need to detect the set of incompatible ones, and emulate their expected behavior. Thus, this is a quasi-emulation technique. The x86 CPU architecture doesn't give as the abililty to naturally detect usage of 100% of the features which we need to virtualize, so we employ some software techniques to fill in the gaps. As we certainly can't allow both operating systems to drive the same peripheral IO devices, we require software emulation of a reasonable set of such devices to be driven by the virtualized operating system. Fortunately, years of work on the project "bochs" (a Strategy #1 emulator), have yielded a complete set of such devices, which are known to be compatible with with many x86 operating systems. These can actually be shared between both projects, and extended where necessary. The benefits to virtualization include fairly native performance, the ability to run various x86 operating systems, and no need to depend on the publication of API standards or to keep up with them as they are extended. POSSIBLE VIRTUALIZATION SCHEMES =============================== The ideas and techniques of virtualizing a machine can be carried over to many different architectures, provided they have a capable feature set. Virtualization can also use various schemes depending on the architecture, the operating systems to be run, and other considerations. Some CPU architectures provide virtualization naturally. Others like IA32 require additional software to completely virtualize the CPU. It is conceivable, that multiple OS environments could be engineered such that they communicated their low-level requirements via a common micro OS which controlled all the devices and system oriented CPU features. If all application and operating system code was well behaved, there would be no need for any additional virtualization, as it would be effected naturally. Given our real world circumstances, we have to account for OS and application code, which access system oriented features. Our micro OS, in this case, would have to detect and emulate these accesses. Since the hardware devices are among the features which have to be emulated, a like set of devices would be offered to each OS running in this environment, though they don't have to relate to the actual set of devices which are physically part of your particular PC. The micro OS would drive those. The downside here, is the effort involved in writing such a micro OS, and device drivers for the incredible number of devices out there. Fortunately, we can just use services from a primary OS we run, to implement emulation of the devices which the virtualization needs to offer to any secondary OS we run. This removes us from having to write device drivers for any of the native hardware, and allows us to focus on the virtualization logic. Given this choice, it's useful to associate some terminology with each role an OS may play. The OS which controls the actual hardware and provides services used by the virtualization code will be called the "host OS". This OS does not need any of the virtualization environment to boot or run. Each OS which runs inside a virtualized environment (emulated devices and virtualized CPU usage) will be called a "guest OS". There can only be one host OS, but conceivably many instances of guest OSs. CHALLENGE ON THE IA32 ===================== A processor could be engineered to be naturally virtualizable. By this I mean, that any instructions which access system oriented features should naturally trap out, giving a virtualization monitor the chance to emulate the expected behavior. It is important to protect reads and writes of system registers, so that poorly behaved application code will also function properly. Unfortunately, the IA32 architecture is not 100% naturally virtualizable. There are a number of instructions for which write access to system registers is not allowed from user code (this is good; trap generated), but read access is allowed (not good; no trap). So we must coerce the processor into trapping out when potentially problematic instructions are executed; ones that the processor does not offer us a natural hardware protection against. In a nutshell, this problem boils down to how to breakpoint on the execution of arbitrary instructions, since the IA32 CPU won't always do this for us. And do this, without the guest OS detecting any changes; otherwise it's execution path could be altered. If you think about the last statement, it may become obvious that our virtualization objective for the CPU component anyways, is the same as that for a non-intrusive software debugger. How can I set any number of breakpoints on arbitrary instructions, without changing code execution? With this thought in mind, hopefully the rest should fall into place. IA32 FEATURES WHICH ARE NOT NATURALLY VIRTUALIZABLE =================================================== I've put together a list of x86 instructions which need special consideration with respect to virtualization. This should help out in discussion of implementation. For now, let's assume we're going to use the following strategy. Guest code from all ring levels 0..3 will actually be run at ring3 (the user or application level) so that most sensitive instructions will yield a natural protection exception, and thus we'll get a chance to virtualize it's execution. Following is the list, as mentioned. A 'Y' in the 2nd column, means that when run at ring3, the IA32 processor naturally generates a protection exception, thus giving our virtual monitor code the chance to do something smart in the context of this virtualization environment. A 'N' means we can't make the processor give us the chance, so we need to figure out how to handle this otherwise. A '*' denotes some special commentary is necessary, and follows after the list. TABLE 1 ------- protected in ring3 clts Y hlt Y in * (IOPL and TSS permission map) ins * (IOPL and TSS permission map) out * (IOPL and TSS permission map) outs * (IOPL and TSS permission map) lgdt Y lidt Y lldt Y lmsw Y ltr Y mov r32, CRx Y mov CRx, r32 Y mov r32, DRx Y mov DRx, r32 Y mov r32, TRx Y mov TRx, r32 Y popf * pushf * cli Y (IOPL) sti Y (IOPL) sgdt N sidt N sldt N smsw N str N verr N verw N lar N lsl N lds/les/lfs/lgs/lss * mov r/m, Sreg * mov Sreg, r/m * push Sreg * pop Sreg * sysenter * sysexit Y Also of relevance to this topic, are instructions which effect a control transfer or interrupt. The importance of these instructions will become more evident later, so for now I'll just list them: TABLE 2 ------- call * ret * enter * leave * jcc * jmp * int/into/bound * iret * loop * loope/z * loopne/nz * wait * We can handle the instructions which have native protection ('Y') while running at user-level, by an exception handler which emulates the instruction in our virtualization context. This is exactly what a v8086 mode monitor does. Fortunately, emulation of nearly all x86 instructions is done in bochs already, so there's not much ground to break here. That leaves the set of instructions with a 'N' or '*' in column 2. Following, I outline special considerations about each of these instructions. LAR, LSL, VERR, VERW The worry here is the CPL we are running at, is factored into the behavior of the checks made by these instructions, as are the fields in the segment descriptors, which we may modify according to our virtualization strategies. We don't want the guest to be able to see any modifications we make. Under the right circumstances of CPL and the descriptors in the current descriptor table, we could conceivably let these instructions execute as-is. Otherwise they will have to be virtualized. SGDT, SIDT, SLDT The IA32 architecture let's you store but not load values of certain system registers. In this case, the global, interrupt, and local descriptor tables. This is of concern to our virtualization strategy only if our implementation uses values for these registers, other than what is expected. SMSW Moves to/from control registers cause exceptions in user-level code. Unfortunately, SMSW can be run at any privilege level, and reads the bottom 16 bits of CR0. This bits contain the following fields: 0:PE protection enable 1:MP monitor coprocessor 2:EM emulate coprocessor 3:TS task switched 4:ET extension type (hardcoded to 1 on most processors) 5:NE numeric error Note, I believe bit4 is hardcoded to 1 on Intel processors, though this is not true for some clones. I think the trick here is that if the virtualization monitor and environment can keep these fields consistent with what the guest OS expects, then we can let this instruction execute as-is. STR This stores the segment selector from the task register into a variable. This is of concern to our virtualization strategy only if our implementation uses values for these registers, other than what is expected. POPF, PUSHF The arithmatic flags such as CF, should not be a problem, as we're letting user instructions which manipulate them execute natively. Any exceptions generated to emulated system oriented instructions will have to save restore them properly, but that's to be expected. That leaves us with: 8:TF 9:IF 12-13:IOPL 14:NT 16:RF 17:VM 18:AC 19:VIF 20:VIP 21:ID Notes: - PUSHF on the Pentium+ does not copy the values of VM and RF (bits 16 and 17). Instead, these fields are cleared in the flags image on the stack. So in a sense, you don't have to worry about code looking at these fields via a PUSHF for >= Pentium. MOV r/m, Sreg MOV Sreg, r/m PUSH Sreg POP Sreg These instructions load and store segment registers. A concern is the RPL field of the selector. Since instructions which read the value of the selector are not protected against in user code, we need to do one of two things. Whenever possible, we can make sure segments are loaded with a selector which has the RPL which the guest code expects (3 for user code). Or we can virtualize instructions which read the segment selectors, by emulating them and substituting the desired fields. The strategy we use at any one time, depends on the privilege level of the guest code that is executing as well as other factors. IN, INS, OUT, OUTS These instructions have the following protection check (for protected mode), to determine if the instruction will be allowed to make the IO transaction. if ((CPL > EFLAGS.IOPL) && (any corresponding bits in TSS are 1)) accessible = no else accessible = yes We generally want to reflect IO accesses to our device emulation code. (See section on that later) Thus, we need to insure we always receive an exception when the guest attempts an IO instruction. If the IOPL value in EFLAGS we chose to use at any one time is < CPL (more privileged), then the processor will generate an exception. If we chose to use a value of IOPL such that this is not that case, we can use bits in the TSS to force an exception. We might make a choice to allow an IOPL like this, if the guest is requesting it, and we don't want to virtualize operations which access the EFLAGS register, like PUSHF. Modifying bits in the TSS IO map, means that we must also virtualize the TSS. (See section on that later) SYSENTER This instruction has the following checks: IF CR0.PE == 0 THEN #GP(0) IF SYSENTER_CS_MSR == 0 THEN #GP(0) So I would say we should do the following. Upon startup of our VM, check what processor we're running on and see if sysenter/sysexit are supported via CPUID. If not supported, then no big deal. If supported then whenever we warp into our VM, save the value of SYSENTER_CS_MSR and then set it to 0. Then we'll receive a fault when the guest tries to use it. We have to restore the value when warping back to the host OS. DYNAMIC SCAN-BEFORE-EXECUTE TECHNIQUE ===================================== As mentioned, one of the key tasks we have, is to protect against the execution of that small set of instructions which do not invoke native IA32 protection mechanisms. So we do it in software. The path of execution of code can be thought of simply as starting at a well defined address (in the ROM BIOS on the PC), and passing through many branches (jumps, calls, interrupts, etc) along the way. Since we know where execution begins, we can fetch and decode a sequence of instructions up to a branch instruction, and place a breakpoint there. We could then execute the code, which will generate a breakpoint exception at the branch instruction. Our virtualization monitor would receive the exception, effect the branch in the guest code, and given the new target address after the branch, repeat the same process for the next code sequence. Using this method, we are free to place breakpoints on arbitrary instructions, not just branch instructions. So we can force the guest code to generate exceptions on any instructions, including the ones in the set we talked about. Essentially, what we are trying to accomplish here, is to never let the execution of guest code pass through unscanned code, otherwise it might run an instruction from that set. There are some considerations we need to keep in mind, when implementing this technique. First, is that if we use software breakpoints, that means we are physically modifying the code. Conceivably, guest code could read from itself and thus retreive the breakpoint instruction rather than the original instruction. I'll call this Self Examining Code (SEC). Likewise, we also need the ability to handle when guest code modifies itself, known as Self Modifying Code (SMC). You could imagine that if we placed a breakpoint at the end of a sequence of instructions, and the instructions in the middle modified the code, we could conceivably lose control of the execution path. So we will detect writes to code which has been scanned, so we can detect this. And, as discussed so far, the strategy entails continuous decoding, lot's of exceptions and processing, and thus very non-optimal performance. So as an extension, wherever possible, after we've scanned code sequences, we attempt to let them run without intervention on the next pass. Ideally, we want to only scan code which exists on code paths which haven't been taken before. The paging machinisms are a convient way to protect against reads/writes to small regions of memory on IA32, something we need to do to handle the considerations above. So we make use of paging protections. Let's start by thinking about some code contained completely within a linear page (and it's related physical page). Let's say execution begins at instruction i0, somewhere in that page. The first revision of our strategy might be as follows. We start parsing instructions, beginning at i0 until we encounter: - An instruction which is in our current list of those which can not be run natively - A branch instruction - The address of an instruction sequence which has already been analyzed For the first 2 conditions we install a breakpoint at the beginning of the terminal instruction. For the last condition, we don't need to do anything, since the following code has been dealt with already, breakpoints installed downstream wherever necessary. We then allow the code to execute natively. Execution continues until it reaches a breakpoint condition we set. If we have encountered an instruction which can not be run natively, then we emulate it's behaviour and begin analyzing the instructions which proceed it, as above. If we have encountered a branch instruction, then we could single step through it's execution and begin analyzing as above starting with the target address. Additionally, if the target address is non computed and has been analyzed and marked OK already, then we could mark this branch instruction as OK, and let it execute natively from now on. Computed branch instructions, we need to monitor. Since the target address is dynamic, we don't know if it will branch to a sequence we have scanned before. Thus, we are dynamically monitoring code, making sure we never let execution branch to code we have not yet examined. Our strategy so far, does not address the possibility that some instructions which write to memory, may write into our page of code, specifically into the address of instructions which we OK'd and allowed to execute natively. This is where the paging system comes in handy. Using the page tables, we can write protect any page of memory in which we have analyzed and OK'd code. Since multiple linear addresses can be mapped to a single physical page, for perfection, all page entries pointing to the physical page with trusted code would have to be write protected. Then, upon a write-protect page fault, we have the opportunity to unprotect the page, step through the instruction, do something intelligent with respect to the meta information we store about that code page, and re-protect the page. We might for example, just dump the meta information about that code page, and start from scratch. I outline the steps required by such a technique in more detail below, as well as some possible implementation details. - For each new code page we encounter, allocate a page which represents attributes of the instructions within the page. Zero out page to begin with. - Each byte in this corresponding attribute page, denotes attributes of the instruction which starts at that offset in the code page. Here's a possible layout of the bitfields in each byte: 7 6 5 4 3 2 1 0 | | | | | | | | | | | | +-+-+-+---- instruction length 1..15 | | | | | | +-+------------ available for future use | | | +---------------- 0=execute native, 1=virtualize | +------------------ 0=not yet scanned, 1=scanned When bit7 is 0, all the other bits are meaningless, since we have not yet scanned the instruction. - At first, when we encounter new local-page branch instructions (static offsets) the target address in the page may very well not have been scanned yet. We could mark this instruction as one to virtualize for now. The virtualization logic could simulate the branch until the target address has been scanned, in which case we could mark the instruction to execute natively from thereon. Lazy processing at it's best. The next step beyond this strategy would be to use a recursive descent technique and branch out pre-scanning the one (unconditional branch) or two (conditional branch) possible target addresses. Upon returning from the recursion, granted both are in the local code page, we would likely be able to let the branch instruction execute natively. Code downstream could generate breakpoints where necessary. Terminals in the recursion could be: - instructions which are already scanned - out of page branches - instructions which require virtualization, though we could scan right through these. - instructions whose opcodes cross page boundaries. We may also want to establish a maximum recursion depth. The win we get is if the code we prescan is well used before we have to dump the attribute page. We lose when this is not the case, for instance when the code page is modified frequently via self modifying code or due to code pages which share data. We may find a comfortable max depth of N, which does a much better job winning than losing on the average, through trial and error. - When we detect a write to a code page for which we have scanned code, there are multiple actions we could take. We could simply dump all info for that page. Or we could conceivably examine the address and data size of the instruction's access in the page-fault handler, to determine if it stepped on addresses for which we have scanned instructions. Upon examination of the affected addresses, we'd look at the corresponding attributes. If they pertain to an instruction(s) which we have pre-scanned, then we need to dump all the mappings for the entire attribute page. This is because the technique I have here doesn't record which instructions branch into these addresses, so there is no way to know which other instructions to invalidate. This is the tradeoff for a simple algorithm. If such writes go to areas in the page which are not yet marked as pre-scanned, then we can step through the instructions. We can consider the writes as ones to data in a shared code/data page. And as such, we don't need to dump the page attributes we have accumulated thus far. For simplicity, let's start with the first option. - We need to cope with out-of-page branches and computed branches, in a way such that we never lose control of execution. The simplist approach is to mark such instructions as needing to be virtualized the monitor. We should chose this one as our first approach, for simplicity. For static offset out-of-page branches, if we wanted to go a step further and let them pass-through, we would have to keep some additional page/branch information, which can add an amount of complexity to our monitor. The main issue is that we may have to invalid code in a given page that is branched to. If we were to allow long static branches to this code, then we would have to iterate (or recurse) through all the code pages which had such branches, and do some invalidating there. This would require a very efficient "edge list" between pages etc. - This technique handles overlapping instructions well. Each byte in the attribute page holds info about only the instruction which *starts* there. So there can be attributes for instructions which start in the middle of a previous instruction with no conflict. This technique also makes things simple for the code which handles writes to a code page. Given the affected addresses, you can easily scan forward and backward to see if a pre-scanned instruction was hit, and then invalidate the attribute page. Or in other words, it can handle self-modifying code well. - This technique extends out to code which spans multiple pages, quite well. We have an additional boundary case where an instruction crosses a page boundary. In this case, we need to write-protect both pages, and handle modified code across both pages. And the tables we use to keep track of which instructions have been analyzed, will likely be oriented (hashed) in a way that factors in page size for peformance reasons. Or if we want to take the easy way out, we could just mark any instruction which spans 2 pages, as needing virtualization. We could then just step through the instruction, and resume scan/executing thereafter. - This technique eats up memory as we encounter and monitor new pages. So we'd have to have some threshold at which we could dump old private pages, and the associated info about which instructions in them have been monitored, so they could be reused. Probably a good candidate for an LRU strategy. PROTECTING AGAINST GUEST READS TO CODE THAT WE MODIFY ===================================================== We previously talked about how to monitor code before it is executed, in order to virtualize arbitrary instructions (usually ones that don't offer natural protection). With this technique, we make use of breakpointing. Hardware breakpoints would seem like a natural choice, since they are non-intrusive. However, there are a limited number of them (only 4), and their use by the virtualization code potentially competes with use of hardware breakpoints in the guest OS. I'd like to offer the ability to allow guest OS hardware breakpoints. So we need to be prepared to use software breakpoints, if/when any of these factors make it necessary. Software breakpoints (INT 3, single-byte opcode 0xCC) give us the ability to install unlimited breakpoints in the code we monitor. However, a side effect is that we have to modify the code by inserting them. This offers a potential for incorrect execution when running any code which depends on a read of executable code being completely accurate. Access through any of the data segment registers could potentially read from a section of code we have modified, and "see" the software breakpoint instructions we have installed. Unfortunately, the paging protection does not offer a natural differentiation between reading and executing code. If it did, we could use it to protect against reading from a page which we have modified, while at the same time allowing execution in that modified page. Then upon a read, we'd receive a page fault, and could spoon feed the read data, according to the unmodified page. Thus it would never see the modifications. The next section, explains a way to exploit the separate I&D TLB caches, to do something similar to this. Separate caches did not exist on earlier processors like the 386 and 486. So here's an alternate strategy. What we can do is to execute code from a private copy where the modifications are made. Reads would occur from the actual code; execution from a private modified copy. In the virtualization technique we talked about previously, we write protect pages of code which we are monitoring. Since we are notified of changes to the page via a page fault, we have the opportunity to propagate the change to the private copy as well, and take appropriate actions. What we could do, is to provide a separate code segment (CS) descriptor, for the purposes of pointing into our private modified code page. When our monitor effects control transfers back to guested code, this descriptor will be fetched and loaded into CS. Other segment registers, like DS, will be loaded from descriptors which point to the normal guest data space, including possibly the original code page. A concern we would then have is that instructions which access memory, may override the default DS descriptor access and use CS (override opcode 0x2E). We don't want any reads accessing memory via our private code descriptor. This can be addressed by making sure our modified CS descriptor is marked as execute-only. Attempts to use CS for reads will then generate exceptions. Or we can virtualize all instructions which use a CS prefix opcode. (See Caveat #1) USING THE TLB CHARACTERISTICS TO PROTECT AGAINST CODE READS/WRITES ================================================================== If we could let code execute in a page, but disallow reads/writes to the same page, we would be notified when have encountered self-examining or self-modifying code. As it turns out, most new IA32 CPUs from Intel and other vendors, have split I&D TLB caches for better performance. That is to say, they cache values from page tables in a separate instruction TLB cache as instructions are encountered, than from similar values in a data TLB cache as they are encountered. It is possible under the right circumstances, to use this split cache to our advantage, so we can effectively have the ability to run in code which can not be read or written, when it is available on the current CPU. This is generally available on anything Pentium and beyond and clones of such processor levels, as a test program which was run by many users confirmed. This technique is known _not_ to work on the following chips, likely due to a combined I&D TLB cache. We could even dynamically determine if this technique is available to the monitor, given the CPU it is running on. TABLE ----- 386 (doesn't even have INVLPG) 486-DX66 Cyrix PR150+ Cyrix PR200+ Cyrix MediaGX233 Cyrix M2 PR300 MX Cyrix M2 333 AMD K5 PR100 IDT C6 200 NexGen Nx586 100 The technique works as follows (from the monitor's perspective). - invalidate the code page with INVLPG. (Wipes I&D TLB entry clean) - make sure page table entry is ring3 accessible - create a private mapping to the code page (different linear address maps to same physical) - write a RET instruction into the code page (using private mapping) - call the RET instruction using normal mapping (Loads I TLB entry) - replace the instruction where the RET went, using private mapping - set the page table entry permissions to only ring0 accessible. - switch to the ring3 code. Essentially, we have loaded the instruction TLB cache entry for a particular code page, with a ring3 accessible TLB entry. But before we transfer to the ring3 code, we have modified the entry to be not-accessible anymore. Since we originally flushed the TLB entry, the data TLB cache entry is invalid. In the future, a data access will attempt to load the page table entry from the modified entry, which is not ring3 accessible. Thus a data access, be it read or write, will generate a page fault - our notification and a chance to do something about the fact that the guest code is reading/writing in the same page. Yet the code is able to execute, fetching using the page translations and permissions cached in the instruction TLB cache. Note also, that there is no deterministic guarantee how long the TLB entry will be maintained for. The CPU is free to dump it at any time. But this doesn't provide a problem in our strategy, other than performance. Next time the TLB entry is loaded, it will read in the entry which is non-accessible from ring3 and provide an exception. At that point, we can re-iterate this process. LDT/GDT/IDT VIRTUALIZATION ========================== Let's assume that the host OS has it's own complete set of local, global, and interrupt descriptor tables, as well as page tables. For our virtualization techniques, we need flexibility to maintain a separate set of these tables. We need to play tricks with the page tables, and to create descriptor tables, such that we can at times let guest code load descriptors using the natural x86 mechanisms. We also need to create our own interrupt descriptor table, to handle exceptions generated by the guest code as part of our virtualization, or just as part of natural exceptions/interrupts which are generated by guest code. To have such flexibility, we need to save the host context of these tables, and switch to a completely separate context whenever we run our guest code for a timeslice. Then restore back to the host context when our timeslice is done. In a sense, our monitor code is running the guest OS/code, in a "parallel" context to the host OS. [I plan to fill in more here later +++] VIRTUALIZING DESCRIPTOR LOADING =============================== (for protected mode guest code) Granted we virtualize all protection levels of code (0..3) by running them at ring3, we will have a privilege level mismatch with respect to loading of segment registers when not running ring3 code. While running user code (ring3), the descriptor loads should execute as expected, due to the match in effective and actual privilege levels. This is provided, we point the GDTR and LDTR registers at descriptor tables which contain descriptors that the guest code expects. While running guest system code (rings0..2), an exception would be generated whenever loading from a descriptor that has a privilege level < 3, whereas the load may have normally succeeded. One way to solve this, is to protect against instructions which load the segment registers, when running code effectively at CPL of {0,1,2}. Then emulate them. We have to virtualize instructions which examine the segment registers at these privilege levels anyways (because they may look at the RPL field which will not reflect the expected privilege level), so doing the same for instructions which load them is just an extenstion. What I'd like to explore further is the idea of using a private GDT and LDT for the virtualization of code at CPL {0,1,2}. If we virtualized instructions which looked at the GDTR and LDTR, we could load them with values pointing to private copies. The private descriptor tables could start out empty, generating exceptions upon segment register loads. Each time, the exception handler could generate a private descriptor which would allow for the next segment register load to execute natively. For example, say we are running guest code which is effectively at ring0, but really at ring3. And a segment register load attempt occurs, for a ring0 segment. The first attempt, would invoke an exception, where we could build a private descriptor which actually had a descriptor privilege level of ring3. The next attempt would work natively due to the protection level match of our private descriptor. Each time the descriptor table registers (GDTR and LDTR) are reloaded, we could dump our private descriptor tables and start anew. Our choices are to start from empty tables, or map all the real descriptors to private ones in one shot. In order to implement such a technique, we would need to make use of the paging unit protection mechanisms. We need to be notified when system code changes a descriptor table entry, so we could re-adjust (or dump) our private copy. We need to page protect any pages which contain the GDT and LDT descriptors. The exception handler would recognize that a write access to descriptor table memory occurred, and do the correct thing with our private descriptor tables. We have to keep the accessed bit in each of the descriptor tables correct, so it may be better to start out our private descriptor tables with empty descriptors, and change the accessed bit in the real tables during the exception handler where we build a private entry. If we build them all at one time, then the accessed bit would only be modified in the private copies, upon future loads. OVERVIEW OF VIRTUALIZATION ========================== Please refer to the following table. It should help visualize where the various components of our virtualization strategy lie. TABLE ----- | HOST OS CONTEXT | GUEST OS CONTEXT =============================================================== Ring0: | host kernel/monitor module | monitor kernel Ring1: | | Ring2: | | Ring3: | monitor app/IO emulation | guest OS kernel + app In this table, each "context" consists of both kernel and application code, it's own set of page tables, descriptor tables, and other system mappings. We run our monitor application in the host OS, like any other program. One of it's chores is to maintain the emulation of IO devices. Since we are emulating the devices at the host OS user level, we can make full use of all the host OS facilities that are available to any user program, for instance libc calls, GUI calls, etc. The host OS will natively schedule this monitor application task on/off the processor for time quanta, using it's scheduling algorithm. The second chore, when not handling IO emulation, is to make sure the guest OS gets some time on the processor. To do this, the host OS monitor application requests that the guest context be run for a time quantum, via a system call which is received by the host OS monitor module. Since this kernel module is privileged, it can store the necessary current host OS context, and switch over to the guest OS context. Over in this context, let's assume we have "pushed" guest OS kernel code from ring0 to ring3, to assist in virtualization. Other discussion explain why we do this. The context switch leaves us in the monitor kernel, which can then run the guest code. Within the guest OS context, there will be potential exceptions generated due to virtualization of certain features. These will be handled internally by the guest OS monitor kernel. How the guest OS monitor kernel handles things can be controlled by calls from the host OS monitor application via the host OS monitor module, which can tweek things over in the guest OS context. It is important to understand that the guest OS context does not handle managing real hardware, yet real hardware interrupts may occur during execution within the guest OS context. We must reflect these interrupts to the host OS, so it can handle them promptly. When the guest OS monitor kernel receives an interrupt that was meant for the host OS (timer interrupt for instance), it switches back over to the host OS context and let's it handle the interrupt. The issue here, is that the next time we get a time slice from the host OS scheduler, we need to make sure it switches us to the context of the host OS monitor application, since execution of the guest OS kernel and application code (though at ring3) is not executing within the proper context. So in the guest OS monitor kernel, we need to fake a return from the system_call() listed above such that we'd be back in the host OS monitor code and host OS context, before switching back to the host OS to handle the interrupt. In the next time slice our host OS monitor app gets, it will again make the system_call(), and we'll do it all over again, maintaining proper contexts without having to modify the scheduling algorithms in each host OS kernel. SOME NOTES ABOUT HOST AND GUEST PAGING ====================================== So far we've hammered out much of the framework for virtualization, but haven't talked about an important topic, paging. Each of the OSs (host or guest) would have both a physical and virtual memory footprint if it was run on a real unvirtualized PC. You probably have your computer configured with enough physical memory as to have reasonable performance. Your OS chews up some memory for locking down the kernel, and multiplexes the rest for virtual memory. Now comes along a second OS. Let's say we didn't have any kind of paging on the whole amount of memory needed by the guest OS context. We'd need to acquire this memory via the host OS kernel, so it doesn't use it for other purposes. It would have to all be locked down. This would tremendously decrease the amount of physical memory resources available for the host OS environment, and thus would be incredibly detrimental to it's performance. One option would be to lower the amount of physical memory taken by the guest OS context and let the paging system in the guest OS handle virtual memory. This of course makes the performance of the guest OS worse, the cherry on top being the extra overhead incurred due to the framework in the guest OS monitor kernel. So we walk a tight rope here. So I'm throwing out the following idea. If we could efficiently coordinate parts of the page tables between the host OS context and the guest OS context (remember we're using different page tables too), then perhaps we could run the guest OS monitor kernel in locked down memory, and the guest OS kernel + app code (which is being run at ring3) in pageable memory. When a page fault is received by the guest OS monitor kernel, we could detect that it is due to a real page not present condition, and reflect that exception back to the host OS kernel, the same way we would when we reflect an external interrupt condition. What we are achieving here is a smaller physical memory consumption on the host OS, and use of the host OS's native paging system, so that the guest OS only hogs memory resources when it uses them. When it's idling, a large part of it could be paged out. I'd like to get people's reaction on this stuff. Let's iron out the wrinkles in this, and then I'm afraid, we're going to have to start writing some code... :^) MAPPING THE MONITOR'S GDT AND IDT INTO THE GUEST LINEAR SPACE ============================================================= The monitor is never directly invoked (called) by the guest code, in fact the guest shouldn't even know about the monitor. The monitor is only invoked via interrupts and exceptions which gate via the monitor's IDT. And since entries in the IDT point into the GDT, we need to look at both, with respect to where to map them into the current guest task's linear memory. As our IDT and GDT must occupy the same linear address domain as the guest code which is normally executing, we need to make sure there are mechanisms to allow these structures to cohabitate with the currently running guest task's address space. And keep in mind, there can be N different address spaces, depending on which guest task is currently running. If we virtualize these structures, we need to maintain both the guest's copies of them, and modified working copies of such structures, which are actually used by the processor. When the guest OS accesses these structures, the monitor will receive a page fault, since we need to protect the pages which contain the guest's copy of them. Upon receiving the interrupt, the monitor can update the working copy used by the processor, accordingly. Another point worth noting is that the SGDT and SIDT instructions are not protected and thus ring3 (user) code may execute them. They each return a base address and limit, the base address being a _pure_ linear address independent of the code and data segment base addresses. To offer really precise virtualization, in the sense that the user program will not detect us influencing the base linear address at which we store these structures, we could use the 2 following approaches. Approach #1: If we are performing the pre-scanning technique, we could simply virtualize the SGDT and SIDT instructions, and emulate them to return the values which the guest code expects. In this case we can place the GDT and IDT structure anywhere in linear memory such that they are in an area which is not currently used by either guest-OS or guest-user code. We have access to the guest page tables, so it is fairly easy to find a free area. Approach #2: Under certain circumstances, we may be able to locate the working copies of the GDT and IDT structures, at the linear addresses requested by the guest via loads of the GDTR and IDTR registers. If we can do this, then we may let execution of these instructions pass through without intervention. This is a condition that is likely to occur while running guest application code. It is common for an OS not to allow access from application code to these structures. Given this is the case, we can map our working copies into the current linear address space, right where they are expected to be. MAPPING THE ACTUAL MONITOR INTERRUPT HANDLER CODE INTO THE GUEST LINEAR SPACE ====================================================== Now that we've discussed placing the GDT and IDT in linear memory, we need to map the actual interrupt handler code as well. Since we will be virtualizing the IDT and GDT, the guest OS will not see our segment descriptors and selectors, so we have some freedom here. We can place this code (by page mapping it) into an unused linear address range, again given we have access to the guest-OS page tables. The interrupt handler code, is actually just code linked with our host OS kernel module. The consideration here is that code generated by the compiler is based on offsets from the code and data segments. This code will not be calling functions in the host-OS kernel and should be contained to access within its own code and data when used in the monitor/guest context. So we must set the monitor's code and data segment base addresses such that the offsets make sense, based on the linear address where we map in the code. For example, let's say our host-OS uses a CS segment base normally of 0xc0000000 (like previous Linux kernels) and our kernel module lives in the range 0xc2000000 .. 0xc200ffff. Then let's say that based on empty areas in the guest-OS's page tables, we find a free range living at 0x62000000 .. 0x6200ffff. We would make the descriptor for our interrupt handler contain a base of 0x60000000, so that the offsets remain consistent with the kernel module code. And of course, we mark these pages as supervisor, so that in the case they are accesses by the guest OS, a fault will occur. We will also be virtualizing the guest-OS page tables, protecting that area of memory, so we can update our strategies. Thus, we will know when the guest-OS makes updates to it's page tables. This gives us a perfect opportunity to detect when an area of memory is no longer free. If the guest-OS marks a linear address range as not free anymore, and that conflicts with the range we are using for our monitor code, we can simply change the segment descriptor base addresses for code and data, and remap the handler code to another linear address range which is currently free. No memory transfers occur, only remapping of addresses. This kind of overhead will only occur once per time that we find we are no longer living in free memory. To reduce this even further, we could start out at, and use alternate addresses, which are known not to be used by particular guest OSes. VIRTUALIZING THE TSS ==================== [I plan to fill in more here later +++] TIMING ISSUES ============= As we are allowing a great deal of code to run natively, emulation of the system timers (PITs) should be highly accurate as well. Keep in mind that the Programmable Interrupt Timers, like other IO devices need to be emulated use in the guest OS, since they are in use by the host. Let's assume that we're not trying to multiplex the real timers, and that we do need to emulate them. We could generalize execution of guest code by saying that each time slice of guest code execution is bounded by an exception of some kind. It may be generated by the host OS system timer, telling us our time slice within the host OS context is over. It may be due to a breakpoint or other protection installed by the virtualization, so we have a chance to emulate a behavior. Or due to a true exception occuring within the guest OS. At any rate, the exception will vector to a routine in our monitor's IDT. We need a mechanism for measuring the time between such exceptions to facilitate an accurate timer emulation. On Pentium+, a very good choice would be to make use of the performance monitoring counters. The RDTSC (Read Time-Stamp Counter) instruction will give us an accurate reading which we can use. It is also executable from CPL==3, given that CR4.TSD==0, so we could use it efficiently in user-level monitor code if necessary. The only issue I can think of, is that this would nail us down to Pentium+, as the 386 and 486 don't have these performance counters. I'd like to support these chips if possible, so perhaps we can conditionally sample the real PIT instead. This will give us much less resolution, but perhaps it's a reasonable alternative, their execution speed is correspondingly lower. Even if the TSC (Time-Stamp Counter) is in use by the host OS, we should be able to multiplex it's use, between the host and monitor/guest environments. If so, we need to save and restore it to appropriate values, as part of our warping to/from the host/monitor contexts. We will have to build a certain amount of low-level time facilities into the monitor. Our PIT emulation as well as emulation of other IO devices can make use of these high-resolution facilities. There are two main software components in our virtualization strategy. We have (1) a user program component which communicates to (2) a kernel module component through the normal kernel interfaces like ioctl(), read(), write(), etc. All or most of the device emulation (video board, hard drive, keyboard, etc) will be done at the user program level, and we'll make use of the standard C library interfaces and such to implement these. The reason I say most, is that for performance reasons, parts of all of some particular devices such as the timer and interrupt controller chips can likely be moved into the monitor domain. As was talked about before this would alleviate a lot of context switching between the host/guest contexts. We don't have to do this kind of thing right away. Though, it's worth pointing out that parts of quite a few devices can be moved into the monitor. For example the floppy controller could be done in the monitor, the floppy drive in the user program. The VGA adapter in the monitor, the CRT display in the user app. Etc. etc. Anyways, so we need some kind of accurate time reference and timer services from this virtualization framework. For example, to emulate the CMOS RTC, you need to be notified once per second so you can update the clock. Because of these needs, we need to develop such a framework. Our timer stuff has to relate very closely to the amount of real execution time the guest code has. What we don't want to use are time references based on the host OS system, as those are highly dependent on system load and other factors. Depending upon the guest code running, there may also be a considerable amount of time spent in the monitor as part of the virtualization implementation, for certain local chunks of guest OS code. We should exclude this time if at all possible, since it is not time when the guest OS is really running, and it would skew the time reference. So our approach could go something like this. Each time, just before the monitor hands over execution to the guest code, we take a snapshot of time, using the RDTSC instruction. Linux even defines an asm macro for this. :^) Upon the next time invocation of our monitor code (via an interrupt or exception) we take a 2nd sampling using the same instruction. Now we have an accurate time sample of how long the guest code actually ran without intervention. We pass this duration to the timer framework. If there are requests from the device models to be notified given the elapsed time, then we call them. If they live in the user app world, then we return back to the user app, which sees this as a return from the ioctl() call, and some fields are filled in, like how long we ran for etc. I we were wicked perfectionists, we could subtract off the number of cycles it takes to get the guest code started again, and for the exception to occur from our RDTSC values. Of course, guest code we run at any one time could conceivably not invoke the virtualization monitor, before our next device model requests being notified. The next bounding event would be caused by a hardware interrupt redirect. Each host OS can have set the IRQ0 timer to interrupt at a particular rate, but let's say it's 100Hz or every 0.01 seconds. Let's say a device model wants to be woken at say 0.005 seconds, and that some guest code runs which is not naughty, and doesn't invoke the monitor during the next user process time quantum. So if we wanted highly accurate timing, we need a mechanism for interrupting us in the middle. Fortunately, the built-in APIC on the Pentium has a timer based on CPU clock speed which can do this. It can be programmed to either periodic or one-shot mode. (thanks to one of the developers for suggesting use of this timer facility) If we saw this condition, we could set the APIC timer to go off at the equivalence of 0.005 seconds, and our monitor will be notified right on the money. Other tricks like temporarily reprogramming the PIT or the CMOS timer for a finer grained interrupt during that one quantum could be used as a back-up plan for CPUs without the APIC timer capability. For these CPUs, rather than getting a time reference by way of reading the time-stamp-counter with RDTSC, we could read the PIT counter register. This is not as high-resolution, but perhaps functional enough. I suppose, for starters we could declare the resolution of our timer facilities to be, at best, the interval of the host OS's periodic interrupt rate. :^) To tie this together with the FreeMWare code we have already, let's look at how this plays out for another contrived example. Again, the host OS uses a 0.01 second periodic interrupt. And let's say the next interrupt required is at 0.035 seconds. The user app code component would probably look something like this: So far all time reference has been relative to the execution of guest code. This is the accurate way to make things respond to the guest code properly. There are however, things which are better tied to the host OS time reference. Let's pick on the VGA emulation. There really are two parts to it. The hardware adapter emulation is the first. It needs to live in the time reference of the guest code for accurate emulation. It does not care if there is a CRT attached to it or not, or in other words whether you actually view the output or not. The emulation of spewing the frame buffer output to your CRT can be done in any time reference. This will be implemented by using a GUI library, on a lot of platforms X11, and at the user application level. You might want to refresh the output every so often, if it is updated, but not too often otherwise you'll bog the system down. In this scenario, we are better off using timing facilities of the host OS. It's probably better to move this function off into a separate thread/process. There are other device models which are candidates for this sort of separation. We can look into this more as we go. TRANSITIONING FROM THE MONITOR TO THE HOST LINEAR ADDRESS SPACE =============================================================== Yes, these are some good comments. I've been down this road and found the same. There are some hacky ways I was thinking of, for utilizing tasking, but getting back to host context is worse since you don't have the same room to play. The simplist way is to have a host<->monitor shim which sits within a single page of memory. The address in the host world is just the linear address where insmod places your module code. When you're in the guest and you want to get back to the host, you make a quick switch of that page mapping to make it the same physical address as in the host, invalidate the TLB entry for it, then jump to it. (remember the differences in CS base values though!) A similar process happens in reverse. After you've made the transition, you have to restore the page mapping back to the way the guest wants it. During normal operation in the guest, the page of shim code does not exist in the linear addressing world, or at least it doesn't need to. I call this the "worm hole" technique. That one page (it could be more but you shouldn't need it) is the nexus between the host linear world and the guest linear world. You open it up, make your quantum jump into the next world, then close the worm hole behind you. VIRTUALIZING THE GUEST OS PAGE TABLES ===================================== To implement our virtualization framework, we need to make heavy use of the CPU's native paging system. We use it to protect pages against being accessed by the guest code, and to play other tricks. The guest OS page tables represent the way it expects to allocate, protect, and map linear to physical memory when it is running natively on your machine and at expected privilege levels. Note that (as per our current strategy), before we even begin execution of the guest environment, our host kernel module allocates all of the physical memory for the guest virtual machine. This memory is non-paged in the host OS to prevent conflicts. As such, we can take some liberties with the paging mechanisms, using paging faults as a mechamism for the monitor to intervene when the guest is accessing data which we need to virtualize. Since we are virtualizing the guest OS code, changing privilege levels it runs at, changing the physical addresses at which pages reside, and using the paging protection mechanism for other tricks, we can not just use the guest OS page tables as-is. We must use a modified version of them, and be careful that these modifications are not visibile by the guest OS. What better mechanism to virtualize the guest page tables than the paging unit! We can use protection flags in the page table entries to be notified when the guest OS attempts an access to the page tables. At this point, we can feed it the data it expects (read), or maintain the effect that it intended (write). In this way, the guest never senses our modifications. When there is a change of a page directory or page table entry by the guest OS, we look at the requested physical page address, and can map this to the corresponding memory allocated for the guest VM by the host OS module. Our page fault handler must look at the address to which an access generated the fault, and determine whether it is a result of our monitor virtualizing system data structures such as the page tables, descriptor tables, etc (in which case the access is valid but we need to feed it fake data), or the access was truly to a non-existent or higher privilege level page (in which case we have to effect a page fault in the guest OS). One of the behavioral differences that results from us "pushing" ring0 guest OS code down to ring3 to be run within the VM environment, is that we no longer have the dual protection page level access capabilities which the guest OS expects. The paging unit classifies all code in rings{0,1,2} as being supervisor and can essentially access all pages. Code which runs in ring3 is classified as user code and can only access pages at that level. But if we push all rings of guest code to ring3, guest monitor 0--+ 0 1 | 1 2 | 2 3--+--->3 then there is no longer access level distinction between guest OS code and guest user code. There are a couple fundamental approaches we can take to solve this behavioral difference. APPROACH #1 For each set of page tables which is encountered while running the guest OS, we could maintain two virtualized sets of page tables. A first set would represent the guest's page tables biased for running the guest *user* app code within our virtualized environment. Since we'd be running ring3 code at ring3, this would be fairly straight-forward. Only those pages normally allowed access from ring3 would be allowed access from guested user code. A second set would represent the guest's page tables biased for running the guest *system* code within our virtualized environment. Since the guest's system code expects to be able to access all pages, we can mark system and user pages all as user-level privilege in this set of page tables - except where we protect otherwise to implement various virtualization tricks. APPROACH #2 Rather than push guest system code down to ring3, we could push it down to ring1 instead. This would yield a privilege level mapping such as the following. guest monitor 0--+ 0 1 +--->1 2 2 3------>3 As far as page protection goes, this would offer us an environment, where we could use the page protection mechanisms more natively, and which would only require one private set of page tables per each guest set. Since x86 paging groups all CPLs of {0,1,2} as supervisor-mode, the shift from 0 to 1 in the example above does not affect privilege levels with respect to paging. All the instructions, where the ability to execute them is based soley on CPL, generate an exception when not run at ring0. So execution at ring1 will have the same effect for those instructions, as from ring3. The question is, which other virtualization strategies does this interfere with? Well, an obvious point, is that since we're modifying the RPL of the selector from 0 to 1, we need to virtualize instructions which expose can expose the RPL. So, instructions such as "MOV AX, DS" and "PUSH ES" need to be virtualized. [I plan to fill in more here later +++] (See Caveat #2) Since executing privileged instructions pushed down to ring1 will result in an exception anyways, we could chose to virtualize them instead, and run them at their original privilege level: guest monitor 0------>0 1------>1 2------>2 3------>3 Perhaps we would be able to gain something, in the way of allowing certain descriptor accesses to occur naturally. We can certainly only do this, when our virtualization is controlling execution, by way of the scan-and-execute technique. We must never allow execution to reach and execute a privileged level instruction natively. Or other non-privileged yet sensitive ones. [I plan to fill in more here later +++] MAINTAINING THE VIRTUALIZED PAGE TABLES ======================================= Regardless of the method chosen, we still need to define how the page tables are maintained. Within a hypothetical guest OS, there could be N active page tables, let's say one for each running task. Note that there may be one or more common regions in the linear address spaces described by these page tables. This would be the case, for instance, for situations where each application has it's own mappings for a particular linear region, and the OS code maintains a consistent set of mappings in a different linear region. When the guest OS attempts a switch of page tables (generally associated with a hardware or software task switch), the monitor will intervene and effect that change in light of our virtualization strategies. Again, we have some options to discuss. In short, we can either dump old page mappings and start anew upon each reload of the PDBR, or try to be smart and store multiple sets of page tables. The simplest approach is to dump the old page table mappings each time the guest requests a reload of the PDBR register. One could imagine that we could then mark all the entries in the virtualized page directory as being not present, so that we could dynamically build the virtualized page tables. Each time a new directory entry was accessed, the page fault would be handled by the monitor, which could in turn mark all the page table entries not present except for the one it needed to build. (we of course except the small region where our interrupt and exception handlers have to exist) Using this technique, we could also dump our page mappings during a privilege level transition. If we chose to do this, then the issues above regarding which privilege level to push guest system code to, are moot since we just rebuild the page tables dynamically according to the effective CPL of the guest code. A more complex approach would be to save virtualized page table information across PDBR reloads. The idea here is that when the guest OS schedules a task who's page tables are already virtualized and stored, we can save a number of page faults and execution of associated monitor code, which would otherwise be incurred from the dynamic rebuilding of the page tables. This technique does generate some issues. One is that it requires more memory for storage of additional page tables. It also requires additional logic to keep track of the page tables, and must properly maintain situations where parts of multiple page tables are shared. THE ACCESSED AND DIRTY BITS =========================== If we maintain a private copy of page directories and page tables, it is not enough to only monitor changes in the guest's tables and modify a private copy accordingly. The accessed and dirty bits (only the accessed bit in the directory) allow the CPU to give feedback to the OS on the use of page directory and page table entries, by updating these fields. We must therefore insure that we provide coherence between these flags as they are updated in the private tables by the CPU, and those that are in the tables provided by the guest which we are virtualizing. We certainly can't or at least don't want to have to do this on an instruction by instruction basis. Two good places to make this update are upon the guest's access to the page tables, and at a time when we "dump" virtualized page mappings for a previous set of page tables. To allow the monitor a point of intervention, when the guest accesses it's page tables, we can mark these regions (where the guest actually stores its tables, not the private ones we use) with page protections such that the guest code will generate a fault; supervisor privilege if we are pushing all guest code to ring3, and not-present if we are pushing guest system code to ring1 should do. During the fault, our monitor will have to complete the update from the A & D bits in our private tables to those in the guest's tables. PASS-THROUGH IO DEVICES ======================= The question has been asked on several occasions, whether a guest OS can make use of native hardware. In general, the hardware that the guest OS sees, is limited to the device emulation which is offered to it, by the virtualization environment. The device emulation, in turn, makes use of the real hardware driven by the host OS, via functionality which is exported by the host OS such as libc, GUI libs, ioctl() calls, etc. So, without any extra logic, the guest OS will not be able to use hardware which is not supported by a host OS driver and service. This is unfortunate, as it would be helpful to allow the guest OS, which has a driver for a specific piece of hardware not supported by the host OS, to drive the hardware. An example of this, would be to allow Windows as a guest OS to drive a winmodem, not supported on a Linux host OS. This is where the concept of a pass-through device comes in. If a device is not already driven by the host OS, then there is potential for the virtualization to allow the guest OS to communicate directly with the device, within constraints of a pass-through mechanism. I believe we should explore this area at some point. We'll need to look into what is needed to accurately pass through IO reads and writes, DMA, IRQs, Plug-N-Play issues, etc. This has been done before in other projects to some degree. Anybody want to write up a section on this, and give us a background on it's use in other projects? CUSTOM GUEST OS SPECIFIC DRIVERS ================================ Due to performance considerations, lack of documentation of a specific hardware device, or development time issues, one may not want to develop emulation for a real piece of hardware. Instead, you could create a pseudo emulation, and a guest OS specific driver to communicate with it. Since you create the hardware and data exchance protocol, there is large potential for performance gains, at the expense of having to write a device driver for each guest OS. It makes more sense to start out development, emulating a common set of devices, like the ones in bochs. But I think its worth considering the benefits derived from writing custom device emulation and associated OS drivers, soonafter. The most notable areas, where we can get some real gains, are the video, disk and network devices. USING VIRTUALIZATION ENVIRONMENT FOR OS DEBUGGING ================================================= A secondary use of an accurate virtualization environment, is for debugging of a guest OS's kernel code. It is often difficult to debug kernel code, since the debugging process can be intrusive to the behaviour of the kernel. It would be useful to have an option to conditionally compile for a debug environment, where some code in the host OS would control the virtualization environment, giving the ability to debug the guest OS in a non-intrusive manor. This would be invaluable to OS and driver design teams. Given we employ a virtualization strategy which can virtualize arbitrary IA32 instructions, we would enjoy a great amount of flexibility. This would essentially allow us to place an infinite number of unintrusive instruction breakpoints in the VM. Or we could pass to the VM, a matrix of certain IA32 instructions that we would like to gather instrumentation data on, and collect data when these instructions are executed. IO DEVICES FOR THE GUEST OS =========================== To virtualize a complete machine, we must virtualize both the processor and the hardware devices that consitute the machine. We've talked quite a bit about virtualizing the x86 (IA32) processor. So now, we must explore issues surrounding virtualizing the hardware (IO) devices. The nature of most devices is such that they are built to be driven by a component of an operating system, the driver. No other software must interfere with this interface, otherwise there will be conflict. As a result, we can not let the virtualized guest OS drive the same devices as are already being driven by the host OS, lest mahem will ensue. Thus, we have no choice but to intercept all IO accesses from the CPU, and model (emulate) a complete set of devices in software. We need to pass IO reads and writes to/from the emulation, such that the IO instructions believe they are getting such information from real devices. This is a very familiar process, as it is done in many other emulation projects. Fortunately, there is emulation for a reasonable set of IO devices already implemented and functioning in bochs. There is emulation for IDE drive, IDE ATAPI CDROM, VGA+monitor, floppy, keyboard, PICs, PITs, CMOS and RTC, limited serial port, limited NE2000 network card, etc. The components are compatible, at least to some degree with DOS, Win95, WinNT, Linux, Minux and other OSes. We may be able to share other components with various other emulation projects if need be. Referring to the section OVERVIEW OF VIRTUALIZATION, its worth looking at the changes in context involved with virtualizing an IO operation. If the emulation of the IO device occurs in the monitor application running in the host OS context, in order to virtualize an IO operation in the guest context, the following context transitions would occur for one instruction. - IO instruction triggers exception, guest monitor receives exception via a gate in it's IDT. - guest monitor warps back to host monitor module. - host monitor module transitions back to host monitor application, passing IO port information. Operation is emulated. - monitor app calls host monitor module. - host monitor module warps to guest monitor context, guest monitor effects the IO operation for the guest code. - guest monitor transitions back to guest code and resumes execution. That's 6 transitions, some of which can be very cycle expensive. For occasional IO, this is perhaps not so much of a problem. It is more of a problem for emulation of devices which receive very frequent IO requests. For example, the disk drive in IO mode would experience some serious performance penalties here. Also, the VGA (especially in planar mode) would have the same issues. The VGA frame buffer is just a memory mapped IO device. While it would be nice to just take a snapshot of the frame buffer every now and then, the latching modes make this impossible. It's worth reiterating why we want to get back to our user mode application in the host world. Essentially, we want to have access to services provided by the host OS; access to libc, GUI, ioctl() calls and such. However, we don't need these services for a great deal of the emulation for some of the devices. So a very logical step, is split up the emulation of each device into that which can be emulated without host services, and that which can not. For most devices, this is fortunately a very logical split. For example, let's look at the VGA emulation. That actual VGA controller has no need for host services. This can easily be moved into function provided by the monitor kernel. Periodically, we need to update our GUI window with a snapshot obtained from the VGA emulation. That update can be done back in the host application space, and is not as time critical. Some of the devices, such as the PITs and the PICs can be totally moved into the monitor kernel functionality, since they require no host OS services. At any rate, given an IO operation can be serviced by code in the monitor kernel, here would be the set of context transitions which would occur. - IO instruction triggers exception, guest monitor receives exception via a gate in it's IDT, guest monitor effects the IO operation for the guest code. - guest monitor transitions back to guest code and resumes execution. So we have trimmed down to 2 transitions. Better yet, our context switches were only "verticle" privilege level transitions. Moving "horizontal" on the diagram (between host and monitor worlds), is very cycle expensive, first because it involves a lot of context saving/restoring, and second because it mandates that we have to reload the paging register each time. Though, there may be some value in using Global pages here (persistent across page register reloads) for better performance, when this feature is available. (Global bit introduced on the Pentium Pro) BIOS ==== We will need a reasonably complete BIOS, which is compatible with the devices we emulate. I am donating the BIOS I wrote for bochs, which is of course compatible with the device emulation from bochs, to serve this purpose. We are free to use another BIOS as well. It's best to start with a known quantity. USING HARDWARE BREAKPOINTS IN THE GUEST OS ========================================== One of the benefits of using software breakpoints as part of our virtualization strategy, is that it leaves open the possibility of supporting hardware breakpointing in the guest OS, which can be very valuable. Access to the hardware breakpoint registers is always virtualized in our strategies. This gives us some flexibility. We'll always know when the host attempts to change the debug registers, or read them. We are free to alter these values to something that fits our virtualization scheme. For instance, if we employ the split I&D TLB method, while we're running in the private code page, if there is a hardware breakpoint which points to a linear address in the original code page, we can temporarily change it to point into the private code page. Since we monitor out-of-page control transfers, we always have a place to synchronize the debug registers. Also an issues, is the use of hardware breakpoints in the host OS. If during a transition from host to monitor context, we find that any hardware breakpoints are enabled, we must save the debug register contents. We must then disable any enabled ones, or replace them with values from the monitor/guest environment. A restore of such values is of course necessary on the transition back to host context. If the host OS has not enabled any hardware breakpoints, and they're not currently in demand from the guest OS, we also have the option to do nothing on the context switch, and then dynamically save the debug registers, upon demand from the guest OS. PROFILE FEEDBACK DATABASE ASSOCIATED WITH EACH GUEST OS ======================================================= Each guest OS that you might run in the VM, has it's own set of features and resource usage. Depending on the usage of such resources, it may be possible to tweek the virtualization to do something more efficiently. It may be therefore beneficial to store certain information about the resource usage of a given guest OS, along with it's configuration files. This would give an adaptive way to tune the VM, for the next time this guest OS is booted. Alternatively, the user could provide info about the kind of guest OS to be run. This is a more coarse-grained and non-adaptive alternative, but perhaps effective. PAGE CLUSTERS ============= Some of the techniques, such as the split I&D TLB technique, and the dynamic scan-before-execute technique, are discussed using a single-page oriented strategy. For example, let's look at the scan-before-execute technique. As mentioned, we always intervene when out-of-page branches are taken (and computed ones). Each such intervention results in an exception and execution of additional monitor code, and of course a corresponding performance hit. We may find that code does not exhibit a high enough degree of locality within the constraints of a single page, that it can execute as efficiently as we'd like, due to intervention. A natural extention to the scanning technique is to look at a cluster or series of pages as an atomic region of memory. So rather than thinking about branches out-of-page, we can look at a set of N pages as one entity and then only control static branches out of the N page region. We let intra-region branches execute without being virtualized. As a consequence, where we completely invalidate a page using the single-page strategy, we must now invalidate the page cluster, since it is an atomic structure. There are several approaches we could take in defining what a page cluster is. A simple approach, might be to define a page cluster as a contiguous region of N pages, aligned on an N-page boundary. Effectively, this gives us a larger page size, extending it from 4096 to N*4096. Perhaps we would find that expanding the effective page size, using this technique, would solve a fair amount of our code locality performance concerns. Taking a more dynamic slant, we could decide to start with a single page, and annex adjacent pages, up to a certain limit. For the cost of a slight bit more logic, we would prevent the inclusion of extaneous pages that have no local static branches into the current ones. The reason we don't want to include any more than necessary, is when we do need to invalidate pages, we'd like to not invalidate any more than necessary. The same reason, also puts a limitation on the number of pages we should let the page cluster grow to. If we were really into data-flow analysis, we could dynamically maintain edge graphs of all the code branching, and create non-contiguous page clusters, accordingly. Of course, this means that part of our code page invalidation maintenance would involve an algorithm to invalidate parts of the edge graph. I'll leave this to the people with higher degrees. :^) CONDITIONS UNDER WHICH SCAN-BEFORE-EXECUTE CAN BE ELIMINATED ============================================================ [I plan to fill in more here later +++] USING PROTECTED MODE AND V86 MODE VIRTUAL INTERRUPTS ==================================================== [I plan to fill in more here later +++] CAVEATS ======= Caveat #1: If we modify fields in guest descriptors, or introduce extra descriptors to the guest descriptor tables, then the guest can conceivably see these differences with LSL, LAR, or perhaps other segment oriented instructions. For perfection, if such modifications are done, we need to virtualize such instructions. Caveat #2: If we want to have supervisor level code, access pages in a write-protectable way, we have to kick on the CR0.WP flag to get this effect. Otherwise supervisor code can stomp on any page regardless of it's read/write flag. This flag will already be on for host OSes which use an efficient fork() implementation, since you need it for an on-demand copy-on-write strategy. It can be easily saved/restored during the host<->monitor switch, if not.