Tuesday, September 10, 2019

Hyper-V memory internals. Guest OS memory access

Software, used in article (operation systems have August 2019 patches):
Windows 10, build 1903 x64
Windows Server 2019
Windows Server 2016
WinDBG Preview
Visual Studio 2019
Process Hacker
PyKd plugin for WinDBG

Testing lab works on Intel-based PC. Therefore, Intel specific Hyper-V terms: hvix64.exe, vmcall instruction, etc will be used in article context.

Terms and definitions:

  • WDAG – Windows Defender Application Guard;
  • Full VM (virtual machine) – virtual server, which was created in Hyper-V manager. Differs from WDAG container, Windows Sandbox, docker in Hyper-V isolation mode;
  • Root OS – operation system, where server part of Hyper-V is installed;
  • Guest OS – operation system, which works in Hyper-V emulation context, uses virtual devices, which is presented by Hyper-V infrastructure. It can be Full VM and Hyper-V containers;
  • TLFS – Hypervisor Top-Level Functional Specification 5.0;
  • GPA (guest physical address) – Guest OS physical memory address;
  • SPA (system physical address) – Root OS physical memory address;
  • Hypercall – hypervisor service function, which is called by vmcall execution with specifying hypercall number;
  • PFN – page frame number.

Source of hvmm.sys driver on github.com:

Python-script for GPAR and MBlock objects parsing

Long time ago I didn’t write anything in my blogpost. It doesn’t mean, that I stopped Hyper-V research. Since Microsoft issued WDAG in Windows 10, build 1803, I started investigate it, but got much problems. First, it was impossible to attach to container, because it doesn’t support it. WDAG is isolated environment, and bcdedit options for debugging can’t be configured. More then, every configuration option is reset after rebooting. Sysinternals LiveKD supports Hyper-V attaching, but compatibility was broken in latest OS versions, more then, guest OS memory reading hypercall HvReadGpa, which is used by LiveKd, is not compatible with containers. 
It was stalemate, but it turned out, that Matt Suiche (@msuiche), founder of Comae Technologies, shared LiveCloudKd source code for me (many thanks to him!). That program allows attach WinDBG to guest OS, using vid.dll API for reading guest OS memory. But next problem is vid.dll execution blocked by Microsoft: functions from vid.dll can be executed only from vmwp.exe process context, otherwise it will be blocked by vid.sys driver, which compared _EPROCESS object of function’s usermode caller process with parent vmwp.exe _EPROCESS. Additionally, some of original LiveCloudKd techniques stopped working in Windows 10. I had to update it too. 
Working on adaptation of LiveCloudKd can help me understand Hyper-V guest memory internals better. Soon Matt shared sources on github (https://github.com/comaeio/LiveCloudKd).
In 2017, Andrea Allievi made Hyper-V memory management architecture presentation (www.andrea-allievi.com/files/Recon_2017_Montreal_HyperV_public.pptx). Good work, but details were described quite abstractly, it was hard to match information from presentation to real vid.sys code. I believe it was because at the moment of presentation, Hyper-V symbols information has not yet been published. 
Btw, thanks to Andrea to pointing me to some names of vid.sys structures.
Additionally, need say thanks to Microsoft company, which decided to publish symbols for many Hyper-V modules (https://docs.microsoft.com/en-us/virtualization/community/team-blog/2018/20180425-hyper-v-symbols-for-debugging). Without them it was hard to analyze memory-managed vid functions.
First, I planned wrote article about Hyper-V containers, but I made research log above 150 pages (6 from 9 font), but still don’t understand whole working scheme. After that I decided to make a list of Hyper-V container components (then, it was extended to all Hyper-V components cheat sheet – no much files were need to add. Containers and Hyper-V has very similar components base).
After that, I understood, that it has much components and too big for 1 article description. Therefore, I decided to highlighted more interesting things in separate article about guest OS memory structures.
Why guest OS only? Hyper-V kernel hvix64.exe already has memory description in TLFS docs, and de facto it involved in memory operation only in allocation\deallocation stage. Read\write memory guest OS made independently of hypervisor. Yes, of course hypervisor make memory access attribution\isolates guest OS memory from root OS, and other OSes, but it made by hardware feature like EPT and don’t need evolve hypervisor on every memory reading\writing operation.
I describe memory access to Full VM, WDAG, Windows Sandbox and shortly Docker containers. During research hvmm driver was created. Main function of it – provide interface for reading guest OS memory from root OS without access to vid.sys, hvix64.exe API. That driver was integrated to LiveCloudKd project.
Detailed description of Hyper-V internals we will see in part 2 of Windows Internals book, 7th, writing by Andrea Allievi. But while book under develop, you can read shot description of Hyper-V guest OS memory structures in this article :)
Let’s beginning. 
Direct memory access to Full VM and Hyper-V containers
Vmwp.exe is the main process, that controls guest OS execution and provide device emulation. It is launched by vmcompute.exe, which is managed by vmms.exe for Full VM, hvsimgr.exe for WDAG, WindowsSanbox.exe for Windows Sandbox, docker.exe for docker containers. When starting, the vmwp.exe process accesses to the hypervisor interfaces (hypercalls) through the vid.dll interface. I got hypercall usage statistic for Windows Server 2019 VM, Docker container in Hyper-V isolation mode (nanoserver image: 1809) and WDAG container. The WDAG container generates too many hypercalls, so due some delays, caused by the debugger writing results, the container immediately started to turn off after being turned on (WDAG-manage application hvsimgr.exe controls execution timeouts of some procedures), and therefore the WDAG results contains summary indicator (I want to try dtrace, relatively recently developed under Windows, to collect such statistics - in theory, it should reduce the cost of recording the collected data and remove hvsimgr.exe timeout limitations). Separately there is recorded shutdown statistics, so that the approximate order can be estimated. In comparing to Full VM, it is quite large:
What categories of hypercalls can be distinguished from this calling statistics? Partition creation, configuring its properties, creating virtual processors and virtual ports (use to send signals, messages), setting interceptions, and various hypercalls for memory management.
See to winhvr.sys!WinHvMapGpaPagesFromMbpArrayScanLargePages function. Rdx contains page number, rsi - size (in pages).
When we start Windows Server 2019 with 1500 Mb of RAM, we got:
1st call rdx=0000000000000000 rsi=000000000005dc00 
2nd call rdx=00000000000f8000 rsi=0000000000000800 
3rd call rdx=0000000000fff800 rsi=0000000000000800

When we start Windows Server 2019 with 2300 Mb of RAM, we got:
1st call: rdx=0000000000000000 rsi=000000000008fc00 
2nd call: rdx=00000000000f8000 rsi=0000000000000800 
3rd call: rdx=0000000000fff800 rsi=000000000000024a

Call stack:
1st call
2nd and 3rd calls
00 winhvr!WinHvMapGpaPagesFromMbpArrayScanLargePages
01 Vid!VsmmHvpMapGpasFromMbpArray
02 Vid!VsmmHvpMapGpasFromMemoryBlockRange
03 Vid!VsmmHvMapGpasFromMemoryBlock
04 Vid!VsmmAdjustGpaSpaceForMemoryBlockRange
05 Vid!VsmmCreateMemoryBlockGpaRange
06 Vid!VidIoControlPartition
07 Vid!VidIoControlDispatch
08 Vid!VidIoControlPreProcess
…………………WDF Calls………………………………..
0d nt!IofCallDriver
0e nt!IopSynchronousServiceTail
0f nt!IopXxxControlFile
10 nt!NtDeviceIoControlFile
11 nt!KiSystemServiceCopyEnd
12 ntdll!NtDeviceIoControlFile
13 vid_7ffb4de20000!VidCreateMemoryBlockGpaRange
14 vmwp!GpaRangeMbBacked::Initialize
15 vmwp!MemoryManager::CreateGpaRangeInternal
16 vmwp!MemoryManager::CreateMemoryBlock
17 vmwp!MemoryManager::CreateRamMemoryBlocks
18 vmwp!MemoryManager::CreateRam
19 vmwp!VirtualMachine::ConstructGuestRam
1a vmwp!WorkerTaskStarting::RunCleanStartSteps
1b vmwp!WorkerTaskStarting::RunTask
1c vmwp!WorkerAsyncTask<VmPerf::Vmwp::StartingTask>::Execute
1d vmwp!VirtualMachine::DoStateChangeTask
1e vmwp!VirtualMachine::StartInternal
# Call Site
00 winhvr!WinHvMapGpaPagesFromMbpArrayScanLargePages
01 Vid!VsmmHvpMapGpasFromMbpArray
02 Vid!VsmmHvpMapGpasFromMemoryBlockRange
03 Vid!VsmmHvMapGpasFromMemoryBlock
04 Vid!VsmmAdjustGpaSpaceForMemoryBlockRange
05 Vid!VsmmCreateMemoryBlockGpaRange
06 Vid!VidIoControlPartition
07 Vid!VidIoControlDispatch
08 Vid!VidIoControlPreProcess
.............WDF Calls............
0d nt!IofCallDriver
0e nt!IopSynchronousServiceTail
0f nt!IopXxxControlFile
10 nt!NtDeviceIoControlFile
11 nt!KiSystemServiceCopyEnd
12 ntdll!NtDeviceIoControlFile
13 vid_7ffb4de20000!VidCreateMemoryBlockGpaRange
14 vmwp!MemoryManager::CreateMemoryBlockGpaRange
15 vmwp!VmbComGpaRange::VmbComGpaRange
16 vmwp!Vml::VmComMultiInstanceObject<VmbComGpaRange>::CreateInstance
17 vmwp!Vml::CreateComObject<VmbComGpaRange,IMemoryManager
18 vmwp!VmbComMemoryBlock::CreateGpaRange
19 vmuidevices!VideoSynthDevice::SetupVramGpaRange
1a vmuidevices!VideoSynthDevice::SynthVidOnVramLocation
1b vmuidevices!VideoSynthDevice::OnMessageReceived
1c vmuidevices!VMBusPipeIO::OnReadCompletion
1d vmuidevices!VMBusPipeIO::ProcessCompletionList
1e vmuidevices!VMBusPipeIO::HandleCompletions
1f vmuidevices!VMBusPipeIO::OnCompletion

The last memory block is mapped memory of video adapter. A one-page-size block is used for an ACPI devices.

Among other things driver hvmm.sys is needed to remove vmwp.exe protection, that prevent dll injection to that process. That driver works with partition handle with Prtn-signature (VM_PROCESS_CONTEXT), but there is second type, that supporting by vid.sys - EXO-partitions. EXO-partitions can be created using WinHv Platform API Library (https://docs.microsoft.com/en-us/virtualization/api/hypervisor-platform/hypervisor-platform), which allows third-party developers to make their virtualization solutions compatible with Hyper-V and run it simultaneously with native Hyper-V VMs. Currently VirtualBox, Qemu, Bochs (f.e. in applepie implementation) have this supporting. VMware, one year after the appearance of these APIs in Windows 1803, finally added support to its VMware Workstation product too. Probably, a new assembly of VMware will be released after the release of Windows 10, build 1909 (19H2).
However, it is still possible to use the vid.dll interface without a driver in Windows Server 2016 and earlier. API execution lock is missing in vid.sys in that OS, and driver hvmm.sys is not needed in that environment. But WDAG and Windows Sandbox containers are presenting in Windows 10 only, where API is locked. 
What structures will be needed to work with Guest OS memory? I tried to visualize them in a diagram. In the future, while reading the article, it should become clearly, how they are using.

  • Partition handle (VM_PROCESS_CONTEXT);
  • GPAR-handle (GPAR - Guest physical Address Range);
  • Array of GPAR elements (GPAR Array);
  • Array of MBlock-objects (MBlock Array. MBlock – memory block GPA range);
  • GPAR-object (GPAR_OBJECT);
  • MBlock-object (MEMORY_BLOCK).

Partition handle is the main object, which is used by hvmm driver. When user mode section of partition handle is created, its kernel mode part contains all the necessary information about the created partition. The search algorithm for the user mode component hasn’t changed since Windows Server 2008 R2, and this component can be obtained by enumeration of handles, opened by the vmwp.exe process. For this, find all open file descriptors with the names like \Device\000000 and try to get partition name.
If the name can be obtained, it means, that we found a valid partition handle. In my practice, there are 3 similar objects for each Full VM or container. If we pass the obtained values to the kernel function nt!ObReferenceObjectByHandle, then in two cases it returns NULL, that means objects are invalid. For the current descriptor, we get the pointer to the partition handle.
Yes, object pointers offsets inside partition handle are fixed and differ for each version of Windows. But for same version of Windows they aren’t changed, so the method is quite reliable.
Partition handle contains fields, that point to an array of MBlock objects (initialized in
vid.sys!VsmmMemoryBlockpInitialize) and an array of GPAR objects (initialized in vid.sys!VsmmGpaRangepInitialize).
By the way, you do not need to confuse the partition handle with the Windows 10 memory partition structure, which !partition WinDBG command displays. This is the _MI_PARTITION structure, which contains basic information about current state of the operating system memory. This object is created without an active hypervisor (or active – no matter).
You can read more about it in the 1st part of Windows Internals book (7th edition). I couldn’t find that information in MSDN (current Microsoft Docs).
Containers and Full VM have different accessing memory methods, so let's look at memory reading examples for both. Let's start with Full VM based on Windows Server 2019.
Full VM memory reading
LiveCloudKd application passes the request to the driver for reading guest OS memory block. The data, required for the request, is packed into the GPA_INFO structure. This structure contains start memory address, number of bytes to read and service information about virtual machine partition (PID vmwp, partition id).
First, get partition handle. To do this, just call the nt!ObReferenceObjectByHandle function with the passed descriptor.
Type of getting object is FILE_OBJECT. To gain access to the body of the descriptor, you must get a pointer to FsContext.
Beginning part of partition handle looks like:

The first 0x278 bytes contain section signature, the name and its identifier. The size of structure is not small (0x3EF0 for Windows Server 2019) and it is different for different operating systems. The exact size of partition handle can be found in vid.sys!VidCreatePartition (by the amount of memory allocated for it). We will not need it in driver.
When we get partition handle type (VmType), we can perform one of two procedures for memory blocks reading. There are actually quite a lot of possible VmType values, and moreover, they differ for different versions of operating systems. For example, VmType for Full VM in Windows 10 and Windows Server 2019 have different values. Not all of them have been investigated (especially for operating systems such as Linux, because WinDBG, that launched by LiveCloudKd, doesn’t work with them). But finally partitions of virtual machines were divided into two categories: container’s partitions and Full VM partitions.
The hvmm.sys!VidGetFullVmMemoryBlock function at the input receives a section descriptor, a buffer in which to write the received data, the size of the buffer in bytes and the GPA of the virtual machine.
BOOLEAN VidGetFullVmMemoryBlock(PVM_PROCESS_CONTEXT pPartitionHandle, PCHAR pBuffer, ULONG len, ULONG64 GPA)
GPA – it is page number, which is calculated: GPA = GpaInfo.StartAddress / PAGE_SIZE;

The start address should be aligned on the page boundary, if the hvmm driver function is called directly (LiveCloudKdSdk prepared usermode buffer for that).
Next, we need to find GPAR object, that describes the requested GPA. Each GPA is included in the memory block, previously allocated by the hypervisor, and this memory block is described by the GPAR object. Fields GpaIndexStart and GpaIndexEnd are located, respectively, at the offsets 0x100 and 0x108 of the GPAR objects. You can understand whether the GPAR object describes the GPA or not, by the value of these fields. For example:

This GPAR object control GPA from 0 to 0x8fbff.

GPAR objects count in Full VM are much smaller than in containers. For example, Generation 2 Full VM has 3-4 GPAR objects, containers have about 780. Then guest OS has more memory, then more blocks it allocates with HvMapGpaPages* hypercalls and, correspondingly, there are greater numbers of GPAR objects. The maximum range of GPAs, described by GPAR object, that I met, was 0x96000 pages.

Let's get back to our driver. We can find GPAR object using hvmm.sys!VidGetGparObjectForGpa function. Partition handle and GPA are passed to the function. How does it work? As described above, each partition handle has a pointer to a GPA block descriptor. This is a structure, which, among other things, contains a pointer to the partition handle itself, a pointer to array with pointers to GPAR objects, and the count of elements in the array of GPAR objects (see the diagram of the relationship of structures above).

When we got this information, we can run cycle through the GPAR objects and find 1 GPAR the object, that is responsible for the GPA. Code is quite simple, as you can see. This is a simplified implementation of VsmmLookupMemoryBlockByHandle function of vid.sys driver.
Vid.sys driver also has additional procedure for encrypted memory reading -  VsmmpSecureReadMemoryBlockPageRangeInternal. It uses AES XTS through BCryptEncrypt\BCryptDecrypt functions from  ksecdd.sys driver. I can’t find in what cases they are used, because even for Shielded VMs with TPM enabled, standard memory reading operations are working. Perhaps some special areas are encrypted, but they haven’t been found still. Obviously, guest OS directly make reading/writing operations to the already allocated memory area without calling any functions from vid.sys. All exceptions must be caught and handled by the hypervisor. Accordingly, if the root OS encrypts some parts of the memory, then the guest OS will not be able to transparently access them.

Go back to the hvmm code. 
When we found a suitable GPAR object, we exit from cycle.

There are GPAR objects exist, that don’t describe the GPA, but instead of the necessary data, contain a pointer to a certain usermode structure inside the vmwp.exe process. They are tied to the memory allocated for virtual Hyper-V devices. Usually, there is 1 such GPAR object per partition (see content of that memory later in Docker part of that article).

We don’t need in that objects during memory reading operations.
What data is contained in the GPAR object and will help to read the data from the guest OS? This is another data type - an MBlock object (MEMORY_BLOCK). It contains guest PFN data and other useful information. A fairly large structure, at the beginning contains the signature "Mb  ".

From all the fields, we need only a pointer to the GPA array. Size of the array element is 16 bytes. One 8-byte part contains the GPA (in guest OS), and other 8-byte part contains the SPA information (in root OS).

We can calculate SPA by following formula:

For SPA reading, we need mapped it to root OS virtual address space. Use MDL structure for this:

There is an array of PFN at the end of each MDL structure. A pointer to it can be obtained using MmGetMdlPfnArray macro. When we received the pointer, we had wrote HostSPA index to it. Of course, it is possible to put in MDL more than one PFN at one time. But there is a chance to get to the border of GPAR blocks, therefore memory reading is done page by page. For Full VM, this is not very profitable, since the size of each block is large enough, but speed is still good.

Next, we get virtual address using the nt!MmMapLockedPagesSpecifyCache function and use it to copy guest OS memory block using nt!RtlCopyMemory. Accordingly, reading is performed in a loop. 1 memory page is copied on 1 iteration. During copying, it is recommended to pause the virtual machine in order to avoid memory modification during reading. In LiveCloudKdSdk, the SdkControlVmState function is implemented for this. It suspends the execution of the virtual machine either by the usual powershell-cmdlets Suspend-VM\Resume-VM, or works with the special register of each virtual processor calling HvWriteVpRegister hypercall and set the HvRegisterExplicitSuspend register to 0 (resume) or 1 (suspend).

Container memory reading

Consider reading the container’s memory on Windows Defender Application Guard example (to use it, it’s need install same name component in Windows 10. It has been present since the 1803 build). Access to memory of Windows Sandbox and docker container in Hyper-V isolation mode is same.

It made by next function of hvmm.sys driver:

BOOLEAN VidGetContainerMemoryBlock(PVM_PROCESS_CONTEXT pPartitionHandle, PCHAR pBuffer, ULONG len, ULONG64 GPA)

Before executing it, as for Full VM, we must get partition handle first. Then, we will additionally need vmmem process handle. This process  is created, when containers work, and works in kernel mode only. 

We can see it’s threads, when launched container on a 4-processor PC (there are no user mode threads):

The vmmem process descriptor is present in the partition handle. We can find it, using ‘scrP’ signature (see the hvmm!VidFindVmmemHandle function for details).

We get a pointer to the GPAR object, as same way for reading memory in Full VM. Next we see differences - other fields of the GPAR structure are used to read blocks of memory. VmmMemGpaOffset - the main offset, which allows us convert GPA to SPA for a specific memory block. There is additional offset present (SomeGpaOffset), which can influence to final result, but during my experiments it was always 0.

Next, we calculate source address, using the following formula and copy data block directly from the address space of vmmem process:

Now we can see key difference between reading container memory from reading Full VM memory: we need copy data from virtual memory of the vmmem process. There is no need for memory mapping using MDL.

Hyper-V memory API

Direct access to memory without corresponding exported Windows functions is interesting, but a more reliable method is to use some of APIs, which is provided by Microsoft. But for reliability you will have to pay the restrictions imposed by Microsoft on these APIs. In particular, for hypercalls they work only with Full VM and for containers they always return FALSE, additionally they read\write no more than 0x10 bytes at one time. The vid.dll function API is generally forbidden to be called from any module other than the vmwp.exe process in latest versions of Windows.

Vid.dll has next functions for reading\writing memory:

  • VidTranslateGvaToGpa
  • VidReadMemoryBlockPageRange (wrapper on vid.sys!VidReadWriteMemoryBlockPageRange)
  • VidWriteMemoryBlockPageRange (wrapper on vid.sys!VidReadWriteMemoryBlockPageRange)

And hypercalls (it must be called from ring 0):

  • HvTranslateVirtualAddress
  • HvWriteGPA
  • HvReadGPA
See it in more detailed.

Reading\writing memory using hypercalls

HvReadGpa using is quite simple, if you don’t take, that memory block shouldn’t fall on the page boundary. Otherwise, the reading operation will be broken and end of block, that must be read from the second page, will contain zero bytes. Blocking separation is implemented in the usermode part of LiveCloudKdSdk. Driver hvmm calls WinHvReadGPA - HvReadGpa wrapper from winhvr.sys driver. You can call HvReadGpa directly through vmcall, but before you will have to additionally perform operations to prepare hypercall parameters.

Boundary checking for writing operation was made in hvmm.sys driver.

An additional check is performed before reading virtual address space using winhvr.sys!WinHvTranslateVirtualAddress. The function converts a virtual address into a physical one, using the current context of the CPU (and accordingly, CR3 register).

Possible validation options (LiveCloudKd uses only HV_TRANSLATE_GVA_VALIDATE_READ and HV_TRANSLATE_GVA_VALIDATE_WRITE).

#define HV_TRANSLATE_GVA_VALIDATE_READ       (0x0001)
#define HV_TRANSLATE_GVA_CONTROL_MASK        (0x003F)

WinDBG in memory dump mode works with physical addresses only (for debugger it is file offsets). Accordingly, it makes all the work for converting virtual address to physical, therefore we don’t need to do additional hypercall for checking memory address.

Microsoft Hyper-V Virtualization Infrastructure Driver Library (vid.dll) API
First, see vid.dll!VidReadMemoryBlockPageRange

    __in PT_HANDLE Partition,
    __in MB_HANDLE MemoryBlock,
    __in MB_PAGE_INDEX StartMbp,
    __in UINT64 MbpCount,
         PVOID ClientBuffer,
    __in UINT64 BufferSize
Partition parameter – it is user mode partition handle;
ClientBuffer  – pointer to memory region, where result will be stored;
BufferSize – yes, buffer size, and nothing more;

Two parameters can cause some questions: MemoryBlock and StartMbp. MemoryBlock is number of the MBlock object from which data will be read. In Windows Server 2008 R2 kernel-mode handle must be pointed as that parameter (yes, the user mode application contained kernel mode descriptor addresses - the original version of LiveCloudKd was built on this logic):

StartMbp is index, which is equal to physical memory page number. We just need to get the GPA and divide it into PAGE_SIZE (0x1000). The page size in this case is virtual. For example, when ntoskrnl.exe image memory page is usually 2 Mb LARGE_PAGE, but the page numbers will still be 4 Kb granular for that region. Buffer can be specified less, then less data will be written to it. Everything is clear, with one exception - this index is relative to the beginning of MB_HANDLE MemoryBlock. For example, for the first memory block, index will match with physical memory page number. If blocks are placed continuously, index of second block will be equal to page number minus first block size. Index of third block will be equal to page number minus the size of the first block and minus the size of the second block. Everything seems to be clear. The main problem is that physical memory blocks are not continuous. Moreover, these boundaries cannot be easily determined from the user mode. Microsoft didn’t provide such APIs even from the time of Windows Server 2008 R2.

Matt used a separate function for searching descriptors in memory, but Microsoft closed this opportunity by replacing the descriptors with their indexes in the table, located in kernel mode, and therefore I used vid.dll! VidReadMemoryBlockPageRange function. 

First, we can get the HANDLE numbers by doing a simple search, reading first memory page of each block. If function returns TRUE – it means, that block exists, if FALSE - block doesn’t exist. Based on practical experience, I determined the maximum size of the index to be 0x400. As we saw above, a large number of indexes are observed only for containers such as WDAG and Windows Sandbox, due to the fact that each file is mapped in a separate block.

When we get array with indexes, we have could determine maximum block size by slightly modifying the binary searching algorithm in the array.

We know, that memory block is continuous, therefore we can determine its boundary by setting the condition: when reading a block, the subsequent block shouldn’t be read. Accordingly, first we can scan the memory and build the initial memory mapping scheme. But, as I wrote above, there are gaps between the blocks, and therefore, to clarify the memory allocation, we will have to examine the _PHYSICAL_MEMORY_DESCRIPTOR structure in guest OS.

0: kd> dt poi(nt!MmPhysicalMemoryBlock)  nt!_PHYSICAL_MEMORY_DESCRIPTOR
   +0x000 NumberOfRuns     : 7
   +0x008 NumberOfPages    : 0xbfee1
   +0x010 Run              : [1] _PHYSICAL_MEMORY_RUN

0: kd> dq poi(nt!MmPhysicalMemoryBlock) L20
ffff8b81`91615020  00000000`00000007 00000000`000bfee1 all blocks count, summary blocks size
ffff8b81`91615030  00000000`00000001 00000000`0000009fstart position of block, page count in block.
ffff8b81`91615040  00000000`00000100 00000000`0000027b
ffff8b81`91615050  00000000`0000037d 00000000`00005d86
ffff8b81`91615060  00000000`00006105 00000000`00058dc0
ffff8b81`91615070  00000000`0005ef1b 00000000`00001080
ffff8b81`91615080  00000000`0005ffff 00000000`00000001
ffff8b81`91615090  00000000`00060200 00000000`00060000

WinDBG has command to show PHYSICAL_MEMORY_DESCRIPTOR structure.

As you can see, part of the guest OS memory blocks fits in one block allocated by the hypervisor. And part of the blocks of the guest OS correspond to the blocks allocated by the hypervisor, with the same volume, but with some offset. Given that the offset is small, we can adjust our table:

The first block isn’t need for adjustment. Memory is mapping 1 in 1, which allows us to read data from the first block, where ntoskrnl.exe is located, in order to calculate the values ??of the _PHYSICAL_MEMORY_DESCRIPTOR structure later. After calculation, we can perform the offset correction. I described in driver code the case, when one guest block can consist of several blocks, allocated by the hypervisor, but I haven’t encountered such case in my stand. The last of the blocks with a size of 0x800 pages is used for video memory, as was explained above. In our case, in a virtual machine, the maximum physical address available for reading is greater than maximum address, specified in PHYSICAL_MEMORY_DESCRIPTOR. This block is not specified in PHYSICAL_MEMORY_DESCRIPTOR, so we just assume, that it goes sequentially after the last guest OS block. Offset of this block can’t be determined without a driver in the host OS. We can assume, that this is memory used by the device, and it can be read, for example, by LiveCloudKd.

After correction, we can read all physical guest OS memory without the driver, excepting pages. Which was paged in pagefile.sys.
I complete code description on that point. The remaining details can be found in sources of hvmm driver.

Additional details

I wrote PyKD script ParsePrtnStructure.py for better visualization of GPAR objects and Mblock objects (link is given at the beginning of the article). For using it, you have to find partition handle first. To do this, run  hvmm.sys driver, which outputs the value of this descriptor to the debugger and then inserted this value into the script.

Script output for Windows Server 2019 guest OS:

Count of GPAR and memory blocks for containers is much more:
In Hyper-V containers all Mblock objects contains zero. Like this:
0: kd> dc 0xffff958b7f0d14d0
ffff958b`7f0d14d0  00000000 00000000 00000000 00000000  ................
ffff958b`7f0d14e0  00000000 00000000 00000000 00000000  ................
ffff958b`7f0d14f0  00000000 00000000 00000000 00000000  ................

there is additional type of block inside vid.sys driver: reserve bucket block (VSMM_RESERVE_BUCKET)
But it is not need for reading guest OS memory in standard case. We see that address is pointing to themselves (0x10 alignment).
Docker container with Hyper-V isolation mode
Docker container in Hyper-V isolation mode creates quite a lot of processes (processes for 1 Windows Server 2019 nanoserver 1809 container):
We see 2 partition handles (by the count of vmwp.exe processes). The name of 1st  of them matches the name of the user in the context of which the process is running.
However, this partition has irrelevant table of MBlock objects:
Elements count is 0x8e, but the MBlock object itself is only one, and it is empty. 
Name of 2nd partition coincides with the identifier, created for container, and contains necessary Nt-kernel data, that can be used to access the memory of the container using WinDBG.
Base address is the same as the Vmmem GPA Offset parameter, which is used for reading memory block from the context of the vmmem process.
The offset of file mapping region in another vmmem instance are the same as VmmemGPA offset, using by hvmm.sys driver.
Different vmmem processes load different executables. But in the process, where there are fewer files, the number of active threads is 0.
The 2nd process of the vmmem docker container is not critical to execution. It can be killed through Process Hacker (the memory size will be several tens of kilobytes). The 1st vmmem process is also not critical for reading memory. The registers of the section to which the process is attached have the correct values, but when reading the kernel mode memory, zeros are returned.
After stopping the two aforementioned vmmem processes, you can still safely start processes inside the container through docker exec.
Call stack of vmmem creation (3 times per container starting process)
1st PsCreateMinimalProcess
2nd PsCreateMinimalProcess
3rd PsCreateMinimalProcess
: kd> kcn
# Call Site
00 nt!PsCreateMinimalProcess
01 nt!VmCreateMemoryProcess
02 Vid!VsmmNtSlatMemoryProcessCreate
03 Vid!VsmmProcesspMicroVmSetup
14 vmwp!VidPartitionManager::Initialize
15 vmwp!VidPartitionManager::CreateInstance
2: kd> kcn
# Call Site
00 nt!PsCreateMinimalProcess
01 nt!VmCreateMemoryProcess
02 Vid!VsmmNtSlatMemoryProcessCreate
03 Vid!VsmmClonepTemplateCreate
13 vmwp!WorkerTaskSaving::StartSave
14 vmwp!WorkerTaskSaving::RunSaveSteps
15 vmwp!WorkerTaskSaving::RunTask
0: kd> kcn
# Call Site
00 nt!PsCreateMinimalProcess
01 nt!VmCreateMemoryProcess
02 Vid!VsmmNtSlatMemoryProcessCreate
03 Vid!VsmmCloneTemplateApply
13 vmwp!VidPartitionManager::Initialize
14 vmwp!VidPartitionManager::CreateInstance
We again see a pseudo Gpar object pointing to a user mode structure (as seen above, this block is created for interaction with virtual devices):
For reading memory inside this block we have to enter vmwp.exe context:
Vmwp.exe process of docker container contain descriptor of files, that used inside container:
More information about docker containers internals you can see in video from Microsoft Ignite conference:
https://www.youtube.com/watch?time_continue=2291&v=tG8R5SQGPck (OS internals: Technical deep-dive into operating system innovations - BRK3365, starting from 38:11).

Usage examples

In which programs can we use the ability to read/write memory to the guest OS?

LiveCloudKd (as an alternative to Sysinternals LiveKd in the -hvl option part). 
On screenshot, one Full VM with Windows Server 2019 and 1 Docker container in Hyper-V isolation mode are running on Hyper-V host server.

EXDi-plugin for WinDBG - the options are the same, but allows you to use legal functions for WinDBG integration (LiveCloudKd uses hooks of some functions inside WinDBG). It even works with WinDBG Preview, which itself runs in a separate container (UWP application). At the time of writing, EXDi-plugin plugin only works with Windows Server 2019\Windows 10 with the hvmm.sys driver loaded, since it requires a write operation to the guest OS. The screenshot shows the operation of WinDBG Preview in EXDi mode and the mimilb.dll plugin, which is part of the mimikatz utility.

The plugin for the MemProcFs program (https://github.com/ufrisk/MemProcFS), which is integrated with pypykatz (https://github.com/skelsec/pypykatz) also allows you to scan the guest OS for hashes (in the screenshot, guest OS - domain controller, based on Windows Server 2016).

It is clear, that for using this method you need get access the host server with administrator rights. So, first of all, I position the utility as an opportunity to dig inside the OS when the debugger is long configured\too lazy or unable to connect (for example, the Secure Boot option is active).

The article described various ways to accessing memory of Hyper-V guest partitions, created in a variety of cases. I hope that working with Hyper-V memory has become a little more understandable. Hyper-V evolves very quickly and integrates more and more actively into the Windows kernel, while remaining virtually undocumented.

The information may be useful to those who want to understand the internal structure of Hyper-V, and possibly get transparent access to the guest OS memory, as well as make its modification. For LiveCloudKd usage it is necessary to have access to the root OS, where the virtual machines are located, and I don’t think that it carries any security risk. However, for Windows Server 2016 such access can be obtained using only the user mode API, which is rather problematic to control. For protection, it is recommended to enable either the Shielded VM option (then, to bypass it, you will need to load the driver), or use Windows Server 2019, where Microsoft blocked the API call from vid.dll for third-party processes and turned on for vmwp.exe the prohibition of injecting libraries, that not signed by Microsoft. However, the latest work on introducing code into third-party processes, demonstrated in August 2019 at Blackhat in Las Vegas (report by Process Injection Techniques - Gotta Catch Them All from Itzik Kotler and Amit Klein from SafeBreach Labs), shows that there are ways to get around these restrictions from user mode (of course, this requires local administrator rights). The only reliable protection against such access to guest OS is Microsoft's Code Integrity in conjunction with the Shielded VM.

No comments:

Post a Comment