Skip to content

Reserving Hugepages for MySql

To get a hugepage support for mysql, follow the step given below:

1. Allocate Hugepages for Mysql using following command

[ashwin@wildfire]$ echo 40 > /proc/sys/vm/nr_hugepages

Other way to configure Hugepages

[ashwin@wildfire]$ sysctl -w vm.nr_hugepages=40

2. Check the Status of Allocation

[ashwin@wildfire]$ grep -i huge /proc/meminfo

HugePages_Total: 40

HugePages_Free: 40

HugePages_Rsvd: 0

Hugepagesize: 2048 kB

HugePages_Total : Total Number of Hugepages in pools
HugePages_Free  : Number of free Hugepages(unallocated) in pool
HugePages_Rsvd  : Number of Committed Hugepages to kernel      
Hugepagesize    : Size of Hugepages

3. Get mysql process id using following command

[ashwin@wildfire]$id mysql

uid=27(mysql) gid=27(mysql) groups=27(mysql)

4. Do following changes

[ashwin@wildfire]$ vi /etc/sysctl.conf

Add the following 4 lines :

###########

vm.nr_hugepages=40

vm.hugetlb_shm_group=27

kernel.shmmax = 68719476736

kernel.shmall = 4294967296

#############

[ashwin@wildfire]$ sysctl -p

[ashwin@wildfire]$vi /etc/my.cnf

Add:

##########

[mysqld]

large-pages

datadir=/var/lib/mysql

socket=/var/lib/mysql/mysql.sock

user=mysql

##########

[ashwin@wildfire]$vi /etc/security/limits.conf

Add:

#########

@mysql soft memlock unlimited

@mysql hard memlock unlimited

##########

5. Restart Mysql server

[ashwin@wildfire]$service mysqld restart

—–

6. Check whether the hugepages are configured to mysql.

[ashwin@wildfire]$ grep -i huge /proc/meminfo

HugePages_Total: 40

HugePages_Free: 40

HugePages_Rsvd: 40

Hugepagesize: 2048 kB

New Implementation for Allocation of Hugepages in Xen

We have implemented hugepage support for guests in following manner

In our implementation we added a parameter hugepage_num which is specified in the config file of the DomU. It is the number of hugepages that the guest is guaranteed to receive whenever the kernel asks for hugepage by using its boot time parameter or reserving after booting (eg. Using echo XX > /proc/sys/vm/nr_hugepages). During creation of the domain we reserve MFN’s for these hugepages and store them in the list. The listhead of this list is inside the domain structure with name “hugepage_list”. When the domain is booting, at that time the memory seen by the kernel is less amount required for hugepages. The function reserve_hugepage_range is called as a initcall. Before this function the xen_extra_mem_start points to this apparent end of the memory. In this function we reserve the PFN range for the hugepages which are going to be allocated by kernel by incrementing the xen_extra_mem_start. We maintain these PFNs as pages in “xen_hugepfn_list” in the kernel. Now when the kernel requests for hugepages, it makes a hypercall to xen hypervisor. Hypervisor will return mfn from the link-list and these will will be mapped to p2m mapping table in kernel. Like this kernel can allocate hugepages in virtualized  environment.

Enabling Printk’s in Paging

OMG!!! Debugging kernel code is really difficult but debugging paging in kernel code is more difficult. Try this,before pagetables are setup put printk in buddy allocator code and check what will happen. Don’t waste time in exploring this, kernel will get crash as pagetables are not setup and libraries are not initialized. But still you want to debug the code after page tables are setup. Follow the steps given below :

1. Add below declaration in any of the file (say mm.h)
int post_printk=0;
EXPORT_SYMBOL(post_printk);

2. Define following macro in your code

#define MY_PRINTK(print_this)  if(post_printk) printk(“\n%s\n”print_this)

In kernel programming always put “\n” at the end of printks which ensure that printk’s will be printed even though kernel is crashed.

3. Now set post_printk=1 when you want to enable printks.

4. So rather than putting printks use the macro definition defined above.

MY_PRINTK(“This is my message in printk”);

Tried Solutions for Hugepage Allocation in Xen

1)Allocating pages during xen-boot time:
When Xen get-booted we have reserved few 2MB (order = 9) from the alloc calls explained earlier. We figure out these mfn of the reserved pool which lies in 1-to-1 domain region. What we did was when domain request a hugepage it makes and hyper-call so that we can allocate page from this pool. This method failed because phy_to_machine mapping was not done.

2)Case 1 by  mapping pages on kernel side .
Using set phy_to_mach at kernel side the mfns returned by our hypercall from xen were mapped to physical space in kernel . This approach failed since phy_to_mach was mapping not done on xen side.

3)Make mapping on xen side  before sending mfn to kernel. This case is under testing and partially implemented .

Problem with Hugepages and Xen ?

When we enable the hugepages then kernel assures the hugepages, but when writes operation performed on it then kernel gets crashed. The reason is that there is no code in xen hypervisor which maps hugepages ptes of kernel. Also kernel can’t create its own pte without informing xen since page-tables in domain kernel are read-only.

HugePages

Hugepages are of size 2MB.They are enabled by setting PS in Page Directory Entry. When hugepages are enabled, one of the page walk is reduced since we get 2MB page one level earlier. This improves the performance of the system upto 30-40 and some times even higher if they are used properly. Hugepages can even degrade the performance of the system if thrashing is the case as if IO time required for DMA to get 2MB page is more quite more than 4KB page.
Hugepages can be enabled by giving boot option or other option is

# echo 20 > /proc/sys/vm/nr_hugepages
# cat /proc/meminfo | grep Huge

Hugepages are not handled with 4KB singleton pages. For maintaining hugepages pool, kernel creates its own buddy-allocator. Following data-structure is used for creating buddy:

struct hstate {
int next_nid_to_alloc;
int next_nid_to_free;
unsigned int order;
unsigned long mask;
unsigned long max_huge_pages;
unsigned long nr_huge_pages;
unsigned long free_huge_pages;
unsigned long resv_huge_pages;
unsigned long surplus_huge_pages;
unsigned long nr_overcommit_huge_pages;
struct list_head hugepage_freelists[MAX_NUMNODES];
unsigned int nr_huge_pages_node[MAX_NUMNODES];
unsigned int free_huge_pages_node[MAX_NUMNODES];
unsigned int surplus_huge_pages_node[MAX_NUMNODES];
char name[HSTATE_NAME_LEN];
};

PV- Paging Documentation

Here is the documentation for PV-Paging in Xen.

Xen mmu operations

This file contains the various mmu fetch and update operations.
The most important job they must perform is the mapping between the
domain’s pfn and the overall machine mfns.

Xen allows guests to directly update the pagetable, in a controlled fashion. In other words, the guest modifies the same pagetable
that the CPU actually uses, which eliminates the overhead of having a separate shadow pagetable.

In order to allow this, it falls on the guest domain to map its notion of a “physical” pfn – which is just a domain-local linear
address – into a real “machine address” which the CPU’s MMU can use.

A pgd_t/pmd_t/pte_t will typically contain an mfn, and so can be inserted directly into the pagetable. When creating a new
pte/pmd/pgd, it converts the passed pfn into an mfn. Conversely,when reading the content back with __(pgd|pmd|pte)_val, it converts the mfn back into a pfn.The other constraint is that all pages which make up a pagetable
must be mapped read-only in the guest. This prevents uncontrolled guest updates to the pagetable. Xen strictly enforces this, and
will disallow any pagetable update which will end up mapping a pagetable page RW, and will disallow using any writable page as a
pagetable.

Naively, when loading %cr3 with the base of a new pagetable, Xen would need to validate the whole pagetable before going on.
Naturally, this is quite slow. The solution is to “pin” a pagetable, which enforces all the constraints on the pagetable even
when it is not actively in use. This menas that Xen can be assured that it is still valid when you do load it into %cr3, and doesn’t
need to revalidate it. Note about cr3 (pagetable base) values. xen_cr3 contains the current logical cr3 value; it contains the
last set cr3. This may not be the current effective cr3, because its update may be being lazily deferred. However, a vcpu looking
at its own cr3 can use this value knowing that it everything will be self-consistent.

xen_current_cr3 contains the actual vcpu cr3; it is set once the  hypercall to set the vcpu cr3 is complete (so it may be a little
out of date, but it will never be set early). If one vcpu is looking at another vcpu’s cr3 value, it should use this variable.
Xen leaves the responsibility for maintaining p2m mappings to the guests themselves, but it must also access and update the p2m array during suspend/resume when all the pages are reallocated.

The p2m table is logically a flat array, but we implement it as a three-level tree to allow the address space to be sparse.

The p2m_mid_mfn pages are mapped by p2m_top_mfn_p. The p2m_top and p2m_top_mfn levels are limited to 1 page, so the
maximum representable pseudo-physical address space is:
P2M_TOP_PER_PAGE * P2M_MID_PER_PAGE * P2M_PER_PAGE pages

P2M_PER_PAGE depends on the architecture, as a mfn is always unsigned long (8 bytes on 64-bit, 4 bytes on 32), leading to
512 and 1024 entries respectively.

We can construct this by grafting the Xen provided pagetable into head_64.S’s preconstructed pagetables. We copy the Xen L2’s into
level2_ident_pgt, level2_kernel_pgt and level2_fixmap_pgt. This means that only the kernel has a physical mapping to start with –
but that’s enough to get __va working. We need to fill in the rest of the physical mapping once some sort of allocator has been set
up.

Hypercalls used for Xen Paging

For paging following are the hypercalls used by domains:

extern long
do_memory_op(
unsigned long cmd,
XEN_GUEST_HANDLE(void) arg);

extern long
do_tmem_op(
XEN_GUEST_HANDLE(tmem_op_t) uops);

extern int
do_mmu_update(
XEN_GUEST_HANDLE(mmu_update_t) ureqs,
unsigned int count,
XEN_GUEST_HANDLE(uint) pdone,
unsigned int foreigndom);

extern int
do_update_va_mapping(
unsigned long va,
u64 val64,
unsigned long flags);

extern int
do_mmuext_op(
XEN_GUEST_HANDLE(mmuext_op_t) uops,
unsigned int count,
XEN_GUEST_HANDLE(uint) pdone,
unsigned int foreigndom);

How to make a Hypercall in Xen ?

Hyper-call is the communication channel between domains and Xen. This ensures security between domains.

How to make a Hypercall in Xen ?

1. add entry of hypercall in /include/xen/hypercall.h and
/include/xen/asmx86/hypercall.h
2.make entry in /arch/x86/x86_32/entry.S
(.long)
make entry in /arch/x86/x86_64/entry.S
(.quad)
3.make entry in /include/public/xen.h

Paging Security in Xen

When Pagetables are setup in Xen, it marks as a Read only for Guest Domain. This is ensure security between domains i.e One domain should not modify pages of other.
So how domain performs write operation on pages ?
Answer: Hypercall.
Hyper-call is the communication channel between domains and Xen. This ensures security between domains.