Date: Fri, 27 Jun 2014 14:23:12 -0700 From: Neel Natu <neelnatu@gmail.com> To: Sean Bruno <sbruno@freebsd.org> Cc: "freebsd-virtualization@freebsd.org" <freebsd-virtualization@freebsd.org> Subject: Re: jenkins bhyve vms crashing and burning after several days of use Message-ID: <CAFgRE9H2QLzQ3mKp1a4zfNBinhVu60F0MMovuSk4sEO0y20FeQ@mail.gmail.com> In-Reply-To: <CAFgRE9HpA_LQStzPYpDUU0erqNp%2BKOrjwK%2B7A7RGfD7XTCi1Hg@mail.gmail.com> References: <1403818926.2417.6.camel@bruno> <1403819194.2417.8.camel@bruno> <CAFgRE9GYHzenX7px6-Sp6BfeTVA0-jcwg=JgcGXKuBeFJXUoog@mail.gmail.com> <1403821402.2417.12.camel@bruno> <CAFgRE9HpA_LQStzPYpDUU0erqNp%2BKOrjwK%2B7A7RGfD7XTCi1Hg@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi, On Thu, Jun 26, 2014 at 3:43 PM, Neel Natu <neelnatu@gmail.com> wrote: > Hi Sean, > > On Thu, Jun 26, 2014 at 3:23 PM, Sean Bruno <sbruno@ignoranthack.me> wrote: >> On Thu, 2014-06-26 at 15:00 -0700, Neel Natu wrote: >>> Hi Sean, >>> >>> On Thu, Jun 26, 2014 at 2:46 PM, Sean Bruno <sbruno@ignoranthack.me> wrote: >>> > On Thu, 2014-06-26 at 14:42 -0700, Sean Bruno wrote: >>> >> so, we're seeing the bhyve vms running in the freebsd cluster for >>> >> jenkins crashing and burning after a couple of days of use. >>> >> >>> >> vm exit[9] >>> >> reason VMX >>> >> rip 0x0000000029286336 >>> >> inst_length 3 >>> >> status 0 >>> >> exit_reason 49 >>> >> qualification 0x0000000000000000 >>> >> inst_type 0 >>> >> inst_error 0 >>> >> >>> >> >>> >> It looks like we have an active core file on havoc.ysv if you have a >>> >> moment to look at it: >>> >> >>> >> http://people.freebsd.org/~sbruno/bhyve.core >>> >> >>> >> FreeBSD havoc.ysv.freebsd.org 11.0-CURRENT FreeBSD 11.0-CURRENT #2 >>> >> r267362: Wed Jun 11 14:56:34 UTC 2014 >>> >> sbruno@havoc.freebsd.org:/usr/obj/usr/src/sys/HAVOC amd64 >>> >> >>> > >>> > Also, from chaos.ysv >>> > >>> > http://people.freebsd.org/~sbruno/bhyve.core.chaos >>> > >>> > FreeBSD chaos.ysv.freebsd.org 11.0-CURRENT FreeBSD 11.0-CURRENT #1 >>> > r267362: Wed Jun 11 15:50:24 UTC 2014 >>> > sbruno@chaos.ysv.freebsd.org:/usr/obj/usr/src/sys/CHAOS amd64 >>> > >>> >>> Can you tell us the processor and memory configuration on havoc and chaos? >>> >>> Also, could you execute the following commands on havoc: >>> >>> # bhyvectl --vm=vmname --cpu=9 --get-vmcs-guest-physical-address >>> -- this will output the offending guest physical address that >>> triggered the EPT misconfiguration >>> >>> # bhyvectl --vm=vmname --get-gpa-pmap=<gpa_from_above> >>> -- this will output the page table entries in the EPT that map to the >>> offending GPA >>> >>> Hopefully that provides us with something to work with. >>> >>> best >>> Neel >>> >>> > >> >> chaos: >> CPU: Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz (2200.05-MHz K8-class CPU) >> Origin="GenuineIntel" Id=0x206d6 Family=0x6 Model=0x2d Stepping=6 >> Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> >> Features2=0x1fbee3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX> >> AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM> >> AMD Features2=0x1<LAHF> >> TSC: P-state invariant, performance statistics >> avail memory = 66298322944 (63227 MB) >> >> havoc: >> FreeBSD clang version 3.4.1 (tags/RELEASE_34/dot1-final 208032) 20140512 >> CPU: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz (2400.14-MHz >> K8-class CPU) >> Origin="GenuineIntel" Id=0x206c2 Family=0x6 Model=0x2c Stepping=2 >> Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> >> Features2=0x29ee3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,POPCNT,AESNI> >> AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM> >> AMD Features2=0x1<LAHF> >> TSC: P-state invariant, performance statistics >> avail memory = 16571621376 (15803 MB) >> > > Thanks, we'll see if there are relevant errata for these processors. > Actually these processors have entirely different microarchitectures (Nehalem and Sandybridge) so its unlikely that this is due to processor errata. >> >> There appear to be three vms running on havoc: >> root@havoc.ysv:/home/sbruno # bhyvectl --vm=vm1 --cpu=9 >> --get-vmcs-guest-physical-address >> gpa[9] 0x0000000000000000 >> root@havoc.ysv:/home/sbruno # bhyvectl --vm=vm2 --cpu=9 >> --get-vmcs-guest-physical-address >> gpa[9] 0x0000000000000000 >> root@havoc.ysv:/home/sbruno # bhyvectl --vm=vm3 --cpu=9 >> --get-vmcs-guest-physical-address >> gpa[9] 0x0000000000000000 >> >> root@havoc.ysv:/home/sbruno # bhyvectl --vm=vm1 --cpu=9 >> --get-gpa-pmap=0x0000000000000000 >> gpa 0: 0x300002c936e007 0x300002c9353007 0x300002c9352007 0 >> >> root@havoc.ysv:/home/sbruno # bhyvectl --vm=vm2 --cpu=9 >> --get-gpa-pmap=0x0000000000000000 >> gpa 0: 0x30000286cb0007 0x300003ad105007 0x3000019b1fd007 0 >> >> root@havoc.ysv:/home/sbruno # bhyvectl --vm=vm3 --cpu=9 >> --get-gpa-pmap=0x0000000000000000 >> gpa 0: 0x300002c9348007 0x300002c9339007 0 >> >> >> But there's no information available on chaos at the moment as there are >> no active vms running. >> > > Sorry, I should explained a bit more. > > After a bhyve(8) exits because of the EPT misconfiguration error there > are breadcrumbs left over in the VMCS as well as the nested page > tables. We can use them to diagnose what happened. > > The bhyvectl commands above should be executed after the VM exits but > before it is restarted again. Once it restarts, the breadcrumbs get > written over and are of no use. > > The "--vm=<vmname>" passed to the bhyvectl command should be of the > virtual machine that crashed. > The "--cpu=<vcpuid>" passed to the bhyvectl command should be the > vcpuid that detected the EPT misconfiguration. The reason I used '9' > as an example above was because you saw this on the console: > > vm exit[9] > reason VMX > rip 0x0000000029286336 > > Hope that helps. > I submitted a change in r267966 to dump this information to the console. It is also stashed in the process memory so we can inspect it in a coredump. Would it be possible to upgrade chaos and/or havoc to r267966 so we can make progress on debugging this issue? best Neel > best > Neel > >> sean >>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAFgRE9H2QLzQ3mKp1a4zfNBinhVu60F0MMovuSk4sEO0y20FeQ>