Feb 02 13:13:38 Bdragon28: ping Feb 02 13:13:50 Bdragon28: I see one potential hit point on Book-E Feb 02 13:14:07 There are *three* traps taken for page faults Feb 02 13:14:08 jhibbits: pong Feb 02 13:14:37 it already knows its recursion level though, isn't it just a case of adding some more scratch space? Feb 02 13:15:00 TLB Miss -> (insert bogus TLB entry with bad permissions) -> DSI -> fault in page to page table -> TLB Miss -> insert real TLB entry Feb 02 13:15:02 jhibbits: or do you mean regarding the abysmal performance? Feb 02 13:15:18 Abysmal perfomance Feb 02 13:15:29 I think I can get it down to 2 traps Feb 02 13:15:38 That should improve performance considerably Feb 02 13:15:38 ah, are you thinking of maybe doing an actual access check in the first miss? Feb 02 13:16:08 No, I'm thinking of taking the TLB Miss -> DSI, and inserting the page into the TLB on DSI fault exit Feb 02 13:16:32 ahh Feb 02 13:16:47 to avoid the second tlb miss? Feb 02 13:16:51 Yes Feb 02 13:17:31 That should shave off several dozen clock cycles Feb 02 13:17:39 Maybe hundreds Feb 02 13:17:41 that's a lot Feb 02 13:17:56 and I guess also however long interrupt setup takes Feb 02 13:18:28 I'm just thinking of the hardware interrupt setup time Feb 02 13:18:36 oh, ok Feb 02 13:18:41 I thought you meant on top of that Feb 02 13:18:54 nope Feb 02 13:19:11 but yeah, if that's doable I imagine it would probably be good for 30% boost or something Feb 02 13:20:31 although I guess it really depends on just how often it's faulting Feb 02 13:20:59 seems to me that given the number of tlb entries that fit at once, stuff like compiling would be *constantly* faulting Feb 02 13:21:16 Well, there are TLB misses and there are faults Feb 02 13:21:27 ok, miss Feb 02 13:22:10 still, it seems like the place where *any* optimization would do the most good Feb 02 13:24:12 I can shave off one instruction in the TLB miss handler easily. Might be able to shave off more Feb 02 13:24:16 btw does it bother to coalesce stuff into larger protection groups, or is it generally just pagewise for simplicity? Feb 02 13:24:41 What do you mean by protection groups? Feb 02 13:24:58 Maybe I'm misthinking about how it works Feb 02 13:25:52 I mean how you can stuff a few !=4k entries in Feb 02 13:26:19 I guess I'm confusing translation and tlbs a bit Feb 02 13:26:36 We can have superpages via TLB1 Feb 02 13:26:46 right, that Feb 02 13:26:49 TLB0 is 4k pages only Feb 02 13:28:26 is any of tlb1 used for userspace or is it already being taken up by holding kernel related translations? Feb 02 13:28:57 Only a handful of entries are taken currently Feb 02 13:29:16 On e500v2 we have 16 TLB1 entries, of which I think 5 are available Feb 02 13:29:31 On e5500 we have 64, of which 11 are occupied, leaving 53 to play... Feb 02 13:29:55 yeah, I guess I did my measurements in terms of e500v2 when I was looking at the tlb in the debugger Feb 02 13:30:17 One of my goals is to eventually add superpages Feb 02 13:30:23 if superpages were usable that could probably be used for some serious speedup to some stuff Feb 02 13:30:25 But I know absolutely nothing about how to do that Feb 02 13:32:07 I guess the main trick is to notice when a range is contig and has the same protections? Feb 02 13:32:16 Yeah Feb 02 13:32:28 But there's some special things you can do to make that more likely Feb 02 13:32:46 cow-like tricks? Feb 02 13:34:02 Hmm... maybe I wouldn't save too much by moving the second TLB miss handling into the DSI handler Feb 02 13:34:09 Interrupt latency is 8 clocks Feb 02 13:34:26 that's 8 clocks event-to-first-instruction? Feb 02 13:34:32 yes Feb 02 13:34:38 that's... impressive Feb 02 13:34:46 oh right Feb 02 13:34:52 it doesn't even need to turn off translation first Feb 02 13:34:55 The e500 has no pipeline Feb 02 13:35:07 so it's practically just a branch Feb 02 13:35:08 Rather, the pipeline is incredibly short Feb 02 13:35:55 e5500 interrupt latency is 10 clocks Feb 02 13:39:17 Is the storage of the larger picture MI? Feb 02 13:39:45 What do you mean? Feb 02 13:39:59 thinking about it in terms of mapped ranges and page protections Feb 02 13:40:23 There are two levels of the VM system -- MI and MD Feb 02 13:40:27 MI I have no control over Feb 02 13:40:30 MD is the pmap layer Feb 02 13:40:47 ah, so there's a md layer where you can implement a data structure that makes the most sense for the internals? Feb 02 13:41:05 yes Feb 02 13:41:50 ok, NOW I understand suddenly why pmap and vm are totally separate systems Feb 02 13:42:19 it never clicked before Feb 02 13:45:45 and also I guess I understand the difference between a miss and a fault too Feb 02 13:46:06 miss means it's not in the tlb but it is "probably" in phys, and fault means it's not in phys either, right? Feb 02 13:46:13 Hm... my A1222 has had 1 BILLION VM faults (I'm guessing that's DSI and ISI) since it started up... So 8 clocks in, 8 clocks out (guessing).. 16 billion clock s could be saved... at 2 billion clocks per second that'd save 8 seconds :P Feb 02 13:46:20 Correct Feb 02 13:47:33 so the speed of the system is up to whether or not pmap is implemented in a way where it can efficiently generate tlb entries on demand, assuming no memory pr essure causing pages to be discarded from phys ram. Feb 02 13:47:36 If I can save the second miss, that's still a net win regardless Feb 02 13:47:54 Correct Feb 02 13:52:14 jhibbits: any stats on the tlb locks? Feb 02 13:52:36 what kind of stats? Feb 02 13:52:49 Misses are constant Feb 02 13:53:41 Okay, in the last 7 minutes, I've had 2 million more VM faults (vm.stats.vm_v_vm_faults) Feb 02 13:54:02 what sort of lock does it use? Seems to me that if there's lock contention, there might be a win in using lockless pcpu stuff in there somewhere might help Feb 02 13:54:35 It's basically a spinlock Feb 02 13:55:16 We need to protect concurrent accesses to the TLB, which is in the pcpu structure Feb 02 13:55:27 So that's already done Feb 02 13:56:05 oh, so it's more locking aginst the hardware than locking against execution? Feb 02 13:56:15 correct Feb 02 13:56:17 ok Feb 02 13:56:42 Time to lurk away again Feb 02 15:19:48 jhibbits: I wonder if there would be any win squeezing the top level page directory into the pmap page instead of the pm_pdir indirection. If you drop down MA XCPU a bit you could probably squeeze in a 768-entry directory. Feb 02 15:20:15 otherwise it would have to be a 512 entry directory Feb 02 15:20:48 (in terms of 32 bit) Feb 02 15:23:07 which would mean avoiding pulling in another page for the page directory Feb 02 15:25:31 with current settings, 1096 bytes of struct pmap are in use on 32 bits, and 1136 bytes on 64 bits, if my counting is correct Feb 02 15:25:46 most of it being eaten up by int pm_tid[MAXCPU] Feb 02 15:27:20 which takes 1024 bytes for MAXCPU=256 Feb 02 15:29:46 Hmm, I guess if you were to just move it to the end of the page you wouldn't need to make it smaller Feb 02 15:30:11 it would just pull in the next page for stuff that didn't fit on the first page Feb 02 15:30:19 err, to the end of the struct rather Feb 02 15:33:32 I guess whether or not that's particularly helpful depends on whether or not the low directory entries tend to get used first or not. Feb 02 15:58:49 oh wait, it's already doing that, I wrote it down wrong Feb 02 15:59:12 (regarding the memory for the pointers living in the pmap struct Feb 02 16:04:38 oh Feb 02 16:05:03 you moved *away* from that sort of setup in r295520 Feb 02 16:12:38 oh, it was 64 bit per pte for longer than that. Maybe I'm just misunderstanding the comment. Feb 02 22:06:35 Bdragon28: I do want to cut MAXCPU on MPC85XXSPE down to 2, which will save a ton of memory Feb 02 22:06:49 jhibbits: you should be able to do that in the kernel conf, right? Feb 02 22:07:12 yes Feb 02 22:07:26 Bdragon28: pm_pdir is the top level page directory... a table of tables Feb 02 22:07:36 yeah, ignore my rambling Feb 02 22:07:44 I had thought it was a *pointer* to a table of tables Feb 02 22:07:50 I misread * as ** Feb 02 22:08:00 ah okay Feb 02 22:08:11 The 64-bit pmap needs work Feb 02 22:08:18 It uses a hashed top level Feb 02 22:08:21 bogus Feb 02 22:08:32 yeah it's kinda carrying around a lot of extra baggage isn't it? Feb 02 22:09:00 Not reall Feb 02 22:09:17 It's just not an ideal setup for top level page directory Feb 02 22:10:50 it would be nice if the pmap struct was page aligned and fit on one page Feb 02 22:11:24 pm_pdir is exactly a page but it's probably straddling two most of the time due to the rest of the struct etc etc Feb 02 22:12:50 if it were a 512 entry table it would always fit but that would put more pressure on the levels below it I guess Feb 02 22:25:21 Yeah Feb 02 22:28:47 I don't really like the ptbl_buf list, I think it's unnecessary