2008-10-21 01:09 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-10-21 01:28 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-10-21 04:20 -!- mlankhorst(~m@fw1.astro.rug.nl) has joined #tux3 2008-10-21 08:43 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-10-21 08:57 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-21 09:10 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-10-21 09:46 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-10-21 10:01 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-21 10:39 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-10-21 12:37 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-21 12:47 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-21 13:59 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-21 15:30 ACTION is reading Chapter 15 from Understanding the Linux Kernel... 2008-10-21 16:09 ah, never looked at it to tell the truth 2008-10-21 16:09 mostly it just reads out the source code to us without interpretation 2008-10-21 16:10 it's modern enough to know that submit_bh calls submit_bio 2008-10-21 16:24 hey 2008-10-21 16:25 sk8 oclock 2008-10-21 16:27 enjoy : 2008-10-21 16:28 utlk is actually pretty good on the page IO life cycle 2008-10-21 16:28 is missing the recent stuff on dirty page limits and all the changes related to that 2008-10-21 16:28 which were pretty major 2008-10-21 16:30 utlk? 2008-10-21 16:31 understanding the linux kernel 2008-10-21 16:48 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-21 16:51 uh :D 2008-10-21 16:51 I though it will be some sort of tool 2008-10-21 18:47 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-21 18:50 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-21 19:07 -!- macan(~chatzilla@159.226.41.129) has joined #tux3 2008-10-21 19:54 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-21 19:54 ACTION is getting ready... 2008-10-21 19:58 -!- RalucaM(~ral@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-21 19:58 hi 2008-10-21 20:00 hi raluca 2008-10-21 20:00 how'd you like the photo gallery? 2008-10-21 20:00 http://phunq.net/sunset 2008-10-21 20:01 not in the ballpark of razvan's skillz, but... 2008-10-21 20:01 hmm, seems like "next session" was last session 2008-10-21 20:01 any maze? 2008-10-21 20:02 let's see if 2.6.27 is indexed yet 2008-10-21 20:02 yes 2008-10-21 20:02 ok it is 2008-10-21 20:03 hello 2008-10-21 20:03 hi hirofumi 2008-10-21 20:03 hi 2008-10-21 20:03 let's take a look at how fast path writepages works 2008-10-21 20:03 start with ext3_writepages 2008-10-21 20:03 remind me to ask a question about halloween afterwards... and flips, you're not really out are you ;-)? 2008-10-21 20:04 who me? 2008-10-21 20:04 flips: oh, I didn't see the gallery yet, let me check 2008-10-21 20:04 ->writepages is an address_space_operation 2008-10-21 20:05 meaning, associated with struct mapping 2008-10-21 20:05 err 2008-10-21 20:05 with struct address_space 2008-10-21 20:05 usually referenced by ->mapping 2008-10-21 20:05 are we at a specific place in the code? 2008-10-21 20:05 fun skew there 2008-10-21 20:05 we will be soon 2008-10-21 20:06 http://lxr.linux.no/linux+v2.6.26.6/fs/ext3/inode.c 2008-10-21 20:07 http://lxr.linux.no/linux+v2.6.26.6/fs/ext3/inode.c#L1769 2008-10-21 20:07 that's ext3_readpages 2008-10-21 20:07 ext3 doesn't support ->writepages 2008-10-21 20:08 I guess thats why I couldn't find it :P 2008-10-21 20:08 I should have looked in the struct from the start 2008-10-21 20:08 interesting question, why it's in ext2 and not ext3 2008-10-21 20:08 I'd guess it's the journaling 2008-10-21 20:09 makes it much harder to support 2008-10-21 20:09 because ext2 is old? 2008-10-21 20:09 because ext3 has more rules about writing I think 2008-10-21 20:09 you'll note writepage has 3 different implementations for ext3 2008-10-21 20:09 and the vfs writepages doesn't know those rules 2008-10-21 20:09 we'll get more specifc about that later 2008-10-21 20:09 oh yes 2008-10-21 20:10 so ext3_readpages is just a wrapper for the library function 2008-10-21 20:10 mpage_readpages 2008-10-21 20:10 where it supplies its *_get_block function 2008-10-21 20:10 as is ext3_readpage 2008-10-21 20:11 yes 2008-10-21 20:11 and not the way tux3 is structured at the moment 2008-10-21 20:11 and possibiliy we will avoid creating tux3_get_block and using that whole library interface 2008-10-21 20:11 ACTION arrives fashionably late 2008-10-21 20:11 I'm leaning in that direction 2008-10-21 20:12 very fashionable 2008-10-21 20:12 http://lxr.linux.no/linux+v2.6.26.6/fs/mpage.c#L371 2008-10-21 20:12 it builds up a bio containing a bunch of pages instead of just one 2008-10-21 20:13 wait a moment, which direction are you leaning in? (and what directions are there?) 2008-10-21 20:13 which saves merging in the block elevator among onther things 2008-10-21 20:13 the direction I'm leaning in is not to have a tux3_get_block 2008-10-21 20:13 and therfore not using any library function that expects a get_block callback 2008-10-21 20:13 those library functions being ancient, crufty 2008-10-21 20:14 in practice I may find it's impractical, or it's totally practical 2008-10-21 20:14 don't know yet 2008-10-21 20:14 ah 2008-10-21 20:14 homework assignment? ;) 2008-10-21 20:14 so basically implementing readpage(s) manually? 2008-10-21 20:14 right 2008-10-21 20:15 like be beloved romfs :D 2008-10-21 20:15 be = my 2008-10-21 20:15 maybe 2008-10-21 20:15 probably work the effort 2008-10-21 20:15 the callback mess is really a mess, and the concept of the get_block interface is kind of broken 2008-10-21 20:15 it implies some place to cache the physical address 2008-10-21 20:16 whereass that really should be the business of the fs 2008-10-21 20:16 ext2 read/write page/pages just look like wrappers too 2008-10-21 20:16 http://lxr.linux.no/linux+v2.6.26.6/fs/mpage.c#L371 <- we're here now 2008-10-21 20:16 yes, because it's a non-journalled file system, which actually has a read/write block interface 2008-10-21 20:17 one thing we don't know from looking here, is where the list of pages we're writing came from 2008-10-21 20:17 mpage_readpage(s) are pretty much the same thing 2008-10-21 20:18 we write pages when they are dirty write? 2008-10-21 20:18 and we want to use them for something else 2008-10-21 20:18 anyway, we're going to write the whole list, and if we're lucky, the list refers to pages contiguous on disk 2008-10-21 20:18 athough apparently there's no page cache lru interaction in the single page version 2008-10-21 20:18 because a single bio can only handle contiguous pages 2008-10-21 20:18 http://lxr.linux.no/linux+v2.6.26.6/fs/ext3/inode.c#L1423 eek 2008-10-21 20:19 there's no lru interaction in the multipage version either 2008-10-21 20:19 nope 2008-10-21 20:19 do_mpage_readpage can submit bios 2008-10-21 20:19 shapor, odd indeed 2008-10-21 20:20 so the data does not have to fit in a single bio 2008-10-21 20:20 http://lxr.linux.no/linux+v2.6.26.6/fs/mpage.c#L384 <- nano optimization 2008-10-21 20:20 somebody must have measured it and determined it actually matters 2008-10-21 20:20 warming up a cache line ahead of time 2008-10-21 20:21 what do you mean theres no lru interaction? 2008-10-21 20:21 ok, so what's the add_to_page_cache_lru all about 2008-10-21 20:22 the bio also gets allocated inside do_mpage_readpage 2008-10-21 20:23 but the last submit happens outside of the main loop 2008-10-21 20:23 we peek into the page cache, if there's no page there we read it 2008-10-21 20:23 that's what that's doing 2008-10-21 20:23 why do we need that only in the multiple page case? 2008-10-21 20:23 the assumption: if we find a page, it must be either uptodate or dirty 2008-10-21 20:23 because the readpage won't be called on an uptodate page 2008-10-21 20:24 basically, no point in reading data we already have in the page cache 2008-10-21 20:24 http://lxr.linux.no/linux+v2.6.26.6/fs/mpage.c#L168 <- here's the big rambling hack 2008-10-21 20:24 why does it matter if its dirty? 2008-10-21 20:24 either way if its in page cache, just use that copy, no? 2008-10-21 20:24 it's going to deal with issues like noncontiguous physical disk locations 2008-10-21 20:25 dirty or update, either way, don't have to read it 2008-10-21 20:25 dirty or uptodate I meant 2008-10-21 20:25 right 2008-10-21 20:25 right 2008-10-21 20:26 let's skim through this quickly and see if there's anything interesting 2008-10-21 20:26 which function are we skimming through? 2008-10-21 20:26 188 if (page_has_buffers(page)) 2008-10-21 20:26 189 goto confused; <- now why would that be 2008-10-21 20:27 http://lxr.linux.no/linux+v2.6.26.6/fs/mpage.c#L168 2008-10-21 20:27 do_mpage_readpage 2008-10-21 20:27 big rambling hack 2008-10-21 20:27 fast 2008-10-21 20:27 unpretty 2008-10-21 20:27 akpm coded this whole file in a couple days as I recall 2008-10-21 20:28 yes 2008-10-21 20:28 shortly before being annointed mm czar ;) 2008-10-21 20:28 because we're not expecting there to be cached in mem data for the page we're reading 2008-10-21 20:29 because we're not expecting any filesystem that monkeys with buffers to touch this mapping? I don't know 2008-10-21 20:29 oh 2008-10-21 20:29 because we just added it 2008-10-21 20:29 and therefore it shouldn't have buffers 2008-10-21 20:30 could write BUG there, it would have to be a race 2008-10-21 20:30 http://lxr.linux.no/linux+v2.6.26.6/fs/mpage.c#L350 2008-10-21 20:30 another hack 2008-10-21 20:30 it's a shame this function gets called one page at a time 2008-10-21 20:30 must move the cpu needle 2008-10-21 20:31 that's where I think we're going to do the whole bio prep look in tux3 2008-10-21 20:31 instead of interfacing to the library 2008-10-21 20:31 it's similar to the code already in filemap.c 2008-10-21 20:32 http://lxr.linux.no/linux+v2.6.26.6/fs/mpage.c#L350 <- let's consider that issue later 2008-10-21 20:33 re tux3 2008-10-21 20:33 199 * Map blocks using the result from the previous get_blocks call first. 2008-10-21 20:34 nigh on unreadable 2008-10-21 20:35 my brain is sizzling... 2008-10-21 20:35 223 * Then do more get_blocks calls until we are done with this page. <- makes more sense 2008-10-21 20:35 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-21 20:35 from 223 we see the ->get_block calls 2008-10-21 20:35 one for each buffer on the page 2008-10-21 20:36 sorry 2008-10-21 20:36 for each block on the page 2008-10-21 20:36 because we're doing this without buffers 2008-10-21 20:36 that's the main point of it 2008-10-21 20:36 hrm 2008-10-21 20:36 avoids buffer oriented IO for most file data 2008-10-21 20:37 we're using a fake buffer 2008-10-21 20:37 called map_bh 2008-10-21 20:37 just so the ->get_block interface will work 2008-10-21 20:37 crufty? yes very 2008-10-21 20:37 hm get_block gets called alot 2008-10-21 20:38 247 /* some filesystems will copy data into the page during 2008-10-21 20:38 248 * the get_block call, <- for example, tail packing filessystem, for example reiserfs 2008-10-21 20:39 290 * This page will go to BIO. Do we need to send this BIO off first? <- what happens if we hit discontiguous blocks 2008-10-21 20:39 the entire vfs interface/libraries really seem to be optimized/written for older 'simpler' filesystems 2008-10-21 20:39 it's because we "evolve" linux 2008-10-21 20:39 with incremental changes 2008-10-21 20:40 generally helpful for stability, but not for structure 2008-10-21 20:40 and then all the newer filesystems basically reimplement this or skip parts of it 2008-10-21 20:40 true 2008-10-21 20:40 the whole page, block, buffer seems like it should be much simpler 2008-10-21 20:40 major cut & paste culture 2008-10-21 20:40 mapping* 2008-10-21 20:40 nobody wants to read/understand this shit ;) 2008-10-21 20:40 hmm... isn't this the way MS does with their OS :P 2008-10-21 20:40 shapor, yes way simpler 2008-10-21 20:40 it's fairly obscene at the moment 2008-10-21 20:40 perhaps, but how many fs'es does MS support? 2008-10-21 20:40 I don't think akpm would argue that 2008-10-21 20:41 maze, a fraction of linux 2008-10-21 20:41 I was thinking about the OS not the FS part ;-) 2008-10-21 20:41 one thing bsders seem to tout is the fact they've done away with buffer heads 2008-10-21 20:41 I should check out how they went about that 2008-10-21 20:41 had some discussion with dilon about it 2008-10-21 20:42 bit didn't follow up by reading code 2008-10-21 20:42 I got the impression... by implementing a new, xfs like layer 2008-10-21 20:42 sounds like they just hid them then 2008-10-21 20:43 199 * Map blocks using the result from the previous get_blocks call first. <- ok I think I grok this now 2008-10-21 20:43 the filesystem is free to go ahead and map more blocks than the one asked for 2008-10-21 20:45 map in what sense? 2008-10-21 20:45 interesting project is to go trace the lifetime of map_bh through this code 2008-10-21 20:45 map as in call ->get_block 2008-10-21 20:45 to get a physical mapping, store it in the bh->block 2008-10-21 20:45 bh->b_blocknr I think it was 2008-10-21 20:46 and bh->b_size is how many blocks was mapped 2008-10-21 20:47 I didn't notice that 2008-10-21 20:47 good eyes 2008-10-21 20:47 very ugly hack 2008-10-21 20:48 really pushing the buffer interface past the breaking point 2008-10-21 20:48 yes, incrementale change 2008-10-21 20:48 we finished for _readpages for now? 2008-10-21 20:48 I sure hope so ;-) 2008-10-21 20:49 let's see if we can figure out why writepages is used by ext2 and not by ext3 in the next 11 minutes 2008-10-21 20:49 uhm, wild guess... journalling 2008-10-21 20:49 which would only matter for data-journalled - except 2008-10-21 20:50 for writes beyond eof and in holes 2008-10-21 20:50 of course, but that's not a sufficiently precise answer 2008-10-21 20:50 getting more precise 2008-10-21 20:50 you can't use the get_block interface? 2008-10-21 20:50 the proposal is "it only matters for data=journalled" 2008-10-21 20:50 for jbd, i think 2008-10-21 20:50 uhm, no. 2008-10-21 20:51 ok, to start with, writepages only works on data, not metadata 2008-10-21 20:51 the proposal was, it would only matter for data=journalled, except for write past eof, and sparse files, which is why it's always needed 2008-10-21 20:51 and hence the 3 different ext3 writepage implementations 2008-10-21 20:52 and no writepages implementation, which is the interesting question 2008-10-21 20:52 let's see under what conditions vfs calls ->writepages 2008-10-21 20:52 my guess is lack of writepages, means it falls back to using writepage one at a time 2008-10-21 20:52 perhaps writepages is too complicated to journalize 2008-10-21 20:53 actually data=ordered also needs special handling because of consistency guarantees it offers 2008-10-21 20:53 perhaps it can fail in too many ways :P 2008-10-21 20:53 or noone has bothered to yet ;-) 2008-10-21 20:53 (my guess) 2008-10-21 20:53 [since with journaling writes are slow anyways...] 2008-10-21 20:53 akpm would bother if it would make ext3 go faster 2008-10-21 20:54 so I'm rejecting that theory 2008-10-21 20:54 hmm, really? 2008-10-21 20:54 http://lxr.linux.no/linux+v2.6.26.6/mm/page-writeback.c#L1003 2008-10-21 20:54 you see the lengths that have gone to already 2008-10-21 20:54 we take our slight advantage over bsd seriously ;) 2008-10-21 20:54 right, so we use generic 2008-10-21 20:54 hmm? where's the advantage? 2008-10-21 20:55 so the answer may be: generic_writepages works for ext3, not for ext2 2008-10-21 20:55 maybe 2008-10-21 20:56 don't buy that 2008-10-21 20:56 anyway, we have found our way to the main place that pages are written in linux 2008-10-21 20:56 maybe the ext2 case could be more optimized? 2008-10-21 20:56 http://lxr.linux.no/linux+v2.6.26.6/fs/ext2/inode.c#L778 2008-10-21 20:56 http://lxr.linux.no/linux+v2.6.26.6/mm/page-writeback.c#L862 <- write_cache_pages 2008-10-21 20:56 let's see what is the generic one... 2008-10-21 20:56 _2copy will only get a few fringe cases, most ext3 traffic will go through here 2008-10-21 20:57 I suppose nobody got around to plugging generic_writepages into ext2 2008-10-21 20:57 block_dev.c: .writepages = generic_writepages, 2008-10-21 20:58 hmm, I don't like my latest theory either 2008-10-21 20:58 truth is, I don't know and with 2 minutes to go I'm declaring it homework 2008-10-21 20:58 that, and "read generic_writepages" 2008-10-21 20:59 :-) 2008-10-21 20:59 did we have fun today? 2008-10-21 20:59 we're certainly wading in it 2008-10-21 20:59 sinking... 2008-10-21 20:59 I feel it was shorter... 2008-10-21 20:59 one thing worth remembering: there's weird locking going on through all of this 2008-10-21 21:00 and scheduling 2008-10-21 21:00 in other words, we're taking a superficial view of it so far 2008-10-21 21:01 see, mpage_writepages is just a wrapper for write_cache_pages too 2008-10-21 21:02 for every one thing i learn in these sessions i find out about 10 more i have no clue about, makes it feel like a net loss ;) 2008-10-21 21:02 lol 2008-10-21 21:02 ACTION feels pretty much the same... 2008-10-21 21:02 there's also so much history behind how it all is... 2008-10-21 21:03 anyway what's up with halloween? 2008-10-21 21:04 and who's on these 2 photos? http://phunq.net/sunset/.1024/.html/woohoo.jpg.html and http://phunq.net/sunset/.1024/.html/wheee.jpg.html 2008-10-21 21:04 ok, write_cache_pages is for filesystems that don't supply a get_block, but do supply a ->writepage 2008-10-21 21:04 when is holloween? :D 2008-10-21 21:04 oct 31 2008-10-21 21:04 maze, we're making arrangements for something rather cool 2008-10-21 21:05 cool, but where and when? 2008-10-21 21:05 oct 31 2008-10-21 21:05 venice beach 2008-10-21 21:05 we can start early 2008-10-21 21:05 on 3rd street 2008-10-21 21:05 so Friday next week 2008-10-21 21:05 soon, yes 2008-10-21 21:05 expect email 2008-10-21 21:05 does venice beach, mean beach in venice, ca? 2008-10-21 21:06 yes, just south of santa monica 2008-10-21 21:06 caveat: you need to be on the southwest size of the 405 before late afternoon 2008-10-21 21:06 ugh, that's even farther than Malibu... 2008-10-21 21:06 before early afternoon even 2008-10-21 21:06 I'm about 10 minutes from malibu 2008-10-21 21:06 MaZe: planning on being in malibu? 2008-10-21 21:06 15 maybe 2008-10-21 21:07 no, just Malibu has somehow always seemed to be as a tropical paradise on the other end of the world (back when I lived in eu) 2008-10-21 21:07 :) 2008-10-21 21:07 oh, 26 minutes it tells me 2008-10-21 21:08 depends from where in malibu 2008-10-21 21:08 malibu is 27 miles itself 2008-10-21 21:08 I can't think of as "where movie stars get arresting for driving their suvs drunk" 2008-10-21 21:08 355 miles 2008-10-21 21:08 i think of it as the gateway to the santa monica mountains 2008-10-21 21:09 all the nice roads, vrewm 2008-10-21 21:10 start early? around when? 2008-10-21 21:11 ah, ext3 has its own kernel feature: ->write_begin, ->write_end for order write 2008-10-21 21:11 also used by btrfs I think 2008-10-21 21:12 btrfs doesn't have 2008-10-21 21:12 planned to be used 2008-10-21 21:13 prepare_write? 2008-10-21 21:13 not really 2008-10-21 21:13 different thing 2008-10-21 21:14 having a hard time finding the call point in vfs 2008-10-21 21:14 e.g. pagecache_write_begin? 2008-10-21 21:15 this one call: http://lxr.linux.no/linux+v2.6.26.6/drivers/block/loop.c#L769 2008-10-21 21:15 but surely that isn't the only one 2008-10-21 21:15 http://lxr.linux.no/linux+v2.6.26.6/mm/filemap.c#L1912 2008-10-21 21:15 hirofumi, right 2008-10-21 21:16 from cscope: 2008-10-21 21:16 fs/affs/file.c affs_truncate 829 res = mapping->a_ops->write_begin(NULL, mapping, size, 0, 0, &page, &fsdata); 2008-10-21 21:16 fs/ext4/inode.c ext4_page_mkwrite 4851 ret = mapping->a_ops->write_begin(file, mapping, page_offset(page), 2008-10-21 21:16 mm/filemap.c pagecache_write_begin 2020 return aops->write_begin(file, mapping, pos, len, flags, 2008-10-21 21:16 mm/filemap.c generic_perform_write 2429 status = a_ops->write_begin(file, mapping, pos, bytes, flags, 2008-10-21 21:16 well, btrfs seems to do much difference way 2008-10-21 21:17 ok, it's a wrapper for grab_cache_page and prepare_write, or a fs hook 2008-10-21 21:17 really crufty 2008-10-21 21:18 luckly, prepare_write will go away soon 2008-10-21 21:18 and all one will use ->write_begin 2008-10-21 21:18 it will? never understood what it was for 2008-10-21 21:18 in the first place 2008-10-21 21:19 so I guess the answer is, it was always bogus 2008-10-21 21:20 three calls from buffer.c look like the only interesting ones 2008-10-21 21:21 hmm 2008-10-21 21:21 block_prepare_write()? not ->prepare_write 2008-10-21 21:21 those are frigne cases 2008-10-21 21:22 it's quite impressive how all this churn has happened and most filesystem code is barely affected 2008-10-21 21:23 recent bloatup in core is pretty scary 2008-10-21 21:23 ->prepare_write is replaced by ->write_begin 2008-10-21 21:24 ->commit_write was replaced by ->write_end 2008-10-21 21:31 flips, btw, do you already have ideas for buffer management? 2008-10-21 21:32 hirofumi, yes 2008-10-21 21:33 oh, great 2008-10-21 21:33 hirofumi, it's the main topic of the post I've been working on for the last week 2008-10-21 21:33 hopefully I'll post in about 2 hours 2008-10-21 21:34 great :) 2008-10-21 21:43 -!- FelipeS_(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-21 21:46 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-21 21:52 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-21 21:53 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-21 21:54 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3