2008-10-23 00:27 flips: nice 2008-10-23 00:28 ACTION is reading now 2008-10-23 00:28 enjoy 2008-10-23 00:29 was doing some testing tonight against the scheduler. there's no way the current scheduler rebalancing code can guarantee determinancy since it still balance on a best-effort and can double/cross lock runqueues delaying the cpu local schedule() calls from being able to reschedule 2008-10-23 00:29 I'll have do some kind of rt based processor isolations that's possibly dynamic 2008-10-23 00:29 determinacy? 2008-10-23 00:29 flips: you and matt are a great combination, like two peas in a po 2008-10-23 00:30 flips: deterministic latency 2008-10-23 00:30 seems like 2008-10-23 00:30 the -rt patch is fully preemptible but the schedule is mismatch because it's largely best effort 2008-10-23 00:31 s/realtime/rubbertime/ 2008-10-23 00:34 I wonder if the -rt patch still fails to do swsuspend properly 2008-10-23 00:35 don't know 2008-10-23 00:37 po=pod 2008-10-23 00:54 flips: btw, you'll have ot abstract the reagular file handling code with the metadata file stuff if you didn't know that already using routines so that the metadata files aren't... 2008-10-23 00:54 treat the same as regular files. They'll still use basic file load routines and stuff, but not have the same semantics in the fs 2008-10-23 00:54 you probably know that already 2008-10-23 00:56 you'll need some kind atomic write barrier, I guess, as well 2008-10-23 00:59 ACTION remembers soft-updates in ffs 2008-10-23 00:59 er FreeBSD UFs 2008-10-23 01:03 flips: the email is too design heavy for most regular lkml folks, but keep on going... 2008-10-23 01:05 ACTION never liked that aspect of Linux kernel culture 2008-10-23 01:07 flips: isn't this going to require VM changes as well for the forked-buffer stuff ? 2008-10-23 01:07 or does that stuff already exist from the ext3 work ? 2008-10-23 01:09 Well, it's the best of the worse ways :\ 2008-10-23 01:15 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2008-10-23 01:18 bh, I didn't post it to lkml 2008-10-23 01:19 bh, all the necessary interfaces are available to modules 2008-10-23 01:19 and if they weren't, I 'd make them 2008-10-23 01:19 ok, what about the buffer forking ? 2008-10-23 01:19 ok 2008-10-23 01:19 makes snse 2008-10-23 01:19 sense 2008-10-23 01:19 should work out fine 2008-10-23 01:19 because it's kind of a dramatic thing for the VM 2008-10-23 01:19 we'll make it a tux3 U homework project 2008-10-23 01:20 ext3 does a similar thing 2008-10-23 01:20 yeah, figured as much 2008-10-23 01:20 which is one of the reasons the interfaces have to be exposed 2008-10-23 01:20 but it's probably not as sophisticated 2008-10-23 01:21 it's a journal, it has its own sophistication 2008-10-23 01:21 read the linked pdf for some great entertainment 2008-10-23 01:22 showing design stuff on lkml isn't entirely pointless, at least jon corbet reads it 2008-10-23 01:22 and understands 2008-10-23 01:23 that's good 2008-10-23 01:23 he kind of cares 2008-10-23 01:24 man this is going to trigger edge cases like crazy 2008-10-23 01:26 ACTION wishes he can work on this :\ 2008-10-23 01:26 easy enough to get that wish granted 2008-10-23 01:28 nice speculative recovery 2008-10-23 01:28 posting to lkml won't hurt 2008-10-23 01:28 it's interesting reading 2008-10-23 01:29 flips: you store the metadata/header for extents in reverse order right ? 2008-10-23 01:30 yes 2008-10-23 01:30 nice 2008-10-23 01:31 because I'm thiking about that blob thing 2008-10-23 01:31 you can versio that metadata with constant in a fixed location 2008-10-23 01:32 versio? 2008-10-23 01:32 version I guess 2008-10-23 01:32 extent version using a special number 2008-10-23 01:32 yes 2008-10-23 01:33 so that you kow how to read that structure 2008-10-23 01:33 versioning strategy's pretty well worked out 2008-10-23 01:33 typing with one hand right now :) 2008-10-23 01:52 ok night 2008-10-23 01:52 hello 2008-10-23 03:48 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-10-23 04:30 -!- bh(~billh@ip68-107-26-122.sd.sd.cox.net) has joined #tux3 2008-10-23 07:07 -!- pgquiles_(~pgquiles@156.Red-88-25-133.staticIP.rima-tde.net) has joined #tux3 2008-10-23 07:12 -!- mlankhorst(~m@fw1.astro.rug.nl) has joined #tux3 2008-10-23 09:07 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-23 09:10 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-23 10:08 -!- mingming(~mingming@32.97.110.51) has joined #tux3 2008-10-23 10:13 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-23 10:42 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-10-23 10:59 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-23 11:41 quick q: page->lru is something that filesystem will not mess with, right? 2008-10-23 12:02 -!- pgquiles(~pgquiles@156.Red-88-25-133.staticIP.rima-tde.net) has joined #tux3 2008-10-23 12:06 IIRC, page->lru is vm stuff. it will be used to manage which is page active. and if vm want more free memory, it may ask to clean inactive pages to fs 2008-10-23 12:07 great! 2008-10-23 12:07 I'm the VM in my case ;-) 2008-10-23 12:35 -!- pgquiles(~pgquiles@156.Red-88-25-133.staticIP.rima-tde.net) has joined #tux3 2008-10-23 12:56 -!- FelipeS(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-23 13:23 -!- FelipeS_(~Felipe@lawn-128-61-31-5.lawn.gatech.edu) has joined #tux3 2008-10-23 14:46 razvanm, correct 2008-10-23 14:46 thought the filesystem my move pages to the front or back of the lru queue if it thinks it knows something the vmm doesn't 2008-10-23 15:02 my tux3.notes file is about 2,000 lines long 2008-10-23 15:02 mostly consisting of posts I haven't posted 2008-10-23 15:18 wow... big backlog :P 2008-10-23 15:59 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-10-23 16:43 folks 2008-10-23 17:16 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-23 17:30 ponk 2008-10-23 17:33 sk8 oclock 2008-10-23 19:11 hello 2008-10-23 19:11 which stage will do "physical remapping"? rollup? 2008-10-23 19:13 hirofumi, not rollup 2008-10-23 19:13 phase transition 2008-10-23 19:13 oh 2008-10-23 19:13 when a new phase is ready to commit to disk, first think to do is flush all dirty inodes 2008-10-23 19:14 all dirty inodes means ileaf? or whole itable btree? 2008-10-23 19:15 flushing dirty inodes in kernel would call write_pages, in tux3 userspace calls write_buffer->map->ops->brwrite 2008-10-23 19:15 dirty inode table blocks have to be flushed too 2008-10-23 19:15 well 2008-10-23 19:15 not flushed, but committed 2008-10-23 19:16 flushing is just the process of committing cached data to writeout 2008-10-23 19:16 yes 2008-10-23 19:16 the whole btree does not have to be committed 2008-10-23 19:17 because we have the "promise" system 2008-10-23 19:17 -!- FelipeS_(~Felipe@r77h15.res.gatech.edu) has joined #tux3 2008-10-23 19:17 ACTION have to read that email more deeply 2008-10-23 19:17 we just write out the leaf nodes and "promise" to update the pointers in parents 2008-10-23 19:18 maybe promise is logical logging? 2008-10-23 19:18 yes 2008-10-23 19:18 i see 2008-10-23 19:19 I used to call it logical records in commit blocks 2008-10-23 19:19 promise is short for that 2008-10-23 19:19 i see 2008-10-23 19:20 another question is: 2008-10-23 19:20 modified buffers in active tree can't free, because we don't know 2008-10-23 19:20 final state of that buffer until rollup? If so, we must pin many 2008-10-23 19:20 btree-index buffers (etc.) of active tree if user modified may inodes? 2008-10-23 19:20 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-23 19:20 exactly 2008-10-23 19:21 i see 2008-10-23 19:21 that is the "dirty metadata" 2008-10-23 19:21 which must be reconstructed if we crash 2008-10-23 19:21 using the logical records in the commit blocks (promises) 2008-10-23 19:21 yes 2008-10-23 19:22 even if vm wants more memory, we can't free those buffers? 2008-10-23 19:27 how will we handle ENOSPC? we must do rollup/pahse transition to make more free space? 2008-10-23 19:28 if it is possible 2008-10-23 19:39 hirofumi, yes, those buffers pin memory even of the vm is low on memory, so we need to make sure not to use too much 2008-10-23 19:40 i see 2008-10-23 19:40 the closer we get to filesystem full, the shorter a phase can be 2008-10-23 19:41 when the vmm is very low on memory, it sets the PF_MEMALLOC flag and calls ->writepage to free memory 2008-10-23 19:41 the PF_MEMALLOC flag gives the filesystem access to an emergency reserve of a few megabytes 2008-10-23 19:42 yes 2008-10-23 19:43 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-23 19:43 otherwise, if cache memory is low but the vmm has not called our filesystem to write out dirty pages, then we will just block in alloc_pages waiting for memory to be freed 2008-10-23 19:45 i see. I thought if we can handle with more few memory, it's great 2008-10-23 19:46 it won't use much memory, it's just index blocks that are pinned 2008-10-23 19:46 each index block references up to 512 data blocks 2008-10-23 19:47 index blocks and bitmap blocks, each bitmap block covers 128 MB of filesystem blocks 2008-10-23 19:47 i see 2008-10-23 19:48 if we need to unpin some then we do a rollup 2008-10-23 19:48 being careful to always have enough memory in the emergency reserve to do the rollup 2008-10-23 19:49 i see 2008-10-23 19:50 i read someone says modern fs uses too much memory 2008-10-23 19:50 btrfs/hammer etc. 2008-10-23 19:51 zfs 2008-10-23 19:51 yes 2008-10-23 19:51 they aren't careful with memory 2008-10-23 19:51 I have been very careful 2008-10-23 19:51 i really dislike that 2008-10-23 19:52 right, it's no good to have more memory if it is just wasted 2008-10-23 19:52 how do they waste it? 2008-10-23 19:52 ZFS has 128 byte block pointers for example 2008-10-23 19:54 128 bytes?!? 2008-10-23 19:54 1024 bits? 2008-10-23 19:54 huge space 2008-10-23 19:54 why do they need them so big? 2008-10-23 19:55 good question 2008-10-23 19:55 128bits? not bytes 2008-10-23 19:55 -!- RalucaM(~ral@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-10-23 19:55 128 bits sounds better :D 2008-10-23 19:55 one thing is, when they do their raid, they like to put multiple pointers to redundant copies of the block in the block pointer 2008-10-23 19:55 hi 2008-10-23 19:55 128 bytes, yes 2008-10-23 19:56 hi raluca 2008-10-23 19:56 ACTION has bad manners :P 2008-10-23 19:57 ah 2008-10-23 19:57 in that huge size they have plenty of space to even store a sha1 or an md5 2008-10-23 19:58 yes, they do that 2008-10-23 20:00 it's tux3 oclock 2008-10-23 20:00 yup 2008-10-23 20:00 I was looking for the zfs repository 2008-10-23 20:00 ACTION is ready 2008-10-23 20:00 I've been in there before 2008-10-23 20:00 google didn't find it right away 2008-10-23 20:01 left as an exercise for the interested reader how they wasted, err, needed 128 bytes for a block pointer 2008-10-23 20:01 i'm useing git for opensolaris 2008-10-23 20:01 right, there's an online repo somewhere on opensolaris.org 2008-10-23 20:02 google just didn't find it right away 2008-10-23 20:02 git://repo.or.cz/opensolaris.git 2008-10-23 20:02 ok, let's be a little selfish today and take a look at something we actually need for tux3 2008-10-23 20:02 mirror though 2008-10-23 20:02 the latest post mentions "forking" a buffer 2008-10-23 20:03 that happens when we want to change a buffer, but it is already committed to writeout 2008-10-23 20:03 hey 2008-10-23 20:03 or another way of putting it, it not in the current phase 2008-10-23 20:03 so we can't change it any more 2008-10-23 20:04 what we do is remove the underlying page from the buffer cache, or in kernel, the page cache 2008-10-23 20:04 copy the data to another page, and put that in the page cache 2008-10-23 20:04 let's look at kernel code to see how we might make that work 2008-10-23 20:04 where should we look first? 2008-10-23 20:04 buffer.c? :D 2008-10-23 20:05 what are we looking for? 2008-10-23 20:05 a write 2008-10-23 20:05 we're looking for where the block is cached 2008-10-23 20:05 remember, in kernel, buffers are just handles for block IO 2008-10-23 20:06 as opposed to in tux3 userspace where we tend to think of them as cached blocks 2008-10-23 20:06 well, the still are, but in kernel they are not the primary unit 2008-10-23 20:06 pages are 2008-10-23 20:07 there are two kinds of places where filesystems cache block 2008-10-23 20:07 the "buffer cache", which is just a page cache mapped one to one to the block device 2008-10-23 20:07 and the so called "page cache" which is a page cache per inode 2008-10-23 20:08 "page cache" is actually misnamed, it sounds like one big caches, it's actually lots of caches 2008-10-23 20:08 ok, so where we should look depends on the kind of block we need to fork 2008-10-23 20:09 suppose it is a directory entry block, where do we look? 2008-10-23 20:09 dirent is cached as page cache, or buffer cache? 2008-10-23 20:09 tell me 2008-10-23 20:10 and your logic 2008-10-23 20:10 dirent is allocted using the slab allocator 2008-10-23 20:10 that was a wild guess 2008-10-23 20:10 and the content of the directory is just a file 2008-10-23 20:10 right 2008-10-23 20:10 so we should look in... 2008-10-23 20:11 ok, let's look at ext3_bread 2008-10-23 20:11 see where it goes 2008-10-23 20:12 should we use .26 or .27 today? 2008-10-23 20:12 let's try .27 2008-10-23 20:12 http://lxr.linux.no/linux+v2.6.27/fs/ext3/inode.c#L1054 2008-10-23 20:12 keep up with mainline, more or less 2008-10-23 20:12 yes 2008-10-23 20:12 RazvanM, ok, follow it in, see where it goes 2008-10-23 20:14 found the next function in? 2008-10-23 20:14 (sorry, I was looking for the dentry_cache :P) 2008-10-23 20:15 next is ext3_getblk 2008-10-23 20:15 right 2008-10-23 20:15 and where does it go from there? 2008-10-23 20:15 the main call 2008-10-23 20:15 hint: don't worry about the ext3 handle 2008-10-23 20:15 ll_rw_block 2008-10-23 20:16 for now 2008-10-23 20:16 which looks to be deprecated :| 2008-10-23 20:16 look closer 2008-10-23 20:16 sb_getblk after get_block like op 2008-10-23 20:16 right 2008-10-23 20:16 let's see how that works 2008-10-23 20:16 sb_getblk 2008-10-23 20:16 we've looked at it before 2008-10-23 20:16 it sounds familiar :P 2008-10-23 20:17 http://lxr.linux.no/linux+v2.6.27/fs/buffer.c#L1403 2008-10-23 20:17 right 2008-10-23 20:18 follow the _slow path 2008-10-23 20:18 I know those 2008-10-23 20:18 this is where the kernel code is really crappy ;) 2008-10-23 20:18 the block to page mapping is done in grow_buffers :p 2008-10-23 20:18 http://lxr.linux.no/linux+v2.6.27/fs/buffer.c#L1119 2008-10-23 20:19 grow_buffers :D 2008-10-23 20:19 love the name 2008-10-23 20:19 I woudn't expect anybody to guess taht 2008-10-23 20:19 took me 10 minutes to figure it out last time we were in here 2008-10-23 20:19 1109 /* Create a page with the proper size buffers.. */ 2008-10-23 20:20 http://lxr.linux.no/linux+v2.6.27/fs/buffer.c#L1028 2008-10-23 20:20 1048 if (!try_to_free_buffers(page)) 2008-10-23 20:20 1049 goto failed; 2008-10-23 20:20 <- lovely 2008-10-23 20:21 and at failed we BUG 2008-10-23 20:21 this mechanism is in a state of transition ;) 2008-10-23 20:21 1055 bh = alloc_page_buffers(page, size, 0); 2008-10-23 20:21 flips: can you explain again the link between the page and buffer_head? :D 2008-10-23 20:21 each page has a page to attach a circular list of buffer_heads to it 2008-10-23 20:22 as many as there can be blocks on the page 2008-10-23 20:22 usually one 2008-10-23 20:22 page is 4k usually 2008-10-23 20:22 yes 2008-10-23 20:22 is the block usually 4k? 2008-10-23 20:22 yes 2008-10-23 20:22 cool 2008-10-23 20:22 default for nearly all filesystems 2008-10-23 20:22 didn't know that :D 2008-10-23 20:23 it's not very cool actually, because 4K is a bit small on modern hardware 2008-10-23 20:23 non-unix fs has 512bytes 2008-10-23 20:23 this is a big flaw in linux 2008-10-23 20:23 can't ahve buffer bigger than page 2008-10-23 20:23 romfs has 1KB I think :P 2008-10-23 20:24 smaller blocks create less external fragmentation 2008-10-23 20:24 one more q: what happen when there is more than one bh in a page? 2008-10-23 20:24 that will only be the case when block size is a fraction of page size 2008-10-23 20:24 right 2008-10-23 20:24 see all that code that checks for buffers being there and puts them there if they are not 2008-10-23 20:25 sometime when you have a lot of time on your hands, go read try_to_free_buffers 2008-10-23 20:25 ...the worst fundtion in the entire kernel 2008-10-23 20:25 so the bh in page are continuos? 2008-10-23 20:25 continous? 2008-10-23 20:25 continuous? 2008-10-23 20:26 contiguous 2008-10-23 20:26 yes 2008-10-23 20:26 what space on the disk do they cover 2008-10-23 20:26 and that's important 2008-10-23 20:26 because what it does in the case of block smaller than page is create false sharing 2008-10-23 20:26 good... I didn't know that :D 2008-10-23 20:27 not contiguous on disk 2008-10-23 20:27 contiguous in memory 2008-10-23 20:27 aaaaaa 2008-10-23 20:27 sorry 2008-10-23 20:27 grrr... 2008-10-23 20:27 I liked the other answer better :D 2008-10-23 20:27 yes, that leads to a lot of headaches 2008-10-23 20:27 exactly!! 2008-10-23 20:28 so in tux3, we want to branch a buffer, but we actually have to mess with a whole page 2008-10-23 20:28 but in tux3 the buffer will be 4k, right? 2008-10-23 20:28 not necessarily 2008-10-23 20:28 tux3 can handle 256 byte blocks 2008-10-23 20:29 I think we decided to make the smallest 512 2008-10-23 20:29 linux sector size 2008-10-23 20:29 yes :) 2008-10-23 20:29 let's keep going in 2008-10-23 20:29 ok 2008-10-23 20:29 find_or_create_page 2008-10-23 20:30 http://lxr.linux.no/linux+v2.6.27/mm/filemap.c#L720 2008-10-23 20:30 add_to_page_cache_lru 2008-10-23 20:30 add_to_page_cache 2008-10-23 20:31 add_to_page_cache_locked 2008-10-23 20:31 this looks also familiar... 2008-10-23 20:31 radix_tree_insert 2008-10-23 20:32 http://lxr.linux.no/linux+v2.6.27/lib/radix-tree.c#L291 2008-10-23 20:32 there we see the nice new rcu code that got added by peterz in the last cycle 2008-10-23 20:32 lockless pagecache? 2008-10-23 20:33 wait, it was there before 2008-10-23 20:34 ok, let's poke around in _insert for a while 2008-10-23 20:34 it's good to have an idea what happens there 2008-10-23 20:35 it's a radix tree with branching factor 64 2008-10-23 20:35 meaning we have a lot of page cache pointers sitting next to each other 2008-10-23 20:35 it's tempting to use that fact when we are operating on pages that are contiguous in the page cache 2008-10-23 20:35 to avoid lookups 2008-10-23 20:36 I don't know of any kernel code that has actually done that though 2008-10-23 20:36 also haven't looked hard 2008-10-23 20:37 note: lockless page cache is due to nick piggin 2008-10-23 20:37 and it isn't completely merged yet 2008-10-23 20:37 I presume that a goal is to get rid of even the rcu locks from the radix tree 2008-10-23 20:38 oh, i thought it was done 2008-10-23 20:38 part went in 2008-10-23 20:38 _insert is actually pretty simple 2008-10-23 20:39 add levels if we're trying to insert at a high address 2008-10-23 20:39 otherwise drill down through levels masking off the index 2008-10-23 20:39 empty parts of the tree have null pointers, fill them in if in our path 2008-10-23 20:40 that's about it. RCU strangeness to think about 2008-10-23 20:40 otherwise we're done here 2008-10-23 20:40 ok, so what is a tux3 buffer fork going to look like, based on what we just looked at? 2008-10-23 20:41 ok, lookup cache, then copy data to new cache, and insert new pos on radix tree 2008-10-23 20:42 ? 2008-10-23 20:42 basically, and we'll need to worry about locking 2008-10-23 20:42 copy dat to new cache will happen in multiple steps, right? 2008-10-23 20:42 and we need to worry about false sharing 2008-10-23 20:42 there are other blocks onthe same page, what happens to them? 2008-10-23 20:43 it's a per-block operation as currently conceived 2008-10-23 20:44 don't copy other blocks, becase new cache may not be contiguous 2008-10-23 20:44 hirofumi, when we branch a block we don't change its position 2008-10-23 20:44 um.. 2008-10-23 20:44 we just pull the page that carries the block data out of the page cache, leaving a copy in its place 2008-10-23 20:45 we don't necessarily even need buffer heads on the page we pull out of cache 2008-10-23 20:45 because nobody is going to be changing it, hence no need for per-block locking 2008-10-23 20:45 and we can do the actual transfer to disk with a bio 2008-10-23 20:46 so we will just be pulling the underlying page out and replacing it with a new page 2008-10-23 20:46 that has the effect of forking all the buffers on the same page 2008-10-23 20:46 so we must have some bits in the buffer_head flags to tell us which phase a buffer belongs to 2008-10-23 20:47 um.. one may be ileaf, and one may be dleaf etc.? 2008-10-23 20:47 that is, whether it has already been forked or not 2008-10-23 20:47 here in a file page cache we will only find file data or directory data or bitmap block 2008-10-23 20:47 later atom stuff 2008-10-23 20:47 ileaf and dleaf live in the buffer cache 2008-10-23 20:48 which is direct-mapped to the block device 2008-10-23 20:48 it is handled in much the same way 2008-10-23 20:48 ext3 does not directly perform operations on the buffer cache, I think I recall 2008-10-23 20:49 but lets the vfs do it 2008-10-23 20:49 using the generic_ functions 2008-10-23 20:49 and ext3 just supplies a ->get_block function 2008-10-23 20:49 well, let's go see how ext3_get_block works 2008-10-23 20:50 at some point it obviously has to go read some metadata 2008-10-23 20:50 most likely with sb_bread 2008-10-23 20:50 yes 2008-10-23 20:51 http://lxr.linux.no/linux+v2.6.27/fs/ext3/inode.c#L953 2008-10-23 20:51 then http://lxr.linux.no/linux+v2.6.27/fs/ext3/inode.c#L786 2008-10-23 20:51 these functions are a little oddly structured 2008-10-23 20:51 because they are lockless 2008-10-23 20:52 ext3_block_to_path just does arithmetic 2008-10-23 20:52 because caller has lock of requested page? 2008-10-23 20:53 no, it's completely lockless 2008-10-23 20:53 it uses the block pointers like locks 2008-10-23 20:53 um.. what happen if it was truncated? 2008-10-23 20:53 it checks and backs out the operation 2008-10-23 20:53 see verify_chain 2008-10-23 20:54 ok, 367 bh = sb_bread(sb, le32_to_cpu(p->key)); 2008-10-23 20:54 in ext3_get_branch 2008-10-23 20:54 so we got to a function that is familiar 2008-10-23 20:55 yes 2008-10-23 20:55 oh thing to watch out for: some of the blocks on the buffer cache page may be data, that is, in an inode page cache, and some may be metadata, in the bufffer cache 2008-10-23 20:55 I haven't thought about what impact that might have on the forking operation 2008-10-23 20:55 it's probably ok, but needs to be thought about 2008-10-23 20:56 ok, that's enough for today 2008-10-23 20:56 how'd we do on the interesting front? 2008-10-23 20:57 do forking? 2008-10-23 20:57 I meant, was it interesting? 2008-10-23 20:58 yes 2008-10-23 20:58 this lesson was above my knowledge level :P I have to dig more to qualify for it :P 2008-10-23 20:58 razvanm, it's not above your level 2008-10-23 20:59 just read through the log once more and it will all look simple 2008-10-23 20:59 I will :D 2008-10-23 21:00 next time I think we might take a look at how we might go about doing filesystem IO without having a tux3_get_block function 2008-10-23 21:00 I don't know myself whether it's practical to avoid this 2008-10-23 21:00 it's not particularly hard to implement 2008-10-23 21:00 but I think I want to see if we can just avoid the whole block IO library and work directly with the page cache and bio 2008-10-23 21:00 IIRC, btrfs uses get_extents or something 2008-10-23 21:00 sb_bread is our friend 2008-10-23 21:01 new invention 2008-10-23 21:01 I don't want to go that far just yet 2008-10-23 21:01 the api is likely to take some time to settle 2008-10-23 21:01 well 2008-10-23 21:01 I think it is the get_ model that is broken 2008-10-23 21:01 not the block vs extents 2008-10-23 21:02 the get_ model assumes there is a place to cache the physical pointer 2008-10-23 21:02 which historically has been the buffer_head 2008-10-23 21:02 but it turns out that the cached physical pointer is rarely used 2008-10-23 21:02 i see 2008-10-23 21:02 not enough to justify the strange mechanisms that are in place to handle it 2008-10-23 21:03 ah, becase we use forking on write. physical pointer is not used much? 2008-10-23 21:03 because 2008-10-23 21:04 I am thinking that it caching physical pointers is a win, then there should be a library function to do that that all filesystems can use, so that the whole IO path does not revolve around the need for a fs to generate a physical pointer 2008-10-23 21:04 forking happens on the "top end" of filesystem and is not part of writeout 2008-10-23 21:04 it is to avoid stalls in buffer IO calls 2008-10-23 21:05 "remapping" is where we change physical pointers around 2008-10-23 21:05 which occurs in filemap.c, during an inode flush 2008-10-23 21:05 and will also need to happen when we write out inode table blocks during phase commit 2008-10-23 21:06 and index blocks during rollup 2008-10-23 21:07 um.. natural delayed write.. 2008-10-23 21:08 delayed write -> delayed allocation 2008-10-23 21:09 yes 2008-10-23 21:09 I'm thinking about adding delayed inode number assignment too 2008-10-23 21:09 oh, i see 2008-10-23 21:10 could possibly confuse nfs 2008-10-23 21:13 thanks. this talk seems to clear my brain more or less 2008-10-23 21:15 you're welcome 2008-10-23 21:30 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-10-23 21:30 hmm, a little late 2008-10-23 21:30 ;-) 2008-10-23 21:31 good think we have a log 2008-10-23 21:31 good thing 2008-10-23 21:35 yup catching up now 2008-10-23 21:52 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-10-23 21:52 hi tim_dimm 2008-10-23 21:53 flips... 2008-10-23 21:53 word 2008-10-23 22:02 dword 2008-10-23 22:03 but not msword 2008-10-23 22:03 ddword 2008-10-23 22:11 qword 2008-10-23 22:11 f word 2008-10-23 22:16 wor dup 2008-10-23 22:16 sword 2008-10-23 22:16 *swish* 2008-10-23 22:16 ACTION cuts a swatch through the witty banter 2008-10-23 22:16 swishy 2008-10-23 22:17 wordy