2008-09-30 00:00 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3 2008-09-30 00:00 MaZe, hello 2008-09-30 00:07 hi pranith 2008-09-30 00:09 flips, no reply from shapor :( 2008-09-30 00:09 about the struct fields description... 2008-09-30 00:09 may be u can reply ? ;) 2008-09-30 00:09 right 2008-09-30 00:11 pranith, have you read the comment that begins: /* Leaf index format 2008-09-30 00:11 ? 2008-09-30 00:11 in dleaf.c 2008-09-30 00:11 i tried :) 2008-09-30 00:12 the header contains the two level index followed by the table of extents 2008-09-30 00:13 i dint understand the limit on number of versions at the same level 2008-09-30 00:13 dinner time for me 2008-09-30 00:14 ohkies 2008-09-30 00:14 it's simple: you can't have more than 255 entries in one group, therefore can't have more than 255 entries with the same logical address 2008-09-30 00:14 well 2008-09-30 00:14 actually that is probably wrong now 2008-09-30 00:15 hmm, anything changed? 2008-09-30 00:15 you can have multiple dleaf groups with the same logical address now I think 2008-09-30 00:15 sure, lots of code changed 2008-09-30 00:15 every day 2008-09-30 00:15 later... 2008-09-30 00:15 okies 2008-09-30 01:59 folks 2008-09-30 01:59 ACTION is back from a night of goofing off 2008-09-30 01:59 feels great 2008-09-30 01:59 I was working myself into the ground much of last week 2008-09-30 02:14 -!- ajonat(~ajonat@190.48.120.169) has joined #tux3 2008-09-30 02:14 -!- orgthingy(~orgthingy@62.150.55.188) has joined #tux3 2008-09-30 02:14 -!- hirofumi(~hirofumi@210.171.168.39) has joined #tux3 2008-09-30 02:14 -!- SEJeff(~jeff__@66.151.59.138) has joined #tux3 2008-09-30 02:14 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-09-30 02:14 -!- konrad(~konrad@D-128-208-53-208.dhcp4.washington.edu) has joined #tux3 2008-09-30 02:14 -!- flips(~phillips@phunq.net) has joined #tux3 2008-09-30 02:14 -!- ceatinge(~ceatinge@72.232.13.50) has joined #tux3 2008-09-30 02:14 -!- nataliep(~nataliep@207.47.98.129.static.nextweb.net) has joined #tux3 2008-09-30 02:14 -!- data(~data@echo489.server4you.de) has joined #tux3 2008-09-30 02:14 -!- Bushman(~marcin@c-76-23-106-132.hsd1.sc.comcast.net) has joined #tux3 2008-09-30 02:14 -!- shapor(~shapor@yzf.shapor.com) has joined #tux3 2008-09-30 02:14 -!- vcgomes(~vcgomes@li17-238.members.linode.com) has joined #tux3 2008-09-30 02:14 -!- ChanServ changed mode/#tux3 -> -o tux3bot 2008-09-30 02:31 -!- kbingham(~kbingham@193.132.141.186) has joined #tux3 2008-09-30 03:01 -!- pgquiles(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-09-30 03:26 -!- Kirantpatil(~kiran@122.166.94.37) has joined #tux3 2008-09-30 03:26 hello 2008-09-30 03:26 ah, what a great night ^______^ 2008-09-30 03:35 -!- Kirantpatil(~kiran@122.166.94.37) has left #tux3 2008-09-30 03:35 -!- Kirantpatil(~kiran@122.166.94.37) has joined #tux3 2008-09-30 03:53 orgthingy, where from/ 2008-09-30 03:53 ? 2008-09-30 03:54 why everybody keeps asking where im from xD 2008-09-30 04:10 why? you have a problem in that/ 2008-09-30 04:17 pranith : well, not really 2008-09-30 04:17 but i dont usually say any personal info in IRC :P 2008-09-30 04:18 ohk 2008-09-30 04:18 i dint knw that ones country is personal 2008-09-30 04:20 hi pranith 2008-09-30 04:21 Kirantpatil, helo 2008-09-30 04:21 is the session over today morning ? 2008-09-30 04:22 again i failed join .. 2008-09-30 04:22 TUX 3 university session. 2008-09-30 04:23 hmm 2008-09-30 04:23 yeah, morning is a bad time for people here 2008-09-30 04:23 i too miss it 2008-09-30 04:23 usually get it from the logs 2008-09-30 04:23 i am from bengaluru.. 2008-09-30 04:24 how about you.. 2008-09-30 04:24 delhi 2008-09-30 04:24 oh cool.. 2008-09-30 04:24 what do u do? 2008-09-30 04:24 i run Freesoftware training center.. 2008-09-30 04:24 oh 2008-09-30 04:24 which place in bangy? 2008-09-30 04:25 Driver programming, Administration.. 2008-09-30 04:25 it is in rajaji nagar. 2008-09-30 04:25 hmm 2008-09-30 04:25 i knw only koramangala 2008-09-30 04:25 and marathahalli 2008-09-30 04:25 ok.. 2008-09-30 04:25 and mg road ;) 2008-09-30 04:26 i am planning for giving filesystem training. 2008-09-30 04:26 so i am preparing for it.. 2008-09-30 04:26 oh 2008-09-30 04:26 nice 2008-09-30 04:26 i have it scheduled from Nov 1. 2008-09-30 04:27 hmm 2008-09-30 04:27 too soon i guess 2008-09-30 04:27 hope for getting some contributors ... 2008-09-30 04:27 oh 2008-09-30 04:27 which level are the students here? 2008-09-30 04:28 basically they span from college grads to experienced fellows.. 2008-09-30 04:28 hmm 2008-09-30 04:28 ok 2008-09-30 04:29 my motives are to spread linux kernel programming in easy way 2008-09-30 04:30 here is our website www.turtlelinuxlabs.in 2008-09-30 04:30 nice 2008-09-30 04:31 i am still in learning phase of filesystems.. 2008-09-30 04:32 hmm 2008-09-30 04:32 i tried to apply the patch of daniels posted in lwn.net and compile the kernel.. 2008-09-30 04:32 it was showing some error. 2008-09-30 04:34 give me some guidelines on this.. 2008-09-30 04:37 which patch? 2008-09-30 04:37 tux3 is not yet in kernel 2008-09-30 04:38 its still in userspace in fuse 2008-09-30 04:38 you dont need to compile the kernel to test this 2008-09-30 04:38 just use the fuse version 2008-09-30 04:39 please see this http://lwn.net/Articles/299740/ 2008-09-30 04:39 i followed that link.. 2008-09-30 04:41 hmm, i think i missed that 2008-09-30 04:41 :( 2008-09-30 04:41 what shall i do then.. 2008-09-30 04:41 flips, why dont u cc to tux3?? 2008-09-30 04:42 im not sure.. 2008-09-30 04:42 ive never compiled this in a kernel before.. 2008-09-30 04:43 where can i get the fuse version.. 2008-09-30 04:48 am i doing right here.. 2008-09-30 04:51 hmm 2008-09-30 04:51 use the mercurial repo 2008-09-30 04:52 hg pull http://phunq.net/tux3 2008-09-30 04:52 install mercurial 2008-09-30 04:53 ok, i will try that 2008-09-30 04:57 then cd tux3/user/test 2008-09-30 04:57 make && make debug 2008-09-30 04:58 it will mount in /tmp/ 2008-09-30 05:04 thanks pranith. 2008-09-30 05:12 welcome :) 2008-09-30 05:13 i am getting some errors, shall i paste it here 2008-09-30 05:13 i am using ubuntu gibbon. 2008-09-30 05:15 tux3.c:14:18: error: popt.h: No such file or directory 2008-09-30 05:16 sudo apt-get install libpopt-dev 2008-09-30 05:34 Kirantpatil, worked? 2008-09-30 05:42 yes it worked. 2008-09-30 05:43 i am just execting sudo make testfuse 2008-09-30 05:50 ok.. i played with testfuse and testfs 2008-09-30 05:50 they are working fine.. 2008-09-30 05:51 next what should i do ?? 2008-09-30 05:51 work with dleaf and dleaftest 2008-09-30 05:51 make dleaf 2008-09-30 05:51 make dleaftest 2008-09-30 05:51 ./dleaf 2008-09-30 05:51 there is a bug with testfuse 2008-09-30 05:51 in readdir... 2008-09-30 05:52 ls 2008-09-30 05:52 touch hello 2008-09-30 05:52 ls 2008-09-30 05:52 rm hello 2008-09-30 05:52 ls 2008-09-30 05:53 i am now installing valgrind 2008-09-30 05:54 no need 2008-09-30 05:54 u can run it directly 2008-09-30 05:54 ./dleaf 2008-09-30 05:54 ok.. 2008-09-30 05:57 i did run ./dleaf 2008-09-30 05:57 it is showing lot of dwalk messages.. 2008-09-30 05:58 i didnt understand where should i do "touch hello" "ls" and "rm hello" 2008-09-30 06:00 make debugfs 2008-09-30 06:00 "make debug" 2008-09-30 06:00 go to /tmp/test 2008-09-30 06:00 them do touch and rm then ls 2008-09-30 06:03 yeah 2008-09-30 06:03 root@kiran-desktop:/tmp/test# ls 2008-09-30 06:03 ???@???? 2008-09-30 06:04 this is how looks after "rm hello" 2008-09-30 06:04 mode 0100666 uid 0 gid 0 root d:1 2008-09-30 06:04 ---- get attr for '/' ---- 2008-09-30 06:04 ---- get attr for '/' ---- 2008-09-30 06:04 ---- readdir '/' at 0 ---- 2008-09-30 06:04 ---- get attr for '� 2008-09-30 06:04 @�' ---- 2008-09-30 06:04 ---- get attr for '/� 2008-09-30 06:04 @�' ---- 2008-09-30 06:04 ---- readdir '/' at 1000 ---- 2008-09-30 06:05 in debug message. 2008-09-30 06:10 then i think i need to understand the code .. 2008-09-30 06:10 am i right.. 2008-09-30 06:13 yup 2008-09-30 06:13 u need to.. 2008-09-30 06:13 something wrong in readdir 2008-09-30 06:13 i dint look further 2008-09-30 06:49 -!- Kirantpatil(~kiran@122.166.94.37) has left #tux3 2008-09-30 07:33 -!- pgquiles_(~pgquiles@62.43.226.52.static.user.ono.com) has joined #tux3 2008-09-30 07:41 flips, there? 2008-09-30 07:47 -!- pgquiles__(~pgquiles@166.Red-88-16-39.dynamicIP.rima-tde.net) has joined #tux3 2008-09-30 09:21 -!- orgthingy(~orgthingy@62.150.55.188) has joined #tux3 2008-09-30 09:27 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-30 09:38 -!- RazvanM(~RazvanM@dazzler.isi.jhu.edu) has joined #tux3 2008-09-30 09:49 -!- Kirantpatil(~kiran@122.167.222.94) has joined #tux3 2008-09-30 09:49 -!- Kirantpatil(~kiran@122.167.222.94) has left #tux3 2008-09-30 09:50 -!- Kirantpatil(~kiran@122.167.222.94) has joined #tux3 2008-09-30 09:50 -!- Kirantpatil(~kiran@122.167.222.94) has left #tux3 2008-09-30 09:52 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-30 10:28 -!- Kirantpatil(~kiran@122.167.222.94) has joined #tux3 2008-09-30 10:28 -!- Kirantpatil(~kiran@122.167.222.94) has left #tux3 2008-09-30 10:48 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-30 10:51 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-30 12:01 -!- orgthingy(~orgthingy@62.150.55.188) has joined #tux3 2008-09-30 12:43 -!- MaZe(~MaZe@216-239-45-4.google.com) has joined #tux3 2008-09-30 17:10 -!- konrad(~konrad@D-128-208-53-208.dhcp4.washington.edu) has joined #tux3 2008-09-30 18:58 -!- ajonat(~ajonat@190.48.107.189) has joined #tux3 2008-09-30 19:32 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-30 19:52 19:52:21 2008-09-30 19:54 that's true 2008-09-30 19:54 ah, and I missed the ping from pranith too 2008-09-30 19:54 yesterday 2008-09-30 19:55 I guess I'd better fix the readdir bug in fuse 2008-09-30 19:55 specially as a provisional fix has been offered 2008-09-30 19:55 -!- ajonat_(~ajonat@190.48.122.185) has joined #tux3 2008-09-30 19:59 t -30 & counting 2008-09-30 20:00 t -> tux3 2008-09-30 20:00 browsers running? 2008-09-30 20:00 mayhaps 2008-09-30 20:01 we start here: http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2040 2008-09-30 20:01 or maybe we should start from where this is called in _copy2 2008-09-30 20:01 _2copy 2008-09-30 20:01 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2106 2008-09-30 20:02 -!- RazvanM(~RazvanM@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-09-30 20:02 razvanm: http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2106 2008-09-30 20:02 -!- RalucaM(~ral@pool-151-196-118-156.balt.east.verizon.net) has joined #tux3 2008-09-30 20:02 ACTION is sorry that he is late 2008-09-30 20:02 flips: razvanm: http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2106 2008-09-30 20:02 ACTION is glad you're here 2008-09-30 20:02 page = __grab_cache_page(mapping, index); 2008-09-30 20:03 ok, we're getting a cache page that user data will be copied onto 2008-09-30 20:03 later that page will be added to a bio and thrown at a device 2008-09-30 20:03 but today we're just going to look at the page cache 2008-09-30 20:04 that is, the list of pages belonging to a particular inode that have been read in via some buffer IO operation 2008-09-30 20:04 or directly created, as here 2008-09-30 20:04 since we know we're going to write to this page, normally the entire thing, there is no need to read it first 2008-09-30 20:05 we just "grab" it, and by that, viro means look into the cache and allocated a page if one is not already there 2008-09-30 20:06 so lets got to http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2040 and see how it works 2008-09-30 20:06 quick q: this "grab" is unique to this case? 2008-09-30 20:06 pretty much 2008-09-30 20:06 page cache ops are highly non-orthogonal 2008-09-30 20:06 there may or may not be justification for that 2008-09-30 20:07 they just kind of grew from usage, like most of linux 2008-09-30 20:07 and the comment claims it's just for buffered writes 2008-09-30 20:07 possibly true 2008-09-30 20:07 should we take a look at what mapping and index? :P 2008-09-30 20:08 struct address_space * mapping 2008-09-30 20:08 we'll be looking at those, yes 2008-09-30 20:08 aaa... this was in some time ago 2008-09-30 20:08 that's what this is all about 2008-09-30 20:08 pgoff_t index; 2008-09-30 20:08 index is for practical purposes unsigned int 2008-09-30 20:08 that means 32 bit on 32 arches 2008-09-30 20:08 interesting that pgoff_t seems to be a page offset, but it would make sense to be a file offset div page_size ? 2008-09-30 20:09 maybe just a misnomer 2008-09-30 20:09 limiting the size of any file to 2^(32 + 12) 2008-09-30 20:09 oh, as in offset into file in pages 2008-09-30 20:09 exactly 2008-09-30 20:10 bad terminology 2008-09-30 20:10 -!- Kirantpatil(~kiran@122.167.219.78) has joined #tux3 2008-09-30 20:10 does this mean files can't be larger than 16TB? 2008-09-30 20:10 (on 32-bit arch) 2008-09-30 20:10 yes 2008-09-30 20:10 that's where that comes from 2008-09-30 20:10 volumes too 2008-09-30 20:10 does any linux filesystem workaround this somehow? 2008-09-30 20:10 volumes? 2008-09-30 20:10 because each volume has a page cache dedicated to non-file pages on the volume, that is, metadata 2008-09-30 20:11 there is no workaround 2008-09-30 20:11 "speed of sound in a 32 bit vacuum" 2008-09-30 20:11 :p 2008-09-30 20:11 :D 2008-09-30 20:11 ok, what does the index index? 2008-09-30 20:11 A: a radix tree 2008-09-30 20:12 let's drill down into find_lock_page, which is used in more than one place thankfully 2008-09-30 20:12 index = pos >> PAGE_CACHE_SHIFT; 2008-09-30 20:12 so how does tux3 scale beyond this? 2008-09-30 20:12 it doesn't? 2008-09-30 20:12 razvanm, good point 2008-09-30 20:13 razvanm, you will see code like that in tux3.c 2008-09-30 20:13 it does not on 32 bit 2008-09-30 20:13 fact of life 2008-09-30 20:13 ACTION sits down in the back row 2008-09-30 20:13 hey shapor 2008-09-30 20:13 hi flips 2008-09-30 20:13 ACTION throws some chalk 2008-09-30 20:13 always wanted to do that 2008-09-30 20:13 is it illegal yet? 2008-09-30 20:14 so you simply can't mount such a large tux3 fs on a 32-bit os? 2008-09-30 20:14 simply can't 2008-09-30 20:14 we'd better produce a nice error though 2008-09-30 20:14 because somebody will try 2008-09-30 20:14 hm 2008-09-30 20:14 to tell the truth, it would not be that big a deal to fix 2008-09-30 20:14 somebody who wants to is welcome 2008-09-30 20:15 pretty easy hack for a great deal of fame 2008-09-30 20:15 ACTION listens for the thundering herd of volunteers 2008-09-30 20:15 ok, let's go to find_lock_page 2008-09-30 20:15 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L661 2008-09-30 20:16 takes a mapping and index 2008-09-30 20:16 mapping is what tux3 calls "map" 2008-09-30 20:16 I called it map because that saved me about 80,000 keystrokes over the life of the project 2008-09-30 20:17 how does a page become a pagecache page? 2008-09-30 20:17 also, a tux3 userspace map maps blocks, whereas linux page cache maps pages 2008-09-30 20:17 shapor, we're looking at that right now 2008-09-30 20:17 somewhere in here we will find an alloc_pages(order 1) 2008-09-30 20:17 order 0 I mean 2008-09-30 20:18 first thing we do is try to find it already in the radix tree, but let's skip that and find out what happens when it's not there 2008-09-30 20:18 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-09-30 20:18 nothing happens :P 2008-09-30 20:18 exactly 2008-09-30 20:18 this function does not allocate pages 2008-09-30 20:18 alloc_pages(order n) allocates 2^n pages, with linear physical addresses? 2008-09-30 20:19 Returns zero if the page was not present. 2008-09-30 20:19 ok, let's go back up to _2copy and find out where the page is really alloced 2008-09-30 20:19 if we don't find it here 2008-09-30 20:20 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2049 2008-09-30 20:20 status = -ENOMEM; 2008-09-30 20:20 break 2008-09-30 20:20 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2107 2008-09-30 20:20 2049 page = page_cache_alloc(mapping); 2008-09-30 20:21 8-) 2008-09-30 20:21 :D 2008-09-30 20:21 better ;-) 2008-09-30 20:21 right 2008-09-30 20:21 knew it was in there ;) 2008-09-30 20:22 74static inline struct page *page_cache_alloc(struct address_space *x) 75{ 76 return __page_cache_alloc(mapping_gfp_mask(x)); 77} 2008-09-30 20:22 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L500 2008-09-30 20:22 which is just a call to alloc_pages 2008-09-30 20:22 as promised 2008-09-30 20:22 return alloc_pages(gfp, 0); 2008-09-30 20:22 what is the point of page_cache_alloc? 2008-09-30 20:23 some new bs about mapping_gfp_mask 2008-09-30 20:23 calling __page_cache_alloc 2008-09-30 20:23 if (cpuset_do_page_mem_spread()) { 2008-09-30 20:23 for numa 2008-09-30 20:23 shapor, probably little point if you really dig 2008-09-30 20:23 lots of accumlated cruft in there 2008-09-30 20:23 why grabing fails if adding to the lru fails? 2008-09-30 20:23 it's basically a numa-diverse alloc_pages(0) 2008-09-30 20:24 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2057 2008-09-30 20:24 goto repeat; 2008-09-30 20:24 notice it shouldn't fail 2008-09-30 20:24 razvanm, because that got a lot more complex recently 2008-09-30 20:24 let's take a look at it 2008-09-30 20:25 getting well outside the scope of vfs 2008-09-30 20:25 I guess there must be a reason why the page must be in the lru. Is there an obvious one? :P 2008-09-30 20:25 I think every page must be 2008-09-30 20:25 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L459 <- :P 2008-09-30 20:25 otherwise how do you know what to flush on memory low condition? 2008-09-30 20:25 it's about reverse mapping 2008-09-30 20:26 lots of complexity has been added to optimize it 2008-09-30 20:26 oh sorry 2008-09-30 20:26 I was blathering 2008-09-30 20:26 O:-) 2008-09-30 20:26 truth is, it's just a wrapper for the radix tree insert 2008-09-30 20:26 MaZe: to deal with the low memory you need some of the pages to be there :P 2008-09-30 20:27 add to page cache should never, ever fail 2008-09-30 20:27 but it could 2008-09-30 20:27 RazvanM: I think that's with extremely low memory conditions 2008-09-30 20:27 if it does fail we are in deep doodoo 2008-09-30 20:27 :D 2008-09-30 20:27 flips: yeah i see it gets called a few times in that file 2008-09-30 20:27 not jsut extremely low, bug buggy in the kernel bug sense 2008-09-30 20:27 the page could already be there, probably we're not fully locked against smp, and thus could potentially hit this on 2 cpus 2008-09-30 20:27 shapor, yes, this is the main interface to the page cache 2008-09-30 20:28 maze, we are fully locked against smp 2008-09-30 20:28 necessarily 2008-09-30 20:28 so where does the EEXIST check come from? 2008-09-30 20:28 write_lock_irq does that, and turns off interrupts for good measure 2008-09-30 20:29 if the page is already there, somebody needs to tell us 2008-09-30 20:29 MaZe: the page is already in lru, right? 2008-09-30 20:29 not lru 2008-09-30 20:29 radix tree 2008-09-30 20:29 badly named function here 2008-09-30 20:29 very bad 2008-09-30 20:29 it means "add to page cache and also to lru" 2008-09-30 20:29 not add to lru 2008-09-30 20:30 aaaa 2008-09-30 20:30 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L490 2008-09-30 20:30 right, only adding to cache can fail 2008-09-30 20:30 got a rul for the EEXIST test? 2008-09-30 20:30 url? 2008-09-30 20:30 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2055 2008-09-30 20:31 mem_cgroup_uncharge_page <- wow, 75 cent name 2008-09-30 20:31 shapor, this should get a rise out of the hotrodder in you 2008-09-30 20:31 memory accounting for containerization 2008-09-30 20:31 hah 2008-09-30 20:32 nitro chardged pages 2008-09-30 20:32 maze, thanks 2008-09-30 20:32 for? 2008-09-30 20:32 for the comment re containers 2008-09-30 20:32 oh. 2008-09-30 20:32 explains why I haven't seen the beast before 2008-09-30 20:32 crappy name 2008-09-30 20:32 therefore fits nicely ;) 2008-09-30 20:33 cgroup is the containers stuff, both cpu and mem 2008-09-30 20:33 uncharge must be in the release page path, and charge in the alloc path 2008-09-30 20:33 yeah this unfortunately all gets pretty complex 2008-09-30 20:33 because we're supporting numa and containers 2008-09-30 20:34 why does this matter for tux3 2008-09-30 20:34 all right, the EEXIST is about what happens if somebody adds the page while we are waiting to acquire the radix tree lock 2008-09-30 20:34 ok? 2008-09-30 20:34 i thoguht we were intentially avoiding the vm 2008-09-30 20:34 page cache is vfs, not vm 2008-09-30 20:34 numa = non-uniform memory access machines (multi-socket machines) and containers (good for jails/vms/isolating users/apps, etc...) 2008-09-30 20:34 flips: right, hence my comment about not having all the locks in smp 2008-09-30 20:35 shapor, we only need to know to recognize what is mm and therefore can be ignored ;) 2008-09-30 20:35 ok 2008-09-30 20:35 maze, right then 2008-09-30 20:35 as usual ;) 2008-09-30 20:35 you get to run the next class ;) 2008-09-30 20:35 well 2008-09-30 20:35 no... 2008-09-30 20:35 got to wait and see what you hack next 2008-09-30 20:35 ACTION runs away... 2008-09-30 20:35 flips: did you feel the mini quake a while ago? 2008-09-30 20:35 shapor, no, missed it 2008-09-30 20:35 didn't feel anything up here 2008-09-30 20:36 we had a great one a month or two ago 2008-09-30 20:36 got the familly up and huddled under a door jamb 2008-09-30 20:36 anyway 2008-09-30 20:36 life in paradise 2008-09-30 20:36 right 2008-09-30 20:36 duck'n'cover 2008-09-30 20:36 ;-) 2008-09-30 20:36 where were we 2008-09-30 20:37 we've nearly done everything interesting in there 2008-09-30 20:37 yes, we did get sidetracked a little. 2008-09-30 20:37 http://lxr.linux.no/linux+v2.6.26.5/mm/filemap.c#L2055 2008-09-30 20:37 sorry ;) 2008-09-30 20:37 details of radix tree aren't that interesting 2008-09-30 20:37 could we have a little more info on mapping? 2008-09-30 20:37 the channel topic says "and friends" 2008-09-30 20:37 ok, question? 2008-09-30 20:37 who fills it in and what it is for? 2008-09-30 20:37 who fills in the mapping? 2008-09-30 20:37 2066 struct address_space *mapping = file->f_mapping; 2008-09-30 20:38 so it comes from the file, so the vfs? 2008-09-30 20:38 it's just the per-inode page cache 2008-09-30 20:38 vfs usually 2008-09-30 20:38 though filesystem can too, and some do 2008-09-30 20:38 the fs has access to the whole misshapen page cache api 2008-09-30 20:38 for better or worse 2008-09-30 20:38 you will see all the functions are EXPORT()ed 2008-09-30 20:39 not even _GPL 2008-09-30 20:39 you can write evil/fringer binary modules that use this interface 2008-09-30 20:39 fringe 2008-09-30 20:39 ok, did we do who fills it in enough? 2008-09-30 20:40 probably 2008-09-30 20:40 still don't get it ;-) but nevermind 2008-09-30 20:40 then we didn't 2008-09-30 20:40 http://lxr.linux.no/linux+v2.6.26.5/include/linux/fs.h#L499 2008-09-30 20:40 the mapping _is_ the page cache 2008-09-30 20:40 is where the struct is defined 2008-09-30 20:41 so the page cache 2008-09-30 20:41 it's basically one to one with file inodes 2008-09-30 20:41 is actually not a page cache, but rather a page cache per inode per superblock 2008-09-30 20:41 exactly 2008-09-30 20:41 just per inode 2008-09-30 20:41 as in not _a_ but _one per_ 2008-09-30 20:41 one per inode 2008-09-30 20:41 actually 2008-09-30 20:41 one per file-backed inode 2008-09-30 20:42 okay, it's always talked of as if it was one beast 2008-09-30 20:42 to be precise 2008-09-30 20:42 yes, that's just sloppy 2008-09-30 20:42 non file-backed inode being non-file/dir stuff? (sockets, pipes, symlinks, fifos, devs?) 2008-09-30 20:42 while we're looking at struct address_space (which really should have been called struct mapping) 2008-09-30 20:42 let's look at some of the fields there 2008-09-30 20:43 device inode, socket, etc 2008-09-30 20:43 so 4k per inode? 2008-09-30 20:43 everything is an inode, and not all have caches 2008-09-30 20:43 er at least 2008-09-30 20:43 at least 2008-09-30 20:43 why 4k? 2008-09-30 20:43 inodes are big bloating things, especially when they have all their decorations attached 2008-09-30 20:44 shapor, which 4k were you referring to? 2008-09-30 20:44 at least one page gets allocated, right? 2008-09-30 20:44 size of a page 2008-09-30 20:44 for what? 2008-09-30 20:44 for the struct? 2008-09-30 20:44 shapor, probably not 2008-09-30 20:44 I'm sure we use a slab-cache of some sort? 2008-09-30 20:45 some inodes could lack any page cache, right? 2008-09-30 20:45 ok i read what flips said wrong 2008-09-30 20:45 shapor, look at the inode, you will see there's an address space embedded right in it 2008-09-30 20:45 kind of confusing 2008-09-30 20:45 hmm 2008-09-30 20:45 where is that defined/ 2008-09-30 20:46 624 struct address_space i_data; 2008-09-30 20:46 http://lxr.linux.no/linux+v2.6.26.5/include/linux/fs.h#L623 2008-09-30 20:46 http://lxr.linux.no/linux+v2.6.26.5/include/linux/fs.h#L624 2008-09-30 20:46 yeah i found it 2008-09-30 20:46 cat /proc/slabinfo - 3rd number means objsize 2008-09-30 20:46 radix_tree_node 21273 21294 560 14 2 : tunables 0 0 0 : slabdata 1521 1521 0 2008-09-30 20:46 bdev_cache 43 63 768 21 4 : tunables 0 0 0 : slabdata 3 3 0 2008-09-30 20:46 sysfs_dir_cache 11921 12189 80 51 1 : tunables 0 0 0 : slabdata 239 239 0 2008-09-30 20:46 inode_cache 4952 4970 568 14 2 : tunables 0 0 0 : slabdata 355 355 0 2008-09-30 20:46 dentry 486172 486172 208 19 1 : tunables 0 0 0 : slabdata 25588 25588 0 2008-09-30 20:47 razvanm, that's the pointer to it, which points into the address space itself, immediately after 2008-09-30 20:47 why the mapping has to be a pointer isn't clear to me 2008-09-30 20:47 probably bogus 2008-09-30 20:47 flips: that was what I about to ask :P 2008-09-30 20:47 razvanm, it's the homework assignment then, to find out by thursday 2008-09-30 20:47 in case you don't want that one? 2008-09-30 20:48 maze, sure, but what use case? 2008-09-30 20:48 share across many inodes? 2008-09-30 20:48 I suspect a highly dogdy one 2008-09-30 20:48 like cow 2008-09-30 20:48 aaa... hard links? :P 2008-09-30 20:48 nah, hard links -> same inode 2008-09-30 20:48 never assume that what we do in kernel actually makes sense ;) 2008-09-30 20:48 soft link? :-) 2008-09-30 20:48 often things are the way they are just because they are 2008-09-30 20:48 and that is eventually proven when somebody rips it out and changes it completely 2008-09-30 20:49 616 spinlock_t i_lock; /* i_blocks, i_bytes, maybe i_size */ 2008-09-30 20:49 how can a lock maybe protect a field? 2008-09-30 20:49 maybe not everybody follow the rules? 2008-09-30 20:49 there's a lot of weirdness around i_size locking wise 2008-09-30 20:50 maze, likely more bogosity, and the cause of thousands of hours worth of bug chasing the last few years 2008-09-30 20:50 anyway, so our inode, including the mapping, seems to use about 570 bytes 2008-09-30 20:50 jsut for starters 2008-09-30 20:50 then you get a couple of pages linked to it... dentries... it gets bloaty 2008-09-30 20:50 inode_cache 162 264 340 11 1 : tunables 54 27 8 : slabdata 24 24 0 2008-09-30 20:51 dentry per hardlink to the inode right? 2008-09-30 20:51 per path element used to open the inode 2008-09-30 20:51 + radix_tree_nodes (and actual pages) for the parts that are in memory? 2008-09-30 20:51 right, but those higher levels are shared 2008-09-30 20:51 n directory names and one file name 2008-09-30 20:51 maybe 2008-09-30 20:52 ? 2008-09-30 20:52 not necessarily 2008-09-30 20:52 except for root 2008-09-30 20:52 you can easily have lots of very unshared paths 2008-09-30 20:52 like in a java class tree 2008-09-30 20:52 bushy 2008-09-30 20:52 in what sense not necessarily, as in there may not be any other files using the same prefix? 2008-09-30 20:52 right 2008-09-30 20:52 or we can have the same prefix and still not share? 2008-09-30 20:52 ah, ok 2008-09-30 20:52 right, of course, then 2008-09-30 20:53 but dentries are pretty small (200 bytes) 2008-09-30 20:53 just by way of showing that the average pinned cache per inode can be quite large 2008-09-30 20:53 you also typically have a struct file 2008-09-30 20:53 if its open 2008-09-30 20:53 right? 2008-09-30 20:53 so file->dentry->inode->pages 2008-09-30 20:53 yes 2008-09-30 20:54 all the dentries up to the root and the inode, and the radix tree have to be in ram, as long as we are using the page cache from that inode, right? 2008-09-30 20:54 destroyed when closed always? 2008-09-30 20:54 when stuff is just hanging in cache you have dentry->inode->pages 2008-09-30 20:54 that's with a closed file? 2008-09-30 20:54 struct file always destroyed on close, dentry not 2008-09-30 20:54 yes 2008-09-30 20:55 right, but eventually it would get evicted, since the close would flush? 2008-09-30 20:55 maze, close doesn't flush 2008-09-30 20:55 only umount evicts like that 2008-09-30 20:55 really? 2008-09-30 20:55 well 2008-09-30 20:55 close does not evict 2008-09-30 20:55 I thought close was guaranteed to give you back any error messages 2008-09-30 20:55 or flush actually 2008-09-30 20:55 in case you were to run out of disk, etc 2008-09-30 20:56 and thus close had to wait for a flush? 2008-09-30 20:56 hmm 2008-09-30 20:56 doubt that 2008-09-30 20:56 if close was equivalent to fsync performance would tank 2008-09-30 20:57 so close may be defined that way, the filesystem does not have to implement it that way 2008-09-30 20:57 hm but if dentries get purged from you dont lose the cache right? 2008-09-30 20:57 man 2 open 2008-09-30 20:57 It is quite possible that errors on a pre- 2008-09-30 20:57 vious write(2) operation are first reported at the final close(). Not 2008-09-30 20:57 checking the return value when closing the file may lead to silent loss 2008-09-30 20:57 of data. This can especially be observed with NFS and with disk quota. 2008-09-30 20:57 A successful close does not guarantee that the data has been success- 2008-09-30 20:57 fully saved to disk, as the kernel defers writes. It is not common for 2008-09-30 20:57 a filesystem to flush the buffers when the stream is closed. If you 2008-09-30 20:57 need to be sure that the data is physically stored use fsync(2). (It 2008-09-30 20:57 will depend on the disk hardware at this point.) 2008-09-30 20:57 shapor, dentries stay around as long as the cache does 2008-09-30 20:57 so, close can return errors, but it still doesn't flush unless you fsync 2008-09-30 20:57 right 2008-09-30 20:58 cute 2008-09-30 20:58 It is probably unwise to close file descriptors while they may be in 2008-09-30 20:58 use by system calls in other threads in the same process. Since a file 2008-09-30 20:58 descriptor may be re-used, there are some obscure race conditions that 2008-09-30 20:58 may cause unintended side effects. 2008-09-30 20:58 two minutes 2008-09-30 20:59 my girl has decided daddy needs to play with her 2008-09-30 20:59 :-) 2008-09-30 20:59 she doesn't like the linux kernel (yet)? 2008-09-30 20:59 maze, we have fixed most of those races 2008-09-30 20:59 couple were fixed this year 2008-09-30 20:59 not yet 2008-09-30 20:59 I like the sound of confidence there... 2008-09-30 20:59 working on it 2008-09-30 20:59 ...most... 2008-09-30 20:59 file table is a nasty thing 2008-09-30 20:59 race wise 2008-09-30 21:00 but yes, the known holes are closed now 2008-09-30 21:00 remember I told you fget_light was the most perverse function in the kernel? 2008-09-30 21:00 interesting that closing an fd drops locks on the file even if you have duped it to another fd 2008-09-30 21:01 indeed 2008-09-30 21:01 second homework is to find out why 2008-09-30 21:01 ok, first home work was why both ptr and struct address_space (*i_mapping and i_data) in struct inode 2008-09-30 21:01 right 2008-09-30 21:02 well we did grab_cache_page pretty well, did not get to the friends 2008-09-30 21:02 gives a starting point for next time if we want 2008-09-30 21:07 can lxr do regexp search 2008-09-30 21:17 coda, raw, bdev 2008-09-30 21:17 quantum electro-dynamics 2008-09-30 21:18 ACTION goes to bed 2008-09-30 21:19 lol 2008-09-30 21:23 -!- Kirantpatil(~kiran@122.167.219.78) has left #tux3 2008-09-30 21:27 yeah, I typed /nick instead of /me in '/nick says thanks for the lesson :D' 2008-09-30 21:27 I've looked at fget_light, and it doesn't seem that scary... 2008-09-30 21:27 something's wrong with me 2008-09-30 21:27 or I'm missing the point 2008-09-30 21:28 or both ;-) 2008-09-30 21:33 linux-2.6.26.5$ egrep -rn -C 2 "[-][>]i_mapping *=" . 2008-09-30 21:47 folks 2008-09-30 21:47 hmm 2008-09-30 21:49 -!- Kirantpatil(~kiran@122.167.219.78) has joined #tux3 2008-09-30 21:50 -!- Kirantpatil(~kiran@122.167.219.78) has left #tux3 2008-09-30 21:55 maze, you haven't spotted it yet 2008-09-30 21:56 hmm? the fact it uses rcu? and doesn't always increment usage counters, nor does it always call fput? 2008-09-30 21:56 oh does it use rcu now? 2008-09-30 21:56 yeah 2008-09-30 21:57 struct files_struct *files = current->files; 2008-09-30 21:57 is protected by rcu 2008-09-30 21:57 although I'm guessing in many cases the cu is partial and not full 2008-09-30 21:57 fget itself isn't though 2008-09-30 21:58 "You can use this only if it is guranteed that the current task already 2008-09-30 21:58 fget is copied verbatim into fget_light 2008-09-30 21:58 313 * holds a refcnt to that file. That check has to be done at fget() only 2008-09-30 21:58 314 * and a flag is returned to be passed to the corresponding fput_light()" 2008-09-30 21:58 in other words, if the current task drops its reference... well it can't 2008-09-30 21:59 and there is no way an external observer can tell that the file is held by fget_light 2008-09-30 21:59 for starters 2008-09-30 21:59 right 2008-09-30 21:59 hence the 'doesn't always increment usage counters' 2008-09-30 21:59 but since, you're already holding the refcnt, it doesn't matter 2008-09-30 21:59 hence line 312 2008-09-30 22:00 and in cases where that doesn't work (threads), it falls back to using full fget 2008-09-30 22:00 oh wait, it does use rcu now 2008-09-30 22:00 used to be much worse 2008-09-30 22:01 wait again 2008-09-30 22:01 it uses rcu on the slow path 2008-09-30 22:02 right 2008-09-30 22:02 but the slow path is actually pretty common 2008-09-30 22:02 -> threads 2008-09-30 22:02 or anything that through clone ended up with shared fd table 2008-09-30 22:04 if the fd table isn't shared, then there is no need for locking, since it's local to this task 2008-09-30 22:04 and we're running in this tasks context 2008-09-30 22:04 /as this task/ 2008-09-30 22:05 otherwise, we need to synchronize via rcu with other tasks which share our fd table 2008-09-30 22:06 it may become shared 2008-09-30 22:06 after the fget_light 2008-09-30 22:06 nope 2008-09-30 22:06 notice the comment 2008-09-30 22:07 cannot be used if clone before fput_light 2008-09-30 22:07 315 * There must not be a cloning between an fget_light/fput_light pair. 2008-09-30 22:07 that's basically the only case were you are not allowed to use fget_light 2008-09-30 22:07 starting to see the perversity? 2008-09-30 22:07 hmm? 2008-09-30 22:07 doesn't seem perverse 2008-09-30 22:07 seems pretty clean 2008-09-30 22:08 oh... hmm 2008-09-30 22:08 I'm just worried that it may not be worth the effort with how multithreaded nowadays everything is getting 2008-09-30 22:09 (of course threads, don't necessarily share fd tables, but in most languages they probably do) 2008-09-30 22:09 isn't tux3 the works of satan ? 2008-09-30 22:09 I thought rcu was supposed to be pretty efficient... I wonder how much this gains in a single-thread case 2008-09-30 22:10 ACTION reads the backlog 2008-09-30 22:10 what I was thinking 2008-09-30 22:11 flips: you saw? linux-2.6.26.5$ egrep -rn -C 2 "[-][>]i_mapping *=" . --> blockdev, raw char dev, coda -> basically my guess was right 2008-09-30 22:11 ah, didn't notice you were already doing the challenge 2008-09-30 22:12 raw char dev maps on block dev, so remaps mapping to blockdevs mapping to share page cache, coda does hackery in case localfs is exported (AFAICT) 2008-09-30 22:12 and then re-imported via coda 2008-09-30 22:13 to share page cache between the codafs import and the original export 2008-09-30 22:13 at least, that's my guess 2008-09-30 22:13 how did you guess raw char dev, coda? 2008-09-30 22:14 linux-2.6.26.5$ egrep -rn -C 2 "[-][>]i_mapping *=" . 2008-09-30 22:14 ah, a computerized guess 2008-09-30 22:14 not really a guess 2008-09-30 22:14 so the short answer is: when the cache must be shared between inodes 2008-09-30 22:14 the guess was earlier, when I said multiple inodes with the same mapping 2008-09-30 22:14 but it isn't clear whether the sharing cases are valid 2008-09-30 22:15 the coda case at least isn't clear 2008-09-30 22:15 now what about the raw char dev? 2008-09-30 22:15 why should that have a cache at all? 2008-09-30 22:15 raw char dev is basically opening block dev with O_DIRECT 2008-09-30 22:15 and is the ancient way to do it 2008-09-30 22:15 oh 2008-09-30 22:15 raw dev 2008-09-30 22:15 so the raw char dev case maps in the mapping from the block dev 2008-09-30 22:16 seems wrong somehow 2008-09-30 22:16 in what sense? 2008-09-30 22:16 why doesn't it just return the device inode? 2008-09-30 22:16 use that when you open the raw device 2008-09-30 22:16 probably because it's a raw char not a block dev 2008-09-30 22:16 and behaviour is different? 2008-09-30 22:17 so can't get there from here maybe 2008-09-30 22:17 hmm 2008-09-30 22:17 or the raw char dev should point at the device inode 2008-09-30 22:17 not at the mapping 2008-09-30 22:18 you'll note that for some reason it's not a raw block dev, but a raw char dev, so probably alignment issues and ioctls force it to have a shim-layer 2008-09-30 22:18 right, but it is not clear it can't reference the block device inode 2008-09-30 22:18 haven't looked at that thing at all 2008-09-30 22:18 some sort of ancientness, nowadays raw char devs are close to getting dropped I think 2008-09-30 22:18 always used o_direct instead 2008-09-30 22:19 exactly 2008-09-30 22:20 coda... who knows 2008-09-30 22:20 doing stacking on a vfs that wasn't designed for it is going to be fun 2008-09-30 22:24 http://lkml.org/lkml/2003/5/2/157 2008-09-30 22:24 (maze) 2008-09-30 22:33 hmm 2008-09-30 22:33 coda... who knows - not really 2008-09-30 22:33 it's a network filesystem with local caching and offline operation 2008-09-30 22:33 pretty obvious it needs to tie the codafs inodes with the local backing store inodes 2008-09-30 22:35 the one you found is just an inlining of fput_light 2008-09-30 22:36 or rather of the first if in it 2008-09-30 22:37 sorry, me bad 2008-09-30 22:38 earlier on in that thread 2008-09-30 22:38 and even then those comparisons are before rcu 2008-09-30 22:40 so not clear what the gain is with rcu 2008-09-30 22:40 as opposed to normal r/w locks like before 2008-09-30 23:04 maze, what is not obvious is why it can't keep references to the backing store inodes 2008-09-30 23:05 instead of sharing the inode's cache with its own inodes 2008-09-30 23:07 then there is assoc_mapping 2008-09-30 23:12 assoc_mapping is only used by sync_mapping_buffers, which is only used by brainless filesystems like ext2 2008-09-30 23:12 is used incorrectly it would seem in reiserfs 2008-09-30 23:13 ocfs2 also uses it, perhaps because mark overlooked it 2008-09-30 23:14 -!- konrad(~konrad@D-128-208-53-208.dhcp4.washington.edu) has joined #tux3 2008-09-30 23:35 -!- Kirantpatil(~kiran@122.166.169.45) has joined #tux3 2008-09-30 23:35 -!- Kirantpatil(~kiran@122.166.169.45) has left #tux3 2008-09-30 23:39 I'm guessing it does keep reference to the backing store inodes 2008-09-30 23:39 but I'm also guessing that it wants access to the coda file to hit the same pagecache as the backing file, isn't that easiest to do by having the same pagecache, by having the i_mapping pointer point to the same location?