2008-08-21 01:27 hey 2008-08-21 01:28 flips: cow allocation is pretty important 2008-08-21 01:28 to prevent fragmentation 2008-08-21 01:30 really 2008-08-21 01:30 so there are some concepts being considered 2008-08-21 01:30 delayed writes help I guess 2008-08-21 01:30 they should 2008-08-21 01:30 ACTION is tired today 2008-08-21 01:30 there is a concept of generating function driven goals 2008-08-21 01:31 been up since about 12pm and have been in front of a computer for most of that time 2008-08-21 01:31 12 pm... 25 hours ago? 2008-08-21 01:31 or only 13? 2008-08-21 01:32 flips: 12pm != midnight ;) 2008-08-21 01:32 12am is 25+ hours ago 2008-08-21 01:32 midday 2008-08-21 01:32 anyway, the idea is when data writes do collide with other versions of the data, either for atomic commit reasons or because a snapshot is held, then the write gets bounced away to a new goal, and if it collides there, to a further away goal 2008-08-21 01:32 the thing is, related data should get bounced to a similar place 2008-08-21 01:33 how do you choose where? 2008-08-21 01:34 another concept is to avoid completely filling any given region, which would interfere with placing the small amount of metadata in the region that is needed to do a certain atomic commits 2008-08-21 01:34 first bounce to a little higher, them more higher, and even more, then try a little lower, then bounce way far away 2008-08-21 01:34 generating function decides 2008-08-21 01:35 like a quadratic hash 2008-08-21 01:35 if you can keep say 4MB globs of data together, it doesn't matter much that it is stored far from its inode 2008-08-21 01:36 the seek time ends up about 10% of the transfer time, which is ok 2008-08-21 01:36 it's when you have lots of itty bitty pieces scattered around that seeking gets dominant 2008-08-21 01:36 there is also a concept of keeping an allocation density per region 2008-08-21 01:37 say, per 128 MB region 2008-08-21 01:37 so the bounce function could take that into account 2008-08-21 01:39 rewrite of a 1 GB file is not necessarily as scary as it sounds, if the truncate is committed first and synced, the the old blocks can be freed and rewritten 2008-08-21 01:39 if snapshotted, you want to take a huge bounce far away 2008-08-21 01:39 so the bounce function needs to take the size of the file into account 2008-08-21 01:39 bigger file = bigger bounces 2008-08-21 01:39 ACTION has a massive headache 2008-08-21 01:40 ACTION recommends that people with headaches not think about impossible problems too much 2008-08-21 01:40 just get flash storage 2008-08-21 01:40 right 2008-08-21 01:40 so far, no actual coding has gone into allocation strategy 2008-08-21 01:41 i'd say relying on seekless storage would be a good first cut 2008-08-21 01:41 anyway, I now need to think of a name other than "btree" for the on disk representation of a btree root 2008-08-21 01:41 well that's happening by default 2008-08-21 01:42 but I want at least some minimal allocation policy right from the start 2008-08-21 01:42 based on inode number 2008-08-21 01:43 roughly speaking, the idea is to allocate inodes in clumps, the clumps all belonging to files created in the same directory 2008-08-21 01:43 and the clumps scattered fairly far apart 2008-08-21 01:43 whys is that significant? 2008-08-21 01:43 to make a tar benchmark go fast? 2008-08-21 01:43 the file data will then be targetted to the region of that clump of inodes 2008-08-21 01:43 tar is a big deal 2008-08-21 01:44 but in general, inodes should be near their directories and data block should be near the inodes 2008-08-21 01:44 would be better to put data in a place where the head will be when you are most likely to need it 2008-08-21 01:44 because the patter goes: look up dirent; open inode; read data 2008-08-21 01:44 i like the idea of spraying the drive with data if its idle 2008-08-21 01:44 files in the same directory tend to have some relationship to each other 2008-08-21 01:45 why choose when you can write it in more than one place 2008-08-21 01:45 that is kind of what hammer does 2008-08-21 01:45 oh? 2008-08-21 01:45 what does hammer do? 2008-08-21 01:46 it sprays writes into roughly the region it thinks they should go, then the reblocking process comes along later and arranges things tidily 2008-08-21 01:46 also, this is the only way space is freed in hammer 2008-08-21 01:46 free blocks are 128 MB I think 2008-08-21 01:46 i'm talking about writing the same data more than once 2008-08-21 01:46 which are obtained by compacting via reblocking 2008-08-21 01:46 that takes more time 2008-08-21 01:47 not if the drive is idle 2008-08-21 01:47 what if it isn't? 2008-08-21 01:47 takes less time to move the data later 2008-08-21 01:47 since you dont have to copy it, just erase an extra copy 2008-08-21 01:48 there might be something there 2008-08-21 01:48 you don't really want the disk spinning for minutes after a big episode of writes though 2008-08-21 01:48 if you have io's waiting, you obviously dont do it 2008-08-21 01:49 although if you have buffer laying around in ram, and the drive gets idle, why not write them somemore 2008-08-21 01:49 then you can trivially break it just by having a long running write, like untarring dozens of kernel trees 2008-08-21 01:49 true, the last point 2008-08-21 01:49 the drive should never be idle when there is a dirty buffer in cache 2008-08-21 01:49 big flaw in linux there 2008-08-21 01:49 or even clean! 2008-08-21 01:49 :P 2008-08-21 01:50 well 2008-08-21 01:50 not so sure that writing out clean data is a win 2008-08-21 01:50 if you know it should be migrated, sure it might be a good time to migrate 2008-08-21 01:50 opportunistic defrag 2008-08-21 01:50 but, that will be slow 2008-08-21 01:50 because? 2008-08-21 01:51 if it's in cache its just a write 2008-08-21 01:51 because you have to seek to do it 2008-08-21 01:51 ah in cache 2008-08-21 01:51 true 2008-08-21 01:51 yeah... writing clean data is kind of a crazy idea 2008-08-21 01:51 the thing is, most of the badly fragmented stuff won't be in cache 2008-08-21 01:51 but there still might be a slight win 2008-08-21 01:52 there is a something similar planned 2008-08-21 01:52 you could do it if you are reading 2008-08-21 01:52 that is the so called log rollup 2008-08-21 01:52 say you read a heavily fragmented file 2008-08-21 01:52 right 2008-08-21 01:52 it ends up in buffers 2008-08-21 01:52 good point 2008-08-21 01:52 you should neve pay that high price again 2008-08-21 01:52 paint it down in some free space 2008-08-21 01:52 and update metadata 2008-08-21 01:52 you choose a new allocation goal for the whole file, then take the opportunity to migrate it, since you had to read it anyway 2008-08-21 01:52 yes 2008-08-21 01:53 however 2008-08-21 01:53 there better be some write activity going on at the same time 2008-08-21 01:53 people don't really like when writing happens when you are just reading 2008-08-21 01:53 like atime 2008-08-21 01:53 course maybe nobody will notice 2008-08-21 01:54 no one will care 2008-08-21 01:54 fragmentation is the biggest problem with cow style filesystems 2008-08-21 01:54 almost* 2008-08-21 01:54 it should be an advantage that tux3 will not rewrite nearly as much metadata 2008-08-21 01:54 i'm reading a bit about it and i like the hammer approach 2008-08-21 01:54 I like hammer too 2008-08-21 01:55 I want to get on lkml and advocate somebody start porting 2008-08-21 01:55 nice and simple really 2008-08-21 01:55 for what it does, yes 2008-08-21 01:55 see, we got a new file checked in 2008-08-21 01:55 volume.c 2008-08-21 01:56 i'm not sold on always trying to defragment files though 2008-08-21 01:56 that will suck on a lot of workloads 2008-08-21 01:56 tomorrow I will try and actually have it reference the master inode table 2008-08-21 01:56 like a log file server 2008-08-21 01:56 the allocator has to try hard to lay down the data in a reasonable place on the first try 2008-08-21 01:57 would be nice if there was some userspace interface to opportunistic readahead 2008-08-21 01:57 say you have a log file server which is appead mostly 2008-08-21 01:58 then you want to grep all the logs for something 2008-08-21 01:58 drive seeks because files are all badly fragmented 2008-08-21 01:59 there is 2008-08-21 01:59 fadvise 2008-08-21 01:59 no that doesn't help 2008-08-21 01:59 I recall you explaining this before 2008-08-21 01:59 need to explain again ;-) 2008-08-21 02:00 i want to sweep the drive once and grep all the files i read 2008-08-21 02:00 idealy ;) 2008-08-21 02:00 I think I can handle the append slowly case 2008-08-21 02:00 a heuristic is triggered when the log file grows to a certain size and is opened for append 2008-08-21 02:01 then, the file will grow in chunks 2008-08-21 02:01 hm 2008-08-21 02:01 "big log file" trigger? 2008-08-21 02:01 hm 2008-08-21 02:01 the allocation goal function will choose a location to target the next chunk where there exists a fair amount of empty space 2008-08-21 02:01 and other things will be discouraged from squatting there 2008-08-21 02:01 could just profile access patterns in general 2008-08-21 02:01 could 2008-08-21 02:01 maybe should 2008-08-21 02:02 and just store that in ram 2008-08-21 02:02 but some important ones can be determined without much analysis 2008-08-21 02:02 doesn't need to be persistent 2008-08-21 02:02 unless the drive is idle of course ;) 2008-08-21 02:02 wow, zumstor built and passed tests with the mem monitor excised ;-) 2008-08-21 02:03 exactly 2.5 hrs 2008-08-21 02:03 true, and analyzing allocation pattern provides work for lazy cpus 2008-08-21 02:04 I think there may be some allocation "zones", for example, zones where 4 MB is the minimum allocation unit 2008-08-21 02:05 could profile directories too 2008-08-21 02:05 and no more than a single file is allowed in the same 4MB zone 2008-08-21 02:05 4MB chunk I mean 2008-08-21 02:05 directory x usually gets files that dont grow beyond 16kb 2008-08-21 02:05 right 2008-08-21 02:05 while directory y usually gets files that grow to 10gb 2008-08-21 02:05 and then they will be targetted to a small granularity zone 2008-08-21 02:05 yeah 2008-08-21 02:05 and new inode table blocks may be created in that zone too 2008-08-21 02:06 that is the beauty of variable attributes 2008-08-21 02:06 eventually, the original inode table blocks of a directory that was "mispredicted" might be moved to the new, more appropriate zone 2008-08-21 02:06 can just add more on the fly even 2008-08-21 02:06 yes 2008-08-21 02:07 or disable them altogether on flash 2008-08-21 02:07 there is also a concept of inode numbers "folding" over the volume 2008-08-21 02:07 so that two inode numbers very far apart can have allocation goals into the same physical region 2008-08-21 02:07 why do inode numbers matter 2008-08-21 02:08 the inode number determines the physical allocation goal 2008-08-21 02:08 the initial goal anyway 2008-08-21 02:08 so you set the allocation goal for a given file by choosing the inode number 2008-08-21 02:09 so that is saying your primary goal is to place it close to other files in the same directory? 2008-08-21 02:09 yes, and place the data near the inode 2008-08-21 02:09 that could be totally wrong 2008-08-21 02:09 example? 2008-08-21 02:09 maildirs 2008-08-21 02:10 directories full of files, one per message in your mailbox 2008-08-21 02:11 usually just add new files one at a time 2008-08-21 02:11 why is it wrong to place the data near the inode then? 2008-08-21 02:11 read them 1 or 2 at a time, never access them again 2008-08-21 02:11 that is, the file data near the file inode 2008-08-21 02:11 hm there must be a better.. er worse case 2008-08-21 02:11 what if you search your mailbox? 2008-08-21 02:12 "don't do that" ? 2008-08-21 02:12 depends on how brain dead the mail server software is 2008-08-21 02:12 most are pretty brain dead 2008-08-21 02:13 some keep a keywords index db file because search is slow 2008-08-21 02:13 grep * with 20000 files is slow 2008-08-21 02:13 although if you do need to do that, it woud be nice not to see 2008-08-21 02:13 seek* 2008-08-21 02:14 now that would be a cool system call 2008-08-21 02:14 "search these files for this pattern" 2008-08-21 02:14 and please dont seek 2008-08-21 02:15 for that kind of grep you want to ls -U | grep foo 2008-08-21 02:15 err 2008-08-21 02:15 well like that 2008-08-21 02:15 |xargs 2008-08-21 02:15 right 2008-08-21 02:15 hrm never thought of that 2008-08-21 02:15 smrt 2008-08-21 02:16 htree will then provide the entries in hash order 2008-08-21 02:16 no better than lexical order 2008-08-21 02:16 hm 2008-08-21 02:16 but phtree will provide them in physical order 2008-08-21 02:16 things will sing 2008-08-21 02:17 that sounds really sucky of htree 2008-08-21 02:17 btrfs guys are busy inplementing the htree idea 2008-08-21 02:17 htree is very fast as most things 2008-08-21 02:17 but it's not the best solution 2008-08-21 02:17 imho 2008-08-21 02:18 htree is really good for huge volumes when nothing is in cache 2008-08-21 02:18 with the caveat that the above load will still suck 2008-08-21 02:18 no matter what allocation strategy is used the key will be benchmarking common worklodas 2008-08-21 02:18 and making sure they sing 2008-08-21 02:18 right 2008-08-21 02:18 untarring kernel trees is one of the important ones 2008-08-21 02:18 and also trying to tickle worst cases 2008-08-21 02:18 then grep the kernel tree, stuff like taht 2008-08-21 02:20 would be neat to hint allocation strategy with ioctls or something 2008-08-21 02:20 similar idea to fadvise 2008-08-21 02:20 "this is a log file" 2008-08-21 02:20 or "this file will never be more than 8k" 2008-08-21 02:21 struct root { u64 block:48, levels:8, unused:8; }; 2008-08-21 02:21 struct btree { struct root root; u16 entries_per_leaf; }; 2008-08-21 02:21 but if fadvise is any indicator, such an interface would never get used 2008-08-21 02:21 sadly 2008-08-21 02:21 s/never/very rarely/ 2008-08-21 02:22 well we can make a ddlink interface and you can go crazy with hints 2008-08-21 02:22 see what works 2008-08-21 02:22 most important thing though is to act fairly reasonable in common loads 2008-08-21 02:23 yeah because all those great ideas go to shit if you are serving the volume over nfs 2008-08-21 02:23 and every write is sync too, that hurts 2008-08-21 02:23 heh 2008-08-21 02:23 sync has to be fast 2008-08-21 02:23 I think tux3 will have a really fast sync 2008-08-21 02:23 hammer would probably kick all ass as an nfs server 2008-08-21 02:23 because of the forward log thing 2008-08-21 02:24 quite possibly 2008-08-21 02:45 flips: see the mail on the list? 2008-08-21 02:45 ACTION looks 2008-08-21 02:46 so people are reading your messags afterall ;) 2008-08-21 02:46 :-) 2008-08-21 02:47 so hopefully it will be less of a blog in future 2008-08-21 02:47 or at least one that gets lots of comments ;) 2008-08-21 02:51 ok, time to respond 2008-08-21 02:51 just checked in a big splat change 2008-08-21 02:51 need to restructure the way args are passed to the btree methods somewhat 2008-08-21 02:51 so that leaf methods can use fields in the struct btree 2008-08-21 02:52 anyway... microchange 2008-08-21 02:52 but macro patches to do it 2008-08-21 02:53 86 members on tux3 now 2008-08-21 02:53 just passed zumastor a little while ago 2008-08-21 02:53 we need to get to a beanery the day it passes 100 2008-08-21 04:37 -!- konrad(~konrad@c-24-16-74-109.hsd1.wa.comcast.net) has joined #tux3 2008-08-21 06:07 flips: you going to the linux plumbers conf? 2008-08-21 06:29 pgquiles, wasn't planning on it 2008-08-21 11:54 -!- tim_dimm(~timothyhu@cpe-76-90-98-247.socal.res.rr.com) has joined #tux3 2008-08-21 14:23 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has joined #tux3 2008-08-21 14:29 shapor: ping 2008-08-21 14:36 wow- Tux3 finally got booted off the hottest messages list on lkml. hanging in there in the 1/2 life = 1 day list though 2008-08-21 14:58 tim_dimm: pong 2008-08-21 15:00 any response to your bind post? 2008-08-21 15:01 not on the list 2008-08-21 15:01 any privately? 2008-08-21 15:02 some guy from ISC replied, thanked me for the patch and said that it wouldn't compile on all platorms due to "compiler constructs" 2008-08-21 15:02 isn't (struct in_addr){ .s_addr = htonl(hst->ip)} ANSI C? 2008-08-21 15:05 shapor, it is C99 2008-08-21 15:05 ah i've gotten used to c99 i guess 2008-08-21 15:05 rewrite as .s_addr = htonl(hst->ip); 2008-08-21 15:06 obviously 2008-08-21 15:06 yeah 2008-08-21 15:06 boneheads over there sounds like 2008-08-21 15:06 didn't tell you the error message I bet 2008-08-21 15:06 no i had to ask what construct he was talking about 2008-08-21 15:07 they are in the business of intentially producing buggy software 2008-08-21 15:07 I am even more on the leading edge of insanity, write in gnu-c99 2008-08-21 15:07 fancy stuff 2008-08-21 15:07 the only practical difference I have noticed is, g99 has typeof 2008-08-21 15:07 it's beyond me how anybody can get by without it 2008-08-21 17:48 -!- MaZe(~MaZe@216-239-45-4.google.com) has left #tux3 2008-08-21 18:01 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has joined #tux3 2008-08-21 20:09 -!- caoliver(~oliver@75-134-208-20.dhcp.trcy.mi.charter.com) has left #tux3 2008-08-21 23:30 -!- MaZe(~MaZe@c-24-6-86-168.hsd1.ca.comcast.net) has joined #tux3