X-Git-Url: https://git.sur5r.net/?a=blobdiff_plain;f=bacula%2Fkernstodo;h=cfa76f1b9acf9ef38a113cc09c7ea50ab11b214e;hb=fc92e04201e428fbf206dbd01518a02490ba50f9;hp=6387dc40acfb3d619d7a8a7b1a292258c71acf26;hpb=a22606258480f37a2ba6ce1e20a740cd8b2fc698;p=bacula%2Fbacula diff --git a/bacula/kernstodo b/bacula/kernstodo index 6387dc40ac..cfa76f1b9a 100644 --- a/bacula/kernstodo +++ b/bacula/kernstodo @@ -1,33 +1,178 @@ Kern's ToDo List - 16 June 2005 + 22 February 2006 Major development: Project Developer ======= ========= -TLS Landon Fuller -Unicode in Win32 Thorsten Engel (done) -VSS Thorsten Engel (in beta testing) -Version 1.37 Kern (see below) -======================================================== - -1.37 Major Projects: -#3 Migration (Move, Copy, Archive Jobs) - (probably not this version) -#7 Single Job Writing to Multiple Storage Devices - (probably not this version) - -## Create a new GUI chapter explaining all the GUI programs. - -Autochangers: -- Make "update slots" when pointing to Autochanger, remove - all Volumes from other drives. "update slots all-drives"? - -For 1.37: -- Finish TLS implementation. -- Fix PostgreSQL GROUP BY problems in restore. -- Fix PostgreSQL sql problems in bugs. + +Document: +- Document cleaning up the spool files: + db, pid, state, bsr, mail, conmsg, spool +- Document the multiple-drive-changer.txt script. +- Pruning with Admin job. +- Does WildFile match against full name? Doc. +- %d and %v only valid on Director, not for ClientRunBefore/After. + +Priority: + +For 1.39: +- Fix re-read of last block to check if job has actually written + a block, and check if block was written by a different job + (i.e. multiple simultaneous jobs writing). +- JobStatus and Termination codes. +- Some users claim that they must do two prune commands to get a + Volume marked as purged. +- Print warning message if LANG environment variable does not specify + UTF-8. +=== Migration from David === +What I'd like to see: + +Job { + Name = "-migrate" + Type = Migrate + Messages = Standard + Pool = Default + Migration Selection Type = LowestUtil | OldestVol | PoolOccupancy | +Client | PoolResidence | Volume | JobName | SQLquery + Migration Selection Pattern = "regexp" + Next Pool = +} + +There should be no need for a Level (migration is always Full, since you +don't calculate differential/incremental differences for migration), +Storage should be determined by the volume types in the pool, and Client +is really a selection issue. Migration should always occur to the +NextPool defined in the pool definition. If no nextpool is defined, the +job should end with a reason of "no place to go". If Next Pool statement +is present, we override the check in the pool definition and use the +pool specified. + +Here's how I'd define Migration Selection Types: + +With Regexes: +Client -- Migrate data from selected client only. Migration Selection +Pattern regexp provides pattern to select client names, eg ^FS00* makes +all client names starting with FS00 eligible for migration. + +Jobname -- Migration all jobs matching name. Migration Selection Pattern +regexp provides pattern to select jobnames existing in pool. + +Volume -- Migrate all data on specified volumes. Migration Selection +Pattern regexp provides selection criteria for volumes to be migrated. +Volumes must exist in pool to be eligible for migration. + + +With Regex optional: +LowestUtil -- Identify the volume in the pool with the least data on it +and empty it. No Migration Selection Pattern required. + +OldestVol -- Identify the LRU volume with data written, and empty it. No +Migration Selection Pattern required. + +PoolOccupancy -- if pool occupancy exceeds , migrate volumes +(starting with most full volumes) until pool occupancy drops below +. Pool highmig and lowmig values are in pool definition, no +Migration Selection Pattern required. + + +No regex: +SQLQuery -- Migrate all jobuids returned by the supplied SQL query. +Migration Selection Pattern contains SQL query to execute; should return +a list of 1 or more jobuids to migrate. + +PoolResidence -- Migrate data sitting in pool for longer than +PoolResidence value in pool definition. Migration Selection Pattern +optional; if specified, override value in pool definition (value in +minutes). + + +[ possibly a Python event -- kes ] +=== +- run_cmd() returns int should return JobId_t +- get_next_jobid_from_list() returns int should return JobId_t +- Document export LDFLAGS=-L/usr/lib64 +- Don't attempt to restore from "Disabled" Volumes. +- Network error on Win32 should set Win32 error code. +- What happens when you rename a Disk Volume? +- Job retention period in a Pool (and hence Volume). The job would + then be migrated. +- Detect resource deadlock in Migrate when same job wants to read + and write the same device. +- Make hardlink code at line 240 of find_one.c use binary search. +- Queue warning/error messages during restore so that they + are reported at the end of the report rather than being + hidden in the file listing ... +- Look at -D_FORTIFY_SOURCE=2 +- Add Win32 FileSet definition somewhere +- Look at fixing restore status stats in SD. +- Make selection of Database used in restore correspond to + client. +- Implement a mode that says when a hard read error is + encountered, read many times (as it currently does), and if the + block cannot be read, skip to the next block, and try again. If + that fails, skip to the next file and try again, ... +- Add level table: + create table LevelType (LevelType binary(1), LevelTypeLong tinyblob); + insert into LevelType (LevelType,LevelTypeLong) values + ("F","Full"), + ("D","Diff"), + ("I","Inc"); +- Add ACL to restore only to original location. +- Add a recursive mark command (rmark) to restore. +- "Minimum Job Interval = nnn" sets minimum interval between Jobs + of the same level and does not permit multiple simultaneous + running of that Job (i.e. lets any previous invocation finish + before doing Interval testing). +- Look at simplifying File exclusions. +- New directive "Delete purged Volumes" +- new pool XXX with ScratchPoolId = MyScratchPool's PoolId and + let it fill itself, and RecyclePoolId = XXX's PoolId so I can + see if it become stable and I just have to supervise + MyScratchPool +- If I want to remove this pool, I set RecyclePoolId = MyScratchPool's + PoolId, and when it is empty remove it. +- Figure out how to recycle Scratch volumes back to the Scratch Pool. +- Add Volume=SCRTCH +- Allow Check Labels to be used with Bacula labels. +- "Resuming" a failed backup (lost line for example) by using the + failed backup as a sort of "base" job. +- Look at NDMP +- Email to the user when the tape is about to need changing x + days before it needs changing. +- Command to show next tape that will be used for a job even + if the job is not scheduled. +- From: Arunav Mandal + 1. When jobs are running and bacula for some reason crashes or if I do a + restart it remembers and jobs it was running before it crashed or restarted + as of now I loose all jobs if I restart it. + + 2. When spooling and in the midway if client is disconnected for instance a + laptop bacula completely discard the spool. It will be nice if it can write + that spool to tape so there will be some backups for that client if not all. + + 3. We have around 150 clients machines it will be nice to have a option to + upgrade all the client machines bacula version automatically. + + 4. Atleast one connection should be reserved for the bconsole so at heavy load + I should connect to the director via bconsole which at sometimes I can't + + 5. Another most important feature that is missing, say at 10am I manually + started backup of client abc and it was a full backup since client abc has + no backup history and at 10.30am bacula again automatically started backup of + client abc as that was in the schedule. So now we have 2 multiple Full + backups of the same client and if we again try to start a full backup of + client backup abc bacula won't complain. That should be fixed. + +- Fix bpipe.c so that it does not modify results pointer. + ***FIXME*** calling sequence should be changed. +- For Windows disaster recovery see http://unattended.sf.net/ +- regardless of the retention period, Bacula will not prune the + last Full, Diff, or Inc File data until a month after the + retention period for the last Full backup that was done. +- update volume=xxx --- add status=Full +- Remove old spool files on startup. +- Exclude SD spool/working directory. - Refuse to prune last valid Full backup. Same goes for Catalog. -- --without-openssl breaks at least on Solaris. - Python: - Make a callback when Rerun failed levels is called. - Give Python program access to Scheduled jobs. @@ -46,35 +191,7 @@ For 1.37: resources were locked. - The last part is left in the spool dir. -Document: -- Port limiting -m in iptables to prevent DoS attacks - could cause broken pipes on Bacula. -- Document that Bootstrap files can be written with cataloging - turned off. -- Pruning with Admin job. -- Add better documentation on how restores can be done -- OS linux 2.4 - 1) ADIC, DLT, FastStor 4000, 7*20GB - 2) Sun, DDS, (Suns name unknown - Archive Python DDS drive), 1.2GB - 3) Wangtek, QIC, 6525ES, 525MB (fixed block size 1k, block size etc. - driver dependent - aic7xxx works, ncr53c8xx with problems) - 4) HP, DDS-2, C1553A, 6*4GB -- Doc the following - to activate, check or disable the hardware compression feature on my - exb-8900 i use the exabyte "MammothTool" you can get it here: - http://www.exabyte.com/support/online/downloads/index.cfm - There is a solaris version of this tool. With option -C 0 or 1 you can - disable or activate compression. Start this tool without any options for - a small reference. -- Linux Sony LIB-D81, AIT-3 library works. -- Document PostgreSQL performance problems bug 131. -- Document testing -- Document that ChangerDevice is used for Alert command. -- Document new CDROM directory. -- Document Heartbeat Interval in the dealing with firewalls section. -- Document the multiple-drive-changer.txt script. -Maybe in 1.37: - In restore don't compare byte count on a raw device -- directory entry does not contain bytes. - To mark files as deleted, run essentially a Verify to disk, and @@ -129,6 +246,105 @@ Maybe in 1.37: - Bug: if a job is manually scheduled to run later, it does not appear in any status report and cannot be cancelled. +==== Keeping track of deleted files ==== + My "trick" for keeping track of deletions is the following. + Assuming the user turns on this option, after all the files + have been backed up, but before the job has terminated, the + FD will make a pass through all the files and send their + names to the DIR (*exactly* the same as what a Verify job + currently does). This will probably be done at the same + time the files are being sent to the SD avoiding a second + pass. The DIR will then compare that to what is stored in + the catalog. Any files in the catalog but not in what the + FD sent will receive a catalog File entry that indicates + that at that point in time the file was deleted. + + During a restore, any file initially picked up by some + backup (Full, ...) then subsequently having a File entry + marked "delete" will be removed from the tree, so will not + be restored. If a file with the same name is later OK it + will be inserted in the tree -- this already happens. All + will be consistent except for possible changes during the + running of the FD. + + Since I'm on the subject, some of you may be wondering what + the utility of the in memory tree is if you are going to + restore everything (at least it comes up from time to time + on the list). Well, it is still *very* useful because it + allows only the last item found for a particular filename + (full path) to be entered into the tree, and thus if a file + is backed up 10 times, only the last copy will be restored. + I recently (last Friday) restored a complete directory, and + the Full and all the Differential and Incremental backups + spanned 3 Volumes. The first Volume was not even mounted + because all the files had been updated and hence backed up + since the Full backup was made. In this case, the tree + saved me a *lot* of time. + + Make sure this information is stored on the tape too so + that it can be restored directly from the tape. + + Comments from Martin Simmons (I think they are all covered): + Ok, that should cover the basics. There are few issues though: + + - Restore will depend on the catalog. I think it is better to include the + extra data in the backup as well, so it can be seen by bscan and bextract. + + - I'm not sure if it will preserve multiple hard links to the same inode. Or + maybe adding or removing links will cause the data to be dumped again? + + - I'm not sure if it will handle renamed directories. Possibly it will work + by dumping the whole tree under a renamed directory? + + - It remains to be seen how the backup performance of the DIR's will be + affected when comparing the catalog for a large filesystem. + +==== +From David: +How about introducing a Type = MgmtPolicy job type? That job type would +be responsible for scanning the Bacula environment looking for specific +conditions, and submitting the appropriate jobs for implementing said +policy, eg: + +Job { + Name = "Migration-Policy" + Type = MgmtPolicy + Policy Selection Job Type = Migrate + Scope = " " + Threshold = " " + Job Template = +} + +Where is any legal job keyword, is a comparison +operator (=,<,>,!=, logical operators AND/OR/NOT) and is a +appropriate regexp. I could see an argument for Scope and Threshold +being SQL queries if we want to support full flexibility. The +Migration-Policy job would then get scheduled as frequently as a site +felt necessary (suggested default: every 15 minutes). + +Example: + +Job { + Name = "Migration-Policy" + Type = MgmtPolicy + Policy Selection Job Type = Migration + Scope = "Pool=*" + Threshold = "Migration Selection Type = LowestUtil" + Job Template = "MigrationTemplate" +} + +would select all pools for examination and generate a job based on +MigrationTemplate to automatically select the volume with the lowest +usage and migrate it's contents to the nextpool defined for that pool. + +This policy abstraction would be really handy for adjusting the behavior +of Bacula according to site-selectable criteria (one thing that pops +into mind is Amanda's ability to automatically adjust backup levels +depending on various criteria). + + +===== + Regression tests: - Add Pool/Storage override regression test. - Add delete JobId to regression. @@ -1146,155 +1362,45 @@ Block Position: 0 === Done -- Save mount point for directories not traversed with onefs=yes. -- Add seconds to start and end times in the Job report output. -- if 2 concurrent backups are attempted on the same tape - drive (autoloader) into different tape pools, one of them will exit - fatally instead of halting until the drive is idle -- Update StartTime if job held in Job Queue. -- Look at www.nu2.nu/pebuilder as a helper for full windows - bare metal restore. (done by Scott) -- Fix orphanned buffers: - Orphaned buffer: 24 bytes allocated at line 808 of rufus-dir job.c - Orphaned buffer: 40 bytes allocated at line 45 of rufus-dir alist.c -- Implement Preben's suggestion to add - File System Types = ext2, ext3 - to FileSets, thus simplifying backup of *all* local partitions. -- Try to open a device on each Job if it was not opened - when the SD started. -- Add dump of VolSessionId/Time and FileIndex with bls. -- If Bacula does not find the right tape in the Autochanger, - then mark the tape in error and move on rather than asking - for operator intervention. -- Cancel command should include JobId in list of Jobs. -- Add performance testing hooks -- Bootstrap from JobMedia records. -- Implement WildFile and WildDir to solve problem of - saving only *.doc files. -- Fix - Please use the "label" command to create a new Volume for: - Storage: DDS-4-changer - Media type: - Pool: Default - label - The defined Storage resources are: -- Copy Changer Device and Changer Command from Autochanger - to Device resource in SD if none given in Device resource. -- 1. Automatic use of more than one drive in an autochanger (done) -- 2. Automatic selection of the correct drive for each Job (i.e. - selects a drive with an appropriate Volume for the Job) (done) -- 6. Allow multiple simultaneous Jobs referencing the same pool write - to several tapes (some new directive(s) are are probably needed for - this) (done) -- Locking (done) -- Key on Storage rather than Pool (done) -- Allow multiple drives to use same Pool (change jobq.c DIR) (done). -- Synchronize multiple drives so that not more - than one loads a tape and any time (done) -- 4. Use Changer Device and Changer Command specified in the - Autochanger resource, if none is found in the Device resource. - You can continue to specify them in the Device resource if you want - or need them to be different for each device. -- 5. Implement a new Device directive (perhaps "Autoselect = yes/no") - that can allow a Device be part of an Autochanger, and hence the changer - script protected, but if set to no, will prevent the Device from being - automatically selected from the changer. This allows the device to - be directly accessed through its Device name, but not through the - AutoChanger name. -#6 Select one from among Multiple Storage Devices for Job -#5 Events that call a Python program - (Implemented in Dir/SD) -- Make sure the Device name is in the Query packet returned. -- Don't start a second file job if one is already running. -- Implement EOF/EOV labels for ANSI labels -- Implement IBM labels. -- When Python creates a new label, the tape is immediately - recycled and no label created. This happens when using - autolabeling -- even when Python doesn't generate the name. -- Scratch Pool where the volumes can be re-assigned to any Pool. -- 28-Mar 23:19 rufus-sd: acquire.c:379 Device "DDS-4" (/dev/nst0) - is busy reading. Job 6 canceled. -- Remove separate thread for opening devices in SD. On the other - hand, don't block waiting for open() for devices. -- Fix code to either handle updating NumVol or to calculate it in - Dir next_vol.c -- Ensure that you cannot exclude a directory or a file explicitly - Included with File. -#4 Embedded Python Scripting - (Implemented in Dir/SD/FD) -- Add Python writable variable for changing the Priority, - Client, Storage, JobStatus (error), ... -- SD Python - - Solicit Events -- Add disk seeking on restore; turn off seek on tapes. - stored/match_bsr.c -- Look at dird_conf.c:1000: warning: `int size' - might be used uninitialized in this function -- Indicate when a Job is purged/pruned during restore. -- Implement some way to turn off automatic pruning in Jobs. -- Implement a way an Admin Job can prune, possibly multiple - clients -- Python script? -- Look at Preben's acl.c error handling code. -- SD crashes after a tape restore then doing a backup. -- If drive is opened read/write, close it and re-open - read-only if doing a restore, and vice-versa. -- Windows restore: - data-fd: RestoreFiles.2004-12-07_15.56.42 Error: - > ..\findlib\../../findlib/create_file.c:275 Could not open e:/: ERR=Der - > Prozess kann nicht auf die Datei zugreifen, da sie von einem anderen - > Prozess verwendet wird. - Restore restores all files, but then fails at the end trying - to set the attributes of e: - from failed jobs.- Resolve the problem between Device name and Archive name, - and fix SD messages. -- Tell the "restore" user when browsing is no longer possible. -- Add a restore directory-x -- Write non-optimized bsrs from the JobMedia and Media records, - even after Files are pruned. -- Delete Stripe and Copy from VolParams to save space. -- Fix option 2 of restore -- list where file is backed up -- require Client, - then list last 20 backups. -- Finish implementation of passing all Storage and Device needs to - the SD. -- Move test for max wait time exceeded in job.c up -- Peter's idea. -## Consider moving docs to their own project. -## Move rescue to its own project. -- Add client version to the Client name line that prints in - the Job report. -- Fix the Rescue CDROM. -- By the way: on page http://www.bacula.org/?page=tapedrives , at the - bottom, the link to "Tape Testing Chapter" is broken. It goes to - /html-manual/... while the others point to /rel-manual/... -- Device resource needs the "name" of the SD. -- Specify a single directory to restore. -- Implement MediaType keyword in bsr? -- Add a date and time stamp at the beginning of every line in the - Job report (Volker Sauer). -- Add level to estimate command. -- Add "limit=n" for "list jobs" -- Make bootstrap filename unique. -- Make Dmsg look at global before calling subroutine. -- From Chris Hull: - it seems to be complaining about 12:00pm which should be a valid 12 - hour time. I changed the time to 11:59am and everything works fine. - Also 12:00am works fine. 0:00pm also works (which I don't think - should). None of the values 12:00pm - 12:59pm work for that matter. -- Require restore via the restore command or make a restore Job - get the bootstrap file. -- Implement Maximum Job Spool Size -- Fix 3993 error in SD. It forgets to look at autochanger - resource for device command, ... -- 3. Prevent two drives requesting the same Volume in any given - autochanger, by checking if a Volume is mounted on another drive - in an Autochanger. -- Upgrade to MySQL 4.1.12 See: - http://dev.mysql.com/doc/mysql/en/Server_SQL_mode.html -- Add # Job Level date to bsr file -- Implement "PreferMountedVolumes = yes|no" in Job resource. -## Integrate web-bacula into a new Bacula project with - bimagemgr. -- Cleaning tapes should have Status "Cleaning" rather than append. -- Make sure that Python has access to Client address/port so that - it can check if Clients are alive. -- Review all items in "restore". - +- Make sure that all do_prompt() calls in Dir check for + -1 (error) and -2 (cancel) returns. +- Fix foreach_jcr() to have free_jcr() inside next(). + jcr=jcr_walk_start(); + for ( ; jcr; (jcr=jcr_walk_next(jcr)) ) + ... + jcr_walk_end(jcr); +- A Volume taken from Scratch should take on the retention period + of the new pool. +- Correct doc for Maximum Changer Wait (and others) accepting only + integers. +- Implement status that shows why a job is being held in reserve, or + rather why none of the drives are suitable. +- Implement a way to disable a drive (so you can use the second + drive of an autochanger, and the first one will not be used or + even defined). +- Make sure Maximum Volumes is respected in Pools when adding + Volumes (e.g. when pulling a Scratch volume). +- Keep same dcr when switching device ... +- Implement code that makes the Dir aware that a drive is an + autochanger (so the user doesn't need to use the Autochanger = yes + directive). +- Make catalog respect ACL. +- Add recycle count to Media record. +- Add initial write date to Media record. +- Fix store_yesno to be store_bitmask. +--- create_file.c.orig Fri Jul 8 12:13:05 2005 ++++ create_file.c Fri Jul 8 12:13:07 2005 +@@ -195,6 +195,8 @@ + attr->ofname, be.strerror()); + return CF_ERROR; + } ++ } else if(S_ISSOCK(attr->statp.st_mode)) { ++ Dmsg1(200, "Skipping socket: %s\n", attr->ofname); + } else { + Dmsg1(200, "Restore node: %s\n", attr->ofname); + if (mknod(attr->ofname, attr->statp.st_mode, attr->statp.st_rdev) != 0 && errno != EEXIST) { +- Add true/false to conf same as yes/no +- Reserve blocks other restore jobs when first cannot connect to SD. +- Fix Maximum Changer Wait, Maximum Open Wait, Maximum Rewind Wait to + accept time qualifiers. +- Does ClientRunAfterJob fail the job on a bad return code?