git.sur5r.net Git - bacula/rescue/blob - rescue/linux/cdrom/yaird-0.0.5/doc/kernel.xml

   1 <section id="kernel">
   2   <title>The interface between kernel and image</title>
   3
   4   <para>
   5     The initial boot image is supposed to load enough modules to let
   6     the real root device be mounted cleanly.  It starts up in a
   7     <emphasis>very</emphasis> bare environment and it has to do tricky
   8     stuff like juggling root filesystems; to pull that off successfully
   9     it makes sense to take a close look at the environment that the
  10     kernel creates for the image and what the kernel expects it to do.
  11     This section contains raw design notes based on kernel 2.6.8.
  12   </para>
  13
  14   <para>
  15     The processing of the image starts even before the kernel is
  16     activated.  The bootloader, grub or lilo for example, reads two
  17     files from the boot file system into ram: the kernel and image.
  18     The bootloader somehow manages to set two variables in the kernel:
  19     <code>initrd_start</code> and <code>initrd_end</code>; these variables
  20     point to the copy of the image in ram.  The bootloader now
  21     hands over control to the kernel.
  22   </para>
  23
  24   <para>
  25     During setup, the kernel creates a special file system, rootfs.
  26     This mostly reuses ramfs code, but there are a few twists: it can
  27     never be mounted from userspace, there's only one copy, and it's not
  28     mounted on top of anything else.  The existence of rootfs means that
  29     the rest of the kernel always can assume there's a place to mount
  30     other file systems.  It also is a place where temporary files can
  31     be created during the boot sequence.
  32   </para>
  33
  34     <para>
  35       In <code>initramfs.c:populate_rootfs()</code>, there are two
  36       possibilities.  If the image looks like a cpio.gz file, it is
  37       unpacked into rootfs.  If the file <filename>/init</filename> is
  38       among the files unpacked from the cpio file, the initramfs model
  39       is used; otherwise we get a more complex interaction between kernel
  40       and initrd, discussed in <xref linkend="initrd"/>.
  41     </para>
  42
  43   <simplesect>
  44     <title>Booting with Initramfs</title>
  45     <para>
  46       If the image was a cpio file, and it contains a file
  47       <filename>/init</filename>, the initram model is used.
  48       The kernel does some basic setup and hands over control to
  49       <filename>/init</filename>; it is then up to
  50       <filename>/init</filename> to make a real root available and to
  51       transfer control to the <filename>/sbin/init</filename> command
  52       on the real root.
  53     </para>
  54
  55     <para>
  56       The tricky part is to do that in such a way that there
  57       is no way for user processes to gain access to the rootfs
  58       filesystem; and in such a way that rootfs remains empty and
  59       hidden under the user root file system.  This is best done
  60       using some C code; <application>yaird</application> uses
  61       <application>run_init</application>, a small tool based on
  62       <application>klibc</application>.
  63       <programlisting>
  64         # invoked as last command in /init, with no other processes running,
  65         # as follows:
  66         # exec run_init /newroot /sbin/init "$@"
  67         - chdir /newroot
  68         # following after lots of sanity checks and not across mounts:
  69         - rm -rf /*
  70         - mount --move . /
  71         - chroot .
  72         - chdir /
  73         - open /dev/console
  74         - exec /sbin/init "$@"
  75       </programlisting>
  76     </para>
  77
  78   </simplesect>
  79
  80   <simplesect id="initrd">
  81     <title>Booting with initrd</title>
  82     <para>
  83       If the image was not a cpio file, the kernel copies the
  84       initrd image from where ever the boot loader left it to
  85       <filename>rootfs:/initrd.image</filename>, and frees the ram used
  86       by the bootloader for the initrd image.
  87     </para>
  88
  89     <para>
  90       After reading initrd, the kernel does more setup to the point where
  91       we have:
  92       <itemizedlist>
  93
  94         <listitem>
  95           <para>
  96               working CPU and memory management
  97           </para>
  98         </listitem>
  99
 100         <listitem>
 101           <para>
 102               working process management
 103           </para>
 104         </listitem>
 105
 106         <listitem>
 107           <para>
 108               compiled in drivers activated
 109           </para>
 110         </listitem>
 111
 112         <listitem>
 113           <para>
 114               a number of support processes such as ksoftirqd are created.
 115               (These processes have the rootfs as root; they can get a new
 116               root when the <code>pivot_root()</code> system call is used.)
 117           </para>
 118         </listitem>
 119
 120         <listitem>
 121           <para>
 122               something like a console.  <code>Console_init()</code> is
 123               called before PCI or USB probes, so expect only compiled in
 124               console devices to work.
 125           </para>
 126         </listitem>
 127
 128       </itemizedlist>
 129     </para>
 130
 131     <para>
 132       At this point, in <code>do_mounts.c:prepare_namespace()</code>,
 133       the kernel looks for a root filesystem to mount.  That root file
 134       system can come from a number of places: NFS, a raid device, a plain
 135       disk or an initrd.  If it's an initrd, the sequence is as follows
 136       (where devfs can fail if it's not compiled into the kernel)
 137
 138       <programlisting>
 139       - mount -t devfs devfs /dev
 140       - md_run_setup()
 141       - process initrd
 142       - umount /dev
 143       - mount --move . /
 144       - chroot .
 145       - mount -t devfs devfs /dev
 146       </programlisting>
 147
 148     </para>
 149
 150     <para>
 151       Once that returns, in <code>init/main.c:init()</code>,
 152       initialisation memory is freed and <filename>/sbin/init</filename>
 153       is executed with <code>/dev/console</code> as file  descriptor 0, 1
 154       and 2.  <filename>/sbin/init</filename> can be overruled with
 155       an <code>init=/usr/bin/firefox</code> parameter passed to the
 156       boot loader; if <filename>/sbin/init</filename> is not found,
 157       <filename>/etc/init</filename> and a number of other fallbacks
 158       are tried.  We're in business.
 159     </para>
 160
 161     <para>
 162       The processing of initrd starts in
 163       <code>do_mounts_initrd.c:initrd_load()</code>.  It creates
 164       <filename>rootfs:/dev/ram</filename>, then copies
 165       <filename>rootfs:/initrd.image</filename> there and unlinks
 166       <filename>rootfs:/initrd.image</filename>.  Now we have the initrd
 167       image in a block device, which is good for mounting.  It calls
 168       <code>handle_initrd()</code>, which does:
 169
 170       <programlisting>
 171       # make another block special file for ram0
 172       - mknod /dev/root.old b 1 0
 173       # try mounting initrd with all known file systems,
 174       # optionally read-only
 175       - mount -t xxx /dev/root.old /root
 176       - mkdir rootfs:/old
 177       - cd /root
 178       - mount --move . /
 179       - chroot .
 180       - mount -t devfs devfs /dev
 181       - system ("/linuxrc");
 182       - cd rootfs:/old
 183       - mount --move / .
 184       - cd rootfs:/
 185       - chroot .
 186       - umount rootfs:/old/dev
 187       - ... more ...
 188       </programlisting>
 189
 190     </para>
 191
 192     <para>
 193       So <filename>initrd:/linuxrc</filename> runs in an environment where
 194       initrd is the root, with devfs mounted if available, and rootfs is
 195       invisible (except that there are open file handles to directories
 196       in rootfs, needed to change back to the old environment).
 197     </para>
 198
 199     <para>
 200       Now the idea seems to have been that <filename>/linuxrc</filename>
 201       would mount the real root and <code>pivot_root</code> into it, then start
 202       <filename>/sbin/init</filename>.  Thus, linuxrc would never return.
 203       However, <code>main.c:init()</code> does some usefull stuff only
 204       after linuxrc returns: freeing init memory segments and starting numa
 205       policy, so in eg Debian and Fedora, <filename>/linuxrc</filename>
 206       will end, and <filename>/sbin/init</filename>
 207       is started by <code>main.c:init()</code>.
 208     </para>
 209
 210     <para>
 211       After linuxrc returns, the variable <code>real_root_dev</code>
 212       determines what happens.  This variable can be read and written
 213       via <filename>/proc/sys/kernel/real-root-dev</filename>.  If it
 214       is 0x0100 (the device number of <filename>/dev/ram0</filename>)
 215       or something equivalent, <code>handle_initrd()</code> will change
 216       directory to <filename>/old</filename> and return.  If it is
 217       something else, <code>handle_initrd()</code> will decode it, mount
 218       it as root, mount initrd as <filename>/root/initrd</filename>,
 219       and again start <filename>/sbin/init</filename>.  (if mounting as
 220       <filename>/root/initrd</filename> fails, the block device is freed.)
 221     </para>
 222
 223     <para>
 224       Remember <code>handle_initrd()</code> was called via
 225       <code>load_initrd()</code> from <code>prepare_namespace()</code>,
 226       and <code>prepare_namespace()</code> ends by chrooting into the
 227       current directory: <filename>rootfs:/old</filename>.
 228     </para>
 229
 230     <para>
 231       Note that <filename>rootfs:/old</filename> was move-mounted
 232       from '/' after <filename>/linuxrc</filename> returned.
 233       When <filename>/linuxrc</filename> started, the root was
 234       initrd, but <filename>/linuxrc</filename> may have done a
 235       <code>pivot_root()</code>, replacing the root with a real root,
 236       say <filename>/dev/hda1</filename>.
 237     </para>
 238
 239     <para>
 240       Thus:
 241       <itemizedlist>
 242
 243         <listitem>
 244           <para>
 245               <filename>/linuxrc</filename> is started with initrd
 246               mounted as root.
 247           </para>
 248         </listitem>
 249
 250         <listitem>
 251           <para>
 252               There is working memory management, processes, compiled
 253               in drivers, and stdin/out/err are connected to a console,
 254               if the relevant drivers are compiled in.
 255           </para>
 256         </listitem>
 257
 258         <listitem>
 259           <para>
 260               Devfs may be mounted on <filename>/dev</filename>.
 261           </para>
 262         </listitem>
 263
 264         <listitem>
 265           <para>
 266               <filename>/linuxrc</filename> can <code>pivot_root</code>.
 267           </para>
 268         </listitem>
 269
 270         <listitem>
 271           <para>
 272               If you echo 0x0100 to
 273               <filename>/proc/sys/kernel/real-root-dev</filename>,
 274               the <code>pivot_root</code> will remain in effect after
 275               <filename>/linuxrc</filename> ends.
 276           </para>
 277         </listitem>
 278
 279         <listitem>
 280           <para>
 281               After <filename>/linuxrc</filename> returns,
 282               <filename>/dev</filename> may be unmounted and replaced
 283               with devfs.
 284           </para>
 285         </listitem>
 286
 287       </itemizedlist>
 288     </para>
 289
 290     <para>
 291       Thus a good strategy for <filename>/linuxrc</filename> is to
 292       do as little as possible, and defer the real initialisation
 293       to <filename>/sbin/init</filename> on the initrd; this
 294       <filename>/sbin/init</filename> can then <code>pivot_root</code>
 295       into the real root device.
 296       <programlisting>
 297         #!/bin/dash
 298         set -x
 299         mount -nt proc proc /proc
 300         # root=$(cat proc/sys/kernel/real-root-dev)
 301         echo 256 > proc/sys/kernel/real-root-dev
 302         umount -n /proc
 303       </programlisting>
 304     </para>
 305
 306   </simplesect>
 307
 308   <simplesect>
 309     <title>Kernel command line parameters</title>
 310     <para>
 311       The kernel passes more information than just an initial file system
 312       to the initrd or initramfs image; there also are the kernel boot
 313       parameters.  The bootloader passes these to the kernel, and the kernel
 314       in turn passes them on via <filename>/proc/cmdline</filename>.
 315     </para>
 316
 317     <para>
 318       An old version of these parameters is documented in the
 319       <citerefentry>
 320         <!-- Sometimes I think docbook is overdoing this markup thing -->
 321         <refentrytitle>bootparam</refentrytitle>
 322         <manvolnum>7</manvolnum>
 323       </citerefentry> manual page; more recent information is in the kernel
 324       documentation file <citetitle>kernel-parameters.txt</citetitle>.
 325       Mostly, these parameters are used to configure non-modular drivers,
 326       and thus not very interesting to <application>yaird</application>.
 327       Then there are parameters such as <code>noapic</code>, which are
 328       interpreted by the kernel core and also irrelevant to
 329       <application>yaird</application>.
 330       Finally there are a few parameters which are used by the kernel
 331       to determine how to mount the root file system.
 332     </para>
 333
 334     <para>
 335       Whether the initial image should emulate these options or ignore them
 336       is open to discussion; you can make a case that the flexibility these
 337       options offer has become irrelevant now that initrd/initramfs offers
 338       far more fine grained control over the way in which the system
 339       is booted.
 340       Support for these options is mostly a matter of tuning the
 341       distribution specific templates, but it is possible that the
 342       templates need an occassional hint from the planner.
 343       To find out just how much "mostly" is, we'll try to implement
 344       full support for these options and see where we run into
 345       limitations.
 346       An inventarisation of relevant options.
 347       <variablelist>
 348
 349         <varlistentry>
 350           <term>
 351             ide
 352           </term>
 353           <listitem>
 354             <para>
 355               These are options for the modular ide-core driver.
 356               This could be supported by adding an attribute
 357               "isIdeCore" to insmod actions, and expanding the ide
 358               kernel options only for insmod actions where that
 359               attribute is true.
 360               It seems cleaner to support the options from
 361               <filename>/etc/modprobe.conf</filename>.
 362               Unsupported for now.
 363             </para>
 364           </listitem>
 365         </varlistentry>
 366
 367         <varlistentry>
 368           <term>
 369             init
 370           </term>
 371           <listitem>
 372             <para>
 373               The first program to be started on the definitive root device,
 374               default <filename>/sbin/init</filename>.  Supported.
 375             </para>
 376           </listitem>
 377         </varlistentry>
 378
 379         <varlistentry>
 380           <term>
 381             ro
 382           </term>
 383           <listitem>
 384             <para>
 385               Mount the definitive root device read only,
 386               so that it can be submitted to <application>fsck</application>.
 387               Supported; this is the default behaviour.
 388             </para>
 389           </listitem>
 390         </varlistentry>
 391
 392         <varlistentry>
 393           <term>
 394             rw
 395           </term>
 396           <listitem>
 397             <para>
 398               Three guesses.  Supported.
 399             </para>
 400           </listitem>
 401         </varlistentry>
 402
 403         <varlistentry>
 404           <term>
 405             resume, noresume
 406           </term>
 407           <listitem>
 408             <para>
 409               Which device (not) to use for software suspend.
 410               To be done.
 411             </para>
 412           </listitem>
 413         </varlistentry>
 414
 415         <varlistentry>
 416           <term>
 417             root
 418           </term>
 419           <listitem>
 420             <para>
 421               The device to mount as root.  This is a nasty one:
 422               the planner by default only creates device nodes
 423               that are needed to mount the root device, and even
 424               if you were to put hotplug on the inital image
 425               to create all possible device nodes, there's still
 426               the matter of putting support for the proper file system
 427               on the initial image.
 428               We could make an option to
 429               <application>yaird</application> to specify a list
 430               of possible root devices and load the necessary
 431               modules for all of them.
 432               Unsupported until there's a clear need for it.
 433             </para>
 434           </listitem>
 435         </varlistentry>
 436
 437         <varlistentry>
 438           <term>
 439             rootflags
 440           </term>
 441           <listitem>
 442             <para>
 443               Flags to use while mounting root file system.
 444               Implement together with root option.
 445             </para>
 446           </listitem>
 447         </varlistentry>
 448
 449         <varlistentry>
 450           <term>
 451             rootfstype
 452           </term>
 453           <listitem>
 454             <para>
 455               File system type for root file system.
 456               Implement together with root option.
 457             </para>
 458           </listitem>
 459         </varlistentry>
 460
 461         <varlistentry>
 462           <term>
 463             nfsaddrs
 464           </term>
 465           <listitem>
 466             <para>
 467               For diskless booting.
 468               Unclear whether we need this.  NFS booting is desirable,
 469               but I guess that will mostly be done under control of
 470               DHCP.  Unsupported for now.
 471             </para>
 472           </listitem>
 473         </varlistentry>
 474
 475         <varlistentry>
 476           <term>
 477             nfsroot
 478           </term>
 479           <listitem>
 480             <para>
 481               More diskless booting.
 482               Unsupported for now.
 483             </para>
 484           </listitem>
 485         </varlistentry>
 486
 487       </variablelist>
 488
 489     </para>
 490   </simplesect>
 491 </section>