request_standard_resources為內(nèi)存和外設(shè)預(yù)留I/O訪問資源。 arch/arm/kernel/setup.c static void __init request_standard_resources(struct meminfo *mi, struct machine_desc *mdesc) { struct resource *res; int i; kernel_code.start = virt_to_phys(&_text); kernel_code.end = virt_to_phys(&_etext - 1); kernel_data.start = virt_to_phys(&__data_start); kernel_data.end = virt_to_phys(&_end - 1); mi參數(shù)記錄了當(dāng)前系統(tǒng)中的所有內(nèi)存bank,它通過Bootloader的ATAG機(jī)制傳遞給內(nèi)核,并存放在struct meminfo類型同名meminfo描述符中。如果系統(tǒng)中只提供了一個(gè)內(nèi)存bank,且大小為256M,那么打印出該描述符的信息如下所示: mi->nr_banks:1 bank[0]: start:0x50000000, size:0x10000000, node:0 系統(tǒng)定義了三個(gè)標(biāo)準(zhǔn)內(nèi)存資源:視頻內(nèi)存資源,內(nèi)核代碼區(qū)和內(nèi)核數(shù)據(jù)區(qū)。 static struct resource mem_res[] = { { .name = "Video RAM", .start = 0, .end = 0, .flags = IORESOURCE_MEM }, { .name = "Kernel text", .start = 0, .end = 0, .flags = IORESOURCE_MEM }, { .name = "Kernel data", .start = 0, .end = 0, .flags = IORESOURCE_MEM } }; 內(nèi)核用三個(gè)宏分別對(duì)應(yīng)這三個(gè)標(biāo)準(zhǔn)內(nèi)存資源。函數(shù)中通過這三個(gè)宏進(jìn)行這三個(gè)內(nèi)存資源引用。由于mem_res被定義為static類型,所以它并不會(huì)在其他地方被引用,這里只是使用它來記錄內(nèi)核的代碼和數(shù)據(jù)段在內(nèi)存中的分布信息。 #define video_ram mem_res[0] #define kernel_code mem_res[1] #define kernel_data mem_res[2] 首先通過alloc_bootmem_low分配一個(gè)struct resource類型的資源描述符,然后通過request_resource申請(qǐng)?jiān)撡Y源同時(shí)注冊(cè)到內(nèi)核資源管理樹中,以聲明RAM設(shè)備對(duì)這一I/O地址區(qū)域的占有。 for (i = 0; i < mi->nr_banks; i++) { if (mi->bank[i].size == 0) continue; res = alloc_bootmem_low(sizeof(*res)); res->name = "System RAM"; res->start = mi->bank[i].start; res->end = mi->bank[i].start + mi->bank[i].size - 1; res->flags = IORESOURCE_MEM | IORESOURCE_BUSY; request_resource(&iomem_resource, res); if (kernel_code.start >= res->start && kernel_code.end <= res->end) request_resource(res, &kernel_code); if (kernel_data.start >= res->start && kernel_data.end <= res->end) request_resource(res, &kernel_data); } IORESOURCE_BUSY指明該資源在使用中,不可被分配,IORESOURCE_MEM指明這是使用內(nèi)存映射的資源。接著預(yù)留kernel_code和kernel_data區(qū),以防其他外設(shè)的I/O地址映射到內(nèi)核代碼和數(shù)據(jù)區(qū)。所以System RAM資源總是第一個(gè)申請(qǐng)的外設(shè)I/O資源。 if (mdesc->video_start) { video_ram.start = mdesc->video_start; video_ram.end = mdesc->video_end; request_resource(&iomem_resource, &video_ram); } /* * Some machines don't have the possibility of ever * possessing lp0, lp1 or lp2 */ if (mdesc->reserve_lp0) request_resource(&ioport_resource, &lp0); if (mdesc->reserve_lp1) request_resource(&ioport_resource, &lp1); if (mdesc->reserve_lp2) request_resource(&ioport_resource, &lp2); } 如果系統(tǒng)中注冊(cè)了視頻設(shè)備,那么通常視頻驅(qū)動(dòng)會(huì)將視屏設(shè)備占用的內(nèi)存資源注冊(cè)函數(shù)過載到機(jī)器描述符的video_start指針函數(shù)上。并在此時(shí)調(diào)用。lp0/1/2是老式的并口打印機(jī)設(shè)備,它們通過I/O端口映射來工作,所以注冊(cè)到ioport_resource中。 cpu_init用來設(shè)置CPU在各種工作模式下的堆棧地址。顯然任何一個(gè)CPU的堆棧寄存器都有一份自己的拷貝,這樣它們才能獨(dú)立處理中斷。 struct stack { u32 irq[3]; u32 abt[3]; u32 und[3]; } ____cacheline_aligned; static struct stack stacks[NR_CPUS]; void cpu_init(void) { unsigned int cpu = smp_processor_id(); struct stack *stk = &stacks[cpu]; __asm__ ( "msr cpsr_c, %1\n\t" "add sp, %0, %2\n\t" "msr cpsr_c, %3\n\t" "add sp, %0, %4\n\t" "msr cpsr_c, %5\n\t" "add sp, %0, %6\n\t" "msr cpsr_c, %7" : : "r" (stk), "I" (PSR_F_BIT | PSR_I_BIT | IRQ_MODE), "I" (offsetof(struct stack, irq[0])), "I" (PSR_F_BIT | PSR_I_BIT | ABT_MODE), "I" (offsetof(struct stack, abt[0])), "I" (PSR_F_BIT | PSR_I_BIT | UND_MODE), "I" (offsetof(struct stack, und[0])), "I" (PSR_F_BIT | PSR_I_BIT | SVC_MODE) : "r14"); } 查看編譯后的匯編語言如下: c002d190: e1a0c00d mov ip, sp c002d194: e92dd800 push {fp, ip, lr, pc} ...... c002d1c0: e59f3024 ldr r3, [pc, #36] ; c002d1ec <cpu_init+0x5c> ldr將stacks的地址裝入r3,也即stacks[cpu].irq[0]的地址。 /* * CPSR_c中的c標(biāo)志意味著只改變CPSR控制域屏蔽字節(jié)也即cpsr[0:7]。0xd2對(duì)應(yīng)IRQ模式,并禁止IRQ和FIQ中斷。 * add指令將r3+0存入sp,由于任何模式都有獨(dú)立的sp寄存器,所以這里將依次設(shè)置IRQ,ABT,UND模式的堆棧 */ c002d1c4: e321f0d2 msr CPSR_c, #210 ; 0xd2 c002d1c8: e283d000 add sp, r3, #0 ; 0x0 /* ABT模式關(guān)IRQ和FIQ中斷,將r3+0xc的值放入sp*/ c002d1cc: e321f0d7 msr CPSR_c, #215 ; 0xd7 c002d1d0: e283d00c add sp, r3, #12 ; 0xc /* UND模式關(guān)IRQ和FIQ中斷,將r3+0x18的值放入sp*/ c002d1d4: e321f0db msr CPSR_c, #219 ; 0xdb c002d1d8: e283d018 add sp, r3, #24 ; 0x18 /* 返回到SVC模式 */ c002d1dc: e321f0d3 msr CPSR_c, #211 ; 0xd3 /* ldm 將保存在sp中的值出棧,分別保存到fp, sp和pc中,與進(jìn)入函數(shù)時(shí)壓棧相呼應(yīng)*/ c002d1e0: e89da800 ldm sp, {fp, sp, pc} 顯然以上的代碼用來設(shè)置IRQ/ABT/UND三種模式的堆棧地址,并且該地址是由靜態(tài)變量stacks的地址決定的。stacks的開始地址就是irq[0]的地址,開始地址的0xc偏移處是abt[0]的地址,0x18的偏移處則是und[0]的地址。
__switch_data: ...... .long init_thread_union + THREAD_START_SP @ sp ...... __mmap_switched: ...... ldmia r3, {r4, r5, r6, r7, sp} ...... 這里的sp值被賦值為init_thread_union + THREAD_START_SP,init_thread_union是內(nèi)核線程struct thread_info 結(jié)構(gòu)體的開始地址,通常它與__data_start相等。 #define THREAD_SIZE_ORDER 1 #define THREAD_SIZE 8192 #define THREAD_START_SP (THREAD_SIZE - 8)
內(nèi)核線程struct thread_info 結(jié)構(gòu)體的大小為THREAD_SIZE,其中線程信息在最底部,其余的內(nèi)存作為線程的堆棧使用。這里將init_thread_union加上THREAD_SIZE,是因?yàn)槎褩J窍蛳略鲩L的,再減去8,是為了在堆棧上端留出中斷處理所需的空間。
默認(rèn)情況下內(nèi)核并不開啟FIQ功能,只有在配置CONFIG_FIQ時(shí),快速中斷的相關(guān)代碼才會(huì)編譯進(jìn)內(nèi)核。 arch/arm/kernel/Makefile obj-$(CONFIG_FIQ) += fiq.o 而相關(guān)的函數(shù)就定義在fiq.c中,設(shè)置FIQ堆棧的函數(shù)如下所示: void __attribute__((naked)) set_fiq_regs(struct pt_regs *regs) { register unsigned long tmp; asm volatile ( "mov ip, sp\n stmfd sp!, {fp, ip, lr, pc}\n sub fp, ip, #4\n mrs %0, cpsr\n msr cpsr_c, %2 @ select FIQ mode\n mov r0, r0\n ldmia %1, {r8 - r14}\n msr cpsr_c, %0 @ return to SVC mode\n mov r0, r0\n ldmfd sp, {fp, sp, pc}" : "=&r" (tmp) : "r" (®s->ARM_r8), "I" (PSR_I_BIT | PSR_F_BIT | FIQ_MODE)); } 這里ldmia設(shè)置FIQ堆棧,而參數(shù)來自regs。 通過以上的設(shè)置,內(nèi)核在各個(gè)模式下的堆棧如下表所示: 表 17. 內(nèi)核堆棧設(shè)置
/* * Set up various architecture-specific pointers */ init_arch_irq = mdesc->init_irq; system_timer = mdesc->timer; init_machine = mdesc->init_machine; early_trap_init(); }setup_arch的最后部分記錄一些特定架構(gòu)的指針,它們?cè)俳酉聛淼某跏蓟斜徽{(diào)用。 arch/arm/kernel/traps.c void __init early_trap_init(void) { unsigned long vectors = CONFIG_VECTORS_BASE; extern char __stubs_start[], __stubs_end[]; extern char __vectors_start[], __vectors_end[]; extern char __kuser_helper_start[], __kuser_helper_end[]; int kuser_sz = __kuser_helper_end - __kuser_helper_start; /* * Copy the vectors, stubs and kuser helpers (in entry-armv.S) * into the vector page, mapped at 0xffff0000, and ensure these * are visible to the instruction stream. */ memcpy((void *)vectors, __vectors_start, __vectors_end - __vectors_start); memcpy((void *)vectors + 0x200, __stubs_start, __stubs_end - __stubs_start); memcpy((void *)vectors + 0x1000 - kuser_sz, __kuser_helper_start, kuser_sz); /* * Copy signal return handlers into the vector page, and * set sigreturn to be a pointer to these. */ memcpy((void *)KERN_SIGRETURN_CODE, sigreturn_codes, sizeof(sigreturn_codes)); flush_icache_range(vectors, vectors + PAGE_SIZE); modify_domain(DOMAIN_USER, DOMAIN_CLIENT); }early_trap_init初始化ARM Linux中斷向量,它已經(jīng)完全代替了trap_init函數(shù)。首先通過memcpy復(fù)制__vectors_start的中斷處理函數(shù)的入口到vectors,vectors被賦值為CONFIG_VECTORS_BASE,它在.config中被設(shè)置為0xffff0000。它就是頁表中建立的中斷向量的入口地址。__vectors_start等一些列中斷向量都應(yīng)以在名為entry-arm中斷向量表放在這個(gè)文件里。 .globl __vectors_start __vectors_start: swi SYS_ERROR0 // 復(fù)位異常 b vector_und + stubs_offset ldr pc, .LCvswi + stubs_offset b vector_pabt + stubs_offset b vector_dabt + stubs_offset b vector_addrexcptn + stubs_offset b vector_irq + stubs_offset b vector_fiq + stubs_offset .globl __vectors_end __vectors_end:__vectors_start至__vectors_end之間為異常向量表。當(dāng)有異常發(fā)生時(shí),處理器會(huì)跳轉(zhuǎn)到對(duì)應(yīng)的0xffff0000起始的向量處取指令,然后,通過b指令散轉(zhuǎn)到異常處理代碼.因?yàn)锳RM中b指令是相對(duì)跳轉(zhuǎn),而且只有+/-32MB的尋址范圍,所以把__stubs_start~__stubs_end之間的異常處理代碼復(fù)制到了0xffff0200起始處,這里可直接用b指令跳轉(zhuǎn)過去,這樣比使用絕對(duì)跳轉(zhuǎn)(ldr)效率高。__stubs_start至__stubs_end之間是異常處理的位置。也位于文件arch/arm/kernel/entry-armv.S中,這部分代碼被復(fù)制到0xffff0200處。vector_und等參數(shù)通過宏vector_stub擴(kuò)展而成。 .macro vector_stub, name, mode, correction=0 .align 5 vector_\name: .if \correction sub lr, lr, #\correction .endif @ @ Save r0, lr_<exception> (parent PC) and spsr_<exception> @ (parent CPSR) @ stmia sp, {r0, lr} @ save r0, lr mrs lr, spsr str lr, [sp, #8] @ save spsr @ @ Prepare for SVC32 mode. IRQs remain disabled. @ mrs r0, cpsr eor r0, r0, #(\mode ^ SVC_MODE) msr spsr_cxsf, r0 @ @ the branch table must immediately follow this code @ and lr, lr, #0x0f mov r0, sp ldr lr, [pc, lr, lsl #2] movs pc, lr @ branch to handler in SVC mode ENDPROC(vector_\name) .endmstubs_offset是如何確定的呢?當(dāng)匯編器看到B指令后會(huì)把要跳轉(zhuǎn)的標(biāo)簽轉(zhuǎn)化為相對(duì)于當(dāng)前PC的偏移量(±32M)寫入指令碼。從上面的代碼可以看到中斷向量表和stubs都發(fā)生了代碼搬移,所以如果中斷向量表中仍然寫成b vector_irq,那么實(shí)際執(zhí)行的時(shí)候就無法跳轉(zhuǎn)到搬移后的vector_irq處,因?yàn)橹噶畲a里寫的是原來的偏移量,所以需要把指令碼中的偏移量寫成搬移后的。我們把搬移前的中斷向量表中的irq入口地址記irq_PC,它在中斷向量表的偏移量就是irq_PC-vectors_start, vector_irq在stubs中的偏移量是vector_irq-stubs_start,這兩個(gè)偏移量在搬移前后是不變的。搬移后 vectors_start在0xffff0000處,而stubs_start在0xffff0200處,所以搬移后的vector_irq相對(duì)于中斷 向量中的中斷入口地址的偏移量就是,200+vector_irq在stubs中的偏移量再減去中斷入口在向量表中的偏移量,即200+ vector_irq-stubs_start-irq_PC+vectors_start = (vector_irq-irq_PC) + vectors_start+200-stubs_start,對(duì)于括號(hào)內(nèi)的值實(shí)際上就是中斷向量表中寫的vector_irq,減去irq_PC是由匯編器完成的,而后面的 vectors_start+200-stubs_start就應(yīng)該是stubs_offset,實(shí)際上在entry-armv.S中也是這樣定義的。 .globl __stubs_start __stubs_start: /* * Interrupt dispatcher */ vector_stub irq, IRQ_MODE, 4 .long __irq_usr @ 0 (USR_26 / USR_32) .long __irq_invalid @ 1 (FIQ_26 / FIQ_32) .long __irq_invalid @ 2 (IRQ_26 / IRQ_32) .long __irq_svc @ 3 (SVC_26 / SVC_32) .long __irq_invalid @ 4 .long __irq_invalid @ 5 .long __irq_invalid @ 6 ...... .LCvswi: .word vector_swi .globl __stubs_end __stubs_end: .equ stubs_offset, __vectors_start + 0x200 - __stubs_start sched_init(); /* * Disable preemption - early bootup scheduling is extremely * fragile until we cpu_idle() for the first time. */ preempt_disable();sched_init初始化進(jìn)程調(diào)度器。preempt_disable禁止調(diào)度,這是為了內(nèi)核進(jìn)一步初始化其他部分。當(dāng)調(diào)用最后一個(gè)函數(shù)rest_init時(shí),通過cpu_idle完成第一次調(diào)度。
command_line參數(shù)在start_kernel中定義,然后在setup_arch中被賦予值。setup_arch首先對(duì)CONFIG_CMDLINE傳遞的參數(shù)通過parse_cmdline進(jìn)行解析。parse_cmdline對(duì)所有__early_param宏定義的參數(shù)進(jìn)行解析:
./mm/mmu.c:125:__early_param("cachepolicy=", early_cachepolicy); ./mm/mmu.c:133:__early_param("nocache", early_nocache); ./mm/mmu.c:141:__early_param("nowb", early_nowrite); ./mm/mmu.c:153:__early_param("ecc=", early_ecc); ./mm/mmu.c:655:__early_param("vmalloc=", early_vmalloc); ./mm/init.c:46:__early_param("initrd=", early_initrd); ./kernel/setup.c:415:__early_param("mem=", early_mem);所有未被__early_param定義的參數(shù)均被傳遞給setup_command_line函數(shù)。 /* * We need to store the untouched command line for future reference. * We also need to store the touched command line since the parameter * parsing is performed in place, and we should allow a component to * store reference of name/value for future reference. */ static void __init setup_command_line(char *command_line) { saved_command_line = alloc_bootmem(strlen (boot_command_line)+1); static_command_line = alloc_bootmem(strlen (command_line)+1); strcpy (saved_command_line, boot_command_line); strcpy (static_command_line, command_line); }saved_command_line和static_command_line首先通過Bootmem機(jī)制分配內(nèi)存,然后分別保存未處理的原始命令行,以及通過parse_cmdline處理過的命令行,一個(gè)實(shí)際的示例如下: CONFIG_CMDLINE="root=/dev/mtdblock2 rootfstype=cramfs init=/linuxrc console=ttySAC0,115200 mem=256M" saved_command_line:root=/dev/mtdblock2 rootfstype=cramfs console=ttySAC0,115200 mem=256M bootmem_debug=1 static_command_line:root=/dev/mtdblock2 rootfstype=cramfs console=ttySAC0,115200 bootmem_debug=1 mm/page_alloc.c void build_all_zonelists(void) { set_zonelist_order();set_zonelist_order函數(shù)與CONFIG_NUMA是否定義有關(guān),沒有定時(shí),如下所示: /* * zonelist_order: * 0 = automatic detection of better ordering. * 1 = order by ([node] distance, -zonetype) * 2 = order by (-zonetype, [node] distance) * * If not NUMA, ZONELIST_ORDER_ZONE and ZONELIST_ORDER_NODE will create * the same zonelist. So only NUMA can configure this param. */ #define ZONELIST_ORDER_DEFAULT 0 #define ZONELIST_ORDER_NODE 1 #define ZONELIST_ORDER_ZONE 2 static void set_zonelist_order(void) { current_zonelist_order = ZONELIST_ORDER_ZONE; } include/linux/kernel.h /* Values used for system_state */ extern enum system_states { SYSTEM_BOOTING, SYSTEM_RUNNING, SYSTEM_HALT, SYSTEM_POWER_OFF, SYSTEM_RESTART, SYSTEM_SUSPEND_DISK, } system_state; if (system_state == SYSTEM_BOOTING) { __build_all_zonelists(NULL); mminit_verify_zonelist(); cpuset_init_current_mems_allowed(); } else { /* we have to stop all cpus to guarantee there is no user of zonelist */ stop_machine(__build_all_zonelists, NULL, NULL); /* cpuset refresh routine should be here */ }system_state默認(rèn)被初始化為SYSTEM_BOOTING,它在rest_init才會(huì)被改變?yōu)镾YSTEM_RUNNING。這里通過__build_all_zonelists來設(shè)置內(nèi)存管理區(qū)中的分配器參考鏈表zonelists成員,它根據(jù)訪問優(yōu)先級(jí)來將不通管理區(qū)掛在到該鏈表。通常只有配置了CONFIG_NUMA時(shí)才有效,否則只是按當(dāng)前節(jié)點(diǎn)中的node_zones數(shù)組成員的下標(biāo)依序加入zonelists。 vm_total_pages = nr_free_pagecache_pages(); /* * Disable grouping by mobility if the number of pages in the * system is too low to allow the mechanism to work. It would be * more accurate, but expensive to check per-zone. This check is * made on memory-hotadd so a system can start with mobility * disabled and enable it later */ if (vm_total_pages < (pageblock_nr_pages * MIGRATE_TYPES)) page_group_by_mobility_disabled = 1; else page_group_by_mobility_disabled = 0; printk("Built %i zonelists in %s order, mobility grouping %s. " "Total pages: %ld\n", num_online_nodes(), zonelist_order_name[current_zonelist_order], page_group_by_mobility_disabled ? "off" : "on", vm_total_pages); #ifdef CONFIG_NUMA printk("Policy zone: %s\n", zone_names[policy_zone]); #endif根據(jù)當(dāng)前系統(tǒng)中的物理內(nèi)存大小,來決定是否啟用流動(dòng)分組(Mobility Grouping)機(jī)制,這種機(jī)制可以在分配大內(nèi)存塊時(shí)減少內(nèi)存碎片。顯然只有內(nèi)存足夠大時(shí)才會(huì)啟用該功能。 void __init page_alloc_init(void) { hotcpu_notifier(page_alloc_cpu_notify, 0); } hotcpu_notifier是一個(gè)宏,在編譯選項(xiàng)CONFIG_HOTPLUG_CPU(CPU熱插拔)起作用時(shí),它才有效。否則不做任何事情。 include/linux/cpu.h /* old style is :hotcpu_notifier(fn, pri) do { } while (0) */ #define hotcpu_notifier(fn, pri) do { (void)(fn); } while (0) 舊代碼的定義要容易理解,也即未使能CONFIG_HOTPLUG_CPU時(shí),什么也不做。(void)(fn)擴(kuò)展開就是(void)(page_alloc_cpu_notify),這并不是對(duì)函數(shù)page_alloc_cpu_notify的引用,而是類似于int a; (int)(a);的一種使用方式,GCC會(huì)將這種代碼優(yōu)化掉。為什么要這樣做呢?這是為了在沒有配置CPU熱插拔功能的系統(tǒng)上避免GCC類似的抱怨: mm/page_alloc.c:4152: warning: 'page_alloc_cpu_notify' defined but not used 總而言之,page_alloc_init函數(shù)只有在開啟CONFIG_HOTPLUG_CPU時(shí),才有作用,此時(shí)完成對(duì)每個(gè)CPU的通告功能。 parse_early_param(); parse_args("Booting kernel", static_command_line, __start___param, __stop___param - __start___param, &unknown_bootoption);parse_early_param參數(shù)解析函數(shù)主要針對(duì)__setup_param聲明的參數(shù)進(jìn)行解析。parse_args在這里主要針對(duì)編譯進(jìn)內(nèi)核的模塊中的參數(shù)進(jìn)行解析。 /* Sort the kernel's built-in exception table */ void __init sort_main_extable(void) { sort_extable(__start___ex_table, __stop___ex_table); }sort_main_extable對(duì)__start___ex_table和__stop___ex_table之間的異常表struct exception_table_entry元素進(jìn)行排序,以加快對(duì)異常的處理。 arch/arm/kernel/vmlinux.lds.S __start___ex_table = .; #ifdef CONFIG_MMU *(__ex_table) #endif __stop___ex_table = .;定義到__ex_table節(jié)中的代碼均由匯編或者內(nèi)嵌語言寫成,例如: arch/arm/kernel/entry-armv.S .section .fixup, "ax" 4: mov pc, r9 .previous .section __ex_table,"a" .long 1b, 4b__ex_table節(jié)中的"a"是指該節(jié)中的代碼需要分配內(nèi)存。.long 1b, 4b則分別前面的標(biāo)號(hào)1和4,其中如果標(biāo)號(hào)1對(duì)應(yīng)的子程序如果處理中出現(xiàn)問題,則繼續(xù)繼續(xù)標(biāo)號(hào)4的子程序,以確保可以在異常處理中返回。接下來的trap_init()是一個(gè)空函數(shù),它被各個(gè)系統(tǒng)架構(gòu)下的trap初始化代碼取代。
Read-Copy-Update機(jī)制在此時(shí)初始化,它基于修改副本,而讀者可以在不加鎖的情況下來讀取資源,當(dāng)沒有讀者讀取時(shí)將副本替換,并在適當(dāng)?shù)臅r(shí)刻釋放舊的數(shù)據(jù)。它在讀去次數(shù)多,而寫的次數(shù)少,并且資源區(qū)可以通過指針表示的情況下很高效。RCU的初始化位于start_kernel中的rcu_init代碼如下:
kernel/rcupdate.c void __init rcu_init(void) { __rcu_init(); } kernel/rcuclassic.c void __init __rcu_init(void) { rcu_cpu_notify(&rcu_nb, CPU_UP_PREPARE, (void *)(long)smp_processor_id()); /* Register notifier for non-boot CPUs */ register_cpu_notifier(&rcu_nb); }rcu機(jī)制的核心書籍結(jié)構(gòu)是名為rcu_nb的通知鏈表結(jié)構(gòu)struct notifier_block。通知鏈表節(jié)點(diǎn)的結(jié)構(gòu)如下,其中notifier_call是通知事件的處理函數(shù),priority則是該節(jié)點(diǎn)的優(yōu)先級(jí)。 struct notifier_block { int (*notifier_call)(struct notifier_block *, unsigned long, void *); struct notifier_block *next; int priority; }; static struct notifier_block __cpuinitdata rcu_nb = { .notifier_call = rcu_cpu_notify, };rcu_nb的通知處理函數(shù)為rcu_cpu_notify,它用來處理以下事件:CPU_UP_PREPARE,CPU_UP_PREPARE_FROZEN和CPU_DEAD,CPU_DEAD_FROZEN,它們分別對(duì)應(yīng)CPU上線和下線。 static int __cpuinit rcu_cpu_notify(struct notifier_block *self, unsigned long action, void *hcpu) { long cpu = (long)hcpu; switch (action) { case CPU_UP_PREPARE: case CPU_UP_PREPARE_FROZEN: rcu_online_cpu(cpu); break; case CPU_DEAD: case CPU_DEAD_FROZEN: rcu_offline_cpu(cpu); break; default: break; } return NOTIFY_OK; }在系統(tǒng)初始話階段,__rcu_init將在當(dāng)前CPU上注冊(cè)CPU_UP_PREPARE事件。對(duì)應(yīng)的處理函數(shù)為:rcu_online_cpu。 static void rcu_init_percpu_data(int cpu, struct rcu_ctrlblk *rcp, struct rcu_data *rdp) { unsigned long flags; spin_lock_irqsave(&rcp->lock, flags); memset(rdp, 0, sizeof(*rdp)); rdp->nxttail[0] = rdp->nxttail[1] = rdp->nxttail[2] = &rdp->nxtlist; rdp->donetail = &rdp->donelist; rdp->quiescbatch = rcp->completed; rdp->qs_pending = 0; rdp->cpu = cpu; rdp->blimit = blimit; spin_unlock_irqrestore(&rcp->lock, flags); } static void __cpuinit rcu_online_cpu(int cpu) { struct rcu_data *rdp = &per_cpu(rcu_data, cpu); struct rcu_data *bh_rdp = &per_cpu(rcu_bh_data, cpu); rcu_init_percpu_data(cpu, &rcu_ctrlblk, rdp); rcu_init_percpu_data(cpu, &rcu_bh_ctrlblk, bh_rdp); open_softirq(RCU_SOFTIRQ, rcu_process_callbacks); }每CPU均有一個(gè)rcu_data,rcu_ctrlblk是控制塊(Control Block)。這里可以看到RCU機(jī)制是通過RCU_SOFTIRQ軟中斷實(shí)現(xiàn)的。在RCU_SOFTIRQ軟中斷被初出發(fā)時(shí),處理函數(shù)rcu_process_callbacks將處理rcu_data和rcu_bh_data數(shù)據(jù)。內(nèi)核線程ksoftirqd則用來統(tǒng)一處理軟中斷。 static void rcu_process_callbacks(struct softirq_action *unused) { smp_mb(); /* See above block comment. */ __rcu_process_callbacks(&rcu_ctrlblk, &__get_cpu_var(rcu_data)); __rcu_process_callbacks(&rcu_bh_ctrlblk, &__get_cpu_var(rcu_bh_data)); smp_mb(); /* See above block comment. */ } void open_softirq(int nr, void (*action)(struct softirq_action *)) { softirq_vec[nr].action = action; }
init_IRQ初始化中斷描述符中的status。而init_arch_irq則由特定的系統(tǒng)架構(gòu)提供,以用來初始化系統(tǒng)中的中斷功能。
void __init init_IRQ(void) { int irq; for (irq = 0; irq < NR_IRQS; irq++) irq_desc[irq].status |= IRQ_NOREQUEST | IRQ_NOPROBE; #ifdef CONFIG_SMP bad_irq_desc.affinity = CPU_MASK_ALL; bad_irq_desc.cpu = smp_processor_id(); #endif init_arch_irq(); }NR_IRQS依據(jù)CPU系統(tǒng)架構(gòu)設(shè)定。 arch/arm/plat-s3c64xx/include/plat/irqs.h /* Set the default NR_IRQS */ #define NR_IRQS (IRQ_EINT_GROUP9_BASE + IRQ_EINT_GROUP9_NR + 1)init_arch_irq在初始化系統(tǒng)架構(gòu)相關(guān)的代碼中被賦值為init_irq,這里為s3c6410_init_irq。 arch/arm/kernel/setup.c void __init setup_arch(char **cmdline_p) { ...... init_arch_irq = mdesc->init_irq; ...... } arch/arm/mach-s3c6410/cpu.c void __init s3c6410_init_irq(void) { /* VIC0 is missing IRQ7, VIC1 is fully populated. */ s3c64xx_init_irq(~0 & ~(1 << 7), ~0); } arch/arm/plat-s3c64xx/irq.c void __init s3c64xx_init_irq(u32 vic0_valid, u32 vic1_valid) { int uart, irq; printk(KERN_DEBUG "%s: initialising interrupts\n", __func__); /* initialise the pair of VICs */ vic_init(S3C_VA_VIC0, S3C_VIC0_BASE, vic0_valid); vic_init(S3C_VA_VIC1, S3C_VIC1_BASE, vic1_valid); /* add the timer sub-irqs */ set_irq_chained_handler(IRQ_TIMER0_VIC, s3c_irq_demux_timer0); set_irq_chained_handler(IRQ_TIMER1_VIC, s3c_irq_demux_timer1); set_irq_chained_handler(IRQ_TIMER2_VIC, s3c_irq_demux_timer2); set_irq_chained_handler(IRQ_TIMER3_VIC, s3c_irq_demux_timer3); set_irq_chained_handler(IRQ_TIMER4_VIC, s3c_irq_demux_timer4); for (irq = IRQ_TIMER0; irq <= IRQ_TIMER4; irq++) { set_irq_chip(irq, &s3c_irq_timer); set_irq_handler(irq, handle_level_irq); set_irq_flags(irq, IRQF_VALID); } for (uart = 0; uart < ARRAY_SIZE(uart_irqs); uart++) s3c64xx_uart_irq(&uart_irqs[uart]); }注意到這些都是對(duì)底層硬件寄存器的操作。S3C_VA_VIC0虛擬地址是在setup_arch中mdesc->map_io的操作實(shí)現(xiàn)映射的: arch/arm/plat-s3c64xx/cpu.c static struct map_desc s3c_iodesc[] __initdata = { { .virtual = (unsigned long)S3C_VA_SYS, .pfn = __phys_to_pfn(S3C64XX_PA_SYSCON), .length = SZ_4K, .type = MT_DEVICE, }, { .virtual = (unsigned long)(S3C_VA_UART + UART_OFFS), .pfn = __phys_to_pfn(S3C_PA_UART), .length = SZ_4K, .type = MT_DEVICE, }, { .virtual = (unsigned long)S3C_VA_VIC0, .pfn = __phys_to_pfn(S3C64XX_PA_VIC0), .length = SZ_16K, .type = MT_DEVICE, }, { .virtual = (unsigned long)S3C_VA_VIC1, .pfn = __phys_to_pfn(S3C64XX_PA_VIC1), .length = SZ_16K, .type = MT_DEVICE, }, { .virtual = (unsigned long)S3C_VA_TIMER, .pfn = __phys_to_pfn(S3C_PA_TIMER), .length = SZ_16K, .type = MT_DEVICE, }, { .virtual = (unsigned long)S3C64XX_VA_GPIO, .pfn = __phys_to_pfn(S3C64XX_PA_GPIO), .length = SZ_4K, .type = MT_DEVICE, }, };
LINUX進(jìn)程總是會(huì)分配一個(gè)號(hào)碼用于在其命令空間中唯一地標(biāo)識(shí)它們。該號(hào)碼被稱作進(jìn)程ID號(hào),簡稱PID。用fork或clone產(chǎn)生的每個(gè)進(jìn)程都由內(nèi)核自動(dòng)地分配一個(gè)新的唯一的PID值。為了便于管理PID,系統(tǒng)定義了一個(gè)哈希數(shù)組。
kernel/pid.c static struct hlist_head *pid_hash;hlist_head類型是一個(gè)內(nèi)核用于建立雙鏈散列表的標(biāo)準(zhǔn)數(shù)據(jù)結(jié)構(gòu)。pid_hash用作一個(gè)hlist_head數(shù)組,數(shù)組的元素?cái)?shù)目取決于計(jì)算機(jī)的內(nèi)存配置,為16到4096之間的2的冪指數(shù)。 void __init pidhash_init(void) { int i, pidhash_size; unsigned long megabytes = nr_kernel_pages >> (20 - PAGE_SHIFT); pidhash_shift = max(4, fls(megabytes * 4)); pidhash_shift = min(12, pidhash_shift); pidhash_size = 1 << pidhash_shift; printk("PID hash table entries: %d (order: %d, %Zd bytes)\n", pidhash_size, pidhash_shift, pidhash_size * sizeof(struct hlist_head)); pid_hash = alloc_bootmem(pidhash_size * sizeof(*(pid_hash))); if (!pid_hash) panic("Could not alloc pidhash!\n"); for (i = 0; i < pidhash_size; i++) INIT_HLIST_HEAD(&pid_hash[i]); }pidhash_init在系統(tǒng)啟動(dòng)時(shí)被執(zhí)行,主要做了以下工作:
void __init init_timers(void) { int err = timer_cpu_notify(&timers_nb, (unsigned long)CPU_UP_PREPARE, (void *)(long)smp_processor_id()); init_timer_stats(); BUG_ON(err == NOTIFY_BAD); register_cpu_notifier(&timers_nb); open_softirq(TIMER_SOFTIRQ, run_timer_softirq); }init_timers主要初始化本地軟件時(shí)鐘:
struct tvec_base { spinlock_t lock; struct timer_list *running_timer; unsigned long timer_jiffies; struct tvec_root tv1; struct tvec tv2; struct tvec tv3; struct tvec tv4; struct tvec tv5; } ____cacheline_aligned; kernel/timer.c struct tvec_base boot_tvec_bases; 內(nèi)核定時(shí)器的結(jié)構(gòu)如上圖所示。它間接定義了五個(gè)數(shù)組,其中第一個(gè)數(shù)組是tvec_root類型,其余四個(gè)均為tvec類型,它們實(shí)際都是對(duì)鏈表頭進(jìn)行的數(shù)組封裝。內(nèi)核為了快速及時(shí)的處理定時(shí)器,它將不同的定時(shí)器根據(jù)到期時(shí)間分別分組,最靠前的放在tv1中,依次類推。 struct tvec { struct list_head vec[TVN_SIZE]; }; struct tvec_root { struct list_head vec[TVR_SIZE]; };由于每個(gè)分組本身是一個(gè)雙向鏈表, 表 18. 定時(shí)器時(shí)間間隔
TVR_SIZE和TVN_SIZE決定了tvec_root和tvec的鏈表的個(gè)數(shù)。如果為了節(jié)約內(nèi)存可以配置CONFIG_BASE_SMALL為0,否則TVN_BITS的值為6,而TVR_BITS對(duì)應(yīng)8。所以tvec_root和tvec的鏈表的個(gè)數(shù)分別為256和64。所以對(duì)于tv1來說,它的每一個(gè)鏈表對(duì)應(yīng)與它的下標(biāo)相同的到期時(shí)間的定時(shí)器。 #define TVN_BITS (CONFIG_BASE_SMALL ? 4 : 6) #define TVR_BITS (CONFIG_BASE_SMALL ? 6 : 8) #define TVN_SIZE (1 << TVN_BITS) #define TVR_SIZE (1 << TVR_BITS) timer_cpu_notify在初始化階段時(shí)參數(shù)action被指定為CPU_UP_PREPARE,此時(shí)它將調(diào)用init_timers_cpu來初始化當(dāng)前CPU對(duì)應(yīng)的tvec_base結(jié)構(gòu)。對(duì)于非SMP系統(tǒng)來說就是boot_tvec_bases。初始化的代碼如下: /kernel/timer.c(init_timers_cpu) spin_lock_init(&base->lock); for (j = 0; j < TVN_SIZE; j++) { INIT_LIST_HEAD(base->tv5.vec + j); INIT_LIST_HEAD(base->tv4.vec + j); INIT_LIST_HEAD(base->tv3.vec + j); INIT_LIST_HEAD(base->tv2.vec + j); } for (j = 0; j < TVR_SIZE; j++) INIT_LIST_HEAD(base->tv1.vec + j); base->timer_jiffies = jiffies;init_timer_stats函數(shù)只有在配置CONFIG_TIMER_STATS時(shí)才有效,否則為空函數(shù)。register_cpu_notifier在SMP上有效,否則為空函數(shù)。open_softirq用來注冊(cè)TIMER_SOFTIRQ軟中斷的處理函數(shù)run_timer_softirq。run_timer_softirq用來在軟中斷中使用后半步來進(jìn)行定時(shí)器鏈表的處理。
hrtimers_init與init_timers類似,但是它初始化每CPU變量hrtimer_bases,它用來實(shí)現(xiàn)高精度定時(shí)器。
<figure><title>內(nèi)核RAM布局</title><graphic fileref="images/kernelmap.gif"/></figure>100=1 100=1 第 11.15 節(jié) “Sandbox” [7] |
|