[kernel] 带着问题看源码 —— 脚本是如何被 execve 调用的,[kernel] 带着问题看源码 —— 进程 ID 是如何分配的,[apue] 进程环境那些事儿

2025-08-22 15:57:08,

前言

在《[apue] 进程控制那些事儿》一文的"进程创建-> exec -> 解释器文件"一节中，曾提到脚本文件的识别是由内核作为 exec 系统调用处理的一部分来完成的，并且有以下特性：

指定解释器的以 #! (shebang) 开头的第一行长度不得超过 128
shebang 最多只能指定一个参数
shebang 指定的命令与参数会成为新进程的前 2 个参数，用户提供的其它参数依次往后排

这些特性是如何实现的？带着这个疑问，找出系统对应的内核源码看个究竟。

源码定位

和《[kernel] 带着问题看源码 —— 进程 ID 是如何分配的》一样，这里使用 bootlin 查看内核 3.10.0 版本源码，脚本文件是在 execve 时解析的，所以先搜索 sys_ execve：

整个调用链如下：

sys_execve -> do_execve -> do_execve_common -> search_binary_handler-> load_binary -> load_script (binfmt_script.c)

为了快速进入主题，前面咱们就不一一细看了，主要解释一下 search_binary_handler。

Linux 中加载不同文件格式的方式是可扩展的，这主要是通过内核模块来实现的，每个模块实现一个格式，新的格式可通过编写内核模块实现快速支持，而无需修改内核源码。刚才浏览代码的时候已经初窥门径：

这是目前内核内置的几个模块

binfmt_elf：最常用的 Linux 二进制可执行文件
binfmt_elf_fdpic：缺失 MMU 架构的二进制可执行文件
binfmt_em86：在 Aplha 机器上运行 Intel 的 Linux 二进制文件
binfmt_aout：Linux 老的可执行文件
binfmt_script：脚本文件
binfmt_misc：一种机制，支持运行期文件格式与应用对应关系的绑定
...

基本可以归纳为三大类：

可执行文件
脚本文件 (script)
机制拓展 (misc)

其中 misc 机制用来实现文件与应用的绑定，这一点类似于 Windows 通过后缀直接唤起相关应用，不过它更加强大：除了通过后缀，还可以通过检测文件中的 Magic 字段，作为判断文件类型的依据。

目前主要应用的方向是跨架构运行，例如在 x86 机器上运行 arm64、甚至 Windows 程序 (wine)，相比编写内核模块，便利性又提升了一个等级。

本文主要关注脚本文件的处理过程。

binfmt

内核模块本身并不难实现，以 script 为例：

static struct linux_binfmt script_format = {
	.module		= THIS_MODULE，
	.load_binary	= load_script，
};

static int __init init_script_binfmt(void)
{
	register_binfmt(&script_format);
	return 0;
}

static void __exit exit_script_binfmt(void)
{
	unregister_binfmt(&script_format);
}

core_initcall(init_script_binfmt);
module_exit(exit_script_binfmt);

主要是通过 register_binfmt / unregister_binfmt 来插入、删除 Linux_binfmt 信息节点。

/*
 * This structure defines the functions that are used to load the binary formats that
 * linux accepts.
 */
struct linux_binfmt {
	struct list_head lh;
	struct module *module;
	int (*load_binary)(struct linux_binprm *);
	int (*load_shlib)(struct file *);
	int (*core_dump)(struct coredump_params *cprm);
	unsigned long min_coredump;	/* minimal dump size */
};

linux_binfmt 的内容不多，而且回调函数不用全部实现，没有用到的留空就完事了。下面看下插入节点过程：

static LIST_HEAD(formats);
static DEFINE_RWLOCK(binfmt_lock);

void __register_binfmt(struct linux_binfmt * fmt， int insert)
{
	BUG_ON(!fmt);
	write_lock(&binfmt_lock);
	insert ? list_add(&fmt->lh， &formats) :
		 list_add_tail(&fmt->lh， &formats);
	write_unlock(&binfmt_lock);
}

/* Registration of default binfmt handlers */
static inline void register_binfmt(struct linux_binfmt *fmt)
{
	__register_binfmt(fmt， 0);
}

利用 linux_binfmt.lh 字段 (list_head) 实现链表插入，链表头为全局变量 formats。

search_binary_handler

再看 search_binary_handler 利用 formats 遍历链表的过程：

retval = -ENOENT;
for (try=0; try<2; try++) {

最多尝试 2 次

	read_lock(&binfmt_lock);
	list_for_each_entry(fmt， &formats， lh) {

加锁；通过 formats 遍历整个链表

		int (*fn)(struct linux_binprm *) = fmt->load_binary;
		if (!fn)
			continue;
		if (!try_module_get(fmt->module))
			continue;
		read_unlock(&binfmt_lock);
		bprm->recursion_depth = depth + 1;

try_module_get 检查内核模块是否 alive；执行 load_binary 前解锁 formats 链表以便嵌套；更新嵌套深度

		retval = fn(bprm);
		bprm->recursion_depth = depth;
		if (retval >= 0) {
			if (depth == 0) {
				trace_sched_process_exec(current， old_pid， bprm);
				ptrace_event(PTRACE_EVENT_EXEC， old_vpid);
			}
			put_binfmt(fmt);
			allow_write_access(bprm->file);
			if (bprm->file)
				fput(bprm->file);
			bprm->file = NULL;
			current->did_exec = 1;
			proc_exec_connector(current);
			return retval;
		}

恢复嵌套深度；若执行成功，提前返回

		read_lock(&binfmt_lock);
		put_binfmt(fmt);
		if (retval != -ENOEXEC || bprm->mm == NULL)
			break;
		if (!bprm->file) {
			read_unlock(&binfmt_lock);
			return retval;
		}
	}

执行失败，重新加锁；如果非 ENOEXEC 错误，继续尝试下个 fmt

	read_unlock(&binfmt_lock);
	break;
}

遍历完毕，解锁，退出

其中 list_for_each_entry 是 Linux 对 list 遍历的封装宏：

/**
 * list_for_each_entry	-	iterate over list of given type
 * @pos:	the type * to use as a loop cursor.
 * @head:	the head for your list.
 * @member:	the name of the list_struct within the struct.
 */
#define list_for_each_entry(pos， head， member)				\
	for (pos = list_entry((head)->next， typeof(*pos)， member);	\
	     &pos->member != (head); 	\
	     pos = list_entry(pos->member.next， typeof(*pos)， member))

本质是个 for 循环。另外，之前的 for (try < 2) 其实并不生效，因为总会被末尾的 break 打断。

不过这里揭示了一点 load_binary 返回值的含义：当接口返回 -ENOEXEC 时，表示这个文件“不合胃口”，请继续遍历 formats 列表尝试。关于这一点，在下面解读 load_script 时可多加留意。

另外 binfmt 是可以嵌套的，假设启动一个脚本，它使用 awk 作为解释器，那么整个执行过程看起来像下面这样：

execve (xxx.awk) -> load_script (binfmt_script) -> load_elf_binary (binfmt_elf)

这是因为 awk 作为可执行文件，本身也需要 binfmt 的处理，稍等就可以在 load_script 中看到这一点。

目前 Linux 没有对嵌套深度施加限制。

源码分析

经过一番背景知识铺垫，终于可以进入 binfmt_script 好好看看啦：

/*
 * This structure defines the functions that are used to load the binary formats that
 * linux accepts.
 */
struct linux_binfmt {
	struct list_head lh;
	struct module *module;
	int (*load_binary)(struct linux_binprm *);
	int (*load_shlib)(struct file *);
	int (*core_dump)(struct coredump_params *cprm);
	unsigned long min_coredump;	/* minimal dump size */
};0

脚本不以 #! 开头的，忽略；注意 interp 数组的长度：#define BINPRM_BUF_SIZE 128，这是 shebang 不能超过 128 的根源

/*
 * This structure defines the functions that are used to load the binary formats that
 * linux accepts.
 */
struct linux_binfmt {
	struct list_head lh;
	struct module *module;
	int (*load_binary)(struct linux_binprm *);
	int (*load_shlib)(struct file *);
	int (*core_dump)(struct coredump_params *cprm);
	unsigned long min_coredump;	/* minimal dump size */
};1

系统已读取文件头部的一部分字节到内存，脚本文件用完了，释放

/*
 * This structure defines the functions that are used to load the binary formats that
 * linux accepts.
 */
struct linux_binfmt {
	struct list_head lh;
	struct module *module;
	int (*load_binary)(struct linux_binprm *);
	int (*load_shlib)(struct file *);
	int (*core_dump)(struct coredump_params *cprm);
	unsigned long min_coredump;	/* minimal dump size */
};2

最多截取前 127 个字符，并向前搜索 shebang 结尾 (\n)，若有，则设置新的结尾到那里

/*
 * This structure defines the functions that are used to load the binary formats that
 * linux accepts.
 */
struct linux_binfmt {
	struct list_head lh;
	struct module *module;
	int (*load_binary)(struct linux_binprm *);
	int (*load_shlib)(struct file *);
	int (*core_dump)(struct coredump_params *cprm);
	unsigned long min_coredump;	/* minimal dump size */
};3

前后 trim 空白字符，如果没有任何内容，忽略；注意初始时 cp 指向字符串尾部，结束时，cp 指向有效命令名的开始

/*
 * This structure defines the functions that are used to load the binary formats that
 * linux accepts.
 */
struct linux_binfmt {
	struct list_head lh;
	struct module *module;
	int (*load_binary)(struct linux_binprm *);
	int (*load_shlib)(struct file *);
	int (*core_dump)(struct coredump_params *cprm);
	unsigned long min_coredump;	/* minimal dump size */
};4

跳过命令名；忽略空白字符；剩下的若不为空全部作为一个参数，这是只能解析一个参数的根源；命令名复制到 interp 数组中备用

/*
 * This structure defines the functions that are used to load the binary formats that
 * linux accepts.
 */
struct linux_binfmt {
	struct list_head lh;
	struct module *module;
	int (*load_binary)(struct linux_binprm *);
	int (*load_shlib)(struct file *);
	int (*core_dump)(struct coredump_params *cprm);
	unsigned long min_coredump;	/* minimal dump size */
};5

删除 argv 的第一个参数，分别将命令名 (i_name)、参数 (i_arg)、脚本文件名 (bprm->interp) 放置到 argv 前三位。

注意调用的顺序是倒序的：bprm->interp、i_arg、i_name，这是由于 argv 在进程中特殊存放方式导致的，参考后面的解说；最后更新 bprm 中的命令名

/*
 * This structure defines the functions that are used to load the binary formats that
 * linux accepts.
 */
struct linux_binfmt {
	struct list_head lh;
	struct module *module;
	int (*load_binary)(struct linux_binprm *);
	int (*load_shlib)(struct file *);
	int (*core_dump)(struct coredump_params *cprm);
	unsigned long min_coredump;	/* minimal dump size */
};6

通过命令名指定的路径打开文件，并设置到当前进程， prepare_binprm 准备加载前的各种信息，包括预读文件的头部的一些内容

/*
 * This structure defines the functions that are used to load the binary formats that
 * linux accepts.
 */
struct linux_binfmt {
	struct list_head lh;
	struct module *module;
	int (*load_binary)(struct linux_binprm *);
	int (*load_shlib)(struct file *);
	int (*core_dump)(struct coredump_params *cprm);
	unsigned long min_coredump;	/* minimal dump size */
};7

使用新命令的信息继续搜索 binfmt 模块并加载之，这里是真正加载命令的地方

这里主要补充一点，对于 shebang 中的命令名字段，中间不能包含空格，否则会被提前截断，即使使用引号包围也不行，下面是个例子：

/*
 * This structure defines the functions that are used to load the binary formats that
 * linux accepts.
 */
struct linux_binfmt {
	struct list_head lh;
	struct module *module;
	int (*load_binary)(struct linux_binprm *);
	int (*load_shlib)(struct file *);
	int (*core_dump)(struct coredump_params *cprm);
	unsigned long min_coredump;	/* minimal dump size */
};8

通读上面的源码就能了解到，这里根本未对引号做任何处理。

文件头预读

这里主要解释两点，一是 prepare_binprm 会预读文件头部的一些数据，供后面 binfmt 判断使用：

/*
 * This structure defines the functions that are used to load the binary formats that
 * linux accepts.
 */
struct linux_binfmt {
	struct list_head lh;
	struct module *module;
	int (*load_binary)(struct linux_binprm *);
	int (*load_shlib)(struct file *);
	int (*core_dump)(struct coredump_params *cprm);
	unsigned long min_coredump;	/* minimal dump size */
};9

最后一句 kernel_read 就是啦。目前这个 BINPRM_BUF_SIZE 的长度也是 128：

static LIST_HEAD(formats);
static DEFINE_RWLOCK(binfmt_lock);

void __register_binfmt(struct linux_binfmt * fmt， int insert)
{
	BUG_ON(!fmt);
	write_lock(&binfmt_lock);
	insert ? list_add(&fmt->lh， &formats) :
		 list_add_tail(&fmt->lh， &formats);
	write_unlock(&binfmt_lock);
}

/* Registration of default binfmt handlers */
static inline void register_binfmt(struct linux_binfmt *fmt)
{
	__register_binfmt(fmt， 0);
}0

在 do_execve_common 中也会调用这个接口来为第一次 binfmt 识别做准备：

static LIST_HEAD(formats);
static DEFINE_RWLOCK(binfmt_lock);

void __register_binfmt(struct linux_binfmt * fmt， int insert)
{
	BUG_ON(!fmt);
	write_lock(&binfmt_lock);
	insert ? list_add(&fmt->lh， &formats) :
		 list_add_tail(&fmt->lh， &formats);
	write_unlock(&binfmt_lock);
}

/* Registration of default binfmt handlers */
static inline void register_binfmt(struct linux_binfmt *fmt)
{
	__register_binfmt(fmt， 0);
}1

没错，就是这里了

static LIST_HEAD(formats);
static DEFINE_RWLOCK(binfmt_lock);

void __register_binfmt(struct linux_binfmt * fmt， int insert)
{
	BUG_ON(!fmt);
	write_lock(&binfmt_lock);
	insert ? list_add(&fmt->lh， &formats) :
		 list_add_tail(&fmt->lh， &formats);
	write_unlock(&binfmt_lock);
}

/* Registration of default binfmt handlers */
static inline void register_binfmt(struct linux_binfmt *fmt)
{
	__register_binfmt(fmt， 0);
}2

argv 调整

另外一点是 argv 在内存中的布局，参考之前写的《[apue] 进程环境那些事儿》，这里直接贴图：

命令行参数与环境变量是放在进程高地址空间的末尾，以 \0 为间隔的字符串。由于有高地址“天花板”在存在，这里必需先计算出所有字符串的长度，定位到起始位置，再复制整个字符串，此外为了保证 argv[0] 地址小于 argv[1]，整个数组也需要从后向前遍历。这里借用之前写的一个例子证明这一点：

static LIST_HEAD(formats);
static DEFINE_RWLOCK(binfmt_lock);

void __register_binfmt(struct linux_binfmt * fmt， int insert)
{
	BUG_ON(!fmt);
	write_lock(&binfmt_lock);
	insert ? list_add(&fmt->lh， &formats) :
		 list_add_tail(&fmt->lh， &formats);
	write_unlock(&binfmt_lock);
}

/* Registration of default binfmt handlers */
static inline void register_binfmt(struct linux_binfmt *fmt)
{
	__register_binfmt(fmt， 0);
}3

随便给一些参数让它跑个输出：

static LIST_HEAD(formats);
static DEFINE_RWLOCK(binfmt_lock);

void __register_binfmt(struct linux_binfmt * fmt， int insert)
{
	BUG_ON(!fmt);
	write_lock(&binfmt_lock);
	insert ? list_add(&fmt->lh， &formats) :
		 list_add_tail(&fmt->lh， &formats);
	write_unlock(&binfmt_lock);
}

/* Registration of default binfmt handlers */
static inline void register_binfmt(struct linux_binfmt *fmt)
{
	__register_binfmt(fmt， 0);
}4

重点看下 argv 与 envp 的地址，envp 高于 argv；再看各个数组内部的情况，索引低的地址也低。结合之前的内存布局图，需要这样排布各个参数：

先排布 envp，envp 内部从后向前遍历
后排布 argv，argv 内部从后向前遍历

代码也确实是这样写的：

static LIST_HEAD(formats);
static DEFINE_RWLOCK(binfmt_lock);

void __register_binfmt(struct linux_binfmt * fmt， int insert)
{
	BUG_ON(!fmt);
	write_lock(&binfmt_lock);
	insert ? list_add(&fmt->lh， &formats) :
		 list_add_tail(&fmt->lh， &formats);
	write_unlock(&binfmt_lock);
}

/* Registration of default binfmt handlers */
static inline void register_binfmt(struct linux_binfmt *fmt)
{
	__register_binfmt(fmt， 0);
}5

上面这段之前在 do_execve_common 中展示过，先排布 envp 后排布 argv，再看数组内部的处理：

static LIST_HEAD(formats);
static DEFINE_RWLOCK(binfmt_lock);

void __register_binfmt(struct linux_binfmt * fmt， int insert)
{
	BUG_ON(!fmt);
	write_lock(&binfmt_lock);
	insert ? list_add(&fmt->lh， &formats) :
		 list_add_tail(&fmt->lh， &formats);
	write_unlock(&binfmt_lock);
}

/* Registration of default binfmt handlers */
static inline void register_binfmt(struct linux_binfmt *fmt)
{
	__register_binfmt(fmt， 0);
}6

倒序遍历数组

static LIST_HEAD(formats);
static DEFINE_RWLOCK(binfmt_lock);

void __register_binfmt(struct linux_binfmt * fmt， int insert)
{
	BUG_ON(!fmt);
	write_lock(&binfmt_lock);
	insert ? list_add(&fmt->lh， &formats) :
		 list_add_tail(&fmt->lh， &formats);
	write_unlock(&binfmt_lock);
}

/* Registration of default binfmt handlers */
static inline void register_binfmt(struct linux_binfmt *fmt)
{
	__register_binfmt(fmt， 0);
}7

计算当前字符串长度并预留位置，注意复制时可能存在跨页情况，字符串也是从尾向头分割为一块块复制的

static LIST_HEAD(formats);
static DEFINE_RWLOCK(binfmt_lock);

void __register_binfmt(struct linux_binfmt * fmt， int insert)
{
	BUG_ON(!fmt);
	write_lock(&binfmt_lock);
	insert ? list_add(&fmt->lh， &formats) :
		 list_add_tail(&fmt->lh， &formats);
	write_unlock(&binfmt_lock);
}

/* Registration of default binfmt handlers */
static inline void register_binfmt(struct linux_binfmt *fmt)
{
	__register_binfmt(fmt， 0);
}8

复制单个字符串，字符串可能非常大，一个就好几页，copy_from_user 就是那个具体干活儿的

static LIST_HEAD(formats);
static DEFINE_RWLOCK(binfmt_lock);

void __register_binfmt(struct linux_binfmt * fmt， int insert)
{
	BUG_ON(!fmt);
	write_lock(&binfmt_lock);
	insert ? list_add(&fmt->lh， &formats) :
		 list_add_tail(&fmt->lh， &formats);
	write_unlock(&binfmt_lock);
}

/* Registration of default binfmt handlers */
static inline void register_binfmt(struct linux_binfmt *fmt)
{
	__register_binfmt(fmt， 0);
}9

出错处理

了解了 argv 与 envp 的布局后，突然发现在 argv 数组前面插入元素反而简单了，因为它是在最低地址，只要继续向下拓展即可。

不过这里需要先将原先指向脚本路径的第一个元素删除，这里 Linux 使用了一个 trick：直接移动 argv 指针 (bprm->p) 略过第一个参数：

retval = -ENOENT;
for (try=0; try<2; try++) {0

其实关键的就是下面两句：

retval = -ENOENT;
for (try=0; try<2; try++) {1

do...while 循环用来释放第一个元素占用的页面，如果它特别大的话。

经过更新后，bprm->p 指向了第二个参数，argc 减少了 1，后面新参数插入时，会自动覆盖它：

retval = -ENOENT;
for (try=0; try<2; try++) {2

上面这段代码源自 load_script，其中 copy_string_kernel 最终会调用 copy_strings，因此一切又回到了前面的逻辑，这也是为何这里要倒序 copy 参数的缘由，这回大家看明白了吗？

与其说先有倒排这个“鸡”，后有在 argv 数组前面插入元素的“蛋”，倒不如说是先有”蛋“后有”鸡“，换句话说：加载 script 有在 argv 前面插入元素的需求，催生了 argv 参数的倒排。

之前说的 argv[1] 地址大于 argv[0] 这种要求反而是无所谓的，如果要求在 argv 之后插入元素，我估计 &argv[1] < &argv[0] Linux 也能做的出来。

总结

开头提出的三个问题：

指定解释器的以 shebang 开头的第一行长度不得超过 128
shebang 最多只能指定一个参数
shebang 指定的命令与参数会成为新进程的前 2 个参数，用户提供参数依次后排

都一一得到了解答，其中 shebang 长度 128 这个限制，和整个 execve 预读长度 ()BINPRM_BUF_SIZE) 息息相关，也与 binfmt_misc 规定的格式相关，看起来不好随便突破。

另外通过通读源码，得到了以下额外的知识：

解释器文件可嵌套，且没有深度限制
解释器第一行中的命令名不能包含空白字符
命令行参数在内存空间是倒排的

最后对于 shebang 支持多个 arguments 这一点，目前看只要修改 binfmt_scrpts，应该是可以实现的，这个课题就留给感兴趣的读者作为作业吧，哈哈~

参考

[1]. linux下使用binfmt_misc设定不同二进制的打开程序

[2]. Linux中的binfmt-misc原理分析

[3]. binfmt.d 中文手册

[4]. Linux 的 binfmt_misc (binfmt) module 介紹

[5]. Linux系统的可执行文件格式详细解析

[6]. Kernel Support for miscellaneous Binary Formats (binfmt_misc)