Socket与系统调用深度分析

Posted myguaiguai

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Socket与系统调用深度分析相关的知识,希望对你有一定的参考价值。

Socket与系统调用深度分析

可以想象的是,当应用程序调用socket()接口,请求操作系统提供服务时,必然会系统调用,内核根据发起系统调用时传递的系统调用号,判断要执行的程序,若为socket对应的编号,则执行socket对应的中断服务程序。服务程序内部,又根据你要请求的不同服务,来执行不同服务对应的处理程序。当处理结束,执行返回,从中断服务程序到发起中断的int 0x80,再到用户态我们执行的用户程序,层层返回,socket()也就执行完毕了。
本次,我们关心三个问题:
1.应用程序如何如何请求系统调用,或者说,如何进入内核态。
2.中断服务程序之间的调用关系,他是如何跳转到我们需要的服务程序的。
3.socket为了完成我们的调用,在初始化时做了哪些事。

应用程序调用socket

还是使用我们之前编写的hello/hi聊天程序,用客户端来调试,看看socket是如何执行的。

准备:

为了能够调试libc库的内容需要下载libc库,
1.首先安装glibc的符号表,安装方法:
sudo apt-get install libc6-dbg
2.调试libc需要转到对应的源文件,借助了libc的开源,我们可以下载libc的源码,在调试时就能看到执行的位置:
sudo apt-get source libc6-dev
注意你下载源文件的路径,后面调试过程会用到。
3.源文件: clinet.c


#include <stdio.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <string.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#define MAX_len 1024
int sock_fd;
struct sockaddr_in add;
int main()
{
        int ret;
        char buf[MAX_len]={0};
        char buf_rec[MAX_len]={0};
        char buf_p[5]={"0"};
        memset(&add,0,sizeof(add));
        add.sin_family=AF_INET;
        add.sin_port=htons(8000);
        add.sin_addr.s_addr=inet_addr("127.0.0.1");

        if((sock_fd=socket(PF_INET,SOCK_STREAM,0))<=0)
        {

                perror("socket");
                return 1;
        }
        if((ret=connect(sock_fd,(struct sockaddr*)& add,sizeof(struct sockaddr)))<0)
        {
                perror("connet");
                return 1;
        }
        if((ret=send(sock_fd,(void*)buf_p,strlen(buf),0))<0)
        {
                perror("recvfrom");
                return 1;
        }
        while (1)
        {
                scanf("%s",buf);
                if((ret=send(sock_fd,(void*)buf,sizeof(buf),0))<0)
                {
                        perror("sendfrom1");
                        return 1;
                }
                if((ret=recv(sock_fd,(void*)buf_rec,sizeof(buf_rec),0))<0)
                {
                        perror("recvfrom1");
                        return 1;

                }
                printf("%s
",buf_rec);
        }
        return 0;
}

开始调试:

1.编译文件并生成调试信息:
gcc -o - g client client.c
2.执行gdb命令并调试: gdb client

(gdb) file client
Load new symbol table from "client"? (y or n) y
Reading symbols from client...done.
(gdb) b 23
Breakpoint 1 at 0x40091d: file client.c, line 23.
(gdb) c
The program is not being run.
(gdb) run 
Starting program: /home/netlab/netlab/systemcall/client 

Breakpoint 1, main () at client.c:23
23          if((sock_fd=socket(PF_INET,SOCK_STREAM,0))<=0)

将断点设在了23行,也就是第一次执行socket()的那一行,然后运行程序,使程序在23行停住,接下来使用step指令进入socket()内部,分析socket内部如何实现的。

(gdb) s
socket () at ../sysdeps/unix/syscall-template.S:84
84  ../sysdeps/unix/syscall-template.S: No such file or directory.

但是这里提示了我们将要跳转的程序不存在,这是由于我们的libc上并没有源代码,这也是我们准备时要下载源代码的原因,根据他提示的目录,我们使用directory glibc-2.23/sysdeps/unix/命令将下载的libc的源代码装载到gdb,然后再次调试:

(gdb) directory glibc-2.23/sysdeps/unix/
Source directories searched: /home/netlab/netlab/systemcall/glibc-2.23/sysdeps/unix:$cdir:$cwd
(gdb) s
socket () at ../sysdeps/unix/syscall-template.S:85
85      ret
(gdb) l
80  
81  /* This is a "normal" system call stub: if there is an error,
82     it returns -1 and sets errno.  */
83  
84  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
85      ret
86  T_PSEUDO_END (SYSCALL_SYMBOL)
87  
88  #endif
89  
(gdb) 

程序跳入了systemcall-template.s就返回了,从这两句都是宏定义,并看不出什么,实际上,这是系统调用生成的模板,从名字也大致能猜出来,这里规定了常规的系统调用的格式。
所以目前看来通过gdb调试到系统调用是不能实现了,而且,32位与64位在这里遇到的情况都一致,所以我们跳过调试,分析一下libc的源码。

socket glibc库实现:

首先通过一个重定位将socket重定位为__socket

#define __socket socket
#define __recvmsg recvmsg
#define __bind bind
#define __sendto sendto

然后在库文件实现了__socket():

int __socket (int fd, int type, int domain)
{
    #ifdef __ASSUME_SOCKET_SYSCALL
      return INLINE_SYSCALL (socket, 3, fd, type, domain);
    #else
      return SOCKETCALL (socket, fd, type, domain);
    #endif
}
libc_hidden_def (__socket)
weak_alias (__socket, socket)

在__socket()的内部调用了SOCKETCALL或INLINE_SYSCALL,最终它们都会转换为INLINE_SYSCALL,INLINE_SYSCALL与体系结构紧密相关,对应于x86_的架构,实现如下:

# define INLINE_SYSCALL(name, nr, args...)   ({                                              unsigned long int resultvar = INTERNAL_SYSCALL (name, , nr, args);            if (__glibc_unlikely (INTERNAL_SYSCALL_ERROR_P (resultvar, )))              {                                           __set_errno (INTERNAL_SYSCALL_ERRNO (resultvar, ));               resultvar = (unsigned long int) -1;                         }                                           (long int) resultvar; })
#undef INTERNAL_SYSCALL
#define INTERNAL_SYSCALL(name, err, nr, args...)                internal_syscall##nr (SYS_ify (name), err, args)

这里根据参数的数量,又会转换为:

#define internal_syscall3(number, err, arg1, arg2, arg3)        ({                                      unsigned long int resultvar;                        TYPEFY (arg3, __arg3) = ARGIFY (arg3);                  TYPEFY (arg2, __arg2) = ARGIFY (arg2);                  TYPEFY (arg1, __arg1) = ARGIFY (arg1);                  register TYPEFY (arg3, _a3) asm ("rdx") = __arg3;               register TYPEFY (arg2, _a2) asm ("rsi") = __arg2;               register TYPEFY (arg1, _a1) asm ("rdi") = __arg1;               asm volatile (                              "syscall
	"                               : "=a" (resultvar)                              : "0" (number), "r" (_a1), "r" (_a2), "r" (_a3)             : "memory", REGISTERS_CLOBBERED_BY_SYSCALL);                (long int) resultvar;                       })

这里采用了内嵌汇编的形式,将参数用rdx、rsi、rdi来存储,中断号用eax存储,发起软中断内核也就会相应中断,进入中断处理程序,到此,应用程序的部分结束。

内核响应中断:

为了能看到内核如何响应应用程序的socket请求,我们用qemu+gdb调试内核,观察socket请求时内核响应的过程。
1.以调试状态运行menuos,注意要添加上client程序,方便调试
[图片1】
2.分析一下断点的位置,为了能观察到内核对socket的响应,显然应该在响应的函数调用路径上打上断点,以方便调试,但是断点不能设在所有中断的入口,那样我们很难得到我们想要的中断响应,最好的位置就是socket系统调用处理程序的入口,在这个位置,只有socket请求能触发,保证了我们能直接分析,那么如何能找到系统调用的入口呢?
内核的arch/x86/entry/syscalls内就有x86体系下的所有中断入口的描述,为了向前兼容,分为32与64位的中断入口:
32位:

99  i386    statfs          sys_statfs          __ia32_compat_sys_statfs
100 i386    fstatfs         sys_fstatfs         __ia32_compat_sys_fstatfs
101 i386    ioperm          sys_ioperm          __ia32_sys_ioperm
102 i386    socketcall      sys_socketcall          __ia32_compat_sys_socketcall
103 i386    syslog          sys_syslog          __ia32_sys_syslog
104 i386    setitimer       sys_setitimer           __ia32_compat_sys_setitimer
105 i386    getitimer       sys_getitimer           __ia32_compat_sys_getitimer

64位:

......
40  common  sendfile        __x64_sys_sendfile64
41  common  socket          __x64_sys_socket
42  common  connect         __x64_sys_connect
43  common  accept          __x64_sys_accept
44  common  sendto          __x64_sys_sendto

......

对于32程序,显然应该定位于32位的系统调用入口,64位的程序,应该定位于64位的入口。我们将分别看着两种程序对应的中断入口:
首先是32位,将断点设置为__ia32_compat_sys_socketcall,运行menuos,并运行client程序,gdb会在进入__ia32_compat_sys_socketcall时停住:

(gdb) b __ia32_compat_sys_socketcall
Breakpoint 3 at 0xffffffff818474b0: file net/compat.c, line 718.
(gdb) c
Continuing.

Breakpoint 3, __ia32_compat_sys_socketcall (regs=0xffffc900001eff58)
    at net/compat.c:718
718 COMPAT_SYSCALL_DEFINE2(socketcall, int, call, u32 __user *, args)

[图片2]
看一下这个函数的内容:

718 COMPAT_SYSCALL_DEFINE2(socketcall, int, call, u32 __user *, args)
719 {
720     u32 a[AUDITSC_ARGS];
721     unsigned int len;
722     u32 a0, a1;
(gdb) 
723     int ret;
724 
725     if (call < SYS_SOCKET || call > SYS_SENDMMSG)
726         return -EINVAL;
727     len = nas[call];
728     if (len > sizeof(a))
729         return -EINVAL;
730 
731     if (copy_from_user(a, args, len))
733 
734     ret = audit_socketcall_compat(len / sizeof(a[0]), a);
735     if (ret)
736         return ret;
737 
738     a0 = a[0];
739     a1 = a[1];
740 
741     switch (call) {
742     case SYS_SOCKET:
(gdb) 
743         ret = __sys_socket(a0, a1, a[2]);
744         break;
745     case SYS_BIND:
746         ret = __sys_bind(a0, compat_ptr(a1), a[2]);
747         break;
748     case SYS_CONNECT:
749         ret = __sys_connect(a0, compat_ptr(a1), a[2]);
750         break;
751     case SYS_LISTEN:
752         ret = __sys_listen(a0, a1);

显然,这个处理程序是socket一类操作的总入口,它首先获取了系统调用的参数,然后根据请求服务的类型,跳转到不同的处理程序,实现了分发,继续观察函数的调用:

(gdb) b __sys_socket
Breakpoint 4 at 0xffffffff817eea40: file net/socket.c, line 1498.
(gdb) c
Continuing.

Breakpoint 4, __sys_socket (family=2, type=1, protocol=0) at net/socket.c:1498
1498    {
(gdb) l
1493        return __sock_create(net, family, type, protocol, res, 1);
1494    }
1495    EXPORT_SYMBOL(sock_create_kern);
1496    
1497    int __sys_socket(int family, int type, int protocol)
1498    {
1499        int retval;
1500        struct socket *sock;
1501        int flags;
1502    
(gdb) 
1503        /* Check the SOCK_* constants for consistency.  */
1504        BUILD_BUG_ON(SOCK_CLOEXEC != O_CLOEXEC);
1505        BUILD_BUG_ON((SOCK_MAX | SOCK_TYPE_MASK) != SOCK_TYPE_MASK);
1506        BUILD_BUG_ON(SOCK_CLOEXEC & SOCK_TYPE_MASK);
1507        BUILD_BUG_ON(SOCK_NONBLOCK & SOCK_TYPE_MASK);
1508    
1509        flags = type & ~SOCK_TYPE_MASK;
1510        if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
1511            return -EINVAL;
1512        type &= SOCK_TYPE_MASK;
(gdb) 
1513    
1514        if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))
1515            flags = (flags & ~SOCK_NONBLOCK) | O_NONBLOCK;
1516    
1517        retval = sock_create(family, type, protocol, &sock);
1518        if (retval < 0)
1519            return retval;
1520    
1521        return sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
1522    }

在__sys_socket()函数内部只检查了一下参数,就跳转到sock_creat()执行了

(gdb) b __sock_create
Breakpoint 7 at 0xffffffff817ec9a0: file net/socket.c, line 1363.
(gdb) c
Continuing.

Breakpoint 5, __sys_socket (family=2, type=1, protocol=0) at net/socket.c:1517
1517        retval = sock_create(family, type, protocol, &sock);
(gdb) c
Continuing.

Breakpoint 7, __sock_create (net=0xffffffff824e94c0 <init_net>, family=2, type=1, 
    protocol=0, res=0xffffc90000047e98, kern=0) at net/socket.c:1363
1363        if (family < 0 || family >= NPROTO)
(gdb) l
1358        const struct net_proto_family *pf;
1359    
1360        /*
1361         *      Check protocol is in range
1362         */
1363        if (family < 0 || family >= NPROTO)
1364            return -EAFNOSUPPORT;
1365        if (type < 0 || type >= SOCK_MAX)
1366            return -EINVAL;
1367    
(gdb) l
1368        /* Compatibility.
1369    
1370           This uglymoron is moved from INET layer to here to avoid
1371           deadlock in module load.
1372         */
1373        if (family == PF_INET && type == SOCK_PACKET) {
1374            pr_info_once("%s uses obsolete (PF_INET,SOCK_PACKET)
",
1375                     current->comm);
1376            family = PF_PACKET;
1377        }
(gdb) l
1378    
1379        err = security_socket_create(family, type, protocol, kern);
1380        if (err)
1381            return err;
1382    
1383        /*
1384         *  Allocate the socket and allow the family to set things up. if
1385         *  the protocol is 0, the family is instructed to select an appropriate
1386         *  default.
1387         */
(gdb) l
1388        sock = sock_alloc();
1389        if (!sock) {
1390            net_warn_ratelimited("socket: no more sockets
");
1391            return -ENFILE; /* Not exactly a match, but its the
1392                       closest posix thing */
1393        }
1394    
1395        sock->type = type;
1396    
1397    #ifdef CONFIG_MODULES
(gdb) l
1398        /* Attempt to load a protocol module if the find failed.
1399         *
1400         * 12/09/1996 Marcin: But! this makes REALLY only sense, if the user
1401         * requested real, full-featured networking support upon configuration.
1402         * Otherwise module support will break!
1403         */
1404        if (rcu_access_pointer(net_families[family]) == NULL)
1405            request_module("net-pf-%d", family);
1406    #endif
1407    
(gdb) l
1408        rcu_read_lock();
1409        pf = rcu_dereference(net_families[family]);
1410        err = -EAFNOSUPPORT;
1411        if (!pf)
1412            goto out_release;
1413    
1414        /*
1415         * We will call the ->create function, that possibly is in a loadable
1416         * module, so we have to bump that loadable module refcnt first.
1417         */
(gdb) l
1418        if (!try_module_get(pf->owner))
1419            goto out_release;
1420    
1421        /* Now protected by module ref count */
1422        rcu_read_unlock();
1423    
1424        err = pf->create(net, sock, protocol, kern);
1425        if (err < 0)
1426            goto out_module_put;
1427    
(gdb) l
1428        /*
1429         * Now to bump the refcnt of the [loadable] module that owns this
1430         * socket at sock_release time we decrement its refcnt.
1431         */
1432        if (!try_module_get(sock->ops->owner))
1433            goto out_module_busy;
1434    
1435        /*
1436         * Now that we're done with the ->create function, the [loadable]
1437         * module can have its refcnt decremented
(gdb) l
1438         */
1439        module_put(pf->owner);
1440        err = security_socket_post_create(sock, family, type, protocol, kern);
1441        if (err)
1442            goto out_sock_release;
1443        *res = sock;
1444    
1445        return 0;
1446    
1447    out_module_busy:
(gdb) l
1448        err = -EAFNOSUPPORT;
1449    out_module_put:
1450        sock->ops = NULL;
1451        module_put(pf->owner);
1452    out_sock_release:
1453        sock_release(sock);
1454        return err;
1455    
1456    out_release:
1457        rcu_read_unlock();
(gdb) l
1458        goto out_sock_release;
1459    }
1460    EXPORT_SYMBOL(__sock_create);
1461    
1462    /**
1463     *  sock_create - creates a socket
1464     *  @family: protocol family (AF_INET, ...)
1465     *  @type: communication type (SOCK_STREAM, ...)
1466     *  @protocol: protocol (0, ...)
1467     *  @res: new socket
(gdb) l
1468     *
1469     *  A wrapper around __sock_create().
1470     *  Returns 0 or an error. This function internally uses GFP_KERNEL.
1471     */
1472    
1473    int sock_create(int family, int type, int protocol, struct socket **res)
1474    {
1475        return __sock_create(current->nsproxy->net_ns, family, type, protocol, res, 0);
1476    }

最初设断点在sock_reate发现到达不了,查看内核代码发现要设断点在__sock_create,可能是sock_creat被重定向了。继续看__sock_create:
err = security_socket_create(family, type, protocol, kern);先检查了一下是否合法,然后就执行了最关键的函数
sock = sock_alloc();
为了理解这行代码,我们需要知道,sock是struct socket的一个变量,这个接口体是socket的核心,他的内容如下:

struct socket {
    socket_state        state;

    short           type;

    unsigned long       flags;

    struct socket_wq    *wq;

    struct file     *file;
    struct sock     *sk;
    const struct proto_ops  *ops;
};

State是当前socket的状态,用于表示连接和未连接,type表示socket服务的类型,如TCP服务的SOCK_STREAM型,flags表示标志,如SOCK_ASYNC_NOSPACE,wq是等待队列,因为一个socket可能会有多个请求,file是指的文件,因为socket也可以被当作文件看待,所有会有这个指针,兼容文件的操作。Sk是非常重要的,也是非常大的,负责记录协议相关内容。这样的设置使得socket具有很好的协议无关性,可以通用,ops是socket与服务相关的基本操作的指针,这是linux的通常用法,将一个对象的操作用集合子啊一个函数指针的结构体中。

struct proto_ops {
    int     family;
    struct module   *owner;
    int     (*release)   (struct socket *sock);
    int     (*bind)      (struct socket *sock,
                      struct sockaddr *myaddr,
                      int sockaddr_len);
    int     (*connect)   (struct socket *sock,
                      struct sockaddr *vaddr,
                      int sockaddr_len, int flags);
    int     (*socketpair)(struct socket *sock1,
                      struct socket *sock2);
    int     (*accept)    (struct socket *sock,
                      struct socket *newsock, int flags, bool kern);
    int     (*getname)   (struct socket *sock,
                      struct sockaddr *addr,
                      int peer);
    __poll_t    (*poll)      (struct file *file, struct socket *sock,
                      struct poll_table_struct *wait);
    int     (*ioctl)     (struct socket *sock, unsigned int cmd,
                      unsigned long arg);
#ifdef CONFIG_COMPAT
    int     (*compat_ioctl) (struct socket *sock, unsigned int cmd,
                      unsigned long arg);
#endif
    int     (*listen)    (struct socket *sock, int len);
    int     (*shutdown)  (struct socket *sock, int flags);
    int     (*setsockopt)(struct socket *sock, int level,
                      int optname, char __user *optval, unsigned int optlen);
    int     (*getsockopt)(struct socket *sock, int level,
                      int optname, char __user *optval, int __user *optlen);
#ifdef CONFIG_COMPAT
    int     (*compat_setsockopt)(struct socket *sock, int level,
                      int optname, char __user *optval, unsigned int optlen);
    int     (*compat_getsockopt)(struct socket *sock, int level,
                      int optname, char __user *optval, int __user *optlen);
#endif
    int     (*sendmsg)   (struct socket *sock, struct msghdr *m,
                      size_t total_len);
    /* Notes for implementing recvmsg:
     * ===============================
     * msg->msg_namelen should get updated by the recvmsg handlers
     * iff msg_name != NULL. It is by default 0 to prevent
     * returning uninitialized memory to user space.  The recvfrom
     * handlers can assume that msg.msg_name is either NULL or has
     * a minimum size of sizeof(struct sockaddr_storage).
     */
    int     (*recvmsg)   (struct socket *sock, struct msghdr *m,
                      size_t total_len, int flags);
    int     (*mmap)      (struct file *file, struct socket *sock,
                      struct vm_area_struct * vma);
    ssize_t     (*sendpage)  (struct socket *sock, struct page *page,
                      int offset, size_t size, int flags);
    ssize_t     (*splice_read)(struct socket *sock,  loff_t *ppos,
                       struct pipe_inode_info *pipe, size_t len, unsigned int flags);
    int     (*set_peek_off)(struct sock *sk, int val);
    int     (*peek_len)(struct socket *sock);

    /* The following functions are called internally by kernel with
     * sock lock already held.
     */
    int     (*read_sock)(struct sock *sk, read_descriptor_t *desc,
                     sk_read_actor_t recv_actor);
    int     (*sendpage_locked)(struct sock *sk, struct page *page,
                       int offset, size_t size, int flags);
    int     (*sendmsg_locked)(struct sock *sk, struct msghdr *msg,
                      size_t size);
    int     (*set_rcvlowat)(struct sock *sk, int val);
};

再回到调试的程序,sock_alloc()分配了一个socket结构体,内部又是如何实现的呢?继续设断点观察:

(gdb) b sock_alloc
Breakpoint 8 at 0xffffffff817ec230: file net/socket.c, line 569.
(gdb) c
Continuing.

Breakpoint 8, sock_alloc () at net/socket.c:569
569     inode = new_inode_pseudo(sock_mnt->mnt_sb);
(gdb) l
564 struct socket *sock_alloc(void)
565 {
566     struct inode *inode;
567     struct socket *sock;
568 
569     inode = new_inode_pseudo(sock_mnt->mnt_sb);
570     if (!inode)
571         return NULL;
572 
573     sock = SOCKET_I(inode);
(gdb) l
574 
575     inode->i_ino = get_next_ino();
576     inode->i_mode = S_IFSOCK | S_IRWXUGO;
577     inode->i_uid = current_fsuid();
578     inode->i_gid = current_fsgid();
579     inode->i_op = &sockfs_inode_ops;
580 
581     return sock;
582 }

sock_alloc内部实现了两个结构的创建,磁盘文件inode,还有要返回socket结构,并且后面还为inode赋值,问题又聚集在了SOCKET_I(),按照这里来看,SOCKET_I应该是创建socket的位置,不过传递的inode确实有点难以理解,想继续深入看看,但是遗憾的是,SOCKET_I函数是内联的,所以并不能跳转到函数内部,只能通过源码分析了。

static inline struct socket *SOCKET_I(struct inode *inode)
{
    return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
}

container_of(),是一个非常经典的宏,在这里,对于container_of(A,B,C);得到的就是位于结构体A中排在的第一个类型为B的域。即inode->socket_alloc->socket,也就是在inode节点中第一个域为socket_alloc,而socket_alloc有socket域,socket_alloc域如下:

struct socket_alloc {
    struct socket socket;
    struct inode vfs_inode;
};

那这个inode节点如何创建的呢?在sock_alloc函数中,调用了new_inode_pseudo函数来实现的,他实现如下:

struct inode *new_inode_pseudo(struct super_block *sb)
{
    struct inode *inode = alloc_inode(sb);

    if (inode) {
        spin_lock(&inode->i_lock);
        inode->i_state = 0;
        spin_unlock(&inode->i_lock);
        INIT_LIST_HEAD(&inode->i_sb_list);
    }
    return inode;
}

这里又调用了alloc_inode函数:

static struct inode *alloc_inode(struct super_block *sb)
{
    struct inode *inode;

    if (sb->s_op->alloc_inode)
        inode = sb->s_op->alloc_inode(sb);
    else
        inode = kmem_cache_alloc(inode_cachep, GFP_KERNEL);

    if (!inode)
        return NULL;

    if (unlikely(inode_init_always(sb, inode))) {
        if (inode->i_sb->s_op->destroy_inode)
            inode->i_sb->s_op->destroy_inode(inode);
        else
            kmem_cache_free(inode_cachep, inode);
        return NULL;
    }

    return inode;
}

这一下变得明白多了,inode是调用超级块中的s_op->alloc_inode来实现的,又涉及到了文件系统的内容,linux中,用一个超级块来代表一个文件系统,每个文件系统有创建磁盘文件、删除磁盘文件等方法,显然,socket也被当作了一个文件系统,所以这里调用的也是soket文件系统的创建节点函数,早文件系统的节点函数中,并非直接创建一个inode节点,而而是创建了一个sock_alloc结构,这个结构里面有既有struct inode又有struct socket,最后,将这个socket初始化并返回,但这里还有一个细节,socket系统调用的返回值为一个套接字描述符(文件描述符),但这里并没有出现文件描述符,原因在这里__sys_socket函数的sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));,之前提到,socket被当作文件系统看待,包括struct socket结构体内还有struct file,这个函数正是将文件描述符与struct socketstruct file相绑定的函数,实现如下:

static int sock_map_fd(struct socket *sock, int flags)
{
    struct file *newfile;
    int fd = get_unused_fd_flags(flags);
    if (unlikely(fd < 0)) {
        sock_release(sock);
        return fd;
    }

    newfile = sock_alloc_file(sock, flags, NULL);
    if (likely(!IS_ERR(newfile))) {
        fd_install(fd, newfile);
        return fd;
    }

    put_unused_fd(fd);
    return PTR_ERR(newfile);
}

在这里面,通过sock_alloc_file(sock, flags, NULL);得到了要返回的文件描述符fd,并创建了一个struct file的对象,创建的过程如下:

struct file *sock_alloc_file(struct socket *sock, int flags, const char *dname)
{
    struct file *file;

    if (!dname)
        dname = sock->sk ? sock->sk->sk_prot_creator->name : "";

    file = alloc_file_pseudo(SOCK_INODE(sock), sock_mnt, dname,
                O_RDWR | (flags & O_NONBLOCK),
                &socket_file_ops);
    if (IS_ERR(file)) {
        sock_release(sock);
        return file;
    }

    sock->file = file;
    file->private_data = sock;
    return file;
}

这里又调用了alloc_file_pseudo,注意,这里有一个关键的结构体就是socket_file_ops,他定义了一些socket基础的文件操作,所以这一步又将这些文件操作与文件绑定在一起了。定义如下:

static const struct file_operations socket_file_ops = {
    .owner =    THIS_MODULE,
    .llseek =   no_llseek,
    .read_iter =    sock_read_iter,
    .write_iter =   sock_write_iter,
    .poll =     sock_poll,
    .unlocked_ioctl = sock_ioctl,
#ifdef CONFIG_COMPAT
    .compat_ioctl = compat_sock_ioctl,
#endif
    .mmap =     sock_mmap,
    .release =  sock_close,
    .fasync =   sock_fasync,
    .sendpage = sock_sendpage,
    .splice_write = generic_splice_sendpage,
    .splice_read =  sock_splice_read,
};

到此,socket的前两部我们走完了,由于关系错综复杂,我们捋一下调用关系:
__ia32_compat_sys_socketcall->__sys_socket->sock_create->sock_alloc->alloc_file_pseudo->||sb->s_op->alloc_inode;
通过这样一个流程,核心就是创建了结构体socket_alloc,因为这个结构里面既有socket又有inode。
还有最后一个问题:socket初始化,在前面socket struct的介绍中,proto_ops域有不同的服务函数指针,但这些指针在什么时候赋值,如何赋值的我们还未分析,这一步,我们主要分析这个问题。

socket的初始化

同样,通过gdb来观察linux内核的启动过程,观察socket以何种顺序,何种方式被初始化:
重新打开qemu,加载menuos,用gdb调试内核的启动:
首先将断点打在start_kernel,并观察有无初始化网络的代码:

(gdb) target remote: 1234
Remote debugging using : 1234
0x0000000000000000 in fixed_percpu_data ()
(gdb) b start_kernel 
Breakpoint 1 at 0xffffffff82997b05: file init/main.c, line 552.
(gdb) c
Continuing.

Breakpoint 1, start_kernel () at init/main.c:552
warning: Source file is more recent than executable.
552 asmlinkage __visible void __init start_kernel(void)
553 {
554     char *command_line;
555     char *after_dashes;
556 
(gdb) l
557     set_task_stack_end_magic(&init_task);
558     smp_setup_processor_id();
559     debug_objects_early_init();
560 
561     cgroup_init_early();
562 
563     local_irq_disable();
564     early_boot_irqs_disabled = true;
565 
566     /*
(gdb) l
567      * Interrupts are still disabled. Do necessary setups, then
568      * enable them.
569      */
570     boot_cpu_init();
571     page_address_init();
572     pr_notice("%s", linux_banner);
573     setup_arch(&command_line);
574     mm_init_cpumask(&init_mm);
575     setup_command_line(command_line);
576     setup_nr_cpu_ids();
 

并未看到网络初始化相关代码,arch_call_rest_init();注意到这个函数执行的应该是除了这里列出来的其他部分的初始化,将断点设在arch_call_rest_init();

Breakpoint 2, arch_call_rest_init () at init/main.c:548
546 
547 void __init __weak arch_call_rest_init(void)
548 {
549     rest_init();
550 }
551 
552 asmlinkage __visible void __init start_kernel(void)
(gdb) b rest_init
Breakpoint 4, rest_init () at init/main.c:411
411 
(gdb) l
406 
407 noinline void __ref rest_init(void)
408 {
409     struct task_struct *tsk;
410     int pid;
411 
412     rcu_scheduler_starting();
413     /*
414      * We need to spawn init first so that it obtains pid 1, however
415      * the init task will end up wanting to create kthreads, which, if
(gdb) 

arch_rest_init()只有一行,那就是调用rest_init(),继续追踪,得到rest_init的完整代码:

    noinline void __ref rest_init(void)
408 {
409     struct task_struct *tsk;
410     int pid;
411 
412     rcu_scheduler_starting();
413     /*
414      * We need to spawn init first so that it obtains pid 1, however
415      * the init task will end up wanting to create kthreads, which, if
(gdb) 
416      * we schedule it before we create kthreadd, will OOPS.
417      */
418     pid = kernel_thread(kernel_init, NULL, CLONE_FS);
419     /*
420      * Pin init on the boot CPU. Task migration is not properly working
421      * until sched_init_smp() has been run. It will set the allowed
422      * CPUs for init to the non isolated CPUs.
423      */
424     rcu_read_lock();
425     tsk = find_task_by_pid_ns(pid, &init_pid_ns);
(gdb) 
426     set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id()));
427     rcu_read_unlock();
428 
429     numa_default_policy();
430     pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
431     rcu_read_lock();
432     kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
433     rcu_read_unlock();
434 
435     /*
(gdb) 
436      * Enable might_sleep() and smp_processor_id() checks.
437      * They cannot be enabled earlier because with CONFIG_PREEMPT=y
438      * kernel_thread() would trigger might_sleep() splats. With
439      * CONFIG_PREEMPT_VOLUNTARY=y the init task might have scheduled
440      * already, but it's stuck on the kthreadd_done completion.
441      */
442     system_state = SYSTEM_SCHEDULING;
443 
444     complete(&kthreadd_done);
445 
(gdb) 
446     /*
447      * The boot idle thread must execute schedule()
448      * at least once to get things moving:
449      */
450     schedule_preempt_disabled();
451     /* Call into cpu_idle with preempt disabled */
452     cpu_startup_entry(CPUHP_ONLINE);
453 }

这里创建了两个线程kernel_init和kthread,实际的初始化是由它们完成的,那我们将断点分别设在这两个函数:

1086    static int __ref kernel_init(void *unused)
1087    {
1088        int ret;
1089    
1090        kernel_init_freeable();
1091        /* need to finish all async __init code before freeing the memory */
(gdb) 
1092        async_synchronize_full();
1093        ftrace_free_init_mem();
1094        free_initmem();
1095        mark_readonly();
1096    
1097        /*
1098         * Kernel mappings are now finalized - update the userspace page-table
1099         * to finalize PTI.
1100         */
1101        pti_finalize();
(gdb) 
1102    
1103        system_state = SYSTEM_RUNNING;
1104        numa_default_policy();
1105    
1106        rcu_end_inkernel_boot();
1107    
1108        if (ramdisk_execute_command) {
1109            ret = run_init_process(ramdisk_execute_command);
1110            if (!ret)
1111                return 0;
(gdb) 
1112            pr_err("Failed to execute %s (error %d)
",
1113                   ramdisk_execute_command, ret);
1114        }
1115    
1116        /*
1117         * We try each of these until one succeeds.
1118         *
1119         * The Bourne shell can be used instead of init if we are
1120         * trying to recover a really broken machine.
1121         */
(gdb) 
1122        if (execute_command) {
1123            ret = run_init_process(execute_command);
1124            if (!ret)
1125                return 0;
1126            panic("Requested init %s failed (error %d).",
1127                  execute_command, ret);
1128        }
1129        if (!try_to_run_init_process("/sbin/init") ||
1130            !try_to_run_init_process("/etc/init") ||
1131            !try_to_run_init_process("/bin/init") ||
(gdb) 
1132            !try_to_run_init_process("/bin/sh"))
1133            return 0;
1134    
1135        panic("No working init found.  Try passing init= option to kernel. "
1136              "See Linux Documentation/admin-guide/init.rst for guidance.");
1137    }

首先执行的是kernel_init,函数内部负责判断应该执行哪个位置的init文件,并最终跳转执行,但是在加载init用户程序前通过kernel_init_freeable函数进一步做了一些初始化的工作,所以跳转到kernel_init_freeable()。

函数内部除了do_basic_setup外,并未执行初始化。在do_basic_setup打上断点,但是,程序首先来到了kthreadd ``` 568 int kthreadd(void *unused) 569 { 570 struct task_struct *tsk = current; 571 572 /* Setup a clean context for our children to inherit. */ 573 set_task_comm(tsk, "kthreadd"); (gdb) l 574 ignore_signals(tsk); 575 set_cpus_allowed_ptr(tsk, cpu_all_mask); 576 set_mems_allowed(node_states[N_MEMORY]); 577 578 current->flags |= PF_NOFREEZE; 579 cgroup_init_kthreadd(); 580 581 for (;;) { 582 set_current_state(TASK_INTERRUPTIBLE); 583 if (list_empty(&kthread_create_list)) (gdb) 584 schedule(); 585 __set_current_state(TASK_RUNNING); 586 587 spin_lock(&kthread_create_lock); 588 while (!list_empty(&kthread_create_list)) { 589 struct kthread_create_info *create; 590 591 create = list_entry(kthread_create_list.next, 592 struct kthread_create_info, list); 593 list_del_init(&create->list); (gdb) 594 spin_unlock(&kthread_create_lock); 595 596 create_kthread(create); 597 598 spin_lock(&kthread_create_lock); 599 } 600 spin_unlock(&kthread_create_lock); 601 } 602 603 return 0; (gdb) 604 } ``` kthreadd内部负责根据kthread_create_list`创建一系列的线程,这显然与我们要的网络初始化无关,继续观察do_basic_setup;

static void __init do_basic_setup(void)
{
    cpuset_init_smp();
    shmem_init();
    driver_init();
    init_irq_proc();
    do_ctors();
    usermodehelper_enable();
    do_initcalls();
}
859static void __init do_initcalls(void)
860{
861 int level;
862
863 for (level = 0; level < ARRAY_SIZE(initcall_levels) - 1; level++)
864     do_initcall_level(level);
865}

do_initcalls会根据init_levels不断执行do_initcall_level(level),那首先我们需要看看do_initcall_level是什么

static void __init do_initcall_level(int level)
{
    initcall_entry_t *fn;

    strcpy(initcall_command_line, saved_command_line);
    parse_args(initcall_level_names[level],
           initcall_command_line, __start___param,
           __stop___param - __start___param,
           level, level,
           NULL, &repair_env_string);

    trace_initcall_level(initcall_level_names[level]);
    for (fn = initcall_levels[level]; fn < initcall_levels[level+1]; fn++)
        do_one_initcall(initcall_from_entry(fn));
}

initcall_levels为一个表,从而可以对每一个注册进来的初始化项目进行初始化,initcall_from_entry返回的就是fn的地址,然后根据这个地址,执行do_one_initcall,而至于这个表是如何来的,可以从网络初始化程序inet_init得到解答

略去了很多无关代码

static int __init inet_init(void)
{
    struct inet_protosw *q;
    struct list_head *r;
    int rc = -EINVAL;

    sock_skb_cb_check_size(sizeof(struct inet_skb_parm));

    rc = proto_register(&tcp_prot, 1);
    if (rc)
        goto out;

    rc = proto_register(&udp_prot, 1);
    if (rc)
        goto out_unregister_tcp_proto;

    rc = proto_register(&raw_prot, 1);
    if (rc)
        goto out_unregister_udp_proto;

    rc = proto_register(&ping_prot, 1);
    if (rc)
        goto out_unregister_raw_proto;

    (void)sock_register(&inet_family_ops);
    ip_static_sysctl_init();


    if (inet_add_protocol(&icmp_protocol, IPPROTO_ICMP) < 0)
        pr_crit("%s: Cannot add ICMP protocol
", __func__);
    if (inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0)
        pr_crit("%s: Cannot add UDP protocol
", __func__);
    if (inet_add_protocol(&tcp_protocol, IPPROTO_TCP) < 0)
        pr_crit("%s: Cannot add TCP protocol
", __func__);

    /* Register the socket-side information for inet_create. */
    for (r = &inetsw[0]; r < &inetsw[SOCK_MAX]; ++r)
        INIT_LIST_HEAD(r);

    for (q = inetsw_array; q < &inetsw_array[INETSW_ARRAY_LEN]; ++q)
        inet_register_protosw(q);

    arp_init();

    ip_init();
    tcp_init();
    udp_init();
    udplite4_register();
    raw_init();
    ping_init();


    ipv4_proc_init();

    ipfrag_init();

    dev_add_pack(&ip_packet_type);

    ip_tunnel_core_init();

    rc = 0;
out:
    return rc;
out_unregister_raw_proto:
    proto_unregister(&raw_prot);
out_unregister_udp_proto:
    proto_unregister(&udp_prot);
out_unregister_tcp_proto:
    proto_unregister(&tcp_prot);
    goto out;
}

fs_initcall(inet_init);

所以通过fs_initcall(inet_init)将inet_init函数注册进initcalls的initcall_levels,最终得到初始化,为了验证,最好的办法就是重新启动,将断点打在inet_init,观察这个函数是否会调用即可。

(gdb) b inet_init
Breakpoint 1 at 0xffffffff829f49fe: file net/ipv4/af_inet.c, line 1906.
(gdb) c
Continuing.

Breakpoint 1, inet_init () at net/ipv4/af_inet.c:1906
1906    {

接下来仔细看看inet_init的代码:这里面包括了几乎所有的网络协议——TCP、UDP、ICMP等,流程是先注册端口号,然后添加对应的协议,最后是初始化,追踪到这里也就告一段落了,但是我们并没有看到socket系统的基础操作如alloc_inode是如何初始化的,这是由在定义socket超级块的时候就直接定义了,并未在初始化的流程中。

static const struct super_operations sockfs_ops = {
    .alloc_inode    = sock_alloc_inode,
    .destroy_inode  = sock_destroy_inode,
    .statfs     = simple_statfs,
};

虽然结束了,到那时第三个问题还没有得到解决,那就是位于struct socket结构体中的proto_ops域,也就是特定协议的函数处理指针在哪初始化的并没有找到,期待后期的学习能够找到。

以上是关于Socket与系统调用深度分析的主要内容,如果未能解决你的问题,请参考以下文章

Socket与系统调用深度分析

Socket与系统调用深度分析

Socket与系统调用深度分析

Socket系统调用Socket与系统调用深度分析

Socket与系统调用深度分析

Socket与系统调用深度分析