Linux之epoll实现

Posted 笨拙的菜鸟

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Linux之epoll实现相关的知识,希望对你有一定的参考价值。



  1. /*
  2. * fs/eventpoll.c (Efficient event retrieval implementation)
  3. * Copyright (C) 2001,...,2009 Davide Libenzi
  4. *
  5. * This program is free software; you can redistribute it and/or modify
  6. * it under the terms of the GNU General Public License as published by
  7. * the Free Software Foundation; either version 2 of the License, or
  8. * (at your option) any later version.
  9. *
  10. * Davide Libenzi <[email protected]>
  11. *
  12. */
  13. /*
  14. * 在深入了解epoll的实现之前, 先来了解内核的3个方面.
  15. * 1. 等待队列 waitqueue
  16. * 我们简单解释一下等待队列:
  17. * 队列头(wait_queue_head_t)往往是资源生产者,
  18. * 队列成员(wait_queue_t)往往是资源消费者,
  19. * 当头的资源ready后, 会逐个执行每个成员指定的回调函数,
  20. * 来通知它们资源已经ready了, 等待队列大致就这个意思.
  21. * 2. 内核的poll机制
  22. * 被Poll的fd, 必须在实现上支持内核的Poll技术,
  23. * 比如fd是某个字符设备,或者是个socket, 它必须实现
  24. * file_operations中的poll操作, 给自己分配有一个等待队列头.
  25. * 主动poll fd的某个进程必须分配一个等待队列成员, 添加到
  26. * fd的对待队列里面去, 并指定资源ready时的回调函数.
  27. * 用socket做例子, 它必须有实现一个poll操作, 这个Poll是
  28. * 发起轮询的代码必须主动调用的, 该函数中必须调用poll_wait(),
  29. * poll_wait会将发起者作为等待队列成员加入到socket的等待队列中去.
  30. * 这样socket发生状态变化时可以通过队列头逐个通知所有关心它的进程.
  31. * 这一点必须很清楚的理解, 否则会想不明白epoll是如何
  32. * 得知fd的状态发生变化的.
  33. * 3. epollfd本身也是个fd, 所以它本身也可以被epoll,
  34. * 可以猜测一下它是不是可以无限嵌套epoll下去...
  35. *
  36. * epoll基本上就是使用了上面的1,2点来完成.
  37. * 可见epoll本身并没有给内核引入什么特别复杂或者高深的技术,
  38. * 只不过是已有功能的重新组合, 达到了超过select的效果.
  39. */
  40. /*
  41. * 相关的其它内核知识:
  42. * 1. fd我们知道是文件描述符, 在内核态, 与之对应的是struct file结构,
  43. * 可以看作是内核态的文件描述符.
  44. * 2. spinlock, 自旋锁, 必须要非常小心使用的锁,
  45. * 尤其是调用spin_lock_irqsave()的时候, 中断关闭, 不会发生进程调度,
  46. * 被保护的资源其它CPU也无法访问. 这个锁是很强力的, 所以只能锁一些
  47. * 非常轻量级的操作.
  48. * 3. 引用计数在内核中是非常重要的概念,
  49. * 内核代码里面经常有些release, free释放资源的函数几乎不加任何锁,
  50. * 这是因为这些函数往往是在对象的引用计数变成0时被调用,
  51. * 既然没有进程在使用在这些对象, 自然也不需要加锁.
  52. * struct file 是持有引用计数的.
  53. */
  54. /* --- epoll相关的数据结构 --- */
  55. /*
  56. * This structure is stored inside the "private_data" member of the file
  57. * structure and rapresent the main data sructure for the eventpoll
  58. * interface.
  59. */
  60. /* 每创建一个epollfd, 内核就会分配一个eventpoll与之对应, 可以说是
  61. * 内核态的epollfd. */
  62. struct eventpoll {
  63. /* Protect the this structure access */
  64. spinlock_t lock;
  65. /*
  66. * This mutex is used to ensure that files are not removed
  67. * while epoll is using them. This is held during the event
  68. * collection loop, the file cleanup path, the epoll file exit
  69. * code and the ctl operations.
  70. */
  71. /* 添加, 修改或者删除监听fd的时候, 以及epoll_wait返回, 向用户空间
  72. * 传递数据时都会持有这个互斥锁, 所以在用户空间可以放心的在多个线程
  73. * 中同时执行epoll相关的操作, 内核级已经做了保护. */
  74. struct mutex mtx;
  75. /* Wait queue used by sys_epoll_wait() */
  76. /* 调用epoll_wait()时, 我们就是"睡"在了这个等待队列上... */
  77. wait_queue_head_t wq;
  78. /* Wait queue used by file->poll() */
  79. /* 这个用于epollfd本事被poll的时候... */
  80. wait_queue_head_t poll_wait;
  81. /* List of ready file descriptors */
  82. /* 所有已经ready的epitem都在这个链表里面 */
  83. struct list_head rdllist;
  84. /* RB tree root used to store monitored fd structs */
  85. /* 所有要监听的epitem都在这里 */
  86. struct rb_root rbr;
  87. /*
  88. * This is a single linked list that chains all the "struct epitem" that
  89. * happened while transfering ready events to userspace w/out
  90. * holding ->lock.
  91. */
  92. struct epitem *ovflist;
  93. /* The user that created the eventpoll descriptor */
  94. /* 这里保存了一些用户变量, 比如fd监听数量的最大值等等 */
  95. struct user_struct *user;
  96. };
  97. /*
  98. * Each file descriptor added to the eventpoll interface will
  99. * have an entry of this type linked to the "rbr" RB tree.
  100. */
  101. /* epitem 表示一个被监听的fd */
  102. struct epitem {
  103. /* RB tree node used to link this structure to the eventpoll RB tree */
  104. /* rb_node, 当使用epoll_ctl()将一批fds加入到某个epollfd时, 内核会分配
  105. * 一批的epitem与fds们对应, 而且它们以rb_tree的形式组织起来, tree的root
  106. * 保存在epollfd, 也就是struct eventpoll中.
  107. * 在这里使用rb_tree的原因我认为是提高查找,插入以及删除的速度.
  108. * rb_tree对以上3个操作都具有O(lgN)的时间复杂度 */
  109. struct rb_node rbn;
  110. /* List header used to link this structure to the eventpoll ready list */
  111. /* 链表节点, 所有已经ready的epitem都会被链到eventpoll的rdllist中 */
  112. struct list_head rdllink;
  113. /*
  114. * Works together "struct eventpoll"->ovflist in keeping the
  115. * single linked chain of items.
  116. */
  117. /* 这个在代码中再解释... */
  118. struct epitem *next;
  119. /* The file descriptor information this item refers to */
  120. /* epitem对应的fd和struct file */
  121. struct epoll_filefd ffd;
  122. /* Number of active wait queue attached to poll operations */
  123. int nwait;
  124. /* List containing poll wait queues */
  125. struct list_head pwqlist;
  126. /* The "container" of this item */
  127. /* 当前epitem属于哪个eventpoll */
  128. struct eventpoll *ep;
  129. /* List header used to link this item to the "struct file" items list */
  130. struct list_head fllink;
  131. /* The structure that describe the interested events and the source fd */
  132. /* 当前的epitem关系哪些events, 这个数据是调用epoll_ctl时从用户态传递过来 */
  133. struct epoll_event event;
  134. };
  135. struct epoll_filefd {
  136. struct file *file;
  137. int fd;
  138. };
  139. /* Wait structure used by the poll hooks */
  140. struct eppoll_entry {
  141. /* List header used to link this structure to the "struct epitem" */
  142. struct list_head llink;
  143. /* The "base" pointer is set to the container "struct epitem" */
  144. struct epitem *base;
  145. /*
  146. * Wait queue item that will be linked to the target file wait
  147. * queue head.
  148. */
  149. wait_queue_t wait;
  150. /* The wait queue head that linked the "wait" wait queue item */
  151. wait_queue_head_t *whead;
  152. };
  153. /* Wrapper struct used by poll queueing */
  154. struct ep_pqueue {
  155. poll_table pt;
  156. struct epitem *epi;
  157. };
  158. /* Used by the ep_send_events() function as callback private data */
  159. struct ep_send_events_data {
  160. int maxevents;
  161. struct epoll_event __user *events;
  162. };
  163. /* --- 代码注释 --- */
  164. /* 你没看错, 这就是epoll_create()的真身, 基本啥也不干直接调用epoll_create1了,
  165. * 另外你也可以发现, size这个参数其实是没有任何用处的... */
  166. SYSCALL_DEFINE1(epoll_create, int, size)
  167. {
  168. if (size <= 0)
  169. return -EINVAL;
  170. return sys_epoll_create1(0);
  171. }
  172. /* 这才是真正的epoll_create啊~~ */
  173. SYSCALL_DEFINE1(epoll_create1, int, flags)
  174. {
  175. int error;
  176. struct eventpoll *ep = NULL;//主描述符
  177. /* Check the EPOLL_* constant for consistency. */
  178. /* 这句没啥用处... */
  179. BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC);
  180. /* 对于epoll来讲, 目前唯一有效的FLAG就是CLOEXEC */
  181. if (flags & ~EPOLL_CLOEXEC)
  182. return -EINVAL;
  183. /*
  184. * Create the internal data structure ("struct eventpoll").
  185. */
  186. /* 分配一个struct eventpoll, 分配和初始化细节我们随后深聊~ */
  187. error = ep_alloc(&ep);
  188. if (error < 0)
  189. return error;
  190. /*
  191. * Creates all the items needed to setup an eventpoll file. That is,
  192. * a file structure and a free file descriptor.
  193. */
  194. /* 这里是创建一个匿名fd, 说起来就话长了...长话短说:
  195. * epollfd本身并不存在一个真正的文件与之对应, 所以内核需要创建一个
  196. * "虚拟"的文件, 并为之分配真正的struct file结构, 而且有真正的fd.
  197. * 这里2个参数比较关键:
  198. * eventpoll_fops, fops就是file operations, 就是当你对这个文件(这里是虚拟的)进行操作(比如读)时,
  199. * fops里面的函数指针指向真正的操作实现, 类似C++里面虚函数和子类的概念.
  200. * epoll只实现了poll和release(就是close)操作, 其它文件系统操作都有VFS全权处理了.
  201. * ep, ep就是struct epollevent, 它会作为一个私有数据保存在struct file的private指针里面.
  202. * 其实说白了, 就是为了能通过fd找到struct file, 通过struct file能找到eventpoll结构.
  203. * 如果懂一点Linux下字符设备驱动开发, 这里应该是很好理解的,
  204. * 推荐阅读 <Linux device driver 3rd>
  205. */
  206. error = anon_inode_getfd("[eventpoll]", &eventpoll_fops, ep,
  207. O_RDWR | (flags & O_CLOEXEC));
  208. if (error < 0)
  209. ep_free(ep);
  210. return error;
  211. }
  212. /*
  213. * 创建好epollfd后, 接下来我们要往里面添加fd咯
  214. * 来看epoll_ctl
  215. * epfd 就是epollfd
  216. * op ADD,MOD,DEL
  217. * fd 需要监听的描述符
  218. * event 我们关心的events
  219. */
  220. SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
  221. struct epoll_event __user *, event)
  222. {
  223. int error;
  224. struct file *file, *tfile;
  225. struct eventpoll *ep;
  226. struct epitem *epi;
  227. struct epoll_event epds;
  228. error = -EFAULT;
  229. /*
  230. * 错误处理以及从用户空间将epoll_event结构copy到内核空间.
  231. */
  232. if (ep_op_has_event(op) &&
  233. copy_from_user(&epds, event, sizeof(struct epoll_event)))
  234. goto error_return;
  235. /* Get the "struct file *" for the eventpoll file */
  236. /* 取得struct file结构, epfd既然是真正的fd, 那么内核空间
  237. * 就会有与之对于的一个struct file结构
  238. * 这个结构在epoll_create1()中, 由函数anon_inode_getfd()分配 */
  239. error = -EBADF;
  240. file = fget(epfd);
  241. if (!file)
  242. goto error_return;
  243. /* Get the "struct file *" for the target file */
  244. /* 我们需要监听的fd, 它当然也有个struct file结构, 上下2个不要搞混了哦 */
  245. tfile = fget(fd);
  246. if (!tfile)
  247. goto error_fput;
  248. /* The target file descriptor must support poll */
  249. error = -EPERM;
  250. /* 如果监听的文件不支持poll, 那就没辙了.
  251. * 你知道什么情况下, 文件会不支持poll吗?
  252. */
  253. if (!tfile->f_op || !tfile->f_op->poll)
  254. goto error_tgt_fput;
  255. /*
  256. * We have to check that the file structure underneath the file descriptor
  257. * the user passed to us _is_ an eventpoll file. And also we do not permit
  258. * adding an epoll file descriptor inside itself.
  259. */
  260. error = -EINVAL;
  261. /* epoll不能自己监听自己... */
  262. if (file == tfile || !is_file_epoll(file))
  263. goto error_tgt_fput;
  264. /*
  265. * At this point it is safe to assume that the "private_data" contains
  266. * our own data structure.
  267. */
  268. /* 取到我们的eventpoll结构, 来自与epoll_create1()中的分配 */
  269. ep = file->private_data;
  270. /* 接下来的操作有可能修改数据结构内容, 锁之~ */
  271. mutex_lock(&ep->mtx);
  272. /*
  273. * Try to lookup the file inside our RB tree, Since we grabbed "mtx"
  274. * above, we can be sure to be able to use the item looked up by
  275. * ep_find() till we release the mutex.
  276. */
  277. /* 对于每一个监听的fd, 内核都有分片一个epitem结构,
  278. * 而且我们也知道, epoll是不允许重复添加fd的,
  279. * 所以我们首先查找该fd是不是已经存在了.
  280. * ep_find()其实就是RBTREE查找, 跟C++STL的map差不多一回事, O(lgn)的时间复杂度.
  281. */
  282. epi = ep_find(ep, tfile, fd);
  283. error = -EINVAL;
  284. switch (op) {
  285. /* 首先我们关心添加 */
  286. case EPOLL_CTL_ADD:
  287. if (!epi) {
  288. /* 之前的find没有找到有效的epitem, 证明是第一次插入, 接受!
  289. * 这里我们可以知道, POLLERR和POLLHUP事件内核总是会关心的
  290. * */
  291. epds.events |= POLLERR | POLLHUP;
  292. /* rbtree插入, 详情见ep_insert()的分析
  293. * 其实我觉得这里有insert的话, 之前的find应该
  294. * 是可以省掉的... */
  295. error = ep_insert(ep, &epds, tfile, fd);
  296. } else
  297. /* 找到了!? 重复添加! */
  298. error = -EEXIST;
  299. break;
  300. /* 删除和修改操作都比较简单 */
  301. case EPOLL_CTL_DEL:
  302. if (epi)
  303. error = ep_remove(ep, epi);
  304. else
  305. error = -ENOENT;
  306. break;
  307. case EPOLL_CTL_MOD:
  308. if (epi) {
  309. epds.events |= POLLERR | POLLHUP;
  310. error = ep_modify(ep, epi, &epds);
  311. } else
  312. error = -ENOENT;
  313. break;
  314. }
  315. mutex_unlock(&ep->mtx);
  316. error_tgt_fput:
  317. fput(tfile);
  318. error_fput:
  319. fput(file);
  320. error_return:
  321. return error;
  322. }
  323. /* 分配一个eventpoll结构 */
  324. static int ep_alloc(struct eventpoll **pep)
  325. {
  326. int error;
  327. struct user_struct *user;
  328. struct eventpoll *ep;
  329. /* 获取当前用户的一些信息, 比如是不是root啦, 最大监听fd数目啦 */
  330. user = get_current_user();
  331. error = -ENOMEM;
  332. ep = kzalloc(sizeof(*ep), GFP_KERNEL);
  333. if (unlikely(!ep))
  334. goto free_uid;
  335. /* 这些都是初始化啦 */
  336. spin_lock_init(&ep->lock);
  337. mutex_init(&ep->mtx);
  338. init_waitqueue_head(&ep->wq);
  339. init_waitqueue_head(&ep->poll_wait);
  340. INIT_LIST_HEAD(&ep->rdllist);
  341. ep->rbr = RB_ROOT;
  342. ep->ovflist = EP_UNACTIVE_PTR;
  343. ep->user = user;
  344. *pep = ep;
  345. return 0;
  346. free_uid:
  347. free_uid(user);
  348. return error;
  349. }
  350. /*
  351. * Must be called with "mtx" held.
  352. */
  353. /*
  354. * ep_insert()在epoll_ctl()中被调用, 完成往epollfd里面添加一个监听fd的工作
  355. * tfile是fd在内核态的struct file结构
  356. */
  357. static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
  358. struct file *tfile, int fd)
  359. {
  360. int error, revents, pwake = 0;
  361. unsigned long flags;
  362. struct epitem *epi;
  363. struct ep_pqueue epq;
  364. /* 查看是否达到当前用户的最大监听数 */
  365. if (unlikely(atomic_read(&ep->user->epoll_watches) >=
  366. max_user_watches))
  367. return -ENOSPC;
  368. /* 从著名的slab中分配一个epitem */
  369. if (!(epi = kmem_cache_alloc(epi_cache, GFP_KERNEL)))
  370. return -ENOMEM;
  371. /* Item initialization follow here ... */
  372. /* 这些都是相关成员的初始化... */
  373. INIT_LIST_HEAD(&epi->rdllink);
  374. INIT_LIST_HEAD(&epi->fllink);
  375. INIT_LIST_HEAD(&epi->pwqlist);
  376. epi->ep = ep;
  377. /* 这里保存了我们需要监听的文件fd和它的file结构 */
  378. ep_set_ffd(&epi->ffd, tfile, fd);
  379. epi->event = *event;
  380. epi->nwait = 0;
  381. /* 这个指针的初值不是NULL哦... */
  382. epi->next = EP_UNACTIVE_PTR;
  383. /* Initialize the poll table using the queue callback */
  384. /* 好, 我们终于要进入到poll的正题了 */
  385. epq.epi = epi;
  386. /* 初始化一个poll_table
  387. * 其实就是指定调用poll_wait(注意不是epoll_wait!!!)时的回调函数,和我们关心哪些events,
  388. * ep_ptable_queue_proc()就是我们的回调啦, 初值是所有event都关心 */
  389. init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);
  390. /*
  391. * Attach the item to the poll hooks and get current event bits.
  392. * We can safely use the file* here because its usage count has
  393. * been increased by the caller of this function. Note that after
  394. * this operation completes, the poll callback can start hitting
  395. * the new item.
  396. */
  397. /* 这一部很关键, 也比较难懂, 完全是内核的poll机制导致的...
  398. * 首先, f_op->poll()一般来说只是个wrapper, 它会调用真正的poll实现,
  399. * 拿UDP的socket来举例, 这里就是这样的调用流程: f_op->poll(), sock_poll(),
  400. * udp_poll(), datagram_poll(), sock_poll_wait(), 最后调用到我们上面指定的
  401. * ep_ptable_queue_proc()这个回调函数...(好深的调用路径...).
  402. * 完成这一步, 我们的epitem就跟这个socket关联起来了, 当它有状态变化时,
  403. * 会通过ep_poll_callback()来通知.
  404. * 最后, 这个函数还会查询当前的fd是不是已经有啥event已经ready了, 有的话
  405. * 会将event返回. */
  406. revents = tfile->f_op->poll(tfile, &epq.pt);
  407. /*
  408. * We have to check if something went wrong during the poll wait queue
  409. * install process. Namely an allocation for a wait queue failed due
  410. * high memory pressure.
  411. */
  412. error = -ENOMEM;
  413. if (epi->nwait < 0)
  414. goto error_unregister;
  415. /* Add the current item to the list of active epoll hook for this file */
  416. /* 这个就是每个文件会将所有监听自己的epitem链起来 */
  417. spin_lock(&tfile->f_lock);
  418. list_add_tail(&epi->fllink, &tfile->f_ep_links);
  419. spin_unlock(&tfile->f_lock);
  420. /*
  421. * Add the current item to the RB tree. All RB tree operations are
  422. * protected by "mtx", and ep_insert() is called with "mtx" held.
  423. */
  424. /* 都搞定后, 将epitem插入到对应的eventpoll中去 */
  425. ep_rbtree_insert(ep, epi);
  426. /* We have to drop the new item inside our item list to keep track of it */
  427. spin_lock_irqsave(&ep->lock, flags);
  428. /* If the file is already "ready" we drop it inside the ready list */
  429. /* 到达这里后, 如果我们监听的fd已经有事件发生, 那就要处理一下 */
  430. if ((revents & event->events) && !ep_is_linked(&epi->rdllink)) {
  431. /* 将当前的epitem加入到ready list中去 */
  432. list_add_tail(&epi->rdllink, &ep->rdllist);
  433. /* Notify waiting tasks that events are available */
  434. /* 谁在epoll_wait, 就唤醒它... */
  435. if (waitqueue_active(&ep->wq))
  436. wake_up_locked(&ep->wq);
  437. /* 谁在epoll当前的epollfd, 也唤醒它... */
  438. if (waitqueue_active(&ep->poll_wait))
  439. pwake++;
  440. }
  441. spin_unlock_irqrestore(&ep->lock, flags);
  442. atomic_inc(&ep->user->epoll_watches);
  443. /* We have to call this outside the lock */
  444. if (pwake)
  445. ep_poll_safewake(&ep->poll_wait);
  446. return 0;
  447. error_unregister:
  448. ep_unregister_pollwait(ep, epi);
  449. /*
  450. * We need to do this because an event could have been arrived on some
  451. * allocated wait queue. Note that we don‘t care about the ep->ovflist
  452. * list, since that is used/cleaned only inside a section bound by "mtx".
  453. * And ep_insert() is called with "mtx" held.
  454. */
  455. spin_lock_irqsave(&ep->lock, flags);
  456. if (ep_is_linked(&epi->rdllink))
  457. list_del_init(&epi->rdllink);
  458. spin_unlock_irqrestore(&ep->lock, flags);
  459. kmem_cache_free(epi_cache, epi);
  460. return error;
  461. }
  462. /*
  463. * This is the callback that is used to add our wait queue to the
  464. * target file wakeup lists.
  465. */
  466. /*
  467. * 该函数在调用f_op->poll()时会被调用.
  468. * 也就是epoll主动poll某个fd时, 用来将epitem与指定的fd关联起来的.
  469. * 关联的办法就是使用等待队列(waitqueue)
  470. */
  471. static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead,
  472. poll_table *pt)
  473. {
  474. struct epitem *epi = ep_item_from_epqueue(pt);
  475. struct eppoll_entry *pwq;
  476. if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL))) {
  477. /* 初始化等待队列, 指定ep_poll_callback为唤醒时的回调函数,
  478. * 当我们监听的fd发生状态改变时, 也就是队列头被唤醒时,
  479. * 指定的回调函数将会被调用. */
  480. init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
  481. pwq->whead = whead;
  482. pwq->base = epi;
  483. /* 将刚分配的等待队列成员加入到头中, 头是由fd持有的 */
  484. add_wait_queue(whead, &pwq->wait);
  485. list_add_tail(&pwq->llink, &epi->pwqlist);
  486. /* nwait记录了当前epitem加入到了多少个等待队列中,
  487. * 我认为这个值最大也只会是1... */
  488. epi->nwait++;
  489. } else {
  490. /* We have to signal that an error occurred */
  491. epi->nwait = -1;
  492. }
  493. }
  494. /*
  495. * This is the callback that is passed to the wait queue wakeup
  496. * machanism. It is called by the stored file descriptors when they
  497. * have events to report.
  498. */
  499. /*
  500. * 这个是关键性的回调函数, 当我们监听的fd发生状态改变时, 它会被调用.
  501. * 参数key被当作一个unsigned long整数使用, 携带的是events.
  502. */
  503. static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key)
  504. {
  505. int pwake = 0;
  506. unsigned long flags;
  507. struct epitem *epi = ep_item_from_wait(wait);//从等待队列获取epitem.需要知道哪个进程挂载到这个设备
  508. struct eventpoll *ep = epi->ep;//获取
  509. spin_lock_irqsave(&ep->lock, flags);
  510. /*
  511. * If the event mask does not contain any poll(2) event, we consider the
  512. * descriptor to be disabled. This condition is likely the effect of the
  513. * EPOLLONESHOT bit that disables the descriptor when an event is received,
  514. * until the next EPOLL_CTL_MOD will be issued.
  515. */
  516. if (!(epi->event.events & ~EP_PRIVATE_BITS))
  517. goto out_unlock;
  518. /*
  519. * Check the events coming with the callback. At this stage, not
  520. * every device reports the events in the "key" parameter of the
  521. * callback. We need to be able to handle both cases here, hence the
  522. * test for "key" != NULL before the event match test.
  523. */
  524. /* 没有我们关心的event... */
  525. if (key && !((unsigned long) key & epi->event.events))
  526. goto out_unlock;
  527. /*
  528. * If we are trasfering events to userspace, we can hold no locks
  529. * (because we‘re accessing user memory, and because of linux f_op->poll()
  530. * semantics). All the events that happens during that period of time are
  531. * chained in ep->ovflist and requeued later on.
  532. */
  533. /*
  534. * 这里看起来可能有点费解, 其实干的事情比较简单:
  535. * 如果该callback被调用的同时, epoll_wait()已经返回了,
  536. * 也就是说, 此刻应用程序有可能已经在循环获取events,
  537. * 这种情况下, 内核将此刻发生event的epitem用一个单独的链表
  538. * 链起来, 不发给应用程序, 也不丢弃, 而是在下一次epoll_wait
  539. * 时返回给用户.
  540. */
  541. if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) {
  542. if (epi->next == EP_UNACTIVE_PTR) {
  543. epi->next = ep->ovflist;
  544. ep->ovflist = epi;
  545. }
  546. goto out_unlock;
  547. }
  548. /* If this file is already in the ready list we exit soon */
  549. /* 将当前的epitem放入ready list */
  550. if (!ep_is_linked(&epi->rdllink))
  551. list_add_tail(&epi->rdllink, &ep->rdllist);
  552. /*
  553. * Wake up ( if active ) both the eventpoll wait list and the ->poll()
  554. * wait list.
  555. */
  556. /* 唤醒epoll_wait... */
  557. if (waitqueue_active(&ep->wq))
  558. wake_up_locked(&ep->wq);
  559. /* 如果epollfd也在被poll, 那就唤醒队列里面的所有成员. */
  560. if (waitqueue_active(&ep->poll_wait))
  561. pwake++;
  562. out_unlock:
  563. spin_unlock_irqrestore(&ep->lock, flags);
  564. /* We have to call this outside the lock */
  565. if (pwake)
  566. ep_poll_safewake(&ep->poll_wait);
  567. return 1;
  568. }
  569. /*
  570. * Implement the event wait interface for the eventpoll file. It is the kernel
  571. * part of the user space epoll_wait(2).
  572. */
  573. SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
  574. int, maxevents, int, timeout)
  575. {
  576. int error;
  577. struct file *file;
  578. struct eventpoll *ep;
  579. /* The maximum number of event must be greater than zero */
  580. if (maxevents <= 0 || maxevents > EP_MAX_EVENTS)
  581. return -EINVAL;
  582. /* Verify that the area passed by the user is writeable */
  583. /* 这个地方有必要说明一下:
  584. * 内核对应用程序采取的策略是"绝对不信任",
  585. * 所以内核跟应用程序之间的数据交互大都是copy, 不允许(也时候也是不能...)指针引用.
  586. * epoll_wait()需要内核返回数据给用户空间, 内存由用户程序提供,
  587. * 所以内核会用一些手段来验证这一段内存空间是不是有效的.
  588. */
  589. if (!access_ok(VERIFY_WRITE, events, maxevents * sizeof(struct epoll_event))) {
  590. error = -EFAULT;
  591. goto error_return;
  592. }
  593. /* Get the "struct file *" for the eventpoll file */
  594. error = -EBADF;
  595. /* 获取epollfd的struct file, epollfd也是文件嘛 */
  596. file = fget(epfd);
  597. if (!file)
  598. goto error_return;
  599. /*
  600. * We have to check that the file structure underneath the fd
  601. * the user passed to us _is_ an eventpoll file.
  602. */
  603. error = -EINVAL;
  604. /* 检查一下它是不是一个真正的epollfd... */
  605. if (!is_file_epoll(file))
  606. goto error_fput;
  607. /*
  608. * At this point it is safe to assume that the "private_data" contains
  609. * our own data structure.
  610. */
  611. /* 获取eventpoll结构 */
  612. ep = file->private_data;
  613. /* Time to fish for events ... */
  614. /* OK, 睡觉, 等待事件到来~~ */
  615. error = ep_poll(ep, events, maxevents, timeout);
  616. error_fput:
  617. fput(file);
  618. error_return:
  619. return error;
  620. }
  621. /* 这个函数真正将执行epoll_wait的进程带入睡眠状态... */
  622. static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
  623. int maxevents, long timeout)
  624. {
  625. int res, eavail;
  626. unsigned long flags;
  627. long jtimeout;
  628. wait_queue_t wait;//等待队列
  629. /*
  630. * Calculate the timeout by checking for the "infinite" value (-1)
  631. * and the overflow condition. The passed timeout is in milliseconds,
  632. * that why (t * HZ) / 1000.
  633. */
  634. /* 计算睡觉时间, 毫秒要转换为HZ */
  635. jtimeout = (timeout < 0 || timeout >= EP_MAX_MSTIMEO) ?
  636. MAX_SCHEDULE_TIMEOUT : (timeout * HZ + 999) / 1000;
  637. retry:
  638. spin_lock_irqsave(&ep->lock, flags);
  639. res = 0;
  640. /* 如果ready list不为空, 就不睡了, 直接干活... */
  641. if (list_empty(&ep->rdllist)) {
  642. /*
  643. * We don‘t have any available event to return to the caller.
  644. * We need to sleep here, and we will be wake up by
  645. * ep_poll_callback() when events will become available.
  646. */
  647. /* OK, 初始化一个等待队列, 准备直接把自己挂起,
  648. * 注意current是一个宏, 代表当前进程 */
  649. init_waitqueue_entry(&wait, current);//初始化等待队列,wait表示当前进程
  650. __add_wait_queue_exclusive(&ep->wq, &wait);//挂载到ep结构的等待队列
  651. for (;;) {
  652. /*
  653. * We don‘t want to sleep if the ep_poll_callback() sends us
  654. * a wakeup in between. That‘s why we set the task state
  655. * to TASK_INTERRUPTIBLE before doing the checks.
  656. */
  657. /* 将当前进程设置位睡眠, 但是可以被信号唤醒的状态,
  658. * 注意这个设置是"将来时", 我们此刻还没睡! */
  659. set_current_state(TASK_INTERRUPTIBLE);
  660. /* 如果这个时候, ready list里面有成员了,
  661. * 或者睡眠时间已经过了, 就直接不睡了... */
  662. if (!list_empty(&ep->rdllist) || !jtimeout)
  663. break;
  664. /* 如果有信号产生, 也起床... */
  665. if (signal_pending(current)) {
  666. res = -EINTR;
  667. break;
  668. }
  669. /* 啥事都没有,解锁, 睡觉... */
  670. spin_unlock_irqrestore(&ep->lock, flags);
  671. /* jtimeout这个时间后, 会被唤醒,
  672. * ep_poll_callback()如果此时被调用,
  673. * 那么我们就会直接被唤醒, 不用等时间了...
  674. * 再次强调一下ep_poll_callback()的调用时机是由被监听的fd
  675. * 的具体实现, 比如socket或者某个设备驱动来决定的,
  676. * 因为等待队列头是他们持有的, epoll和当前进程
  677. * 只是单纯的等待...
  678. **/
  679. jtimeout = schedule_timeout(jtimeout);//睡觉
  680. spin_lock_irqsave(&ep->lock, flags);
  681. }
  682. __remove_wait_queue(&ep->wq, &wait);
  683. /* OK 我们醒来了... */
  684. set_current_state(TASK_RUNNING);
  685. }
  686. /* Is it worth to try to dig for events ? */
  687. eavail = !list_empty(&ep->rdllist) || ep->ovflist != EP_UNACTIVE_PTR;
  688. spin_unlock_irqrestore(&ep->lock, flags);
  689. /*
  690. * Try to transfer events to user space. In case we get 0 events and
  691. * there‘s still timeout left over, we go trying again in search of
  692. * more luck.
  693. */
  694. /* 如果一切正常, 有event发生, 就开始准备数据copy给用户空间了... */
  695. if (!res && eavail &&
  696. !(res = ep_send_events(ep, events, maxevents)) && jtimeout)
  697. goto retry;
  698. return res;
  699. }
  700. /* 这个简单, 我们直奔下一个... */
  701. static int ep_send_events(struct eventpoll *ep,
  702. struct epoll_event __user *events, int maxevents)
  703. {
  704. struct ep_send_events_data esed;
  705. esed.maxevents = maxevents;
  706. esed.events = events;
  707. return ep_scan_ready_list(ep, ep_send_events_proc, &esed);
  708. }
  709. /**
  710. * ep_scan_ready_list - Scans the ready list in a way that makes possible for
  711. * the scan code, to call f_op->poll(). Also allows for
  712. * O(NumReady) performance.
  713. *
  714. * @ep: Pointer to the epoll private data structure.
  715. * @sproc: Pointer to the scan callback.
  716. * @priv: Private opaque data passed to the @sproc callback.
  717. *
  718. * Returns: The same integer error code returned by the @sproc callback.
  719. */
  720. static int ep_scan_ready_list(struct eventpoll *ep,
  721. int (*sproc)(struct eventpoll *,
  722. struct list_head *, void *),
  723. void *priv)
  724. {
  725. int error, pwake = 0;
  726. unsigned long flags;
  727. struct epitem *epi, *nepi;
  728. LIST_HEAD(txlist);
  729. /*
  730. * We need to lock this because we could be hit by
  731. * eventpoll_release_file() and epoll_ctl().
  732. */
  733. mutex_lock(&ep->mtx);
  734. /*
  735. * Steal the ready list, and re-init the original one to the
  736. * empty list. Also, set ep->ovflist to NULL so that events
  737. * happening while looping w/out locks, are not lost. We cannot
  738. * have the poll callback to queue directly on ep->rdllist,
  739. * because we want the "sproc" callback to be able to do it
  740. * in a lockless way.
  741. */
  742. spin_lock_irqsave(&ep->lock, flags);
  743. /* 这一步要注意, 首先, 所有监听到events的epitem都链到rdllist上了,
  744. * 但是这一步之后, 所有的epitem都转移到了txlist上, 而rdllist被清空了,
  745. * 要注意哦, rdllist已经被清空了! */
  746. list_splice_init(&ep->rdllist, &txlist);
  747. /* ovflist, 在ep_poll_callback()里面我解释过, 此时此刻我们不希望
  748. * 有新的event加入到ready list中了, 保存后下次再处理... */
  749. ep->ovflist = NULL;
  750. spin_unlock_irqrestore(&ep->lock, flags);
  751. /*
  752. * Now call the callback function.
  753. */
  754. /* 在这个回调函数里面处理每个epitem
  755. * sproc 就是 ep_send_events_proc, 下面会注释到. */
  756. error = (*sproc)(ep, &txlist, priv);
  757. spin_lock_irqsave(&ep->lock, flags);
  758. /*
  759. * During the time we spent inside the "sproc" callback, some
  760. * other events might have been queued by the poll callback.
  761. * We re-insert them inside the main ready-list here.
  762. */
  763. /* 现在我们来处理ovflist, 这些epitem都是我们在传递数据给用户空间时
  764. * 监听到了事件. */
  765. for (nepi = ep->ovflist; (epi = nepi) != NULL;
  766. nepi = epi->next, epi->next = EP_UNACTIVE_PTR) {
  767. /*
  768. * We need to check if the item is already in the list.
  769. * During the "sproc" callback execution time, items are
  770. * queued into ->ovflist but the "txlist" might already
  771. * contain them, and the list_splice() below takes care of them.
  772. */
  773. /* 将这些直接放入readylist */
  774. if (!ep_is_linked(&epi->rdllink))
  775. list_add_tail(&epi->rdllink, &ep->rdllist);
  776. }
  777. /*
  778. * We need to set back ep->ovflist to EP_UNACTIVE_PTR, so that after
  779. * releasing the lock, events will be queued in the normal way inside
  780. * ep->rdllist.
  781. */
  782. ep->ovflist = EP_UNACTIVE_PTR;
  783. /*
  784. * Quickly re-inject items left on "txlist".
  785. */
  786. /* 上一次没有处理完的epitem, 重新插入到ready list */
  787. list_splice(&txlist, &ep->rdllist);
  788. /* ready list不为空, 直接唤醒... */
  789. if (!list_empty(&ep->rdllist)) {
  790. /*
  791. * Wake up (if active) both the eventpoll wait list and
  792. * the ->poll() wait list (delayed after we release the lock).
  793. */
  794. if (waitqueue_active(&ep->wq))
  795. wake_up_locked(&ep->wq);
  796. if (waitqueue_active(&ep->poll_wait))
  797. pwake++;
  798. }
  799. spin_unlock_irqrestore(&ep->lock, flags);
  800. mutex_unlock(&ep->mtx);
  801. /* We have to call this outside the lock */
  802. if (pwake)
  803. ep_poll_safewake(&ep->poll_wait);
  804. return error;
  805. }
  806. /* 该函数作为callbakc在ep_scan_ready_list()中被调用
  807. * head是一个链表, 包含了已经ready的epitem,
  808. * 这个不是eventpoll里面的ready list, 而是上面函数中的txlist.
  809. */
  810. static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head,
  811. void *priv)
  812. {
  813. struct ep_send_events_data *esed = priv;
  814. int eventcnt;
  815. unsigned int revents;
  816. struct epitem *epi;
  817. struct epoll_event __user *uevent;
  818. /*
  819. * We can loop without lock because we are passed a task private list.
  820. * Items cannot vanish during the loop because ep_scan_ready_list() is
  821. * holding "mtx" during this call.
  822. */
  823. /* 扫描整个链表... */
  824. for (eventcnt = 0, uevent = esed->events;
  825. !list_empty(head) && eventcnt < esed->maxevents;) {
  826. /* 取出第一个成员 */
  827. epi = list_first_entry(head, struct epitem, rdllink);
  828. /* 然后从链表里面移除 */
  829. list_del_init(&epi->rdllink);
  830. /* 读取events,
  831. * 注意events我们ep_poll_callback()里面已经取过一次了, 为啥还要再取?
  832. * 1. 我们当然希望能拿到此刻的最新数据, events是会变的~
  833. * 2. 不是所有的poll实现, 都通过等待队列传递了events, 有可能某些驱动压根没传
  834. * 必须主动去读取. */
  835. revents = epi->ffd.file->f_op->poll(epi->ffd.file, NULL) &
  836. epi->event.events;
  837. /*
  838. * If the event mask intersect the caller-requested one,
  839. * deliver the event to userspace. Again, ep_scan_ready_list()
  840. * is holding "mtx", so no operations coming from userspace
  841. * can change the item.
  842. */
  843. if (revents) {
  844. /* 将当前的事件和用户传入的数据都copy给用户空间,
  845. * 就是epoll_wait()后应用程序能读到的那一堆数据. */
  846. if (__put_user(revents, &uevent->events) ||
  847. __put_user(epi->event.data, &uevent->data)) {
  848. /* 如果copy过程中发生错误, 会中断链表的扫描,
  849. * 并把当前发生错误的epitem重新插入到ready list.
  850. * 剩下的没处理的epitem也不会丢弃, 在ep_scan_ready_list()
  851. * 中它们也会被重新插入到ready list */
  852. list_add(&epi->rdllink, head);
  853. return eventcnt ? eventcnt : -EFAULT;
  854. }
  855. eventcnt++;
  856. uevent++;
  857. if (epi->event.events & EPOLLONESHOT)
  858. epi->event.events &= EP_PRIVATE_BITS;
  859. else if (!(epi->event.events & EPOLLET)) {
  860. /*
  861. * If this file has been added with Level
  862. * Trigger mode, we need to insert back inside
  863. * the ready list, so that the next call to
  864. * epoll_wait() will check again the events
  865. * availability. At this point, noone can insert
  866. * into ep->rdllist besides us. The epoll_ctl()
  867. * callers are locked out by
  868. * ep_scan_ready_list() holding "mtx" and the
  869. * poll callback will queue them in ep->ovflist.
  870. */
  871. /* 嘿嘿, EPOLLET和非ET的区别就在这一步之差呀~
  872. * 如果是ET, epitem是不会再进入到readly list,
  873. * 除非fd再次发生了状态改变, ep_poll_callback被调用.
  874. * 如果是非ET, 不管你还有没有有效的事件或者数据,
  875. * 都会被重新插入到ready list, 再下一次epoll_wait
  876. * 时, 会立即返回, 并通知给用户空间. 当然如果这个
  877. * 被监听的fds确实没事件也没数据了, epoll_wait会返回一个0,
  878. * 空转一次.
  879. */
  880. list_add_tail(&epi->rdllink, &ep->rdllist);
  881. }
  882. }
  883. }
  884. return eventcnt;
  885. }
  886. /* ep_free在epollfd被close时调用,
  887. * 释放一些资源而已, 比较简单 */
  888. static void ep_free(struct eventpoll *ep)
  889. {
  890. struct rb_node *rbp;
  891. struct epitem *epi;
  892. /* We need to release all tasks waiting for these file */
  893. if (waitqueue_active(&ep->poll_wait))
  894. ep_poll_safewake(&ep->poll_wait);
  895. /*
  896. * We need to lock this because we could be hit by
  897. * eventpoll_release_file() while we‘re freeing the "struct eventpoll".
  898. * We do not need to hold "ep->mtx" here because the epoll file
  899. * is on the way to be removed and no one has references to it
  900. * anymore. The only hit might come from eventpoll_release_file() but
  901. * holding "epmutex" is sufficent here.
  902. */
  903. mutex_lock(&epmutex);
  904. /*
  905. * Walks through the whole tree by unregistering poll callbacks.
  906. */
  907. for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp)) {
  908. epi = rb_entry(rbp, struct epitem, rbn);
  909. ep_unregister_pollwait(ep, epi);
  910. }
  911. /*
  912. * Walks through the whole tree by freeing each "struct epitem". At this
  913. * point we are sure no poll callbacks will be lingering around, and also by
  914. * holding "epmutex" we can be sure that no file cleanup code will hit
  915. * us during this operation. So we can avoid the lock on "ep->lock".
  916. */
  917. /* 之所以在关闭epollfd之前不需要调用epoll_ctl移除已经添加的fd,
  918. * 是因为这里已经做了... */
  919. while ((rbp = rb_first(&ep->rbr)) != NULL) {
  920. epi = rb_entry(rbp, struct epitem, rbn);
  921. ep_remove(ep, epi);
  922. }
  923. mutex_unlock(&epmutex);
  924. mutex_destroy(&ep->mtx);
  925. free_uid(ep->user);
  926. kfree(ep);
  927. }
  928. /* File callbacks that implement the eventpoll file behaviour */
  929. static const struct file_operations eventpoll_fops = {
  930. .release = ep_eventpoll_release,
  931. .poll = ep_eventpoll_poll
  932. };
  933. /* Fast test to see if the file is an evenpoll file */
  934. static inline int is_file_epoll(struct file *f)
  935. {
  936. return f->f_op == &eventpoll_fops;
  937. }
  938. /* OK, eventpoll我认为比较重要的函数都注释完了... */












以上是关于Linux之epoll实现的主要内容,如果未能解决你的问题,请参考以下文章

Linux网络编程——多路复用之epoll

Linux内核I/O 具体实现方式之epoll

linux服务器开发之epoll的封装类实现

Linux IO多路复用之epoll网络编程及源码(转)

linux网络编程 - epoll边沿触发/水平触发内核实现代码分析

linux网络编程 - epoll边沿触发/水平触发内核实现代码分析