nova VirtualInterfaceCreateException (by quqi99)

Posted quqi99

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了nova VirtualInterfaceCreateException (by quqi99)相关的知识,希望对你有一定的参考价值。

作者:张华 发表于:2022-09-01
版权声明:可以任意转载,转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明

问题

虚机有时候会报下列错误:

nova.exception.VirtualInterfaceCreateException: Virtual Interface creation failed
...
2022-07-19 13:50:32.084 147039 WARNING nova.virt.libvirt.driver [req-7b6da117-d40c-4bac-9f82-50c4266e1617 66d3188e9f24466f8d9c3905f178d12a ca6332100f1d42e4aa94aa2c37f243e4 - d476d9f579154d49961c29942a76d1c0 d476d9f579154d49961c29942a76d1c0] [instance: 6fb0fc5d-4fa3-435c-9f60-b9953d11adb7] Timeout waiting for [('network-vif-plugged', '864b25d9-276b-4644-916a-7637762b93e2')] for instance with vm_state building and task_state spawning.: eventlet.timeout.Timeout: 300 seconds

初步分析

没问题的虚机有下列日志:

Provisioning for port 2e96ade5-3a66-4045-b631-597995c07d5b completed by entity DHCP. provisioning_complete /usr/lib/python3/dist-packages/neutron/db/provisioning_blocks.py:133
Provisioning for port 2e96ade5-3a66-4045-b631-597995c07d5b completed by entity L2. provisioning_complete /usr/lib/python3/dist-packages/neutron/db/provisioning_blocks.py:133

有问题的虚机有下列日志, 它比上面的少了’completed by entity L2. provisioning_complete '.

For every failed VM (due to the vif plugged event timeout) neutron server log contain only following
Provisioning for port 864b25d9-276b-4644-916a-7637762b93e2 completed by entity DHCP. provisioning_complete /usr/lib/python3/dist-packages/neutron/db/provisioning_blocks.py:133

这样它就不会像下列这样将port设置为ACTIVE(https://docs.openstack.org/neutron/latest/contributor/internals/provisioning_blocks.html), 然后会引起nova端报的超时问题.

Transition to ACTIVE for port object 864b25d9-276b-4644-916a-7637762b93e2 will not be triggered until provisioned by entity L2. add_provisioning_component /usr/lib/python3/dist-packages/neutron/db/provisioning_blocks.py:73

neutron-l2-agent有时会花47分钟来完成一个循环.

2022-08-04 19:24:53.758 95691 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-5184b270-3ac4-4777-a974-8d629599cd88 - - - - -] Agent rpc_loop - iteration:2935733 started
2022-08-04 20:11:40.496 95691 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-5184b270-3ac4-4777-a974-8d629599cd88 - - - - -] Agent rpc_loop - iteration:2935733 completed. Processed ports statistics: 'regular': 'added': 0, 'updated': 2, 'removed': 2. Elapsed:2806.738

下列脚本能将所有neutron-l2-agent的循环时间给列出来(下面只贴出了部分输出, 它显示有时候循环花很长时间):

$ for i in `seq 2 22`; do echo "neutron-openvswitch-agent.log.$i.gz: " ; zgrep "iteration.* completed" neutron-openvswitch-agent.log.$i.gz | awk ' print $(NF) ' | cut -d ":" -f2 | sort -gr | head -n 6; done
neutron-openvswitch-agent.log.2.gz:
...
neutron-openvswitch-agent.log.5.gz:
2806.738
2803.673
1410.152
1410.006
925.405
925.316

在上面2022-08-04 19:24:53 - 2022-08-04 20:11:40这段时间也都能看到所报的错误,从而验证所报的错误应该就是这个循环造成的.

$zgrep "Timeout waiting for \\[('network-vif-plugged'" nova-compute.log.5.gz                 
2022-08-04 19:30:54.076 147039 WARNING nova.virt.libvirt.driver [req-acfb3bd2-565f-48e7-91e8-f109446895c1 f7ca0fcc30cf47c0b60b260bab9dd7bd e14e421176884fb492a71e1c9be13fa3 - 6113214bdbed491c880c23149c25b7cb 6113214bdbed491c880c23149c25b7cb] [instance: 94d4ab46-799d-44fc-9991-9a57ec18e950] Timeout waiting for [('network-vif-plugged', '43b7f269-0eae-44c4-990f-53c98743e12b')] for instance with vm_state building and task_state spawning.: eventlet.timeout.Timeout: 300 seconds
...
2022-08-04 20:10:06.211 147039 WARNING nova.virt.libvirt.driver [req-5b035412-a797-453a-b3e8-098dead294dd f7ca0fcc30cf47c0b60b260bab9dd7bd e14e421176884fb492a71e1c9be13fa3 - 6113214bdbed491c880c23149c25b7cb 6113214bdbed491c880c23149c25b7cb] [instance: 19da6ffc-dcc1-4620-b3b2-2018a3355b8b] Timeout waiting for [('network-vif-plugged', '71190281-53ce-4558-82cb-0240e71b9245')] for instance with vm_state building and task_state spawning.: eventlet.timeout.Timeout: 300 seconds

进一步分析 - focus在neutron-l2-agent端

这段时间neutron-l2-agent在做什么呢?它在更新SG.

2022-07-28 21:10:34.988 95691 INFO neutron.agent.securitygroups_rpc [req-0be98a3b-1674-486c-aeb4-67d49e413757 f7ca0fcc30cf47c0b60b260bab9dd7bd e14e421176884fb492a71e1c9be13fa3 - - -] Security group member upd
ated '1f4da925-7e78-4e7c-ac07-8eae09fc2fda', '4de2b2bf-5302-4cf4-96de-949ea78ca8eb'

通过下列方式要来neutron DB.

pass=`juju run --unit mysql/0 'leader-get root-password'`
juju run --unit mysql/0 "mysqldump -u root --password=$pass --single-transaction --skip-lock-tables --set-gtid-purged=OFF --databases neutron --quick --result-file=/tmp/neutron.sql"

分析到有的SG中979上的LB, 有高达1994个active ports (This is for security group:4d15a8fa-3adc-4750-87da-c3d26902dbca network:4a57f171-1e6b-4ae4-9ec5-2203144817b7 (lb-mgmt-net).):

select sgb.security_group_id, p.network_id, n.name, count(sgb.port_id) from securitygroupportbindings sgb join ports p on p.id=sgb.port_id join networks n on n.id=p.network_id group by 1,2 order by 4 desc limit 10;
+--------------------------------------+--------------------------------------+-------------------+--------------------+
| security_group_id                    | network_id                           | name              | count(sgb.port_id) |
+--------------------------------------+--------------------------------------+-------------------+--------------------+
| 4d15a8fa-3adc-4750-87da-c3d26902dbca | 4a57f171-1e6b-4ae4-9ec5-2203144817b7 | lb-mgmt-net       |               1994 |
| 7f9e022e-f12d-4526-8e11-b09f5ed2e0e5 | 4a8c719c-04e7-490c-b7d2-b99e79e7e79f | network_lb_az3    |                241 |

select count(id) from load_balancer where provisioning_status='ACTIVE';
+-----------+
| count(id) |
+-----------+
|       979 |
+-----------+
...

相关代码如下:

#https://github.com/openstack/neutron/blob/stable/ussuri/neutron/agent/securitygroups_rpc.py#L214
    def security_groups_member_updated(self, security_groups):
        LOG.info("Security group "
                 "member updated %r", security_groups)
        self._security_group_updated(
            security_groups,
            'security_group_source_groups',
            'sg_member')
            
#https://github.com/openstack/neutron/blob/stable/ussuri/neutron/agent/securitygroups_rpc.py#L206
    def security_groups_rule_updated(self, security_groups):
        LOG.info("Security group "
                 "rule updated %r", security_groups)
        self._security_group_updated(
            security_groups,
            'security_groups',
            'sg_rule')

这个是FW rpc的实现机制:
https://github.com/openstack/neutron/blob/stable/ussuri/doc/source/contributor/internals/openvswitch_firewall.rst#firewall-api-calls
https://github.com/openstack/neutron/blob/stable/ussuri/doc/source/contributor/internals/rpc_callbacks.rst

There are two main calls performed by the firewall driver in order to either create or update a port with security groups - prepare_port_filter and update_port_filter. Both methods rely on the security group objects that are already defined in the driver and work similarly to their iptables counterparts. The definition of the objects will be described later in this document. prepare_port_filter must be called only once during port creation, and it defines the initial rules for the port. When the port is updated, all filtering rules are removed, and new rules are generated based on the available information about security groups in the driver.

Security group rules can be defined in the firewall driver by calling update_security_group_rules, which rewrites all the rules for a given security group. If a remote security group is changed, then update_security_group_members is called to determine the set of IP addresses that should be allowed for this remote security group. Calling this method will not have any effect on existing instance ports. In other words, if the port is using security groups and its rules are changed by calling one of the above methods, then no new rules are generated for this port. update_port_filter must be called for the changes to take effect.

All the machinery above is controlled by security group RPC methods, which mean the firewall driver doesn't have any logic of which port should be updated based on the provided changes, it only accomplishes actions when called from the controller.

推荐应该call了下列代码路径:

security_groups_rule_updated ->
and
security_groups_member_updated ->
_security_group_updated ->
refresh_firewall ->
_apply_port_filter(device_ids, update_filter=True) ->
firewall.update_port_filter(device) ->
_set_port_filters(of_port) ->
add_flows_from_rules ->
conj_ip_manager.update_flows_for_vlan ->
_update_flows_for_vlan_subr ->
delete_flows_for_flow_state ->
delete_flow_for_ip

是有可能触发了下列lp bug吗?
https://bugs.launchpad.net/neutron/+bug/1975674
https://bugs.launchpad.net/neutron/+bug/1813703

以上是关于nova VirtualInterfaceCreateException (by quqi99)的主要内容,如果未能解决你的问题,请参考以下文章

华为nova10和nova10pro的区别 华为nova10和华为nova10pro参数配置

荣耀nova7是虚拟机吗?

OpenStack 运维 - 部署Nova组件 [T版]

OpenStack 运维 - 部署Nova组件 [T版]

OpenStack部署之Nova项目(7-5)

OpenStack部署之Nova项目(7-5)