• ovn metadata (by quqi99)


    作者:张华 发表于:2022-08-25
    版权声明:可以任意转载,转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明

    问题

    客户描述虚机的metadata功能偶尔有问题,他们怀疑是:https://bugs.launchpad.net/networking-ovn/+bug/1953591
    客户还说出现问题之后在出现问题的虚机所在的host上重启neutron-ovn-metadata-agent服务可以恢复服务,重启这个服务会重新运行:

    /usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ip netns exec ovnmeta-63e2c276-60dd-40e3-baa1-c16342eacce2 haproxy -f /var/lib/neutron/ovn-metadata-proxy/63e2c276-60dd-40e3-baa1-c16342eacce2.conf
    
    • 1

    ovn metadata原理

    我很熟悉neutron metadata, 但ovn metadata弄得还是很多,但它们的原理应该是类似的.

    • 对于sriov port它会有一个对应的external port来提供ovn metadata功能,所以ovn medata是在GW chassis上做的,与compute chassis并不在同一节点.流程见:20220607 - sriov ovn metadata - https://blog.csdn.net/quqi99/article/details/116893909 ,
    • 对于普通虚机的metadata是由它所在的compute chassis上的ovnmeta-xxx来做.

    1, 虚机VM(192.168.21.75)内可以由dhcp(192.168.21.2)提供主机路由来直接转发到元数据代理neutron-ovn-metadata-agent(192.168.21.2), 若没有启用dhcp则由GW转发至neutron-ovn-metadata-agent

    $ ssh -i ~/testkey.priv ubuntu@10.5.150.219 -- ip route list |grep 169
    169.254.169.254 via 192.168.21.2 dev ens2 proto dhcp src 192.168.21.75 metric 100
    
    $ openstack port list
    +--------------------------------------+------+-------------------+------------------------------------------------------------------------------+--------+
    | ID                                   | Name | MAC Address       | Fixed IP Addresses                                                           | Status |
    +--------------------------------------+------+-------------------+------------------------------------------------------------------------------+--------+
    | 48244606-524d-4c24-bc54-0143869d4267 |      | fa:16:3e:43:81:23 | ip_address='192.168.21.2', subnet_id='7a3260aa-e2a6-4721-8f0a-5b06aec71b53'  | DOWN   |
    | 5043ab8b-184d-4d83-abf8-fa044aec9a6d |      | fa:16:3e:e2:8c:a0 | ip_address='10.5.152.84', subnet_id='a0f40836-715c-4dcc-9df1-fb4093aa0f79'   | DOWN   |
    | 544e4423-d3d8-418e-8b6c-ea294ff0d1b3 |      | fa:16:3e:a2:83:76 |                                                                              | DOWN   |
    | 70b96954-32a2-488b-b67b-4bf039189b02 |      | fa:16:3e:42:29:d4 | ip_address='192.168.21.1', subnet_id='7a3260aa-e2a6-4721-8f0a-5b06aec71b53'  | ACTIVE |
    | c99885ae-1778-46cb-af07-29cf32113ef7 |      | fa:16:3e:4d:03:e6 | ip_address='10.5.150.219', subnet_id='a0f40836-715c-4dcc-9df1-fb4093aa0f79'  | N/A    |
    | dd6bd055-3460-459e-b9b2-8da7211dacdc |      | fa:16:3e:53:b7:77 | ip_address='192.168.21.75', subnet_id='7a3260aa-e2a6-4721-8f0a-5b06aec71b53' | ACTIVE |
    +--------------------------------------+------+-------------------+------------------------------------------------------------------------------+--------+
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14

    2, 计算节点上运行一个neutron-ovn-metadata-agent, 同时neutron-ovn-metadata-agent为每个network创建一个ovnmeta-xxx并启动haproxy实例.
      如果ovn-meta-xxx不存在就会被创建一个,并且使用veth pair将ovn-meta-xxx(eg:tap8ea7d54e-51)与host(eg:tap8ea7d54e-50, 再加到br-int)相连.
    haproxy实例监听169.254.169.254:80,并加入X-OVN-Network-ID头,最后转到backend (unix socket: /var/lib/neutron/metadata_proxy)

    #nova-compute/0 has 3 processes about metadata
    /usr/bin/python3 /usr/bin/neutron-ovn-metadata-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/neutron_ovn_metadata_agent.ini --log-file=/var/log/neutron/neutron-ovn-metadata-agent.log
    /usr/bin/python3 /usr/bin/nova-api-metadata --config-file=/etc/nova/nova.conf --log-file=/var/log/nova/nova-api-metadata.log
    haproxy -f /var/lib/neutron/ovn-metadata-proxy/8ea7d54e-5fe0-408d-9de8-5ccbed8ede84.conf
    
    $ sudo ip netns exec ovnmeta-8ea7d54e-5fe0-408d-9de8-5ccbed8ede84 ip addr show tap8ea7d54e-51 |grep -E 'link|inet'
        link/ether fa:16:3e:43:81:23 brd ff:ff:ff:ff:ff:ff link-netnsid 0
        inet 169.254.169.254/32 brd 169.254.169.254 scope global tap8ea7d54e-51
        inet 192.168.21.2/24 brd 192.168.21.255 scope global tap8ea7d54e-51
    $ ip addr show tap8ea7d54e-50 |grep link
        link/ether ca:cb:5c:af:dd:22 brd ff:ff:ff:ff:ff:ff link-netns ovnmeta-8ea7d54e-5fe0-408d-9de8-5ccbed8ede84
    $ sudo ovs-vsctl show |grep 8ea7d54e
            Port tap8ea7d54e-50
                Interface tap8ea7d54e-50
    
    $ sudo ip netns exec ovnmeta-8ea7d54e-5fe0-408d-9de8-5ccbed8ede84 route -n
    Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
    192.168.21.0    0.0.0.0         255.255.255.0   U     0      0        0 tap8ea7d54e-51
    
    $ cat /var/lib/neutron/ovn-metadata-proxy/8ea7d54e-5fe0-408d-9de8-5ccbed8ede84.conf
    global
        log         /dev/log local0 info
        user        root
        group       root
        maxconn     1024
        pidfile     /var/lib/neutron/external/pids/8ea7d54e-5fe0-408d-9de8-5ccbed8ede84.pid.haproxy
        daemon
    defaults
        log global
        mode http
        option httplog
        option dontlognull
        option http-server-close
        option forwardfor
        retries                 3
        timeout http-request    30s
        timeout connect         30s
        timeout client          32s
        timeout server          32s
        timeout http-keep-alive 30s
    listen listener
        bind 169.254.169.254:80
        server metadata /var/lib/neutron/metadata_proxy
        http-request add-header X-OVN-Network-ID 8ea7d54e-5fe0-408d-9de8-5ccbed8ede84
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44

    3, neutron-ovn-metadata-agent与unix socket(/var/lib/neutron/metadata_proxy)相连,它再将请求发给本机上nova-api-metadata(DVR模式下每个计算节点都会有自己的nova-api-metadata)

    已有数据分析

    无奈,客户遇到问题了并在07-21 17.15重启了neutron-ovn-metadata-agent, 他抓的sosreport是在运行workaround之后抓的,参考价值有限啊.
    他提到了这个namespace - ovnmeta-63e2c276-60dd-40e3-baa1-c16342eacce2, 所以搜索出如下信息,但是因为sosreport是在运行workaround之后抓的所以看不出什么问题.

    $ grep -r 'tap63e2c276' sos_commands/networking/namespaces/ovnmeta-63e2c276-60dd-40e3-baa1-c16342eacce2/ip_netns_exec_ovnmeta-63e2c276-60dd-40e3-baa1-c16342eacce2_ip_-d_address_show
    2: tap63e2c276-61@if20:  mtu 1500 qdisc noqueue state UP group default qlen 1000
        inet 100.94.98.4/23 brd 100.94.99.255 scope global tap63e2c276-61
        inet 169.254.169.254/32 brd 169.254.169.254 scope global tap63e2c276-61
    
    • 1
    • 2
    • 3
    • 4

    但是根据这个IP (100.94.98.4)却搜不到东西(find var/log/neutron/* |xargs zgrep -i -E ‘100.94.98.4,’),但是却搜到了一个关于100.94.98.40的日志:

    var/log/neutron/neutron-ovn-metadata-agent.log.1:2022-07-20 21:15:56.758 3872 ERROR neutron.agent.ovn.metadata.server [-] No port found in network 63e2c276-60dd-40e3-baa1-c16342eacce2 with IP address 100.94.98.40
    
    • 1

    这个日志来自代码./neutron/agent/ovn/metadata/server.py#_get_instance_and_project_id,在SB_DB里查询到的Port_Binding为0,

     99     def _get_instance_and_project_id(self, req):                                
    100         remote_address = req.headers.get('X-Forwarded-For')                     
    101         network_id = req.headers.get('X-OVN-Network-ID')                        
    102                                                                                 
    103         ports = self.sb_idl.get_network_port_bindings_by_ip(network_id,         
    104                                                             remote_address)     
    105         num_ports = len(ports)                                                  
    106         if num_ports == 1:                                                      
    107             external_ids = ports[0].external_ids                                
    108             return (external_ids[ovn_const.OVN_DEVID_EXT_ID_KEY],               
    109                     external_ids[ovn_const.OVN_PROJID_EXT_ID_KEY])              
    110         elif num_ports == 0:                                                    
    111             LOG.error("No port found in network %s with IP address %s",                                                                                                            
    112                       network_id, remote_address)
    
    899     def get_network_port_bindings_by_ip(self, network, ip_address):                                                                                                                
    900         rows = self.db_list_rows('Port_Binding').execute(check_error=True)      
    901         # TODO(twilson) It would be useful to have a db_find that takes a       
    902         # comparison function                                                   
    903                                                                                 
    904         def check_net_and_ip(port):                                             
    905             # If the port is not bound to any chassis it is not relevant        
    906             if not port.chassis:                                                
    907                 return False                                                    
    908                                                                                 
    909             # TODO(dalvarez): Remove the comparison to port.datapath.uuid in Y  
    910             # cycle when we are sure that all namespaces will be created with   
    911             # the Neutron network UUID and not anymore with the OVN datapath    
    912             # UUID.                                                             
    913             is_in_network = lambda port: (                                      
    914                 str(port.datapath.uuid) == network or                           
    915                 utils.get_network_name_from_datapath(port.datapath) == network) 
    916                                                                                 
    917             return port.mac and is_in_network(port) and (                       
    918                     ip_address in port.mac[0].split(' '))                       
    919                                                                                 
    920         return [r for r in rows if check_net_and_ip(r)] 
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37

    但是却搜不着和get_network_port_bindings_by_ip相关的东西(find var/log/neutron/* |xargs zgrep -i -E ‘get_network_port_bindings_by_ip’),所以应该不是连不着SB_DB的问题(若是连不上SB_DB,那运行get_network_port_bindings_by_ip时会出错,打出堆栈会有get_network_port_bindings_by_ip这个字眼吧?), 所以很有可能原因是真的Port_Binding就是0. 在我的测试环境上用’ovn-sbctl list Port_Binding’会看到下列数据(测试环境中的metadata port为192.168.21.2):

    _uuid               : a011fdcd-15ba-48c3-aec0-2ef590aee864
    chassis             : []
    datapath            : 0f509be3-9b1f-4c66-9c59-ac4211cf5ef4
    encap               : []
    external_ids        : {"neutron:cidrs"="192.168.21.2/24", "neutron:device_id"=ovnmeta-8ea7d54e-5fe0-408d-9de8-5ccbed8ede84, "neutron:device_owner"="network:distributed", "neutron:network_name"=neutron-8ea7d54e-5fe0-408d-9de8-5ccbed8ede84, "neutron:port_name"="", "neutron:project_id"="728f2ae8aa8443b5a255574a5e236746", "neutron:revision_number"="2", "neutron:security_group_ids"=""}
    gateway_chassis     : []
    ha_chassis_group    : []
    logical_port        : "48244606-524d-4c24-bc54-0143869d4267"
    mac                 : ["fa:16:3e:43:81:23 192.168.21.2"]
    nat_addresses       : []
    options             : {requested-chassis=""}
    parent_port         : []
    tag                 : []
    tunnel_key          : 1
    type                : localport
    up                  : false
    virtual_parent      : []
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17

    Port_Binding是谁创建的呢?
    另外,nova-compute中大量这种日志, 似乎连不上neutron-api (9696):

    var/log/nova/nova-api-metadata.log:2022-07-21 13:41:24.086 1790270 ERROR nova.api.metadata.handler [req-42f38810-e2c3-4df0-8de6-754d1f6d5472 - - - - -] Failed to get metadata for instance id: 80f1bb26-f5db-4919-abb2-6b61d13f9267
    
    • 1

    var/log/nova/nova-api-metadata.log.20.gz:2022-07-01 20:21:46.371 3516 ERROR oslo.messaging._drivers.impl_rabbit [-] [e96b9581-fa77-4f0f-a580-05a13b18bfad] AMQP server on 192.168.0.124:5672 is unreachable: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'. Trying again in 1 seconds.: amqp.exceptions.ConnectionForced: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'
    ...
    var/log/nova/nova-api-metadata.log.20.gz:2022-07-01 20:59:04.928 3514 ERROR nova.api.metadata.handler [req-c4d5a1b4-1c36-4864-b729-d68de13ef59d - - - - -] Failed to get metadata for instance id: cc1c7466-b5d4-47d0-8e23-e91c785fafb3: neutronclient.common.exceptions.NeutronClientException: 
    
    • 1
    • 2
    • 3

    var/log/nova/nova-api-metadata.log.20.gz:2022-07-01 20:59:04.849 3514 ERROR nova.api.metadata.handler neutronclient.common.exceptions.NeutronClientException: 
    var/log/nova/nova-api-metadata.log.20.gz:2022-07-01 20:59:04.849 3514 ERROR nova.api.metadata.handler 
    var/log/nova/nova-api-metadata.log.20.gz:2022-07-01 20:59:04.849 3514 ERROR nova.api.metadata.handler 502 Proxy Error
    var/log/nova/nova-api-metadata.log.20.gz:2022-07-01 20:59:04.849 3514 ERROR nova.api.metadata.handler 
    var/log/nova/nova-api-metadata.log.20.gz:2022-07-01 20:59:04.849 3514 ERROR nova.api.metadata.handler 

    Proxy Error

    var/log/nova/nova-api-metadata.log.20.gz:2022-07-01 20:59:04.849 3514 ERROR nova.api.metadata.handler

    The proxy server received an invalid^M var/log/nova/nova-api-metadata.log.20.gz:2022-07-01 20:59:04.849 3514 ERROR nova.api.metadata.handler response from an upstream server.
    ^M var/log/nova/nova-api-metadata.log.20.gz:2022-07-01 20:59:04.849 3514 ERROR nova.api.metadata.handler The proxy server could not handle the request

    Reason: Error reading from remote server

    var/log/nova/nova-api-metadata.log.20.gz:2022-07-01 20:59:04.849 3514 ERROR nova.api.metadata.handler
    var/log/nova/nova-api-metadata.log.20.gz:2022-07-01 20:59:04.849 3514 ERROR nova.api.metadata.handler
    Apache/2.4.41 (Ubuntu) Server at neutron.internal.api.xxx.infra.xxx.net Port 9696
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

    还有大量这种日志:

    2022-07-20 23:59:18.828 1790159 INFO nova.compute.resource_tracker [req-25cb1333-d268-4870-877d-8465aea2b451 - - - - -] Instance 4d35b4ff-177c-4659-b6a4-d4d1d5ddecab has allocations against this compute host but is not found in the database.
    
    • 1

    也发现这种日志:

    2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command Traceback (most recent call last):
    2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python3/dist-packages/ovsdbapp/backend/ovs_idl/command.py", line 42, in execute
    2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command     t.add(self)
    2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python3.8/contextlib.py", line 120, in __exit__
    2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command     next(self.gen)
    2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python3/dist-packages/ovsdbapp/api.py", line 110, in transaction
    2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command     del self._nested_txns_map[cur_thread_id]
    2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python3/dist-packages/ovsdbapp/api.py", line 61, in __exit__
    2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command     self.result = self.commit()
    2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python3/dist-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 57, in commit
    2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command     raise exceptions.TimeoutException(
    2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command ovsdbapp.exceptions.TimeoutException: Commands [DbSetCommand(table=Chassis_Private, record=xxx.maas, col_values=(('external_ids', {'neutron:ovn-metadata-sb-cfg': '31205'}),))] exceeded timeout 180 seconds, cause: Result queue is empty
    2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command
    2022-07-11 16:37:25.780 2361 INFO ovsdbapp.backend.ovs_idl.vlog [-] ssl:100.94.0.158:6642: connected
    2022-07-11 16:37:25.817 2361 INFO ovsdbapp.backend.ovs_idl.vlog [-] ssl:100.94.0.158:6642: clustered database server is not cluster leader; trying another server
    2022-07-11 16:37:25.818 2361 INFO ovsdbapp.backend.ovs_idl.vlog [-] ssl:100.94.0.158:6642: connection closed by client
    2022-07-11 16:37:33.831 2361 INFO ovsdbapp.backend.ovs_idl.vlog [-] ssl:100.94.0.204:6642: connected
    2022-07-11 16:37:34.566 2361 INFO neutron.agent.ovn.metadata.agent [-] Connection to OVSDB established, doing a full sync
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18

    所以是metadata-agent连接SB_DB时超时导致查询到的port_binding为0了, 找到一个lp bug https://bugs.launchpad.net/neutron/+bug/1951841, 这个bug说system_id不为UUID时会报错,我们也看到了下列日志:

    2022-07-11 18:57:45.926 798838 WARNING neutron.agent.ovn.metadata.agent [-] Can't read ovn-bridge external-id from OVSDB. Using br-int instead.
    2022-07-11 18:57:45.926 798838 INFO neutron.agent.ovn.metadata.agent [-] Loaded chassis name xxx.maas (UUID: 4fe633c2-0c52-5c86-92b2-0ee1f5e71a90) and ovn bridge br-int.
    2022-07-11 18:57:45.927 798838 INFO oslo_service.service [-] Starting 2 workers
    
    308     def _get_ovn_bridge(self):                                                  
    309         """Return the external_ids:ovn-bridge value of the Open_vSwitch table.  
    310                                                                                 
    311         This is the OVS bridge used to plug the metadata ports to.              
    312         If the key doesn't exist, this method will return 'br-int' as default.  
    313         """                                                                     
    314         ext_ids = self.ovs_idl.db_get(                                          
    315             'Open_vSwitch', '.', 'external_ids').execute()                      
    316         try:                                                                    
    317             return ext_ids['ovn-bridge']                                        
    318         except KeyError:                                                        
    319             LOG.warning("Can't read ovn-bridge external-id from OVSDB. Using "                                                                                                     
    320                         "br-int instead.")                                      
    321             return 'br-int' 
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18

    但是取不到external_ids还是因为Port_Binding不存在啊?

    排除lp bug 1951841

    https://bugs.launchpad.net/neutron/+bug/1951841
    ovn除了为计算节点添加Chassis记录(ovn-sbctl list Chassis), 还得为metadata-agent添加一个Chassis_Private记录(它根据Chassis system-id来生成Chassis_Private的agent-id.下面的代码中假设self.chassis_id是uuid,但实现charm部署的却不一定是uuid,这会导致neutron-ovn-metadata-agent运行可以但this produces that the neutron-ovn-metadata-agent daemon stays running but disconnected and new launched VMs won’t have access to the metadata service.

        def register_metadata_agent(self):
            # NOTE(lucasagomes): db_add() will not overwrite the UUID if
            # it's already set.
            table = ('Chassis_Private' if self.has_chassis_private else 'Chassis')
            # Generate unique, but consistent metadata id for chassis name
            agent_id = uuid.uuid5(self.chassis_id, 'metadata_agent')
            ext_ids = {ovn_const.OVN_AGENT_METADATA_ID_KEY: str(agent_id)}
            self.sb_idl.db_add(table, self.chassis, 'external_ids',
                               ext_ids).execute(check_error=True)
    
    # ovn-sbctl list Chassis_Private
    _uuid               : 45524f21-0ca2-4736-a851-e5db69dad805
    chassis             : 4e2fabf4-3834-4137-8454-3525e541d652
    external_ids        : {"neutron:ovn-metadata-id"="26e2474c-75fc-5977-a311-828b9ad92180", "neutron:ovn-metadata-sb-cfg"="5"}
    name                : juju-3df241-ovn-8.cloud.sts
    nb_cfg              : 5
    nb_cfg_timestamp    : 1661337217145
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17

    做了下列实验未重现问题,可能是做一次不会稳定重现吗?

    $ juju ssh nova-compute/0 sudo ip netns | grep ovnmeta
    ovnmeta-0211506b-233e-4773-a034-3950dfefe23d
    ovnmeta-8ea7d54e-5fe0-408d-9de8-5ccbed8ede84 (id: 0)
    
    $ openstack server delete focal-103100
    $ juju ssh nova-compute/0 sudo ip netns | grep ovnmeta
    
    
    $ juju status ovn-central | grep leader
    ovn-central  21.09.0  active      3  ovn-central  21.09/stable   24  no       Unit is ready (leader: ovnnb_db)
    ovn-central/0   active    idle   9        10.5.1.22       6641/tcp,6642/tcp  Unit is ready (leader: ovnnb_db)
    ovn-central/2*  active    idle   11       10.5.3.101      6641/tcp,6642/tcp  Unit is ready (leader: ovnsb_db)
    $ juju ssh $(juju status ovn-central | grep leader | tail -n 1 | awk '{print $1}' | tr -d '*') sudo reboot
    #wait a few minutes
    $ juju status ovn-central | grep leader
    ovn-central  21.09.0  active      3  ovn-central  21.09/stable   24  no       Unit is ready (leader: ovnnb_db)
    ovn-central/0   active    idle   9        10.5.1.22       6641/tcp,6642/tcp  Unit is ready (leader: ovnnb_db)
    ovn-central/1   active    idle   10       10.5.0.186      6641/tcp,6642/tcp  Unit is ready (leader: ovnsb_db northd: active)
    
    $ ./tools/instance_launch.sh 1 focal
    $ ./tools/float_all.sh
    $ juju ssh nova-compute/0 sudo ip netns | grep ovnmeta
    ovnmeta-8ea7d54e-5fe0-408d-9de8-5ccbed8ede84 (id: 0)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23

    OVN中的Port_Binding表更新后会触发neutron中的 PortBindingUpdateVirtualPortsEvent 事件, 然后neutron的update_virtual_port_host再去neutron中创建PortBinding记录.

    #https://bugs.launchpad.net/neutron/+bug/1951841/comments/28
    neutron-ovn-metadata-agent如果每次启动用一个随机的agent-id来创建Private_Chassis的话,这个patch(https://github.com/openstack/neutron/commit/da3ce7319866e8dc874d405e91f9af160e2c3d31)改成不用随机的了,但是它使用了(chassis_id = uuid.UUID(self._get_own_chassis_name()) && agent_id = uuid.uuid5(chassis_id, ‘metadata_agent’)),这个会要求_get_own_chassis_name必须是一个uuid否则会抛异常.
    为什么抛出了异常metada-agent就不再重新连接SB_DB去注册Private_Chassis了呢?那是因为ChassisCreateEventBase#run调用"self.agent.register_metadata_agent"第一次运行的时候因为first_time=true不会call self.agent.register_metadata_agent

        def run(self, event, row, old):
            if self.first_time:
                self.first_time = False
            else:
                # NOTE(lucasagomes): Re-register the ovn metadata agent
                # with the local chassis in case its entry was re-created
                # (happens when restarting the ovn-controller)
                self.agent.register_metadata_agent()
                LOG.info("Connection to OVSDB established, doing a full sync")
                self.agent.sync()
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

    解决方法是支持让system-id为non-UUID - https://opendev.org/openstack/neutron/commit/6da4432fed255f3bcf3831f5d0520ab389ce36e5

    但是这个代码已经在19.1.0里了,客户用的是19.3.0,显然不是这个bug

    排除lp bug 1953510

    #https://bugs.launchpad.net/neutron/+bug/1953510/comments/2
    ovn-central端总会在做snapshot时做leader切换 - https://github.com/openvswitch/ovs/commit/3c2d6274bceecb65ec8f2f93f2aac26897a7ddfe

    SB_DB端会看到日志:

    $ find sosreport-juju-2752e1-*/var/log/ovn/* |xargs zgrep -i -E 'leadership'
    sosreport-juju-2752e1-6-lxd-24-2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.322Z|80962|raft|INFO|Transferring leadership to write a snapshot.
    sosreport-juju-2752e1-6-lxd-24-2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.610Z|80967|raft|INFO|received leadership transfer from 9b15 in term 295
    sosreport-juju-2752e1-6-lxd-24-2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.024Z|82382|raft|INFO|Transferring leadership to write a snapshot.
    sosreport-juju-2752e1-7-lxd-27-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.322Z|92695|raft|INFO|received leadership transfer from 0f3c in term 294
    sosreport-juju-2752e1-7-lxd-27-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.330Z|92698|raft|INFO|Transferring leadership to write a snapshot.
    sosreport-juju-2752e1-7-lxd-27-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.025Z|92893|raft|INFO|received leadership transfer from 0f3c in term 296
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    neutron端也会看到日志,这是在说刚好在写Chassis_Private时发生了leader切换吗?

    2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command ovsdbapp.exceptions.TimeoutException: Commands [DbSetCommand(table=Chassis_Private, record=xxx.maas, col_values=(('external_ids', {'neutron:ovn-metadata-sb-cfg': '31205'}),))] exceeded timeout 180 seconds, cause: Result queue is empty
    2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command
    2022-07-11 16:37:25.780 2361 INFO ovsdbapp.backend.ovs_idl.vlog [-] ssl:100.94.0.158:6642: connected
    2022-07-11 16:37:25.817 2361 INFO ovsdbapp.backend.ovs_idl.vlog [-] ssl:100.94.0.158:6642: clustered database server is not cluster leader; trying another server
    
    • 1
    • 2
    • 3
    • 4

    进一步的流程分析

    1, ovn-controller中的chassis_run上报chassis信息,并在SBDB中创建Chassis记录.
    
    2, ovn-northd中的sbrec_port_binding_insert负责将NB_DB中的port(包含lsp和lrp)插入SB_DB的Port_Binding table, 见: ovn-northd 源码分析 - http://events.jianshu.io/p/e2139ecd0463
    
    ovn-northd -vconsole:emer -vsyslog:err -vfile:info --ovnnb-db=ssl:100.94.0.158:6641,ssl:100.94.0.99:6641,ssl:100.94.0.204:6641 --ovnsb-db=ssl:100.94.0.158:16642,ssl:100.94.0.99:16642,ssl:100.94.0.204:16642 -c /etc/ovn/cert_host -C /etc/ovn/ovn-central.crt -p /etc/ovn/key_host --no-chdir --log-file=/var/log/ovn/ovn-northd.log --pidfile=/var/run/ovn/ovn-northd.pid --detach
    
    2022-08-18T09:15:46.193Z|28332|ovsdb_cs|INFO|ssl:100.94.0.204:16642: clustered database server is not cluster leader; trying another server
    2022-08-18T09:15:46.193Z|28333|reconnect|INFO|ssl:100.94.0.204:16642: connection attempt timed out
    2022-08-18T09:15:46.194Z|28334|reconnect|INFO|ssl:100.94.0.99:16642: connecting...
    2022-08-18T09:15:46.199Z|28335|reconnect|INFO|ssl:100.94.0.99:16642: connected
    2022-08-18T09:15:46.201Z|28336|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active.
    2022-08-18T09:15:46.202Z|28337|northd|WARN|Dropped 3 log messages in last 86113 seconds (most recently, 86113 seconds ago) due to excessive rate
    2022-08-18T09:15:46.202Z|28338|northd|WARN|Chassis does not exist for Chassis_Private record, name: juju-2752e1-5-lxd-24.maas
    2022-08-18T09:15:46.202Z|28339|ovsdb_cs|INFO|ssl:100.94.0.99:16642: clustered database server is not cluster leader; trying another server
    2022-08-18T09:15:46.203Z|28340|reconnect|INFO|ssl:100.94.0.99:16642: connection attempt timed out
    2022-08-18T09:15:46.203Z|28341|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby.
    2022-08-18T09:15:46.203Z|28342|reconnect|INFO|ssl:100.94.0.158:16642: connecting...
    2022-08-18T09:15:46.207Z|28343|reconnect|INFO|ssl:100.94.0.158:16642: connected
    2022-08-18T09:15:46.209Z|28344|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active.
    2022-08-18T09:15:46.210Z|28345|ovsdb_cs|INFO|ssl:100.94.0.158:16642: clustered database server is not cluster leader; trying another server
    2022-08-18T09:15:46.210Z|28346|reconnect|INFO|ssl:100.94.0.158:16642: connection attempt timed out
    2022-08-18T09:15:46.210Z|28347|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby.
    2022-08-18T09:15:46.210Z|28348|reconnect|INFO|ssl:100.94.0.204:16642: connecting...
    2022-08-18T09:15:46.215Z|28349|reconnect|INFO|ssl:100.94.0.204:16642: connected
    
    3, ovn-northd一直有上面的日志显示连不上SB_DB,从而导致无法插入Port_Binding, 那么neutron也不会收到PortBindingUpdateVirtualPortsEvent事件,那么neutron也不会通过update_virtual_port_host在neutron中创建PortBinding记录
    
    4, 这样neutron-medadata-agent的_get_instance_and_project_id查询Port_Binding为0, 也就有了:
    var/log/neutron/neutron-ovn-metadata-agent.log.1:2022-07-20 21:15:56.758 3872 ERROR neutron.agent.ovn.metadata.server [-] No port found in network 63e2c276-60dd-40e3-baa1-c16342eacce2 with IP address 100.94.98.40
    
    5, SB_DB leader老在切换, 切换过程中刚好创建虚机的话可能报下列错.
    2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command ovsdbapp.exceptions.TimeoutException: Commands [DbSetCommand(table=Chassis_Private, record=xxx.maas, col_values=(('external_ids', {'neutron:ovn-metadata-sb-cfg': '31205'}),))] exceeded timeout 180 seconds, cause: Result queue is empty
    
    6, 这种日志还是因为Port_Binding不存在,才取不到Port_Bidning中的external_ids - Can't read ovn-bridge external-id from OVSD
    
    7, Port_Binding不存在,又回到了上面的第2步(Port_Binding是由ovn_northd创建的,它连不上SB_DB), 所以是SB_DB的问题
    
    8, 为啥SB_DB有问题?
    
    $ grep -r 'leader' sosreport-juju-2752e1-*/var/log/ovn/ovsdb-server-sb.log
    sosreport-juju-2752e1-5-lxd-23--2022-08-18-bnsdhsj/var/log/ovn/ovsdb-server-sb.log:2022-08-18T09:15:49.863Z|31963|raft|INFO|server 0f3c is leader for term 294
    sosreport-juju-2752e1-5-lxd-23--2022-08-18-bnsdhsj/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.330Z|32086|raft|INFO|server 9b15 is leader for term 295
    sosreport-juju-2752e1-5-lxd-23--2022-08-18-bnsdhsj/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.613Z|32088|raft|INFO|server 0f3c is leader for term 296
    sosreport-juju-2752e1-5-lxd-23--2022-08-18-bnsdhsj/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.379Z|32327|raft|INFO|server 9b15 is leader for term 297
    sosreport-juju-2752e1-6-lxd-24--2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T09:15:49.862Z|80388|raft|INFO|term 294: elected leader by 2+ of 3 servers
    sosreport-juju-2752e1-6-lxd-24--2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.322Z|80962|raft|INFO|Transferring leadership to write a snapshot.
    sosreport-juju-2752e1-6-lxd-24--2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.324Z|80963|raft|INFO|rejected append_reply (not leader)
    sosreport-juju-2752e1-6-lxd-24--2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.609Z|80964|raft|INFO|rejected append_reply (not leader)
    sosreport-juju-2752e1-6-lxd-24--2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.609Z|80965|raft|INFO|rejected append_reply (not leader)
    sosreport-juju-2752e1-6-lxd-24--2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.610Z|80966|raft|INFO|server 9b15 is leader for term 295
    sosreport-juju-2752e1-6-lxd-24--2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.610Z|80967|raft|INFO|received leadership transfer from 9b15 in term 295
    sosreport-juju-2752e1-6-lxd-24--2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.613Z|80969|raft|INFO|term 296: elected leader by 2+ of 3 servers
    sosreport-juju-2752e1-6-lxd-24--2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.024Z|82382|raft|INFO|Transferring leadership to write a snapshot.
    sosreport-juju-2752e1-6-lxd-24--2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.367Z|82383|raft|INFO|rejected append_reply (not leader)
    sosreport-juju-2752e1-6-lxd-24--2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.378Z|82384|raft|INFO|server 9b15 is leader for term 297
    sosreport-juju-2752e1-7-lxd-27--2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T09:15:51.200Z|92562|raft|INFO|server 0f3c is leader for term 294
    sosreport-juju-2752e1-7-lxd-27--2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.322Z|92695|raft|INFO|received leadership transfer from 0f3c in term 294
    sosreport-juju-2752e1-7-lxd-27--2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.330Z|92697|raft|INFO|term 295: elected leader by 2+ of 3 servers
    sosreport-juju-2752e1-7-lxd-27--2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.330Z|92698|raft|INFO|Transferring leadership to write a snapshot.
    sosreport-juju-2752e1-7-lxd-27--2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.334Z|92699|raft|INFO|rejected append_reply (not leader)
    sosreport-juju-2752e1-7-lxd-27--2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.728Z|92700|raft|INFO|server 0f3c is leader for term 296
    sosreport-juju-2752e1-7-lxd-27--2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.025Z|92893|raft|INFO|received leadership transfer from 0f3c in term 296
    sosreport-juju-2752e1-7-lxd-27--2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.378Z|92895|raft|INFO|term 297: elected leader by 2+ of 3 servers
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63

    然后就搜到了这个:

    $ find sosreport-juju-2752e1-*/var/log/ovn/* |xargs zgrep -i -E 'inactivity'
    sosreport-juju-2752e1-7-lxd-27-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-nb.log:2022-08-18T09:15:49.882Z|657323|reconnect|ERR|ssl:100.94.0.99:40308: no response to inactivity probe after 5 seconds, disconnecting
    sosreport-juju-2752e1-7-lxd-27-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-nb.log:2022-08-18T09:15:49.883Z|657324|reconnect|ERR|ssl:100.94.0.158:51592: no response to inactivity probe after 5 seconds, disconnecting
    sosreport-juju-2752e1-7-lxd-27-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T09:15:47.313Z|92550|reconnect|ERR|ssl:100.94.0.158:6644: no response to inactivity probe after 5 seconds, disconnecting
    sosreport-juju-2752e1-7-lxd-27-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T09:15:49.815Z|92555|reconnect|ERR|ssl:100.94.0.99:45744: no response to inactivity probe after 5 seconds, disconnecting
    sosreport-juju-2752e1-7-lxd-27-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T09:15:49.816Z|92556|reconnect|ERR|ssl:100.94.0.158:44928: no response to inactivity probe after 5 seconds, disconnecting
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    可搜索’20220427 - ovn performance issue’ - https://blog.csdn.net/quqi99/article/details/103194137 , 得试试:

    # on compute chassis
    ovs-vsctl set open . external_ids:ovn-openflow-probe-interval=0
    ovs-vsctl set open . external_ids:ovn-monitor-all=true
    ovs-vsctl set open . external_ids:ovn-remote-probe-interval=100000
    $ sudo ovs-vsctl get open . external_ids
    {hostname=juju-3df241-ovn-8.cloud.sts, ovn-bridge-mappings="physnet1:br-data", ovn-encap-ip="10.5.0.110", ovn-encap-type=geneve, ovn-match-northd-version="true", ovn-remote="ssl:10.5.1.22:6642,ssl:10.5.0.186:6642,ssl:10.5.3.101:6642", rundir="/var/run/openvswitch", system-id=juju-3df241-ovn-8.cloud.sts}
    
    #on SB_DB and NB_DB
    ovn-nbctl set connection . inactivity_probe='180000'
    ovn-sbctl set connection . inactivity_probe='180000'
    ovn-nbctl get connection . inactivity_probe
    
    #juju
    juju config ovn-central ovsdb-server-inactivity-probe
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14

    确定SB_DB的问题

    可以在每个ovn-central unit上运行"ovn-appctl -t ovn-northd status", 它用来告诉unit是active还是standby,但如果stuck就说明ovn-northd状态不对,可能需要重启ovn-northd,这个可排除是否所有ovn-central unit都有问题.
    还有如下命令:

    #https://github.com/sosreport/sos/commit/499b1ebd1341b524eca294762332939fe0408416
    #cat sos_commands/ovn_central/ovs-appctl_-t_.var.run.ovn.ovnnb_db.ctl_cluster.status_OVN_Northbound
    $ ls /var/run/ovn/*.ctl
    $ sudo ovs-appctl -t /var/run/ovn/ovnnb_db.ctl cluster/status OVN_Northbound
    2394
    Name: OVN_Northbound
    Cluster ID: 74bc (74bcea93-e506-4c88-97bf-e0094b9ca634)
    Server ID: 2394 (23940573-b5be-465a-9e0f-b3ab9941335d)
    Address: ssl:10.5.3.101:6643
    Status: cluster member
    Role: follower
    Term: 7
    Leader: ad8a
    Vote: ad8a
    Last Election started 5567642 ms ago, reason: timeout
    Election timer: 4000
    Log: [93, 100]
    Entries not yet committed: 0
    Entries not yet applied: 0
    Connections: ->f38b ->ad8a <-f38b <-ad8a
    Disconnections: 0
    Servers:
        f38b (f38b at ssl:10.5.1.22:6643) last msg 5564502 ms ago
        ad8a (ad8a at ssl:10.5.0.186:6643) last msg 197 ms ago
        2394 (2394 at ssl:10.5.3.101:6643) (self)
    $ sudo ovs-appctl -t /var/run/ovn/ovnsb_db.ctl cluster/status OVN_Southbound
    3dd6
    Name: OVN_Southbound
    Cluster ID: 0993 (09934a9f-1948-48ec-9cba-da7c6efba970)
    Server ID: 3dd6 (3dd60330-5607-4817-bc53-1e08ad972c02)
    Address: ssl:10.5.3.101:6644
    Status: cluster member
    Role: follower
    Term: 10
    Leader: 9c8f
    Vote: self
    Last Election started 5620622 ms ago, reason: timeout
    Election timer: 4000
    Log: [114, 125]
    Entries not yet committed: 0
    Entries not yet applied: 0
    Connections: ->8ff3 ->9c8f <-9c8f <-8ff3
    Disconnections: 0
    Servers:
        3dd6 (3dd6 at ssl:10.5.3.101:6644) (self)
        8ff3 (8ff3 at ssl:10.5.0.186:6644) last msg 5619344 ms ago
        9c8f (9c8f at ssl:10.5.1.22:6644) last msg 222 ms ago
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47

    Event Timeline

    1, on 2022-07-11, SB_DB should be bad so neutron-ovn-metadata-agent reported a error 'exceeded timeout 180 seconds, cause: Result queue is empty'
    
    2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command ovsdbapp.exceptions.TimeoutException: Commands [DbSetCommand(table=Chassis_Private, record=xxx.maas, col_values=(('external_ids', {'neutron:ovn-metadata-sb-cfg': '31205'}),))] exceeded timeout 180 seconds, cause: Result queue is empty
    2022-07-11 16:37:25.769 2361 ERROR ovsdbapp.backend.ovs_idl.command
    2022-07-11 16:37:25.780 2361 INFO ovsdbapp.backend.ovs_idl.vlog [-] ssl:100.94.0.158:6642: connected
    2022-07-11 16:37:25.817 2361 INFO ovsdbapp.backend.ovs_idl.vlog [-] ssl:100.94.0.158:6642: clustered database server is not cluster leader; trying another server
    2022-07-11 16:37:25.818 2361 INFO ovsdbapp.backend.ovs_idl.vlog [-] ssl:100.94.0.158:6642: connection closed by client
    2022-07-11 16:37:33.831 2361 INFO ovsdbapp.backend.ovs_idl.vlog [-] ssl:100.94.0.204:6642: connected
    2022-07-11 16:37:34.566 2361 INFO neutron.agent.ovn.metadata.agent [-] Connection to OVSDB established, doing a full sync
    
    This error should have nothing to do with this issue because it happened on 2022-07-11, not on 2022-08-18
    
    2, on 2022-08-18, SB_DB was good, because I just found the error 'exceeded timeout 180 seconds, cause: Result queue is empty' in sosreport-xxx-2022-07-21-baldxuf, not in sosreport-xxx-2022-08-18-bgnovqu
    
    3, at 2022-08-18T09:15:47, one inactivity probe error happened
    
    var/log/ovn/ovsdb-server-sb.log:2022-08-18T09:15:47.313Z|92550|reconnect|ERR|ssl:100.94.0.158:6644: no response to inactivity probe after 5 seconds, disconnecting
    
    sosreport shows 100.94.0.204 is the leader at that time, so this inactivity probe means the leader 100.94.0.204 didn't receive the reply from 100.94.0.158:6644 within 5 seconds then it disconnected the TCP connection between 100.94.0.204 and 100.94.0.158 to avoid waste of resources. So it should not be the key of the problem.
    
        5-lxd-23: f801 100.94.0.99:6644 follower term=297
        6-lxd-24: 0f3c 100.94.0.158:6644 follower term=297
        7-lxd-27: 9b15 100.94.0.204:6644 leader term=297  leader
    
    4, at 2022-08-18 09:15:50, neutron-ovn-metadata-agent started to call self.agent.register_metadata_agent() to create Private_Chassis in SB_DB
    
    $ find sosreport-srv1ibm00*/var/log/neutron/* |xargs zgrep -i -E 'Connection to OVSDB established, doing a full'
    xxx/var/log/neutron/neutron-ovn-metadata-agent.log:2022-08-18 09:15:50.726 75556 INFO neutron.agent.ovn.metadata.agent [-] Connection to OVSDB established, doing a full sync
        def run(self, event, row, old):
            if self.first_time:
                self.first_time = False
            else:
                # NOTE(lucasagomes): Re-register the ovn metadata agent
                # with the local chassis in case its entry was re-created
                # (happens when restarting the ovn-controller)
                self.agent.register_metadata_agent()
                LOG.info("Connection to OVSDB established, doing a full sync")
                self.agent.sync()
    
    5, at 2022-08-18T12:30:35, reelection took place due to snapshot factor
    
    $ find sosreport-juju-2752e1-*/var/log/ovn/* |xargs zgrep -i -E 'received leadership transfer from'
    sosreport-juju-2752e1-6-lxd-24-xxx/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.610Z|80967|raft|INFO|received leadership transfer from 9b15 in term 295
    sosreport-juju-2752e1-7-lxd-27-xxx/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.322Z|92695|raft|INFO|received leadership transfer from 0f3c in term 294
    sosreport-juju-2752e1-7-lxd-27-xxx/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.025Z|92893|raft|INFO|received leadership transfer from 0f3c in term 296
    
    # reelection was caused by snapshot factor according to timestamp
    $ find sosreport-juju-2752e1-*/var/log/ovn/* |xargs zgrep -i -E 'Transferring leadership'
    sosreport-juju-2752e1-6-lxd-24-xxx/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.322Z|80962|raft|INFO|Transferring leadership to write a snapshot.
    sosreport-juju-2752e1-6-lxd-24-xxx/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.024Z|82382|raft|INFO|Transferring leadership to write a snapshot.
    sosreport-juju-2752e1-7-lxd-27-xxx/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.330Z|92698|raft|INFO|Transferring leadership to write a snapshot.
    
    6, at 2022-08-18 12:30:36, neutron-ovn-metadata-agent can't query port from SB_DB
    
    $ find sosreport-srv1ibm00*/var/log/neutron/* |xargs zgrep -i -E 'OVSDB Error'
    sosreport-srv1ibm002d-xxx/var/log/neutron/neutron-ovn-metadata-agent.log:2022-08-18 12:30:36.103 75556 ERROR ovsdbapp.backend.ovs_idl.transaction [-] OVSDB Error: no error details available
    sosreport-srv1ibm006d-xxx/var/log/neutron/neutron-ovn-metadata-agent.log:2022-08-18 12:30:36.104 2171 ERROR ovsdbapp.backend.ovs_idl.transaction [-] OVSDB Error: no error details available
    
    7, at 2022-08-18 12:57:53, metadata port can't be created
    
    2022-08-18 12:57:53.800 3730 ERROR neutron.agent.ovn.metadata.server [-] No port found in network 63e2c276-60dd-40e3-baa1-c16342eacce2 with IP address 100.94.98.135
    
    8, at 2022-08-18 18:30:01, run the workaround to restart neutron-ovn-metadata-agent.service 
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63

    所以这个问题是SB_DB在snapshot前做leadership的迁移的时候neutron又在创建metadata port,我们当然可以让snapshot的频率变低(https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1990978,这个特性在ovs=2.17.3里可用,客户用的是focal-xena: 2.16.0-0ubuntu2.1~cloud0).但是即使SB_DB做leadership时也不能让创建metadata port就失败啊, 这个lp bug ( https://bugs.launchpad.net/neutron/+bug/1961013 )让neutron可以连接一个read-only的non-leader节点来减少失败率.
    但neutron是在创建metadata port是在写啊?

  • 相关阅读:
    Mirror 镜像站点的使用
    Elementui的el-dropdown组件使用与案例
    C语言易错点
    【c#】Thread多线程项目练习Demo之---摇奖机
    【vim 学习系列文章 10 -- vim 将代码中空格高亮显示方法】
    【docker系列】docker compose的v1\v2版本安装及使用上的区别
    elasticsearch集群
    Python 之 argparse 解析命令行参数模块
    Vue + Element-UI —— 项目实战(六)
    【无标题】
  • 原文地址:https://blog.csdn.net/quqi99/article/details/126520382