[H2O XGBoost因本地服务器死机或挂起而崩溃(?)

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了[H2O XGBoost因本地服务器死机或挂起而崩溃(?)相关的知识,希望对你有一定的参考价值。

我将需要做一些工作来构建一个较小的测试用例,另外,我必须获得释放数据的权限(在匿名之后),但是H2O始终使用这些数据和参数使我崩溃。 (通常使用功能输入和参数的不同组合会成功,但似乎总是会因以下功能和参数而失败)。

数据有12847393行(这可能是引起问题的原因?)

这是我得到的丑陋的堆栈抓取。 (似乎可重现。)

提为错误:https://0xdata.atlassian.net/projects/PUBDEV/issues/PUBDEV-7321

---------------------------------------------------------------------------
ConnectionResetError                      Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    599                                                   body=body, headers=headers,
--> 600                                                   chunked=chunked)
    601 

/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    383                     # otherwise it looks like a programming error was the cause.
--> 384                     six.raise_from(e, None)
    385         except (SocketTimeout, BaseSSLError, SocketError) as e:

/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py in raise_from(value, from_value)

/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    379                 try:
--> 380                     httplib_response = conn.getresponse()
    381                 except Exception as e:

/usr/lib/python3.6/http/client.py in getresponse(self)
   1345             try:
-> 1346                 response.begin()
   1347             except ConnectionError:

/usr/lib/python3.6/http/client.py in begin(self)
    306         while True:
--> 307             version, status, reason = self._read_status()
    308             if status != CONTINUE:

/usr/lib/python3.6/http/client.py in _read_status(self)
    267     def _read_status(self):
--> 268         line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
    269         if len(line) > _MAXLINE:

/usr/lib/python3.6/socket.py in readinto(self, b)
    585             try:
--> 586                 return self._sock.recv_into(b)
    587             except timeout:

ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

ProtocolError                             Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    448                     retries=self.max_retries,
--> 449                     timeout=timeout
    450                 )

/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    637             retries = retries.increment(method, url, error=e, _pool=self,
--> 638                                         _stacktrace=sys.exc_info()[2])
    639             retries.sleep()

/usr/local/lib/python3.6/dist-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
    366             if read is False or not self._is_method_retryable(method):
--> 367                 raise six.reraise(type(error), error, _stacktrace)
    368             elif read is not None:

/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py in reraise(tp, value, tb)
    684         if value.__traceback__ is not tb:
--> 685             raise value.with_traceback(tb)
    686         raise value

/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    599                                                   body=body, headers=headers,
--> 600                                                   chunked=chunked)
    601 

/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    383                     # otherwise it looks like a programming error was the cause.
--> 384                     six.raise_from(e, None)
    385         except (SocketTimeout, BaseSSLError, SocketError) as e:

/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py in raise_from(value, from_value)

/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    379                 try:
--> 380                     httplib_response = conn.getresponse()
    381                 except Exception as e:

/usr/lib/python3.6/http/client.py in getresponse(self)
   1345             try:
-> 1346                 response.begin()
   1347             except ConnectionError:

/usr/lib/python3.6/http/client.py in begin(self)
    306         while True:
--> 307             version, status, reason = self._read_status()
    308             if status != CONTINUE:

/usr/lib/python3.6/http/client.py in _read_status(self)
    267     def _read_status(self):
--> 268         line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
    269         if len(line) > _MAXLINE:

/usr/lib/python3.6/socket.py in readinto(self, b)
    585             try:
--> 586                 return self._sock.recv_into(b)
    587             except timeout:

ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/h2o/backend/connection.py in request(self, endpoint, data, json, filename, save_to)
    473                                     headers=headers, timeout=self._timeout, stream=stream,
--> 474                                     auth=self._auth, verify=verify, proxies=self._proxies)
    475             if isinstance(save_to, types.FunctionType):

/usr/local/lib/python3.6/dist-packages/requests/api.py in request(method, url, **kwargs)
     59     with sessions.Session() as session:
---> 60         return session.request(method=method, url=url, **kwargs)
     61 

/usr/local/lib/python3.6/dist-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    532         send_kwargs.update(settings)
--> 533         resp = self.send(prep, **send_kwargs)
    534 

/usr/local/lib/python3.6/dist-packages/requests/sessions.py in send(self, request, **kwargs)
    645         # Send the request
--> 646         r = adapter.send(request, **kwargs)
    647 

/usr/local/lib/python3.6/dist-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    497         except (ProtocolError, socket.error) as err:
--> 498             raise ConnectionError(err, request=request)
    499 

ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

H2OConnectionError                        Traceback (most recent call last)
<ipython-input-1-56b68eefa416> in <module>
     87 start_time = time.time()
     88 model  = H2OXGBoostEstimator(**param)
---> 89 model.train(x=x, y="y", training_frame=hdf6)
     90 elapsed_time = time.time() - start_time

/usr/local/lib/python3.6/dist-packages/h2o/estimators/estimator_base.py in train(self, x, y, training_frame, offset_column, fold_column, weights_column, validation_frame, max_runtime_secs, ignored_columns, model_id, verbose)
    110         self._train(x=x, y=y, training_frame=training_frame, offset_column=offset_column, fold_column=fold_column,
    111                     weights_column=weights_column, validation_frame=validation_frame, max_runtime_secs=max_runtime_secs,
--> 112                     ignored_columns=ignored_columns, model_id=model_id, verbose=verbose)
    113 
    114 

/usr/local/lib/python3.6/dist-packages/h2o/estimators/estimator_base.py in _train(self, x, y, training_frame, offset_column, fold_column, weights_column, validation_frame, max_runtime_secs, ignored_columns, model_id, verbose, extend_parms_fn)
    263             return
    264 
--> 265         model.poll(poll_updates=self._print_model_scoring_history if verbose else None)
    266         model_json = h2o.api("GET /%d/Models/%s" % (rest_ver, model.dest_key))["models"][0]
    267         self._resolve_model(model.dest_key, model_json)

/usr/local/lib/python3.6/dist-packages/h2o/job.py in poll(self, poll_updates)
     58                 pb.execute(self._refresh_job_status, print_verbose_info=ft.partial(poll_updates, self))
     59             else:
---> 60                 pb.execute(self._refresh_job_status)
     61         except StopIteration as e:
     62             if str(e) == "cancelled":

/usr/local/lib/python3.6/dist-packages/h2o/utils/progressbar.py in execute(self, progress_fn, print_verbose_info)
    169                 # Query the progress level, but only if it's time already
    170                 if self._next_poll_time <= now:
--> 171                     res = progress_fn()  # may raise StopIteration
    172                     assert_is_type(res, (numeric, numeric), numeric)
    173                     if not isinstance(res, tuple):

/usr/local/lib/python3.6/dist-packages/h2o/job.py in _refresh_job_status(self)
     96     def _refresh_job_status(self):
     97         if self._poll_count <= 0: raise StopIteration("")
---> 98         jobs = h2o.api("GET /3/Jobs/%s" % self.job_key)
     99         self.job = jobs["jobs"][0] if "jobs" in jobs else jobs["job"][0]
    100         self.status = self.job["status"]

/usr/local/lib/python3.6/dist-packages/h2o/h2o.py in api(endpoint, data, json, filename, save_to)
    121     # type checks are performed in H2OConnection class
    122     _check_connection()
--> 123     return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to)
    124 
    125 

/usr/local/lib/python3.6/dist-packages/h2o/backend/connection.py in request(self, endpoint, data, json, filename, save_to)
    481             if self._local_server and not self._local_server.is_running():
    482                 self._log_end_exception("Local server has died.")
--> 483                 raise H2OConnectionError("Local server has died unexpectedly. RIP.")
    484             else:
    485                 self._log_end_exception(e)

H2OConnectionError: Local server has died unexpectedly. RIP.

传递的参数:

param = 
      "ntrees" : 15
    , "min_rows" : 5
    , "max_depth" : 5
    , "learn_rate" : 0.02
    , "sample_rate" : 0.7
    , "col_sample_rate_per_tree" : 0.9
    , "seed": 42
    , "score_tree_interval": 100

有14个输入列,其中5个是分类特征。

我像这样初始化H2O:

h2o.init(
    strict_version_check=False,
#    nthreads=1,   # Crashes either with 1 or 4 threads.
    log_dir="/tmp/clem-h2o/",
    log_level='TRACE'
)



Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_242"; OpenJDK Runtime Environment (build 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08); OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)
  Starting server from /usr/local/lib/python3.6/dist-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpkeo9aau1
  JVM stdout: /tmp/tmpkeo9aau1/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmpkeo9aau1/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.
H2O cluster uptime: 01 secs
H2O cluster timezone:   Etc/UTC
H2O data parsing timezone:  UTC
H2O cluster version:    3.28.0.3
H2O cluster version age:    14 days, 3 hours and 57 minutes
H2O cluster name:   H2O_from_python_unknownUser_xuimzh
H2O cluster total nodes:    1
H2O cluster free memory:    4.445 Gb
H2O cluster total cores:    4
H2O cluster allowed cores:  4
H2O cluster status: accepting new members, healthy
H2O connection url: http://127.0.0.1:54321
H2O connection proxy:   'http': None, 'https': None
H2O internal security:  False
H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python version: 3.6.9 final

不幸的是,警告,错误和致命日志文件为空。在匿名标记功能名称之前,我无法释放其他日志文件...

我不知道是否还有其他调试开关可以帮助诊断问题。

这是版本信息:

H2O Version:  3.28.0.3
Python 3.6.9
Ubuntu 18.04.3 LTS
答案

即使我在计算机上安装了256GB内存和48核CPU,我也遇到了完全相同的问题。

以上是关于[H2O XGBoost因本地服务器死机或挂起而崩溃(?)的主要内容,如果未能解决你的问题,请参考以下文章

数据库中的死锁会影响其他数据库或挂起整个服务器吗?

在 H2O 随机森林和 xgboost 中使用权重列

应用退出或挂起时处理推送负载

为啥在 macOS 上使用 QThread 时 PyQt 应用程序崩溃或挂起?

启动远程程序时出错:正在启动或挂起

python multiprocessing.Pool kill *特定*长时间运行或挂起的进程