[H2O XGBoost因本地服务器死机或挂起而崩溃(?)
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了[H2O XGBoost因本地服务器死机或挂起而崩溃(?)相关的知识,希望对你有一定的参考价值。
我将需要做一些工作来构建一个较小的测试用例,另外,我必须获得释放数据的权限(在匿名之后),但是H2O始终使用这些数据和参数使我崩溃。 (通常使用功能输入和参数的不同组合会成功,但似乎总是会因以下功能和参数而失败)。
数据有12847393行(这可能是引起问题的原因?)
这是我得到的丑陋的堆栈抓取。 (似乎可重现。)
提为错误:https://0xdata.atlassian.net/projects/PUBDEV/issues/PUBDEV-7321
---------------------------------------------------------------------------
ConnectionResetError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
599 body=body, headers=headers,
--> 600 chunked=chunked)
601
/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
383 # otherwise it looks like a programming error was the cause.
--> 384 six.raise_from(e, None)
385 except (SocketTimeout, BaseSSLError, SocketError) as e:
/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py in raise_from(value, from_value)
/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
379 try:
--> 380 httplib_response = conn.getresponse()
381 except Exception as e:
/usr/lib/python3.6/http/client.py in getresponse(self)
1345 try:
-> 1346 response.begin()
1347 except ConnectionError:
/usr/lib/python3.6/http/client.py in begin(self)
306 while True:
--> 307 version, status, reason = self._read_status()
308 if status != CONTINUE:
/usr/lib/python3.6/http/client.py in _read_status(self)
267 def _read_status(self):
--> 268 line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
269 if len(line) > _MAXLINE:
/usr/lib/python3.6/socket.py in readinto(self, b)
585 try:
--> 586 return self._sock.recv_into(b)
587 except timeout:
ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
ProtocolError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
448 retries=self.max_retries,
--> 449 timeout=timeout
450 )
/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
637 retries = retries.increment(method, url, error=e, _pool=self,
--> 638 _stacktrace=sys.exc_info()[2])
639 retries.sleep()
/usr/local/lib/python3.6/dist-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
366 if read is False or not self._is_method_retryable(method):
--> 367 raise six.reraise(type(error), error, _stacktrace)
368 elif read is not None:
/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py in reraise(tp, value, tb)
684 if value.__traceback__ is not tb:
--> 685 raise value.with_traceback(tb)
686 raise value
/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
599 body=body, headers=headers,
--> 600 chunked=chunked)
601
/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
383 # otherwise it looks like a programming error was the cause.
--> 384 six.raise_from(e, None)
385 except (SocketTimeout, BaseSSLError, SocketError) as e:
/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py in raise_from(value, from_value)
/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
379 try:
--> 380 httplib_response = conn.getresponse()
381 except Exception as e:
/usr/lib/python3.6/http/client.py in getresponse(self)
1345 try:
-> 1346 response.begin()
1347 except ConnectionError:
/usr/lib/python3.6/http/client.py in begin(self)
306 while True:
--> 307 version, status, reason = self._read_status()
308 if status != CONTINUE:
/usr/lib/python3.6/http/client.py in _read_status(self)
267 def _read_status(self):
--> 268 line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
269 if len(line) > _MAXLINE:
/usr/lib/python3.6/socket.py in readinto(self, b)
585 try:
--> 586 return self._sock.recv_into(b)
587 except timeout:
ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
During handling of the above exception, another exception occurred:
ConnectionError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/h2o/backend/connection.py in request(self, endpoint, data, json, filename, save_to)
473 headers=headers, timeout=self._timeout, stream=stream,
--> 474 auth=self._auth, verify=verify, proxies=self._proxies)
475 if isinstance(save_to, types.FunctionType):
/usr/local/lib/python3.6/dist-packages/requests/api.py in request(method, url, **kwargs)
59 with sessions.Session() as session:
---> 60 return session.request(method=method, url=url, **kwargs)
61
/usr/local/lib/python3.6/dist-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
532 send_kwargs.update(settings)
--> 533 resp = self.send(prep, **send_kwargs)
534
/usr/local/lib/python3.6/dist-packages/requests/sessions.py in send(self, request, **kwargs)
645 # Send the request
--> 646 r = adapter.send(request, **kwargs)
647
/usr/local/lib/python3.6/dist-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
497 except (ProtocolError, socket.error) as err:
--> 498 raise ConnectionError(err, request=request)
499
ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
During handling of the above exception, another exception occurred:
H2OConnectionError Traceback (most recent call last)
<ipython-input-1-56b68eefa416> in <module>
87 start_time = time.time()
88 model = H2OXGBoostEstimator(**param)
---> 89 model.train(x=x, y="y", training_frame=hdf6)
90 elapsed_time = time.time() - start_time
/usr/local/lib/python3.6/dist-packages/h2o/estimators/estimator_base.py in train(self, x, y, training_frame, offset_column, fold_column, weights_column, validation_frame, max_runtime_secs, ignored_columns, model_id, verbose)
110 self._train(x=x, y=y, training_frame=training_frame, offset_column=offset_column, fold_column=fold_column,
111 weights_column=weights_column, validation_frame=validation_frame, max_runtime_secs=max_runtime_secs,
--> 112 ignored_columns=ignored_columns, model_id=model_id, verbose=verbose)
113
114
/usr/local/lib/python3.6/dist-packages/h2o/estimators/estimator_base.py in _train(self, x, y, training_frame, offset_column, fold_column, weights_column, validation_frame, max_runtime_secs, ignored_columns, model_id, verbose, extend_parms_fn)
263 return
264
--> 265 model.poll(poll_updates=self._print_model_scoring_history if verbose else None)
266 model_json = h2o.api("GET /%d/Models/%s" % (rest_ver, model.dest_key))["models"][0]
267 self._resolve_model(model.dest_key, model_json)
/usr/local/lib/python3.6/dist-packages/h2o/job.py in poll(self, poll_updates)
58 pb.execute(self._refresh_job_status, print_verbose_info=ft.partial(poll_updates, self))
59 else:
---> 60 pb.execute(self._refresh_job_status)
61 except StopIteration as e:
62 if str(e) == "cancelled":
/usr/local/lib/python3.6/dist-packages/h2o/utils/progressbar.py in execute(self, progress_fn, print_verbose_info)
169 # Query the progress level, but only if it's time already
170 if self._next_poll_time <= now:
--> 171 res = progress_fn() # may raise StopIteration
172 assert_is_type(res, (numeric, numeric), numeric)
173 if not isinstance(res, tuple):
/usr/local/lib/python3.6/dist-packages/h2o/job.py in _refresh_job_status(self)
96 def _refresh_job_status(self):
97 if self._poll_count <= 0: raise StopIteration("")
---> 98 jobs = h2o.api("GET /3/Jobs/%s" % self.job_key)
99 self.job = jobs["jobs"][0] if "jobs" in jobs else jobs["job"][0]
100 self.status = self.job["status"]
/usr/local/lib/python3.6/dist-packages/h2o/h2o.py in api(endpoint, data, json, filename, save_to)
121 # type checks are performed in H2OConnection class
122 _check_connection()
--> 123 return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to)
124
125
/usr/local/lib/python3.6/dist-packages/h2o/backend/connection.py in request(self, endpoint, data, json, filename, save_to)
481 if self._local_server and not self._local_server.is_running():
482 self._log_end_exception("Local server has died.")
--> 483 raise H2OConnectionError("Local server has died unexpectedly. RIP.")
484 else:
485 self._log_end_exception(e)
H2OConnectionError: Local server has died unexpectedly. RIP.
传递的参数:
param =
"ntrees" : 15
, "min_rows" : 5
, "max_depth" : 5
, "learn_rate" : 0.02
, "sample_rate" : 0.7
, "col_sample_rate_per_tree" : 0.9
, "seed": 42
, "score_tree_interval": 100
有14个输入列,其中5个是分类特征。
我像这样初始化H2O:
h2o.init(
strict_version_check=False,
# nthreads=1, # Crashes either with 1 or 4 threads.
log_dir="/tmp/clem-h2o/",
log_level='TRACE'
)
Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
Java Version: openjdk version "1.8.0_242"; OpenJDK Runtime Environment (build 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08); OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)
Starting server from /usr/local/lib/python3.6/dist-packages/h2o/backend/bin/h2o.jar
Ice root: /tmp/tmpkeo9aau1
JVM stdout: /tmp/tmpkeo9aau1/h2o_unknownUser_started_from_python.out
JVM stderr: /tmp/tmpkeo9aau1/h2o_unknownUser_started_from_python.err
Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.
H2O cluster uptime: 01 secs
H2O cluster timezone: Etc/UTC
H2O data parsing timezone: UTC
H2O cluster version: 3.28.0.3
H2O cluster version age: 14 days, 3 hours and 57 minutes
H2O cluster name: H2O_from_python_unknownUser_xuimzh
H2O cluster total nodes: 1
H2O cluster free memory: 4.445 Gb
H2O cluster total cores: 4
H2O cluster allowed cores: 4
H2O cluster status: accepting new members, healthy
H2O connection url: http://127.0.0.1:54321
H2O connection proxy: 'http': None, 'https': None
H2O internal security: False
H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python version: 3.6.9 final
不幸的是,警告,错误和致命日志文件为空。在匿名标记功能名称之前,我无法释放其他日志文件...
我不知道是否还有其他调试开关可以帮助诊断问题。
这是版本信息:
H2O Version: 3.28.0.3
Python 3.6.9
Ubuntu 18.04.3 LTS
答案
即使我在计算机上安装了256GB内存和48核CPU,我也遇到了完全相同的问题。
以上是关于[H2O XGBoost因本地服务器死机或挂起而崩溃(?)的主要内容,如果未能解决你的问题,请参考以下文章