在tensorflow分布式训练过程中突然终止(终止)
问题
这是为那些将从服务器接收渐变的员工提供的培训功能,在计算权重和偏差后,将更新的渐变发送到服务器。代码如下:
def train():"""Train CIFAR-10 for a number of steps."""g1 = tf.Graph()with g1.as_default():global_step = tf.contrib.framework.get_or_create_global_step()# Get images and labels for CIFAR-10.images, labels = cifar10.distorted_inputs()# Build a Graph that computes the logits predictions from the# inference model.logits = cifar10.inference(images)# Calculate loss.loss = cifar10.loss(logits, labels)grads = cifar10.train_part1(loss, global_step)only_gradients = [g for g,_ in grads]only_vars = [v for _,v in grads]placeholder_gradients = []#with tf.device("/gpu:0"):for grad_var in grads :placeholder_gradients.append((tf.placeholder('float', shape=grad_var[0].get_shape()) ,grad_var[1]))feed_dict = {}for i,grad_var in enumerate(grads): feed_dict[placeholder_gradients[i][0]] = np.zeros(placeholder_gradients[i][0].shape)# Build a Graph that trains the model with one batch of examples and# updates the model parameters.train_op = cifar10.train_part2(global_step,placeholder_gradients)class _LoggerHook(tf.train.SessionRunHook):"""Logs loss and runtime."""def begin(self):self._step = -1self._start_time = time.time()def before_run(self, run_context):self._step += 1return tf.train.SessionRunArgs(loss) # Asks for loss value.def after_run(self, run_context, run_values):if self._step % FLAGS.log_frequency == 0:current_time = time.time()duration = current_time - self._start_timeself._start_time = current_timeloss_value = run_values.resultsexamples_per_sec = FLAGS.log_frequency * FLAGS.batch_size / durationsec_per_batch = float(duration / FLAGS.log_frequency)format_str = ('%s: step %d, loss = %.2f (%.1f examples/sec; %.3f ''sec/batch)')print (format_str % (datetime.now(), self._step, loss_value,examples_per_sec, sec_per_batch))with tf.train.MonitoredTrainingSession(checkpoint_dir=FLAGS.train_dir,hooks=[tf.train.StopAtStepHook(last_step=FLAGS.max_steps),tf.train.NanTensorHook(loss),_LoggerHook()],config=tf.ConfigProto(log_device_placement=FLAGS.log_device_placement, gpu_options=gpu_options)) as mon_sess:global portwhile not mon_sess.should_stop():gradients = mon_sess.run(only_gradients,feed_dict = feed_dict)# pickling the gradientssend_data = pickle.dumps(gradients,pickle.HIGHEST_PROTOCOL)# finding size of pickled gradientsto_send_size = len(send_data)# Sending the size of the gradients firstsend_size = pickle.dumps(to_send_size, pickle.HIGHEST_PROTOCOL)s.sendall(send_size)# sending the gradientss.sendall(send_data)recv_size = safe_recv(17, s)recv_size = pickle.loads(recv_size)recv_data = safe_recv(recv_size, s)gradients2 = pickle.loads(recv_data)#print("Recevied gradients of size: ", len(recv_data))feed_dict = {}for i,grad_var in enumerate(gradients2): feed_dict[placeholder_gradients[i][0]] = gradients2[i]res = mon_sess.run(train_op, feed_dict=feed_dict)
但是当我运行这段代码时,工作进程在运行了一些步骤之后神秘地终止(killed)。这是输出:
Connecting to port 16001
Downloading cifar-10-binary.tar.gz 100.0% Successfully downloaded cifar-10-binary.tar.gz 170052171 bytes. WARNING:tensorflow:From
cifar10_train2.py:217: The name tf.gfile.Exists is deprecated. Please
use tf.io.gfile.exists instead.W1206 14:32:34.514134 140128687093568 deprecation_wrapper.py:119] From
cifar10_train2.py:217: The name tf.gfile.Exists is deprecated. Please
use tf.io.gfile.exists instead.WARNING:tensorflow:From cifar10_train2.py:218: The name
tf.gfile.DeleteRecursively is deprecated. Please use
tf.io.gfile.rmtree instead.W1206 14:32:34.515784 140128687093568 deprecation_wrapper.py:119] From
cifar10_train2.py:218: The name tf.gfile.DeleteRecursively is
deprecated. Please use tf.io.gfile.rmtree instead.WARNING:tensorflow:From cifar10_train2.py:219: The name
tf.gfile.MakeDirs is deprecated. Please use tf.io.gfile.makedirs
instead.W1206 14:32:34.521829 140128687093568 deprecation_wrapper.py:119] From
cifar10_train2.py:219: The name tf.gfile.MakeDirs is deprecated.
Please use tf.io.gfile.makedirs instead.WARNING:tensorflow:From cifar10_train2.py:108:
get_or_create_global_step (from
tensorflow.contrib.framework.python.ops.variables) is deprecated and
will be removed in a future version. Instructions for updating: Please
switch to tf.train.get_or_create_global_step W1206 14:32:35.968658
140128687093568 deprecation.py:323] From cifar10_train2.py:108:
get_or_create_global_step (from
tensorflow.contrib.framework.python.ops.variables) is deprecated and
will be removed in a future version. Instructions for updating: Please
switch to tf.train.get_or_create_global_step WARNING:tensorflow:From
/home/ubuntu/aws_share/DistNet-master/cifar10_input.py:346:
string_input_producer (from tensorflow.python.training.input) is
deprecated and will be removed in a future version. Instructions for
updating: Queue-based input pipelines have been replaced bytf.data
.
Use
tf.data.Dataset.from_tensor_slices(string_tensor).shuffle(tf.shape(input_tensor, out_type=tf.int64)[0]).repeat(num_epochs)
. Ifshuffle=False
, omit
the.shuffle(...)
. W1206 14:32:35.973685 140128687093568
deprecation.py:323] From
/home/ubuntu/aws_share/DistNet-master/cifar10_input.py:346:
string_input_producer (from tensorflow.python.training.input) is
deprecated and will be removed in a future version. Instructions for
updating: Queue-based input pipelines have been replaced bytf.data
.
Use
tf.data.Dataset.from_tensor_slices(string_tensor).shuffle(tf.shape(input_tensor, out_type=tf.int64)[0]).repeat(num_epochs)
. Ifshuffle=False
, omit
the.shuffle(...)
. WARNING:tensorflow:From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/training/input.py:278:
input_producer (from tensorflow.python.training.input) is deprecated
and will be removed in a future version. Instructions for updating:
Queue-based input pipelines have been replaced bytf.data
. Use
tf.data.Dataset.from_tensor_slices(input_tensor).shuffle(tf.shape(input_tensor, out_type=tf.int64)[0]).repeat(num_epochs)
. Ifshuffle=False
, omit
the.shuffle(...)
. W1206 14:32:35.978900 140128687093568
deprecation.py:323] From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/training/input.py:278:
input_producer (from tensorflow.python.training.input) is deprecated
and will be removed in a future version. Instructions for updating:
Queue-based input pipelines have been replaced bytf.data
. Use
tf.data.Dataset.from_tensor_slices(input_tensor).shuffle(tf.shape(input_tensor, out_type=tf.int64)[0]).repeat(num_epochs)
. Ifshuffle=False
, omit
the.shuffle(...)
. WARNING:tensorflow:From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/training/input.py:190:
limit_epochs (from tensorflow.python.training.input) is deprecated and
will be removed in a future version. Instructions for updating:
Queue-based input pipelines have been replaced bytf.data
. Use
tf.data.Dataset.from_tensors(tensor).repeat(num_epochs)
. W1206
14:32:35.979965 140128687093568 deprecation.py:323] From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/training/input.py:190:
limit_epochs (from tensorflow.python.training.input) is deprecated and
will be removed in a future version. Instructions for updating:
Queue-based input pipelines have been replaced bytf.data
. Use
tf.data.Dataset.from_tensors(tensor).repeat(num_epochs)
.
WARNING:tensorflow:From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/training/input.py:199:
QueueRunner.init (from
tensorflow.python.training.queue_runner_impl) is deprecated and will
be removed in a future version. Instructions for updating: To
construct input pipelines, use thetf.data
module. W1206
14:32:35.981818 140128687093568 deprecation.py:323] From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/training/input.py:199:
QueueRunner.init (from
tensorflow.python.training.queue_runner_impl) is deprecated and will
be removed in a future version. Instructions for updating: To
construct input pipelines, use thetf.data
module.
WARNING:tensorflow:From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/training/input.py:199:
add_queue_runner (from tensorflow.python.training.queue_runner_impl)
is deprecated and will be removed in a future version. Instructions
for updating: To construct input pipelines, use thetf.data
module.
W1206 14:32:35.983106 140128687093568 deprecation.py:323] From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/training/input.py:199:
add_queue_runner (from tensorflow.python.training.queue_runner_impl)
is deprecated and will be removed in a future version. Instructions
for updating: To construct input pipelines, use thetf.data
module.
WARNING:tensorflow:From
/home/ubuntu/aws_share/DistNet-master/cifar10_input.py:79:
FixedLengthRecordReader.init (from tensorflow.python.ops.io_ops)
is deprecated and will be removed in a future version. Instructions
for updating: Queue-based input pipelines have been replaced by
tf.data
. Usetf.data.FixedLengthRecordDataset
. W1206
14:32:35.987178 140128687093568 deprecation.py:323] From
/home/ubuntu/aws_share/DistNet-master/cifar10_input.py:79:
FixedLengthRecordReader.init (from tensorflow.python.ops.io_ops)
is deprecated and will be removed in a future version. Instructions
for updating: Queue-based input pipelines have been replaced by
tf.data
. Usetf.data.FixedLengthRecordDataset
.
WARNING:tensorflow:From
/home/ubuntu/aws_share/DistNet-master/cifar10_input.py:359: The name
tf.random_crop is deprecated. Please use tf.image.random_crop instead.W1206 14:32:36.005118 140128687093568 deprecation_wrapper.py:119] From
/home/ubuntu/aws_share/DistNet-master/cifar10_input.py:359: The name
tf.random_crop is deprecated. Please use tf.image.random_crop instead.WARNING:tensorflow:From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/ops/image_ops_impl.py:1514:
div (from tensorflow.python.ops.math_ops) is deprecated and will be
removed in a future version. Instructions for updating: Deprecated in
favor of operator or tf.math.divide. W1206 14:32:36.057567
140128687093568 deprecation.py:323] From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/ops/image_ops_impl.py:1514:
div (from tensorflow.python.ops.math_ops) is deprecated and will be
removed in a future version. Instructions for updating: Deprecated in
favor of operator or tf.math.divide. Filling queue with 20000 CIFAR
images before starting to train. This will take a few minutes.
WARNING:tensorflow:From
/home/ubuntu/aws_share/DistNet-master/cifar10_input.py:126:
shuffle_batch (from tensorflow.python.training.input) is deprecated
and will be removed in a future version. Instructions for updating:
Queue-based input pipelines have been replaced bytf.data
. Use
tf.data.Dataset.shuffle(min_after_dequeue).batch(batch_size)
. W1206
14:32:36.059510 140128687093568 deprecation.py:323] From
/home/ubuntu/aws_share/DistNet-master/cifar10_input.py:126:
shuffle_batch (from tensorflow.python.training.input) is deprecated
and will be removed in a future version. Instructions for updating:
Queue-based input pipelines have been replaced bytf.data
. Use
tf.data.Dataset.shuffle(min_after_dequeue).batch(batch_size)
.
WARNING:tensorflow:From
/home/ubuntu/aws_share/DistNet-master/cifar10_input.py:135: The name
tf.summary.image is deprecated. Please use tf.compat.v1.summary.image
instead.W1206 14:32:36.075527 140128687093568 deprecation_wrapper.py:119] From
/home/ubuntu/aws_share/DistNet-master/cifar10_input.py:135: The name
tf.summary.image is deprecated. Please use tf.compat.v1.summary.image
instead.WARNING:tensorflow:From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:374: The name
tf.variable_scope is deprecated. Please use
tf.compat.v1.variable_scope instead.W1206 14:32:36.079024 140128687093568 deprecation_wrapper.py:119] From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:374: The name
tf.variable_scope is deprecated. Please use
tf.compat.v1.variable_scope instead.WARNING:tensorflow:From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:135: calling
TruncatedNormal.init (from tensorflow.python.ops.init_ops) with
dtype is deprecated and will be removed in a future version.
Instructions for updating: Call initializer instance with the dtype
argument instead of passing it to the constructor W1206
14:32:36.079606 140128687093568 deprecation.py:506] From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:135: calling
TruncatedNormal.init (from tensorflow.python.ops.init_ops) with
dtype is deprecated and will be removed in a future version.
Instructions for updating: Call initializer instance with the dtype
argument instead of passing it to the constructor
WARNING:tensorflow:From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:111: The name
tf.get_variable is deprecated. Please use tf.compat.v1.get_variable
instead.W1206 14:32:36.080104 140128687093568 deprecation_wrapper.py:119] From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:111: The name
tf.get_variable is deprecated. Please use tf.compat.v1.get_variable
instead.WARNING:tensorflow:From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:138: The name
tf.add_to_collection is deprecated. Please use
tf.compat.v1.add_to_collection instead.W1206 14:32:36.089883 140128687093568 deprecation_wrapper.py:119] From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:138: The name
tf.add_to_collection is deprecated. Please use
tf.compat.v1.add_to_collection instead.WARNING:tensorflow:From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:386: The name
tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.W1206 14:32:36.096186 140128687093568 deprecation_wrapper.py:119] From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:386: The name
tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.WARNING:tensorflow:From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:572: The name
tf.train.exponential_decay is deprecated. Please use
tf.compat.v1.train.exponential_decay instead.W1206 14:32:36.164064 140128687093568 deprecation_wrapper.py:119] From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:572: The name
tf.train.exponential_decay is deprecated. Please use
tf.compat.v1.train.exponential_decay instead.WARNING:tensorflow:From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:577: The name
tf.summary.scalar is deprecated. Please use
tf.compat.v1.summary.scalar instead.W1206 14:32:36.170583 140128687093568 deprecation_wrapper.py:119] From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:577: The name
tf.summary.scalar is deprecated. Please use
tf.compat.v1.summary.scalar instead.INFO:tensorflow:Summary name conv1/weight_loss (raw) is illegal; using
conv1/weight_loss__raw_ instead. I1206 14:32:36.224623 140128687093568
summary_op_util.py:66] Summary name conv1/weight_loss (raw) is
illegal; using conv1/weight_loss__raw_ instead.
INFO:tensorflow:Summary name conv2/weight_loss (raw) is illegal; using
conv2/weight_loss__raw_ instead. I1206 14:32:36.230782 140128687093568
summary_op_util.py:66] Summary name conv2/weight_loss (raw) is
illegal; using conv2/weight_loss__raw_ instead.
INFO:tensorflow:Summary name local3/weight_loss (raw) is illegal;
using local3/weight_loss__raw_ instead. I1206 14:32:36.234036
140128687093568 summary_op_util.py:66] Summary name local3/weight_loss
(raw) is illegal; using local3/weight_loss__raw_ instead.
INFO:tensorflow:Summary name local4/weight_loss (raw) is illegal;
using local4/weight_loss__raw_ instead. I1206 14:32:36.236655
140128687093568 summary_op_util.py:66] Summary name local4/weight_loss
(raw) is illegal; using local4/weight_loss__raw_ instead.
INFO:tensorflow:Summary name softmax_linear/weight_loss (raw) is
illegal; using softmax_linear/weight_loss__raw_ instead. I1206
14:32:36.239247 140128687093568 summary_op_util.py:66] Summary name
softmax_linear/weight_loss (raw) is illegal; using
softmax_linear/weight_loss__raw_ instead. INFO:tensorflow:Summary name
cross_entropy (raw) is illegal; using cross_entropy__raw_ instead.
I1206 14:32:36.241847 140128687093568 summary_op_util.py:66] Summary
name cross_entropy (raw) is illegal; using cross_entropy__raw_
instead. INFO:tensorflow:Summary name total_loss (raw) is illegal;
using total_loss__raw_ instead. I1206 14:32:36.244410 140128687093568
summary_op_util.py:66] Summary name total_loss (raw) is illegal; using
total_loss__raw_ instead. WARNING:tensorflow:From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:584: The name
tf.train.GradientDescentOptimizer is deprecated. Please use
tf.compat.v1.train.GradientDescentOptimizer instead.W1206 14:32:36.247661 140128687093568 deprecation_wrapper.py:119] From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:584: The name
tf.train.GradientDescentOptimizer is deprecated. Please use
tf.compat.v1.train.GradientDescentOptimizer instead.WARNING:tensorflow:From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:609: The name
tf.summary.histogram is deprecated. Please use
tf.compat.v1.summary.histogram instead.W1206 14:32:36.383297 140128687093568 deprecation_wrapper.py:119] From
/home/ubuntu/aws_share/DistNet-master/cifar10.py:609: The name
tf.summary.histogram is deprecated. Please use
tf.compat.v1.summary.histogram instead.WARNING:tensorflow:From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/training/moving_averages.py:433:
Variable.initialized_value (from tensorflow.python.ops.variables) is
deprecated and will be removed in a future version. Instructions for
updating: Use Variable.read_value. Variables in 2.X are initialized
automatically both in eager and graph (inside tf.defun) contexts.
W1206 14:32:36.406855 140128687093568 deprecation.py:323] From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/training/moving_averages.py:433:
Variable.initialized_value (from tensorflow.python.ops.variables) is
deprecated and will be removed in a future version. Instructions for
updating: Use Variable.read_value. Variables in 2.X are initialized
automatically both in eager and graph (inside tf.defun) contexts.
WARNING:tensorflow:From cifar10_train2.py:144: The name
tf.train.SessionRunHook is deprecated. Please use
tf.estimator.SessionRunHook instead.W1206 14:32:36.700819 140128687093568 deprecation_wrapper.py:119] From
cifar10_train2.py:144: The name tf.train.SessionRunHook is deprecated.
Please use tf.estimator.SessionRunHook instead.WARNING:tensorflow:From cifar10_train2.py:171: The name
tf.train.MonitoredTrainingSession is deprecated. Please use
tf.compat.v1.train.MonitoredTrainingSession instead.W1206 14:32:36.701856 140128687093568 deprecation_wrapper.py:119] From
cifar10_train2.py:171: The name tf.train.MonitoredTrainingSession is
deprecated. Please use tf.compat.v1.train.MonitoredTrainingSession
instead.WARNING:tensorflow:From cifar10_train2.py:173: The name
tf.train.StopAtStepHook is deprecated. Please use
tf.estimator.StopAtStepHook instead.W1206 14:32:36.702125 140128687093568 deprecation_wrapper.py:119] From
cifar10_train2.py:173: The name tf.train.StopAtStepHook is deprecated.
Please use tf.estimator.StopAtStepHook instead.INFO:tensorflow:Create CheckpointSaverHook. I1206 14:32:36.702475
140128687093568 basic_session_run_hooks.py:541] Create
CheckpointSaverHook. WARNING:tensorflow:From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py:1354:
add_dispatch_support..wrapper (from
tensorflow.python.ops.array_ops) is deprecated and will be removed in
a future version. Instructions for updating: Use tf.where in 2.0,
which has the same broadcast rule as np.where W1206 14:32:36.849380
140128687093568 deprecation.py:323] From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py:1354:
add_dispatch_support..wrapper (from
tensorflow.python.ops.array_ops) is deprecated and will be removed in
a future version. Instructions for updating: Use tf.where in 2.0,
which has the same broadcast rule as np.where INFO:tensorflow:Graph
was finalized. I1206 14:32:36.947587 140128687093568
monitored_session.py:240] Graph was finalized. 2019-12-06
14:32:36.948710: I tensorflow/core/platform/cpu_feature_guard.cc:142]
Your CPU supports instructions that this TensorFlow binary was not
compiled to use: AVX2 FMA 2019-12-06 14:32:36.957928: I
tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency:
2400045000 Hz 2019-12-06 14:32:36.958092: I
tensorflow/compiler/xla/service/service.cc:168] XLA service 0x405e200
executing computations on platform Host. Devices: 2019-12-06
14:32:36.958116: I tensorflow/compiler/xla/service/service.cc:175]
StreamExecutor device (0): , 2019-12-06
14:32:37.060010: W
tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time
warning): Not using XLA:CPU for cluster because envvar
TF_XLA_FLAGS=–tf_xla_cpu_global_jit was not set. If you want
XLA:CPU, either set that envvar, or use experimental_jit_scope to
enable XLA:CPU. To confirm that XLA is active, pass
–vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=–xla_hlo_profile.
INFO:tensorflow:Running local_init_op. I1206 14:32:37.153052
140128687093568 session_manager.py:500] Running local_init_op.
INFO:tensorflow:Done running local_init_op. I1206 14:32:37.165080
140128687093568 session_manager.py:502] Done running local_init_op.
WARNING:tensorflow:From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py:875: start_queue_runners (from
tensorflow.python.training.queue_runner_impl) is deprecated and will
be removed in a future version. Instructions for updating: To
construct input pipelines, use thetf.data
module. W1206
14:32:37.200688 140128687093568 deprecation.py:323] From
/home/ubuntu/geo-distNet/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py:875: start_queue_runners (from
tensorflow.python.training.queue_runner_impl) is deprecated and will
be removed in a future version. Instructions for updating: To
construct input pipelines, use thetf.data
module.
INFO:tensorflow:Saving checkpoints for 0 into
/home/ubuntu/cifar10_train/model.ckpt. I1206 14:32:38.378327
140128687093568 basic_session_run_hooks.py:606] Saving checkpoints for
0 into /home/ubuntu/cifar10_train/model.ckpt. 2019-12-06
14:32:42.361013: W tensorflow/core/framework/allocator.cc:107]
Allocation of 18874368 exceeds 10% of system memory. 2019-12-06
14:32:42.897646: W tensorflow/core/framework/allocator.cc:107]
Allocation of 21196800 exceeds 10% of system memory. 2019-12-06
14:32:43.161692: W tensorflow/core/framework/allocator.cc:107]
Allocation of 9437184 exceeds 10% of system memory. 2019-12-06
14:32:43.169073: W tensorflow/core/framework/allocator.cc:107]
Allocation of 18874368 exceeds 10% of system memory. 2019-12-06
14:32:43.231100: W tensorflow/core/framework/allocator.cc:107]
Allocation of 16070400 exceeds 10% of system memory. 2019-12-06
14:32:43.285419: step 0, loss = 4.67 (197.6 examples/sec; 0.648
sec/batch) WARNING:tensorflow:It seems that global step
(tf.train.get_global_step) has not been increased. Current value
(could be stable): 0 vs previous value: 0. You could increase the
global step by passing tf.train.get_global_step() to
Optimizer.apply_gradients or Optimizer.minimize. W1206 14:32:43.988662
140128687093568 basic_session_run_hooks.py:724] It seems that global
step (tf.train.get_global_step) has not been increased. Current value
(could be stable): 0 vs previous value: 0. You could increase the
global step by passing tf.train.get_global_step() to
Optimizer.apply_gradients or Optimizer.minimize. WARNING:tensorflow:It
seems that global step (tf.train.get_global_step) has not been
increased. Current value (could be stable): 1 vs previous value: 1.
You could increase the global step by passing
tf.train.get_global_step() to Optimizer.apply_gradients or
Optimizer.minimize. W1206 14:32:45.315430 140128687093568
basic_session_run_hooks.py:724] It seems that global step
(tf.train.get_global_step) has not been increased. Current value
(could be stable): 1 vs previous value: 1. You could increase the
global step by passing tf.train.get_global_step() to
Optimizer.apply_gradients or Optimizer.minimize. WARNING:tensorflow:It
seems that global step (tf.train.get_global_step) has not been
increased. Current value (could be stable): 2 vs previous value: 2.
You could increase the global step by passing
tf.train.get_global_step() to Optimizer.apply_gradients or
Optimizer.minimize. W1206 14:32:46.506114 140128687093568
basic_session_run_hooks.py:724] It seems that global step
(tf.train.get_global_step) has not been increased. Current value
(could be stable): 2 vs previous value: 2. You could increase the
global step by passing tf.train.get_global_step() to
Optimizer.apply_gradients or Optimizer.minimize. WARNING:tensorflow:It
seems that global step (tf.train.get_global_step) has not been
increased. Current value (could be stable): 3 vs previous value: 3.
You could increase the global step by passing
tf.train.get_global_step() to Optimizer.apply_gradients or
Optimizer.minimize. W1206 14:32:47.667660 140128687093568
basic_session_run_hooks.py:724] It seems that global step
(tf.train.get_global_step) has not been increased. Current value
(could be stable): 3 vs previous value: 3. You could increase the
global step by passing tf.train.get_global_step() to
Optimizer.apply_gradients or Optimizer.minimize. WARNING:tensorflow:It
seems that global step (tf.train.get_global_step) has not been
increased. Current value (could be stable): 4 vs previous value: 4.
You could increase the global step by passing
tf.train.get_global_step() to Optimizer.apply_gradients or
Optimizer.minimize. W1206 14:32:48.857112 140128687093568
basic_session_run_hooks.py:724] It seems that global step
(tf.train.get_global_step) has not been increased. Current value
(could be stable): 4 vs previous value: 4. You could increase the
global step by passing tf.train.get_global_step() to
Optimizer.apply_gradients or Optimizer.minimize. 2019-12-06
14:32:50.199463: step 10, loss = 4.66 (185.1 examples/sec; 0.691
sec/batch) 2019-12-06 14:32:56.430896: step 20, loss = 4.64 (205.5
examples/sec; 0.623 sec/batch) Killed
搜索解决方案并找到以下命令:
dmesg -T| grep -E -i -B100 'killed process'
它显示了导致进程终止并获得以下输出的原因:
[Fri Dec 6 14:33:06 2019] [ pid ] uid tgid total_vm rss
pgtables_bytes swapents oom_score_adj name [Fri Dec 6 14:33:06 2019]
[ 8384] 1000 8384 64817 627 266240 0 0
(sd-pam) [Fri Dec 6 14:33:06 2019] [ 8513] 1000 8513 27163
427 253952 0 0 sshd [Fri Dec 6 14:33:06 2019] [
8519] 1000 8519 5803 437 86016 0 0
bash [Fri Dec 6 14:33:06 2019] [ 8558] 1000 8558 118072 11795
430080 0 0 jupyter-noteboo [Fri Dec 6 14:33:06
2019] [ 8567] 1000 8567 155530 8753 421888 0
0 python3 [Fri Dec 6 14:33:06 2019] [ 8588] 1000 8588 156391
9638 430080 0 0 python3 [Fri Dec 6 14:33:06
2019] [ 8826] 1000 8826 1157 16 49152 0
0 sh [Fri Dec 6 14:33:06 2019] [ 8827] 1000 8827 462846 175833
2416640 0 0 cifar10_train2. [Fri Dec 6 14:33:06
2019] Out of memory: Kill process 8827 (cifar10_train2.) score 700 or
sacrifice child [Fri Dec 6 14:33:06 2019] Killed process 8827
(cifar10_train2.) total-vm:1851384kB, anon-rss:703332kB, file-rss:0kB,
shmem-rss:0kB
这意味着内存不足的原因导致!
解决
[cRPD] Syslog Message: ‘Memory cgroup out of memory: Kill process (rpd) score or sacrifice child’
RPD 在启动容器时将利用比分配给它的内存+交换更多的内存+交换。例如,在以下情况下,生成的容器的内存+交换限制为 2G
docker run --rm -detach --name cRR1 -h cRR1 --privileged -v cRR1-config:/config -v cRR1-Log:/var/log --memory="2g" --memory-swap="2g" -it crpd:20.3X75-D5.5
如下面的 docker 统计信息输出所示,它显示容器正在利用分配给它的所有内存 - 2G
/home/regress# ps aux | grep "/usr/sbin/rpd -N"
任务内存输出显示容器在“22/05/30 09:56:20”上使用了大约 2G 的最大内存 - 这正是内存不足和 RPD 崩溃的时候
Memory cgroup out of memory” is a message coming from the kernel when it kills something because of a memory cgroup restriction, such as those we place on containers while spinning them.
“内存 cgroup 出内存”是来自内核的消息,当它由于内存 cgroup 限制而杀死某些内容时,例如我们在旋转容器时放置在容器上的那些。
One major thing that could lead to this event is if insufficient memory is allocated to container which is insufficient to handle the route scale.
可能导致此事件的一个主要原因是,如果分配给容器的内存不足,这不足以处理路由规模。
Another factor which could lead to memory contention is a huge route churn which pushes the memory to the edge of extinction.
另一个可能导致内存争用的因素是巨大的路由搅动,它将内存推向灭绝的边缘。
As a reference for RIB/FIB scale vs minimum memory required to support it, refer the documentation on cRPD Resource Requirements.
有关 RIB/FIB 规模与支持它所需的最小内存的参考,请参阅 cRPD 资源要求的文档。
相关文章:

在tensorflow分布式训练过程中突然终止(终止)
问题 这是为那些将从服务器接收渐变的员工提供的培训功能,在计算权重和偏差后,将更新的渐变发送到服务器。代码如下: def train():"""Train CIFAR-10 for a number of steps."""g1 tf.Graph()with g1.as_de…...

windows永久暂停更新
目录 1.winr,输入regedit打开注册表 2.打开注册表的这个路径: 计算机\HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\WindowsUpdate\UX\Settings 右键空白地方新建QWORD值命名为:FlightSettingsMaxPauseDays 3.双击FlightSettingsMaxPauseDays,修改里面的值为100000,右边基数设置…...

Android 9系统源码_音频管理(一)按键音效源码解析
前言 当用户点击Android智能设备的按钮的时候,如果伴随有按键音效的话,会给用户更好的交互体验。本期我们将会结合Android系统源码来具体分析一下控件是如何发出按键音效的。 一、系统加载按键音效资源 1、在TV版的Android智能设备中,我们…...

PyTorch搭建神经网络
PyTorch版本:1.12.1PyTorch官方文档PyTorch中文文档 PyTorch中搭建并训练一个神经网络分为以下几步: 定义神经网络定义损失函数以及优化器训练:反向传播、梯度下降 下面以LeNet-5为例,搭建一个卷积神经网络用于手写数字识别。 …...
TiDB 优雅关闭
背景 今天使用tiup做实验的事后,将tidb节点从2个缩到1个,发现tiup返回成功但是tidb-server进程还在。 这就引发的我的好奇心,why? 实验复现 启动集群 #( 07/31/23 8:32下午 )( happyZBMAC-f298743e3 ):~/docker/tiup/tiproxy…...

食品厂能源管理系统助力节能减排,提升可持续发展
随着全球能源问题的日益突出,食品厂作为能源消耗较大的行业,如何有效管理和利用能源成为了一项重要任务。引入食品厂能源管理系统可以帮助企业实现节能减排,提高能源利用效率,同时也符合可持续发展的理念。 食品厂能源管理系统的…...
ABAP读取文本函数效率优化,read_text --->zread_text
FUNCTION zread_text. *“---------------------------------------------------------------------- "“本地接口: *” IMPORTING *” VALUE(CLIENT) LIKE SY-MANDT DEFAULT SY-MANDT *" VALUE(ID) LIKE THEAD-TDID *" VALUE(LANGUAGE) LIKE THEAD-…...
Spring Data Repository 使用详解
8.1. 核心概念 Spring Data repository 抽象的中心接口是 Repository。它把要管理的 domain 类以及 domain 类的ID类型作为泛型参数。这个接口主要是作为一个标记接口,用来捕捉工作中的类型,并帮助你发现扩展这个接口的接口。 CrudRepository 和 ListCr…...

[ MySQL ] — 数据库环境安装、概念和基本使用
目录 安装MySQL 获取mysql官⽅yum源 安装mysql yum 源 安装mysql服务 启动服务 登录 方法1:获取临时root密码 方法2:无密码 方法3:跳过密码认证 配置my.cnf 卸载环境 设置开机启动(可以不设) 常见问题 安装遇到秘钥过期的问题&…...
Apache Thrift C++库的TThreadPoolServer模式的完整示例
1. 本程序功能 1) 要有完整的request 和 response; 2) 支持多进程并行处理任务; 3)子进程任务结束后无僵尸进程 2.Apache Thrift C++库的编译和安装 见 步步详解:Apache Thrift C++库从编译到工作模式DEMO_北雨南萍的博客-CSDN博客 3.框架生成 数据字段定义: cat D…...

图解java.util.concurrent并发包源码系列——深入理解ReentrantLock,看完可以吊打面试官
图解java.util.concurrent并发包源码系列——深入理解ReentrantLock,看完可以吊打面试官 ReentrantLock是什么,有什么作用ReentrantLock的使用ReentrantLock源码解析ReentrantLock#lock方法FairSync#tryAcquire方法NonfairSync#tryAcquire方法 Reentrant…...

【计算机网络】网络基础(上)
文章目录 1. 网络发展认识协议 2.网络协议初识协议分层OSI七层模型 | TCP/IP网络传输基本流程情况1:同一个局域网(子网)数据在两台通信机器中如何流转协议报头的理解局域网通信原理(故事版本)一般原理数据碰撞结论 情况2:跨一个路由器的两个子网IP地址与…...

51单片机(普中HC6800-EM3 V3.0)实验例程软件分析 实验四 蜂鸣器
目录 前言 一、原理图及知识点介绍 1.1、蜂鸣器原理图: 二、代码分析 前言 第一个实验:51单片机(普中HC6800-EM3 V3.0)实验例程软件分析 实验一 点亮第一个LED_ManGo CHEN的博客-CSDN博客 第二个实验:51单片机(普中HC6800-EM…...

无向图-已知根节点求高度
深搜板子题,无向图,加边加两个,dfs输入两个参数变量,一个是当前深搜节点,另一个是父节点(避免重复搜索父节点),恢复现场 ///首先完成数组模拟邻接表#include<iostream> #incl…...
RIP动态路由协议 (已过时,逐渐退出舞台)
RIP 路由更新:RIP1/2 每30秒钟广播(255.255.255.255)/组播 (224.0.0.9)一次超时:180秒未收到更新,即标记为不可用(跳数16),240秒收不到,即从路由表中删除 ;跳…...
C++ operator关键字的使用(重载运算符、仿函数、类型转换操作符)
目录 定义operator重载运算符operator重载函数调用运算符operator类型转换操作符 定义 C11 中,operator 是一个关键字,用于重载运算符。通过重载运算符,您可以定义自定义类型的对象在使用内置运算符时的行为。 operator重载用法一般可以分为…...

深度学习笔记-暂退法(Drop out)
背景 在机器学习的模型中,如果模型的参数太多,而训练样本又太少,训练出来的模型很容易产生过拟合的现象。在训练神经网络的时候经常会遇到过拟合的问题,过拟合具体表现在:模型在训练数据上损失函数较小,预…...

使用自适应去噪在线顺序极限学习机预测飞机发动机剩余使用寿命(Matlab代码实现)
💥💥💞💞欢迎来到本博客❤️❤️💥💥 🏆博主优势:🌞🌞🌞博客内容尽量做到思维缜密,逻辑清晰,为了方便读者。 ⛳️座右铭&a…...
实验5-7 使用函数求1到10的阶乘和 (10 分)
实验5-7 使用函数求1到10的阶乘和 (10 分) 本题要求实现一个计算非负整数阶乘的简单函数,使得可以利用该函数,计算1!2!⋯10!的值。 函数接口定义: double fact( int n ); 其中n是用户传入的参数,其值不超过…...

kafka部署
1.kafka安装部署 1.1 kafaka下载 https://archive.apache.org/dist/kafka/2.4.0/kafka_2.12-2.4.0.tgz Binary downloads是指预编译的软件包,可供直接下载和安装,无需手动编译。在计算机领域中,二进制下载通常指预构建的软件分发包,可以直接安装在系统上并使用 "2.…...
OkHttp 中实现断点续传 demo
在 OkHttp 中实现断点续传主要通过以下步骤完成,核心是利用 HTTP 协议的 Range 请求头指定下载范围: 实现原理 Range 请求头:向服务器请求文件的特定字节范围(如 Range: bytes1024-) 本地文件记录:保存已…...
鸿蒙中用HarmonyOS SDK应用服务 HarmonyOS5开发一个医院查看报告小程序
一、开发环境准备 工具安装: 下载安装DevEco Studio 4.0(支持HarmonyOS 5)配置HarmonyOS SDK 5.0确保Node.js版本≥14 项目初始化: ohpm init harmony/hospital-report-app 二、核心功能模块实现 1. 报告列表…...

mysql已经安装,但是通过rpm -q 没有找mysql相关的已安装包
文章目录 现象:mysql已经安装,但是通过rpm -q 没有找mysql相关的已安装包遇到 rpm 命令找不到已经安装的 MySQL 包时,可能是因为以下几个原因:1.MySQL 不是通过 RPM 包安装的2.RPM 数据库损坏3.使用了不同的包名或路径4.使用其他包…...

Android 之 kotlin 语言学习笔记三(Kotlin-Java 互操作)
参考官方文档:https://developer.android.google.cn/kotlin/interop?hlzh-cn 一、Java(供 Kotlin 使用) 1、不得使用硬关键字 不要使用 Kotlin 的任何硬关键字作为方法的名称 或字段。允许使用 Kotlin 的软关键字、修饰符关键字和特殊标识…...

10-Oracle 23 ai Vector Search 概述和参数
一、Oracle AI Vector Search 概述 企业和个人都在尝试各种AI,使用客户端或是内部自己搭建集成大模型的终端,加速与大型语言模型(LLM)的结合,同时使用检索增强生成(Retrieval Augmented Generation &#…...

云原生玩法三问:构建自定义开发环境
云原生玩法三问:构建自定义开发环境 引言 临时运维一个古董项目,无文档,无环境,无交接人,俗称三无。 运行设备的环境老,本地环境版本高,ssh不过去。正好最近对 腾讯出品的云原生 cnb 感兴趣&…...
在鸿蒙HarmonyOS 5中使用DevEco Studio实现指南针功能
指南针功能是许多位置服务应用的基础功能之一。下面我将详细介绍如何在HarmonyOS 5中使用DevEco Studio实现指南针功能。 1. 开发环境准备 确保已安装DevEco Studio 3.1或更高版本确保项目使用的是HarmonyOS 5.0 SDK在项目的module.json5中配置必要的权限 2. 权限配置 在mo…...

职坐标物联网全栈开发全流程解析
物联网全栈开发涵盖从物理设备到上层应用的完整技术链路,其核心流程可归纳为四大模块:感知层数据采集、网络层协议交互、平台层资源管理及应用层功能实现。每个模块的技术选型与实现方式直接影响系统性能与扩展性,例如传感器选型需平衡精度与…...
[特殊字符] Spring Boot底层原理深度解析与高级面试题精析
一、Spring Boot底层原理详解 Spring Boot的核心设计哲学是约定优于配置和自动装配,通过简化传统Spring应用的初始化和配置流程,显著提升开发效率。其底层原理可拆解为以下核心机制: 自动装配(Auto-Configuration) 核…...

生产管理系统开发:专业软件开发公司的实践与思考
生产管理系统开发的关键点 在当前制造业智能化升级的转型背景下,生产管理系统开发正逐步成为企业优化生产流程的重要技术手段。不同行业、不同规模的企业在推进生产管理数字化转型过程中,面临的挑战存在显著差异。本文结合具体实践案例,分析…...