2016-04-21

Caffe Training Note

Content

1. ImageNet
1. 1.0.1. 1. Preparing the dataset
2. 1.0.2. 2.
1.1. ERROR & REASION
1. 1.1.1. Error 1: Out of memory
2. 1.1.2. - Reason

ImageNet

This part is a simplified instruction of training own data using Caffe

1. Preparing the dataset

The final dataset should be like this:

data
   -- train
      -- train1.jpg
	  -- train2.jpg
	  -- train3.jpg
	     ......
   -- val
	  -- val1.jpg
	  -- val2.jpg
	  -- val3.jpg
	     ......
   -- train.txt
   -- val.txt

and if the data is like this in python

X_train.shape = (5000, 1, 256, 256)
X_val.shape = (2000, 1, 256, 256)
X_test.shape = (500, 1, 256, 256)
y_train = (5000,)
y_val = (2000,)
y_test = (500,)

The following code can be used to generate the structure of dataset

if not os.path.exists('data'):
    os.makedirs('data')

os.chdir('data')

## save image

if not os.path.exists('train'):
    os.makedirs('train')

for i in xrange(Numtrain):
    name = 'train' + str(i) + '.jpg'
    xx = X_train[i,]
    xx =  xx.transpose(1,2,0)
    xx = (xx - np.min(xx)) / (np.max(xx)-np.min(xx))*255.0
    img = Image.fromarray(xx.astype(np.uint8), 'RGB') ## Notice that the function Image.fromarray() can only save image
                                                  ## of the form 'uint8' type !!!
    img.save('train/' + name)

............

Create a new file named “cell” under the “examples/imagenet” path and put the data folder into it
copy the file “examples/iamgenet/creat_imagenet.sh into newly created folder “cell” and change it as follows

EXAMPLE=examples/imagenet/cell
DATA=examples/imagenet/cell
TOOLS=build/tools

TRAIN_DATA_ROOT=examples/imagenet/cell/data/train
VAL_DATA_ROOT=examples/imagenet/cell/data/val

RESIZE=true # if the images do not need resize, set it to "false"

run command: ./examples/imagenet/myself/create_imagenet.sh
then lmdb file will be generated under cell folder.

2.

ERROR & REASION

Error 1: Out of memory

F0420 13:29:52.527748 10836 syncedmem.cpp:56] Check failed: error == cudaSuccess (2 vs. 0)  out of memory
*** Check failure stack trace: ***
    @     0x7f6cb7656dbd  google::LogMessage::Fail()
    @     0x7f6cb7658cf8  google::LogMessage::SendToLog()
    @     0x7f6cb7656953  google::LogMessage::Flush()
    @     0x7f6cb765962e  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f6cb7d5b021  caffe::SyncedMemory::to_gpu()
    @     0x7f6cb7d5a389  caffe::SyncedMemory::mutable_gpu_data()
    @     0x7f6cb7d3fdf2  caffe::Blob<>::mutable_gpu_data()
    @     0x7f6cb7dca57f  caffe::CuDNNConvolutionLayer<>::Forward_gpu()
    @     0x7f6cb7d71ec5  caffe::Net<>::ForwardFromTo()
    @     0x7f6cb7d72237  caffe::Net<>::Forward()
    @     0x7f6cb7d552c7  caffe::Solver<>::Step()
    @     0x7f6cb7d55b89  caffe::Solver<>::Solve()
    @           0x40806e  train()
    @           0x40594c  main
    @     0x7f6cb6969ec5  __libc_start_main
    @           0x406081  (unknown)
Aborted (core dumped)

- Reason

ERROR reasion: batch_size is too large!!! you should change the batch_size in train_val.prototxt file

Notice: When you set the configuration of caffe model, something you should notice

"batch_size": should not be larger than 100 if you use a single GPU to train the network.

"stepsize": should smaller than "max_iter" and also should be the divisor of "max_iter".

"test_iter": should be the multiple of (val_size / batch_size) [val_size is the size of val data, and the batch_size describe the batch size of val data].

"test_interval" should better be the multiple of (train_size / batch_size) [train_size is the size of your train data, and the batch_size describe the batch size of train data].

Since the result is better when you train on your whole train dataset. and this variable determine the model should be test after every how many iterations.

"stepsize": determine the "base_lr" should be reduced by "weight_decay" value after how many times iterations.

"snapshot": determine the intermidiate results should be stored every how many times of iterations.

"snapshot_prefix": determine the directory of storing the snapshot results

One example of prototxt configuration:

#####################
## solver.prototxt ##
#####################

net: "examples/imagenet/cell/train_val.prototxt"
test_iter: 40
test_interval: 40
base_lr: 0.01
lr_policy: "step"
gamma: 0.1
stepsize: 100000
display: 20
max_iter: 10000
momentum: 0.9
weight_decay: 0.0005
snapshot: 1000
snapshot_prefix: "examples/imagenet/cell/model/caffenet_train"
solver_mode: GPU


########################
## train_val.prototxt ##
########################

name: "CaffeNet"
layer {
  name: "data"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    mirror: true
    crop_size: 227
    mean_file: "examples/imagenet/cell/imagenet_mean.binaryproto"
  }
# mean pixel / channel-wise mean instead of mean image
#  transform_param {
#    crop_size: 227
#    mean_value: 104
#    mean_value: 117
#    mean_value: 123
#    mirror: true
#  }
  data_param {
    source: "examples/imagenet/cell/train_lmdb"
    batch_size: 250
    backend: LMDB
  }
}
layer {
  name: "data"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TEST
  }
  transform_param {
    mirror: false
    crop_size: 227
    mean_file: "examples/imagenet/cell/imagenet_mean.binaryproto"
  }
# mean pixel / channel-wise mean instead of mean image
#  transform_param {
#    crop_size: 227
#    mean_value: 104
#    mean_value: 117
#    mean_value: 123
#    mirror: false
#  }
  data_param {
    source: "examples/imagenet/cell/val_lmdb"
    batch_size: 50
    backend: LMDB
  }
}