Caffe Training Note

Content
  1. 1. ImageNet
    1. 1.0.1. 1. Preparing the dataset
    2. 1.0.2. 2.
  2. 1.1. ERROR & REASION
    1. 1.1.1. Error 1: Out of memory
    2. 1.1.2. - Reason

ImageNet

This part is a simplified instruction of training own data using Caffe

1. Preparing the dataset

The final dataset should be like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
data
-- train
-- train1.jpg
-- train2.jpg
-- train3.jpg
......

-- val
-- val1.jpg
-- val2.jpg
-- val3.jpg
......

-- train.txt
-- val.txt

and if the data is like this in python

X_train.shape = (5000, 1, 256, 256)
X_val.shape = (2000, 1, 256, 256)
X_test.shape = (500, 1, 256, 256)
y_train = (5000,)
y_val = (2000,)
y_test = (500,)

The following code can be used to generate the structure of dataset

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
if not os.path.exists('data'):
os.makedirs('data')

os.chdir('data')

## save image

if not os.path.exists('train'):
os.makedirs('train')

for i in xrange(Numtrain):
name = 'train' + str(i) + '.jpg'
xx = X_train[i,]
xx = xx.transpose(1,2,0)
xx = (xx - np.min(xx)) / (np.max(xx)-np.min(xx))*255.0
img = Image.fromarray(xx.astype(np.uint8), 'RGB') ## Notice that the function Image.fromarray() can only save image
## of the form 'uint8' type !!!
img.save('train/' + name)

............

Create a new file named “cell” under the “examples/imagenet” path and put the data folder into it
copy the file “examples/iamgenet/creat_imagenet.sh into newly created folder “cell” and change it as follows

1
2
3
4
5
6
7
8
EXAMPLE=examples/imagenet/cell
DATA=examples/imagenet/cell
TOOLS=build/tools

TRAIN_DATA_ROOT=examples/imagenet/cell/data/train
VAL_DATA_ROOT=examples/imagenet/cell/data/val

RESIZE=true # if the images do not need resize, set it to "false"

run command: ./examples/imagenet/myself/create_imagenet.sh
then lmdb file will be generated under cell folder.

2.

ERROR & REASION

Error 1: Out of memory

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
F0420 13:29:52.527748 10836 syncedmem.cpp:56] Check failed: error == cudaSuccess (2 vs. 0)  out of memory
*** Check failure stack trace: ***
@ 0x7f6cb7656dbd google::LogMessage::Fail()
@ 0x7f6cb7658cf8 google::LogMessage::SendToLog()
@ 0x7f6cb7656953 google::LogMessage::Flush()
@ 0x7f6cb765962e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f6cb7d5b021 caffe::SyncedMemory::to_gpu()
@ 0x7f6cb7d5a389 caffe::SyncedMemory::mutable_gpu_data()
@ 0x7f6cb7d3fdf2 caffe::Blob<>::mutable_gpu_data()
@ 0x7f6cb7dca57f caffe::CuDNNConvolutionLayer<>::Forward_gpu()
@ 0x7f6cb7d71ec5 caffe::Net<>::ForwardFromTo()
@ 0x7f6cb7d72237 caffe::Net<>::Forward()
@ 0x7f6cb7d552c7 caffe::Solver<>::Step()
@ 0x7f6cb7d55b89 caffe::Solver<>::Solve()
@ 0x40806e train()
@ 0x40594c main
@ 0x7f6cb6969ec5 __libc_start_main
@ 0x406081 (unknown)
Aborted (core dumped)

- Reason

ERROR reasion: batch_size is too large!!! you should change the batch_size in train_val.prototxt file

Notice: When you set the configuration of caffe model, something you should notice

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
"batch_size": should not be larger than 100 if you use a single GPU to train the network.

"stepsize": should smaller than "max_iter" and also should be the divisor of "max_iter".

"test_iter": should be the multiple of (val_size / batch_size) [val_size is the size of val data, and the batch_size describe the batch size of val data].

"test_interval" should better be the multiple of (train_size / batch_size) [train_size is the size of your train data, and the batch_size describe the batch size of train data].

Since the result is better when you train on your whole train dataset. and this variable determine the model should be test after every how many iterations.

"stepsize": determine the "base_lr" should be reduced by "weight_decay" value after how many times iterations.

"snapshot": determine the intermidiate results should be stored every how many times of iterations.

"snapshot_prefix": determine the directory of storing the snapshot results

One example of prototxt configuration:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
#####################
## solver.prototxt ##
#####################

net: "examples/imagenet/cell/train_val.prototxt"

test_iter: 40
test_interval: 40
base_lr: 0.01
lr_policy: "step"
gamma: 0.1
stepsize: 100000
display: 20
max_iter: 10000
momentum: 0.9
weight_decay: 0.0005
snapshot: 1000
snapshot_prefix: "examples/imagenet/cell/model/caffenet_train"
solver_mode: GPU


########################
## train_val.prototxt ##
########################

name: "CaffeNet"

layer {
name: "data"
type: "Data"
top: "data"
top: "label"
include {
phase: TRAIN
}
transform_param {
mirror: true
crop_size: 227
mean_file: "examples/imagenet/cell/imagenet_mean.binaryproto"
}
# mean pixel / channel-wise mean instead of mean image
# transform_param {
# crop_size: 227
# mean_value: 104
# mean_value: 117
# mean_value: 123
# mirror: true
# }
data_param {
source: "examples/imagenet/cell/train_lmdb"
batch_size: 250
backend: LMDB
}
}
layer {
name: "data"
type: "Data"
top: "data"
top: "label"
include {
phase: TEST
}
transform_param {
mirror: false
crop_size: 227
mean_file: "examples/imagenet/cell/imagenet_mean.binaryproto"
}
# mean pixel / channel-wise mean instead of mean image
# transform_param {
# crop_size: 227
# mean_value: 104
# mean_value: 117
# mean_value: 123
# mirror: false
# }
data_param {
source: "examples/imagenet/cell/val_lmdb"
batch_size: 50
backend: LMDB
}
}