I almost gave up on implementing my own deep learning library. On one hand, I feel uncomfortable using other libraries without understanding the internals, on the other hand, while it is relatively simple to implement neural networks, making a generic neural network library that is performant and easy to use is difficult. Plus, I feel that I fall behind the latest researches. Without a familiar framework, I can’t duplicate others’ results easily. So I thought I should get myself familiar with some frameworks first.
I started with Caffe, because I suck at python. Most other frameworks seem to use python as the primary interface. Caffe, however, seemed to be C++ friendly. But my first impression with Caffe was kinda bad. I feel that it’s not well designed and lacks documentation for C++ users. Some Caffe’s problems are shared with other frameworks. So I guess I should keep working on my neural network library.
Basically the targeted users of most frameworks are data scientists who don’t need to embed a framework into another program, but just run it as a tool. They have models and data defined in files. They interact with the frameworks by just tweaking the training parameters. But the deep learning problem I’m interested in is reinforcement learning, where training data is generated in real-time and the training process is kinda interactive. And to check the training results, since most reinforcement learning models are used for game AI or robotics, I have to embed the framework into a game or a robotic control program. I found it’s obscure to do so with Caffe.
For example, to timely deliver generated data to my model for training, I want to be able to directly feed data to and from memory. While Caffe has the MemoryData layer, there is almost no document for it. I found many people asked about MemoryData, and there is no good answer. In the end, I realized that most effective way of learning Caffe is debugging its source code. Moreover, the only cpp example provided by Caffe is sketchy and has only the inferencing part, no training part.
Also, most frameworks use Alexnet or at least the MNIST models as the helloworld examples. While they may be more exciting to see, they are too complex to serve as the first example, and hide basics and fundamentals.
So for this blog post, I’d like to provide a simple example of using Caffe with a MemoryData layer as the input to solve the xor problem. Hopefully, this is a better helloworld for those who want to learn Caffe.
Neural Network is known to be able to solve the XORproblem well, even the XOR operation is nonlinear. The XOR operation can be summarized in the following table. So our goal is creating a neural network, with two binary numbers a and b as the inputs and one binary number c as its output, c should be equal to a xor b if the network works as expected.

There are many materials explaining this problem in more detail. For example:
Or this:
The model I’m using is similar to that of the above link:

The only difference is that my model has biases. This one doesn’t.
A Caffe model or a Caffe neural network is formed by connecting a set of blobs and layers. A blob is a chunk of data. And a layer is an operation applied on a blob (data). A layer itself could have a blob too, which is the weight. So a Caffe model will look like a chain of alternating blobs and layers connecting with each other, because a layer needs blobs as its input and it generates new blobs to become the inputs for the next layer.
Overall, my model looks like this (model.prototxt):
name: "XOR"
layer {
name: "inputdata"
type: "MemoryData"
top: "fulldata"
top: "fakelabel"
include {
phase: TRAIN
}
memory_data_param
{
batch_size: 64
channels: 1
height: 1
width: 2
}
}layer {
name: "test_inputdata"
type: "MemoryData"
top: "fulldata"
top: "fakelabel"
include {
phase: TEST
}
memory_data_param
{
batch_size: 4
channels: 1
height: 1
width: 2
}
}layer {
name: "fc6"
type: "InnerProduct"
bottom: "fulldata"
top: "fc6"
inner_product_param {
num_output: 2
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}}layer {
name: "fc6sig"
bottom: "fc6"
top: "fc6"
type: "Sigmoid"
}layer {
name: "fc7"
type: "InnerProduct"
bottom: "fc6"
top: "fc7"
inner_product_param {
num_output: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}layer {
name: "output"
bottom: "fc7"
top: "output"
type: "Sigmoid"
include {
phase: TEST
}
}layer {
name: "loss"
type: "SigmoidCrossEntropyLoss"
bottom: "fc7"
bottom: "fakelabel"
top: "loss"
}
The first 2 layers are the input layers. I use 2 input layers, because one is for training, the other is for interference. As you can see, they are MemoryData layers, because we want to provide the training and testing data directly from memory as they are generated. The only difference between these two layers is the batch size. For training, I use 64 batch size. For testing, the batch size is 4. Because I only need to test these 4 cases: 0 xor 0, 0 xor 1, 1 xor 0, 1 xor 1.
Notice that this MemoryData doesn’t allow you to specify the size of you labels. It has to be 1. I think this is another shitty thing about Caffe (What’s worse is that they don’t document this. You have to debug Caffe’s source code to find out.). While 1 is indeed the label size for the xor problem, for other problems, you will have to put all your data and labels into the same piece of memory and use a SplitLayer to cut them into data and labels. I may show an example later of how to use SplitLayer.
layer {
name: "inputdata"
type: "MemoryData"
top: "fulldata"
top: "fakelabel"
include {
phase: TRAIN
}
memory_data_param
{
batch_size: 64
channels: 1
height: 1
width: 2
}
}layer {
name: "test_inputdata"
type: "MemoryData"
top: "fulldata"
top: "fakelabel"
include {
phase: TEST
}
memory_data_param
{
batch_size: 4
channels: 1
height: 1
width: 2
}
}
The next layer is the first hidden layer, corresponding to the yellow neurons in the above diagram. The filters, according to Caffe document, can randomize the initial neural network, otherwise the initial weights will be zeros. Since the model is a fully connected network, the layer type is InnerProduct here. In my previous experiment with my own Neural network implementation, sigmoid activations gave good results. So here I’m just using sigmoid.
layer {
name: "fc6"
type: "InnerProduct"
bottom: "fulldata"
top: "fc6"
inner_product_param {
num_output: 2
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}layer {
name: "fc6sig"
bottom: "fc6"
top: "fc6"
type: "Sigmoid"
}
The next layer is corresponding to the green output neuron in the diagram, which is also an InnerProduct:
layer {
name: "fc7"
type: "InnerProduct"
bottom: "fc6"
top: "fc7"
inner_product_param {
num_output: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
But for activation, I created two layers, one for training and one for testing.
layer {
name: "output"
bottom: "fc7"
top: "output"
type: "Sigmoid"
include {
phase: TEST
}
}layer {
name: "loss"
type: "SigmoidCrossEntropyLoss"
bottom: "fc7"
bottom: "fakelabel"
top: "loss"
}
The layer for training is a SigmoidCrossEntropy layer, where sigmoid is the activation and cross entropy is the cost function. The reason for combining sigmoid and cross entropy together as a single layer is because calculating their derivatives is easier this way. Since for testing, there is no need to produce the loss, using just the sigmoid layer should be fine.
And then, we have the solver config file (solver.prototxt):
net: "model.prototxt"
base_lr: 0.02
lr_policy: "step"
gamma: 0.5
stepsize: 500000
display: 2000
max_iter: 5000000
snapshot: 1000000
snapshot_prefix: "XOR"
solver_mode: CPU
The learning rate is 0.02 to start with and decreases by 50% for every 500000 steps. The overall iteration count is 5000000.
Now comes the C++ program.
First I generate 400 sets of training data. Each training data has the batch size of 64.
float *data = new float[64*1*1*3*400];
float *label = new float[64*1*1*1*400];for(int i = 0; i<64*1*1*400; ++i)
{
int a = rand() % 2;
int b = rand() % 2;
int c = a ^ b;
data[i*2 + 0] = a;
data[i*2 + 1] = b;
label[i] = c;
}
Basically, I just random 2 binary numbers a and b and calculate their xor value c. And I save both a and b saved together as the input data and c saved into a separate array as the label.
And then I create a solver parameter object and load solver.prototxt into it:
caffe::SolverParameter solver_param; caffe::ReadSolverParamsFromTextFileOrDie("./solver.prototxt", &solver_param);
Next, I create the solver out of the solver parameter:
std::shared_ptr<caffe::Solver<float> > solver(caffe::SolverRegistry<float>::CreateSolver(solver_param));
Then I need to obtain the input MemoryData layer from the solver’s neural network and feed in my training data:
caffe::MemoryDataLayer<float> *dataLayer_trainnet = (caffe::MemoryDataLayer<float> *) (solver->net()->layer_by_name("inputdata").get());
dataLayer_trainnet->Reset(data, label, 25600);
The reset function of MemoryData allows you to provide pointers to the memory of data and labels. Again, the size of each label can only be 1, whereas the size of data is specified in the model.prototxt file. 25600 is the count of training data. It has to be a multiply of 64, the batch size. 25600 is 400 * 64. Basically we generated 400 training data with batch size of 64.
Now, call this line, the network will be trained:
solver->Solve();
Once it is trained, we need to test it. Create another network, with the same model, but pass TEST as the phase. And load the trained weights cached inside XOR_iter_5000000.caffemodel:
std::shared_ptr<caffe::Net<float> > testnet;testnet.reset(new caffe::Net<float>("./model.prototxt", caffe::TEST));
testnet->CopyTrainedLayersFrom("XOR_iter_5000000.caffemodel");
Similar to training, we need to obtain the input MemoryData layer and pass the input to it for testing:
float testab[] = {0, 0, 0, 1, 1, 0, 1, 1};
float testc[] = {0, 1, 1, 0};caffe::MemoryDataLayer<float> *dataLayer_testnet = (caffe::MemoryDataLayer<float> *) (testnet->layer_by_name("test_inputdata").get());dataLayer_testnet->Reset(testab, testc, 4);
Notice that the name for this input layer is test_inputdata, whereas the input layer for training is inputdata. Remember we created 2 input layers in the model file. These names are corresponding to the 2 layers. Their difference is the batch size.
Then we do the following to calculate the neural network output:
testnet->Forward();
Once this is done, we need to obtain the result by accessing the output blob:
boost::shared_ptr<caffe::Blob<float> > output_layer = testnet->blob_by_name("output");const float* begin = output_layer->cpu_data();
const float* end = begin + 4;
std::vector<float> result(begin, end);
We know the output size is 4, and we save the outputs into the result vector.
In the end we just print the results:
for(int i = 0; i< result.size(); ++i)
{
printf("input: %d xor %d, truth: %f result by nn: %f\n", (int)testab[i*2 + 0], (int)testab[i*2+1], testc[i], result[i]);
}
The complete source code is here:
Here are the test results for this simple neural network:
input: 0 xor 0, truth: 0.000000 result by nn: 0.000550
input: 0 xor 1, truth: 1.000000 result by nn: 0.999368
input: 1 xor 0, truth: 1.000000 result by nn: 0.999368
input: 1 xor 1, truth: 0.000000 result by nn: 0.000626
So given 0 and 0, the expected output should be 0, and the neural network generated 0.0005, which is very close. Given 0 and 1, while the expected output is 1, the neural network gave 0.99368, which is also good enough.