1 Star 0 Fork 1

hahan2020 / torch7-cunn

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
# CUDA backend for the Neural Network Package #

This package provides a CUDA implementation for many of the modules in the base nn package: nn

  • Modules: There are also additional GPU-related modules not found in the nn package.

Installing from source

git clone https://github.com/torch/cunn
cd cunn
luarocks make rocks/cunn-scm-1.rockspec

To use

Simply convert your network model to CUDA by calling :cuda():

local model = nn.Sequential()
model:add(nn.Linear(2,2))
model:add(nn.LogSoftMax())

model:cuda()  -- convert model to CUDA

... and similarly for your tensors:

local input = torch.Tensor(32,2):uniform()
input = input:cuda()
local output = model:forward(input)

... or create them directly as CudaTensors:

local input = torch.CudaTensor(32,2):uniform()
local output = model:forward(input)

To run unit-tests

luajit -l cunn -e 'cunn.test()'

GPU Training Concepts

Performance

  • data should be transferred between main memory and gpu in batches, otherwise the transfer time will be dominated by latency associated with speed of light, and execution overheads, rather than by bandwidth
  • therefore, train and predict using mini-batches
  • allocating GPU memory causes a sync-point, which will noticeably affect performance
    • therefore try to allocate any CudaTensors once, at the start of the program, and then simply copy data backwards and forwards between main memory and existing CudaTensors
  • similarly, try to avoid any operations that implicitly allocate new tensors. For example, if you write:
require 'cutorch'

local a = torch.CudaTensor(1000):uniform()
for it=1,1000 do
  local b = torch.add(a, 1)
end

... this will allocate one thousand new CudaTensors, one for each call to torch.add(a, 1).

Use instead this form:

require 'cutorch'

local a = torch.CudaTensor(1000):uniform()
local b = torch.CudaTensor(1000):uniform()
for it=1,1000 do
  b:add(a, 1)
end

In this form, b is allocated only once, before the loop. Then the b:add(a,1) operation will perform the add inside the GPU kernel, and store the result into the original b CudaTensor. This will run noticeably faster, in general. It's also a lot less likely to eat up arbitrary amounts of memory, and less likely to need frequent calls to collectgarbage(); collectgarbage().

Benchmarking

  • GPU operations will typically continue after an instruction has been issued
  • eg, if you do:
require 'cutorch'
local a = torch.CudaTensor(1000,1000):uniform()
a:add(1)

... the GPU kernel to add 1 will only be scheduled for launch by a:add(1). It might not have completed yet, or even have reached the GPU, at the time that the a:add(1) returns

  • therefore for running wall-clock timings, you should call cutorch.synchronize() before each timecheck point:
require 'cutorch'
require 'sys'

local a = torch.CudaTensor(1000,1000):uniform()
cutorch.synchronize()
start = sys.tic()
a:add(1)
cutorch.synchronize()
print(sys.toc())

空文件

简介

暂无描述 展开 收起
Cuda 等 6 种语言
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
1
https://gitee.com/hahan2020/torch7-cunn.git
git@gitee.com:hahan2020/torch7-cunn.git
hahan2020
torch7-cunn
torch7-cunn
master

搜索帮助