分类深度学习下的文章 - jupiter's blog

登录 / 注册

标签搜索

jupiter

累计撰写 363 篇文章
累计收到 187 条评论

搜索到 62 篇与的结果

2021-10-28
Linux&Jetson Nano下编译安装ncnn 1.下载ncnn源码项目地址：https://github.com/Tencent/ncnngit clone https://github.com/Tencent/ncnn.git cd ncnn git submodule update --init2.安装依赖2.1 通用依赖gitg++cmakeprotocol buffer (protobuf) headers files and protobuf compilerglslangopencv(用于编译案列)sudo apt install build-essential git cmake libprotobuf-dev protobuf-compiler libvulkan-dev vulkan-utils libopencv-dev2.2 vulkan header files and loader library (用于调用GPU,只用CPU的可以不用安装)2.2.1 X86版本安装# 为GPU安装Vulkan驱动 sudo apt install mesa-vulkan-drivers # 安装vulkansdk wget https://sdk.lunarg.com/sdk/download/1.2.189.0/linux/vulkansdk-linux-x86_64-1.2.189.0.tar.gz?Human=true -O vulkansdk-linux-x86_64-1.2.189.0.tar.gz tar -xvf vulkansdk-linux-x86_64-1.2.189.0.tar.gz export VULKAN_SDK=$(pwd)/1.2.189.0/x86_642.2.2 Jetson Nano安装确认vulkan驱动是否安装正常nvidia@xavier:/$ vulkaninfo Xlib: extension "NV-GLX" missing on display "localhost:10.0". Xlib: extension "NV-GLX" missing on display "localhost:10.0". Xlib: extension "NV-GLX" missing on display "localhost:10.0". /build/vulkan-tools-WR7ZBj/vulkan-tools-1.1.126.0+dfsg1/vulkaninfo/vulkaninfo.h:399: failed with ERROR_INITIALIZATION_FAILED异常原因查找通过vnc远程连接到图形界面后运行vulkaninfonano@nano:~$ vulkaninfo =========== VULKAN INFO =========== Vulkan Instance Version: 1.2.70 Instance Extensions: ==================== Instance Extensions count = 16 VK_KHR_device_group_creation : extension revision 1 ······ ========================= minImageCount = 2 maxImageCount = 8 currentExtent: width = 256 height = 256 minImageExtent: width = 256 height = 256 maxImageExtent: width = 256 height = 256 maxImageArrayLayers = 1 ······安装vulkansdk# 编译安装vulkansdk sudo apt-get update && sudo apt-get install git build-essential libx11-xcb-dev libxkbcommon-dev libwayland-dev libxrandr-dev cmake git clone https://github.com/KhronosGroup/Vulkan-Loader.git cd Vulkan-Loader && mkdir build && cd build ../scripts/update_deps.py cmake -DCMAKE_BUILD_TYPE=Release -DVULKAN_HEADERS_INSTALL_DIR=$(pwd)/Vulkan-Headers/build/install .. make -j$(nproc) export LD_LIBRARY_PATH=$(pwd)/loader cd Vulkan-Headers ln -s ../loader lib export VULKAN_SDK=$(pwd)3. 开始编译CPU 版# 没安VULKAN运行这个 cd ncnn mkdir -p build cd build cmake -DCMAKE_BUILD_TYPE=Release -DNCNN_VULKAN=OFF -DNCNN_SYSTEM_GLSLANG=ON -DNCNN_BUILD_EXAMPLES=ON .. make -j$(nproc)GPU-X86# 有GPU安了VULKAN运行这个 cd ncnn mkdir -p build cd build cmake -DCMAKE_BUILD_TYPE=Release -DNCNN_VULKAN=ON -DNCNN_SYSTEM_GLSLANG=ON -DNCNN_BUILD_EXAMPLES=ON .. make -j$(nproc)GPU- Jetson Nano# Jetson Nano用这个 cd ncnn mkdir -p build cd build cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/jetson.toolchain.cmake -DNCNN_VULKAN=ON -DCMAKE_BUILD_TYPE=Release -DNCNN_BUILD_EXAMPLES=ON .. make -j$(nproc)4.验证安装4.1 验证squeezenetcd ../examples ../build/examples/squeezenet ../images/256-ncnn.pngnano@nano:/software/ncnn/examples$ ../build/examples/squeezenet ../images/256-ncnn.png [0 NVIDIA Tegra X1 (nvgpu)] queueC=0[16] queueG=0[16] queueT=0[16] [0 NVIDIA Tegra X1 (nvgpu)] bugsbn1=0 bugbilz=0 bugcopc=0 bugihfa=0 [0 NVIDIA Tegra X1 (nvgpu)] fp16-p/s/a=1/1/1 int8-p/s/a=1/1/1 [0 NVIDIA Tegra X1 (nvgpu)] subgroup=32 basic=1 vote=1 ballot=1 shuffle=1 532 = 0.168945 920 = 0.093323 716 = 0.063110 nvdc: start nvdcEventThread nvdc: exit nvdcEventThread4.1 验证benchncnncd ../benchmark ../build/benchmark/benchncnn 10 $(nproc) 0 0nano@nano:/software/ncnn/benchmark$ ../build/benchmark/benchncnn 10 $(nproc) 0 0[0 NVIDIA Tegra X1 (nvgpu)] queueC=0[16] queueG=0[16] queueT=0[16] [0 NVIDIA Tegra X1 (nvgpu)] bugsbn1=0 bugbilz=0 bugcopc=0 bugihfa=0 [0 NVIDIA Tegra X1 (nvgpu)] fp16-p/s/a=1/1/1 int8-p/s/a=1/1/1 [0 NVIDIA Tegra X1 (nvgpu)] subgroup=32 basic=1 vote=1 ballot=1 shuffle=1 loop_count = 10 num_threads = 4 powersave = 0 gpu_device = 0 cooling_down = 1 squeezenet min = 19.90 max = 22.82 avg = 20.82 squeezenet_int8 min = 36.58 max = 236.35 avg = 66.89 mobilenet min = 24.75 max = 41.05 avg = 28.83 mobilenet_int8 min = 42.95 max = 70.39 avg = 52.08 mobilenet_v2 min = 31.84 max = 38.09 avg = 35.59 mobilenet_v3 min = 29.77 max = 38.48 avg = 33.56 shufflenet min = 25.98 max = 36.90 avg = 30.86 shufflenet_v2 min = 18.46 max = 27.65 avg = 20.49 mnasnet min = 22.63 max = 35.37 avg = 24.88 proxylessnasnet min = 27.85 max = 33.44 avg = 30.52 efficientnet_b0 min = 34.85 max = 48.31 avg = 38.46 efficientnetv2_b0 min = 56.62 max = 76.70 avg = 61.99 regnety_400m min = 28.31 max = 35.59 avg = 31.92 blazeface min = 14.40 max = 34.70 avg = 23.63 googlenet min = 55.01 max = 75.36 avg = 60.89 googlenet_int8 min = 111.53 max = 315.94 avg = 167.58 resnet18 min = 51.45 max = 77.21 avg = 59.26 resnet18_int8 min = 81.99 max = 207.09 avg = 117.43 alexnet min = 69.98 max = 102.26 avg = 83.27 vgg16 min = 302.14 max = 337.56 avg = 320.55 vgg16_int8 min = 464.06 max = 601.92 avg = 540.28 resnet50 min = 140.36 max = 176.66 avg = 159.53 resnet50_int8 min = 299.16 max = 554.05 avg = 453.26 squeezenet_ssd min = 53.43 max = 78.75 avg = 63.67 squeezenet_ssd_int8 min = 91.45 max = 215.14 avg = 123.13 mobilenet_ssd min = 66.30 max = 90.77 avg = 76.86 mobilenet_ssd_int8 min = 89.05 max = 261.33 avg = 119.18 mobilenet_yolo min = 142.24 max = 182.72 avg = 154.48 mobilenetv2_yolov3 min = 81.96 max = 107.17 avg = 91.93 yolov4-tiny min = 103.76 max = 138.15 avg = 115.43 nanodet_m min = 27.15 max = 36.88 avg = 32.00 yolo-fastest-1.1 min = 33.21 max = 40.95 avg = 35.84 yolo-fastestv2 min = 17.51 max = 29.54 avg = 21.32 vision_transformer min = 4981.82 max = 5576.98 avg = 5198.79 nvdc: start nvdcEventThread nvdc: exit nvdcEventThread参考资料how to buildVulkan Support on L4TNVIDIA vulkan driver的安装和Jetson平台上vulkan sdk的制作vulkaninfo failed with VK_ERROR_INITIALIZATION_FAILED
- 2021年10月28日
- 986 阅读
- 0 评论
- 0 点赞
2021-09-20
快速使用Faster RCNN进行训练VOC格式的数据集训练步骤STEP1:下载代码并配置环境git clone https://github.com/bubbliiiing/faster-rcnn-pytorch.git cd faster-rcnn-pytorch pip install -r requirements.txtSTEP2:根据文件结构填充VOC格式的数据集数据放置格式(只需完成#TODO部分即可)├──VOCdevkit/VOC2007/ ├── Annotations ├──放置xml文件 #TODO ├── JPEGImages ├──放置img文件 #TODO ├──ImageSets/Main ├──放置训练索引文件 (无需手动完成，自动生成) ├── voc2frcnn.py #数据分割脚本,用于生成训练索引文件编辑voc2frcnn.py。设置tarin\val\test数据分割比例#----------------------------------------------------------------------# # 想要增加测试集修改trainval_percent # train_percent不需要修改 #----------------------------------------------------------------------# trainval_percent=1 train_percent=1生成训练索引文件python voc2frcnn.pySTEP3:生成最终训练所需的txt文件编辑根目录下的voc_annotation.py，将classes改成你自己的classes(注意不要使用中文标签，文件夹中不要有空格！)classes = ["aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow", "diningtable", "dog", "horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor"]然后运行voc_annotation.pypython voc_annotation.py此时会生成对应的2007_train.txt，每一行对应其图片位置及其真实框的位置。STEP4:编辑model_data/voc_classes.txt将其中的类别数改为自己的,文件内容为cat dog ...STEP5:修改train.py的NUM_CLASSSES将train.py的NUM_CLASSSES修改成所需要分的类的个数（不需要+1）STEP6:开始训练python train.pySTEP7:模型效果评估评估过程可参考视频https://www.bilibili.com/video/BV1zE411u7Vw参考资料https://github.com/bubbliiiing/faster-rcnn-pytorch
- 2021年09月20日
- 1,006 阅读
- 0 评论
- 0 点赞
2021-09-15
快速使用YOLOv5进行训练VOC格式的数据集训练步骤STEP1:下载官方YOLOv5的代码并配置环境git clone https://github.com/ultralytics/yolov5 cd yolov5 pip install -r requirements.txtSTEP2:准备VOC格式的数据集数据放置格式├──train_data_VOC ├── Annotations ├──放置xml文件 ├── JPEGImages ├──防止img文件STEP3:将数据集转为YOLOv5所需要的COCO格式mkdir train_data_COCO vim VOC2COCO.pyimport os import shutil import random import xmltodict from progressbar import * #================================================================================================================ # 函数定义区 # 函数-将voc xml中的object转化为对应的一条yolo数据 def get_yolo_data(obj,img_width,img_height): # 获取voc格式的数据信息 name = obj['name'] xmin = float(obj['bndbox']['xmin']) xmax = float(obj['bndbox']['xmax']) ymin = float(obj['bndbox']['ymin']) ymax = float(obj['bndbox']['ymax']) # 计算yolo格式的数据信息 class_idx = class_names.index(name) x_center,y_center = (xmin+xmax)/2,(ymin+ymax)/2 box_width = xmax - xmin box_height = ymax - ymin yolo_data = "{} {} {} {} {}\n".format(class_idx,x_center/img_width,y_center/img_height,box_width/img_width,box_height/img_height) return yolo_data # 函数-将xml文件转为txt文件 def convert_annotations(image_name): in_file = xml_file_path + image_name + '.xml' out_file = txt_file_path + image_name + '.txt' yolo_data = "" with open(in_file) as f: xml_str = f.read() # 转为字典 xml_dic = xmltodict.parse(xml_str) # 获取图片的width、height img_width = float(xml_dic["annotation"]["size"]["width"]) img_height = float(xml_dic["annotation"]["size"]["height"]) # 获取xml文件中的object objects = xml_dic["annotation"]["object"] if isinstance(objects,list): # xml文件中包含多个object for obj in objects: yolo_data += get_yolo_data(obj,img_width,img_height) else: # xml文件中包含1个object obj = objects yolo_data += get_yolo_data(obj,img_width,img_height) with open(out_file,'w') as f: f.write(yolo_data) # 函数-创建最终用于训练的COCO格式数据集的文件夹 def create_dir(): if not os.path.exists('train_data_COCO/images/'): os.makedirs('train_data_COCO/images/') if not os.path.exists('train_data_COCO/labels/'): os.makedirs('train_data_COCO/labels/') if not os.path.exists('train_data_COCO/images/train/'): os.makedirs('train_data_COCO/images/train') if not os.path.exists('train_data_COCO/images/val/'): os.makedirs('train_data_COCO/images/val/') if not os.path.exists('train_data_COCO/images/test/'): os.makedirs('train_data_COCO/images/test/') if not os.path.exists('train_data_COCO/labels/train/'): os.makedirs('train_data_COCO/labels/train/') if not os.path.exists('train_data_COCO/labels/val/'): os.makedirs('train_data_COCO/labels/val/') if not os.path.exists('train_data_COCO/labels/test/'): os.makedirs('train_data_COCO/labels/test/') return #================================================================================================================ # 功能实现区 """ STEP1:准备工作：数据准备+创建各种所需的文件夹 """ # 对应的VOC数据集的路径参数+类别参数 xml_file_path = './train_data_VOC/Annotations/' # 检查和自己的xml文件夹名称是否一致 images_file_path = './train_data_VOC/JPEGImages/' # 检查和自己的图像文件夹名称是否一致 class_names = ['Person', 'BridgeVehicle', 'LuggageVehicle', 'Plane', 'RefuelVehicle', 'FoodVehicle', 'RubbishVehicle', 'WaterVehicle', 'PlatformVehicle', 'TractorVehicle'] # 创一个临时文件夹用来存放xml文件转换出来的对应的txt文件 if not os.path.exists('train_data_COCO/temp_labels/'): os.makedirs('train_data_COCO/temp_labels/') txt_file_path = 'train_data_COCO/temp_labels/' # 执行xml到txt的转换，存储到一个临时文件夹 total_xml = os.listdir(xml_file_path) num_xml = len(total_xml) # XML文件总数 for i in range(num_xml): name = total_xml[i][:-4] convert_annotations(name) # 创建COCO格式的数据所需要的各种文件夹 create_dir() # 读取所有的txt文件 total_txt = os.listdir(txt_file_path) print("数据准备工作完成，开始进行数据分配") """ STEP2:数据分配：按比例对数据集进行划分 """ # 设置数据集划分比例，训练集75%，验证集15%，测试集15% train_percent = 0.8 val_percent = 0.15 test_percent = 0.05 # 计算train,val,test每一类的数据数量 num_txt = len(total_txt) num_train = int(num_txt * train_percent) num_val = int(num_txt * val_percent) num_test = num_txt - num_train - num_val # 根据计算出的每类的数据数量计算出进行数据分配的索引 list_all_txt = range(num_txt) # 范围 range(0, num) train = random.sample(list_all_txt, num_train)# train从list_all_txt取出num_train个元素 val_test = [i for i in list_all_txt if not i in train]# 所以list_all_txt列表只剩下了这些元素：val_test val = random.sample(val_test, num_val)# 再从val_test取出num_val个元素，val_test剩下的元素就是test # 根据采样的索引结果进行文件分配工作 print("训练集数目：{}, 验证集数目：{},测试集数目：{}".format(len(train), len(val), len(val_test) - len(val))) #进度条功能 widgets = ['VOC2COCO: ',Percentage(), ' ', Bar('#'),' ', Timer(),' ', ETA()] pbar = ProgressBar(widgets=widgets, maxval=num_txt).start() count = 0 for i in list_all_txt: name = total_txt[i][:-4] srcImage = images_file_path + name + '.jpg' srcLabel = txt_file_path + name + '.txt' if i in train: dst_train_Image = 'train_data_COCO/images/train/' + name + '.jpg' dst_train_Label = 'train_data_COCO/labels/train/' + name + '.txt' shutil.copyfile(srcImage, dst_train_Image) shutil.copyfile(srcLabel, dst_train_Label) elif i in val: dst_val_Image = 'train_data_COCO/images/val/' + name + '.jpg' dst_val_Label = 'train_data_COCO/labels/val/' + name + '.txt' shutil.copyfile(srcImage, dst_val_Image) shutil.copyfile(srcLabel, dst_val_Label) else: dst_test_Image = 'train_data_COCO/images/test/' + name + '.jpg' dst_test_Label = 'train_data_COCO/labels/test/' + name + '.txt' shutil.copyfile(srcImage, dst_test_Image) shutil.copyfile(srcLabel, dst_test_Label) #更新进度条 count += 1 pbar.update(count) #释放进度条 pbar.finish() print("数据分配工作完成，开始释放临时文") """ STEP3:释放临时文件 """ shutil.rmtree(txt_file_path) print("临时文件释放完成，VOC2COCO执行结束")python VOC2COCO.pySTEP4:在data下创建与数据对应的data.yaml文件文件内容按照数据的数据情况填写path: train_data_COCO # root train: # train images (relative to 'path') - images/train val: # val images (relative to 'path') - images/val test: # test images (optional) - images/test # Classes nc: 10 # number of classes names: ['Person', 'BridgeVehicle', 'LuggageVehicle', 'Plane', 'RefuelVehicle', 'FoodVehicle', 'RubbishVehicle', 'WaterVehicle', 'PlatformVehicle', 'TractorVehicle'] # class namesSTEP5:下载预训练模型mkdir weights cd weights wget https://github.com/ultralytics/yolov5/releases/download/v5.0/yolov5s.ptSTEP6:开始训练python train.py --data data/data.yaml --cfg models/yolov5s.yaml --weights weights/yolov5s.pt --batch-size 64 --epochs 60 参考资料https://github.com/ultralytics/yolov5
- 2021年09月15日
- 925 阅读
- 0 评论
- 0 点赞
2021-09-14
深度学习中的FLOPs介绍及计算(注意区分FLOPS) FLOPS与FLOPsFLOPS：注意全大写，是floating point operations per second的缩写，意指每秒浮点运算次数，理解为计算速度。是一个衡量硬件性能的指标。FLOPs：注意s小写，是floating point operations的缩写（s表复数），意指浮点运算数，理解为计算量。可以用来衡量算法/模型的复杂度。全连接网络中FLOPs的计算推导以4个输入神经元和3个输出神经元为例计算一个输出神经元的的计算过程为$$ y1 = w_{11}*x_1+w_{21}*x_2+w_{31}*x_3+w_{41}*x_4 $$所需的计算次数为4次乘法3次加法共需4+3=7计算。推广到I个输入神经元O个输出神经元后则计算一个输出神经元所需要的计算次数为$I+(I-1)=2I-1$，则总的计算次数为$$ FLOPs = (2I-1)*O $$考虑bias则为$$ y1 = w_{11}*x_1+w_{21}*x_2+w_{31}*x_3+w_{41}*x_4+b1 $$总的计算次数为$$ FLOPs = 2I*O $$结果FC（full connected）层FLOPs的计算公式如下(不考虑bias时有-1，有bias时没有-1):$$ FLOPs = (2 \times I - 1) \times O $$其中:I = input neuron numbers(输入神经元的数量)O = output neuron numbers(输出神经元的数量)CNN中FLOPs的计算以下答案不考虑activation function的运算推导对于输入通道数为$C_{in}$,卷积核的大小为K,输出通道数为$C_{out}$,输出特征图的尺寸为$H*W$进行一次卷积运算的计算次数为乘法$C_{in}K^2$次加法$C_{in}K^2-1$次共计$C_{in}K^2+C_{in}K^2-1=2C_{in}K^2-1$次，若考虑bias则再加1次得到一个channel的特征图所需的卷积次数为$H*W$次共计需得到$C_{out}$个特征图因此对于CNN中的一个卷积层来说总的计算次数为(不考虑bias时有-1，考虑bias时没有-1):$$ FLOPs = (2C_{in}K^2-1)HWC_{out} $$结果卷积层FLOPs的计算公式如下(不考虑bias时有-1，有bias时没有-1):$$ FLOPs = (2C_{in}K^2-1)HWC_{out} $$其中:$C_{in}$ = input channelK= kernel sizeH,W = output feature map size$C_{out}$ = output channel计算FLOPs的代码或包torchstatfrom torchstat import stat import torchvision.models as models model = models.vgg16() stat(model, (3, 224, 224)) module name input shape output shape params memory(MB) MAdd Flops MemRead(B) MemWrite(B) duration[%] MemR+W(B) 0 features.0 3 224 224 64 224 224 1792.0 12.25 173,408,256.0 89,915,392.0 609280.0 12845056.0 3.67% 13454336.0 1 features.1 64 224 224 64 224 224 0.0 12.25 3,211,264.0 3,211,264.0 12845056.0 12845056.0 1.83% 25690112.0 2 features.2 64 224 224 64 224 224 36928.0 12.25 3,699,376,128.0 1,852,899,328.0 12992768.0 12845056.0 8.43% 25837824.0 3 features.3 64 224 224 64 224 224 0.0 12.25 3,211,264.0 3,211,264.0 12845056.0 12845056.0 1.45% 25690112.0 4 features.4 64 224 224 64 112 112 0.0 3.06 2,408,448.0 3,211,264.0 12845056.0 3211264.0 11.37% 16056320.0 5 features.5 64 112 112 128 112 112 73856.0 6.12 1,849,688,064.0 926,449,664.0 3506688.0 6422528.0 4.03% 9929216.0 6 features.6 128 112 112 128 112 112 0.0 6.12 1,605,632.0 1,605,632.0 6422528.0 6422528.0 0.73% 12845056.0 7 features.7 128 112 112 128 112 112 147584.0 6.12 3,699,376,128.0 1,851,293,696.0 7012864.0 6422528.0 5.86% 13435392.0 8 features.8 128 112 112 128 112 112 0.0 6.12 1,605,632.0 1,605,632.0 6422528.0 6422528.0 0.37% 12845056.0 9 features.9 128 112 112 128 56 56 0.0 1.53 1,204,224.0 1,605,632.0 6422528.0 1605632.0 7.32% 8028160.0 10 features.10 128 56 56 256 56 56 295168.0 3.06 1,849,688,064.0 925,646,848.0 2786304.0 3211264.0 3.30% 5997568.0 11 features.11 256 56 56 256 56 56 0.0 3.06 802,816.0 802,816.0 3211264.0 3211264.0 0.00% 6422528.0 12 features.12 256 56 56 256 56 56 590080.0 3.06 3,699,376,128.0 1,850,490,880.0 5571584.0 3211264.0 5.13% 8782848.0 13 features.13 256 56 56 256 56 56 0.0 3.06 802,816.0 802,816.0 3211264.0 3211264.0 0.37% 6422528.0 14 features.14 256 56 56 256 56 56 590080.0 3.06 3,699,376,128.0 1,850,490,880.0 5571584.0 3211264.0 4.76% 8782848.0 15 features.15 256 56 56 256 56 56 0.0 3.06 802,816.0 802,816.0 3211264.0 3211264.0 0.37% 6422528.0 16 features.16 256 56 56 256 28 28 0.0 0.77 602,112.0 802,816.0 3211264.0 802816.0 2.56% 4014080.0 17 features.17 256 28 28 512 28 28 1180160.0 1.53 1,849,688,064.0 925,245,440.0 5523456.0 1605632.0 3.66% 7129088.0 18 features.18 512 28 28 512 28 28 0.0 1.53 401,408.0 401,408.0 1605632.0 1605632.0 0.00% 3211264.0 19 features.19 512 28 28 512 28 28 2359808.0 1.53 3,699,376,128.0 1,850,089,472.0 11044864.0 1605632.0 5.50% 12650496.0 20 features.20 512 28 28 512 28 28 0.0 1.53 401,408.0 401,408.0 1605632.0 1605632.0 0.00% 3211264.0 21 features.21 512 28 28 512 28 28 2359808.0 1.53 3,699,376,128.0 1,850,089,472.0 11044864.0 1605632.0 5.49% 12650496.0 22 features.22 512 28 28 512 28 28 0.0 1.53 401,408.0 401,408.0 1605632.0 1605632.0 0.00% 3211264.0 23 features.23 512 28 28 512 14 14 0.0 0.38 301,056.0 401,408.0 1605632.0 401408.0 1.10% 2007040.0 24 features.24 512 14 14 512 14 14 2359808.0 0.38 924,844,032.0 462,522,368.0 9840640.0 401408.0 2.94% 10242048.0 25 features.25 512 14 14 512 14 14 0.0 0.38 100,352.0 100,352.0 401408.0 401408.0 0.00% 802816.0 26 features.26 512 14 14 512 14 14 2359808.0 0.38 924,844,032.0 462,522,368.0 9840640.0 401408.0 2.57% 10242048.0 27 features.27 512 14 14 512 14 14 0.0 0.38 100,352.0 100,352.0 401408.0 401408.0 0.00% 802816.0 28 features.28 512 14 14 512 14 14 2359808.0 0.38 924,844,032.0 462,522,368.0 9840640.0 401408.0 2.19% 10242048.0 29 features.29 512 14 14 512 14 14 0.0 0.38 100,352.0 100,352.0 401408.0 401408.0 0.37% 802816.0 30 features.30 512 14 14 512 7 7 0.0 0.10 75,264.0 100,352.0 401408.0 100352.0 0.37% 501760.0 31 avgpool 512 7 7 512 7 7 0.0 0.10 0.0 0.0 0.0 0.0 0.00% 0.0 32 classifier.0 25088 4096 102764544.0 0.02 205,516,800.0 102,760,448.0 411158528.0 16384.0 10.62% 411174912.0 33 classifier.1 4096 4096 0.0 0.02 4,096.0 4,096.0 16384.0 16384.0 0.00% 32768.0 34 classifier.2 4096 4096 0.0 0.02 0.0 0.0 0.0 0.0 0.37% 0.0 35 classifier.3 4096 4096 16781312.0 0.02 33,550,336.0 16,777,216.0 67141632.0 16384.0 2.20% 67158016.0 36 classifier.4 4096 4096 0.0 0.02 4,096.0 4,096.0 16384.0 16384.0 0.00% 32768.0 37 classifier.5 4096 4096 0.0 0.02 0.0 0.0 0.0 0.0 0.37% 0.0 38 classifier.6 4096 1000 4097000.0 0.00 8,191,000.0 4,096,000.0 16404384.0 4000.0 0.73% 16408384.0 total 138357544.0 109.39 30,958,666,264.0 15,503,489,024.0 16404384.0 4000.0 100.00% 783170624.0 ============================================================================================================================================================ Total params: 138,357,544 ------------------------------------------------------------------------------------------------------------------------------------------------------------ Total memory: 109.39MB Total MAdd: 30.96GMAdd Total Flops: 15.5GFlops Total MemR+W: 746.89MB 参考资料CNN 模型所需的计算力（flops）和参数（parameters）数量是怎么计算的？分享一个FLOPs计算神器CNN Explainer[Molchanov P , Tyree S , Karras T , et al. Pruning Convolutional Neural Networks for Resource Efficient Transfer Learning[J]. 2016.](https://arxiv.org/pdf/1611.06440.pdf)
- 2021年09月14日
- 991 阅读
- 0 评论
- 0 点赞
2021-09-13
卷积神经网络中的深度可分离卷积(Depthwise Separable Convolution) 一些轻量级的网络，如mobileNet中，会有深度可分离卷积depthwise separable convolution，由depthwise(DW)和pointwise(PW)两个部分结合起来，用来提取特征feature map。相比常规的卷积操作，其参数数量和运算成本比较低。1.常规卷积操作对于一张5×5像素、三通道（shape为5×5×3），经过3×3卷积核的卷积层（假设输出通道数为4，则卷积核shape为3×3×3×4，最终输出4个Feature Map，如果有same padding则尺寸与输入层相同（5×5），如果没有则为尺寸变为3×3卷积层共4个Filter，每个Filter包含了3个Kernel，每个Kernel的大小为3×3。因此卷积层的参数数量可以用如下公式来计算：$$ N_{std} = 4 × 3 × 3 × 3 = 108 $$2.深度可分离卷积2.1逐通道卷积Depthwise Convolution的一个卷积核负责一个通道，一个通道只被一个卷积核卷积一张5×5像素、三通道彩色输入图片（shape为5×5×3），Depthwise Convolution首先经过第一次卷积运算，DW完全是在二维平面内进行。卷积核的数量与上一层的通道数相同（通道和卷积核一一对应）。所以一个三通道的图像经过运算后生成了3个Feature map(如果有same padding则尺寸与输入层相同为5×5)，如下图所示。其中一个Filter只包含一个大小为3×3的Kernel，卷积部分的参数个数计算如下：$$ N_{depthwise} = 3 × 3 × 3 = 27 $$Depthwise Convolution完成后的Feature map数量与输入层的通道数相同，无法扩展Feature map。而且这种运算对输入层的每个通道独立进行卷积运算，没有有效的利用不同通道在相同空间位置上的feature信息。因此需要Pointwise Convolution来将这些Feature map进行组合生成新的Feature map2.2逐点卷积Pointwise Convolution的运算与常规卷积运算非常相似，它的卷积核的尺寸为 1×1×M，M为上一层的通道数。所以这里的卷积运算会将上一步的map在深度方向上进行加权组合，生成新的Feature map。有几个卷积核就有几个输出Feature map由于采用的是1×1卷积的方式，此步中卷积涉及到的参数个数可以计算为：$$ N_{pointwise} = 1 × 1 × 3 × 4 = 12 $$经过Pointwise Convolution之后，同样输出了4张Feature map，与常规卷积的输出维度相同3.参数对比回顾一下，常规卷积的参数个数为：$$ N_{std} = 4 × 3 × 3 × 3 = 108 $$Separable Convolution的参数由两部分相加得到：$$ N_{depthwise} = 3 × 3 × 3 = 27 \\ N_{pointwise} = 1 × 1 × 3 × 4 = 12 \\ N_{separable} = N_{depthwise} + N_{pointwise} = 39 \\ $$相同的输入，同样是得到4张Feature map，Separable Convolution的参数个数是常规卷积的约1/3。因此，在参数量相同的前提下，采用Separable Convolution的神经网络层数可以做的更深。参考资料卷积神经网络中的Separable Convolution:https://yinguobing.com/separable-convolution/#fn2深度可分离卷积：https://zhuanlan.zhihu.com/p/92134485
- 2021年09月13日
- 942 阅读
- 0 评论
- 0 点赞
2021-09-13
光流法简介及实现 1.光流法简介1.1光流光流（optical flow）是空间运动物体在观察成像平面上的像素运动的瞬时速度。通常将二维图像平面特定坐标点上的灰度瞬时变化率定义为光流矢量。一言以概之：所谓光流就是瞬时速率，在时间间隔很小（比如视频的连续前后两帧之间）时，也等同于目标点的位移三言以概之：所谓光流场就是很多光流的集合。当我们计算出了一幅图片中每个图像的光流，就能形成光流场。构建光流场是试图重现现实世界中的运动场，用以运动分析。 1.2光流法光流法是利用图像序列中像素在时间域上的变化以及相邻帧之间的相关性来找到上一帧跟当前帧之间存在的对应关系，从而计算出相邻帧之间物体的运动信息的一种方法。1.3光流法的基本假设条件亮度恒定不变。即同一目标在不同帧间运动时，其亮度不会发生改变。这是基本光流法的假定（所有光流法变种都必须满足），用于得到光流法基本方程；时间连续或运动是“小运动”。即时间的变化不会引起目标位置的剧烈变化，相邻帧之间位移要比较小。同样也是光流法不可或缺的假定。1.4光流场在空间中，运动可以用运动场描述，而在一个图像平面上，物体的运动往往是通过图像序列中不同图像灰度分布的不同体现的，从而，空间中的运动场转移到图像上就表示为光流场（optical flow field）。光流场是一个二维矢量场，它反映了图像上每一点灰度的变化趋势，可看成是带有灰度的像素点在图像平面上运动而产生的瞬时速度场。它包含的信息即是各像点的瞬时运动速度矢量信息。研究光流场的目的就是为了从序列图像中近似计算不能直接得到的运动场。光流场在理想情况下，光流场对应于运动场。1.5稠密光流与稀疏光流稠密光流稠密光流是一种针对图像或指定的某一片区域进行逐点匹配的图像配准方法，它计算图像上所有的点的偏移量，从而形成一个稠密的光流场。通过这个稠密的光流场，可以进行像素级别的图像配准。稀疏光流与稠密光流相反，稀疏光流并不对图像的每个像素点进行逐点计算。它通常需要指定一组点进行跟踪，这组点最好具有某种明显的特性，例如Harris角点等，那么跟踪就会相对稳定和可靠。稀疏跟踪的计算开销比稠密跟踪小得多。1.6光流法的优缺点优点光流法的优点在于它无须了解场景的信息,就可以准确地检测识别运动日标位置,且在摄像机处于运动的情况下仍然适用。而且光流不仅携带了运动物体的运动信息，而且还携带了有关景物三维结构的丰富信息，它能够在不知道场景的任何信息的情况下，检测出运动对象。缺点光流法的适用条件，即两个基本假设，在现实情况下均不容易满足。假设一：亮度恒定不变。但是实际情况是光流场并不一定反映了目标的实际运动情况,如图,所示。图中,光源不动,而物体表面均一,且产生了自传运动,却并没有产生光流图中,物体并没有运动,但是光源与物体发生相对运动,却有光流产生。因此可以说光流法法对光线敏感, 光线变化极易影响识别效果。假设二：小运动。现实情况下较大距离的运动也是普遍存在的。因此当需要检测的目标运动速度过快是，传统光流法也不适用。对稀疏光流算法而言存在着孔径问题，对稠密光流算法而言存在着计算量大的问题。观察上图(a)我们可以看到目标是在向右移动，但是由于“观察窗口”过小我们无法观测到边缘也在下降。LK算法中选区的小邻域就如同上图的观察窗口，邻域大小的选取会影响到最终的效果。当然，这是针对于一部分稀疏光流算法而言，属于稠密光流范畴的算法一般不存在这个问题。但是稠密光流法的显著缺点主要体现在,计算量大,耗时长,在对实时性要求苛刻的情况下并不适用。2.算法实现2.1稀疏光流-只跟踪某些角点(角点检测使用Shi-Tomasi检测算法）代码""" 稀疏光流，只跟踪某些角点 """ import numpy as np import cv2 import matplotlib.pyplot as plt video_path = "./test.mp4" cap = cv2.VideoCapture(video_path) # 打开视频 # ShiTomasi 角点检测参数 feature_params = dict( maxCorners = 100, qualityLevel = 0.3, minDistance = 7, blockSize = 7 ) # lucas kanade光流法参数 lk_params = dict( winSize = (15,15), maxLevel = 2, criteria = (cv2.TERM_CRITERIA_EPS | cv2.TERM_CRITERIA_COUNT, 10, 0.03)) # 创建随机颜色 color = np.random.randint(0,255,(100,3)) # 获取第一帧的灰度图像及其角点 ret, old_frame = cap.read() #获取第一帧 old_gray = cv2.cvtColor(old_frame, cv2.COLOR_BGR2GRAY) #找到原始灰度图 #获取第一帧的灰度图中的角点p0 p0 = cv2.goodFeaturesToTrack(old_gray, mask = None, **feature_params) #创建一个蒙版用来画轨迹,i.e.和每帧图像大小相同的全0张量 mask = np.zeros_like(old_frame) # 对每帧图像计算光流并绘制光流轨迹 while(True): ret,frame = cap.read() if not ret: print("This video has been processed.") break frame_gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY) # 计算每帧的光流 p1, st, err = cv2.calcOpticalFlowPyrLK(old_gray, frame_gray, p0, None, **lk_params) # 选取好的跟踪点 good_new = p1[st==1] good_old = p0[st==1] # 画出轨迹 for i,(new,old) in enumerate(zip(good_new,good_old)): a,b = new.ravel() c,d = old.ravel() mask = cv2.line(mask, (int(a),int(b)),(int(c),int(d)), color[i].tolist(), 2) #添加了该帧光流的轨迹图 frame = cv2.circle(frame,(int(a),int(b)),5,color[i].tolist(),-1) # 效果可视化 img = cv2.add(frame,mask) #将该图和轨迹图合并 img_show = np.hstack((img,mask)) cv2.imshow('frame',img_show) if cv2.waitKey(50)&0xFF==ord("q"): break # 更新"上一帧图像"和追踪点 old_gray = frame_gray.copy() p0 = good_new.reshape(-1,1,2) cv2.destroyAllWindows() cap.release() 实现效果2.2稠密光流-跟踪所有的像素点代码# 稠密光流 import numpy as np import cv2 video_path = "./test.mp4" cap = cv2.VideoCapture(video_path) # 打开视频 # 获取第一帧 ret, frame1 = cap.read() prvs = cv2.cvtColor(frame1,cv2.COLOR_BGR2GRAY) # 创建光流矢量绘制蒙版 hsv = np.zeros_like(frame1) # 遍历每一行的第1列 hsv[...,1] = 255 while(1): ret, frame2 = cap.read() next = cv2.cvtColor(frame2,cv2.COLOR_BGR2GRAY) # 返回一个两通道的光流向量，实际上是每个点的像素位移值 flow = cv2.calcOpticalFlowFarneback(prvs,next, None, 0.5, 3, 15, 3, 5, 1.2, 0) # 笛卡尔坐标转换为极坐标，获得极轴和极角 mag, ang = cv2.cartToPolar(flow[...,0], flow[...,1]) hsv[...,0] = ang*180/np.pi/2 hsv[...,2] = cv2.normalize(mag,None,0,255,cv2.NORM_MINMAX) rgb = cv2.cvtColor(hsv,cv2.COLOR_HSV2BGR) # 光流法结果可视化 img_show = np.hstack((frame2,rgb)) cv2.imshow('frame',img_show) if cv2.waitKey(50)&0xFF==ord("q"): break prvs = next cap.release() cv2.destroyAllWindows()实现效果参考资料计算机视觉--光流法(optical flow)简介：https://blog.csdn.net/qq_41368247/article/details/82562165光流法简述(重庆邮电大学.孔令上):https://wenku.baidu.com/view/7a2cb968ff00bed5b8f31d6c.htmlOPENCV对光流法的实现(PYTHON3):https://www.freesion.com/article/32791433918/角点检测：Harris 与 Shi-Tomasi：https://zhuanlan.zhihu.com/p/83064609python opencv入门光流法（41）:https://blog.csdn.net/tengfei461807914/article/details/80978947
- 2021年09月13日
- 1,526 阅读
- 0 评论
- 0 点赞
2021-09-13
帧差法+三帧差法原理与实现帧差法原理移动侦测即是根据视频每帧或者几帧之间像素的差异，对差异值设置阈值，筛选大于阈值的像素点，做掩模图即可选出视频中存在变化的桢。帧差法较为简单的视频中物体移动侦测，帧差法分为：单帧差和三桢差。随着帧数的增加是防止检测结果的重影。单帧差法算法原理以视频为例进行单帧差法移动侦测算法实现import cv2 import pandas as pd import numpy as np video_path = "./test.mp4" cam = cv2.VideoCapture(video_path) # 打开一个视频 input_fps = cam.get(cv2.CAP_PROP_FPS) # 获取视频帧率 ret_val, input_image = cam.read() # 读取视频第一帧 gray_lwpCV = cv2.cvtColor(input_image, cv2.COLOR_BGR2GRAY) # 将第一帧转为灰度 gray_lwpCV = cv2.GaussianBlur(gray_lwpCV, (21, 21), 0) # 对转换后的灰度图进行高斯模糊 background=gray_lwpCV # 将高斯模糊后的第一帧作为初始化背景 area_threh = 100 # 物体bbox面积阈值 while(cam.isOpened()) and ret_val == True: ret_val, input_image = cam.read() # 继续读取视频帧 gray_lwpCV = cv2.cvtColor(input_image, cv2.COLOR_BGR2GRAY) gray_lwpCV = cv2.GaussianBlur(gray_lwpCV, (21, 21), 0) # 对读取到的视频帧进行灰度处理+高斯模糊 diff = cv2.absdiff(background, gray_lwpCV) # 将最新读取的视频帧和背景做差 #跟着图像变换背景，如果背景变化区域小于20%或者75%，则将当前帧作为新得背景区域 tem_diff=diff.flatten() tem_ds=pd.Series(tem_diff) tem_per=1-len(tem_ds[tem_ds==0])/len(tem_ds) if (tem_per <0.2 )| (tem_per>0.75): background=gray_lwpCV else: ret,diff_binary = cv2.threshold(diff, 10, 255, cv2.THRESH_BINARY)# 对差值diff进行二值化 contours, hierarchy = cv2.findContours(diff_binary,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE) # 对二值化之后得结果进行轮廓提取 for c in contours: if (cv2.contourArea(c) < area_threh): # 对于矩形区域，只显示大于给定阈值的轮廓（去除微小的变化等噪点） continue (x, y, w, h) = cv2.boundingRect(c) # 该函数计算矩形的边界框 cv2.rectangle(input_image, (x, y), (x+w, y+h), (0, 255, 0), 2) cv2.imshow('frame diff', np.hstack((input_image,cv2.cvtColor(diff,cv2.COLOR_GRAY2BGR)))) if cv2.waitKey(50)&0xFF==ord("q"): break cam.release() cv2.destroyAllWindows()实现效果算法分析优点实现简单，运行速度快缺点存在"鬼影"问题(指在物体原来得位置和现在得位置都出现了该物体)，三帧差法算法原理连续三帧，12相减，23相减，结果做与运算。相减公式：其中阈值T需要手动调整。结果得到一个二值图，对二值图进行形态学处理，再进行轮廓提取。算法实现 import cv2 import numpy as np video_path = "./test.mp4" cap = cv2.VideoCapture(video_path) width =int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)) height =int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)) # 初始化第1.2.3帧 one_frame = np.zeros((height,width),dtype=np.uint8) two_frame = np.zeros((height,width),dtype=np.uint8) three_frame = np.zeros((height,width),dtype=np.uint8) area_threh = 100 # 物体bbox面积阈值 while cap.isOpened(): ret,frame = cap.read() frame_gray =cv2.cvtColor(frame,cv2.COLOR_BGR2GRAY) if not ret: break one_frame,two_frame,three_frame = two_frame,three_frame,frame_gray # 1.2帧做差 abs1 = cv2.absdiff(one_frame,two_frame)#相减 _,thresh1 = cv2.threshold(abs1,15,255,cv2.THRESH_BINARY)#二值，大于40的为255，小于0 # 2.3帧做差 abs2 =cv2.absdiff(two_frame,three_frame) _,thresh2 =cv2.threshold(abs2,15,255,cv2.THRESH_BINARY) binary =cv2.bitwise_and(thresh1,thresh2)#与运算 kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE,(5,5)) # erode = cv2.erode(binary,kernel)#腐蚀 # dilate =cv2.dilate(binary,kernel)#膨胀 # dilate =cv2.dilate(dilate,kernel)#膨胀 # 轮廓提取 contours, hierarchy = cv2.findContours(binary.copy(),mode=cv2.RETR_EXTERNAL,method=cv2.CHAIN_APPROX_SIMPLE)#寻找轮廓 for contour in contours: if cv2.contourArea(contour)>area_threh: x,y,w,h =cv2.boundingRect(contour)#找方框 cv2.rectangle(frame,(x,y),(x+w,y+h),(0,255,0), 2) img_show = np.hstack((frame,cv2.cvtColor(binary,cv2.COLOR_GRAY2BGR))) cv2.imshow('three frame diff',img_show) if cv2.waitKey(50)&0xFF==ord("q"): break cap.release() cv2.destroyAllWindows()实现效果不进行形态学处理膨胀一次膨胀两次先腐蚀一次，再膨胀两次算法分析优点实现简单，运行速度快解决了帧差法存在的“鬼影”问题能大致检测出物体的运动区域缺点不进行膨胀会存在“空洞”问题进行膨胀之后会存在着多个物体的”牵连“问题对物体的运动区域的检测不够全面eg:对于部分人运动区域的检测会存在着只检测出半个人的情况参考资料python+opencv实现移动侦测（帧差法）:https://www.jb51.net/article/183203.htmopencv python 三帧差法实现运动目标区域检测与完整代码：https://blog.csdn.net/pengpengloveqiaoqiao/article/details/89487049
- 2021年09月13日
- 3,336 阅读
- 0 评论
- 0 点赞
2021-08-15
快速调用Yolov5模型检检测图片前提:未修改模型结构1.快速调用官方的Yolov5预模型import torch # 使用torch.hub加载yolov5的预训练模型训练 model = torch.hub.load('ultralytics/yolov5', 'yolov5s') # or yolov5m, yolov5x, custom # 进行模型调用测试 img_path = './6800.jpg' # or file, PIL, OpenCV, numpy, multiple results = model(img_path) # 得到预测结果 print(results.xyxy) # 输出预测出的bbox_list results.show() # 预测结果展示2.快速调用自己训练好的的Yolov5预模型(有pt文件即可)import torch # 使用torch.hub加载yolov5的预训练模型训练 model = torch.hub.load('ultralytics/yolov5', 'yolov5s') # or yolov5m, yolov5x, custom # 加载自己训练好的模型及相关参数 cpkt = torch.load("./best.pt",map_location=torch.device("cuda:0")) # 将预训练的模型的骨干替换成自己训练好的 yolov5_load = model yolov5_load.model = cpkt["model"] # 进行模型调用测试 img_path = './6800.jpg' # or file, PIL, OpenCV, numpy, multiple results = yolov5_load(img_path) # 得到预测结果 print(results.xyxy) # 输出预测出的bbox_list results.show() # 预测结果展示参考资料https://github.com/ultralytics/yolov5
- 2021年08月15日
- 2,235 阅读
- 0 评论
- 0 点赞
2021-07-28
YOLOv5项目目录结构 YOLOv5项目目录结构| detect.py #检测脚本 | hubconf.py #PyTorch Hub相关代码 | LICENSE #版权文件 | README.md #README markdown文件 | requirements.txt #项目所需的安装包列表 | sotabench.py #COCO数据集测试脚本 | test.py #模型测试脚本 | train.py #模型训练脚本 | tutorial.ipynb #Jupyter Notebook演示代码 |---data | | coco.yaml #COCO数据集配置文件 | | coco128.yaml #COCO128数据集配置文件 | | hyp.finetune.yaml #超参数微调配置文件 | | hyp.scratch.yaml #超参数起始配置文件 | | voc.yaml #VOC数据集配置文件 | |---scripts | | | get_coco.sh #下载COCO数据集shell命令 | | | get_voc.sh #下载VOC数据集shell命令 |---inference | |---images #示例图片文件夹 | | | bus.jpg | | | zidane.jpg |---models | | common.py #模型组件定义代码 | | experimental.py #实验性质的代码 | | export.py #模型导出脚本 | | yolo.py #Detect及Model构建代码 | | yolov5l.yaml #yolov51网络模型配置文件 | | yolov5m.yaml #yolov5m网络模型配置文件 | | yolov5s.yaml #yolov5s网络模型配置文件 | | yolov5x.yaml #yolov5x网络模型配置文件 | | __init__.py | |---hub | | | yolov3-spp.yaml | | | yolov5-fpn.yaml | | | yolov5-panet.yaml |---runs #训练结果 | |---exp0 | | | events.out.tfevents.1604835533.PC-201807230204.26148.0 | | | hyp.yaml | | | labels.png | | | opt.yaml | | | orecision-recall_curve.png | | | results.png | | | results.txt | | | test_batch0_gt.jpg | | | test_batch0_pred.jpg | | | train_batch0.jpg | | | train_batch1.jpg | | | train_batch2.jpg | | |---weights | | | | best.pt #最好权重 | | | | last.pt #最近权重 |---utils | | activations.py #激活函数定义代码 | | datasets.py #Dataset及Dataloader定义代码 | | evolve.sh #超参数进化命令 | | general.py #项目通用函数代码 | | google_utils.py #谷歌云使用相关代码 | | torch_utils.py #辅助程序代码 | | __init_.py | |---google_app_engine | | | additional_requirements.txt | | | app.yaml | | | Dockerfile |---VOC #数据集目录 | |---images #数据集图片目录 | | |---train #训练集图片文件夹 | | | | 1000005.jpg | | | | 000007.jpg | | | | 000009.jpg | | | | 000012.jpg | | | | 000016.jpg | | | | ...... | | |---val #验证集图片文件夹 | | | | 000001.jpg | | | | 000002.jpg | | | | 000003.jpg | | | | 000004.jpg | | | | 000006.jpg | | | | ...... | |---labels #数据集标签目录 | | | train.cache | | | val.cache | | |---train #训练集标签文件夹 | | | | 000005.txt | | | | 000007.txt | | | | 000009.txt | | | | 000012.txt | | | | 000016.txt | | | | ...... | | |---val #测试集标签文件夹 | | | | 000001.txt | | | | 000002.txt | | | | 000003.txt | | | | 000004.txt | | | | 000006.txt | | | | ...... |---weights | | download weights.sh #下载权重文件命令 | | yolov5l.pt #yolov5l权重文件 | | yolov5m.pt #yolov5m权重文件 | | yolov5s.mlmodel #yolov5s权重文件（Core ML格式） | | yolov5s.onnx #yolov5s权重文件（onnx格式） | | yolov5s.pt #yolov5s权重文件 | | yolov5s.torchscript.pt #yolov5s权重文件（torchscript格式） | | yolov5x.pt #yolov5x权重文件参考资料1.https://www.bilibili.com/video/BV19K4y197u8?p=14
- 2021年07月28日
- 1,334 阅读
- 1 评论
- 0 点赞
2021-03-30
Scene Text Detection Resources(场景文字识别资源汇总) [转载] [翻译] 1. 数据集1.1 水平文字数据集ICDAR 2003(IC03)：Introduction: 它总共包含509张图像，258张用于训练和251张用于测试。具体来说，它在训练集中包含1110个文本实例，而在测试集中包含1156个文本实例。它具有单词级注释。 IC03仅考虑英文文本实例。Link: IC03-downloadICDAR 2011(IC11):Introduction: IC11是用于文本检测的英语数据集。它包含484张图像，229张用于训练和255张用于测试。该数据集中有1564个文本实例。它提供单词级和字符级注释。Link:11-downloadICDAR 2013(IC13)：Introduction: IC13与IC11几乎相同。它总共包含462张图像，用于训练的229张图像和用于测试的233张图像。具体来说，它在训练集中包含849个文本实例，而在测试集中包含1095个文本实例。Link: IC13-download1.2 任意四边形文本数据集USTB-SV1K：Introduction：USTB-SV1K是英语数据集。它包含来自Google街景视图的1000张街道图像，总共2955个文本实例。它仅提供单词级注释。Link: USTB-SV1K-downloadSVT：Introduction:它包含350张图像，总共725个英文文本实例。 SVT具有字符级别和单词级别的注释。 DVT的图像是从Google街景视图中获取的，分辨率较低。Link: SVT-downloadSVT-P：Introduction: 它包含639个裁剪的单词图像以进行测试。从Google街景视图的侧面快照中选择了图像。因此，大多数图像会因非正面视角而严重失真。它是SVT的改进数据集。Link: SVT-P-download (Password : vnis)ICDAR 2015(IC15)：Introduction: 它总共包含1500张图像，1000张用于训练和500张用于测试。具体来说，它包含17548个文本实例。它提供单词级别的注释。 IC15是第一个附带场景文本数据集，并且仅考虑英语单词。Link: IC15-downloadCOCO-Text：Introduction: 它总共包含63686张图像，用于训练的43686张图像，用于验证的10000张图像和用于测试的10000张图像。具体来说，它包含145859个裁剪的单词图像以进行测试，包括手写和打印，清晰和模糊，英语和非英语。Link: COCO-Text-downloadMSRA-TD500：Introduction: 它总共包含500张图像。它提供文本行级别的注释而不是单词，并提供多边形框而不是轴对齐的矩形来进行文本区域注释。它包含英文和中文文本实例。Link: MSRA-TD500-downloadMLT 2017：Introduction:它总共包含10000个自然图像。它提供单词级别的注释。 MLT有9种语言。它是用于场景文本检测和识别的更真实和复杂的数据集。Link: MLT-downloadMLT 2019:Introduction: 它总共包含18000张图像。它提供单词级别的注释。与MLT相比，此数据集有10种语言。它是用于场景文本检测和识别的更真实和复杂的数据集。Link: MLT-2019-downloadCTW：Introduction:它包含32285个中文文本的高分辨率街景图像，总共包含1018402个字符实例。所有图像都在字符级别进行注释，包括其基础字符类型，绑定框和其他6个属性。这些属性指示其背景是否复杂，是否凸起，是否为手写或印刷，是否被遮挡，是否扭曲，是否使用艺术字。Link: CTW-downloadRCTW-17：Introduction:它总共包含12514张图像，用于训练的11514张图像和用于测试的1000张图像。 RCTW-17中的图像大部分是通过照相机或手机收集的，其他则是生成的图像。文本实例用平行四边形注释。它是第一个大规模的中文数据集，也是当时发布的最大的数据集。Link: RCTW-17-downloadReCTS：Introduction:该数据集是大规模的中国街景商标数据集。它基于中文单词和中文文本行级标签。标记方法是任意四边形标记。它总共包含20000张图像。Link: ReCTS-download1.3 不规则文本数据集CUTE80：Introduction: 它包含在自然场景中拍摄的80张高分辨率图像。具体来说，它包含288个裁剪的单词图像以进行测试。数据集集中在弯曲的文本上。没有提供词典。Link: CUTE80-downloadTotal-Text：Introduction: 它总共包含1,555张图像。具体来说，它包含11459个经裁剪的单词图像，这些图像具有三种以上不同的文本方向：水平，多方向和弯曲。Link: Total-Text-downloadSCUT-CTW1500：Introduction: 它总共包含1500张图像，1000张用于训练和500张用于测试。具体来说，它包含10751个裁剪的单词图像以进行测试。 CTW-1500中的注释是具有14个顶点的多边形。数据集主要由中文和英文组成。Link: CTW-1500-downloadLSVT：Introduction: LSVT由20,000个测试数据，30,000个完整注释的训练数据和400,000个弱注释的训练数据组成，这些数据称为部分标签。带标签的文本区域展示了文本的多样性：水平，多向和弯曲。Link: LSVT-downloadArTs：Introduction: ArT包含10,166张图像，5,603张用于训练和4,563张用于测试。收集它们时会考虑到文本形状的多样性，并且所有文本形状在ArT中都有大量存在。Link: ArT-download1.4 合成数据集Synth80k :Introduction:它包含80万幅图像，其中包含约800万个合成词实例。每个文本实例都用其文本字符串，单词级和字符级的边界框进行注释。Link: Synth80k-downloadSynthText :Introduction:它包含600万个裁剪的单词图像。生成过程与Synth90k相似。它也以水平样式进行注释。Link: SynthText-download1.5 数据集对比 Comparison of Datasets Datasets Language Image Text instance Text Shape Annotation level Total Train Test Total Train Test Horizontal Arbitrary-Quadrilateral Multi-oriented Char Word Text-Line IC03 English 509 258 251 2266 1110 1156 ✓ ✕ ✕ ✕ ✓ ✕ IC11 English 484 229 255 1564 ～～ ✓ ✕ ✕ ✓ ✓ ✕ IC13 English 462 229 233 1944 849 1095 ✓ ✕ ✕ ✓ ✓ ✕ USTB-SV1K English 1000 500 500 2955 ～～ ✓ ✓ ✕ ✕ ✓ ✕ SVT English 350 100 250 725 211 514 ✓ ✓ ✕ ✓ ✓ ✕ SVT-P English 238 ～～ 639 ～～ ✓ ✓ ✕ ✕ ✓ ✕ IC15 English 1500 1000 500 17548 122318 5230 ✓ ✓ ✕ ✕ ✓ ✕ COCO-Text English 63686 43686 20000 145859 118309 27550 ✓ ✓ ✕ ✕ ✓ ✕ MSRA-TD500 English/Chinese 500 300 200 ～～～ ✓ ✓ ✕ ✕ ✕ ✓ MLT 2017 Multi-lingual 18000 7200 10800 ～～～ ✓ ✓ ✕ ✕ ✓ ✕ MLT 2019 Multi-lingual 20000 10000 10000 ～～～ ✓ ✓ ✕ ✕ ✓ ✕ CTW Chinese 32285 25887 6398 1018402 812872 205530 ✓ ✓ ✕ ✓ ✓ ✕ RCTW-17 English/Chinese 12514 15114 1000 ～～～ ✓ ✓ ✕ ✕ ✕ ✓ ReCTS Chinese 20000 ～～～～～ ✓ ✓ ✕ ✓ ✓ ✕ CUTE80 English 80 ～～～～～ ✕ ✕ ✓ ✕ ✓ ✓ Total-Text English 1525 1225 300 9330 ～～ ✓ ✓ ✓ ✕ ✓ ✓ CTW-1500 English/Chinese 1500 1000 500 10751 ～～ ✓ ✓ ✓ ✕ ✓ ✓ LSVT English/Chinese 450000 430000 20000 ～～～ ✓ ✓ ✓ ✕ ✓ ✓ ArT English/Chinese 10166 5603 4563 ～～～ ✓ ✓ ✓ ✕ ✓ ✕ Synth80k English 80k ～～ 8m ～～ ✓ ✕ ✕ ✓ ✓ ✕ SynthText English 800k ～～ 6m ～～ ✓ ✓ ✕ ✕ ✓ ✕ 2. 场景文本检测资源总结2.1 方法对比场景文本检测方法可以分为四个部分：(a) 传统方法; (b) 基于分割的方法;(c) 基于回归的方法;(d) 混合方法.注意：（1）“ Hori”代表水平场景文本数据集。（2）“ Quad”代表任意四边形文本数据集。（3）“ Irreg”代表不规则场景文本数据集。（4）“传统方法”代表不依赖深度学习的方法。2.1.1 传统方法 Method Model Code Hori Quad Irreg Source Time Highlight Yao et al. [1] TD-Mixture ✕ ✓ ✓ ✕ CVPR 2012 1) A new dataset MSRA-TD500 and protocol for evaluation. 2) Equipped a two-level classification scheme and two sets of features extractor. Yin et al. [2] ✕ ✓ ✕ ✕ TPAMI 2013 Extract Maximally Stable Extremal Regions (MSERs) as character candidates and group them together. Le et al. [5] HOCC ✕ ✓ ✓ ✕ CVPR 2014 HOCC + MSERs Yin et al. [7] ✕ ✓ ✓ ✕ TPAMI 2015 Presenting a unified distance metric learning framework for adaptive hierarchical clustering. Wu et al. [9] ✕ ✓ ✓ ✕ TMM 2015 Exploring gradient directional symmetry at component level for smoothing edge components before text detection. Tian et al. [17] ✕ ✓ ✕ ✕ IJCAI 2016 Scene text is first detected locally in individual frames and finally linked by an optimal tracking trajectory. Yang et al. [33] ✕ ✓ ✓ ✕ TIP 2017 A text detector will locate character candidates and extract text regions. Then they will linked by an optimal tracking trajectory. Liang et al. [8] ✕ ✓ ✓ ✓ TIP 2015 Exploring maxima stable extreme regions along with stroke width transform for detecting candidate text regions. Michal et al.[12] FASText ✕ ✓ ✓ ✕ ICCV 2015 Stroke keypoints are efficiently detected and then exploited to obtain stroke segmentations. 2.1.2基于分割的方法 Method Model Code Hori Quad Irreg Source Time Highlight Li et al. [3] ✕ ✓ ✓ ✕ TIP 2014 (1)develop three novel cues that are tailored for character detection and a Bayesian method for their integration; (2)design a Markov random field model to exploit the inherent dependencies between characters. Zhang et al. [14] ✕ ✓ ✓ ✕ CVPR 2016 Utilizing FCN for salient map detection and centroid of each character prediction. Zhu et al. [16] ✕ ✓ ✓ ✕ CVPR 2016 Performs a graph-based segmentation of connected components into words (Word-Graph). He et al. [18] Text-CNN ✕ ✓ ✓ ✕ TIP 2016 Developing a new learning mechanism to train the Text-CNN with multi-level and rich supervised information. Yao et al. [21] ✕ ✓ ✓ ✕ arXiv 2016 Proposing to localize text in a holistic manner, by casting scene text detection as a semantic segmentation problem. Hu et al. [27] WordSup ✕ ✓ ✓ ✕ ICCV 2017 Proposing a weakly supervised framework that can utilize word annotations. Then the detected characters are fed to a text structure analysis module. Wu et al. [28] ✕ ✓ ✓ ✕ ICCV 2017 Introducing the border class to the text detection problem for the first time, and validate that the decoding process is largely simplified with the help of text border. Tang et al.[32] ✕ ✓ ✕ ✕ TIP 2017 A text-aware candidate text region(CTR) extraction model + CTR refinement model. Dai et al. [35] FTSN ✕ ✓ ✓ ✕ arXiv 2017 Detecting and segmenting the text instance jointly and simultaneously, leveraging merits from both semantic segmentation task and region proposal based object detection task. Wang et al. [38] ✕ ✓ ✕ ✕ ICDAR 2017 This paper proposes a novel character candidate extraction method based on super-pixel segmentation and hierarchical clustering. Deng et al. [40] PixelLink ✓ ✓ ✓ ✕ AAAI 2018 Text instances are first segmented out by linking pixels wthin the same instance together. Liu et al. [42] MCN ✕ ✓ ✓ ✕ CVPR 2018 Stochastic Flow Graph (SFG) + Markov Clustering. Lyu et al. [43] ✕ ✓ ✓ ✕ CVPR 2018 Detect scene text by localizing corner points of text bounding boxes and segmenting text regions in relative positions. Chu et al. [45] Border ✕ ✓ ✓ ✕ ECCV 2018 The paper presents a novel scene text detection technique that makes use of semantics-aware text borders and bootstrapping based text segment augmentation. Long et al. [46] TextSnake ✕ ✓ ✓ ✓ ECCV 2018 The paper proposes TextSnake, which is able to effectively represent text instances in horizontal, oriented and curved forms based on symmetry axis. Yang et al. [47] IncepText ✕ ✓ ✓ ✕ IJCAI 2018 Designing a novel Inception-Text module and introduce deformable PSROI pooling to deal with multi-oriented text detection. Yue et al. [48] ✕ ✓ ✓ ✕ BMVC 2018 Proposing a general framework for text detection called Guided CNN to achieve the two goals simultaneously. Zhong et al. [53] AF-RPN ✕ ✓ ✓ ✕ arXiv 2018 Presenting AF-RPN(anchor-free) as an anchor-free and scale-friendly region proposal network for the Faster R-CNN framework. Wang et al. [54] PSENet ✓ ✓ ✓ ✓ CVPR 2019 Proposing a novel Progressive Scale Expansion Network (PSENet), designed as a segmentation-based detector with multiple predictions for each text instance. Xu et al.[57] TextField ✕ ✓ ✓ ✓ arXiv 2018 Presenting a novel direction field which can represent scene texts of arbitrary shapes. Tian et al. [58] FTDN ✕ ✓ ✓ ✕ ICIP 2018 FTDN is able to segment text region and simultaneously regress text box at pixel-level. Tian et al. [83] ✕ ✓ ✓ ✓ CVPR 2019 Constraining embedding feature of pixels inside the same text region to share similar properties. Huang et al. [4] MSERs-CNN ✕ ✓ ✕ ✕ ECCV 2014 Combining MSERs with CNN Sun et al. [6] ✕ ✓ ✕ ✕ PR 2015 Presenting a robust text detection approach based on color-enhanced CER and neural networks. Baek et al. [62] CRAFT ✕ ✓ ✓ ✓ CVPR 2019 Proposing CRAFT effectively detect text area by exploring each character and affinity between characters. Richardson et al. [87] ✕ ✓ ✓ ✕ WACV 2019 Presenting an additional scale predictor the estimate the better scale of text regions for testing. Wang et al. [88] SAST ✕ ✓ ✓ ✓ ACMM 2019 Presenting a context attended multi-task learning framework for scene text detection. Wang et al. [90] PAN ✕ ✓ ✓ ✓ ICCV 2019 Proposing an efﬁcient and accurate arbitrary-shaped text detector called Pixel Aggregation Network(PAN), 2.1.3 基于回归的方法 Method Model Code Hori Quad Irreg Source Time Highlight Gupta et al. [15] FCRN ✓ ✓ ✕ ✕ CVPR 2016 (a) Proposing a fast and scalable engine to generate synthetic images of text in clutter; (b) FCRN. Zhong et al. [20] DeepText ✕ ✓ ✕ ✕ arXiv 2016 (a) Inception-RPN; (b) Utilize ambiguous text category (ATC) information and multilevel region-of-interest pooling (MLRP). Liao et al. [22] TextBoxes ✓ ✓ ✕ ✕ AAAI 2017 Mainly basing SSD object detection framework. Liu et al. [25] DMPNet ✕ ✓ ✓ ✕ CVPR 2017 Quadrilateral sliding windows + shared Monte-Carlo method for fast and accurate computing of the polygonal areas + a sequential protocol for relative regression. He et al. [26] DDR ✕ ✓ ✓ ✕ ICCV 2017 Proposing an FCN that has bi-task outputs where one is pixel-wise classification between text and non-text, and the other is direct regression to determine the vertex coordinates of quadrilateral text boundaries. Jiang et al. [36] R2CNN ✕ ✓ ✓ ✕ arXiv 2017 Using the Region Proposal Network (RPN) to generate axis-aligned bounding boxes that enclose the texts with different orientations. Xing et al. [37] ArbiText ✕ ✓ ✓ ✕ arXiv 2017 Adopting the circle anchors and incorporating a pyramid pooling module into the Single Shot MultiBox Detector framework. Zhang et al. [39] FEN ✕ ✓ ✕ ✕ AAAI 2018 Proposing a refined scene text detector with a novel Feature Enhancement Network (FEN) for Region Proposal and Text Detection Refinement. Wang et al. [41] ITN ✕ ✓ ✓ ✕ CVPR 2018 ITN is presented to learn the geometry-aware representation encoding the unique geometric configurations of scene text instances with in-network transformation embedding. Liao et al. [44] RRD ✕ ✓ ✓ ✕ CVPR 2018 The regression branch extracts rotation-sensitive features, while the classification branch extracts rotation-invariant features by pooling the rotation sensitive features. Liao et al. [49] TextBoxes++ ✓ ✓ ✓ ✕ TIP 2018 Mainly basing SSD object detection framework and it replaces the rectangular box representation in conventional object detector by a quadrilateral or oriented rectangle representation. He et al. [50] ✕ ✓ ✓ ✕ TIP 2018 Proposing a scene text detection framework based on fully convolutional network with a bi-task prediction module. Ma et al. [51] RRPN ✓ ✓ ✓ ✕ TMM 2018 RRPN + RRoI Pooling. Zhu et al. [55] SLPR ✕ ✓ ✓ ✓ arXiv 2018 SLPR regresses multiple points on the edge of text line and then utilizes these points to sketch the outlines of the text. Deng et al. [56] ✓ ✓ ✓ ✕ arXiv 2018 CRPN employs corners to estimate the possible locations of text instances. And it also designs a embedded data augmentation module inside region-wise subnetwork. Cai et al. [59] FFN ✕ ✓ ✕ ✕ ICIP 2018 Proposing a Feature Fusion Network to deal with text regions differing in enormous sizes. Sabyasachi et al. [60] RGC ✕ ✓ ✓ ✕ ICIP 2018 Proposing a novel recurrent architecture to improve the learnings of a feature map at a given time. Liu et al. [63] CTD ✓ ✓ ✓ ✓ PR 2019 CTD + TLOC + PNMS Xie et al. [79] DeRPN ✓ ✓ ✕ ✕ AAAI 2019 DeRPN utilizes anchor string mechanism instead of anchor box in RPN. Wang et al. [82] ✕ ✓ ✓ ✓ CVPR 2019 Text-RPN + RNN Liu et al. [84] ✕ ✓ ✓ ✓ CVPR 2019 CSE mechanism He et al. [29] SSTD ✓ ✓ ✓ ✕ ICCV 2017 Proposing an attention mechanism. Then developing a hierarchical inception module which efficiently aggregates multi-scale inception features. Tian et al. [11] ✕ ✓ ✕ ✕ ICCV 2015 Cascade boosting detects character candidates, and the min-cost flow network model get the final result. Tian et al. [13] CTPN ✓ ✓ ✕ ✕ ECCV 2016 1) RPN + LSTM. 2) RPN incorporate a new vertical anchor mechanism and LSTM connects the region to get the final result. He et al. [19] ✕ ✓ ✓ ✕ ACCV 2016 ER detetctor detects regions to get coarse prediction of text regions. Then the local context is aggregated to classify the remaining regions to obtain a final prediction. Shi et al. [23] SegLink ✓ ✓ ✓ ✕ CVPR 2017 Decomposing text into segments and links. A link connects two adjacent segments. Tian et al. [30] WeText ✕ ✓ ✕ ✕ ICCV 2017 Proposing a weakly supervised scene text detection method (WeText). Zhu et al. [31] RTN ✕ ✓ ✕ ✕ ICDAR 2017 Mainly basing CTPN vertical vertical proposal mechanism. Ren et al. [34] ✕ ✓ ✕ ✕ TMM 2017 Proposing a CNN-based detector. It contains a text structure component detector layer, a spatial pyramid layer, and a multi-input-layer deep belief network (DBN). Zhang et al. [10] ✕ ✓ ✕ ✕ CVPR 2015 The proposed algorithm exploits the symmetry property of character groups and allows for direct extraction of text lines from natural images. Wang et al. [86] DSRN ✕ ✓ ✓ ✕ IJCAI 2019 Presenting a scale-transfer module and scale relationship module to handle the problem of scale variation. Tang et al.[89] Seglink++ ✕ ✓ ✓ ✓ PR 2019 Presenting instance aware component grouping (ICG) for arbitrary-shape text detection. Wang et al.[92] ContourNet ✓ ✓ ✓ ✓ CVPR 2020 1.A scale-insensitive Adaptive Region Proposal Network (AdaptiveRPN); 2. Local Orthogonal Texture-aware Module (LOTM). 2.1.4 混合方法 Method Model Code Hori Quad Irreg Source Time Highlight Tang et al. [52] SSFT ✕ ✓ ✕ ✕ TMM 2018 Proposing a novel scene text detection method that involves superpixel-based stroke feature transform (SSFT) and deep learning based region classification (DLRC). Xie et al.[61] SPCNet ✕ ✓ ✓ ✓ AAAI 2019 Text Context module + Re-Score mechanism. Liu et al. [64] PMTD ✓ ✓ ✓ ✕ arXiv 2019 Perform “soft” semantic segmentation. It assigns a soft pyramid label (i.e., a real value between 0 and 1) for each pixel within text instance. Liu et al. [80] BDN ✓ ✓ ✓ ✕ IJCAI 2019 Discretizing bouding boxes into key edges to address label confusion for text detection. Zhang et al. [81] LOMO ✕ ✓ ✓ ✓ CVPR 2019 DR + IRM + SEM Zhou et al. [24] EAST ✓ ✓ ✓ ✕ CVPR 2017 The pipeline directly predicts words or text lines of arbitrary orientations and quadrilateral shapes in full images with instance segmentation. Yue et al. [48] ✕ ✓ ✓ ✕ BMVC 2018 Proposing a general framework for text detection called Guided CNN to achieve the two goals simultaneously. Zhong et al. [53] AF-RPN ✕ ✓ ✓ ✕ arXiv 2018 Presenting AF-RPN(anchor-free) as an anchor-free and scale-friendly region proposal network for the Faster R-CNN framework. Xue et al.[85] MSR ✕ ✓ ✓ ✓ IJCAI 2019 Presenting a noval multi-scale regression network. Liao et al. [91] DB ✓ ✓ ✓ ✓ AAAI 2020 Presenting differentiable binarization module to adaptively set the thresholds for binarization, which simpliﬁes the post-processing. Xiao et al. [93] SDM ✕ ✓ ✓ ✓ ECCV 2020 1. A novel sequential deformation method; 2. auxiliary character counting supervision. 2.2 检测结果2.2.1 水平文本数据集的检测结果 Method Model Source Time Method Category IC11[68] IC13 [69] IC05[67] P R F P R F P R F Yao et al. [1] TD-Mixture CVPR 2012 Traditional ~ ~ ~ 0.69 0.66 0.67 ~ ~ ~ Yin et al. [2] TPAMI 2013 0.86 0.68 0.76 ~ ~ ~ ~ ~ ~ Yin et al. [7] TPAMI 2015 0.838 0.66 0.738 ~ ~ ~ ~ ~ ~ Wu et al. [9] TMM 2015 ~ ~ ~ 0.76 0.70 0.73 ~ ~ ~ Liang et al. [8] TIP 2015 0.77 0.68 0.71 0.76 0.68 0.72 ~ ~ ~ Michal et al.[12] FASText ICCV 2015 ~ ~ ~ 0.84 0.69 0.77 ~ ~ ~ Li et al. [3] TIP 2014 Segmentation 0.80 0.62 0.70 ~ ~ ~ ~ ~ ~ Zhang et al. [14] CVPR 2016 ~ ~ ~ 0.88 0.78 0.83 ~ ~ ~ He et al. [18] Text-CNN TIP 2016 0.91 0.74 0.82 0.93 0.73 0.82 0.87 0.73 0.79 Yao et al. [21] arXiv 2016 ~ ~ ~ 0.889 0.802 0.843 ~ ~ ~ Hu et al. [27] WordSup ICCV 2017 ~ ~ ~ 0.933 0.875 0.903 ~ ~ ~ Tang et al.[32] TIP 2017 0.90 0.86 0.88 0.92 0.87 0.89 ~ ~ ~ Wang et al. [38] ICDAR 2017 0.87 0.78 0.82 0.87 0.82 0.84 ~ ~ ~ Deng et al. [40] PixelLink AAAI 2018 ~ ~ ~ 0.886 0.875 0.881 ~ ~ ~ Liu et al. [42] MCN CVPR 2018 ~ ~ ~ 0.88 0.87 0.88 ~ ~ ~ Lyu et al. [43] CVPR 2018 ~ ~ ~ 0.92 0.844 0.880 ~ ~ ~ Chu et al. [45] Border ECCV 2018 ~ ~ ~ 0.915 0.871 0.892 ~ ~ ~ Wang et al. [54] PSENet CVPR 2019 ~ ~ ~ 0.94 0.90 0.92 ~ ~ ~ Huang et al. [4] MSERs-CNN ECCV 2014 0.88 0.71 0.78 ~ ~ ~ 0.84 0.67 0.75 Sun et al. [6] PR 2015 0.92 0.91 0.91 0.94 0.92 0.93 ~ ~ ~ Gupta et al. [15] FCRN CVPR 2016 Regression 0.94 0.77 0.85 0.938 0.764 0.842 ~ ~ ~ Zhong et al. [20] DeepText arXiv 2016 0.87 0.83 0.85 0.85 0.81 0.83 ~ ~ ~ Liao et al. [22] TextBoxes AAAI 2017 0.89 0.82 0.86 0.89 0.83 0.86 ~ ~ ~ Liu et al. [25] DMPNet CVPR 2017 ~ ~ ~ 0.93 0.83 0.870 ~ ~ ~ Jiang et al. [36] R2CNN arXiv 2017 ~ ~ ~ 0.92 0.81 0.86 ~ ~ ~ Xing et al. [37] ArbiText arXiv 2017 ~ ~ ~ 0.826 0.936 0.877 ~ ~ ~ Wang et al. [41] ITN CVPR 2018 0.896 0.889 0.892 0.941 0.893 0.916 ~ ~ ~ Liao et al. [49] TextBoxes++ TIP 2018 ~ ~ ~ 0.92 0.86 0.89 ~ ~ ~ He et al. [50] TIP 2018 ~ ~ ~ 0.91 0.84 0.88 ~ ~ ~ Ma et al. [51] RRPN TMM 2018 ~ ~ ~ 0.95 0.89 0.91 ~ ~ ~ Zhu et al. [55] SLPR arXiv 2018 ~ ~ ~ 0.90 0.72 0.80 ~ ~ ~ Cai et al. [59] FFN ICIP 2018 ~ ~ ~ 0.92 0.84 0.876 ~ ~ ~ Sabyasachi et al. [60] RGC ICIP 2018 ~ ~ ~ 0.89 0.77 0.83 ~ ~ ~ Wang et al. [82] CVPR 2019 ~ ~ ~ 0.937 0.878 0.907 ~ ~ ~ Liu et al. [84] CVPR 2019 ~ ~ ~ 0.937 0.897 0.917 ~ ~ ~ He et al. [29] SSTD ICCV 2017 ~ ~ ~ 0.89 0.86 0.88 ~ ~ ~ Tian et al. [11] ICCV 2015 0.86 0.76 0.81 0.852 0.759 0.802 ~ ~ ~ Tian et al. [13] CTPN ECCV 2016 ~ ~ ~ 0.93 0.83 0.88 ~ ~ ~ He et al. [19] ACCV 2016 ~ ~ ~ 0.90 0.75 0.81 ~ ~ ~ Shi et al. [23] SegLink CVPR 2017 ~ ~ ~ 0.877 0.83 0.853 ~ ~ ~ Tian et al. [30] WeText ICCV 2017 ~ ~ ~ 0.911 0.831 0.869 ~ ~ ~ Zhu et al. [31] RTN ICDAR 2017 ~ ~ ~ 0.94 0.89 0.91 ~ ~ ~ Ren et al. [34] TMM 2017 0.78 0.67 0.72 0.81 0.67 0.73 ~ ~ ~ Zhang et al. [10] CVPR 2015 0.84 0.76 0.80 0.88 0.74 0.80 ~ ~ ~ Tang et al. [52] SSFT TMM 2018 Hybrid 0.906 0.847 0.876 0.911 0.861 0.885 ~ ~ ~ Xie et al.[61] SPCNet AAAI 2019 ~ ~ ~ 0.94 0.91 0.92 ~ ~ ~ Liu et al. [80] BDN IJCAI 2019 ~ ~ ~ 0.887 0.894 0.89 ~ ~ ~ Zhou et al. [24] EAST CVPR 2017 ~ ~ ~ 0.93 0.83 0.870 ~ ~ ~ Yue et al. [48] BMVC 2018 ~ ~ ~ 0.885 0.846 0.870 ~ ~ ~ Zhong et al. [53] AF-RPN arXiv 2018 ~ ~ ~ 0.94 0.90 0.92 ~ ~ ~ Xue et al.[85] MSR IJCAI 2019 ~ ~ ~ 0.918 0.885 0.901 ~ ~ ~ 2.2.2 任意四边形文本数据集的检测结果 Method Model Source Time Method Category IC15 [70] MSRA-TD500 [71] USTB-SV1K [65] SVT [66] P R F P R F P R F P R F Le et al. [5] HOCC CVPR 2014 Traditional ~ ~ ~ 0.71 0.62 0.66 ~ ~ ~ ~ ~ ~ Yin et al. [7] TPAMI 2015 ~ ~ ~ 0.81 0.63 0.71 0.499 0.454 0.475 ~ ~ ~ Wu et al. [9] TMM 2015 ~ ~ ~ 0.63 0.70 0.66 ~ ~ ~ ~ ~ ~ Tian et al. [17] IJCAI 2016 ~ ~ ~ 0.95 0.58 0.721 0.537 0.488 0.51 ~ ~ ~ Yang et al. [33] TIP 2017 ~ ~ ~ 0.95 0.58 0.72 0.54 0.49 0.51 ~ ~ ~ Liang et al. [8] TIP 2015 ~ ~ ~ 0.74 0.66 0.70 ~ ~ ~ ~ ~ ~ Zhang et al. [14] CVPR 2016 Segmentation 0.71 0.43 0.54 0.83 0.67 0.74 ~ ~ ~ ~ ~ ~ Zhu et al. [16] CVPR 2016 0.81 0.91 0.85 ~ ~ ~ ~ ~ ~ ~ ~ ~ He et al. [18] Text-CNN TIP 2016 ~ ~ ~ 0.76 0.61 0.69 ~ ~ ~ ~ ~ ~ Yao et al. [21] arXiv 2016 0.723 0.587 0.648 0.765 0.753 0.759 ~ ~ ~ ~ ~ ~ Hu et al. [27] WordSup ICCV 2017 0.793 0.77 0.782 ~ ~ ~ ~ ~ ~ ~ ~ ~ Wu et al. [28] ICCV 2017 0.91 0.78 0.84 0.77 0.78 0.77 ~ ~ ~ ~ ~ ~ Dai et al. [35] FTSN arXiv 2017 0.886 0.80 0.841 0.876 0.771 0.82 ~ ~ ~ ~ ~ ~ Deng et al. [40] PixelLink AAAI 2018 0.855 0.820 0.837 0.830 0.732 0.778 ~ ~ ~ ~ ~ ~ Liu et al. [42] MCN CVPR 2018 0.72 0.80 0.76 0.88 0.79 0.83 ~ ~ ~ ~ ~ ~ Lyu et al. [43] CVPR 2018 0.895 0.797 0.843 0.876 0.762 0.815 ~ ~ ~ ~ ~ ~ Chu et al. [45] Border ECCV 2018 ~ ~ ~ 0.830 0.774 0.801 ~ ~ ~ ~ ~ ~ Long et al. [46] TextSnake ECCV 2018 0.849 0.804 0.826 0.832 0.739 0.783 ~ ~ ~ ~ ~ ~ Yang et al. [47] IncepText IJCAI 2018 0.938 0.873 0.905 0.875 0.790 0.830 ~ ~ ~ ~ ~ ~ Wang et al. [54] PSENet CVPR 2019 0.8692 0.845 0.8569 ~ ~ ~ ~ ~ ~ ~ ~ ~ Xu et al.[57] TextField arXiv 2018 0.843 0.805 0.824 0.874 0.759 0.813 ~ ~ ~ ~ ~ ~ Tian et al. [58] FTDN ICIP 2018 0.847 0.773 0.809 ~ ~ ~ ~ ~ ~ ~ ~ ~ Tian et al. [83] CVPR 2019 0.883 0.850 0.866 0.842 0.817 0.829 ~ ~ ~ ~ ~ ~ Baek et al. [62] CRAFT CVPR 2019 0.898 0.843 0.869 0.882 0.782 0.829 ~ ~ ~ ~ ~ ~ Richardson et al. [87] IJCAI 2019 0.853 0.83 0.827 ~ ~ ~ ~ ~ ~ ~ ~ ~ Wang et al. [88] SAST ACMM 2019 0.8755 0.8734 0.8744 ~ ~ ~ ~ ~ ~ ~ ~ ~ Wang et al. [90] PAN ICCV 2019 0.84 0.819 0.829 0.844 0.838 0.821 ~ ~ ~ ~ ~ ~ Gupta et al. [15] FCRN CVPR 2016 Regression ~ ~ ~ ~ ~ ~ ~ ~ ~ 0.651 0.599 0.624 Liu et al. [25] DMPNet CVPR 2017 0.732 0.682 0.706 ~ ~ ~ ~ ~ ~ ~ ~ ~ He et al. [26] DDR ICCV 2017 0.82 0.80 0.81 0.77 0.70 0.74 ~ ~ ~ ~ ~ ~ Jiang et al. [36] R2CNN arXiv 2017 0.856 0.797 0.825 ~ ~ ~ ~ ~ ~ ~ ~ ~ Xing et al. [37] ArbiText arXiv 2017 0.792 0.735 0.759 0.78 0.72 0.75 ~ ~ ~ ~ ~ ~ Wang et al. [41] ITN CVPR 2018 0.857 0.741 0.795 0.903 0.723 0.803 ~ ~ ~ ~ ~ ~ Liao et al. [44] RRD CVPR 2018 0.88 0.8 0.838 0.876 0.73 0.79 ~ ~ ~ ~ ~ ~ Liao et al. [49] TextBoxes++ TIP 2018 0.878 0.785 0.829 ~ ~ ~ ~ ~ ~ ~ ~ ~ He et al. [50] TIP 2018 0.85 0.80 0.82 0.91 0.81 0.86 ~ ~ ~ ~ ~ ~ Ma et al. [51] RRPN TMM 2018 0.822 0.732 0.774 0.821 0.677 0.742 ~ ~ ~ ~ ~ ~ Zhu et al. [55] SLPR arXiv 2018 0.855 0.836 0.845 ~ ~ ~ ~ ~ ~ ~ ~ ~ Deng et al. [56] arXiv 2018 0.89 0.81 0.845 ~ ~ ~ ~ ~ ~ ~ ~ ~ Sabyasachi et al. [60] RGC ICIP 2018 0.83 0.81 0.82 0.85 0.76 0.80 ~ ~ ~ ~ ~ ~ Wang et al. [82] CVPR 2019 0.892 0.86 0.876 0.852 0.821 0.836 ~ ~ ~ ~ ~ ~ He et al. [29] SSTD ICCV 2017 0.80 0.73 0.77 ~ ~ ~ ~ ~ ~ ~ ~ ~ Tian et al. [13] CTPN ECCV 2016 0.74 0.52 0.61 ~ ~ ~ ~ ~ ~ ~ ~ ~ He et al. [19] ACCV 2016 ~ ~ ~ ~ ~ ~ ~ ~ ~ 0.87 0.73 0.79 Shi et al. [23] SegLink CVPR 2017 0.731 0.768 0.75 0.86 0.70 0.77 ~ ~ ~ ~ ~ ~ Wang et al. [86] DSRN IJCAI 2019 0.832 0.796 0.814 0.876 0.712 0.785 ~ ~ ~ ~ ~ ~ Tang et al.[89] Seglink++ PR 2019 0.837 0.803 0.820 ~ ~ ~ ~ ~ ~ ~ ~ ~ Wang et al. [92] ContourNet CVPR 2020 0.876 0.861 0.869 ~ ~ ~ ~ ~ ~ ~ ~ ~ Tang et al. [52] SSFT TMM 2018 Hybrid ~ ~ ~ ~ ~ ~ ~ ~ ~ 0.541 0.758 0.631 Xie et al.[61] SPCNet AAAI 2019 0.89 0.86 0.87 ~ ~ ~ ~ ~ ~ ~ ~ ~ Liu et al. [64] PMTD arXiv 2019 0.913 0.874 0.893 ~ ~ ~ ~ ~ ~ ~ ~ ~ Liu et al. [80] BDN IJCAI 2019 0.881 0.846 0.863 0.87 0.815 0.842 ~ ~ ~ ~ ~ ~ Zhang et al. [81] LOMO CVPR 2019 0.878 0.876 0.877 ~ ~ ~ ~ ~ ~ ~ ~ ~ Zhou et al. [24] EAST CVPR 2017 0.833 0.783 0.807 0.873 0.674 0.761 ~ ~ ~ ~ ~ ~ Yue et al. [48] BMVC 2018 0.866 0.789 0.823 ~ ~ ~ ~ ~ ~ 0.691 0.660 0.675 Zhong et al. [53] AF-RPN arXiv 2018 0.89 0.83 0.86 ~ ~ ~ ~ ~ ~ ~ ~ ~ Xue et al.[85] MSR IJCAI 2019 ~ ~ ~ 0.874 0.767 0.817 ~ ~ ~ ~ ~ ~ Liao et al. [91] DB AAAI 2020 0.918 0.832 0.873 0.915 0.792 0.849 ~ ~ ~ ~ ~ ~ Xiao et al. [93] SDM ECCV 2020 0.9196 0.8922 0.9057 ~ ~ ~ ~ ~ ~ ~ ~ ~ Method Model Source Time Method Category IC15 [70] MSRA-TD500 [71] USTB-SV1K [65] SVT [66] P R F P R F P R F P R F Le et al. [5] HOCC CVPR 2014 Traditional ~ ~ ~ ~ ~ ~ ~ ~ ~ 0.80 0.73 0.76 Yao et al. [21] arXiv 2016 Segmentation 0.432 0.27 0.333 ~ ~ ~ ~ ~ ~ ~ ~ ~ Hu et al. [27] WordSup ICCV 2017 0.452 0.309 0.368 ~ ~ ~ ~ ~ ~ ~ ~ ~ Lyu et al. [43] CVPR 2018 0.351 0.348 0.349 ~ ~ ~ 0.743 0.706 0.724 ~ ~ ~ Chu et al. [45] Border ECCV 2018 ~ ~ ~ 0.782 0.588 0.671 0.777 0.621 0.690 ~ ~ ~ Yang et al. [47] IncepText IJCAI 2018 ~ ~ ~ 0.785 0.569 0.660 ~ ~ ~ ~ ~ ~ Wang et al. [54] PSENet CVPR 2019 ~ ~ ~ ~ ~ ~ 0.7535 0.6918 0.7213 ~ ~ ~ Baek et al. [62] CRAFT CVPR 2019 ~ ~ ~ ~ ~ ~ 0.806 0.682 0.739 ~ ~ ~ He et al. [29] SSTD ICCV 2017 Regression 0.46 0.31 0.37 ~ ~ ~ ~ ~ ~ ~ ~ ~ Gupta et al. [15] FCRN CVPR 2016 ~ ~ ~ ~ ~ ~ 0.844 0.763 0.801 ~ ~ ~ Liao et al. [49] TextBoxes++ TIP 2018 0.61 0.57 0.59 ~ ~ ~ ~ ~ ~ ~ ~ ~ Ma et al. [51] RRPN TMM 2018 ~ ~ ~ ~ ~ ~ 0.7669 0.5794 0.6601 ~ ~ ~ Deng et al. [56] arXiv 2018 0.555 0.633 0.591 ~ ~ ~ ~ ~ ~ ~ ~ ~ Cai et al. [59] FFN ICIP 2018 0.43 0.35 0.39 ~ ~ ~ ~ ~ ~ ~ ~ ~ Xie et al. [79] DeRPN AAAI 2019 0.586 0.557 0.571 ~ ~ ~ ~ ~ ~ ~ ~ ~ He et al. [29] SSTD ICCV 2017 0.46 0.31 0.37 ~ ~ ~ ~ ~ ~ ~ ~ ~ Liao et al. [44] RRD CVPR 2018 ~ ~ ~ 0.591 0.775 0.670 ~ ~ ~ ~ ~ ~ Richardson et al. [87] IJCAI 2019 ~ ~ ~ ~ ~ ~ 0.729 0.618 0.669 ~ ~ ~ Wang et al. [88] SAST ACMM 2019 ~ ~ ~ ~ ~ ~ 0.7935 0.6653 0.7237 ~ ~ ~ Xie et al.[61] SPCNet AAAI 2019 Hybrid ~ ~ ~ ~ ~ ~ 0.806 0.686 0.741 ~ ~ ~ Liu et al. [64] PMTD arXiv 2019 ~ ~ ~ ~ ~ ~ 0.844 0.763 0.801 ~ ~ ~ Liu et al. [80] BDN IJCAI 2019 ~ ~ ~ ~ ~ ~ 0.791 0.698 0.742 ~ ~ ~ Zhang et al. [81] LOMO CVPR 2019 ~ ~ ~ 0.791 0.602 0.684 0.802 0.672 0.731 ~ ~ ~ Zhou et al. [24] EAST CVPR 2017 0.504 0.324 0.395 ~ ~ ~ ~ ~ ~ ~ ~ ~ Zhong et al. [53] AF-RPN arXiv 2018 ~ ~ ~ ~ ~ ~ 0.75 0.66 0.70 ~ ~ ~ Liao et al. [91] DB AAAI 2020 ~ ~ ~ ~ ~ ~ 0.831 0.679 0.747 ~ ~ ~ Xiao et al. [93] SDM ECCV 2020 ~ ~ ~ ~ ~ ~ 0.8679 0.7526 0.8061 ~ ~ ~ 2.2.3 不规则文本数据集的检测结果在本节中，我们仅选择适用于不规则文本检测的那些方法。 Method Model Source Time Method Category Total-text [74] SCUT-CTW1500 [75] P R F P R F Baek et al. [62] CRAFT CVPR 2019 Segmentation 0.876 0.799 0.836 0.860 0.811 0.835 Long et al. [46] TextSnake ECCV 2018 0.827 0.745 0.784 0.679 0.853 0.756 Tian et al. [83] CVPR 2019 ~ ~ ~ 81.7 84.2 80.1 Wang et al. [54] PSENet CVPR 2019 0.840 0.779 0.809 0.848 0.797 0.822 Wang et al. [88] SAST ACMM 2019 0.8557 0.7549 0.802 0.8119 0.8171 0.8145 Wang et al. [90] PAN ICCV 2019 0.893 0.81 0.85 0.864 0.812 0.837 Zhu et al. [55] SLPR arXiv 2018 Regression ~ ~ ~ 0.801 0.701 0.748 Liu et al. [63] CTD+TLOC PR 2019 ~ ~ ~ 0.774 0.698 0.734 Wang et al. [82] CVPR 2019 ~ ~ ~ 80.1 80.2 80.1 Liu et al. [84] CVPR 2019 0.814 0.791 0.802 0.787 0.761 0.774 Tang et al.[89] Seglink++ PR 2019 0.829 0.809 0.815 0.828 0.798 0.813 Wang et al. [92] ContourNet CVPR 2020 0.869 0.839 0.854 0.837 0.841 0.839 Zhang et al. [81] LOMO CVPR 2019 Hybrid 0.876 0.793 0.833 0.857 0.765 0.808 Xie et al.[61] SPCNet AAAI 2019 0.83 0.83 0.83 ~ ~ ~ Xue et al.[85] MSR IJCAI 2019 0.852 0.73 0.768 0.838 0.778 0.807 Liao et al. [91] DB AAAI 2020 0.871 0.825 0.847 0.869 0.802 0.834 Xiao et al.[93] SDM ECCV 2020 0.9085 0.8603 0.8837 0.884 0.8442 0.8636 3. 综述[A] [TPAMI-2015] Ye Q, Doermann D. Text detection and recognition in imagery: A survey[J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 37(7): 1480-1500. paper[B] [Frontiers-Comput. Sci-2016] Zhu Y, Yao C, Bai X. Scene text detection and recognition: Recent advances and future trends[J]. Frontiers of Computer Science, 2016, 10(1): 19-36. paper[C] [arXiv-2018] Long S, He X, Ya C. Scene Text Detection and Recognition: The Deep Learning Era[J]. arXiv preprint arXiv:1811.04256, 2018. paper4. Evaluation如果您有兴趣开发更好的场景文本检测指标，那么这里推荐的一些参考可能会有用：[A] Wolf, Christian, and Jean-Michel Jolion. "Object count/area graphs for the evaluation of object detection and segmentation algorithms." International Journal of Document Analysis and Recognition (IJDAR) 8.4 (2006): 280-296. paper[B] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. K. Ghosh, A. D.Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny. ICDAR 2015 competition on robust reading. In ICDAR, pages 1156–1160, 2015. paper[C] Calarasanu, Stefania, Jonathan Fabrizio, and Severine Dubuisson. "What is a good evaluation protocol for text localization systems? Concerns, arguments, comparisons and solutions." Image and Vision Computing 46 (2016): 1-17. paper[D] Shi, Baoguang, et al. "ICDAR2017 competition on reading chinese text in the wild (RCTW-17)." 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). Vol. 1. IEEE, 2017. paper[E] Nayef, N; Yin, F; Bizid, I; et al. ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identiﬁcation-rrc-mlt. In Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, volume 1, 1454–1459. IEEE.paper[F] Dangla, Aliona, et al. "A first step toward a fair comparison of evaluation protocols for text detection algorithms." 2018 13th IAPR International Workshop on Document Analysis Systems (DAS). IEEE, 2018. paper[G] He,Mengchao and Liu, Yuliang, et al. ICPR2018 Contest on Robust Reading for Multi-Type Web images. ICPR 2018. paper[H] Liu, Yuliang and Jin, Lianwen, et al. "Tightness-aware Evaluation Protocol for Scene Text Detection" Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2019. paper code5. OCR ServiceOCRAPIFreeTesseract OCR Engine×√Azure√√ABBYY√√OCR Space√√SODA PDF OCR√√Free Online OCR√√Online OCR√√Super Tools√√Online Chinese Recognition√√Calamari OCR×√Tencent OCR√×6. References and Code [1] Yao C, Bai X, Liu W, et al. Detecting texts of arbitrary orientations in natural images. 2012 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2012: 1083-1090. Paper[2] Yin X C, Yin X, Huang K, et al. Robust text detection in natural scene images. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2013, 36(5): 970-83. Paper[3] Li Y, Jia W, Shen C, et al. Characterness: An indicator of text in the wild. IEEE transactions on image processing, 2014, 23(4): 1666-1677. Paper[4] Huang W, Qiao Y, Tang X. Robust scene text detection with convolution neural network induced mser trees. European Conference on Computer Vision(ECCV), 2014: 497-511. Paper[5] Kang L, Li Y, Doermann D. Orientation robust text line detection in natural images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014: 4034-4041. Paper[6] Sun L, Huo Q, Jia W, et al. A robust approach for text detection from natural scene images. Pattern Recognition, 2015, 48(9): 2906-2920. Paper[7] Yin X C, Pei W Y, Zhang J, et al. Multi-orientation scene text detection with adaptive clustering. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2015 (9): 1930-1937. Paper[8] Liang G, Shivakumara P, Lu T, et al. Multi-spectral fusion based approach for arbitrarily oriented scene text detection in video images. IEEE Transactions on Image Processing, 2015, 24(11): 4488-4501. Paper[9] Wu L, Shivakumara P, Lu T, et al. A New Technique for Multi-Oriented Scene Text Line Detection and Tracking in Video. IEEE Trans. Multimedia, 2015, 17(8): 1137-1152. Paper[10] Zheng Z, Wei S, et al. Symmetry-based text line detection in natural scenes. IEEE Conference on Computer Vision & Pattern Recognition(CVPR), 2015. Paper[11] Tian S, Pan Y, Huang C, et al. Text flow: A unified text detection system in natural scene images. Proceedings of the IEEE international conference on computer vision(ICCV). 2015: 4651-4659. Paper[12] Buta M, et al. FASText: Efficient unconstrained scene text detector. 2015 IEEE International Conference on Computer Vision (ICCV). 2015: 1206-1214. Paper[13] Tian Z, Huang W, He T, et al. Detecting text in natural image with connectionist text proposal network. European conference on computer vision(ECCV), 2016: 56-72. Paper Code[14] Zhang Z, Zhang C, Shen W, et al. Multi-oriented text detection with fully convolutional networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR). 2016: 4159-4167. Paper[15] Gupta A, Vedaldi A, Zisserman A. Synthetic data for text localisation in natural images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR). 2016: 2315-2324. Paper Code[16] S. Zhu and R. Zanibbi, A Text Detection System for Natural Scenes with Convolutional Feature Learning and Cascaded Classification, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016: 625-632. Paper[17] Tian S, Pei W Y, Zuo Z Y, et al. Scene Text Detection in Video by Learning Locally and Globally. IJCAI. 2016: 2647-2653. Paper[18] He T, Huang W, Qiao Y, et al. Text-attentional convolutional neural network for scene text detection. IEEE transactions on image processing, 2016, 25(6): 2529-2541. Paper[19] He, Dafang and Yang, Xiao and Huang, Wenyi and Zhou, Zihan and Kifer, Daniel and Giles, C Lee. Aggregating local context for accurate scene text detection. ACCV, 2016. Paper[20] Zhong Z, Jin L, Zhang S, et al. Deeptext: A unified framework for text proposal generation and text detection in natural images. arXiv preprint arXiv:1605.07314, 2016. Paper[21] Yao C, Bai X, Sang N, et al. Scene text detection via holistic, multi-channel prediction. arXiv preprint arXiv:1606.09002, 2016. Paper[22] Liao M, Shi B, Bai X, et al. TextBoxes: A Fast Text Detector with a Single Deep Neural Network. AAAI. 2017: 4161-4167. Paper Code[23] Shi B, Bai X, Belongie S. Detecting Oriented Text in Natural Images by Linking Segments. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017: 3482-3490. Paper Code[24] Zhou X, Yao C, Wen H, et al. EAST: an efficient and accurate scene text detector. CVPR, 2017: 2642-2651. Paper Code[25] Liu Y, Jin L. Deep matching prior network: Toward tighter multi-oriented text detection. CVPR, 2017: 3454-3461. Paper[26] He W, Zhang X Y, Yin F, et al. Deep Direct Regression for Multi-Oriented Scene Text Detection. Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2017: 745-753. Paper[27] Hu H, Zhang C, Luo Y, et al. Wordsup: Exploiting word annotations for character based text detection. ICCV, 2017. Paper[28] Wu Y, Natarajan P. Self-organized text detection with minimal post-processing via border learning. ICCV, 2017. Paper[29] He P, Huang W, He T, et al. Single shot text detector with regional attention. The IEEE International Conference on Computer Vision (ICCV). 2017, 6(7). Paper Code[30] Tian S, Lu S, Li C. Wetext: Scene text detection under weak supervision. ICCV, 2017. Paper[31] Zhu, Xiangyu and Jiang, Yingying et al. Deep Residual Text Detection Network for Scene Text. ICDAR, 2017. Paper[32] Tang Y , Wu X. Scene Text Detection and Segmentation Based on Cascaded Convolution Neural Networks. IEEE Transactions on Image Processing, 2017, 26(3):1509-1520. Paper[33] Yang C, Yin X C, Pei W Y, et al. Tracking Based Multi-Orientation Scene Text Detection: A Unified Framework with Dynamic Programming. IEEE Transactions on Image Processing, 2017. Paper[34] X. Ren, Y. Zhou, J. He, K. Chen, X. Yang and J. Sun, A Convolutional Neural Network-Based Chinese Text Detection Algorithm via Text Structure Modeling. in IEEE Transactions on Multimedia, vol. 19, no. 3, pp. 506-518, March 2017. Paper[35] Dai Y, Huang Z, Gao Y, et al. Fused text segmentation networks for multi-oriented scene text detection. arXiv preprint arXiv:1709.03272, 2017. Paper[36] Jiang Y, Zhu X, Wang X, et al. R2CNN: rotational region CNN for orientation robust scene text detection. arXiv preprint arXiv:1706.09579, 2017. Paper[37] Xing D, Li Z, Chen X, et al. ArbiText: Arbitrary-Oriented Text Detection in Unconstrained Scene. arXiv preprint arXiv:1711.11249, 2017. Paper[38] C. Wang, F. Yin and C. Liu, Scene Text Detection with Novel Superpixel Based Character Candidate Extraction. in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 2017, pp. 929-934. Paper[39] Sheng Zhang, Yuliang Liu, Lianwen Jin et al. Feature Enhancement Network: A Refined Scene Text Detector. In AAAI 2018. Paper[40] Dan Deng et al. PixelLink: Detecting Scene Text via Instance Segmentation. In AAAI 2018. Paper Code[41] Fangfang Wang, Liming Zhao, Xi L et al. Geometry-Aware Scene Text Detection with Instance Transformation Network. In CVPR 2018. Paper[42] Zichuan Liu, Guosheng Lin, Sheng Yang et al. Learning Markov Clustering Networks for Scene Text Detection. In CVPR 2018. Paper[43] Pengyuan Lyu, Cong Yao, Wenhao Wu et al. Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation. In CVPR 2018. Paper[44] Minghui L, Zhen Z, Baoguang S. Rotation-Sensitive Regression for Oriented Scene Text Detection. In CVPR 2018. Paper[45] Chuhui Xue et al. Accurate Scene Text Detection through Border Semantics Awareness and Bootstrapping. In ECCV 2018. Paper[46] Long, Shangbang and Ruan, Jiaqiang, et al. TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes. In ECCV, 2018. Paper[47] Qiangpeng Yang, Mengli Cheng et al. IncepText: A New Inception-Text Module with Deformable PSROI Pooling for Multi-Oriented Scene Text Detection. In IJCAI 2018. Paper[48] Xiaoyu Yue et al. Boosting up Scene Text Detectors with Guided CNN. In BMVC 2018. Paper[49] Liao M, Shi B , Bai X. TextBoxes++: A Single-Shot Oriented Scene Text Detector. IEEE Transactions on Image Processing, 2018, 27(8):3676-3690. Paper Code[50] W. He, X. Zhang, F. Yin and C. Liu, Multi-Oriented and Multi-Lingual Scene Text Detection With Direct Regression, in IEEE Transactions on Image Processing, vol. 27, no. 11, pp.5406-5419, 2018. Paper[51] Ma J, Shao W, Ye H, et al. Arbitrary-oriented scene text detection via rotation proposals.in IEEE Transactions on Multimedia, 2018. Paper Code[52] Youbao Tang and Xiangqian Wu. Scene Text Detection Using Superpixel-Based Stroke Feature Transform and Deep Learning Based Region Classification. In TMM, 2018. Paper[53] Zhuoyao Zhong, Lei Sun and Qiang Huo. An Anchor-Free Region Proposal Network for Faster R-CNN based Text Detection Approaches. arXiv preprint arXiv:1804.09003. 2018. Paper[54] Wenhai W, Enze X, et al. Shape Robust Text Detection with Progressive Scale Expansion Network. In CVPR 2019. Paper Code[55] Zhu Y, Du J. Sliding Line Point Regression for Shape Robust Scene Text Detection. arXiv preprint arXiv:1801.09969, 2018. Paper[56] Linjie D, Yanxiang Gong, et al. Detecting Multi-Oriented Text with Corner-based Region Proposals. arXiv preprint arXiv: 1804.02690, 2018. Paper Code[57] Yongchao Xu, Yukang Wang, Wei Zhou, et al. TextField: Learning A Deep Direction Field for Irregular Scene Text Detection. arXiv preprint arXiv: 1812.01393, 2018. Paper[58] Xiaowei Tian, Dao Wu, Rui Wang, Xiaochun Cao. Focal Text: an Accurate Text Detection with Focal Loss. In ICIP 2018. Paper[59] Chenqin C, Pin L, Bing S. Feature Fusion Network for Scene Text Detection. In ICIP, 2018. Paper[60] Sabyasachi Mohanty et al. Recurrent Global Convolutional Network for Scene Text Detection. In ICIP 2018. Paper[61] Enze Xie, et al. Scene Text Detection with Supervised Pyramid Context Network. In AAAI 2019. Paper[62] Youngmin Baek, Bado Lee, et al. Character Region Awareness for Text Detection. In CVPR 2019. Paper[63] Yuliang L, Lianwen J, Shuaitao Z, et al. Curved Scene Text Detection via Transverse and Longitudinal Sequence Connection. Pattern Recognition, 2019. Paper Code[64] Jingchao Liu, Xuebo Liu, et al, Pyramid Mask Text Detector. arXiv preprint arXiv:1903.11800, 2019. Paper Code[79] Lele Xie, Yuliang Liu, Lianwen Jin, Zecheng Xie, DeRPN: Taking a further step toward more general object detection. In AAAI, 2019. Paper Code[80] Yuliang Liu, Lianwen Jin, et al, Omnidirectional Scene Text Detction with Sequential-free Box Discretization. In IJCAI, 2019.Paper Code[81] Chengquan Zhang, Borong Liang, et al, Look More Than Once: An Accurate Detector for Text of Arbitrary Shapes. In CVPR, 2019.Paper[82] Xiaobing Wang, Yingying Jiang, et al, Arbitrary Shape Scene Text Detection with Adaptive Text Region Representation. In CVPR, 2019. Paper[83] Zhuotao Tian, Michelle Shu, et al, Learning Shape-Aware Embedding for Scene Text Detection. In CVPR, 2019. Paper[84] Zichuan Liu, Guosheng Lin, et al, Towards Robust Curve Text Detection with Conditional Spatial Expansion. In CVPR, 2019. Paper[85] Xue C, Lu S, Zhang W. MSR: multi-scale shape regression for scene text detection. In IJCAI, 2019. Paper[86] Wang Y, Xie H, Fu Z, et al. DSRN: a deep scale relationship network for scene text detection. In IJCAI, 2019: 947-953. Paper[87] Elad Richardson, et al, It's All About The Scale -- Efficient Text Detection Using Adaptive Scaling. In WACV, 2020. Paper[88] Pengfei Wang, et al, A Single-Shot Arbitrarily-Shaped Text Detector based on Context Attended Multi-Task Learning. In ACMM, 2019. Paper[89] Jun Tang, et al, SegLink ++: Detecting Dense and Arbitrary-shaped Scene Text by Instance-aware Component Grouping. In PR, 2019. Paper[90] Wenhai Wang, et al, Efﬁcient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network. In ICCV, 2019. Paper[91] Minghui Liao, et al, Real-time Scene Text Detection with Differentiable Binarization. In AAAI, 2020. PaperCode[92] Wang, Yuxin, et al. ContourNet: Taking a Further Step toward Accurate Arbitrary-shaped Scene Text Detection. CVPR. 2020. PaperCode[93] Xiao, et al, Sequential Deformation for Accurate Scene Text Detection. In ECCV, 2020. Paper DatasetsUSTB-SV1K[65]：Xu-Cheng Yin, Xuwang Yin, Kaizhu Huang, and Hong-Wei Hao, Robust text detection in natural scene images, IEEE Trans. Pattern Analysis and Machine Intelligence (TPAMI), priprint, 2013. PaperSVT[66]: Wang,Kai, and S. Belongie. Word Spotting in the Wild. European Conference on Computer Vision(ECCV), 2010: 591-604. PaperICDAR2005[67]: Lucas, S: ICDAR 2005 text locating competition results. In: ICDAR ,2005. PaperICDAR2011[68]: Shahab, A, Shafait, F, Dengel, A: ICDAR 2011 robust reading competition challenge 2: Reading text in scene images. In: ICDAR, 2011. PaperICDAR2013[69]：D. Karatzas, F. Shafait, S. Uchida, et al. ICDAR 2013 robust reading competition. In ICDAR, 2013. PaperICDAR2015[70]：D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. K. Ghosh, A. D.Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny. ICDAR 2015 competition on robust reading. In ICDAR, pages 1156–1160, 2015. PaperMSRA-TD500[71]：C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu, Detecting texts of arbitrary orientations in natural images. in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2012, pp.1083–1090.PaperCOCO-Text[72]：Veit A, Matera T, Neumann L, et al. Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140, 2016. PaperRCTW-17[73]：Shi B, Yao C, Liao M, et al. ICDAR2017 competition on reading chinese text in the wild (RCTW-17). Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on. IEEE, 2017, 1: 1429-1434. PaperTotal-Text[74]：Chee C K, Chan C S. Total-text: A comprehensive dataset for scene text detection and recognition.Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on. IEEE, 2017, 1: 935-942.PaperSCUT-CTW1500[75]：Yuliang L, Lianwen J, Shuaitao Z, et al. Curved Scene Text Detection via Transverse and Longitudinal Sequence Connection. Pattern Recognition, 2019.PaperMLT 2017[76]: Nayef, N; Yin, F; Bizid, I; et al. ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identiﬁcation-rrc-mlt. In Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, volume 1, 1454–1459. IEEE. PaperOSTD[77]: Chucai Yi and YingLi Tian, Text string detection from natural scenes by structure-based partition and grouping, In IEEE Transactions on Image Processing, vol. 20, no. 9, pp. 2594–2605, 2011. PaperCTW[78]: Yuan T L, Zhu Z, Xu K, et al. Chinese Text in the Wild. arXiv preprint arXiv:1803.00085, 2018. Paper如果您发现我们的资源中有任何问题，或者我们错过了任何好的论文/代码，请通过liuchongyu1996@gmail.com通知我们。感谢您的贡献。CopyrightCopyright © 2019 SCUT-DLVC. All Rights Reserved.
- 2021年03月30日
- 740 阅读
- 0 评论
- 0 点赞
2021-03-30
Scene Text Detection Resources(场景文字识别资源汇总)[转载] 1. Datasets1.1 Horizontal-Text DatasetsICDAR 2003(IC03)：Introduction: It contains 509 images in total, 258 for training and 251 for testing. Specifically, it contains 1110 text instance in training set, while 1156 in testing set. It has word-level annotation. IC03 only consider English text instance.Link: IC03-downloadICDAR 2011(IC11):Introduction: IC11 is an English dataset for text detection. It contains 484 images, 229 for training and 255 for testing. There are 1564 text instance in this dataset. It provides both word-level and character-level annotation.Link: IC11-downloadICDAR 2013(IC13)：Introduction: IC13 is almost the same as IC11. It contains 462 images in total, 229 for training and 233 for testing. Specifically, it contains 849 text instance in training set, while 1095 in testing set.Link: IC13-download1.2 Arbitrary-Quadrilateral-Text DatasetsUSTB-SV1K：Introduction: USTB-SV1K is an English dataset. It contains 1000 street images from Google Street View with 2955 text instance in total. It only provides word-level annotations.Link: USTB-SV1K-downloadSVT：Introduction: It contains 350 images with 725 English text intance in total. SVT has both character-level and word-level annotations. The images of SVT are harvested from Google Street View and have low resolution.Link: SVT-downloadSVT-P：Introduction: It contains 639 cropped word images for testing. Images were selected from the side-view angle snapshots in Google Street View. Therefore, most images are heavily distorted by the non-frontal view angle. It is the imporved datasets of SVT.Link: SVT-P-download (Password : vnis)ICDAR 2015(IC15)：Introduction: It contains 1500 images in total, 1000 for training and 500 for testing. Specifically, it contains 17548 text instance. It provides word-level annotations. IC15 is the first incidental scene text dataset and it only considers English words.Link: IC15-downloadCOCO-Text：Introduction: It contains 63686 images in total, 43686 for training, 10000 for validating and 10000 for testing. Specifically, it contains 145859 cropped word images for testing, including handwritten and printed, clear and blur, English and non-English.Link: COCO-Text-downloadMSRA-TD500：Introduction: It contains 500 images in total. It provides text-line-level annotation rather than word, and polygon boxes rather than axis-aligned rectangles for text region annootation. It contains both English and Chinese text instance.Link: MSRA-TD500-downloadMLT 2017：Introduction: It contains 10000 natural images in total. It provides word-level annotation. There are 9 languages for MLT. It is a more real and complex datasets for scene text detection and recognition..Link: MLT-downloadMLT 2019:Introduction: It contains 18000 images in total. It provides word-level annotation. Compared to MLT, this dataset has 10 languages. It is a more real and complex datasets for scene text detection and recognition..Link: MLT-2019-downloadCTW：Introduction: It contains 32285 high resolution street view images of Chinese text, with 1018402 character instances in total. All images are annotated at the character level, including its underlying character type, bouding box, and 6 other attributes. These attributes indicate whether its background is complex, whether it’s raised, whether it’s hand-written or printed, whether it’s occluded, whether it’s distorted, whether it uses word-art.Link: CTW-downloadRCTW-17：Introduction: It contains 12514 images in total, 11514 for training and 1000 for testing. Images in RCTW-17 were mostly collected by camera or mobile phone, and others were generated images. Text instances are annotated with parallelograms. It is the first large scale Chinese dataset, and was also the largest published one by then.Link: RCTW-17-downloadReCTS：Introduction: This data set is a large-scale Chinese Street View Trademark Data Set. It is based on Chinese words and Chinese text line-level labeling. The labeling method is arbitrary quadrilateral labeling. It contains 20000 images in total.Link: ReCTS-download1.3 Irregular-Text DatasetsCUTE80：Introduction: It contains 80 high-resolution images taken in natural scenes. Specifically, it contains 288 cropped word images for testing. The dataset focuses on curved text. No lexicon is provided.Link: CUTE80-downloadTotal-Text：Introduction: It contains 1,555 images in total. Specifically, it contains 11,459 cropped word images with more than three different text orientations: horizontal, multi-oriented and curved.Link: Total-Text-downloadSCUT-CTW1500：Introduction: It contains 1500 images in total, 1000 for training and 500 for testing. Specifically, it contains 10751 cropped word images for testing. Annotations in CTW-1500 are polygons with 14 vertexes. The dataset mainly consists of Chinese and English.Link: CTW-1500-downloadLSVT：Introduction: LSVT consists of 20,000 testing data, 30,000 training data in full annotations and 400,000 training data in weak annotations, which are referred to as partial labels. The labeled text regions demonstrate the diversity of text: horizontal, multi-oriented and curved.Link: LSVT-downloadArTs：Introduction: ArT consists of 10,166 images, 5,603 for training and 4,563 for testing. They were collected with text shape diversity in mind and all text shapes have high number of existence in ArT.Link: ArT-download1.4 Synthetic DatasetsSynth80k :Introduction: It contains 800 thousands images with approximately 8 million synthetic word instances. Each text instance is annotated with its text-string, word-level and character-level bounding-boxes.Link: Synth80k-downloadSynthText :Introduction: It contains 6 million cropped word images. The generation process is similar to that of Synth90k. It is also annotated in horizontal-style.Link: SynthText-download1.5 Comparison of Datasets Comparison of Datasets Datasets Language Image Text instance Text Shape Annotation level Total Train Test Total Train Test Horizontal Arbitrary-Quadrilateral Multi-oriented Char Word Text-Line IC03 English 509 258 251 2266 1110 1156 ✓ ✕ ✕ ✕ ✓ ✕ IC11 English 484 229 255 1564 ～～ ✓ ✕ ✕ ✓ ✓ ✕ IC13 English 462 229 233 1944 849 1095 ✓ ✕ ✕ ✓ ✓ ✕ USTB-SV1K English 1000 500 500 2955 ～～ ✓ ✓ ✕ ✕ ✓ ✕ SVT English 350 100 250 725 211 514 ✓ ✓ ✕ ✓ ✓ ✕ SVT-P English 238 ～～ 639 ～～ ✓ ✓ ✕ ✕ ✓ ✕ IC15 English 1500 1000 500 17548 122318 5230 ✓ ✓ ✕ ✕ ✓ ✕ COCO-Text English 63686 43686 20000 145859 118309 27550 ✓ ✓ ✕ ✕ ✓ ✕ MSRA-TD500 English/Chinese 500 300 200 ～～～ ✓ ✓ ✕ ✕ ✕ ✓ MLT 2017 Multi-lingual 18000 7200 10800 ～～～ ✓ ✓ ✕ ✕ ✓ ✕ MLT 2019 Multi-lingual 20000 10000 10000 ～～～ ✓ ✓ ✕ ✕ ✓ ✕ CTW Chinese 32285 25887 6398 1018402 812872 205530 ✓ ✓ ✕ ✓ ✓ ✕ RCTW-17 English/Chinese 12514 15114 1000 ～～～ ✓ ✓ ✕ ✕ ✕ ✓ ReCTS Chinese 20000 ～～～～～ ✓ ✓ ✕ ✓ ✓ ✕ CUTE80 English 80 ～～～～～ ✕ ✕ ✓ ✕ ✓ ✓ Total-Text English 1525 1225 300 9330 ～～ ✓ ✓ ✓ ✕ ✓ ✓ CTW-1500 English/Chinese 1500 1000 500 10751 ～～ ✓ ✓ ✓ ✕ ✓ ✓ LSVT English/Chinese 450000 430000 20000 ～～～ ✓ ✓ ✓ ✕ ✓ ✓ ArT English/Chinese 10166 5603 4563 ～～～ ✓ ✓ ✓ ✕ ✓ ✕ Synth80k English 80k ～～ 8m ～～ ✓ ✕ ✕ ✓ ✓ ✕ SynthText English 800k ～～ 6m ～～ ✓ ✓ ✕ ✕ ✓ ✕ 2. Summary of Scene Text Detection Resources2.1 Comparison of MethodsScene text detection methods can be devided into four parts:(a) Traditional methods;(b) Segmentation-based methods;(c) Regression-based methods;(d) Hybrid methods.It is important to notice that: (1) "Hori" stands for horizontal scene text datasets. (2) "Quad" stands for arbitrary-quadrilateral-text datasets. (3) "Irreg" stands for irregular scence text datasets. (4) "Traditional method" stands for the methods that don't rely on deep learning.2.1.1 Traditional Methods Method Model Code Hori Quad Irreg Source Time Highlight Yao et al. [1] TD-Mixture ✕ ✓ ✓ ✕ CVPR 2012 1) A new dataset MSRA-TD500 and protocol for evaluation. 2) Equipped a two-level classification scheme and two sets of features extractor. Yin et al. [2] ✕ ✓ ✕ ✕ TPAMI 2013 Extract Maximally Stable Extremal Regions (MSERs) as character candidates and group them together. Le et al. [5] HOCC ✕ ✓ ✓ ✕ CVPR 2014 HOCC + MSERs Yin et al. [7] ✕ ✓ ✓ ✕ TPAMI 2015 Presenting a unified distance metric learning framework for adaptive hierarchical clustering. Wu et al. [9] ✕ ✓ ✓ ✕ TMM 2015 Exploring gradient directional symmetry at component level for smoothing edge components before text detection. Tian et al. [17] ✕ ✓ ✕ ✕ IJCAI 2016 Scene text is first detected locally in individual frames and finally linked by an optimal tracking trajectory. Yang et al. [33] ✕ ✓ ✓ ✕ TIP 2017 A text detector will locate character candidates and extract text regions. Then they will linked by an optimal tracking trajectory. Liang et al. [8] ✕ ✓ ✓ ✓ TIP 2015 Exploring maxima stable extreme regions along with stroke width transform for detecting candidate text regions. Michal et al.[12] FASText ✕ ✓ ✓ ✕ ICCV 2015 Stroke keypoints are efficiently detected and then exploited to obtain stroke segmentations. 2.1.2 Segmentation-based Methods Method Model Code Hori Quad Irreg Source Time Highlight Li et al. [3] ✕ ✓ ✓ ✕ TIP 2014 (1)develop three novel cues that are tailored for character detection and a Bayesian method for their integration; (2)design a Markov random field model to exploit the inherent dependencies between characters. Zhang et al. [14] ✕ ✓ ✓ ✕ CVPR 2016 Utilizing FCN for salient map detection and centroid of each character prediction. Zhu et al. [16] ✕ ✓ ✓ ✕ CVPR 2016 Performs a graph-based segmentation of connected components into words (Word-Graph). He et al. [18] Text-CNN ✕ ✓ ✓ ✕ TIP 2016 Developing a new learning mechanism to train the Text-CNN with multi-level and rich supervised information. Yao et al. [21] ✕ ✓ ✓ ✕ arXiv 2016 Proposing to localize text in a holistic manner, by casting scene text detection as a semantic segmentation problem. Hu et al. [27] WordSup ✕ ✓ ✓ ✕ ICCV 2017 Proposing a weakly supervised framework that can utilize word annotations. Then the detected characters are fed to a text structure analysis module. Wu et al. [28] ✕ ✓ ✓ ✕ ICCV 2017 Introducing the border class to the text detection problem for the first time, and validate that the decoding process is largely simplified with the help of text border. Tang et al.[32] ✕ ✓ ✕ ✕ TIP 2017 A text-aware candidate text region(CTR) extraction model + CTR refinement model. Dai et al. [35] FTSN ✕ ✓ ✓ ✕ arXiv 2017 Detecting and segmenting the text instance jointly and simultaneously, leveraging merits from both semantic segmentation task and region proposal based object detection task. Wang et al. [38] ✕ ✓ ✕ ✕ ICDAR 2017 This paper proposes a novel character candidate extraction method based on super-pixel segmentation and hierarchical clustering. Deng et al. [40] PixelLink ✓ ✓ ✓ ✕ AAAI 2018 Text instances are first segmented out by linking pixels wthin the same instance together. Liu et al. [42] MCN ✕ ✓ ✓ ✕ CVPR 2018 Stochastic Flow Graph (SFG) + Markov Clustering. Lyu et al. [43] ✕ ✓ ✓ ✕ CVPR 2018 Detect scene text by localizing corner points of text bounding boxes and segmenting text regions in relative positions. Chu et al. [45] Border ✕ ✓ ✓ ✕ ECCV 2018 The paper presents a novel scene text detection technique that makes use of semantics-aware text borders and bootstrapping based text segment augmentation. Long et al. [46] TextSnake ✕ ✓ ✓ ✓ ECCV 2018 The paper proposes TextSnake, which is able to effectively represent text instances in horizontal, oriented and curved forms based on symmetry axis. Yang et al. [47] IncepText ✕ ✓ ✓ ✕ IJCAI 2018 Designing a novel Inception-Text module and introduce deformable PSROI pooling to deal with multi-oriented text detection. Yue et al. [48] ✕ ✓ ✓ ✕ BMVC 2018 Proposing a general framework for text detection called Guided CNN to achieve the two goals simultaneously. Zhong et al. [53] AF-RPN ✕ ✓ ✓ ✕ arXiv 2018 Presenting AF-RPN(anchor-free) as an anchor-free and scale-friendly region proposal network for the Faster R-CNN framework. Wang et al. [54] PSENet ✓ ✓ ✓ ✓ CVPR 2019 Proposing a novel Progressive Scale Expansion Network (PSENet), designed as a segmentation-based detector with multiple predictions for each text instance. Xu et al.[57] TextField ✕ ✓ ✓ ✓ arXiv 2018 Presenting a novel direction field which can represent scene texts of arbitrary shapes. Tian et al. [58] FTDN ✕ ✓ ✓ ✕ ICIP 2018 FTDN is able to segment text region and simultaneously regress text box at pixel-level. Tian et al. [83] ✕ ✓ ✓ ✓ CVPR 2019 Constraining embedding feature of pixels inside the same text region to share similar properties. Huang et al. [4] MSERs-CNN ✕ ✓ ✕ ✕ ECCV 2014 Combining MSERs with CNN Sun et al. [6] ✕ ✓ ✕ ✕ PR 2015 Presenting a robust text detection approach based on color-enhanced CER and neural networks. Baek et al. [62] CRAFT ✕ ✓ ✓ ✓ CVPR 2019 Proposing CRAFT effectively detect text area by exploring each character and affinity between characters. Richardson et al. [87] ✕ ✓ ✓ ✕ WACV 2019 Presenting an additional scale predictor the estimate the better scale of text regions for testing. Wang et al. [88] SAST ✕ ✓ ✓ ✓ ACMM 2019 Presenting a context attended multi-task learning framework for scene text detection. Wang et al. [90] PAN ✕ ✓ ✓ ✓ ICCV 2019 Proposing an efﬁcient and accurate arbitrary-shaped text detector called Pixel Aggregation Network(PAN), 2.1.3 Regression-based Methods Method Model Code Hori Quad Irreg Source Time Highlight Gupta et al. [15] FCRN ✓ ✓ ✕ ✕ CVPR 2016 (a) Proposing a fast and scalable engine to generate synthetic images of text in clutter; (b) FCRN. Zhong et al. [20] DeepText ✕ ✓ ✕ ✕ arXiv 2016 (a) Inception-RPN; (b) Utilize ambiguous text category (ATC) information and multilevel region-of-interest pooling (MLRP). Liao et al. [22] TextBoxes ✓ ✓ ✕ ✕ AAAI 2017 Mainly basing SSD object detection framework. Liu et al. [25] DMPNet ✕ ✓ ✓ ✕ CVPR 2017 Quadrilateral sliding windows + shared Monte-Carlo method for fast and accurate computing of the polygonal areas + a sequential protocol for relative regression. He et al. [26] DDR ✕ ✓ ✓ ✕ ICCV 2017 Proposing an FCN that has bi-task outputs where one is pixel-wise classification between text and non-text, and the other is direct regression to determine the vertex coordinates of quadrilateral text boundaries. Jiang et al. [36] R2CNN ✕ ✓ ✓ ✕ arXiv 2017 Using the Region Proposal Network (RPN) to generate axis-aligned bounding boxes that enclose the texts with different orientations. Xing et al. [37] ArbiText ✕ ✓ ✓ ✕ arXiv 2017 Adopting the circle anchors and incorporating a pyramid pooling module into the Single Shot MultiBox Detector framework. Zhang et al. [39] FEN ✕ ✓ ✕ ✕ AAAI 2018 Proposing a refined scene text detector with a novel Feature Enhancement Network (FEN) for Region Proposal and Text Detection Refinement. Wang et al. [41] ITN ✕ ✓ ✓ ✕ CVPR 2018 ITN is presented to learn the geometry-aware representation encoding the unique geometric configurations of scene text instances with in-network transformation embedding. Liao et al. [44] RRD ✕ ✓ ✓ ✕ CVPR 2018 The regression branch extracts rotation-sensitive features, while the classification branch extracts rotation-invariant features by pooling the rotation sensitive features. Liao et al. [49] TextBoxes++ ✓ ✓ ✓ ✕ TIP 2018 Mainly basing SSD object detection framework and it replaces the rectangular box representation in conventional object detector by a quadrilateral or oriented rectangle representation. He et al. [50] ✕ ✓ ✓ ✕ TIP 2018 Proposing a scene text detection framework based on fully convolutional network with a bi-task prediction module. Ma et al. [51] RRPN ✓ ✓ ✓ ✕ TMM 2018 RRPN + RRoI Pooling. Zhu et al. [55] SLPR ✕ ✓ ✓ ✓ arXiv 2018 SLPR regresses multiple points on the edge of text line and then utilizes these points to sketch the outlines of the text. Deng et al. [56] ✓ ✓ ✓ ✕ arXiv 2018 CRPN employs corners to estimate the possible locations of text instances. And it also designs a embedded data augmentation module inside region-wise subnetwork. Cai et al. [59] FFN ✕ ✓ ✕ ✕ ICIP 2018 Proposing a Feature Fusion Network to deal with text regions differing in enormous sizes. Sabyasachi et al. [60] RGC ✕ ✓ ✓ ✕ ICIP 2018 Proposing a novel recurrent architecture to improve the learnings of a feature map at a given time. Liu et al. [63] CTD ✓ ✓ ✓ ✓ PR 2019 CTD + TLOC + PNMS Xie et al. [79] DeRPN ✓ ✓ ✕ ✕ AAAI 2019 DeRPN utilizes anchor string mechanism instead of anchor box in RPN. Wang et al. [82] ✕ ✓ ✓ ✓ CVPR 2019 Text-RPN + RNN Liu et al. [84] ✕ ✓ ✓ ✓ CVPR 2019 CSE mechanism He et al. [29] SSTD ✓ ✓ ✓ ✕ ICCV 2017 Proposing an attention mechanism. Then developing a hierarchical inception module which efficiently aggregates multi-scale inception features. Tian et al. [11] ✕ ✓ ✕ ✕ ICCV 2015 Cascade boosting detects character candidates, and the min-cost flow network model get the final result. Tian et al. [13] CTPN ✓ ✓ ✕ ✕ ECCV 2016 1) RPN + LSTM. 2) RPN incorporate a new vertical anchor mechanism and LSTM connects the region to get the final result. He et al. [19] ✕ ✓ ✓ ✕ ACCV 2016 ER detetctor detects regions to get coarse prediction of text regions. Then the local context is aggregated to classify the remaining regions to obtain a final prediction. Shi et al. [23] SegLink ✓ ✓ ✓ ✕ CVPR 2017 Decomposing text into segments and links. A link connects two adjacent segments. Tian et al. [30] WeText ✕ ✓ ✕ ✕ ICCV 2017 Proposing a weakly supervised scene text detection method (WeText). Zhu et al. [31] RTN ✕ ✓ ✕ ✕ ICDAR 2017 Mainly basing CTPN vertical vertical proposal mechanism. Ren et al. [34] ✕ ✓ ✕ ✕ TMM 2017 Proposing a CNN-based detector. It contains a text structure component detector layer, a spatial pyramid layer, and a multi-input-layer deep belief network (DBN). Zhang et al. [10] ✕ ✓ ✕ ✕ CVPR 2015 The proposed algorithm exploits the symmetry property of character groups and allows for direct extraction of text lines from natural images. Wang et al. [86] DSRN ✕ ✓ ✓ ✕ IJCAI 2019 Presenting a scale-transfer module and scale relationship module to handle the problem of scale variation. Tang et al.[89] Seglink++ ✕ ✓ ✓ ✓ PR 2019 Presenting instance aware component grouping (ICG) for arbitrary-shape text detection. Wang et al.[92] ContourNet ✓ ✓ ✓ ✓ CVPR 2020 1.A scale-insensitive Adaptive Region Proposal Network (AdaptiveRPN); 2. Local Orthogonal Texture-aware Module (LOTM). 2.1.4 Hybrid Methods Method Model Code Hori Quad Irreg Source Time Highlight Tang et al. [52] SSFT ✕ ✓ ✕ ✕ TMM 2018 Proposing a novel scene text detection method that involves superpixel-based stroke feature transform (SSFT) and deep learning based region classification (DLRC). Xie et al.[61] SPCNet ✕ ✓ ✓ ✓ AAAI 2019 Text Context module + Re-Score mechanism. Liu et al. [64] PMTD ✓ ✓ ✓ ✕ arXiv 2019 Perform “soft” semantic segmentation. It assigns a soft pyramid label (i.e., a real value between 0 and 1) for each pixel within text instance. Liu et al. [80] BDN ✓ ✓ ✓ ✕ IJCAI 2019 Discretizing bouding boxes into key edges to address label confusion for text detection. Zhang et al. [81] LOMO ✕ ✓ ✓ ✓ CVPR 2019 DR + IRM + SEM Zhou et al. [24] EAST ✓ ✓ ✓ ✕ CVPR 2017 The pipeline directly predicts words or text lines of arbitrary orientations and quadrilateral shapes in full images with instance segmentation. Yue et al. [48] ✕ ✓ ✓ ✕ BMVC 2018 Proposing a general framework for text detection called Guided CNN to achieve the two goals simultaneously. Zhong et al. [53] AF-RPN ✕ ✓ ✓ ✕ arXiv 2018 Presenting AF-RPN(anchor-free) as an anchor-free and scale-friendly region proposal network for the Faster R-CNN framework. Xue et al.[85] MSR ✕ ✓ ✓ ✓ IJCAI 2019 Presenting a noval multi-scale regression network. Liao et al. [91] DB ✓ ✓ ✓ ✓ AAAI 2020 Presenting differentiable binarization module to adaptively set the thresholds for binarization, which simpliﬁes the post-processing. Xiao et al. [93] SDM ✕ ✓ ✓ ✓ ECCV 2020 1. A novel sequential deformation method; 2. auxiliary character counting supervision. 2.2 Detection Results2.2.1 Detection Results on Horizontal-Text Datasets Method Model Source Time Method Category IC11[68] IC13 [69] IC05[67] P R F P R F P R F Yao et al. [1] TD-Mixture CVPR 2012 Traditional ~ ~ ~ 0.69 0.66 0.67 ~ ~ ~ Yin et al. [2] TPAMI 2013 0.86 0.68 0.76 ~ ~ ~ ~ ~ ~ Yin et al. [7] TPAMI 2015 0.838 0.66 0.738 ~ ~ ~ ~ ~ ~ Wu et al. [9] TMM 2015 ~ ~ ~ 0.76 0.70 0.73 ~ ~ ~ Liang et al. [8] TIP 2015 0.77 0.68 0.71 0.76 0.68 0.72 ~ ~ ~ Michal et al.[12] FASText ICCV 2015 ~ ~ ~ 0.84 0.69 0.77 ~ ~ ~ Li et al. [3] TIP 2014 Segmentation 0.80 0.62 0.70 ~ ~ ~ ~ ~ ~ Zhang et al. [14] CVPR 2016 ~ ~ ~ 0.88 0.78 0.83 ~ ~ ~ He et al. [18] Text-CNN TIP 2016 0.91 0.74 0.82 0.93 0.73 0.82 0.87 0.73 0.79 Yao et al. [21] arXiv 2016 ~ ~ ~ 0.889 0.802 0.843 ~ ~ ~ Hu et al. [27] WordSup ICCV 2017 ~ ~ ~ 0.933 0.875 0.903 ~ ~ ~ Tang et al.[32] TIP 2017 0.90 0.86 0.88 0.92 0.87 0.89 ~ ~ ~ Wang et al. [38] ICDAR 2017 0.87 0.78 0.82 0.87 0.82 0.84 ~ ~ ~ Deng et al. [40] PixelLink AAAI 2018 ~ ~ ~ 0.886 0.875 0.881 ~ ~ ~ Liu et al. [42] MCN CVPR 2018 ~ ~ ~ 0.88 0.87 0.88 ~ ~ ~ Lyu et al. [43] CVPR 2018 ~ ~ ~ 0.92 0.844 0.880 ~ ~ ~ Chu et al. [45] Border ECCV 2018 ~ ~ ~ 0.915 0.871 0.892 ~ ~ ~ Wang et al. [54] PSENet CVPR 2019 ~ ~ ~ 0.94 0.90 0.92 ~ ~ ~ Huang et al. [4] MSERs-CNN ECCV 2014 0.88 0.71 0.78 ~ ~ ~ 0.84 0.67 0.75 Sun et al. [6] PR 2015 0.92 0.91 0.91 0.94 0.92 0.93 ~ ~ ~ Gupta et al. [15] FCRN CVPR 2016 Regression 0.94 0.77 0.85 0.938 0.764 0.842 ~ ~ ~ Zhong et al. [20] DeepText arXiv 2016 0.87 0.83 0.85 0.85 0.81 0.83 ~ ~ ~ Liao et al. [22] TextBoxes AAAI 2017 0.89 0.82 0.86 0.89 0.83 0.86 ~ ~ ~ Liu et al. [25] DMPNet CVPR 2017 ~ ~ ~ 0.93 0.83 0.870 ~ ~ ~ Jiang et al. [36] R2CNN arXiv 2017 ~ ~ ~ 0.92 0.81 0.86 ~ ~ ~ Xing et al. [37] ArbiText arXiv 2017 ~ ~ ~ 0.826 0.936 0.877 ~ ~ ~ Wang et al. [41] ITN CVPR 2018 0.896 0.889 0.892 0.941 0.893 0.916 ~ ~ ~ Liao et al. [49] TextBoxes++ TIP 2018 ~ ~ ~ 0.92 0.86 0.89 ~ ~ ~ He et al. [50] TIP 2018 ~ ~ ~ 0.91 0.84 0.88 ~ ~ ~ Ma et al. [51] RRPN TMM 2018 ~ ~ ~ 0.95 0.89 0.91 ~ ~ ~ Zhu et al. [55] SLPR arXiv 2018 ~ ~ ~ 0.90 0.72 0.80 ~ ~ ~ Cai et al. [59] FFN ICIP 2018 ~ ~ ~ 0.92 0.84 0.876 ~ ~ ~ Sabyasachi et al. [60] RGC ICIP 2018 ~ ~ ~ 0.89 0.77 0.83 ~ ~ ~ Wang et al. [82] CVPR 2019 ~ ~ ~ 0.937 0.878 0.907 ~ ~ ~ Liu et al. [84] CVPR 2019 ~ ~ ~ 0.937 0.897 0.917 ~ ~ ~ He et al. [29] SSTD ICCV 2017 ~ ~ ~ 0.89 0.86 0.88 ~ ~ ~ Tian et al. [11] ICCV 2015 0.86 0.76 0.81 0.852 0.759 0.802 ~ ~ ~ Tian et al. [13] CTPN ECCV 2016 ~ ~ ~ 0.93 0.83 0.88 ~ ~ ~ He et al. [19] ACCV 2016 ~ ~ ~ 0.90 0.75 0.81 ~ ~ ~ Shi et al. [23] SegLink CVPR 2017 ~ ~ ~ 0.877 0.83 0.853 ~ ~ ~ Tian et al. [30] WeText ICCV 2017 ~ ~ ~ 0.911 0.831 0.869 ~ ~ ~ Zhu et al. [31] RTN ICDAR 2017 ~ ~ ~ 0.94 0.89 0.91 ~ ~ ~ Ren et al. [34] TMM 2017 0.78 0.67 0.72 0.81 0.67 0.73 ~ ~ ~ Zhang et al. [10] CVPR 2015 0.84 0.76 0.80 0.88 0.74 0.80 ~ ~ ~ Tang et al. [52] SSFT TMM 2018 Hybrid 0.906 0.847 0.876 0.911 0.861 0.885 ~ ~ ~ Xie et al.[61] SPCNet AAAI 2019 ~ ~ ~ 0.94 0.91 0.92 ~ ~ ~ Liu et al. [80] BDN IJCAI 2019 ~ ~ ~ 0.887 0.894 0.89 ~ ~ ~ Zhou et al. [24] EAST CVPR 2017 ~ ~ ~ 0.93 0.83 0.870 ~ ~ ~ Yue et al. [48] BMVC 2018 ~ ~ ~ 0.885 0.846 0.870 ~ ~ ~ Zhong et al. [53] AF-RPN arXiv 2018 ~ ~ ~ 0.94 0.90 0.92 ~ ~ ~ Xue et al.[85] MSR IJCAI 2019 ~ ~ ~ 0.918 0.885 0.901 ~ ~ ~ 2.2.2 Detection Results on Arbitrary-Quadrilateral-Text Datasets Method Model Source Time Method Category IC15 [70] MSRA-TD500 [71] USTB-SV1K [65] SVT [66] P R F P R F P R F P R F Le et al. [5] HOCC CVPR 2014 Traditional ~ ~ ~ 0.71 0.62 0.66 ~ ~ ~ ~ ~ ~ Yin et al. [7] TPAMI 2015 ~ ~ ~ 0.81 0.63 0.71 0.499 0.454 0.475 ~ ~ ~ Wu et al. [9] TMM 2015 ~ ~ ~ 0.63 0.70 0.66 ~ ~ ~ ~ ~ ~ Tian et al. [17] IJCAI 2016 ~ ~ ~ 0.95 0.58 0.721 0.537 0.488 0.51 ~ ~ ~ Yang et al. [33] TIP 2017 ~ ~ ~ 0.95 0.58 0.72 0.54 0.49 0.51 ~ ~ ~ Liang et al. [8] TIP 2015 ~ ~ ~ 0.74 0.66 0.70 ~ ~ ~ ~ ~ ~ Zhang et al. [14] CVPR 2016 Segmentation 0.71 0.43 0.54 0.83 0.67 0.74 ~ ~ ~ ~ ~ ~ Zhu et al. [16] CVPR 2016 0.81 0.91 0.85 ~ ~ ~ ~ ~ ~ ~ ~ ~ He et al. [18] Text-CNN TIP 2016 ~ ~ ~ 0.76 0.61 0.69 ~ ~ ~ ~ ~ ~ Yao et al. [21] arXiv 2016 0.723 0.587 0.648 0.765 0.753 0.759 ~ ~ ~ ~ ~ ~ Hu et al. [27] WordSup ICCV 2017 0.793 0.77 0.782 ~ ~ ~ ~ ~ ~ ~ ~ ~ Wu et al. [28] ICCV 2017 0.91 0.78 0.84 0.77 0.78 0.77 ~ ~ ~ ~ ~ ~ Dai et al. [35] FTSN arXiv 2017 0.886 0.80 0.841 0.876 0.771 0.82 ~ ~ ~ ~ ~ ~ Deng et al. [40] PixelLink AAAI 2018 0.855 0.820 0.837 0.830 0.732 0.778 ~ ~ ~ ~ ~ ~ Liu et al. [42] MCN CVPR 2018 0.72 0.80 0.76 0.88 0.79 0.83 ~ ~ ~ ~ ~ ~ Lyu et al. [43] CVPR 2018 0.895 0.797 0.843 0.876 0.762 0.815 ~ ~ ~ ~ ~ ~ Chu et al. [45] Border ECCV 2018 ~ ~ ~ 0.830 0.774 0.801 ~ ~ ~ ~ ~ ~ Long et al. [46] TextSnake ECCV 2018 0.849 0.804 0.826 0.832 0.739 0.783 ~ ~ ~ ~ ~ ~ Yang et al. [47] IncepText IJCAI 2018 0.938 0.873 0.905 0.875 0.790 0.830 ~ ~ ~ ~ ~ ~ Wang et al. [54] PSENet CVPR 2019 0.8692 0.845 0.8569 ~ ~ ~ ~ ~ ~ ~ ~ ~ Xu et al.[57] TextField arXiv 2018 0.843 0.805 0.824 0.874 0.759 0.813 ~ ~ ~ ~ ~ ~ Tian et al. [58] FTDN ICIP 2018 0.847 0.773 0.809 ~ ~ ~ ~ ~ ~ ~ ~ ~ Tian et al. [83] CVPR 2019 0.883 0.850 0.866 0.842 0.817 0.829 ~ ~ ~ ~ ~ ~ Baek et al. [62] CRAFT CVPR 2019 0.898 0.843 0.869 0.882 0.782 0.829 ~ ~ ~ ~ ~ ~ Richardson et al. [87] IJCAI 2019 0.853 0.83 0.827 ~ ~ ~ ~ ~ ~ ~ ~ ~ Wang et al. [88] SAST ACMM 2019 0.8755 0.8734 0.8744 ~ ~ ~ ~ ~ ~ ~ ~ ~ Wang et al. [90] PAN ICCV 2019 0.84 0.819 0.829 0.844 0.838 0.821 ~ ~ ~ ~ ~ ~ Gupta et al. [15] FCRN CVPR 2016 Regression ~ ~ ~ ~ ~ ~ ~ ~ ~ 0.651 0.599 0.624 Liu et al. [25] DMPNet CVPR 2017 0.732 0.682 0.706 ~ ~ ~ ~ ~ ~ ~ ~ ~ He et al. [26] DDR ICCV 2017 0.82 0.80 0.81 0.77 0.70 0.74 ~ ~ ~ ~ ~ ~ Jiang et al. [36] R2CNN arXiv 2017 0.856 0.797 0.825 ~ ~ ~ ~ ~ ~ ~ ~ ~ Xing et al. [37] ArbiText arXiv 2017 0.792 0.735 0.759 0.78 0.72 0.75 ~ ~ ~ ~ ~ ~ Wang et al. [41] ITN CVPR 2018 0.857 0.741 0.795 0.903 0.723 0.803 ~ ~ ~ ~ ~ ~ Liao et al. [44] RRD CVPR 2018 0.88 0.8 0.838 0.876 0.73 0.79 ~ ~ ~ ~ ~ ~ Liao et al. [49] TextBoxes++ TIP 2018 0.878 0.785 0.829 ~ ~ ~ ~ ~ ~ ~ ~ ~ He et al. [50] TIP 2018 0.85 0.80 0.82 0.91 0.81 0.86 ~ ~ ~ ~ ~ ~ Ma et al. [51] RRPN TMM 2018 0.822 0.732 0.774 0.821 0.677 0.742 ~ ~ ~ ~ ~ ~ Zhu et al. [55] SLPR arXiv 2018 0.855 0.836 0.845 ~ ~ ~ ~ ~ ~ ~ ~ ~ Deng et al. [56] arXiv 2018 0.89 0.81 0.845 ~ ~ ~ ~ ~ ~ ~ ~ ~ Sabyasachi et al. [60] RGC ICIP 2018 0.83 0.81 0.82 0.85 0.76 0.80 ~ ~ ~ ~ ~ ~ Wang et al. [82] CVPR 2019 0.892 0.86 0.876 0.852 0.821 0.836 ~ ~ ~ ~ ~ ~ He et al. [29] SSTD ICCV 2017 0.80 0.73 0.77 ~ ~ ~ ~ ~ ~ ~ ~ ~ Tian et al. [13] CTPN ECCV 2016 0.74 0.52 0.61 ~ ~ ~ ~ ~ ~ ~ ~ ~ He et al. [19] ACCV 2016 ~ ~ ~ ~ ~ ~ ~ ~ ~ 0.87 0.73 0.79 Shi et al. [23] SegLink CVPR 2017 0.731 0.768 0.75 0.86 0.70 0.77 ~ ~ ~ ~ ~ ~ Wang et al. [86] DSRN IJCAI 2019 0.832 0.796 0.814 0.876 0.712 0.785 ~ ~ ~ ~ ~ ~ Tang et al.[89] Seglink++ PR 2019 0.837 0.803 0.820 ~ ~ ~ ~ ~ ~ ~ ~ ~ Wang et al. [92] ContourNet CVPR 2020 0.876 0.861 0.869 ~ ~ ~ ~ ~ ~ ~ ~ ~ Tang et al. [52] SSFT TMM 2018 Hybrid ~ ~ ~ ~ ~ ~ ~ ~ ~ 0.541 0.758 0.631 Xie et al.[61] SPCNet AAAI 2019 0.89 0.86 0.87 ~ ~ ~ ~ ~ ~ ~ ~ ~ Liu et al. [64] PMTD arXiv 2019 0.913 0.874 0.893 ~ ~ ~ ~ ~ ~ ~ ~ ~ Liu et al. [80] BDN IJCAI 2019 0.881 0.846 0.863 0.87 0.815 0.842 ~ ~ ~ ~ ~ ~ Zhang et al. [81] LOMO CVPR 2019 0.878 0.876 0.877 ~ ~ ~ ~ ~ ~ ~ ~ ~ Zhou et al. [24] EAST CVPR 2017 0.833 0.783 0.807 0.873 0.674 0.761 ~ ~ ~ ~ ~ ~ Yue et al. [48] BMVC 2018 0.866 0.789 0.823 ~ ~ ~ ~ ~ ~ 0.691 0.660 0.675 Zhong et al. [53] AF-RPN arXiv 2018 0.89 0.83 0.86 ~ ~ ~ ~ ~ ~ ~ ~ ~ Xue et al.[85] MSR IJCAI 2019 ~ ~ ~ 0.874 0.767 0.817 ~ ~ ~ ~ ~ ~ Liao et al. [91] DB AAAI 2020 0.918 0.832 0.873 0.915 0.792 0.849 ~ ~ ~ ~ ~ ~ Xiao et al. [93] SDM ECCV 2020 0.9196 0.8922 0.9057 ~ ~ ~ ~ ~ ~ ~ ~ ~ Method Model Source Time Method Category IC15 [70] MSRA-TD500 [71] USTB-SV1K [65] SVT [66] P R F P R F P R F P R F Le et al. [5] HOCC CVPR 2014 Traditional ~ ~ ~ ~ ~ ~ ~ ~ ~ 0.80 0.73 0.76 Yao et al. [21] arXiv 2016 Segmentation 0.432 0.27 0.333 ~ ~ ~ ~ ~ ~ ~ ~ ~ Hu et al. [27] WordSup ICCV 2017 0.452 0.309 0.368 ~ ~ ~ ~ ~ ~ ~ ~ ~ Lyu et al. [43] CVPR 2018 0.351 0.348 0.349 ~ ~ ~ 0.743 0.706 0.724 ~ ~ ~ Chu et al. [45] Border ECCV 2018 ~ ~ ~ 0.782 0.588 0.671 0.777 0.621 0.690 ~ ~ ~ Yang et al. [47] IncepText IJCAI 2018 ~ ~ ~ 0.785 0.569 0.660 ~ ~ ~ ~ ~ ~ Wang et al. [54] PSENet CVPR 2019 ~ ~ ~ ~ ~ ~ 0.7535 0.6918 0.7213 ~ ~ ~ Baek et al. [62] CRAFT CVPR 2019 ~ ~ ~ ~ ~ ~ 0.806 0.682 0.739 ~ ~ ~ He et al. [29] SSTD ICCV 2017 Regression 0.46 0.31 0.37 ~ ~ ~ ~ ~ ~ ~ ~ ~ Gupta et al. [15] FCRN CVPR 2016 ~ ~ ~ ~ ~ ~ 0.844 0.763 0.801 ~ ~ ~ Liao et al. [49] TextBoxes++ TIP 2018 0.61 0.57 0.59 ~ ~ ~ ~ ~ ~ ~ ~ ~ Ma et al. [51] RRPN TMM 2018 ~ ~ ~ ~ ~ ~ 0.7669 0.5794 0.6601 ~ ~ ~ Deng et al. [56] arXiv 2018 0.555 0.633 0.591 ~ ~ ~ ~ ~ ~ ~ ~ ~ Cai et al. [59] FFN ICIP 2018 0.43 0.35 0.39 ~ ~ ~ ~ ~ ~ ~ ~ ~ Xie et al. [79] DeRPN AAAI 2019 0.586 0.557 0.571 ~ ~ ~ ~ ~ ~ ~ ~ ~ He et al. [29] SSTD ICCV 2017 0.46 0.31 0.37 ~ ~ ~ ~ ~ ~ ~ ~ ~ Liao et al. [44] RRD CVPR 2018 ~ ~ ~ 0.591 0.775 0.670 ~ ~ ~ ~ ~ ~ Richardson et al. [87] IJCAI 2019 ~ ~ ~ ~ ~ ~ 0.729 0.618 0.669 ~ ~ ~ Wang et al. [88] SAST ACMM 2019 ~ ~ ~ ~ ~ ~ 0.7935 0.6653 0.7237 ~ ~ ~ Xie et al.[61] SPCNet AAAI 2019 Hybrid ~ ~ ~ ~ ~ ~ 0.806 0.686 0.741 ~ ~ ~ Liu et al. [64] PMTD arXiv 2019 ~ ~ ~ ~ ~ ~ 0.844 0.763 0.801 ~ ~ ~ Liu et al. [80] BDN IJCAI 2019 ~ ~ ~ ~ ~ ~ 0.791 0.698 0.742 ~ ~ ~ Zhang et al. [81] LOMO CVPR 2019 ~ ~ ~ 0.791 0.602 0.684 0.802 0.672 0.731 ~ ~ ~ Zhou et al. [24] EAST CVPR 2017 0.504 0.324 0.395 ~ ~ ~ ~ ~ ~ ~ ~ ~ Zhong et al. [53] AF-RPN arXiv 2018 ~ ~ ~ ~ ~ ~ 0.75 0.66 0.70 ~ ~ ~ Liao et al. [91] DB AAAI 2020 ~ ~ ~ ~ ~ ~ 0.831 0.679 0.747 ~ ~ ~ Xiao et al. [93] SDM ECCV 2020 ~ ~ ~ ~ ~ ~ 0.8679 0.7526 0.8061 ~ ~ ~ 2.2.3 Detection Results on Irregular-Text DatasetsIn this section, we only select those methods suitable for irregular text detection. Method Model Source Time Method Category Total-text [74] SCUT-CTW1500 [75] P R F P R F Baek et al. [62] CRAFT CVPR 2019 Segmentation 0.876 0.799 0.836 0.860 0.811 0.835 Long et al. [46] TextSnake ECCV 2018 0.827 0.745 0.784 0.679 0.853 0.756 Tian et al. [83] CVPR 2019 ~ ~ ~ 81.7 84.2 80.1 Wang et al. [54] PSENet CVPR 2019 0.840 0.779 0.809 0.848 0.797 0.822 Wang et al. [88] SAST ACMM 2019 0.8557 0.7549 0.802 0.8119 0.8171 0.8145 Wang et al. [90] PAN ICCV 2019 0.893 0.81 0.85 0.864 0.812 0.837 Zhu et al. [55] SLPR arXiv 2018 Regression ~ ~ ~ 0.801 0.701 0.748 Liu et al. [63] CTD+TLOC PR 2019 ~ ~ ~ 0.774 0.698 0.734 Wang et al. [82] CVPR 2019 ~ ~ ~ 80.1 80.2 80.1 Liu et al. [84] CVPR 2019 0.814 0.791 0.802 0.787 0.761 0.774 Tang et al.[89] Seglink++ PR 2019 0.829 0.809 0.815 0.828 0.798 0.813 Wang et al. [92] ContourNet CVPR 2020 0.869 0.839 0.854 0.837 0.841 0.839 Zhang et al. [81] LOMO CVPR 2019 Hybrid 0.876 0.793 0.833 0.857 0.765 0.808 Xie et al.[61] SPCNet AAAI 2019 0.83 0.83 0.83 ~ ~ ~ Xue et al.[85] MSR IJCAI 2019 0.852 0.73 0.768 0.838 0.778 0.807 Liao et al. [91] DB AAAI 2020 0.871 0.825 0.847 0.869 0.802 0.834 Xiao et al.[93] SDM ECCV 2020 0.9085 0.8603 0.8837 0.884 0.8442 0.8636 3. Survey[A] [TPAMI-2015] Ye Q, Doermann D. Text detection and recognition in imagery: A survey[J]. IEEE transactions on pattern analysis and machine intelligence, 2015, 37(7): 1480-1500. paper[B] [Frontiers-Comput. Sci-2016] Zhu Y, Yao C, Bai X. Scene text detection and recognition: Recent advances and future trends[J]. Frontiers of Computer Science, 2016, 10(1): 19-36. paper[C] [arXiv-2018] Long S, He X, Ya C. Scene Text Detection and Recognition: The Deep Learning Era[J]. arXiv preprint arXiv:1811.04256, 2018. paper4. EvaluationIf you are insterested in developing better scene text detection metrics, some references recommended here might be useful.[A] Wolf, Christian, and Jean-Michel Jolion. "Object count/area graphs for the evaluation of object detection and segmentation algorithms." International Journal of Document Analysis and Recognition (IJDAR) 8.4 (2006): 280-296. paper[B] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. K. Ghosh, A. D.Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny. ICDAR 2015 competition on robust reading. In ICDAR, pages 1156–1160, 2015. paper[C] Calarasanu, Stefania, Jonathan Fabrizio, and Severine Dubuisson. "What is a good evaluation protocol for text localization systems? Concerns, arguments, comparisons and solutions." Image and Vision Computing 46 (2016): 1-17. paper[D] Shi, Baoguang, et al. "ICDAR2017 competition on reading chinese text in the wild (RCTW-17)." 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). Vol. 1. IEEE, 2017. paper[E] Nayef, N; Yin, F; Bizid, I; et al. ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identiﬁcation-rrc-mlt. In Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, volume 1, 1454–1459. IEEE.paper[F] Dangla, Aliona, et al. "A first step toward a fair comparison of evaluation protocols for text detection algorithms." 2018 13th IAPR International Workshop on Document Analysis Systems (DAS). IEEE, 2018. paper[G] He,Mengchao and Liu, Yuliang, et al. ICPR2018 Contest on Robust Reading for Multi-Type Web images. ICPR 2018. paper[H] Liu, Yuliang and Jin, Lianwen, et al. "Tightness-aware Evaluation Protocol for Scene Text Detection" Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2019. paper code5. OCR ServiceOCRAPIFreeTesseract OCR Engine×√Azure√√ABBYY√√OCR Space√√SODA PDF OCR√√Free Online OCR√√Online OCR√√Super Tools√√Online Chinese Recognition√√Calamari OCR×√Tencent OCR√×6. References and Code [1] Yao C, Bai X, Liu W, et al. Detecting texts of arbitrary orientations in natural images. 2012 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2012: 1083-1090. Paper[2] Yin X C, Yin X, Huang K, et al. Robust text detection in natural scene images. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2013, 36(5): 970-83. Paper[3] Li Y, Jia W, Shen C, et al. Characterness: An indicator of text in the wild. IEEE transactions on image processing, 2014, 23(4): 1666-1677. Paper[4] Huang W, Qiao Y, Tang X. Robust scene text detection with convolution neural network induced mser trees. European Conference on Computer Vision(ECCV), 2014: 497-511. Paper[5] Kang L, Li Y, Doermann D. Orientation robust text line detection in natural images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014: 4034-4041. Paper[6] Sun L, Huo Q, Jia W, et al. A robust approach for text detection from natural scene images. Pattern Recognition, 2015, 48(9): 2906-2920. Paper[7] Yin X C, Pei W Y, Zhang J, et al. Multi-orientation scene text detection with adaptive clustering. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2015 (9): 1930-1937. Paper[8] Liang G, Shivakumara P, Lu T, et al. Multi-spectral fusion based approach for arbitrarily oriented scene text detection in video images. IEEE Transactions on Image Processing, 2015, 24(11): 4488-4501. Paper[9] Wu L, Shivakumara P, Lu T, et al. A New Technique for Multi-Oriented Scene Text Line Detection and Tracking in Video. IEEE Trans. Multimedia, 2015, 17(8): 1137-1152. Paper[10] Zheng Z, Wei S, et al. Symmetry-based text line detection in natural scenes. IEEE Conference on Computer Vision & Pattern Recognition(CVPR), 2015. Paper[11] Tian S, Pan Y, Huang C, et al. Text flow: A unified text detection system in natural scene images. Proceedings of the IEEE international conference on computer vision(ICCV). 2015: 4651-4659. Paper[12] Buta M, et al. FASText: Efficient unconstrained scene text detector. 2015 IEEE International Conference on Computer Vision (ICCV). 2015: 1206-1214. Paper[13] Tian Z, Huang W, He T, et al. Detecting text in natural image with connectionist text proposal network. European conference on computer vision(ECCV), 2016: 56-72. Paper Code[14] Zhang Z, Zhang C, Shen W, et al. Multi-oriented text detection with fully convolutional networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR). 2016: 4159-4167. Paper[15] Gupta A, Vedaldi A, Zisserman A. Synthetic data for text localisation in natural images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR). 2016: 2315-2324. Paper Code[16] S. Zhu and R. Zanibbi, A Text Detection System for Natural Scenes with Convolutional Feature Learning and Cascaded Classification, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016: 625-632. Paper[17] Tian S, Pei W Y, Zuo Z Y, et al. Scene Text Detection in Video by Learning Locally and Globally. IJCAI. 2016: 2647-2653. Paper[18] He T, Huang W, Qiao Y, et al. Text-attentional convolutional neural network for scene text detection. IEEE transactions on image processing, 2016, 25(6): 2529-2541. Paper[19] He, Dafang and Yang, Xiao and Huang, Wenyi and Zhou, Zihan and Kifer, Daniel and Giles, C Lee. Aggregating local context for accurate scene text detection. ACCV, 2016. Paper[20] Zhong Z, Jin L, Zhang S, et al. Deeptext: A unified framework for text proposal generation and text detection in natural images. arXiv preprint arXiv:1605.07314, 2016. Paper[21] Yao C, Bai X, Sang N, et al. Scene text detection via holistic, multi-channel prediction. arXiv preprint arXiv:1606.09002, 2016. Paper[22] Liao M, Shi B, Bai X, et al. TextBoxes: A Fast Text Detector with a Single Deep Neural Network. AAAI. 2017: 4161-4167. Paper Code[23] Shi B, Bai X, Belongie S. Detecting Oriented Text in Natural Images by Linking Segments. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017: 3482-3490. Paper Code[24] Zhou X, Yao C, Wen H, et al. EAST: an efficient and accurate scene text detector. CVPR, 2017: 2642-2651. Paper Code[25] Liu Y, Jin L. Deep matching prior network: Toward tighter multi-oriented text detection. CVPR, 2017: 3454-3461. Paper[26] He W, Zhang X Y, Yin F, et al. Deep Direct Regression for Multi-Oriented Scene Text Detection. Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2017: 745-753. Paper[27] Hu H, Zhang C, Luo Y, et al. Wordsup: Exploiting word annotations for character based text detection. ICCV, 2017. Paper[28] Wu Y, Natarajan P. Self-organized text detection with minimal post-processing via border learning. ICCV, 2017. Paper[29] He P, Huang W, He T, et al. Single shot text detector with regional attention. The IEEE International Conference on Computer Vision (ICCV). 2017, 6(7). Paper Code[30] Tian S, Lu S, Li C. Wetext: Scene text detection under weak supervision. ICCV, 2017. Paper[31] Zhu, Xiangyu and Jiang, Yingying et al. Deep Residual Text Detection Network for Scene Text. ICDAR, 2017. Paper[32] Tang Y , Wu X. Scene Text Detection and Segmentation Based on Cascaded Convolution Neural Networks. IEEE Transactions on Image Processing, 2017, 26(3):1509-1520. Paper[33] Yang C, Yin X C, Pei W Y, et al. Tracking Based Multi-Orientation Scene Text Detection: A Unified Framework with Dynamic Programming. IEEE Transactions on Image Processing, 2017. Paper[34] X. Ren, Y. Zhou, J. He, K. Chen, X. Yang and J. Sun, A Convolutional Neural Network-Based Chinese Text Detection Algorithm via Text Structure Modeling. in IEEE Transactions on Multimedia, vol. 19, no. 3, pp. 506-518, March 2017. Paper[35] Dai Y, Huang Z, Gao Y, et al. Fused text segmentation networks for multi-oriented scene text detection. arXiv preprint arXiv:1709.03272, 2017. Paper[36] Jiang Y, Zhu X, Wang X, et al. R2CNN: rotational region CNN for orientation robust scene text detection. arXiv preprint arXiv:1706.09579, 2017. Paper[37] Xing D, Li Z, Chen X, et al. ArbiText: Arbitrary-Oriented Text Detection in Unconstrained Scene. arXiv preprint arXiv:1711.11249, 2017. Paper[38] C. Wang, F. Yin and C. Liu, Scene Text Detection with Novel Superpixel Based Character Candidate Extraction. in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 2017, pp. 929-934. Paper[39] Sheng Zhang, Yuliang Liu, Lianwen Jin et al. Feature Enhancement Network: A Refined Scene Text Detector. In AAAI 2018. Paper[40] Dan Deng et al. PixelLink: Detecting Scene Text via Instance Segmentation. In AAAI 2018. Paper Code[41] Fangfang Wang, Liming Zhao, Xi L et al. Geometry-Aware Scene Text Detection with Instance Transformation Network. In CVPR 2018. Paper[42] Zichuan Liu, Guosheng Lin, Sheng Yang et al. Learning Markov Clustering Networks for Scene Text Detection. In CVPR 2018. Paper[43] Pengyuan Lyu, Cong Yao, Wenhao Wu et al. Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation. In CVPR 2018. Paper[44] Minghui L, Zhen Z, Baoguang S. Rotation-Sensitive Regression for Oriented Scene Text Detection. In CVPR 2018. Paper[45] Chuhui Xue et al. Accurate Scene Text Detection through Border Semantics Awareness and Bootstrapping. In ECCV 2018. Paper[46] Long, Shangbang and Ruan, Jiaqiang, et al. TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes. In ECCV, 2018. Paper[47] Qiangpeng Yang, Mengli Cheng et al. IncepText: A New Inception-Text Module with Deformable PSROI Pooling for Multi-Oriented Scene Text Detection. In IJCAI 2018. Paper[48] Xiaoyu Yue et al. Boosting up Scene Text Detectors with Guided CNN. In BMVC 2018. Paper[49] Liao M, Shi B , Bai X. TextBoxes++: A Single-Shot Oriented Scene Text Detector. IEEE Transactions on Image Processing, 2018, 27(8):3676-3690. Paper Code[50] W. He, X. Zhang, F. Yin and C. Liu, Multi-Oriented and Multi-Lingual Scene Text Detection With Direct Regression, in IEEE Transactions on Image Processing, vol. 27, no. 11, pp.5406-5419, 2018. Paper[51] Ma J, Shao W, Ye H, et al. Arbitrary-oriented scene text detection via rotation proposals.in IEEE Transactions on Multimedia, 2018. Paper Code[52] Youbao Tang and Xiangqian Wu. Scene Text Detection Using Superpixel-Based Stroke Feature Transform and Deep Learning Based Region Classification. In TMM, 2018. Paper[53] Zhuoyao Zhong, Lei Sun and Qiang Huo. An Anchor-Free Region Proposal Network for Faster R-CNN based Text Detection Approaches. arXiv preprint arXiv:1804.09003. 2018. Paper[54] Wenhai W, Enze X, et al. Shape Robust Text Detection with Progressive Scale Expansion Network. In CVPR 2019. Paper Code[55] Zhu Y, Du J. Sliding Line Point Regression for Shape Robust Scene Text Detection. arXiv preprint arXiv:1801.09969, 2018. Paper[56] Linjie D, Yanxiang Gong, et al. Detecting Multi-Oriented Text with Corner-based Region Proposals. arXiv preprint arXiv: 1804.02690, 2018. Paper Code[57] Yongchao Xu, Yukang Wang, Wei Zhou, et al. TextField: Learning A Deep Direction Field for Irregular Scene Text Detection. arXiv preprint arXiv: 1812.01393, 2018. Paper[58] Xiaowei Tian, Dao Wu, Rui Wang, Xiaochun Cao. Focal Text: an Accurate Text Detection with Focal Loss. In ICIP 2018. Paper[59] Chenqin C, Pin L, Bing S. Feature Fusion Network for Scene Text Detection. In ICIP, 2018. Paper[60] Sabyasachi Mohanty et al. Recurrent Global Convolutional Network for Scene Text Detection. In ICIP 2018. Paper[61] Enze Xie, et al. Scene Text Detection with Supervised Pyramid Context Network. In AAAI 2019. Paper[62] Youngmin Baek, Bado Lee, et al. Character Region Awareness for Text Detection. In CVPR 2019. Paper[63] Yuliang L, Lianwen J, Shuaitao Z, et al. Curved Scene Text Detection via Transverse and Longitudinal Sequence Connection. Pattern Recognition, 2019. Paper Code[64] Jingchao Liu, Xuebo Liu, et al, Pyramid Mask Text Detector. arXiv preprint arXiv:1903.11800, 2019. Paper Code[79] Lele Xie, Yuliang Liu, Lianwen Jin, Zecheng Xie, DeRPN: Taking a further step toward more general object detection. In AAAI, 2019. Paper Code[80] Yuliang Liu, Lianwen Jin, et al, Omnidirectional Scene Text Detction with Sequential-free Box Discretization. In IJCAI, 2019.Paper Code[81] Chengquan Zhang, Borong Liang, et al, Look More Than Once: An Accurate Detector for Text of Arbitrary Shapes. In CVPR, 2019.Paper[82] Xiaobing Wang, Yingying Jiang, et al, Arbitrary Shape Scene Text Detection with Adaptive Text Region Representation. In CVPR, 2019. Paper[83] Zhuotao Tian, Michelle Shu, et al, Learning Shape-Aware Embedding for Scene Text Detection. In CVPR, 2019. Paper[84] Zichuan Liu, Guosheng Lin, et al, Towards Robust Curve Text Detection with Conditional Spatial Expansion. In CVPR, 2019. Paper[85] Xue C, Lu S, Zhang W. MSR: multi-scale shape regression for scene text detection. In IJCAI, 2019. Paper[86] Wang Y, Xie H, Fu Z, et al. DSRN: a deep scale relationship network for scene text detection. In IJCAI, 2019: 947-953. Paper[87] Elad Richardson, et al, It's All About The Scale -- Efficient Text Detection Using Adaptive Scaling. In WACV, 2020. Paper[88] Pengfei Wang, et al, A Single-Shot Arbitrarily-Shaped Text Detector based on Context Attended Multi-Task Learning. In ACMM, 2019. Paper[89] Jun Tang, et al, SegLink ++: Detecting Dense and Arbitrary-shaped Scene Text by Instance-aware Component Grouping. In PR, 2019. Paper[90] Wenhai Wang, et al, Efﬁcient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network. In ICCV, 2019. Paper[91] Minghui Liao, et al, Real-time Scene Text Detection with Differentiable Binarization. In AAAI, 2020. PaperCode[92] Wang, Yuxin, et al. ContourNet: Taking a Further Step toward Accurate Arbitrary-shaped Scene Text Detection. CVPR. 2020. PaperCode[93] Xiao, et al, Sequential Deformation for Accurate Scene Text Detection. In ECCV, 2020. Paper DatasetsUSTB-SV1K[65]：Xu-Cheng Yin, Xuwang Yin, Kaizhu Huang, and Hong-Wei Hao, Robust text detection in natural scene images, IEEE Trans. Pattern Analysis and Machine Intelligence (TPAMI), priprint, 2013. PaperSVT[66]: Wang,Kai, and S. Belongie. Word Spotting in the Wild. European Conference on Computer Vision(ECCV), 2010: 591-604. PaperICDAR2005[67]: Lucas, S: ICDAR 2005 text locating competition results. In: ICDAR ,2005. PaperICDAR2011[68]: Shahab, A, Shafait, F, Dengel, A: ICDAR 2011 robust reading competition challenge 2: Reading text in scene images. In: ICDAR, 2011. PaperICDAR2013[69]：D. Karatzas, F. Shafait, S. Uchida, et al. ICDAR 2013 robust reading competition. In ICDAR, 2013. PaperICDAR2015[70]：D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. K. Ghosh, A. D.Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny. ICDAR 2015 competition on robust reading. In ICDAR, pages 1156–1160, 2015. PaperMSRA-TD500[71]：C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu, Detecting texts of arbitrary orientations in natural images. in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2012, pp.1083–1090.PaperCOCO-Text[72]：Veit A, Matera T, Neumann L, et al. Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140, 2016. PaperRCTW-17[73]：Shi B, Yao C, Liao M, et al. ICDAR2017 competition on reading chinese text in the wild (RCTW-17). Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on. IEEE, 2017, 1: 1429-1434. PaperTotal-Text[74]：Chee C K, Chan C S. Total-text: A comprehensive dataset for scene text detection and recognition.Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on. IEEE, 2017, 1: 935-942.PaperSCUT-CTW1500[75]：Yuliang L, Lianwen J, Shuaitao Z, et al. Curved Scene Text Detection via Transverse and Longitudinal Sequence Connection. Pattern Recognition, 2019.PaperMLT 2017[76]: Nayef, N; Yin, F; Bizid, I; et al. ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identiﬁcation-rrc-mlt. In Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, volume 1, 1454–1459. IEEE. PaperOSTD[77]: Chucai Yi and YingLi Tian, Text string detection from natural scenes by structure-based partition and grouping, In IEEE Transactions on Image Processing, vol. 20, no. 9, pp. 2594–2605, 2011. PaperCTW[78]: Yuan T L, Zhu Z, Xu K, et al. Chinese Text in the Wild. arXiv preprint arXiv:1803.00085, 2018. PaperIf you find any problems in our resources, or any good papers/codes we have missed, please inform us at liuchongyu1996@gmail.com. Thank you for your contribution.CopyrightCopyright © 2019 SCUT-DLVC. All Rights Reserved.
- 2021年03月30日
- 773 阅读
- 0 评论
- 0 点赞
2021-02-19
IOU计算 IOU计算&判断两个矩形相交以及求出相交的区域求解图示一$$ IOU=area/(area1+area2-area) $$求解图示二理论分析-判断两个矩形相交以及求出相交的区域问题：给定两个矩形A和B，矩形A的左上角坐标为（Xa1,Ya1），右下角坐标为（Xa2,Ya2），矩形B的左上角坐标为（Xb1,Yb1），右下角坐标为（Xb2,Yb2）。1.设计一个算法，确定两个矩形是否相交（即有重叠区域）对于这个问题，一般的思路就是判断一个矩形的四个顶点是否在另一个矩形的区域内。这个思路最简单，但是效率不高，并且存在错误，错误在哪里，下面分析一下。如上图，把矩形的相交（区域重叠）分成三种（可能也有其他划分），对于第三种情况，如图中的（3），两个矩形相交，但并不存在一个矩形的顶点在另一个矩形内部。所以那种思路存在一个错误，对于这种情况的相交则检查不出。仔细观察上图，想到另一种思路，那就是判断两个矩形的中心坐标的水平和垂直距离，只要这两个值满足某种条件就可以相交。矩形A的宽 Wa = Xa2-Xa1 高 Ha = Ya2-Ya1矩形B的宽 Wb = Xb2-Xb1 高 Hb = Yb2-Yb1矩形A的中心坐标 (Xa3,Ya3) = （ (Xa2+Xa1)/2 ，(Ya2+Ya1)/2 ）矩形B的中心坐标 (Xb3,Yb3) = （ (Xb2+Xb1)/2 ，(Yb2+Yb1)/2 ）所以只要同时满足下面两个式子，就可以说明两个矩形相交。1） | Xb3-Xa3 | <= Wa/2 + Wb/22） | Yb3-Ya3 | <= Ha/2 + Hb/2即：| Xb2+Xb1-Xa2-Xa1 | <= Xa2-Xa1 + Xb2-Xb1| Yb2+Yb1-Ya2-Ya1 | <=Y a2-Ya1 + Yb2-Yb12.如果两个矩形相交，设计一个算法，求出相交的区域矩形Xc1 = max(Xa1,Xb1)Yc1 = max(Ya1,Yb1)Xc2 = min(Xa2,Xb2)Yc2 = min(Ya2,Yb2)这样就求出了矩形的相交区域。另外，注意到在不假设矩形相交的前提下，定义（Xc1,Yc1）,（Xc2,Yc2）,且Xc1,Yc1,Xc2,Yc2的值由上面四个式子得出。这样，可以依据Xc1,Yc1,Xc2,Yc2的值来判断矩形相交。Xc1,Yc1,Xc2,Yc2只要同时满足下面两个式子，就可以说明两个矩形相交。3) Xc1 <= Xc24) Yc1 <= Yc2即：max(Xa1,Xb1) <= min(Xa2,Xb2)max(Ya1,Yb1) <= min(Ya2,Yb2)代码实现代码""" IOU计算 + input + box1:[box1_x1,box1_y1,box1_x2,box1_y2] + box2:[box2_x1,box2_y1,box2_x2,box2_y2] + output + iou值 """ def cal_iou(box1,box2): # 判断是否能相交 if abs(box2[2]+box2[0]-box1[2]-box1[0])>box2[2]-box2[0]+box1[2]-box1[0]: return 0 if abs(box2[3]+box2[1]-box1[3]-box1[1])>box2[3]-box2[1]+box1[3]-box1[1]: return 0 # 求相交区域左上角的坐标和右下角的坐标 box_intersect_x1 = max(box1[0], box2[0]) box_intersect_y1 = max(box1[1], box2[1]) box_intersect_x2 = min(box1[2], box2[2]) box_intersect_y2 = min(box1[3], box2[3]) # 求二者相交的面积 area_intersect = (box_intersect_y2 - box_intersect_y1) * (box_intersect_x2 - box_intersect_x1) # 求box1,box2的面积 area_box1 = (box1[2] - box1[0]) * (box1[3] - box1[1]) area_box2 = (box2[2] - box2[0]) * (box2[3] - box2[1]) # 求二者相并的面积 area_union = area_box1 + area_box2 - area_intersect # 计算iou（交并比） iou = area_intersect / area_union return iou验证box1 = [0,0,500,500] box2 = [250,250,750,750] iou = cal_iou(box1,box2) print(iou)0.14285714285714285人为验证图示import matplotlib.pyplot as plt fig1 = plt.figure() ax1 = fig1.add_subplot(111, aspect='equal') ax1.add_patch(plt.Rectangle((0, 0),500,500,color='b',alpha=0.5)) ax1.add_patch(plt.Rectangle((250, 250),500,500,color='b',alpha=0.5)) ax1.add_patch(plt.Rectangle((250, 250),250,250,color='r',alpha=0.5)) plt.xlim(0, 750) plt.ylim(0, 750) plt.show()由图易知：area_box1= 250000area_box2= 250000area_intersect= 62500area_union= 437500因此：iou = 62500 / 437500 = 0.14285714285714285参考资料yolo 算法中的IOU算法程序与原理解读:https://blog.csdn.net/caokaifa/article/details/80724842IOU的计算:https://www.cnblogs.com/darkknightzh/p/9043395.html判断两个矩形相交以及求出相交的区域:https://www.cnblogs.com/zhoug2020/p/7451340.html
- 2021年02月19日
- 1,183 阅读
- 0 评论
- 0 点赞
2021-02-18
深度学习常用数据集[转载] 深度学习常用数据集[转载]1、迁移学习（传统神经网络）1、猫狗数据集：链接：https://pan.baidu.com/s/1TqmdkJBY49ftg19tRK2Ngg 提取码: htxf2、目标检测1、VOC2007+2012训练集链接: https://pan.baidu.com/s/1u4YUyWJqs5bD38A6Hvs-og 提取码: xzde3、实例分割1、shape数据集（圆形、三角形、正方形）：链接: https://pan.baidu.com/s/14dBd1Lbjw0FCnwKryf9taQ 提取码: 94574、语义分割（旧版）1、斑马线数据集：链接：https://pan.baidu.com/s/1uzwqLaCXcWe06xEXk1ROWw 提取码：pp6w2、VOC数据集：链接：https://pan.baidu.com/s/1Urh9W7XPNMF8yR67SDjAAQ 提取码: cvy25、语义分割（新版）1、VOC拓展数据集及其验证集链接: https://pan.baidu.com/s/1BrR7AUM1XJvPWjKMIy2uEw 提取码: vszf6、人脸识别人脸识别数据集包含在对应权值的百度网盘里。1、retinaface链接: https://pan.baidu.com/s/1t7-BNsZzHj2isCekc_PVtw 提取码: 2qrs2、retinaface-pytorch链接: https://pan.baidu.com/s/1q2E6uWs0R5GU_PFs9_vglg 提取码: z7es参考资料1.神经网络学习小记录44——训练资源汇总贴：https://blog.csdn.net/weixin_44791964/article/details/105123842
- 2021年02月18日
- 987 阅读
- 0 评论
- 0 点赞
2021-02-18
YOLOv3学习：（三）模型输出解码 YOLOv3学习：（三）模型输出解码YOLOv3 模型输出输出模型输出解码-理论(以13*13为例)解码目标模型输出shape:[batch_size, 255, 13, 13] 255 = 3(先验框数量)*(x_offset+y_offset+w_scale+h_scale+有无物体置信度+类别置信度)即原模型将图像分割为13*13的小块进行预测，每个小块负责根据先验框预测3个框，每个预测框以小格的左上角为基准点，以先验框的w和h为基准。$$ 预测框w=先验框w \times e^{w\_scale} $$$$ 预测框h=先验框h \times e^{h\_scale} $$模型输出解码的目标即为将输出结果的x_offset+y_offset+w_scale+h_scale部分进行校正，变成以整个图片的最左上角(0,0)点为基准点，并对每个预测框的w，h根据先验框进行对应校正。最终的到3*13*13个预测框。即解码输出shape:[batch_size, 3*13*13,85] 85=x_offset+y_offset+w_scale+h_scale+有无物体置信度+类别置信度模型输出解码-代码# YOLOv3 超参数 from easydict import EasyDict super_param = \ { "anchors": [[[116, 90], [156, 198], [373, 326]], [[30, 61], [62, 45], [59, 119]], [[10, 13], [16, 30], [33, 23]]], "num_classes": 80, "img_size":(416,416), } super_param = EasyDict(super_param) print(super_param.img_size) # YOLOv3模型输出结果解码器 """ 模型输出结果解释：以[batch_size, 255, 13, 13]为例 255 = 3(先验框数量)*(x_offset+y_offset+w+h+有无物体置信度+类别置信度) 代表将原图划分为13*13 然后每个小框负责预测3个框每个框的中心点为(框的左上角x+x_offset,框的左上角y+y_offset) 每个框的w和h为 torch.exp(w.data) * anchor_w 和torch.exp(h.data) * anchor_h 解码输出结果解释：实例对应输出shape为[batch_size，3*13*13，85],即共预测了3*13*13个boxm 每个box的具体参数为(x+y+w+h+有无物体置信度+80个类别置信度)共85个 """ class DecodeBox(nn.Module): def __init__(self, anchors = super_param.anchors[0], num_classes = super_param.num_classes, img_size = super_param.img_size): super(DecodeBox, self).__init__() self.anchors = anchors self.num_anchors = len(anchors) self.num_classes = num_classes self.img_size = img_size def forward(self, input): # 获取YOLOv3单路输出的结果shape信息 batch_size,input_height,input_width = input.size(0),input.size(2),input.size(3) # 计算步长 stride_h,stride_w = self.img_size[1] / input_height,self.img_size[0] / input_width # 把把先验框归一到特征层上 eg:[116, 90], [156, 198], [373, 326] --》[116/32, 90/32], [156/32, 198/32], [373/32, 326/32] scaled_anchors = [(anchor_width / stride_w, anchor_height / stride_h) for anchor_width, anchor_height in self.anchors] # 对预测结果进行reshape # eg:[batch_size, 255, 13, 13] -->[batch_size,num_anchors,input_height,input_width,5 + num_classes](batch_size,3,13,13,85) # 维度中的85包含了4+1+80，分别代表x_offset、y_offset、h和w、置信度、分类结果。 prediction = input.view(batch_size, self.num_anchors, 5 + self.num_classes, input_height, input_width).permute(0, 1, 3, 4, 2).contiguous() # 先验框的中心位置的调整参数 x_offset,y_offset = torch.sigmoid(prediction[..., 0]),torch.sigmoid(prediction[..., 1]) # 先验框的宽高调整参数 w,h = prediction[..., 2],prediction[..., 3] # Width.Height # 获得置信度，是否有物体 conf = torch.sigmoid(prediction[..., 4]) # 种类置信度 pred_cls = torch.sigmoid(prediction[..., 5:]) # Cls pred. FloatTensor = torch.cuda.FloatTensor if x_offset.is_cuda else torch.FloatTensor LongTensor = torch.cuda.LongTensor if x_offset.is_cuda else torch.LongTensor # 生成网格，先验框中心，网格左上角 grid_x = torch.linspace(0, input_width - 1, input_width).repeat(input_width, 1).repeat( batch_size * self.num_anchors, 1, 1).view(x_offset.shape).type(FloatTensor) grid_y = torch.linspace(0, input_height - 1, input_height).repeat(input_height, 1).t().repeat( batch_size * self.num_anchors, 1, 1).view(y_offset.shape).type(FloatTensor) # 生成先验框的宽高 anchor_w = FloatTensor(scaled_anchors).index_select(1, LongTensor([0])) anchor_h = FloatTensor(scaled_anchors).index_select(1, LongTensor([1])) anchor_w = anchor_w.repeat(batch_size, 1).repeat(1, 1, input_height * input_width).view(w.shape) anchor_h = anchor_h.repeat(batch_size, 1).repeat(1, 1, input_height * input_width).view(h.shape) # 计算调整后的先验框中心与宽高 pred_boxes = FloatTensor(prediction[..., :4].shape) pred_boxes[..., 0] = x_offset.data + grid_x pred_boxes[..., 1] = y_offset.data + grid_y pred_boxes[..., 2] = torch.exp(w.data) * anchor_w pred_boxes[..., 3] = torch.exp(h.data) * anchor_h # 用于将输出调整为相对于416x416的大小 _scale = torch.Tensor([stride_w, stride_h] * 2).type(FloatTensor) output = torch.cat((pred_boxes.view(batch_size, -1, 4) * _scale, conf.view(batch_size, -1, 1), pred_cls.view(batch_size, -1, self.num_classes)), -1) return output.data测试fake_out1 = torch.zeros((1,255,13,13)) print(fake_out1.shape) decoder = DecodeBox() out1_decode = decoder(fake_out1) print(out1_decode.shape)torch.Size([1, 255, 13, 13]) torch.Size([1, 507, 85])参考资料Pytorch 搭建自己的YOLO3目标检测平台（Bubbliiiing 深度学习教程）:https://www.bilibili.com/video/BV1Hp4y1y788?p=11&spm_id_from=pageDriver
- 2021年02月18日
- 858 阅读
- 0 评论
- 0 点赞
2021-02-07
【YOLOv3论文翻译】：YOLOv3：增量式的改进【YOLOv3论文翻译】：YOLOv3：增量式的改进论文原文:YOLOv3: An Incremental Improvement摘要我们对YOLO进行了一系列更新！它包含一堆小设计，可以使系统的性能得到更新。我们也训练了一个新的、比较大的神经网络。虽然比上一版更大一些，但是精度也提高了。不用担心，它的速度依然很快。YOLOv3在320×320输入图像上运行时只需22ms，并能达到28.2mAP，其精度和SSD相当，但速度要快上3倍。使用之前0.5 IOU mAP的检测指标，YOLOv3的效果是相当不错。YOLOv3使用Titan X GPU，其耗时51ms检测精度达到57.9 AP50，与RetinaNet相比，其精度只有57.5 AP50，但却耗时198ms，相同性能的条件下YOLOv3速度比RetinaNet快3.8倍。与之前一样，所有代码在网址：https://pjreddie.com/yolo/。1. 引言有时候，一年内你主要都在玩手机，你知道吗？今年我没有做很多研究。我在Twitter上花了很多时间。研究了一下GAN。去年我留下了一点点的精力[12] [1]；我设法对YOLO进行了一些改进。但是，实话实说，除了仅仅一些小的改变使得它变得更好之外，没有什么超级有趣的事情。我也稍微帮助了其他人的一些研究。其实，这就是今天我要讲的内容。我们有一篇论文快截稿了，并且我们还缺一篇关于YOLO更新内容的文章作为引用，但是我们没有引用来源。因此准备写一篇技术报告！技术报告的好处是他们不需要引言，你们都知道我为什么写这个。所以引言的结尾可以作为阅读本文剩余内容的一个指引。首先我们会告诉你YOLOv3的方案。其次我们会告诉你我们是如何实现的。我们也会告诉你我们尝试过但并不奏效的一些事情。最后我们将探讨这些的意义。2. 方案这节主要介绍YOLOv3的方案：我们主要从其他人的研究工作里获得了一些好思路、好想法。我们还训练了一个新的、比其他网络更好的分类网络。为了方便您理解，我们将带您从头到尾贯穿整个模型系统。![图1.这个图来自Focal Loss论文[9]。YOLOv3的运行速度明显快于其他具有可比性能的检测方法。检测时间基于M40或Titan X（这两个基本上是相同的GPU）。](/usr/uploads/auto_save_image/f1b7a2d2167837f377fafa85701fb668.png)2.1 边界框预测按照YOLO9000，我们的系统也使用维度聚类得到的anchor框来预测边界框[15]。网络为每个边界框预测的4个坐标：tx、ty、tw、th。假设格子距离图像的左上角偏移量为（cx，cy），先验边界框宽度和高度分别为：pw、ph，则预测结果对应为：训练时我们使用误差平方和损失。如果某个预测坐标的真值是$\hat{t^*}$，那么梯度就是真值（从真值框计算而得）和预测值之差：$\hat{t^*}-t^*$。真实值可以很容易地通过变换上述公式得到。YOLOv3使用逻辑回归预测每个边界框是目标的分数。如果真实标签框与某个边界框重叠的面积比与其他任何边界框都大，那么这个先验边界框得分为1。按照[17]的做法，如果先验边界框不是最好的，但是确实与目标的真实标签框重叠的面积大于阈值，我们就会忽略这个预测。我们使用阈值为0.5。与[17]不同，我们的系统只为每个真实目标分配一个边界框。如果先验边界框未分配到真实目标，则不会产生坐标或类别预测的损失，只会产生是否是目标的损失。![图2.维度先验和位置预测的边界框。我们使用聚类质心的偏移量预测框的宽度和高度。我们使用sigmoid函数预测相对于滤波器应用位置的框的中心坐标。这个图公然引用于自己的论文[15]。](/usr/uploads/auto_save_image/cf8bd0eecaa2aefdb8f1e86fbe6a4961.png)2.2 分类预测每个边界框都会使用多标签分类来预测框中可能包含的类。我们不用softmax，而是用单独的逻辑分类器，因为我们发现前者对于提升网络性能没什么作用。在训练过程中，我们用binary cross-entropy（二元交叉熵）损失来预测类别。当我们转向更复杂的领域，例如Open Images Dataset [7]，上面的这种改变将变得很有用。这个数据集中有许多重叠的标签（例如女性和人）。使用softmax会强加这样一个假设——即每个框恰好只有一个类别，但通常情况并非如此。多标签的方式可以更好地模拟数据。2.3 跨尺度预测YOLOv3预测3种不同尺度的框。我们的系统使用类似特征金字塔网络的相似概念，并从这些尺度中提取特征[8]。在我们的基础特征提取器上添加了几个卷积层。其中最后一个卷积层预测了一个编码边界框、是否是目标和类别预测结果的三维张量。在我们的COCO实验[8]中，我们为每个尺度预测3个框，所以对于每个边界框的4个偏移量、1个目标预测和80个类别预测，最终的张量大小为N×N×[3×(4+1+80)]。接下来，我们从前面的2个层中取得特征图，并将其上采样2倍。我们还从网络中的较前的层中获取特征图，并将其与我们的上采样特征图进行拼接。这种方法使我们能够从上采样的特征图中获得更有意义的语义信息，同时可以从更前的层中获取更细粒度的信息。然后，我们添加几个卷积层来处理这个特征映射组合，并最终预测出一个相似的、大小是原先两倍的张量。我们再次使用相同的设计来预测最终尺寸的边界框。因此，第三个尺寸的预测将既能从所有先前的计算，又能从网络前面的层中的细粒度的特征中获益。我们仍然使用k-means聚类来确定我们的先验边界框。我们只是选择了9个类和3个尺度，然后在所有尺度上将聚类均匀地分开。在COCO数据集上，9个聚类分别为(10×13)、(16×30)、(33×23)、(30×61)、(62×45)、(59×119)、(116 × 90)、(156 × 198)、(373 × 326)。2.4 特征提取器我们使用一个新的网络来进行特征提取。我们的新网络融合了YOLOv2、Darknet-19和新发明的残差网络的思想。我们的网络使用连续的3×3和1×1卷积层，而且现在多了一些快捷连接（shortcut connetction），而且规模更大。它有53个卷积层，所以我们称之为... Darknet-53！这个新网络比Darknet-19功能强大很多，并且仍然比ResNet-101或ResNet-152更高效。以下是一些ImageNet上的结果：每个网络都使用相同的设置进行训练，并在256×256的图像上进行单精度测试。运行时间是在Titan X上用256×256图像进行测量的。因此，Darknet-53可与最先进的分类器相媲美，但浮点运算更少，速度更快。Darknet-53比ResNet-101更好，且速度快1.5倍。Darknet-53与ResNet-152相比性能差不多，但速度快比其2倍。Darknet-53也实现了最高的每秒浮点运算测量。这意味着网络结构可以更好地利用GPU，使它的评测更加高效、更快。这主要是因为ResNets的层数太多，效率不高。2.5 训练我们仍然在完整的图像上进行训练，没有使用难负样本挖掘（hard negative mining）或其他类似的方法。我们使用多尺度训练，使用大量的数据增强、批量标准化等标准的操作。我们使用Darknet神经网络框架进行训练和测试[12]。3 我们是如何做的YOLOv3表现非常好！请看表3。就COCO的平均AP指标而言，它与SSD类的模型相当，但速度提高了3倍。尽管如此，它仍然在这个指标上比像RetinaNet这样的其他模型差些。![表3.我很认真地从[9]中“窃取”了所有这些表格，他们花了很长时间才从头开始制作。好的，YOLOv3没问题。请记住，RetinaNet处理图像的时间要长3.8倍。YOLOv3比SSD变体要好得多，可与AP50指标上的最新模型相媲美。](/usr/uploads/auto_save_image/b315b290b4c82ed2f24a0538afbbfbd4.png)然而，当我们使用“旧的”检测指标——在IOU=0.5的mAP（或图表中的AP50）时，YOLOv3非常强大。其性能几乎与RetinaNet相当，并且远强于SSD。这表明YOLOv3是一个非常强大的检测器，擅长为目标生成恰当的框。然而，随着IOU阈值增加，性能显著下降，这表明YOLOv3预测的边界框与目标不能完美对齐。之前的YOLO不擅长检测小物体。但是，现在我们看到了这种趋势的逆转。随着新的多尺度预测，我们看到YOLOv3具有相对较高的APS性能。但是，它在中等和更大尺寸的物体上的表现相对较差。需要更多的研究来深入了解这一点。当我们在AP50指标上绘制准确度和速度关系图时（见图3），我们看到YOLOv3与其他检测系统相比具有显着的优势。也就是说，速度更快、性能更好。![图3. 再次改编自[9]，这次显示的是在0.5 IOU指标上速度/准确度的折衷。你可以说YOLOv3是好的，因为它非常高并且在左边很远。你能引用你自己的论文吗？猜猜谁会去尝试，这个人→[16]。哦，我忘了，我们还修复了YOLOv2中的数据加载bug，该bug的修复提升了2 mAP。将YOLOv3结果潜入这幅图中而没有改变原始布局。](/usr/uploads/auto_save_image/d381f8d42ff1a78d2af931002d8d9127.png)4 失败的尝试我们在研究YOLOv3时尝试了很多东西，但很多都不起作用。下面是我们要记住的血的教训。Anchor框的x、y偏移预测。我们尝试使用常规的Anchor框预测机制，比如利用线性激活将坐标x、y的偏移程度预测为边界框宽度或高度的倍数。但我们发现这种方法降低了模型的稳定性，并且效果不佳。用线性激活代替逻辑激活函数进行x、y预测。我们尝试使用线性激活代替逻辑激活来直接预测x、y偏移。这个改变导致MAP下降了几个点。focal loss。我们尝试使用focal loss。它使得mAP下降2个点。YOLOv3可能已经对focal loss试图解决的问题具有鲁棒性，因为它具有单独的目标预测和条件类别预测。因此，对于大多数样本来说，类别预测没有损失？或者有一些？我们并不完全确定。双IOU阈值和真值分配。Faster R-CNN在训练期间使用两个IOU阈值。如果一个预测与真实标签框重叠超过0.7，它就是一个正样本，若重叠为[0.3，0.7]之间，那么它会被忽略，若它与所有的真实标签框的IOU小于0.3，那么一个负样本。我们尝试了类似的策略，但无法取得好的结果。我们非常喜欢目前的更新，它似乎至少在局部达到了最佳。有些方法可能最终会产生好的结果，也许他们只是需要一些调整来稳定训练。5 这一切意味着什么YOLOv3是一个很好的检测器。速度很快、很准确。它在COCO平均AP介于0.5和0.95 IOU之间的指标的上并不理想。但是，对于旧的0.5 IOU检测指标上效果非常好。为什么我们要改变指标？COCO的原论文只是有这样一句含糊不清的句子：“一旦评估服务器完成，就会生成全面评测指标”。Russakovsky等人的报告说，人们很难区分0.3和0.5的IOU。“训练人类用视觉检查0.3 IOU的边界框，并且与0.5 IOU的框区别开来是非常困难的。“[16]如果人类很难说出差异，那么它也没有多重要吧？但是也许更好的问题是：“现在我们有了这些检测器，我们要做什么？”很多做关于这方面的研究的人都受聘于Google和Facebook。我想至少我们知道这项技术在好人的手中，绝对不会被用来收集您的个人信息并将其出售给......等等，您是说这正是它的用途？oh。其他花大钱资助视觉研究的人还有军方，他们从来没有做过任何可怕的事情，例如用新技术杀死很多人，等等.....（脚注：作者由the Office of Naval Research and Google资助支持。）我强烈地希望，大多数使用计算机视觉的人都用它来做一些快乐且有益的事情，比如计算一个国家公园里斑马的数量[11]，或者追踪在附近徘徊的猫[17]。但是计算机视觉已经有很多可疑的用途，作为研究人员，我们有责任考虑我们的工作可能造成的损害，并思考如何减轻它的影响。我们欠这个世界太多。最后，不要再@我了。（因为哥已经退出Twitter这个是非之地了）。参考文献[1] Analogy. Wikipedia, Mar 2018. 1[2] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010. 6[3] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659, 2017. 3[4] D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, and A. Farhadi. Iqa: Visual question answering in interactive environments. arXiv preprint arXiv:1712.03316, 2017. 1[5] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 3[6] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z.Wojna, Y. Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. 3[7] I. Krasin, T. Duerig, N. Alldrin, V. Ferrari, S. Abu-El-Haija, A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, A. Veit, S. Belongie, V. Gomes, A. Gupta, C. Sun, G. Chechik, D. Cai, Z. Feng, D. Narayanan, and K. Murphy. Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available fromhttps://github.com/openimages, 2017. 2[8] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017. 2, 3[9] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar. Focal loss for dense object detection. arXiv preprint arXiv:1708.02002, 2017. 1, 3, 4[10] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 2[11] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.- Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016. 3[12] I. Newton. Philosophiae naturalis principia mathematica. William Dawson & Sons Ltd., London, 1687. 1[13] J. Parham, J. Crall, C. Stewart, T. Berger-Wolf, and D. Rubenstein. Animal population censusing at scale with citizen science and photographic identification. 2017. 4[14] J. Redmon. Darknet: Open source neural networks in c. http://pjreddie.com/darknet/, 2013–2016. 3[15] J. Redmon and A. Farhadi. Yolo9000: Better, faster, stronger. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 6517–6525. IEEE, 2017. 1, 2, 3[16] J. Redmon and A. Farhadi. Yolov3: An incremental improvement. arXiv, 2018. 4[17] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497, 2015. 2[18] O. Russakovsky, L.-J. Li, and L. Fei-Fei. Best of both worlds: human-machine collaboration for object annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2121–2131, 2015. 4[19] M. Scott. Smart camera gimbal bot scanlime:027, Dec 2017. 4[20] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Beyond skip connections: Top-down modulation for object detection. arXiv preprint arXiv:1612.06851, 2016. 3[21] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. 2017. 3参考资料目标检测经典论文——YOLOv3论文翻译（纯中文版）：YOLOv3：增量式的改进（YOLOv3: An Incremental Improvement）：https://blog.csdn.net/Jwenxue/article/details/107749323?ops_request_misc=%25257B%252522request%25255Fid%252522%25253A%252522161268258716780274122037%252522%25252C%252522scm%252522%25253A%25252220140713.130102334.pc%25255Fblog.%252522%25257D&request_id=161268258716780274122037&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~blog~first_rank_v1~rank_blog_v1-12-107749323.pc_v1_rank_blog_v1&utm_term=YOLO
- 2021年02月07日
- 741 阅读
- 0 评论
- 0 点赞

1
2
3
4
5