




已閱讀5頁,還剩3頁未讀, 繼續(xù)免費(fèi)閱讀
版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)
文檔簡介
A Multi-task Convolutional Neural Network for Autonomous Robotic Grasping in Object Stacking Scenes Hanbo Zhang, Xuguang Lan, Site Bai, Lipeng Wan, Chenjie Yang, and Nanning Zheng AbstractAutonomous robotic grasping plays an important role in intelligent robotics. However, how to help the robot grasp specifi c objects in object stacking scenes is still an open problem, because there are two main challenges for autonomous robots: (1)it is a comprehensive task to know what and how to grasp; (2)it is hard to deal with the situations in which the target is hidden or covered by other objects. In this paper, we propose a multi-task convolutional neural network for autonomous robotic grasping, which can help the robot fi nd the target, make the plan for grasping and fi nally grasp the target step by step in object stacking scenes. We integrate vision-based robotic grasping detection and visual manipulation relationship reasoning in one single deep network and build the autonomous robotic grasping system. Experimental results demonstrate that with our model, Baxter robot can autonomously grasp the target with a success rate of 90.6%, 71.9% and 59.4% in object cluttered scenes, familiar stacking scenes and complex stacking scenes respectively. I. INTRODUCTION In the research of intelligent robotics 1, autonomous robotic grasping is a very challenging task 2. For use in daily life scenes, autonomous robotic grasping should satisfy the following conditions: Grasping should be robust and effi cient. The desired object can be grasped in a multi-object scene without potential damages to other objects. The correct decision can be made when the target is not visible or covered by other things. For human beings, grasping can be done naturally with high effi ciency even if the target is unseen or grotesque. However, robotic grasping involves many diffi cult steps including per- ception, planning and control. Moreover, for complex scenes (e.g. the target is occluded or covered by other objects), robots also need certain reasoning ability to grasp the target orderly. For example, as shown in Fig. 1, in order to prevent potential damages to other objects, the robot have to plan the grasping order through reasoning, perform multiple grasps in sequence to complete the task and fi nally get the target. These diffi culties make autonomous robotic grasping more challenging in complex scenes. Therefore, in this paper, we propose a new vision-based multi-task convolutional neural network (CNN) to solve the mentioned problems for autonomous robotic grasping, which can help the robot complete grasping task in complex scenes Hanbo Zhang and Xuguang Lan are with the Institute of Artifi cial Intelligence and Robotics, the National Engineering Laboratory for Vi- sual Information Processing and Applications, School of Electronic and Information Engineering, Xian Jiaotong University, No.28 Xianning Road, Xian, Shaanxi, China.zhanghanbo163, xglan Fig. 1.Grasping task in a complex scene. The target is the tape, which is placed under several things and nearly invisible. Complete manipulation relationship tree indicates the correct grasping order, and grasping in this order will avoid damages to other objects. The correct decision made by the human being should be: if the target is visible, the correct grasping plan should be made and a sequence of grasps should be executed to get the target while if the target is invisible, the visible things should be moved away in a correct order to fi nd the target. (e.g. grasp the occluded or covered target). To achieve this, three functions should be implemented including grasping the desired object in multi-object scenes, reasoning the correct order for grasping and executing grasp sequence to get the target. To help the robot grasp the desired object in multi-object scenes, we design the Perception part of our network, which can simultaneously detect objects and their grasp candidates. The grasps are detected in the area of each object instead of the whole scene. In order to deal with situations in which the target is hidden or covered by other objects, we design the Reasoning part to get visual manipulation relationships between objects and enable the robot to reason the correct order for grasping, preventing potential damages to other objects. For transferring network outputs to confi gurations of grasping execution, we design Grasping part of our network. For the perception and rea- soning process, RGB images are taken as input of the neural network, while for execution of grasping, depth information is needed for approaching vector computation and coordinate transformation. Though there are some previous works that try to complete grasping in dense clutter 36, as we know, our proposed algorithm is the fi rst to combine perception, reasoning, and grasp planning simultaneously by using one neural network, and attempts to realize autonomous robotic grasp in complex scenarios. To evaluate our proposed algorithm, we validate the performance of our model in VMRD dataset 7. For robotic experiments, Baxter robot is used as the executor to complete grasping tasks, in which the robot is required to 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Macau, China, November 4-8, 2019 978-1-7281-4003-2/19/$31.00 2019 IEEE6435 fi nd the target, make the plan for grasping and grasp the target step by step. The rest of this paper is organized as following: Related work is reviewed in Section II; Our proposed algorithm is detailed in Section III; Experimental results including vali- dation on VMRD dataset and robotic experiments are shown in Section IV, and fi nally the conclusion and discussion are described in Section V. II. RELATEDWORK A. Robotic Grasp Detection As the development of deep learning, robotic grasp detec- tion based on convolutional neural network (CNN) achieves state-of-the-art performance on several datasets such as Cor- nell dataset 813 and CMU grasp dataset 14. They are suitable for grasp detection in single-object scenes. There are some works proposed for grasping in dense clutter 3 6. A deep network is used to simultaneously detect the most exposed object and its best grasp by Guo et al. 3, which is trained on a fruit dataset including 352 RGB images. However, their model can only output the grasp affi liated to the most exposed object without perception and understanding of the overall environment and reasoning of the relationship between objects, which limits the use of the algorithm. Algorithms proposed in 4, 5 and 6 only focus on the detection of grasps in scenes where objects are densely cluttered, rather than what the grasped objects are. Therefore, the existing algorithms detect grasps on features of the whole image, and can only be used to grasp an unspecifi ed object instead of a pointed one in stacking scenes. B. Visual Manipulation Relationship Reasoning RecentworksprovethatCNNsachieveadvanced performance on visual relationship reasoning 1517. Different from visual relationship, visual manipulation relationship 7 is proposed to solve the problem of grasping order in object stacking scenes with consideration of the safety and stability of objects. However, when this algorithm is directly combined with the grasp detection network to solve grasping problem in object stacking scenes, there are two main diffi culties: 1) it is diffi cult to correctly match the detected grasps and the detected objects in object stacking scenes; 2) the cascade structure causes a lot of redundant calculations (e.g. the extraction of scene features), which makes the speed slow. Therefore, in this paper, we propose a new CNN architec- ture to combine object detection, grasp detection and visual manipulation relationship reasoning and build a robotic au- tonomous grasping system. Different from previous works, the grasps are detected on the object features instead of the whole scene. Visual manipulation relationships are applied to decide which object should be grasped fi rst. The proposed network can help our robot grasp the target following the correct grasping order in complex scenes. Target: tape Fig. 2.Our task is to grasp the target in complex scenes. The target is covered by several other objects and almost invisible. The robot need to fi nd the target, plan the grasping order and execute the grasp sequence to get the target. III. TASKDEFINITION In this paper, we focus on grasping task in scenes where the target and several other objects are cluttered or piled up, which means there can be occlusions and overlaps between objects, or the target is hidden under other objects and cannot be observed by the robot. Therefore, we set up an environment with several different objects each time. In each experiment, we test whether the robot can fi nd out and grasp the specifi c target. The target of the task is input manually. The desired robot behavior is that the fi nal target can be grasped step by step following the correct manipulation relationship predicted by the proposed neural network. In detail, we focus on grasping task in realistic and challenging scenes as following: each experimental scene includes 6-9 objects, where objects are piled up and there are severe occlusions and overlaps. In the beginning of each experiment, the target is diffi cult to detect in most cases. Following this setting, it can test whether the robot can make correct decisions to fi nd the target and successfully grasp it. IV. PROPOSEDAPPROACH A. Architecture The proposed architecture of our approach is shown in Fig. 3. The input of our network is RGB images of working scenes. First, we use CNN (e.g. ResNet-101 18) to extract image features. As shown in Fig. 3, convolutional features are shared among Region proposal network (RPN, 19), object detection, grasp detection and visual manipulation relationship reasoning. The shared feature extractor used in our work is ResNet-101 layers before conv4 including 30 ResBlocks. Therefore, the stride of shared features is 16. RPN follows the feature extractor to output regions of interest (ROI). RPN includes three 33 convolutional layers: an intermediate convolutional layer, a ROI regressor and a ROI classifi er. The ROI regressor and classifi er are both cascaded after the intermediate convolutional layer to output locations of ROIs and the probability of each ROI being a candidate of object bounding boxes. 6436 Fig. 3.Architecture of our proposed approach. The input is RGB images of working scenes. The solid arrows indicate forward-propagation while the dotted arrows indicate backward-propagation. In each iteration, the neural network produces one robotic grasp confi guration and the robot moves one object. The iteration will not be terminated until the desired target is grasped. (a): Network architecture; (b): Perception part with object detector and grasp detector; (c): Reasoning part with visual manipulation relationship predictor; (d): Expected results. The mainbody of our approach includes 3 parts: Per- ception, Reasoning and Grasping. “Perception” is used to obtain the detection results of object and grasp with the affi liation between them. “Reasoning” takes object bounding boxes output by “Perception” and image features as input to predict the manipulation relationship between each pair of objects. “Grasping” uses perception results to transform grasp rectangles into robotic grasp confi gurations to be executed by the robot. Each detection produces one robotic grasp confi guration, and the iteration is terminated when the desired target is grasped. B. Perception In “Perception” part, the network simultaneously detects objects and their grasps. The convolutional features and ROIs output by RPN are fi rst fed into ROI pooling layer, where the features are cropped by ROIs and pooled using adaptive pooling into the same size W H (in our work, the size is 7 7). The purpose of ROI pooling is to enable the corresponding features of all ROIs to form a batch for network training. 1) Object Detector: Object detector takes a mini-batch of ROI pooled features as input. As in 18, a ResNet conv5 layer including 9 convolutional layers is adopted as the header for fi nal regression and classifi cation taking ROI pooled features as input. The headers output is then averaged on each feature map. The regressor and classifi er are both fully connected layers with 2048-d input and no hidden layer, outputting locations of refi ned object bounding boxes and classifi cation results respectively. 2) Grasp Detector: Grasp detector also takes ROI pooled features as input to detect grasps on each ROI. Each ROI is fi rstly divided into W H grid cells. Each grid cell corre- sponds to one pixel on ROI pooled feature maps. Inspired by our previous work 13, the grasp detector outputs k (in this paper, k = 4) grasp candidates on each grid cell with oriented anchor boxes as priors. Different oriented anchor size is explored during experiments including 1212 and 2424 pixels. A header including 3 ResBlocks cascades after ROI pooling in order to enlarge receptive fi eld of features used for grasp detection. The reason is that a large receptive fi eld can prevent grasp detector from being confused by grasps that belongs to different ROIs. Then similar to 13, the grasp regressor and grasp classifi er follow the grasp header and output 5k and 2k values for grasp rectangles and graspable confi dence scores respectively. Therefore, the output for each grasp candidate is a 7-dimension vector, 5 for the location of the grasp (xg,yg,wg,hg,g) and 2 for graspable and ungraspable confi dence scores (cg,cug). Therefore, the output of “Perception” part for each objectcontainstwoparts:objectdetectionresultO and grasp detection result G. O is a 5-dimension vec- tor (xmin,ymin,xmax,ymax,cls) representing the loca- tion and category of an object and G is a 5-dimension vector(xg,yg,wg,hg,g)representingthebestgrasp. 6437 ToothpasteBox Tape Wrist DeveloperToothpaste ToothpasteNo Relationship TapePlier Box Toothpaste Plier Wrist Developer Tape Input Image Manipulation Relationships Manipulation Relationship Tree Fig. 4.When all manipulation relationships are obtained, the manipulation relationship tree can be built combining all manipulation relationships. The leaf nodes represent the objects that should be grasped fi rst. (xg,yg,wg,hg,g) is computed by Eq. (1): xg= xg wa+ xa yg= yg ha+ ya wg= exp(wg) wa hg= exp(hg) ha g= g (90/k) + a (1) where (xa,ya,wa,ha,a) is the corresponding oriented an- chor. C. Reasoning Inspired by our previous work 7, we combine visual manipulation relationship reasoning in our network to help robot reason the grasping order without potential damages to other objects. Manipulation Relationship Predictor: To predict manipu- lation relationships of object pairs, we adopt Object Pairing Pooling Layer (OP2L) to obtain features of object pairs. As shown in Fig. 3.C, the input of manipulation relationship predictor is features of object pairs. The features of each object pair (O1,O2) include the features of O1, O2and the union bounding box. Similar to object detector and grasp detector, the features of each object are also adaptively pooled into the same size of W H. The difference is that the convolutional features are cropped by object bounding boxes instead of ROIs. Note that (O1,O2) is different from (O2,O1) because manipulation relationship does not conform to the exchange law. If there are n detected objects, the number of object pairs will be n (n 1), and there will be n (n 1) manipulation relationships predicted. In manipulation relationship predictor, the features of the two objects and the union bounding box are fi rst passed through several convolutional layers respectively (in this work, ResNet Conv5 layers are applied), and fi nally manipulation relationships are classifi ed by a fully connected network containing two 2048-d hidden layers. After getting all manipulation relationships, we can build a manipulation relationship tree to describe the correct grasping order in the whole scene as shown in Fig. 4. Leaf nodes of the manipulation relationship tree should be grasped before the other nodes. Therefore, it is worth noting that the most important part of the manipulation relationship tree is the leaf nodes. In other words, if we can make sure that the leaf nodes are detected correctly in each step, the grasping order will be correct regardless of the other nodes. D. Grasping “Grasping” part is used to complete inference on outputs of the network. In other words, the input of this part is object and grasp detection results, and the output is the corresponding robotic confi guration to grasp each object. Note that there is no trainable weights in “Grasping” part. 1) Grasp Selection: As described above, the grasp de- tector outputs a large set of grasp candidates for each ROI. Therefore, the best grasp candidate should be selected fi rst for each object. According to 12, there are two methods to fi nd the best grasp: (1) choose the grasp with highest graspable score; (2) choose the one closest to the object center in Top-N candidates. The second one is proved to be a better way in 12, which is used to get the grasp of each object in our paper. In our experiments, N is set to 3. 2) Coordinate Transformation: The purpose of the coor- dinate transformation is to map the detected grasps in the image to the approaching vector and grasp point in the robot coordinate system. In this paper, an affi ne transformation is used approximately for this mapping. The affi ne transfor- mation is obtained through four reference points with their coordinates in the image and robot coordinate system. The grasp point is defi ned as the point in grasp rectangle with minimum depth while the approaching vecto
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- GB/T 17811-2025動物源性蛋白質(zhì)飼料胃蛋白酶消化率的測定過濾法
- 2025年安全生產(chǎn)法規(guī)機(jī)關(guān)測試題集
- 2025年文化藝術(shù)行業(yè)財務(wù)崗位面試預(yù)測題及解析
- 2025年村級兒童之家保潔員招聘面試常見問題及參考答案
- 勞動合同協(xié)議范本示例
- 2025年安全員安全知識考核題解
- 2025年汽車銷售顧問銷售技巧測評試題及答案解析
- 2025年農(nóng)業(yè)技術(shù)推廣員專業(yè)知識能力測評試卷及答案解析
- 2025年景觀生態(tài)規(guī)劃師資格考試試題及答案解析
- 2025年職業(yè)安全衛(wèi)生培訓(xùn)題與答案解析
- 美發(fā)服務(wù)禮儀培訓(xùn)課件
- 人教版小學(xué)一至六年級英語單詞匯總表
- 《生理性止血》課件
- 2019人教版高中英語必修三單詞表帶音標(biāo)
- 一例臀部巨大膿腫切開引流患者的個案護(hù)理匯報課件
- 液化石油氣機(jī)械修理工施工質(zhì)量管控詳細(xì)措施培訓(xùn)
- 中建掛籃懸臂澆筑箱梁施工方案
- JCT2199-2013 泡沫混凝土用泡沫劑
- 創(chuàng)業(yè)的勵志格言80句
- 加油站主要生產(chǎn)設(shè)備清單
- 國壽新綠洲團(tuán)體意外傷害保險(A款)條款
評論
0/150
提交評論