{"id":13134,"date":"2017-09-22T14:42:29","date_gmt":"2017-09-22T18:42:29","guid":{"rendered":"http:\/\/n2value.com\/blog\/?p=13134"},"modified":"2017-11-29T21:45:23","modified_gmt":"2017-11-30T02:45:23","slug":"building-a-high-performance-gpu-computing-workstation-for-deep-learning-part-i","status":"publish","type":"post","link":"https:\/\/n2value.com\/blog\/building-a-high-performance-gpu-computing-workstation-for-deep-learning-part-i\/","title":{"rendered":"Building a high-performance GPU computing workstation for deep learning  \u2013 part I"},"content":{"rendered":"<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-13131 size-large\" src=\"http:\/\/n2value.com\/blog\/wp-content\/uploads\/2017\/09\/gear-1024x537.jpg\" alt=\"\" width=\"768\" height=\"403\" srcset=\"https:\/\/n2value.com\/blog\/wp-content\/uploads\/2017\/09\/gear-1024x537.jpg 1024w, https:\/\/n2value.com\/blog\/wp-content\/uploads\/2017\/09\/gear-300x157.jpg 300w, https:\/\/n2value.com\/blog\/wp-content\/uploads\/2017\/09\/gear-768x402.jpg 768w, https:\/\/n2value.com\/blog\/wp-content\/uploads\/2017\/09\/gear.jpg 1996w\" sizes=\"auto, (max-width: 768px) 100vw, 768px\" \/><\/p>\n<p><em>This post is cross posted to <a href=\"http:\/\/www.ai-imaging.org\">www.ai-imaging.org<\/a> .\u00a0 For machine learning and AI issues, please visit the <a href=\"http:\/\/www.ai-imaging.org\">new site<\/a>!<\/em><\/p>\n<p>With Tensorflow released to the public, the NVidia Pascal Titan X GPU, along with (relatively) cheap storage and memory, the time was right to take the leap from CPU-based computing to GPU accelerated machine learning.<\/p>\n<p>My venerable Xeon W3550 8GB T3500 running a 2GB Quadro 600 was outdated. Since a DGX-1 was out of the question ($129,000), I decided to follow other pioneers building their own deep learning workstations. I could have ended up with a multi-thousand dollar doorstop \u2013 fortunately, I did not.<\/p>\n<p style=\"text-align: left;\">Criteria:<\/p>\n<ol>\n<li style=\"text-align: left;\">Reasonably fast CPU<\/li>\n<li style=\"text-align: left;\">Current &#8216;Best&#8217; NVidia GPU with large DDR5 memory<\/li>\n<li style=\"text-align: left;\">Multi-GPU potential<\/li>\n<li style=\"text-align: left;\">32GB or more stable RAM<\/li>\n<li style=\"text-align: left;\">SSD for OS<\/li>\n<li style=\"text-align: left;\">Minimize internal bottlenecks<\/li>\n<li style=\"text-align: left;\">Stable &amp; Reliable &#8211; minimize hardware bugs<\/li>\n<li style=\"text-align: left;\">Dual Boot Windows 10 Pro &amp; Ubuntu 16.04LTS<\/li>\n<li style=\"text-align: left;\">Can run: R, Rstudio, Pycharm, Python 3.5, Tensorflow<\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n<p><a href=\"http:\/\/n2value.com\/blog\/wp-content\/uploads\/2017\/09\/component-costs.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-13141\" src=\"http:\/\/n2value.com\/blog\/wp-content\/uploads\/2017\/09\/component-costs.png\" alt=\"\" width=\"322\" height=\"223\" srcset=\"https:\/\/n2value.com\/blog\/wp-content\/uploads\/2017\/09\/component-costs.png 322w, https:\/\/n2value.com\/blog\/wp-content\/uploads\/2017\/09\/component-costs-300x208.png 300w\" sizes=\"auto, (max-width: 322px) 100vw, 322px\" \/><\/a><\/p>\n<p style=\"padding-left: 540px;\">Total: \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 $3725<\/p>\n<p>&nbsp;<\/p>\n<h4>Asus X99 E 10G WS Motherboard. Retail $699<\/h4>\n<p>A Motherboard sets the capabilities and configuration of your system. While newer Intel Skylake and Kaby Lake CPU architectures &amp; chipsets beckon, reliability is important in a computationally intensive build, and <a href=\"https:\/\/www.extremetech.com\/computing\/220953-skylake-bug-causes-intel-chips-to-freeze-in-complex-workloads\">their documented complex computation freeze bug<\/a> makes me uneasy. Also, both architectures remain PCIe 3.0 at this time.<\/p>\n<p>Therefore, I chose the ASUS X99 motherboard. The board implements 40 PCIe 3.0 lanes which will support three 16X PCIe 3.0 cards (i.e. GPU\u2019s) and one 8x card. <a href=\"https:\/\/www.nextplatform.com\/2017\/07\/14\/system-bottleneck-shifts-pci-express\/\">The PCIe 3.0-CPU lanes are the largest bottleneck in the system, so making these 16X helps the most<\/a>.\u00a0 It also has a 10G Ethernet jack somewhat future-proofing it as I anticipate using large datasets in the Terabyte size. It supports up to 128GB of DDR4. The previous versions of ASUS X99 WS have been well reviewed.<\/p>\n<p>&nbsp;<\/p>\n<h4>Intel Core i7 6850K Broadwell-E CPU Socket Retail $649<\/h4>\n<p>Socket LGA2011-v3 on the motherboard guides the CPU choice \u2013 the sweet spot in the Broadwell-E lineup is the overclockable 3.6Ghz 6850K with 6 cores and 15MB of L3 cache, permitting 40 PCIe lanes. $359 discounted is attractive compared to the 6900K, reviewed to offer minimal to no improvement at a $600 price premium. The 6950X is $1200 more for 4 extra cores, unnecessary for our purposes. Avoid the $650 6800K \u2013 pricier and slower with less (28) lanes. A stable overclock to 4.0Ghz is easily achievable on the 6850K.<\/p>\n<h4>NVidia GeForce 1080Ti 11GB \u2013 EVGA FTW3 edition Retail: $800<\/h4>\n<p>Last year, choosing a GPU was easy \u2013 the Titan X Pascal, a 12GB 3584 CUDA-core monster. However, by spring 2017 there were two choices: The Titan Xp, with slightly faster memory speed &amp; internal bus, and 256 more CUDA cores; and the 1080Ti, the prosumer enthusiast version of the Titan X Pascal, with 3584 cores. The 1080Ti differs in its memory architecture \u2013 11GB DDR5 and a slightly slower, slightly narrower bandwidth vs. the Xp.<\/p>\n<p>The 1080Ti currently wins on price\/performance. You can buy two 1080Ti\u2019s for the price of one Titan Xp. Also, at time of purchase, Volta architecture was announced. As the PCIe bus is the bottleneck, and will remain so for a few years, batch size into DDR5 memory &amp; CUDA cores will be where performance is gained. A 16GB DDR5 Volta processor would be a significant performance gain from a 12GB Pascal for deep learning. Conversely, 12GB Pascal to 11GB Pascal is a relative lesser performance hit. As I am later in the upgrade cycle, I\u2019ll upgrade to the 16GB Volta and resell my 1080Ti in the future \u2013 I anticipate only taking a loss of $250 per 1080Ti on resell.<\/p>\n<p>The FTW3 edition was chosen because it is a true 2-slot card (not 2.5) with better cooling than the Founder\u2019s Edition 1080Ti. This will allow 3 to physically fit onto this motherboard.<\/p>\n<h4>64 GB DDR4-2666 DRAM \u2013 Corsair Vengeance low profile Retail : $600<\/h4>\n<p>DDR4 runs at 2133mhz unless overclocked. Attention must be paid to the size of the DRAM units to ensure they fit under the CPU cooler, which these do. From my research, DRAM speeds over 3000 lose stability. For Broadwell there\u2019s not much evidence that speeds above 2666mhz improves performance. I chose 64GB because 1) I use R which is memory resident so the more GB the better and 2) There is a controversial rule of thumb that your RAM should equal 2x the size of your GPU memory to prevent bottlenecks. Implementing 3 1080Ti\u2019s, 3x 11GB = 33 GB. Implementing 2 16GB Voltas would be 32GB.<\/p>\n<p>&nbsp;<\/p>\n<h4>Samsung 1TB 960 EVO M2 NVMe SSD Retail $500<\/h4>\n<p>The ASUS motherboard has a fast M2 interface, which, while using PCIe lanes, does not compete for slots or lanes. The 1TB size is sufficient for probably anything I will throw at it (all apps\/programs, OS\u2019s, and frequently used data and packages. Everything else can go on other storage. I was unnecessarily concerned about SSD heat throttling &#8211; on this motherboard, the slot\u2019s location is in a good place which allows for great airflow over it. The speed in booting up Windows 10 or Ubuntu 16.04 LTS is noticeable.<\/p>\n<p>&nbsp;<\/p>\n<h4>EVGA Titanium 1200 power supply Retail $350<\/h4>\n<p>One of the more boring parts of the computer, but for a multi GPU build you need a strong 1200 or 1600W power supply. The high Titanium rating will both save on electricity and promote stability over long compute sessions.<\/p>\n<p>&nbsp;<\/p>\n<h4>Barracuda 8TB Hard Drive Retail $299<\/h4>\n<p>I like to control my data, so I\u2019m still not wild about the cloud, although it is a necessity for very large data sets. So here is a large, cheap drive for on-site data storage. For an extra $260, I can Raid 1 the drive and sleep well at night.<\/p>\n<h4><\/h4>\n<h4>Strike FUMA CPU Cooler. Retail $60<\/h4>\n<p>This was actually one of the hardest decisions in building the system \u2013 would the memory will fit under the fans? The answer is a firm yes. This dual fan tower cooler was well-rated, quiet, attractive, fit properly, half the price of other options, and my overclocked CPU runs extremely cool \u2013 35C with full fan RPM\u2019s, average operating temperature 42C and even under a high stress test, I have difficulty getting the temperature over 58C. Notably, the fans never even get to full speed on system control.<\/p>\n<p>&nbsp;<\/p>\n<h4>Corsair 750 D Airflow Edition Case. Retail $250<\/h4>\n<p>After hearing the horror stories of water leaks, I decided at this level of build not to go with water cooling. The 750D has plenty of space (enough for a server) for air circulation, and comes installed with 3 fans \u2013 two air intake on the front and one exhaust at upper rear. It is a really nice, sturdy, large case. My front panel was defective \u2013 the grating kept falling off \u2013 so Corsair shipped me a replacement quickly and without fuss.<\/p>\n<h4><\/h4>\n<h4>Cougar Vortex 14\u201d fans \u2013 Retail $20 ea.<\/h4>\n<p>Two extra cougar Vortex 14\u201d fans were purchased, one as an intake fan at the bottom of the case, and one as a 2nd exhaust fan at the top of the case. These together create excellent airflow at noise levels I can barely hear. Two fans on the CPU Heat Sink plus Three Fans on the GPU plus five fans on the case plus one in the power supply = 11 fans total! More airflow at lower RPM = silence.<\/p>\n<p>&nbsp;<\/p>\n<h4>Windows 10 Pro USB edition Retail $199<\/h4>\n<p>This is a dual boot system so, there you go.<\/p>\n<p><a href=\"http:\/\/n2value.com\/blog\/wp-content\/uploads\/2017\/09\/opencase-e1505920071826.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-13132 size-large\" src=\"http:\/\/n2value.com\/blog\/wp-content\/uploads\/2017\/09\/opencase-e1505920071826-768x1024.jpg\" alt=\"\" width=\"768\" height=\"1024\" srcset=\"https:\/\/n2value.com\/blog\/wp-content\/uploads\/2017\/09\/opencase-e1505920071826-768x1024.jpg 768w, https:\/\/n2value.com\/blog\/wp-content\/uploads\/2017\/09\/opencase-e1505920071826-225x300.jpg 225w, https:\/\/n2value.com\/blog\/wp-content\/uploads\/2017\/09\/opencase-e1505920071826.jpg 1512w\" sizes=\"auto, (max-width: 768px) 100vw, 768px\" \/><\/a><\/p>\n<p>Specific limitations with this system are as follows. While it will accept four GPU\u2019s physically, the slots are limited to 16X\/16X\/16X\/8X with the M2 drive installed which may affect performance on the 4th GPU (&amp; therefore deep learning model training and performance). Additionally, the CPU upgrade path is limited &#8211; without going to a Xeon, the only reasonable upgrade from the 6850K\u2019s 14,378 passmark is the 6950X, with a passmark of 20,021. In the future if more than 128GB DDR4 is required, that will be a problem with this build.<\/p>\n<p>Finally, inherent bandwidth limitations exist in the PCIe 3.0 protocol and aren\u2019t easily circumvented.\u00a0PCIe 3.0 throughput is 8GB\/s. Compare this to NVidia\u2019s proprietary NVlink that allows throughput of 20-25GB\/s (Pascal vs. Volta). Note that current NVlink speeds will not be surpassed until PCIe5.0 is implemented at 32GB\/s in 2019. NVidia\u2019s CUDA doesn\u2019t implement SLI, either, so at present that is not a solution. PCIe 4.0 has just been released with only IBM adopting, doubling transfer vs. 3.0, and 5.0 has been proposed, doubling yet again. However, these faster protocols may be difficult and\/or expensive to implement. A 4 slot PCIe 5.0 bus will probably not be seen until into the 2020\u2019s. This means that for now, dedicated NVlink 2.0 systems will outperform similar PCIe systems.<\/p>\n<p>With that said, this system approaches a best possible build considering price and reliability, and should be able to give a few years of good service, especially if the GPU\u2019s are upgraded periodically. Precursor systems based upon the Z97 chipset are still viable for deep learning, albeit with slower speeds, and have been matched to older NVidia 8GB 1070 GPU\u2019s which are again half the price of the 1080Ti.<\/p>\n<p>In part II, I will describe how I set up the system configuration for dual boot and configured deep learning with Ubuntu 16.04LTS. Surprisingly, this was far more difficult than the actual build itself, for multiple reasons I will explain &amp; detail with the solutions.\u00a0 And yes, it booted up.\u00a0 On the first try.<\/p>\n<p>If you liked this post, head over to our sister site, ai-imaging.org where <a href=\"http:\/\/ai-imaging.org\/building-a-high-performance-gpu-computing-workstation-for-deep-learning-part-ii\/\" target=\"_blank\" rel=\"noopener\">part 2<\/a>, <a href=\"http:\/\/ai-imaging.org\/building-a-high-performance-gpu-computing-workstation-for-deep-learning-part-iii\/\">part 3<\/a>, and <a href=\"http:\/\/ai-imaging.org\/building-a-high-performance-gpu-computing-workstation-part-iv\/\" target=\"_blank\" rel=\"noopener\">part 4<\/a> of this post are located.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This post is cross posted to www.ai-imaging.org .\u00a0 For machine learning and AI issues, please visit the new site! With Tensorflow released to the public, the NVidia Pascal Titan X GPU, along with (relatively) cheap storage and memory, the time was right to take the leap from CPU-based computing to GPU accelerated machine learning. My [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false},"version":2}},"categories":[22,4],"tags":[],"class_list":["post-13134","post","type-post","status-publish","format-standard","hentry","category-computer-vision","category-data-science"],"jetpack_publicize_connections":[],"aioseo_notices":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p4mtfP-3pQ","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/n2value.com\/blog\/wp-json\/wp\/v2\/posts\/13134","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/n2value.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/n2value.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/n2value.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/n2value.com\/blog\/wp-json\/wp\/v2\/comments?post=13134"}],"version-history":[{"count":12,"href":"https:\/\/n2value.com\/blog\/wp-json\/wp\/v2\/posts\/13134\/revisions"}],"predecessor-version":[{"id":13597,"href":"https:\/\/n2value.com\/blog\/wp-json\/wp\/v2\/posts\/13134\/revisions\/13597"}],"wp:attachment":[{"href":"https:\/\/n2value.com\/blog\/wp-json\/wp\/v2\/media?parent=13134"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/n2value.com\/blog\/wp-json\/wp\/v2\/categories?post=13134"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/n2value.com\/blog\/wp-json\/wp\/v2\/tags?post=13134"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}