{"id":2217,"date":"2016-05-19T12:06:08","date_gmt":"2016-05-19T15:06:08","guid":{"rendered":"http:\/\/roberval.com.br\/roberval\/?p=2217"},"modified":"2016-05-19T12:08:59","modified_gmt":"2016-05-19T15:08:59","slug":"a-swarm-of-sparks","status":"publish","type":"post","link":"https:\/\/roberval.com.br\/roberval\/2016\/05\/a-swarm-of-sparks\/","title":{"rendered":"A Swarm of Sparks"},"content":{"rendered":"<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>At <a class=\"markup--anchor markup--p-anchor\" href=\"http:\/\/www.worldsense.com\/\" rel=\"nofollow\" data-href=\"http:\/\/www.worldsense.com\/\">WorldSense<\/a> we build predictors for the best links you could add in your content by creating large language models from the World Wide Web. In the open source world, no tool is better suited for that kind of mass (hyper)text analysis than <a class=\"markup--anchor markup--p-anchor\" href=\"http:\/\/spark.apache.org\/\" rel=\"nofollow\" data-href=\"http:\/\/spark.apache.org\/\">Apache Spark<\/a>, and I wanted to share how we set it up and run it on the cloud, so you can give it a try.<\/p>\n<p><a href=\"http:\/\/www.worldsense.com\/site\/wp-content\/uploads\/2016\/01\/swarm.png\" rel=\"attachment wp-att-2668\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-2668 size-medium alignright\" src=\"http:\/\/www.worldsense.com\/site\/wp-content\/uploads\/2016\/01\/swarm-300x212.png\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" srcset=\"http:\/\/www.worldsense.com\/site\/wp-content\/uploads\/2016\/01\/swarm-300x212.png 300w, http:\/\/www.worldsense.com\/site\/wp-content\/uploads\/2016\/01\/swarm-768x542.png 768w, http:\/\/www.worldsense.com\/site\/wp-content\/uploads\/2016\/01\/swarm-1024x723.png 1024w, http:\/\/www.worldsense.com\/site\/wp-content\/uploads\/2016\/01\/swarm-300x212@2x.png 600w, http:\/\/www.worldsense.com\/site\/wp-content\/uploads\/2016\/01\/swarm-960x678.png 960w, http:\/\/www.worldsense.com\/site\/wp-content\/uploads\/2016\/01\/swarm-770x544.png 770w, http:\/\/www.worldsense.com\/site\/wp-content\/uploads\/2016\/01\/swarm-1170x826.png 1170w, http:\/\/www.worldsense.com\/site\/wp-content\/uploads\/2016\/01\/swarm-768x542@2x.png 1536w, http:\/\/www.worldsense.com\/site\/wp-content\/uploads\/2016\/01\/swarm-600x424@2x.png 1200w, http:\/\/www.worldsense.com\/site\/wp-content\/uploads\/2016\/01\/swarm-960x678@2x.png 1920w, http:\/\/www.worldsense.com\/site\/wp-content\/uploads\/2016\/01\/swarm-770x544@2x.png 1540w\" alt=\"Web scale computing has never been so easy\" width=\"300\" height=\"212\" \/><\/a><\/p>\n<p id=\"4e8c\" class=\"graf--p graf-after--figure\">Spark is a distributed system, and as any similar system, it has a somewhat demanding configuration. There is a plethora of ways of running Spark, but I will try to describe the one that I think offers the best trade-off nowadays: a standalone cluster running (mostly) on bare-bones Amazon EC2 spot instances configured using the <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/github.com\/docker\/toolbox\/releases\/tag\/v1.10.0-rc1\" rel=\"nofollow\" data-href=\"https:\/\/github.com\/docker\/toolbox\/releases\/tag\/v1.10.0-rc1\">newest Docker orchestrations tools<\/a>.<\/p>\n<p id=\"8033\" class=\"graf--p graf-after--p\">Before we start, let us double check what we need:<\/p>\n<ul class=\"postList\">\n<li id=\"ff9e\" class=\"graf--li graf-after--p\">The hardware, in the form of some machines in the cloud.<\/li>\n<li id=\"851b\" class=\"graf--li graf-after--li\">The software, Apache Spark, installed in each of them.<\/li>\n<li id=\"64da\" class=\"graf--li graf-after--li\">An abstraction layer to create a cluster from those machines.<\/li>\n<li id=\"9836\" class=\"graf--li graf-after--li\">Some coordination point through which all of this come to life.<\/li>\n<\/ul>\n<p id=\"bd52\" class=\"graf--p graf-after--li\">We will move backwards through this list, as it makes it easier to present the different systems involved. We allocate our machines with <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/docs.docker.com\/machine\/\" rel=\"nofollow\" data-href=\"https:\/\/docs.docker.com\/machine\/\">Docker Machine<\/a>, using the<a class=\"markup--anchor markup--p-anchor\" href=\"http:\/\/sirile.github.io\/2015\/07\/02\/using-docker-18-experimental-with-docker-machine-and-virtualbox-driver-boot2docker.html\" rel=\"nofollow\" data-href=\"http:\/\/sirile.github.io\/2015\/07\/02\/using-docker-18-experimental-with-docker-machine-and-virtualbox-driver-boot2docker.html\">very latest docker engine version<\/a>, which contains all the functionality we need. Let us start with a very small machine:<\/p>\n<div id=\"crayon-573dd63038f15679577966\" class=\"crayon-syntax crayon-theme-xcode crayon-font-monaco crayon-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<div class=\"crayon-plain-wrap\"><\/div>\n<div class=\"crayon-main\">\n<table class=\"crayon-table\">\n<tbody>\n<tr class=\"crayon-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"crayon-nums-content\">\n<div class=\"crayon-num\" data-line=\"crayon-573dd63038f15679577966-1\">1<\/div>\n<div class=\"crayon-num crayon-striped-num\" data-line=\"crayon-573dd63038f15679577966-2\">2<\/div>\n<\/div>\n<\/td>\n<td class=\"crayon-code\">\n<div class=\"crayon-pre\">\n<div id=\"crayon-573dd63038f15679577966-1\" class=\"crayon-line\"><span class=\"crayon-v\">DRIVER_OPTIONS<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8220;&#8211;driver amazonec2 &#8211;amazonec2-security-group=default &#8211;engine-install-url https:\/\/test.docker.com&#8221;<\/span><\/div>\n<div id=\"crayon-573dd63038f15679577966-2\" class=\"crayon-line crayon-striped-line\"><span class=\"crayon-v\">docker<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-e\">machine <\/span><span class=\"crayon-i\">create<\/span> <span class=\"crayon-v\">$DRIVER_OPTIONS<\/span> <span class=\"crayon-o\">&#8212;<\/span><span class=\"crayon-v\">amazonec2<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-v\">instance<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-r\">type<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">t2<\/span><span class=\"crayon-e\">.nano<\/span> <span class=\"crayon-sy\">$<\/span><span class=\"crayon-sy\">{<\/span><span class=\"crayon-v\">CLUSTER_PREFIX<\/span><span class=\"crayon-sy\">}<\/span><span class=\"crayon-v\">ks<\/span><\/div>\n<\/div>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<p id=\"7ef2\" class=\"graf--p graf-after--pre\">We will use that machine for <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/www.consul.io\/\" rel=\"nofollow\" data-href=\"https:\/\/www.consul.io\/\">Consul<\/a>, an atomic distributed key-value store,<a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/www.consul.io\/docs\/internals\/sessions.html\" rel=\"nofollow\" data-href=\"https:\/\/www.consul.io\/docs\/internals\/sessions.html\">inspired by Google\u00e2\u20ac\u2122s chubby<\/a>. Consul will be responsible for keeping track of who is part of our cluster, among other things. Installing it is trivial, since<a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/hub.docker.com\/r\/progrium\/consul\/\" rel=\"nofollow\" data-href=\"https:\/\/hub.docker.com\/r\/progrium\/consul\/\">someone on the internet<\/a> already packed it as a Docker container for us:<\/p>\n<div id=\"crayon-573dd63038f22710091195\" class=\"crayon-syntax crayon-theme-xcode crayon-font-monaco crayon-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<div class=\"crayon-plain-wrap\"><\/div>\n<div class=\"crayon-main\">\n<table class=\"crayon-table\">\n<tbody>\n<tr class=\"crayon-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"crayon-nums-content\">\n<div class=\"crayon-num\" data-line=\"crayon-573dd63038f22710091195-1\">1<\/div>\n<\/div>\n<\/td>\n<td class=\"crayon-code\">\n<div class=\"crayon-pre\">\n<div id=\"crayon-573dd63038f22710091195-1\" class=\"crayon-line\"><span class=\"crayon-e\">docker<\/span> <span class=\"crayon-sy\">$<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-e\">docker<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-e\">machine<\/span> <span class=\"crayon-e\">config<\/span> <span class=\"crayon-sy\">$<\/span><span class=\"crayon-sy\">{<\/span><span class=\"crayon-v\">CLUSTER_PREFIX<\/span><span class=\"crayon-sy\">}<\/span><span class=\"crayon-v\">ks<\/span><span class=\"crayon-sy\">)<\/span> <span class=\"crayon-v\">run<\/span> <span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-v\">d<\/span> <span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-i\">p<\/span> <span class=\"crayon-s\">&#8220;8500:8500&#8221;<\/span> <span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-i\">h<\/span> <span class=\"crayon-s\">&#8220;consul&#8221;<\/span> <span class=\"crayon-v\">progrium<\/span><span class=\"crayon-o\">\/<\/span><span class=\"crayon-v\">consul<\/span> <span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-v\">server<\/span> <span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-v\">bootstrap<\/span><\/div>\n<\/div>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<p id=\"7ef1\" class=\"graf--p graf-after--pre\">This takes a few minutes to start, but you should only really need to do that once per cluster\u00c2\u00b9. Every time you bring the cluster up you can point to that same Consul instance, and <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/aws.amazon.com\/blogs\/aws\/ec2-update-t2-nano-instances-now-available\/\" rel=\"nofollow\" data-href=\"https:\/\/aws.amazon.com\/blogs\/aws\/ec2-update-t2-nano-instances-now-available\/\">keeping a t2.nano running will cost you less than five bucks an year<\/a>.<\/p>\n<p id=\"e8a8\" class=\"graf--p graf-after--p\">Now we can instantiate the cluster\u00e2\u20ac\u2122s master machine. The core responsibility of this machine is coordinating the workers. It will be both the Spark master machine and the manager for our <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/docs.docker.com\/swarm\/\" rel=\"nofollow\" data-href=\"https:\/\/docs.docker.com\/swarm\/\">Docker Swarm<\/a>, the system responsible for presenting the machines and containers as a cluster.<\/p>\n<div id=\"crayon-573dd63038f27780991148\" class=\"crayon-syntax crayon-theme-xcode crayon-font-monaco crayon-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<div class=\"crayon-plain-wrap\"><\/div>\n<div class=\"crayon-main\">\n<table class=\"crayon-table\">\n<tbody>\n<tr class=\"crayon-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"crayon-nums-content\">\n<div class=\"crayon-num\" data-line=\"crayon-573dd63038f27780991148-1\">1<\/div>\n<div class=\"crayon-num crayon-striped-num\" data-line=\"crayon-573dd63038f27780991148-2\">2<\/div>\n<div class=\"crayon-num\" data-line=\"crayon-573dd63038f27780991148-3\">3<\/div>\n<div class=\"crayon-num crayon-striped-num\" data-line=\"crayon-573dd63038f27780991148-4\">4<\/div>\n<div class=\"crayon-num\" data-line=\"crayon-573dd63038f27780991148-5\">5<\/div>\n<div class=\"crayon-num crayon-striped-num\" data-line=\"crayon-573dd63038f27780991148-6\">6<\/div>\n<\/div>\n<\/td>\n<td class=\"crayon-code\">\n<div class=\"crayon-pre\">\n<div id=\"crayon-573dd63038f27780991148-1\" class=\"crayon-line\"><span class=\"crayon-v\">NET_ETH<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-e\">eth0<\/span><\/div>\n<div id=\"crayon-573dd63038f27780991148-2\" class=\"crayon-line crayon-striped-line\"><span class=\"crayon-v\">KEYSTORE_IP<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-sy\">$<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-e\">aws <\/span><span class=\"crayon-e\">ec2 <\/span><span class=\"crayon-v\">describe<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-v\">instances<\/span> <span class=\"crayon-o\">|<\/span> <span class=\"crayon-v\">jq<\/span> <span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-i\">r<\/span> <span class=\"crayon-s\">&#8220;.Reservations[].Instances[] | select(.KeyName==\\&#8221;${CLUSTER_PREFIX}ks\\&#8221; and .State.Name==\\&#8221;running\\&#8221;) | .PrivateIpAddress&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/div>\n<div id=\"crayon-573dd63038f27780991148-3\" class=\"crayon-line\"><span class=\"crayon-v\">SWARM_OPTIONS<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8220;&#8211;swarm &#8211;swarm-discovery=consul:\/\/$KEYSTORE_IP:8500 &#8211;engine-opt=cluster-store=consul:\/\/$KEYSTORE_IP:8500 &#8211;engine-opt=cluster-advertise=$NET_ETH:2376&#8221;<\/span><\/div>\n<div id=\"crayon-573dd63038f27780991148-4\" class=\"crayon-line crayon-striped-line\"><span class=\"crayon-v\">MASTER_OPTIONS<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8220;$DRIVER_OPTIONS $SWARM_OPTIONS &#8211;swarm-master -engine-label role=master &#8211;amazonec2-instance-type=m4.large&#8221;<\/span><\/div>\n<div id=\"crayon-573dd63038f27780991148-5\" class=\"crayon-line\"><span class=\"crayon-v\">MASTER<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-sy\">$<\/span><span class=\"crayon-sy\">{<\/span><span class=\"crayon-v\">CLUSTER_PREFIX<\/span><span class=\"crayon-sy\">}<\/span><span class=\"crayon-e\">n0<\/span><\/div>\n<div id=\"crayon-573dd63038f27780991148-6\" class=\"crayon-line crayon-striped-line\"><span class=\"crayon-v\">docker<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-e\">machine <\/span><span class=\"crayon-i\">create<\/span> <span class=\"crayon-v\">$MASTER_OPTIONS<\/span> <span class=\"crayon-o\">&#8212;<\/span><span class=\"crayon-v\">amazonec2<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-v\">instance<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-r\">type<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-r\">m4<\/span><span class=\"crayon-e\">.large<\/span> <span class=\"crayon-v\">$MASTER<\/span><\/div>\n<\/div>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<p id=\"7ef3\" class=\"graf--p graf-after--pre\">There are a few interesting things going on here. First, we used some shell-fu to find the IP address of our Consul machine inside the Amazon network. Then we fed that to the swarm-discovery and cluster-store options so Docker can keep track of the nodes in our cluster and the network layout of the containers running in each of them. With the configs in place, we proceeded to create a <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/blog.cloudability.com\/aws-m4-performance-and-cost-analysis\/\" rel=\"nofollow\" data-href=\"https:\/\/blog.cloudability.com\/aws-m4-performance-and-cost-analysis\/\">m4.large<\/a> machine, and labeled it as our master. We now have a fully functional 1-machine cluster, and can run jobs on it. Just point to the Docker Swarm manager and treat it as a regular Docker daemon.<\/p>\n<div id=\"crayon-573dd63038f2b833529163\" class=\"crayon-syntax crayon-theme-xcode crayon-font-monaco crayon-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<div class=\"crayon-plain-wrap\"><\/div>\n<div class=\"crayon-main\">\n<table class=\"crayon-table\">\n<tbody>\n<tr class=\"crayon-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"crayon-nums-content\">\n<div class=\"crayon-num\" data-line=\"crayon-573dd63038f2b833529163-1\">1<\/div>\n<\/div>\n<\/td>\n<td class=\"crayon-code\">\n<div class=\"crayon-pre\">\n<div id=\"crayon-573dd63038f2b833529163-1\" class=\"crayon-line\"><span class=\"crayon-i\">docker<\/span> <span class=\"crayon-sy\">$<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">docker<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-e\">machine <\/span><span class=\"crayon-v\">config<\/span> <span class=\"crayon-o\">&#8212;<\/span><span class=\"crayon-i\">swarm<\/span> <span class=\"crayon-v\">$MASTER<\/span><span class=\"crayon-sy\">)<\/span> <span class=\"crayon-e\">run <\/span><span class=\"crayon-v\">hello<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-v\">world<\/span><\/div>\n<\/div>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<p>To install Spark on our cluster, we will use <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/docs.docker.com\/compose\/\" rel=\"nofollow\" data-href=\"https:\/\/docs.docker.com\/compose\/\">Docker Compose<\/a>, another tool from the Docker family. With Compose we can describe how to install and configure a set of containers. Starting from scratch is easy, but we will take a shortcut by using an existing image, <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/github.com\/gettyimages\/docker-spark\" rel=\"nofollow\" data-href=\"https:\/\/github.com\/gettyimages\/docker-spark\">gettyimages\/spark<\/a>, and only focus on the configuration part. Here is the result, which you should save in a <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/github.com\/docker\/compose\/blob\/1.6.0-rc1\/docs\/compose-file.md\" rel=\"nofollow\" data-href=\"https:\/\/github.com\/docker\/compose\/blob\/1.6.0-rc1\/docs\/compose-file.md\">docker-compose.yml<\/a> file in the local directory.<\/p>\n<div id=\"crayon-573dd63038f2f462731533\" class=\"crayon-syntax crayon-theme-xcode crayon-font-monaco crayon-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<div class=\"crayon-plain-wrap\"><\/div>\n<div class=\"crayon-main\">\n<table class=\"crayon-table\">\n<tbody>\n<tr class=\"crayon-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"crayon-nums-content\">\n<div class=\"crayon-num\" data-line=\"crayon-573dd63038f2f462731533-1\">1<\/div>\n<div class=\"crayon-num crayon-striped-num\" data-line=\"crayon-573dd63038f2f462731533-2\">2<\/div>\n<div class=\"crayon-num\" data-line=\"crayon-573dd63038f2f462731533-3\">3<\/div>\n<div class=\"crayon-num crayon-striped-num\" data-line=\"crayon-573dd63038f2f462731533-4\">4<\/div>\n<div class=\"crayon-num\" data-line=\"crayon-573dd63038f2f462731533-5\">5<\/div>\n<div class=\"crayon-num crayon-striped-num\" data-line=\"crayon-573dd63038f2f462731533-6\">6<\/div>\n<div class=\"crayon-num\" data-line=\"crayon-573dd63038f2f462731533-7\">7<\/div>\n<div class=\"crayon-num crayon-striped-num\" data-line=\"crayon-573dd63038f2f462731533-8\">8<\/div>\n<div class=\"crayon-num\" data-line=\"crayon-573dd63038f2f462731533-9\">9<\/div>\n<div class=\"crayon-num crayon-striped-num\" data-line=\"crayon-573dd63038f2f462731533-10\">10<\/div>\n<div class=\"crayon-num\" data-line=\"crayon-573dd63038f2f462731533-11\">11<\/div>\n<div class=\"crayon-num crayon-striped-num\" data-line=\"crayon-573dd63038f2f462731533-12\">12<\/div>\n<div class=\"crayon-num\" data-line=\"crayon-573dd63038f2f462731533-13\">13<\/div>\n<div class=\"crayon-num crayon-striped-num\" data-line=\"crayon-573dd63038f2f462731533-14\">14<\/div>\n<div class=\"crayon-num\" data-line=\"crayon-573dd63038f2f462731533-15\">15<\/div>\n<div class=\"crayon-num crayon-striped-num\" data-line=\"crayon-573dd63038f2f462731533-16\">16<\/div>\n<div class=\"crayon-num\" data-line=\"crayon-573dd63038f2f462731533-17\">17<\/div>\n<div class=\"crayon-num crayon-striped-num\" data-line=\"crayon-573dd63038f2f462731533-18\">18<\/div>\n<div class=\"crayon-num\" data-line=\"crayon-573dd63038f2f462731533-19\">19<\/div>\n<div class=\"crayon-num crayon-striped-num\" data-line=\"crayon-573dd63038f2f462731533-20\">20<\/div>\n<div class=\"crayon-num\" data-line=\"crayon-573dd63038f2f462731533-21\">21<\/div>\n<div class=\"crayon-num crayon-striped-num\" data-line=\"crayon-573dd63038f2f462731533-22\">22<\/div>\n<div class=\"crayon-num\" data-line=\"crayon-573dd63038f2f462731533-23\">23<\/div>\n<div class=\"crayon-num crayon-striped-num\" data-line=\"crayon-573dd63038f2f462731533-24\">24<\/div>\n<div class=\"crayon-num\" data-line=\"crayon-573dd63038f2f462731533-25\">25<\/div>\n<div class=\"crayon-num crayon-striped-num\" data-line=\"crayon-573dd63038f2f462731533-26\">26<\/div>\n<div class=\"crayon-num\" data-line=\"crayon-573dd63038f2f462731533-27\">27<\/div>\n<div class=\"crayon-num crayon-striped-num\" data-line=\"crayon-573dd63038f2f462731533-28\">28<\/div>\n<\/div>\n<\/td>\n<td class=\"crayon-code\">\n<div class=\"crayon-pre\">\n<div id=\"crayon-573dd63038f2f462731533-1\" class=\"crayon-line\"><span class=\"crayon-s \">version<\/span><span class=\"\">: 2<\/span><\/div>\n<div id=\"crayon-573dd63038f2f462731533-2\" class=\"crayon-line crayon-striped-line\"><span class=\"crayon-s \">services<\/span><span class=\"\">:<\/span><\/div>\n<div id=\"crayon-573dd63038f2f462731533-3\" class=\"crayon-line\"><span class=\"crayon-h\">\u00c2\u00a0\u00c2\u00a0<\/span><span class=\"crayon-s \">master<\/span><span class=\"\">:<\/span><\/div>\n<div id=\"crayon-573dd63038f2f462731533-4\" class=\"crayon-line crayon-striped-line\"><span class=\"crayon-h\">\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0<\/span><span class=\"crayon-s \">container_name<\/span><span class=\"\">: master<\/span><\/div>\n<div id=\"crayon-573dd63038f2f462731533-5\" class=\"crayon-line\"><span class=\"crayon-h\">\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0<\/span><span class=\"crayon-s \">image<\/span><span class=\"\">: gettyimages\/spark<\/span><span class=\"\">:1.6.0-hadoop-2.6<\/span><\/div>\n<div id=\"crayon-573dd63038f2f462731533-6\" class=\"crayon-line crayon-striped-line\"><span class=\"crayon-h\">\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0<\/span><span class=\"crayon-s \">command<\/span><span class=\"\">: \/usr\/spark\/bin\/spark-class org.apache.spark.deploy.master.Master -h master<\/span><\/div>\n<div id=\"crayon-573dd63038f2f462731533-7\" class=\"crayon-line\"><span class=\"crayon-h\">\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0<\/span><span class=\"crayon-s \">hostname<\/span><span class=\"\">: master<\/span><\/div>\n<div id=\"crayon-573dd63038f2f462731533-8\" class=\"crayon-line crayon-striped-line\"><span class=\"crayon-h\">\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0<\/span><span class=\"crayon-s \">environment<\/span><span class=\"\">:<\/span><\/div>\n<div id=\"crayon-573dd63038f2f462731533-9\" class=\"crayon-line\"><span class=\"crayon-h\">\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0<\/span><span class=\"crayon-s \">&#8211; constraint<\/span><span class=\"\">:role==master<\/span><\/div>\n<div id=\"crayon-573dd63038f2f462731533-10\" class=\"crayon-line crayon-striped-line\"><span class=\"crayon-h\">\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0<\/span><span class=\"crayon-s \">ports<\/span><span class=\"\">:<\/span><\/div>\n<div id=\"crayon-573dd63038f2f462731533-11\" class=\"crayon-line\"><span class=\"crayon-h\">\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0<\/span><span class=\"crayon-s \">&#8211; 4040<\/span><span class=\"\">:4040<\/span><\/div>\n<div id=\"crayon-573dd63038f2f462731533-12\" class=\"crayon-line crayon-striped-line\"><span class=\"crayon-h\">\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0<\/span><span class=\"crayon-s \">&#8211; 6066<\/span><span class=\"\">:6066<\/span><\/div>\n<div id=\"crayon-573dd63038f2f462731533-13\" class=\"crayon-line\"><span class=\"crayon-h\">\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0<\/span><span class=\"crayon-s \">&#8211; 7077<\/span><span class=\"\">:7077<\/span><\/div>\n<div id=\"crayon-573dd63038f2f462731533-14\" class=\"crayon-line crayon-striped-line\"><span class=\"crayon-h\">\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0<\/span><span class=\"crayon-s \">&#8211; 8080<\/span><span class=\"\">:8080<\/span><\/div>\n<div id=\"crayon-573dd63038f2f462731533-15\" class=\"crayon-line\"><span class=\"crayon-h\">\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0<\/span><span class=\"crayon-s \">expose<\/span><span class=\"\">:<\/span><\/div>\n<div id=\"crayon-573dd63038f2f462731533-16\" class=\"crayon-line crayon-striped-line\"><span class=\"crayon-h\">\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0<\/span>&#8211; <span class=\"crayon-i \">&#8220;8081-8095&#8221;<\/span><\/div>\n<div id=\"crayon-573dd63038f2f462731533-17\" class=\"crayon-line\"><span class=\"crayon-h\">\u00c2\u00a0\u00c2\u00a0<\/span><span class=\"crayon-s \">worker<\/span><span class=\"\">:<\/span><\/div>\n<div id=\"crayon-573dd63038f2f462731533-18\" class=\"crayon-line crayon-striped-line\"><span class=\"crayon-h\">\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0<\/span><span class=\"crayon-s \">image<\/span><span class=\"\">: gettyimages\/spark<\/span><span class=\"\">:1.6.0-hadoop-2.6<\/span><\/div>\n<div id=\"crayon-573dd63038f2f462731533-19\" class=\"crayon-line\"><span class=\"crayon-h\">\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0<\/span><span class=\"crayon-s \">command<\/span><span class=\"\">: \/usr\/spark\/bin\/spark-class org.apache.spark.deploy.worker.Worker spark<\/span><span class=\"\">:\/\/master<\/span><span class=\"\">:7077<\/span><\/div>\n<div id=\"crayon-573dd63038f2f462731533-20\" class=\"crayon-line crayon-striped-line\"><span class=\"crayon-h\">\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0<\/span><span class=\"crayon-s \">environment<\/span><span class=\"\">:<\/span><\/div>\n<div id=\"crayon-573dd63038f2f462731533-21\" class=\"crayon-line\"><span class=\"crayon-h\">\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0<\/span><span class=\"crayon-s \">&#8211; constraint<\/span><span class=\"\">:role!=master<\/span><\/div>\n<div id=\"crayon-573dd63038f2f462731533-22\" class=\"crayon-line crayon-striped-line\"><span class=\"crayon-h\">\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0<\/span><span class=\"crayon-s \">ports<\/span><span class=\"\">:<\/span><\/div>\n<div id=\"crayon-573dd63038f2f462731533-23\" class=\"crayon-line\"><span class=\"crayon-h\">\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0<\/span><span class=\"crayon-s \">&#8211; 8081<\/span><span class=\"\">:8081<\/span><\/div>\n<div id=\"crayon-573dd63038f2f462731533-24\" class=\"crayon-line crayon-striped-line\"><span class=\"crayon-h\">\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0<\/span><span class=\"crayon-s \">expose<\/span><span class=\"\">:<\/span><\/div>\n<div id=\"crayon-573dd63038f2f462731533-25\" class=\"crayon-line\"><span class=\"crayon-h\">\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0<\/span>&#8211; <span class=\"crayon-i \">&#8220;8081-8095&#8221;<\/span><\/div>\n<div id=\"crayon-573dd63038f2f462731533-26\" class=\"crayon-line crayon-striped-line\"><span class=\"crayon-h\">\u00c2\u00a0\u00c2\u00a0<\/span><span class=\"crayon-s \">networks<\/span><span class=\"\">:<\/span><\/div>\n<div id=\"crayon-573dd63038f2f462731533-27\" class=\"crayon-line\"><span class=\"crayon-h\">\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0<\/span><span class=\"crayon-s \">default<\/span><span class=\"\">:<\/span><\/div>\n<div id=\"crayon-573dd63038f2f462731533-28\" class=\"crayon-line crayon-striped-line\"><span class=\"crayon-h\">\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0\u00c2\u00a0<\/span><span class=\"crayon-s \">driver<\/span><span class=\"\">: overlay<\/span><\/div>\n<\/div>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<p id=\"7ef4\" class=\"graf--p graf-after--pre\">There are a lot of knobs in Spark, and they can all be controlled through that file. You can even <a class=\"markup--anchor markup--p-anchor\" href=\"http:\/\/spark.apache.org\/docs\/latest\/building-spark.html\" rel=\"nofollow\" data-href=\"http:\/\/spark.apache.org\/docs\/latest\/building-spark.html\">customize the spark distribution itself<\/a> using a<a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/docs.docker.com\/engine\/reference\/builder\/\" rel=\"nofollow\" data-href=\"https:\/\/docs.docker.com\/engine\/reference\/builder\/\">Dockerfile<\/a> and custom base images, as we do at WorldSense to get Scala 2.11 and a lot of heavy libraries\u00c2\u00b2. In this example, we are doing the bare minimal, which is just opening the operational ports to the world, plus the <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/spark.apache.org\/docs\/latest\/security.html#configuring-ports-for-network-security\" rel=\"nofollow\" data-href=\"https:\/\/spark.apache.org\/docs\/latest\/security.html#configuring-ports-for-network-security\">spark internal ports<\/a> to the rest of the cluster (the expose directive).<\/p>\n<p id=\"f9b9\" class=\"graf--p graf-after--p\">Also note the parts of the config referring to the <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/docs.docker.com\/engine\/userguide\/networking\/get-started-overlay\/\" rel=\"nofollow\" data-href=\"https:\/\/docs.docker.com\/engine\/userguide\/networking\/get-started-overlay\/\">overlay network<\/a>. The default network is where all services defined in the config file will run, which means they can communicate with each other using the container name as the target hostname. The <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/docs.docker.com\/swarm\/scheduler\/strategy\/\" rel=\"nofollow\" data-href=\"https:\/\/docs.docker.com\/swarm\/scheduler\/strategy\/\">swarm scheduler<\/a> will decide for us on which machine each container goes, respecting the <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/docs.docker.com\/swarm\/scheduler\/filter\/\" rel=\"nofollow\" data-href=\"https:\/\/docs.docker.com\/swarm\/scheduler\/filter\/\">constraints<\/a> we have put in place. In our config file, we have one that pins the master service in the master machine (which is not very powerful) and another which keeps the workers outside that machine. Let us try bringing up the master:<\/p>\n<div id=\"crayon-573dd63038f34512407850\" class=\"crayon-syntax crayon-theme-xcode crayon-font-monaco crayon-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<div class=\"crayon-plain-wrap\"><\/div>\n<div class=\"crayon-main\">\n<table class=\"crayon-table\">\n<tbody>\n<tr class=\"crayon-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"crayon-nums-content\">\n<div class=\"crayon-num\" data-line=\"crayon-573dd63038f34512407850-1\">1<\/div>\n<div class=\"crayon-num crayon-striped-num\" data-line=\"crayon-573dd63038f34512407850-2\">2<\/div>\n<div class=\"crayon-num\" data-line=\"crayon-573dd63038f34512407850-3\">3<\/div>\n<\/div>\n<\/td>\n<td class=\"crayon-code\">\n<div class=\"crayon-pre\">\n<div id=\"crayon-573dd63038f34512407850-1\" class=\"crayon-line\"><span class=\"crayon-i\">eval<\/span> <span class=\"crayon-sy\">$<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">docker<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-e\">machine <\/span><span class=\"crayon-r\">env<\/span> <span class=\"crayon-o\">&#8212;<\/span><span class=\"crayon-i\">swarm<\/span> <span class=\"crayon-v\">$MASTER<\/span><span class=\"crayon-sy\">)<\/span><\/div>\n<div id=\"crayon-573dd63038f34512407850-2\" class=\"crayon-line crayon-striped-line\"><span class=\"crayon-v\">docker<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-e\">compose <\/span><span class=\"crayon-v\">up<\/span> <span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-i\">d<\/span> <span class=\"crayon-e\">master<\/span><\/div>\n<div id=\"crayon-573dd63038f34512407850-3\" class=\"crayon-line\"><span class=\"crayon-e\">lynx <\/span><span class=\"crayon-v\">http<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-o\">\/<\/span><span class=\"crayon-o\">\/<\/span><span class=\"crayon-sy\">$<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-e\">aws <\/span><span class=\"crayon-e\">ec2 <\/span><span class=\"crayon-v\">describe<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-v\">instances<\/span> <span class=\"crayon-o\">|<\/span> <span class=\"crayon-v\">jq<\/span> <span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-i\">r<\/span> <span class=\"crayon-s\">&#8220;.Reservations[].Instances[] | select(.KeyName==\\&#8221;$MASTER\\&#8221; and .State.Name==\\&#8221;running\\&#8221;) | .PublicDnsName&#8221;<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-cn\">8080<\/span><\/div>\n<\/div>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<p id=\"0ba4\" class=\"graf--p graf-after--pre\">So far we have bootstrapped out architecture with Consul, defined our cluster with Docker Swarm and delineated our spark installation with Docker Compose. The last remaining step is to add the bulk of the machines which will do the heavy work.<\/p>\n<p id=\"fe7f\" class=\"graf--p graf-after--p\">The worker machines should be more powerful, and you don\u00e2\u20ac\u2122t have to care too much about the stability of the individual instances. These properties make workers a perfect candidate for Amazon EC2 <a class=\"markup--anchor markup--p-anchor\" href=\"http:\/\/docs.aws.amazon.com\/AWSEC2\/latest\/UserGuide\/using-spot-instances.html\" rel=\"nofollow\" data-href=\"http:\/\/docs.aws.amazon.com\/AWSEC2\/latest\/UserGuide\/using-spot-instances.html\">spot instances<\/a>. They often cost less than one forth of the price of a reserved machine, a bargain you can\u00e2\u20ac\u2122t get elsewhere. Let us bring a few of them up, using docker-machine\u00c2\u00b3 and the very helpful <a class=\"markup--anchor markup--p-anchor\" href=\"http:\/\/www.gnu.org\/software\/parallel\/\" rel=\"nofollow\" data-href=\"http:\/\/www.gnu.org\/software\/parallel\/\">gnu parallel<\/a>\u00e2\u0081\u00b4 script.<\/p>\n<div id=\"crayon-573dd63038f38575704469\" class=\"crayon-syntax crayon-theme-xcode crayon-font-monaco crayon-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<div class=\"crayon-plain-wrap\"><\/div>\n<div class=\"crayon-main\">\n<table class=\"crayon-table\">\n<tbody>\n<tr class=\"crayon-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"crayon-nums-content\">\n<div class=\"crayon-num\" data-line=\"crayon-573dd63038f38575704469-1\">1<\/div>\n<div class=\"crayon-num crayon-striped-num\" data-line=\"crayon-573dd63038f38575704469-2\">2<\/div>\n<div class=\"crayon-num\" data-line=\"crayon-573dd63038f38575704469-3\">3<\/div>\n<\/div>\n<\/td>\n<td class=\"crayon-code\">\n<div class=\"crayon-pre\">\n<div id=\"crayon-573dd63038f38575704469-1\" class=\"crayon-line\"><span class=\"crayon-v\">WORKER_OPTIONS<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8220;$DRIVER_OPTIONS $SWARM_OPTIONS &#8211;amazonec2-request-spot-instance &#8211;amazonec2-spot-price=0.074 &#8211;amazonec2-instance-type=m4.2xlarge&#8221;<\/span><\/div>\n<div id=\"crayon-573dd63038f38575704469-2\" class=\"crayon-line crayon-striped-line\"><span class=\"crayon-v\">CLUSTER_NUM_NODES<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">11<\/span><\/div>\n<div id=\"crayon-573dd63038f38575704469-3\" class=\"crayon-line\"><span class=\"crayon-v\">parallel<\/span> <span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-v\">j0<\/span> <span class=\"crayon-o\">&#8212;<\/span><span class=\"crayon-v\">no<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-v\">run<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-st\">if<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-v\">empty<\/span> <span class=\"crayon-o\">&#8212;<\/span><span class=\"crayon-v\">line<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-e\">buffer <\/span><span class=\"crayon-v\">docker<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-e\">machine <\/span><span class=\"crayon-i\">create<\/span> <span class=\"crayon-v\">$WORKER_OPTIONS<\/span> <span class=\"crayon-o\">&amp;<\/span><span class=\"crayon-v\">lt<\/span><span class=\"crayon-sy\">;<\/span> <span class=\"crayon-o\">&amp;<\/span><span class=\"crayon-v\">lt<\/span><span class=\"crayon-sy\">;<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-st\">for<\/span> <span class=\"crayon-i\">n<\/span> <span class=\"crayon-st\">in<\/span> <span class=\"crayon-sy\">$<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-i\">seq<\/span> <span class=\"crayon-cn\">1<\/span> <span class=\"crayon-v\">$CLUSTER_NUM_NODES<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">;<\/span> <span class=\"crayon-st\">do<\/span> <span class=\"crayon-r\">echo<\/span> <span class=\"crayon-s\">&#8220;${CLUSTER_PREFIX}n$n&#8221;<\/span><span class=\"crayon-sy\">;<\/span> <span class=\"crayon-st\">done<\/span><span class=\"crayon-sy\">)<\/span><\/div>\n<\/div>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<p id=\"7ef5\" class=\"graf--p graf-after--pre\">You now have over 300 cores available in your cluster, for less than a dollar an hour. Last month in WorldSense we used a similar cluster to process over 2 billion web pages from the <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/commoncrawl.org\/\" rel=\"nofollow\" data-href=\"https:\/\/commoncrawl.org\/\">common crawl repository<\/a> over a few days. For now, let us bring up everything and compute the value of pi:<\/p>\n<div id=\"crayon-573dd63038f3c568077956\" class=\"crayon-syntax crayon-theme-xcode crayon-font-monaco crayon-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<div class=\"crayon-plain-wrap\"><\/div>\n<div class=\"crayon-main\">\n<table class=\"crayon-table\">\n<tbody>\n<tr class=\"crayon-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"crayon-nums-content\">\n<div class=\"crayon-num\" data-line=\"crayon-573dd63038f3c568077956-1\">1<\/div>\n<div class=\"crayon-num crayon-striped-num\" data-line=\"crayon-573dd63038f3c568077956-2\">2<\/div>\n<div class=\"crayon-num\" data-line=\"crayon-573dd63038f3c568077956-3\">3<\/div>\n<\/div>\n<\/td>\n<td class=\"crayon-code\">\n<div class=\"crayon-pre\">\n<div id=\"crayon-573dd63038f3c568077956-1\" class=\"crayon-line\"><span class=\"crayon-i\">eval<\/span> <span class=\"crayon-sy\">$<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">docker<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-e\">machine <\/span><span class=\"crayon-r\">env<\/span> <span class=\"crayon-o\">&#8212;<\/span><span class=\"crayon-i\">swarm<\/span> <span class=\"crayon-v\">$MASTER<\/span><span class=\"crayon-sy\">)<\/span><\/div>\n<div id=\"crayon-573dd63038f3c568077956-2\" class=\"crayon-line crayon-striped-line\"><span class=\"crayon-v\">docker<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-e\">compose <\/span><span class=\"crayon-e\">scale <\/span><span class=\"crayon-v\">master<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">1<\/span> <span class=\"crayon-v\">worker<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">10<\/span><\/div>\n<div id=\"crayon-573dd63038f3c568077956-3\" class=\"crayon-line\"><span class=\"crayon-e\">docker <\/span><span class=\"crayon-v\">run<\/span> <span class=\"crayon-o\">&#8212;<\/span><span class=\"crayon-v\">net<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">container<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-v\">master<\/span> <span class=\"crayon-o\">&#8212;<\/span><span class=\"crayon-e\">entrypoint <\/span><span class=\"crayon-v\">spark<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-e\">submit <\/span><span class=\"crayon-v\">gettyimages<\/span><span class=\"crayon-o\">\/<\/span><span class=\"crayon-v\">spark<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-cn\">1.6.0<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-v\">hadoop<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">2.6<\/span> <span class=\"crayon-o\">&#8212;<\/span><span class=\"crayon-e\">master <\/span><span class=\"crayon-v\">spark<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-o\">\/<\/span><span class=\"crayon-o\">\/<\/span><span class=\"crayon-v\">master<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-cn\">7077<\/span> <span class=\"crayon-o\">&#8212;<\/span><span class=\"crayon-t\">class<\/span> <span class=\"crayon-v\">org<\/span><span class=\"crayon-e\">.apache<\/span><span class=\"crayon-e\">.spark<\/span><span class=\"crayon-e\">.examples<\/span><span class=\"crayon-e\">.SparkPi<\/span> <span class=\"crayon-o\">\/<\/span><span class=\"crayon-v\">usr<\/span><span class=\"crayon-o\">\/<\/span><span class=\"crayon-v\">spark<\/span><span class=\"crayon-o\">\/<\/span><span class=\"crayon-v\">lib<\/span><span class=\"crayon-o\">\/<\/span><span class=\"crayon-v\">spark<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-v\">examples<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">1.6.0<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-v\">hadoop2<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-cn\">6.0.jar<\/span><\/div>\n<\/div>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<p id=\"7ef1\" class=\"graf--p graf-after--pre\">In a more realistic scenario one would use something like <a class=\"markup--anchor markup--p-anchor\" href=\"http:\/\/www.tecmint.com\/rsync-local-remote-file-synchronization-commands\/\" rel=\"nofollow\" data-href=\"http:\/\/www.tecmint.com\/rsync-local-remote-file-synchronization-commands\/\">rsync<\/a> to push locally developed jars in the master machine, and then use docker volume support to expose those to the driver. That is how we do it in WorldSense\u00e2\u0081\u00b5.<\/p>\n<p id=\"f2c6\" class=\"graf--p graf-after--p\">I think this is a powerful setup, with the great advantage that it is also easy to debug and replicate locally. I can simply change a bit the flags\u00e2\u0081\u00b6 in these scripts to get virtually the same environment in my laptop. This flexibility has been helpful countless times.<\/p>\n<p id=\"edbc\" class=\"graf--p graf-after--p\">Many companies offer <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/databricks.com\/product\/databricks\" rel=\"nofollow\" data-href=\"https:\/\/databricks.com\/product\/databricks\">hosted solutions for running code in Spark<\/a>, and I highly recommend giving them a try. In our case, we had both budget restrictions and flexibility requirements that forced us into a custom deployment. It hasn\u00e2\u20ac\u2122t come without its costs, but we are sure having some fun.<\/p>\n<p id=\"a4f9\" class=\"graf--p graf-after--p\">Ah, talking about costs, do not forget to bring your cluster down!<\/p>\n<div id=\"crayon-573dd63038f40529730790\" class=\"crayon-syntax crayon-theme-xcode crayon-font-monaco crayon-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover\">\n<div class=\"crayon-plain-wrap\"><\/div>\n<div class=\"crayon-main\">\n<table class=\"crayon-table\">\n<tbody>\n<tr class=\"crayon-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"crayon-nums-content\">\n<div class=\"crayon-num\" data-line=\"crayon-573dd63038f40529730790-1\">1<\/div>\n<\/div>\n<\/td>\n<td class=\"crayon-code\">\n<div class=\"crayon-pre\">\n<div id=\"crayon-573dd63038f40529730790-1\" class=\"crayon-line\"><span class=\"crayon-v\">docker<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-e\">machine <\/span><span class=\"crayon-r\">ls<\/span> <span class=\"crayon-o\">|<\/span> <span class=\"crayon-r\">grep<\/span> <span class=\"crayon-s\">&#8220;^${CLUSTER_PREFIX}&#8221;<\/span> <span class=\"crayon-o\">|<\/span> <span class=\"crayon-r\">cut<\/span> <span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-v\">d<\/span><span class=\"crayon-sy\">\\<\/span> <span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-v\">f1<\/span> <span class=\"crayon-o\">|<\/span> <span class=\"crayon-r\">xargs<\/span> <span class=\"crayon-v\">docker<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-e\">machine <\/span><span class=\"crayon-r\">rm<\/span> <span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-v\">y<\/span><\/div>\n<\/div>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<hr \/>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>Footnotes<\/p>\n<ol class=\"postList\">\n<li id=\"b7c0\" class=\"graf--li graf-after--p\">The need for serialized creation of the cluster-store <a class=\"markup--anchor markup--li-anchor\" href=\"https:\/\/github.com\/docker\/machine\/issues\/2303\" rel=\"nofollow\" data-href=\"https:\/\/github.com\/docker\/machine\/issues\/2303\">should improve<\/a> at some point.<\/li>\n<li id=\"a8ff\" class=\"graf--li graf-after--li\">Spark runs jobs in its workers jvm, and sometimes it is really hard to avoid <a class=\"markup--anchor markup--li-anchor\" href=\"https:\/\/dzone.com\/articles\/what-is-jar-hell\" rel=\"nofollow\" data-href=\"https:\/\/dzone.com\/articles\/what-is-jar-hell\">jar-hell<\/a> when you have some library version in your code and the spark workers already have a different version. For some cases, the only solution is to modify the pom.xml that generates the workers jar itself, and we have done that to fix incompatibilities with logback, dropwizard, and jackson, among others. If you find yourself in the same position, don\u00e2\u20ac\u2122t be afraid to try that. It works.<\/li>\n<li id=\"5113\" class=\"graf--li graf-after--li\">Machine allocation with docker-machine is very simple, but not super reliable. I often have some slaves that do not install correctly, and I simply kill them in a shell loop checking for the success of docker-machine env.<\/li>\n<li id=\"17f1\" class=\"graf--li graf-after--li\">GNU Parallel requires a citation, and I have to say that I do it happily. Before the advent of docker swarm, most of the setup we used was powered by GNU Parallel alone\u00c2\u00a0:-).<br \/>\nO. Tange (2011): GNU Parallel\u00e2\u20ac\u0160\u00e2\u20ac\u201d\u00e2\u20ac\u0160The Command-Line Power Tool,<br \/>\n;login: The USENIX Magazine, February 2011:42\u00e2\u20ac\u201c47.<\/li>\n<li id=\"b302\" class=\"graf--li graf-after--li\">By splitting our jars in rarely-changed dependencies and our own code, most of the time running fresh code in the cluster is just a matter of uploading a couple of megabytes.<\/li>\n<li id=\"c9b9\" class=\"graf--li graf-after--li graf--last\">In my laptop, I need the following changes: DRIVER_OPTIONS=\u00e2\u20ac\u0160\u00e2\u20ac\u201d\u00e2\u20ac\u0160driver virtualbox, NET_ETH=eth1 and KEYSTORE_IP=$(docker-machine ip keystore).<\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>At WorldSense we build predictors for the best links you could add in your content by creating large language models from the World Wide Web. In the open source world, no tool is better suited for that kind of mass (hyper)text analysis than Apache Spark, and I wanted to share how we set it up and run it on the cloud, so you can give it a try.<\/p>\n<p>http:\/\/www.worldsense.com\/a-swarm-of-sparks\/<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[],"class_list":["post-2217","post","type-post","status-publish","format-standard","hentry","category-great-stuff"],"_links":{"self":[{"href":"https:\/\/roberval.com.br\/roberval\/wp-json\/wp\/v2\/posts\/2217","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/roberval.com.br\/roberval\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/roberval.com.br\/roberval\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/roberval.com.br\/roberval\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/roberval.com.br\/roberval\/wp-json\/wp\/v2\/comments?post=2217"}],"version-history":[{"count":2,"href":"https:\/\/roberval.com.br\/roberval\/wp-json\/wp\/v2\/posts\/2217\/revisions"}],"predecessor-version":[{"id":2220,"href":"https:\/\/roberval.com.br\/roberval\/wp-json\/wp\/v2\/posts\/2217\/revisions\/2220"}],"wp:attachment":[{"href":"https:\/\/roberval.com.br\/roberval\/wp-json\/wp\/v2\/media?parent=2217"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/roberval.com.br\/roberval\/wp-json\/wp\/v2\/categories?post=2217"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/roberval.com.br\/roberval\/wp-json\/wp\/v2\/tags?post=2217"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}