背景

kubernetesの構築から運用まで、少しずつ勉強しようと思い、EC2に環境構築を試してみたのですが、かなりハマったので、その共有をします。今時、自らEC2に環境構築する人は少ないかもしれませんが、練習ということで。

環境

k8sを構築する環境
- AWS EC2 のUbuntu 18.04インスタンス3台
  - 仮にIPを、それぞれx.x.x.1, x.x.x.2, x.x.x.3とする
- 秘密鍵は、ローカルの~/.ssh/private.pemとする
ansibleコマンドを叩くローカル
- Mac Mojave
- WindowsのWSLでも試していますが、こちらはうまく行かない・・・（Macよりもさらにハマっているので、解決したら別の記事で解説します）

作業

ここからの試行錯誤の流れはかなり長いので、ansible-playbookが動くまでとその後とで分けて説明します。

ansible-playbookが動くまで

まずは、何も考えずにGetting Started通りに進めました。

Getting Startedでは、一番最初にrequirements.txtに書かれたモジュールをインストールすることになっています。その中にansibleがあるので、一応homebrewでインストールしたansibleは削除しておきました。

その上で、以下のように実行。


$ sudo pip install -r requirements.txt
$ cp -rfp inventory/sample inventory/mycluster
$ declare -a IPS=(x.x.x.1 x.x.x.2 x.x.x.3)
$ CONFIG_FILE=inventory/mycluster/hosts.ini python3 contrib/inventory_builder/inventory.py ${IPS[@]}
Traceback (most recent call last):
  File "contrib/inventory_builder/inventory.py", line 36, in 
    from ruamel.yaml import YAML
ModuleNotFoundError: No module named 'ruamel'

ruamelがないと怒られます。なぜrequirements.txtに記載がないのか不思議ですが、手作業でインストールしましょう。

$ sudo pip install ruamel.yaml
$ CONFIG_FILE=inventory/mycluster/hosts.ini python3 contrib/inventory_builder/inventory.py ${IPS[@]}
Traceback (most recent call last):
  File "contrib/inventory_builder/inventory.py", line 36, in <module>
    from ruamel.yaml import YAML
ModuleNotFoundError: No module named 'ruamel'

また同じエラーが出ます。よく見たら、入力したコマンドにはpython3を使っていますが、requirements.txtによるモジュールのインストールには、単なるpip (つまりpython2.7)を使っています。なぜGetting Startedがこうなっているのか謎ですが、pip3でインストールしましょう。

$ sudo pip3 install -r requirements.txt
$ sudo pip3 install ruamel.yaml
$ CONFIG_FILE=inventory/mycluster/hosts.ini python3 contrib/inventory_builder/inventory.py ${IPS[@]}
Traceback (most recent call last):
  File "contrib/inventory_builder/inventory.py", line 391, in <module>
    sys.exit(main())
  File "contrib/inventory_builder/inventory.py", line 388, in main
    KubesprayInventory(argv, CONFIG_FILE)
  File "contrib/inventory_builder/inventory.py", line 77, in __init__
    self.yaml_config = yaml.load(self.hosts_file)
  File "/usr/local/lib/python3.7/site-packages/ruamel/yaml/main.py", line 331, in load
    return constructor.get_single_data()
  File "/usr/local/lib/python3.7/site-packages/ruamel/yaml/constructor.py", line 109, in get_single_data
    node = self.composer.get_single_node()
  File "/usr/local/lib/python3.7/site-packages/ruamel/yaml/composer.py", line 87, in get_single_node
    event.start_mark,
ruamel.yaml.composer.ComposerError: expected a single document in the stream
  in "inventory/mycluster/hosts.ini", line 4, column 1
but found another document
  in "inventory/mycluster/hosts.ini", line 15, column 1

最後のコマンドでエラーが発生しました。

このコマンドは、k8sが構築されるVMのIPを、hostsファイルに設定するためのコマンドのようですが、hosts.iniがパースできないと言われます。気になるのは、ruamel.yaml.compopser.ComposerErrorです。メッセージ的にymlを扱いそうなのに、hosts.iniを修正しようとしているのが原因のような気がします。

そこで、hosts.iniの部分をhosts.ymlに変更してみましょう。

$ CONFIG_FILE=inventory/mycluster/hosts.yml python3 contrib/inventory_builder/inventory.py ${IPS[@]}
DEBUG: Adding group all
DEBUG: Adding group kube-master
DEBUG: Adding group kube-node
DEBUG: Adding group etcd
DEBUG: Adding group k8s-cluster
DEBUG: Adding group calico-rr
DEBUG: adding host node1 to group all
DEBUG: adding host node2 to group all
DEBUG: adding host node3 to group all
DEBUG: adding host node1 to group etcd
DEBUG: adding host node2 to group etcd
DEBUG: adding host node3 to group etcd
DEBUG: adding host node1 to group kube-master
DEBUG: adding host node2 to group kube-master
DEBUG: adding host node1 to group kube-node
DEBUG: adding host node2 to group kube-node
DEBUG: adding host node3 to group kube-node

思いの外うまくいきました。

残るは、最後のコマンド、ansible-playbookを叩くのみです。

ansible-playbookを最後まで通す

ansible-playbookを叩く前に、二つ気をつけることがあります。一つは、Getting Startedでは、パスワード認証のようですが、EC2では秘密鍵認証なので、オプションを追加する必要があります。 -uオプションでEC2のUbuntuのユーザー名(デフォルトならubuntu)、--private-keyオプションで秘密鍵のパスを指定しましょう。

もう一つ気をつけることとして、先ほどhostsファイルをymlで作ったので、指定するhostsファイルは、hosts.iniではなく、hosts.ymlにする必要があります。

ではそれを踏まえて、コマンドを入力しましょう。

$ ansible-playbook -i inventory/mycluster/hosts.yml --become --become-user=root cluster.yml -u ubuntu --private-key=~/.ssh/private.pem

（長いので省略）

TASK [kubernetes/preinstall : Stop if ip var does not match local ips] *************************************************************************************************************
Sunday 07 April 2019  19:18:33 +0900 (0:00:00.110)       0:00:57.115 **********
fatal: [node1]: FAILED! => {
    "assertion": "ip in ansible_all_ipv4_addresses",
    "changed": false,
    "evaluated_to": false,
    "msg": "Assertion failed"
}
fatal: [node2]: FAILED! => {
    "assertion": "ip in ansible_all_ipv4_addresses",
    "changed": false,
    "evaluated_to": false,
    "msg": "Assertion failed"
}
fatal: [node3]: FAILED! => {
    "assertion": "ip in ansible_all_ipv4_addresses",
    "changed": false,
    "evaluated_to": false,
    "msg": "Assertion failed"
}

途中でエラーが出て止まりました。

ip in ansible_all_ipv4_addressesでググると、以下のStackoverflowのスレッドでヒントを見つけました。

The check is if ip is actually a local ip address. It's not a bug. You can't tell etcd to bind to the floating IP address. You should set access_ip instead to specify the floating IP

ipがローカルアドレスになっているか確認しましょう。このメッセージは決してバグなどではありません。etcdにfloating IPアドレスを伝えることはできないのです。その代わりaccess_ipにfloating IPアドレスをセットしましょう。

github.com

floating IPアドレスとは、OpenStackで外部からアクセスするためにインスタンスに割り当てるIPアドレスです。 EC2でいうパブリックIPです。

現状のhosts.ymlを確認してみましょう。

$ cat inventory/mycluster/hosts.yml
all:
  hosts:
    node1:
      ansible_host: x.x.x.1
      ip: x.x.x.1
      access_ip: x.x.x.1
    node2:
      ansible_host: x.x.x.2
      ip: x.x.x.2
      access_ip: x.x.x.2
    node3:
      ansible_host: x.x.x.3
      ip: x.x.x.3
      access_ip: x.x.x.3
  children:
    kube-master:
      hosts:
        node1:
        node2:
    kube-node:
      hosts:
        node1:
        node2:
        node3:
    etcd:
      hosts:
        node1:
        node2:
        node3:
    k8s-cluster:
      children:
        kube-master:
        kube-node:
    calico-rr:
      hosts: {}

先ほどのStackOverflowのスレッドの下の方に書かれている通り、ansible_hostというキー名をansible_ssh_hostに変え、さらにipの値にローカルIPをセットしてみました。

これで、先どのansible-playbookを実行してみましょう。

$ ansible-playbook -i inventory/mycluster/hosts.yml --become --become-user=root cluster.yml -u ubuntu --private-key=~/.ssh/private.pem

（長いので省略）

TASK [kubernetes/preinstall : Stop if access_ip is not pingable] *******************************************************************************************************************
Sunday 07 April 2019  21:05:52 +0900 (0:00:00.111)       0:00:11.392 **********
fatal: [node1]: FAILED! => {"changed": true, "cmd": ["ping", "-c1", "x.x.x.1"], "delta": "0:00:10.002630", "end": "2019-04-07 12:06:03.175649", "msg": "non-zero return code", "rc": 1, "start": "2019-04-07 12:05:53.173019", "stderr": "", "stderr_lines": [], "stdout": "PING x.x.x.1 (x.x.x.1) 56(84) bytes of data.\n\n--- x.x.x.1 ping statistics ---\n1 packets transmitted, 0 received, 100% packet loss, time 0ms", "stdout_lines": ["PING x.x.x.1 (x.x.x.1) 56(84) bytes of data.", "", "--- x.x.x.1 ping statistics ---", "1 packets transmitted, 0 received, 100% packet loss, time 0ms"]}
fatal: [node2]: FAILED! => {"changed": true, "cmd": ["ping", "-c1", "x.x.x.2"], "delta": "0:00:10.002654", "end": "2019-04-07 12:06:03.184854", "msg": "non-zero return code", "rc": 1, "start": "2019-04-07 12:05:53.182200", "stderr": "", "stderr_lines": [], "stdout": "PING x.x.x.2 (x.x.x.2) 56(84) bytes of data.\n\n--- x.x.x.2 ping statistics ---\n1 packets transmitted, 0 received, 100% packet loss, time 0ms", "stdout_lines": ["PING x.x.x.2 (x.x.x.2) 56(84) bytes of data.", "", "--- x.x.x.2 ping statistics ---", "1 packets transmitted, 0 received, 100% packet loss, time 0ms"]}
fatal: [node3]: FAILED! => {"changed": true, "cmd": ["ping", "-c1", "x.x.x.3"], "delta": "0:00:10.003199", "end": "2019-04-07 12:06:03.188290", "msg": "non-zero return code", "rc": 1, "start": "2019-04-07 12:05:53.185091", "stderr": "", "stderr_lines": [], "stdout": "PING x.x.x.3 (x.x.x.3) 56(84) bytes of data.\n\n--- x.x.x.3 ping statistics ---\n1 packets transmitted, 0 received, 100% packet loss, time 0ms", "stdout_lines": ["PING x.x.x.3 (x.x.x.3) 56(84) bytes of data.", "", "--- x.x.x.3 ping statistics ---", "1 packets transmitted, 0 received, 100% packet loss, time 0ms"]}

先ほどのエラーは乗り越えましたが、Pingが通らない、というエラーが出ています。

気になるのは、access_ip is not pingableという言葉。ノード間の通信に使われるIPアドレスは、access_ipのようです。そこで、再度、hosts.ymlを見直し、access_ipにローカルIPを入れてみました。

しかしそれでも同じところで止まります。

では、本当にPingは返らないのか、確認してみます。インスタンスにSSHでログインして、別のインスタンスにプライベートIPでPingを飛ばしてみます。そうしたら、確かに返ってきませんでした。

この原因はかなり呆気ないもので、セキュリティグループが正しく設定されていないことでした。許可するインバウンドを、検証している自宅のIPのみ許可する設定にしていたため、ノード間ではPingも不通だったようです。プライベートIPだったら、セキュリティグループに記述しなくても疎通できると勘違いしていました。

改めてセキュリティグループの設定し直し、もう一度ansible-playbookを実行。

今回は、かなり成功しているような感じで、処理がどんどん進んでいきます。

しかし、途中でこんなログも。ただ、ignoringと出ているので、ひとまずそのままにしておきます。

TASK [etcd : Configure | Check if etcd cluster is healthy] *************************************************************************************************************************
Sunday 07 April 2019  23:44:50 +0900 (0:00:00.131)       0:04:42.813 **********
fatal: [node2]: FAILED! => {"changed": false, "cmd": "/usr/local/bin/etcdctl --endpoints=https://(ローカルIP 1):2379,https://(ローカルIP 2):2379,https://(ローカルIP 3):2379 cluster-health | grep -q 'cluster is healthy'", "delta": "0:00:00.012281", "end": "2019-04-07 14:44:50.868068", "msg": "non-zero return code", "rc": 1, "start": "2019-04-07 14:44:50.855787", "stderr": "Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp (ローカルIP 3):2379: getsockopt: connection refused\n; error #1: dial tcp (ローカルIP 2):2379: getsockopt: connection refused\n; error #2: dial tcp (ローカルIP 1):2379: getsockopt: connection refused\n\nerror #0: dial tcp (ローカルIP 3):2379: getsockopt: connection refused\nerror #1: dial tcp (ローカルIP 2):2379: getsockopt: connection refused\nerror #2: dial tcp (ローカルIP 1):2379: getsockopt: connection refused", "stderr_lines": ["Error:  client: etcd cluster is unavailable or misconfigured; error #0: dial tcp (ローカルIP 3):2379: getsockopt: connection refused", "; error #1: dial tcp (ローカルIP 2):2379: getsockopt: connection refused", "; error #2: dial tcp (ローカルIP 1):2379: getsockopt: connection refused", "", "error #0: dial tcp (ローカルIP 3):2379: getsockopt: connection refused", "error #1: dial tcp (ローカルIP 2):2379: getsockopt: connection refused", "error #2: dial tcp (ローカルIP 1):2379: getsockopt: connection refused"], "stdout": "", "stdout_lines": []}
...ignoring

ただ、最終的には、このような形で終了し、どうやら成功したみたいです。

PLAY RECAP *************************************************************************************************************************************************************************
localhost                  : ok=1    changed=0    unreachable=0    failed=0
node1                      : ok=396  changed=117  unreachable=0    failed=0
node2                      : ok=333  changed=101  unreachable=0    failed=0
node3                      : ok=298  changed=88   unreachable=0    failed=0

試しに、kubectlコマンドでk8s環境の状況を確認しましょう。

$ ssh -i ~/.ssh/private.pem ubuntu@x.x.x.1
$ sudo su
# kubectl get nodes
NAME    STATUS   ROLES         AGE   VERSION
node1   Ready    master,node   23m   v1.13.5
node2   Ready    master,node   22m   v1.13.5
node3   Ready    node          22m   v1.13.5

うまくいけているようです。

まとめ

Kubespray自体のREADMEに、かなり違和感を感じるものの、誤字なのか自分の環境が特殊なのか分からず、Pull requestを出して良いものか悩ましい状況です。他の方は、何も困らずに勧められるんでしょうか・・・。

その他、知識不足もあり、苦戦しましたが、無事週末に自分のk8s環境を整えることができました。 k8sライフを楽しみたいと思います。

WSLの方は、追って追記したいと思います。

nkty blog

I'm an enterprise software and system architecture. This site dedicates sharing knowledge and know-how about system architecture with me and readers.

AWS EC2上にkubesprayでk8s環境を構築

背景

環境

作業

ansible-playbookが動くまで

ansible-playbookを最後まで通す

まとめ