Donnerstag, 18. September 2014

cli ansible vs clustershell : running commands on multiple hosts.

Let us start with CentOS 6.5
1) enable epel:  yum locainstall -y  http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
2) yum install clustershell -y
3) yum install ansible

Let us run who command on multiple hosts:
clush -w localhost,anotherhost -B -b w
ansible all -i 'localhost,anotherhost,' -c local -m command -a "w"

Then Analyze output....

Mittwoch, 17. September 2014

OpenSM lid and more...

On centOS 6.5
yum groupinstall "Infiniband Support"
yum install infiniband-diags

Infiniband diagnostics tools are containing nice "weapons" to eliminate network BUGS.
One of them is:
 saquery - query InfiniBand subnet administration attributes



 it looks like this:

NodeRecord dump:
                lid.....................0x7D
                reserved................0x0
                base_version............0x1
                class_version...........0x1
                node_type...............Switch
                num_ports...............36
                sys_guid................0xf452140300365230
                node_guid...............0xf452140300365230
                port_guid...............0xf452140300365230
                partition_cap...........0x8
                device_id...............0xC738
                revision................0xA2
                port_num................0
                vendor_id...............0x2C9
                NodeDescription.........MF0;switch-cf2826:SX6036/U1

Fix infiniband QDR or FDR troubles on fat memory machines.

Fix memory trouble on FAT machines:
The formula to computer the maximum value of pagepool when using RDMA is:   
2^log_num_mtt  x  2^log_mtts_per_seg * x PAGE_SIZE > ( 2x pagepool )

2^20 bytes x 2^4 x 4K = 64GiB
Add:
/etc/modprobe.d/mlx4_core.conf
options mlx4_core log_num_mtt=20 log_mtts_per_seg=4
check changes
more /sys/module/mlx4_core/parameters/log_num_mtt
more /sys/module/mlx4_core/parameters/log_mtts_per_seg

CentOS 7 after install:

CentOS 7 is different management than CentOS 5.x and 6.x.
It is uses systemctl to manage services.

systemctl enable sshd
systemctl list-unit-files
systemctl get-default
systemctl set-default multi-user.target
systemctl disable firstboot-graphical.service
systemctl disable bluetooth.service
systemctl enable network.service
systemctl show
systemd-analyze
systemd-analyze blame

Tune other stuff as you need.

Lustre 2.1.6 server and 2.5.3 client.

Do they compatible ?
The answer is yes!
Lustre client upgrading using yum in centos 6.5:

vim /etc/yum.repo/lustre.repo
[hpddLustreserver]
name=CentOS-$releasever - Lustre
baseurl=https://downloads.hpdd.intel.com/public/lustre/latest-maintenance-release/el6/server/
gpgcheck=0
enabled=1
 
[e2fsprogs]
name=CentOS-$releasever - Ldiskfs
baseurl=https://downloads.hpdd.intel.com/public/e2fsprogs/latest/el6/RPMS/
gpgcheck=0
enabled=1

[hpddLustreclient]
name=CentOS-$releasever - Lustre
baseurl=https://downloads.hpdd.intel.com/public/lustre/latest-maintenance-release/el6/client/
gpgcheck=0
enabled=1


On client:
1) yum update lustre-client -y
2) and finally one liner to restart the lustre mount point:

umount /lustre;lustre_rmmod;service lnet stop;service lnet start;mount /lustre

Ovirt 3.4.3-1: recover VM from an "unknown" state.

After iscsi(iser) storage failure and host local disk corruption one of the HA VMs is stalled in "?"=status unknown. restarting ovirt-engine and hosts did not help much. The host which was owning VM was not there. But Web portal was showing unknown status for the host. I was not able to reboot or stop it. All services are gone with bad disk on the host. Looks like main problem was missing iscsi disk storage. It was hanging in "locked state". I found simple solution in 3 steps:
  1. find the hanging disk ID from web interface it looks something like this:324f9089-0a40-4744-aa33-5c5a108f7f43
  2. on ovirt-engine server: su - postgres
  3. psql -U postgres engine -c "select fn_db_unlock_disk('324f9089-0a40-4744-aa33-5c5a108f7f43');"
    After this steps take down hanging host from the web interface. HA VM will come up to another "healthy node".