MrDuk's questions -server

MrDuk

Asked: 2017-07-13 06:58:19 +0800 CST

Process CPU% doubles when paired with more, similar processes

0

We're running into an interesting conundrum that I'd appreciate some help troubleshooting. We have a service that has several processes. To distribute load, we can startup n-processes of most types. So for example, if we expect 200,000 connections and know that each of a certain process type can handle around 5,000 connections before pegging out at 100% CPU, we know we should have at least 40 of these process types running to handle the load.

Recently, we've started consolidating our services to make better use of our hardware. During load testing though, we've seen that changing nothing other than how many of a certain process type are on a single box doubles the CPU% of each process.

Here's a screenshot of the process CPU%:

Here's a screenshot of the host CPU%:

The test from earlier had about 12 instances of this process on it; the test from later doubled the count. I'd say this would make sense if the box just couldn't handle the load, but from what I see it doesn't look like the case.

top - 14:55:08 up 54 days, 18:30,  1 user,  load average: 22.26, 22.39, 22.03
Tasks: 581 total,   1 running, 580 sleeping,   0 stopped,   0 zombie
%Cpu(s): 32.8 us,  3.1 sy,  0.0 ni, 62.3 id,  0.0 wa,  0.0 hi,  1.7 si,  0.0 st
KiB Mem : 26385841+total, 16612808+free, 20537016 used, 77193320 buff/cache
KiB Swap:  4194300 total,  4194300 free,        0 used. 24167782+avail Mem

Load average is within range (this is a 28-core server, 256GB of memory). Disk I/O has a wa of 0.0. I'm not sure what's causing the increased CPU%. Any ideas on what else to look for? Why does doubling the count of processes also double the amount of CPU time required for each process, if the CPU (according to top) is actually under utilized?

MrDuk

Asked: 2017-06-03 08:50:31 +0800 CST

Can I create a hard dependency between two ECS clusters?

0

We have a service that uses two services, a custom java application and a custom kafka build. When releasing a new version of our service, we always tag and push kafka to it's repo and the app to it's repo, with the same version tag. These are part of two separate clusters, due to requiring the application and kafka to run on distinct hosts (not sure if that's the best way to do this ..)

Is there a way I can guarantee that we maintain the same version tag in the container image, for which tag we pull from the repo?

i.e., something like:

1234567890.dkr.ecr.eu-west-1.amazonaws.com/app:${version}
1234567890.dkr.ecr.eu-west-1.amazonaws.com/kafka:${version}

And just update ${version} in one place?

MrDuk

Asked: 2017-03-02 13:25:00 +0800 CST

Connection timed out on new AWS RDS instances - can connect to older, almost identical RDS with no issue

6

I have two RDS instances, both hosted in the same VPC, and on the same subnet. Both have the same security group applied. Both are the same size, encryption disabled, basically exactly the same other than the instance name and table name.

However, I can only connect to one of them (an older one). The one I've created (and recreated) today gives me 116 "Connection timed out" -- I can't even telnet to the endpoint:port. Are there factors outside of the RDS dashboard that I should look into? I've even tried adding Allow ALL from ::/0 to the security group, with no luck.

MrDuk

Asked: 2017-01-18 09:43:55 +0800 CST

How to view historical usage metrics, per process?

0

My first inclination here would be something built into SAR, or the sysstat package in general. If that is indeed the case however, I can't seem to find this solution.

What I would like to see, and absolutely preferably through sysstat if possible, is a historical log of process usage (memory, CPU, etc.) in much the same format as SAR logs (if not already directly available through SAR files somehow). I know monitoring software is available, but I'm more-so looking for a mostly non-intrusive package, and actually something that could be parsed relatively intuitively via statsd/collectd.

MrDuk

Asked: 2017-01-13 09:04:02 +0800 CST

Should I be concerned that swap is being used on a host with nearly 40GB of free memory?

40

I have a production host, below:

The system is using 1GB of swap, while maintaining nearly 40GB of free, unused memory space. Should I be concerned about this, or is it mostly normal?

MrDuk

Asked: 2017-01-13 08:53:42 +0800 CST

How conservative should I be when considering current cache use on a system?

1

Periodically, as with any large online service, we evaluate the current load on our hardware, and make attempts at "right-sizing" so that we're not paying for severely underutilized hardware.

How concerned should I be when factoring cache memory in use? It's my understanding that cache is by and large used as an optimization, but I seem to recall also reading that cache could stay held for a while beyond needing to be used -- a sort of waste due to excessively available resources. Here's an example of one of our current systems:

So my question then is, how "safe" should I play it when considering how much load this host can (or should) hold? Do I consider everything, used/buffers/cache, as 100% required for optimal use? Or can I be more lenient with cache, and assume the system may swap out cache entries more frequently, but not to a point that would actually cause application performance issues?

MrDuk

Asked: 2016-08-19 13:22:29 +0800 CST

Why do netcat scans for UDP ports always succeed?

3

I'm trying to verify that a couple of our servers can communicate via certain ports before migrating some of our services to them, and that they're not blocked by our organizations firewall ACLs.

Makes Sense

[mrduki@mybox1~]$ nc -ul 40000
---
[mrduki@mybox2~]$ nc -zvuw2 mybox1.com 40000
Connection to mybox1.com 40000 port [udp/*] succeeded!

Doesn't Make Sense

[mrduki@mybox1~]$ nc -ul 40000
[mrduki@mybox1~]$ ^C
---
[mrduki@mybox2~]$ nc -zvuw2 mybox1.com 40000
Connection to mybox1.com 40000 port [udp/*] succeeded!

In fact, if I do a port scan from 40000-40100, every single port succeeds.

If I do the same tests without -u (so that it tests TCP instead of UDP), I get 40000 (tcp) timed out: Operation now in progress errors, as I would expect (since I have no such TCP service listening on 40000).

Doing a sudo netstat -alnp | grep LISTEN on mybox1 though shows no such services listening on these ports. So what am I missing?

MrDuk

Asked: 2016-08-10 11:22:05 +0800 CST

How does process CPU usage relate to Load Average? [duplicate]

-1

I have the following host / load:

Two 6-core CPUs, with HT (From what I understand, max load would be 24.0)
12 "primary" processes, with sustained usage of about 50% CPU
Load average: 0.86 0.98 0.98

Could someone help me understand:

How do multiple processes have sustained load, while the Load Average of the box is seemingly low? Considering 12 cores at 50%, I'd expect the load to at least hit somewhere between 6.0 - 12.0.
Considering CPU usage alone, are there any low-level details preventing me from adding more services to this host until the Load average hits ~24? (not concerned about disk I/O, memory, or anything else for the sake of this question -- I just want to fully understand how reliable the Load average is when considering CPU bottle-necking; thread waiting? bus contention? anything that's not represented in the Load average concerning CPU use?)

MrDuk

Asked: 2016-03-10 08:41:53 +0800 CST

What's the correct way to reference hosts as variables in Ansible?

4

I'm fairly new to Ansible, and am in a bit of a dilemma with our flow for this particular script. We have a list of certs that we want to deploy to each host; these are unique to each host, so 1:1 transfer.

# hostfile
[prod]
host_a
host_b
host_c

Currently, I'm doing this as part of the group_vars for my role:

haproxy:
  prod:
    certs:
      host_a:  a.my.endpoint.com.pem
      host_b:  b.my.endpoint.com.pem
      host_c:  c.my.endpoint.com.pem

And then referencing this in a task:

- name: upload haproxy server certificates
  copy:
    src: "{{haproxy[env].certs[inventory_hostname]}}"
    dest: "/etc/haproxy/ssl/{{haproxy[env].certs[inventory_hostname]}}"
    backup: yes
  notify:
    - restart haproxy
  tags:
    - haproxy

This works fine, but I don't like it. I primarily don't like it because it forces you to remember to update hosts in two places (the hostfile and the vars file). What I'd considered was to define the hosts as a var in group_vars, but I'm not entirely sure that I can reference vars within the file? So something like:

hosts:
  host_a
  host_b
  host_c

haproxy:
  prod:
    certs:
      {{ hosts.host_a }}:  a.my.endpoint.com.pem
      {{ hosts.host_b }}:  b.my.endpoint.com.pem
      {{ hosts.host_c }}:  c.my.endpoint.com.pem

And then this further complicates things a bit when I get into the task. I assume I could change my task to have a separate copy for each host, like with a when: {{ inventory_host }} == {{ hosts.host_a}} and just copy each file in it's own section. This seems equally as ugly to me though.

Is there a better, more intuitive way I could go about this?

MrDuk

Asked: 2015-10-10 15:15:59 +0800 CST

Is it possible to increase the rate at which ntpd updates the system clock?

6

We have some hosts that have gotten out of sync due to ntpd being mis-configured, with an ntp server that was unreachable. Our clocks on some hosts (CentOS 6) are now upwards of 30 seconds off (in the future it seems, for most of these).

It looks like from the docs of ntpd that the fastest sync we have available is around 500us/s -- is there any way to increase this so that the clocks update much faster, but not instantly? For example, we'd like to have this set instead to something like 100ms/s.

Is this possible? If so, how can we go about this safely?

Is it dangerous?

Process CPU% doubles when paired with more, similar processes

Can I create a hard dependency between two ECS clusters?

Connection timed out on new AWS RDS instances - can connect to older, almost identical RDS with no issue

How to view historical usage metrics, per process?

Should I be concerned that swap is being used on a host with nearly 40GB of free memory?

How conservative should I be when considering current cache use on a system?

Why do netcat scans for UDP ports always succeed?

How does process CPU usage relate to Load Average? [duplicate]

What's the correct way to reference hosts as variables in Ansible?

Is it possible to increase the rate at which ntpd updates the system clock?

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?