Page MenuHomePhabricator

Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service
Closed, ResolvedPublic

Description

Some testing needs resources that are not easily available on WMCS. Having test servers in the production network reduces the testing abilities (those server would be subject to production constraints).

For a real world use case: we need to test alternatives to blazegraph as a backend for wikidata query service. We have test servers in the production network (wdqs1009 and wdqs1010). Those servers are fine for testing beta versions of the current blazegraph based wdqs, but are not adapted to test random alternatives (current puppetization is in conflict, installing random unpackaged software is forbidden, ...).

This sounds very much like "labs on real hardware". There were conversations about this already, but I don't think there is anything concrete yet.

I think there is a valid use case here that is unlikely to go away, so we need to address it.

More formally, those points should be addressed:

  • high level of resources (CPU, disk space, IO, ...)
  • ability to experiment (full root for users, ability to break thing, to not be safe)
  • easy to iterate (no direct dependency on other team to use the machine)
  • ...

This needs refinement and discussion, let's get this discussion started (again).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Hello @Gehel! We're unlikely to support bare metal on OpenStack in the near future, largely because our pilot program for this years ago determined that we were able to meet almost every actual use case using VMs. Can you tell me a bit more about what you need that a VM can't provide?

The main contention point for WDQS (or investigating alternatives) seems to be IOPS. We tried setting up a wdqs test instance on WMCS, but IO contention meant that we were not able to keep up with the update flow. Our production instance consume ~3-4K IOPS just for updates. If there is a way to get this kind of throughput on our VMs, then that would be great!

The more general concern here is that we are investigating alternatives to blazegraph as a graph database. Part of that investigation is getting some performance numbers from which we could extrapolate whether any solution we investigate is fit for the load we expect in production. Obviously, testing on similar hardware as what we have on production makes this easier.

If getting >4K IOPS on our VM is possible, I'd be happy to do another test with WDQS.

Our production instance consume ~3-4K IOPS just for updates.

To clarify here, this would just allow us to barely run the updater. The server under production load - i.e. something we might need for performance evaluation and testing - are significantly higher, peaks reaching 60K and routine load in 5-10K range.

See e.g. https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?refresh=1m&orgId=1&panelId=16&fullscreen

We also need a lot of diskspace - current DB is over 500G, and to work with it comfortably we probably need somewhere around 1T diskspace capacity.

We're unlikely to support bare metal in the near future. I would like to try a couple of things, though...

  1. Ensure that your tests are running on an SSD host (the cloud is currently a mix of ssds and spinning disks)
  1. Try giving you a private cloudvirt host so your use is uncontested. We have a couple of reserve systems currently, so if you tell me the specs you need for a test VM I can create one for you to try out. If you're satisfied with the results then we can discuss the expense of getting you your own host to run your tests.

@Andrew the specs should be as close to production as we can get, namely:

  • CPU: dual Intel(R) Xeon(R) CPU E5-2620 v3
  • Disk: 1600GB raw raided space SSD
  • RAM: 128GB

The exact figure of the diskspace and the CPU model not that important of course, provided we have 32 cores and enough space to comfortably host 600G-sized database (plus data needed to reload it, which can be around additional 100G).

How long would you need to test? I can allocate this space in a VM for a little while but most existing space is already allocated in the coming weeks.

How long would you need to test?

Well, it depends. Generally we have a number of tests that I'd like to run in the future, and I'd like to have permanent facility for it. It takes a while to set up all the things, and ah-hoc solution for a week or two is not going to solve it. We need something that we can use whenever the need to test scenarios arrives, and something that is a first class citizen solution, not something that sits on the fringes and will be wiped out each time something changes or other projects come in and demand resources. It doesn't mean it has to be up 100% of time, and most likely we won't be using it 24/7 either, but we need something permanent that we can use once the need arrives, without seeking special exception for a little while each time.

Sorry, my question was poorly phrased. How long would you need to test a given VM box in order to determine whether or not it's an adequate solution for your long-term needs? I can't offer you a permanent setup now but we need to figure out what will work before we talk about dedicating specific hardware.

How long would you need to test a given VM box in order to determine whether or not it's an adequate solution for your long-term needs?

I can probably test it in a couple of days.

I've created a temporary VM for this: t206636.wikidata-query.eqiad.wmflabs -- it's on a new and currently unused host, cloudvirt1023. This test box should be largely uncontested for resources until at least Friday.

Let me know if it seems like it might do the trick; if so we can try to refine this a bit more.

Legoktm renamed this task from Provide a way to have test servers on real hardware, isolated from production to Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service.Oct 15 2018, 10:58 PM

@Andrew thanks, I'll start testing on it tomorrow.

@Andrew I logged in there and I see this set of disks:

Filesystem                         Size  Used Avail Use% Mounted on
udev                                65G     0   65G   0% /dev
tmpfs                               13G   19M   13G   1% /run
/dev/vda3                           19G  1.5G   17G   9% /
tmpfs                               65G     0   65G   0% /dev/shm
tmpfs                              5.0M     0  5.0M   0% /run/lock
tmpfs                               65G     0   65G   0% /sys/fs/cgroup
labstore1003.eqiad.wmnet:/scratch  3.0T  1.5T  1.4T  51% /mnt/nfs/labstore1003-scratch
labstore1006.wikimedia.org:/dumps  916G  2.6G  867G   1% /mnt/nfs/dumps-labstore1006.wikimedia.org
labstore1007.wikimedia.org:/dumps  916G  203G  667G  24% /mnt/nfs/dumps-labstore1007.wikimedia.org
tmpfs                               13G     0   13G   0% /run/user/10977

I am not sure which one is the SSD space? Is it connected? Also, it looks like /mnt/nfs/dumps-labstore1007.wikimedia.org is empty. It'd be nice to get access to public dumps there, like on wdqs-test.eqiad.wmflabs.

Also, how could I apply puppet roles to this machine? I do not see it in Horizon. I need to add role::wdqs::labs.

All of the storage for that instance is hosted on SSDs. I can partition the extra space for you if you want -- you would just want to add the puppet class profile::labs::lvm::srv.

I definitely see role::wdqs::labs as an available class in Horizon, in the 'Puppet Configuration' tab.

I'll have a look at what's happening with the dumps mount.

@Smalyshev, dumps should be available on that host now.

I definitely see role::wdqs::labs as an available class in Horizon, in the 'Puppet Configuration' tab.

I found it, I was looking at the wrong region. Dumps are ok now too, so I think I've got enough to start testing.

note to self, I can merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/468377/ after Stas releases this VM (or at least stops caring about resource contention)

I did a short evaluation on provided VM and it looks like it behaves well enough to run Blazegraph tests on it. So if we get something like that, we could probably do tests there. The t206636 can be shut down now, if necessary.

Thanks, Stas. There are two ways I think we can go forward with this:

  1. Run a second set of tests on a similar VM that shares a host with other active cloud instances (to see if we can meet your needs as a standard user), or
  1. Start talking about buying you a private virt host so you can do tests like this that are uncontested by anything except for other tests of yours

Number 1 seems like a good option to eliminate but of course it means you'll need to repeat your tests. Unfortunately to get a good test in the new future I'd need to build you a new VM in the old region (where there's actually some busy traffic happening). If you're interested in giving that a shot, let me know; otherwise we'll need to pass this issue on to some budget-focused people to discuss #2.

Run a second set of tests on a similar VM that shares a host with other active cloud instances (to see if we can meet your needs as a standard user), or

Sure. This would take a bit more time though - we'll need to load a new dump (probably several days) and run updater for a few days and queries on top of it. So overall I estimate it will take 1.5-2 weeks clock time (though mostly it'd be waiting for stuff to complete). I am interested in such test, so let's try it.

I've created a new VM, t206636-2.wikidata-query.eqiad.wmflabs. This is in the older region, on a host that is not super busy but is supporting quite a few other VMs. If your tests look good there too then we're probably in good shape and can avoid needing special hardware just for you.

I've created a new VM, t206636-2.wikidata-query.eqiad.wmflabs

I see that t206636-2 is listed as having 4 VPU and 24G RAM. That sounds too small for what I'd need - is this accurate?

@Andrew Also looks like there is some puppet issue there:

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, Could not find data item lvs::configuration::lvs_service_ips in any Hiera data file and no default supplied at /etc/puppet/modules/lvs/manifests/configuration.pp:71:20 on node t206636-2.wikidata-query.eqiad.wmflabs

Judging from the names of the settings, it's not one of mine, so could you please look into what is going on there?

@Andrew Also looks like there is some puppet issue there:

Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, Could not find data item lvs::configuration::lvs_service_ips in any Hiera data file and no default supplied at /etc/puppet/modules/lvs/manifests/configuration.pp:71:20 on node t206636-2.wikidata-query.eqiad.wmflabs

Judging from the names of the settings, it's not one of mine, so could you please look into what is going on there?

Looks like you are using role::wdqs which has some production specific things (like LVS configuration). For labs, we have role::wdqs::labs which should work fine.

I've created a new VM, t206636-2.wikidata-query.eqiad.wmflabs

I see that t206636-2 is listed as having 4 VPU and 24G RAM. That sounds too small for what I'd need - is this accurate?

That's my mistake -- I've build a new VM t206636-3 that should have the same specs as your original test VM. Is it OK if I delete -2 now?

I've build a new VM t206636-3 that should have the same specs as your original test VM. Is it OK if I delete -2 now?

Yes, sure. I'll use t206636-3.

Data loading test launched for t206636-3 and should take several days to complete. Ideally, it should not be interrupted. Please ping me if VM needs to be rebooted, etc. - it's possible to recover but to have accurate record of timings I need to know when it happened.

Looks like this VM is substantially slower - I started data load at Nov 9, now is end of Nov 13, and it's only 75% done, which means it'll take about a week to load all data, which is significantly slower than in production (it was done in 63 hours).

However, the real test would be the update/query speed, so I'll wait until the loading is finished and see how the Updater is doing and how querying works.

Loading finished, overall took 8 days and 9 hours, or 201 hours, or 3x compared to production. Launching updater next to see the updates speed.

Updater seems to be able to get about 4-5 updates per second, which is abut 2x slower than production. Summarily, it looks like this setup may be fit for functionality testing, but decidedly unfit for any performance testing, as it is 2-3x slower than production.

Final test I want to do it to see if the machine is capable of dealing with incoming update stream (without having to catch up). After that, it can be shut down.

Tried with incoming stream, the machine can't keep up - by now it's 5 hours behind, and processing about half the necessary requests.

I think the conclusion is mostly clear - this VM as such is not suitable for any performance testing or any workloads that are close to what we have in production, and can only be used for feature testing.

@Andrew is there any way to improve the performance?

So it sounds like you will need dedicated hardware to make this work (if the original test with the t206636 VM was adequate.) The analytics team is experimenting with this model currently so we'll see how it goes. You would need to figure out about budgeting/procurement/specs/etc to get your own server installed within the cloud but that's not especially complicated.

In the meantime, can I delete t206636-3?

In the meantime, can I delete t206636-3?

Yes.

The next step here is to determine the hardware requirements and then find out if there is current fiscal year budget to cover procuring the required hardware, or if we can allocate a dedicated server from the existing cloudvirt pool and backfill that capacity.

The wdqs10{09,10} prod hosts each have 8 cores, 128GB RAM, 4x800GB SSD (1.6TB RAID10). Our most recent cloudvirt servers (cloudvirt10[25-30]) are 36 physical cores, 512GB RAM, 6x1.92TB SSD (5.7TB RAID10). I think this means that a single cloudvirt would be more than enough hardware if it was not shared with other arbitrary workloads.

Yes, judging from our preliminary test, if we get uncontested use of the server or even a certain chunk of it maybe (not sure if possible?) it would be enough. Note that an interesting scenario that we want to test in foreseeable future involves cluster setup, so we'd want at least 2 hosts (not sure whether they have to be on 2 separate hardware machines) with requirements close to what wdqs hosts have. It could splitting cloudvirt host into two VMs exclusively used by these test hosts would be ok. Not sure if virtualization that we do now allows such kind of fixed resource allocations (probably also I/O resources need to be taken care of?)

Not sure if virtualization that we do now allows such kind of fixed resource allocations (probably also I/O resources need to be taken care of?)

We have a few special instances that we do this for today. It is something that we are experimenting with for virtualization of shared Cloud Services systems such as the ToolsDB databases. The process that requires cloud-root assistance to create the initial instances using cli magic, but it is possible. Basically we mark the cloudvirt as unavailable for selection by the normal OpenStack scheduler and then force create the desired instances there manually.

Gehel claimed this task.

This is addressed by T221631, closing.

I would like to delete the flavors named t206636 and t206636-testing and the VM named wcqs-beta-01 which uses the latter flavor. Is that ok?

t206636 and wcqs-beta-01 are behind the service https://wcqs-beta.wmflabs.org/ and cannot be deleted