Ubuntu logo

Summit

Scalability and deployment strategies for LAVA

2011-11-01 12:00..13:00 in Grand Sierra G

Right now, we're just using a single server, and deploying packages that line up with our monthly releases. As LAVA is quickly growing and becoming more and more important in Linaro, we should explore how to make it more scalable and how we can support a more agile cycle of development, testing, and deployment of lava components.

Session Notes: Current performance is "decent" * LMC uses flock to serialize runs - ACTION: investigate doing this on a ramdisk - Offloading this to another machine that is not handling other dispatcher activities would be better - offload jenkins too - main host should just be for interactive things

Celery needed to help us spawn workers - maybe can Using the cloud - some worker tasks with low transfer requirements could be used right away if we have anything like that - eventually, deploying web server nodes, database, etc to cloud would be possible, and keeping dispatcher local - use juju for easy deployment Measurements - collectd is running, but not terribly useful - database transactions/min would be useful - google analytics type hit rate counter - ACTION: investigate graphite and statsd - sentry monitoring for django apps - should we run in canonical datacenter for things other than the dispatcher? - deployment might be an issue there - submitting RTs for changes - could we experiment with running some secondary staging server there ACTION: Launchpad has a script that measures how long transactions run for. This should help us avoid making db transactions that take too long

Postgres schema migrations are disruptive, distributed postgres might have issues with this many processes have to talk to the database (scheduler, dashboard,..) could they talk through the queue web server performance (responsiveness) database performance System load (with a notification) - investigate using nagios or new relic for this Memory/swap usage uwsgi has a feature to let you know and/or take action if a request takes longer than a given threshold ACTION: Make sure we're making use of this when we do the new deployment ACTION: Does postgres have a slow queries log? Caching global enablement is not going to work enable globally for anon users and dropping timeout to a few minutes would be better - measurements first cache reports at api level tests are cache aware because they will see the stale data - need to figure out how to turn off by testing wall clock time is ok to check if this is improving is there a way to measure cache hit/miss rate? - memcached could probably help measure this

Sentry could be used to track lots of different kinds of errors, including deploy failures ACTION: Investigate sentry

Zyga: look at getting caching enabled, celery, sentry Michael: statsd/graphite Dave/Paul: other monitoring things