Archive for April, 2010

4/25/2010: 9:05 pm: Conference, MySQL

Replication Tips – slides

  • Single master can typically handle a few dozen slaves
  • If you have a lot of slaves and want to do filtering for all, maybe use blackhole engine to do filtering in one place. Especially if running blackhole slave on same server to limit network bandwidth.
  • Replicate over WAN – slave_net_timeout, master_connect_retry, slave_compressed_protocol

Gearman – slides

  • UDFs allow you to run functions in a wide range of locations on servers other than the database server
  • Synch and asynch jobs
  • gman_sum() provides simple example of map-reduce with gearman
  • - Asynch queues, a.k.a., background tasks
  • e.g., email notifications (avoid blocking on sendmail), full text indexing, stat counters, etc.
  • Can enable you to do batch operations, when you don’t want to block on a large batch
  • Maybe use trigger to call a UDF that does a background task. Allows you to implement much more complex triggers with external side effects (e.g., sending notifications).
  • Putting Gearman workers between app and MySQL can allow you to parallelize queries (scatter-gather).
  • Persistent queues – if job server crashes, it can replay the queue on startup or you start another job server and point it at the queue
  • Queues can be stored in Drizzle, MySQL, PostgresQL, SQLite and even in Memcached.

Mastering the Art of Indexing – slides

  • Lock contention and indexing
    • update/delete by full scan is sequential rw. By index scan is random. So, full scan can be better in some cases.
    • In 5.1 can disable next key locking. use binlog-format=row and transaction-isolation=read-committed. But read committed can perform worse with high concurrency.
  • Deadlock caused by indexing
    • Sometimes single query update or delete has to update secondary index as well as PK table in InnoDB.
    • If 1 statement accesses and modifies sec index before accessing table records while another statement modifies same table record and needs to update same row in secondary index, you can get deadlock.
  • Covering index with range, limit, count
    • Range queries over large ranges often slow, especially if looking in secondary index before using range on table records. Can result in many random reads. Covering index can improve perf by eliminating random reads.
    • Covering index can also help a lot with LIMIT (especially when skipping) if the query would otherwise have to randomly access table records to apply other where clause conditions.
    • Similarly, covering index can help with COUNT, since you don’t have to randomly access table records.
  • Covering index with blob
    • Assuming you have a table with a blob that is rarely selected (e.g., < 10% of time). To avoid blob being loaded into buffer pool, typical optimization is to move blob into another table with 1:1 relationship, though this breaks normalization.
    • For his tests, performance with 2 tables is better until percentage of queries requesting blob (thus requiring two SELECTs) get up to around 45%.
    • Instead, can use covering index to include all desired columns (except the blob, of course) for most queries. To get good perf, initial column should be chronologically order, e.g., auto inc or a timestamp.
    • May have to use FORCE INDEX to get optimizer to use your index.
  • Sorting, indexing and query execution plans
    • If query does ORDER BY on a secondary index that is used for the query, can skip filesort, which is O(nlogn).
    • If not, optimizer chooses btw full scan and using index for sort vs. index scan and a filesort.
    • Index merge can be used to filter records, but not to sort.

Understanding the Role of IO as Bottleneck – slides

  • 7200 RPM drive can generally do about 100 IOPS.
  • HDD disk seek about 10 ms. Reading 1 MB sequentially from HDD is about 20 ms.
  • When you do a read on InnoDB, rows are loaded into buffer pool.
  • Writes to InnoDB go to buffer pool, but also are flushed to log file. Sequential IO to log file is fast. Buffer pool pages (16 KB/page) are marked as dirty until flushed to tablespace.
  • Writes to tablespace are sorted by innodb to be sequential as much as possible.
  • iostat is more useful than vmstat for IO analysis
  • If svctm large (maybe more than a few milliseconds), may have large competing IO requests, e.g., a busy shared SAN.
  • Queue size tells you how much data could be lost if server crashes. Wait time indicates how long ops may be blocking and also how much risk of lost data from power loss with no battery back up (or non-catastropic disk crash).
  • Indexes and archiving old data can reduce your working set. If your server is read-heavy, lots of RAM can really help.
  • RAID 5 good for reads, but worse for writes. RAID 10 good for both.
  • RAID controller with BBWC can reorder some random writes to sequential writes without risk of a power loss losing data.
  • Fusion-IO card can theoretically do up to 100,000 IOPS. Another presenter suggested that InnoDB is currently limited to around 15,000 on these cards.
  • Often cheaper to buy more memory, though that doesn’t help as much with writes.
  • Write options – mount files sytem noatime, compress blobs, reduce number of filesorts on disk.
  • If RAID with no BBWC, use innodb_flush_log_at_trx_commit of 2 (but be aware of possible data loss) or group statements into short transactions.
  • Referred to Facebook talk – they said maximum desirable transaction length for their workload and servers is about 4 seconds.

Mysteries of the Binary Log – slides

  • Safer to purge binary logs by using PURGE BINARY LOG {TO, BEFORE}, since it updates index file.
  • If you use expire_log_days, removal happens at server restart or log flush (which happens during a file rotate).
  • RESET MASTER will delete everything and create new binlog file. Disconnect any slaves before doing this.
  • Common header about 19 bytes. 4 byte timestamp, type, server id, length, file position, flags. File position is the end of the event.
  • SYSDATE() is dangerous, because it actually runs on the slave, so the generated timestamp will be different than on the master.
  • Context events (user defined vars, rand seed, auto_increment, last_insert_id()) are grouped with the query into a binlog event.
  • Since replication user can read binary logs, it can read all changes. Can be security issue.
  • For setting passwords, better to set passwords by setting them into a user variable as a hash and then setting hash directly as the password. Only the already hashed value in the SET statement is replicated.
  • Statements in transaction go into a cache. It is flushed to binary log on commit. It is emptied if a transaction with only transactional statements is rolled back. Complicated if you mix in MyISAM changes.
  • Transaction cache spills to a file if it gets too big. This is one of the reasons large transactions can cause bad performance.
  • DDL statements like CREATE and ALTER cause implicit commits. Changes to mysql databse and LOAD DATA INFILE also are implicitly committed.
4/19/2010: 9:01 am: Conference, MySQL

What’s New in MySQL 5.5? Performance Unleashed! - slides

  • Default storage engine is InnoDB (what had been called the InnoDB plugin, but now is the only version).
  • Many locks and mutexes held back scalability in 5.1.
  • Biggest win was fixing uses of LOCK_open. It was used for many things, some of which made no sense.
  • Introduced MetaData Locking (MDL). Added new mutex to protect MDL hash, but hash is calculated outside mutex.
  • Moved some things to 3 new mutexes, made some things atomic variables and removed useless uses of LOCK_open.
  • LOCK_thread_count used for too many things. Still used to protect list of connections, but changed other uses to atomic variables.
  • LOCK_open and the new MDL hash mutex are the biggest bottlenecks, but they are a problem at 32 cores now, instead of 8 cores.
  • InnoDB had many mutex issues for large number of cores. Buffer pool and log mutexes are worst. Adaptive hash index mutex is global and can be a problem, but can be disabled.
  • InnoDB Kernel mutex started to become a problem again, since other fixes allow you to reach higher number of concurrent transactions.
  • Log mutex held about 75% of time (for Sysbench RW) and buffer pool mutex about 50%.
  • Split the Log mutex into separate mutexes for independent functionality. Need to grab 2nd while holding 1st, but then can release 1st.
  • 1st was hottest, so separating acquisition of it from acquisition of buffer pool index was a big win.
  • Buffer pool, either split mutex functionality into multiple mutexes or split buffer pool into multiple pools.
  • InnoDB plugin with compression made having multiple mutexes much more complex, so decided to split buffer pool. Need 3 locks to get buffer pool page. Adding page hash mutex would have made it 4.
  • New config setting. Defaults to 1. Also split the flush list from buffer pool mutex.
  • Initial testing suggests 8-16 buffer pools gives best performance, but need lots more testing before making recommendations.
  • Rollback segment mutex problem with many connections. Can even perform IO while holding it. Split into 128 mutexes.
  • Purge was done as part of master thread. HIgh transaction rate mean lots of purging. Master thread can end up spending most of its time purging. Moved purging to separate thread.
  • Huge improvement in recovery (e. g., 7 hours to 14 minutes). Reduced algorithmic complexity of 2 impt activities.
  • Also huge improvement to delete performance.
  • Still looking into query cache issues.

MySQL Cluster: an Introduction – slides

  • NDBCluster Engine does not store to local disk. It pushes data to data nodes, which don’t need to be running MySQL.
  • Data nodes are ndbd processes and not necessarily bound to a single server. For testing, you can run multiple on the same server.
  • Data usually stored in 2 data nodes. Storing in 3 is experimental feature, requires more servers and performs worse. Storing in 1 obviously reduces availability, but gives great performance.
  • Storage is transactional.
  • You can have many mysqld instances all accessing the same data nodes. You can also use NDB API to access data nodes.
  • A hash is used to partition data across data nodes.
  • Always at least 1 replica of stored data. Node groups allows you to group replicas. Group selection is done according to order the server is listed in config file.
  • If a node goes down, its replica takes over as primary for the data. When node comes back, if not too far behind, it will sync the changes. If too far behind, it will sync a dump of all data.
  • Or you can bring up a replacement data node server with same IP address. Config is by IP.
  • If 2 data nodes in a group goes down, cluster must halt. Consistency over availability, CAP-wise.
  • SQL Node is a node running MySQL server.
  • Up to 48 data nodes. Very rarely more than 10 in practice, due to network and maintenance issues. Usually just a few.
  • Management node started first and it reads in the config. All data nodes need it in order to start. Also used for monitoring and arbitration.
  • If 2 data nodes can’t communicate, but 1 can still reach management node, that data node will take over. The one that can’t reach the mgmt node will shut down.
  • Mgmt node should be on different physical machine than the data nodes. You generally want two management nodes.
  • Only one will be the arbitrator. If primary fails, the data nodes will elect the 2nd mgmt node as the arbitrator.
  • There are also API nodes. A SQL node is a special case of an API node. API nodes use C++ library. Can also use Cluster/J in 7.1 for Java apps.
  • Only have to create tables on one MySQL server. The table creation will be auto detected by other servers.
  • User and grant info is independent, so that must be done on each server.
  • But, you can set up replication between MySQL servers. This allows you to more easily mix in non-cluster engine tables.
  • Need to handle retries in app. If data node dies during transaction, app must retry so transaction goes to other data node.
  • Table limits: 8052 bytes per row (can use blobs, but perf not good), 128 columns per table.
  • Indexed data limited to available RAM. Non-indexed data can be stored on disk.
  • Features: Online backup, global replication (can use MySQL to replicate clusters), NDB & MGM API, online additions of new data nodes.

EXPLAIN Demystified

  • Estimated query plan. Only for SELECT. Does not generate byte code. Unfortunately, you can’t ask MySQL for the plan it actually used for an executed query.
  • MySQL does everything as nested loop JOINs. Even a single table query is a single iteration loop. Always a left-deep tree, though it may reorder joins within the tree.
  • EXPLAIN output starts with deepest JOIN.
  • If no subquery or UNION, id for each row will be 1.
  • For select_type, ‘simple’ means only one SELECT keyword in entire query.
  • Derived means executed as a temp table, e.g., result of a subquery. If the table column is , the n is the subquery’s id column.
  • Union – rows spooled into temp table then read out with null in a row that says UNION RESULT
  • Be aware that EXPLAIN actually executes subqueries and unions, which can make it run very slowly. Also, can cause side effects if you update variables.
  • For derived, the higher id row is executed first. For UNION, the output tells you the execution order, e.g., .
  • IN and NOT IN queries are poorly optimized. They are converted into a dependent subquery.
  • UNCACHEABLE refers to the item cache. Not well documented. A few non-deterministic things like RAND() or a user variable can cause this.
  • Type is the JOIN type. Really more of an access type. Worst to best is ALL, index, range, ref, eq_ref, const, SYSTEM, null.
  • eq_ref means if there is a match, there will only be one row, perhaps due to a unique constraint on an index
  • Some non-range clauses get labeled as range because of details of the optimizer implementation.
  • ref tells you which columns or constants from preceding tables are used for lookups in the index named in the key column
  • rows is number of estimated rows that must be read. 5.1 tries to use LIMIT clause, but it is still sometimes inaccurate.
  • filtered: # of rows satisfying a condition. Added in 5.1. Not very useful.
  • extras: useful extra info (using index -> covering index; using where -> server post filters rows after using index; using temporary -> implicit temp table but not specified if memory or disk, using filesort -> means some kind of sort, but could be in memory)
  • filesort is an algorithm name. It doesn’t mean a disk-based file was used.
  • Use “pager mk-visual-explain” to pipe output through mk-visual-explain, assuming you installed Maatkit.

Connecting MySQL and Python – slides

  • Standard driver is MySQLdb. Another option is oursql.
  • If using MySQLdb library – db.connect(use_unicode=True, charset=’UTF8′)
  • The oursql driver has same API. Uses mysqlclient. Also implemented in C.
  • Both depend on MySQL client already being installed.
  • MySQL Connector/Python doesn’t depend on MySQL client libraries. Written by Geert.
  • Connector/J can be used with Jython. from java.sql import DriverManager
  • Current version is 0.1.3, but 0.1.4 coming soon.
  • Performance for INSERTs is about twice as slow as for MySQLdb or oursql.
  • Need more benchmark tests, since not clear whether that was including the network roundtrip time, since that should be roughly the same.
  • After the talk I talked to Chromakode about drivers and he recommended oursql. Much better code and documentation. I’ve switched to oursql and I like it.

Better Database debugging for shorter downtimes – slides

  • Tracking changes over time
    • Slowly Changing Dimension type-2. Start with a baseline and only track changes.
    • Need PK of original object, maybe from information_schema. Select attributes to track.
    • Add new row for each new property and when properties change. Use current_flag column to indicate which version is current. Also use effective_date and end_date, which can also tell you the current version.
  • Job logging via stored procedures
    • When they run a job (e.g., crontab entries), they use stored procedures to track when and how long it ran.
    • Really need to review slides and the stored procedures to get much value from this.

Linux Performance Tuning and Stabilization Tips – slides

  • To minimize mem usage, use compact datatypes. Remove unused indexes. Archive data when possible to minimize your working set.
  • Use direct IO -> innodb_flush_method=O_DIRECT.
  • Don’t alloc too much memory to mysqld. Makes filesystem cache too small and causes swapping.
  • If you disable swap & run out of memory, OOM killer will likely kill mysqld. It takes a long time to kill it and it aborts the process, which triggers crash recovery.
  • MyISAM, backup tools & admin tools use the filesystem cache.
  • Copying large files can cause swap. This can happen during backup.
  • set vm.swappiness=0 in /etc/sysctl.conf. Default is 60. This makes OS prefer making filesystem cache small over swapping.
  • tcmalloc faster than malloc. Can config InnoDB to use tcmalloc. Mostly matters for CPU bound loads with large buffer pool.
  • Don’t use too much session memory This can happen if you make per session memory variables to big. read_buffer_size can cause large memory blocks to be allocated.
  • Use BBWC on raid cards. Then, disable write cache on disks.
  • Don’t set write barrier if using BBWC (in ext3, barrier = 0)
  • Appending+fsync much slower than updating+fsync
  • If you use sync-binlog=1, fsync is called on append. Very slow.
  • Set innodb_autoextend_increment to higher than default of 8, maybe use 20.
  • pdflush takes care of async writes for MyISAM, file copy, mysqldump, etc.
  • Facebook moved from ext3 to xfs. ext3 very slow to delete large files. Makes dropping large table slow.
  • Writing to file is serialized.
  • Use dir_index to speed up searching and disable write barrier.
  • With xfs, disable write barrier with “nobarrier”. Concurrent writes possible with O_DIRECT.
  • Fusion SSD cards are 2-3 times faster than Intel SSD cards and 10-20 times faster than HDD. Fusion cards really benefit from xfs.
  • I/O Scheduler default is cfq, but noop or deadline are often better. InnoDB is already doing I/O scheduling itself, so cfq results in useless overhead.
  • Big difference in write performance between cfq and noop when lots of random reads mixed with writes. Recommend noop.
  • I/O scheduler queue size. Default is 128. Setting large like 100000 can greatly improve perf for MyISAM, since it doesn’t sort writes. Doesn’t help InnoDB, since it already sorts them.
  • iostat -x
  • Check r/s, w/s, svctm, %util, though svctm more reliable than %util
  • mpstat gives per core stats, vmstat averages across cores
  • mpstat -P ALL 1
  • SystemTap can be used with MySQL binary builds with DWARF symbols
  • iotop provides per process io stats
4/18/2010: 10:15 pm: MySQL

This year’s MySQL Conference was once again jam packed with very technical presentations. Best of all was the news about all the performance and feature enhancements in MySQL 5.5. The benchmark results for the performance improvements were very impressive. After attending this year’s conference, I’m more confident about MySQL’s future than I’ve been for a very long time. I’ll post my notes from the three days I attended.

It was also a great time to meet up with the friends I’ve made in the MySQL community over the last few years. Some other open source communities have a really poisonous atmosphere of spending a lot of time attacking competing open source products. Fortunately, I see very little of that time-wasting crap within the MySQL community. Well, except for the occasional fork sniping, but that’s easy for me to ignore.

Update: Finally managed to post my notes from all 3 days.


Fork me on GitHub