In this presentation, we will describe a scalable architecture to provide quick access to a huge volume of data, typical of those internet services (e.g., email, social network, etc.) that experience an exponential growth of users and access traffic. We will discuss and analyze through benchmark results the motivations, design, and solution adopted. The design and technique used to scale-out, optimize, and administer the current and historically data for several millions of users is based on a data layout that maps all the data for each user to its individual table, organizing the users across multiple databases and filesystems. This talk will provide an insightful look at the performance analysis results, and its optimizations, performed with four Linux filesystems (ext3, jfs, reiserfs, and xfs) combined with several MySQL storage engines (MyISAM, InnoDB, Falcon, PBXT) and MySQL horizontal partitioning feature.
While most advanced relational database management systems (RDBMS) have evolved some means of determining physical location to be used for storing segments of data on distinct filesystems, and MySQL horizontal partitioning feature allows to distribute portions of individual tables across filesystems, they still have limited capabilities when trying to physically distribute large data sets for millions of users.
In order to achieve the desired scalability we propose an alternative solution that is based on a data layout that maps all the data for each individual user to its separate table, organizing the tables across multiple databases and filesystems. With this approach the burden is now on MySQL Storage engine, the file system, and on a new strategy to handle millions of files.
Initially, two data layout scenarios are evaluated:
In the first scenario, the total number of tables is the total number of users divided by the maximum number of partitions allowed (1024) on a partitioned table. Even though this solution can physically map to a lower number of files and tables, it may imply a higher level of management compared to the second case scenario, when determining which table maps a specified user. Moreover, the former approach imposes some limitations on the possibility of using the horizontal partitioning feature to distribute segments of data for a specified user (sub-partitioning). The second case scenario generates the highest number of files, but provides a lower level of concurrency access.
Next we will discuss the database and filesystem organizations.
In both cases, the number of files associated with a table varies depending on the specific MySQL storage engine. It is possible to organize all the tables in a single database or across multiple databases. In the former case all the files matching a table are located under the same directory. Linux filesystems (as well as other filesystem types) have some constraints when coming to storing millions of files in the same directory. However, MySQL offers the possibility to store each table on different locations, filesystems, and or hardware.
An alternative strategy is to distribute the tables across several databases and spread the databases across several fileystems. At this point it is also possible to better exploit MySQL partitioning feature to each individual user table. The latter solution is the scalable architecture that has been chosen to provide quick access to billions of rows, distributed among over 5 million users.
We will look into our performance analysis results derived on benchmarking in the following topics:
The benchmarks will compare four Linux fileystems such as ext3, jfs, reiserfs, and xfs, with four MySQL storage engines such as MyISAM, InnoDB, PBXT, and Falcon, with and without partitioning feature and their limitations, differences, and their optimizations.
As an example of storage engines differences, it is interesting how the memory footprint used to store a table dictionary vary depending on the storage engine used; that is much more visible when dealing with millions of tables.
We will integrate the previous benchmarks with specific low-level optimizations on the computing platform, such as the impact of write-back and write-though disk caching policies and architectural tests that compare the AMD versus Intel Xeon platform.
Finally, we will close with an examination of the chosen filesystem and MySQL storage engine and results performed using the above described architecture on a master-master replication scenario. We believe the proposed solution will scale to an unlimited number of users with enough commodity hardware resources. Each node (system) contains the data for “n” millions of users distributed among “k” databases; when the maximum number of users is reached a new node is added. The scalability is offered by the fact that a new system can be added and “k” databases will host the data for the next “n” millions of users.
Mariella Di Giacomo is a member of scalable database team at The Hive. Her interests include distributed database systems, cluster computing and large-IO systems. She received her Laurea degree in computer science from the University of Pisa, Italy. Prior to joining The Hive she worked for the Pacific Northwest National Laboratory, the Los Alamos National Laboratory and Alcatel.
I’ve been spending a lot of my time recently working on large scale systems. High availability, automatic/seemless process failover, high speed network shared memory (using multicast), database replication, multi-homed and multi-master syncing, and clustering.
I’ve worked on lots of stuff, including but not limited to: the Linux kernel, small (a portable c-like scripting language), XFree86 (now branched off to be called x.org or some such nonsense), GNOME, the Gimp, a bunch of image processing libraries, file management, disk notification, plug and play monitor support, AGP gart acceleration, and of course the Enlightement Window Manager.
I’ve also worked on speech synthesis, speech recognition, computer vision systems, large scale AI systems – mainly deriving a new type of neural network based off of 6 dimensional matrix modelling, speech to speech translation systems (English <> Arabic, English <> Pashto), workflow management systems for large multi-user development processes, speech annotation software, source revision control systems for speech annotation, and other general audio and image processing software.
View a complete list of MySQL contacts.