Project “700″ [En]

This page is about a project of mine, called “Project 700″, made for educational purposes (software engineering class). The idea of this project is pretty straight forward. I had to create a database with 700 million records, that would store 700 million names, each one with their own user and password. We had the liberty to choose what kind of DBMS (Data Base Management System) we wanted. OS was at our own choice too.

In the next session I’ll present you the tutorial for doing this, including some optimizations and recommendations.

First of all, you should need a pretty good machine, and some free space (I used 300 GB). I would recommend a Dual Core or even Quad Core processor. Using a RAID system isn’t mandatory, but it would greatly improve the speed.

The configuration of my machine on which I made the database:

Motherboard:
CPU Type DualCore Intel Core 2 Duo E6600, 2400 MHz (9 x 267)
Motherboard Name DFI LANParty UT P35-T2R (3 PCI, 1 PCI-E x1, 3 PCI-E x16, 4 DDR2 DIMM, Audio, Dual Gigabit LAN, IEEE-1394)
Motherboard Chipset Intel Bearlake P35
System Memory 2048 MB (DDR2-800 DDR2 SDRAM)
BIOS Type Award (12/24/07)
Storage:
IDE Controller Intel(R) ICH9 2 port Serial ATA Storage Controller 2 – 2926
IDE Controller Intel(R) ICH9 4 port Serial ATA Storage Controller 1 – 2920
Disk Drive ST3500630AS (500 GB, 7200 RPM, SATA-II)
Disk Drive ST3500630AS (500 GB, 7200 RPM, SATA-II)
Disk Drive WDC WD75 00AYYS-01RCA0 USB Device (698 GB, USB)
SMART Hard Disks Status OK
Memory Properties:
Error Detection Method 8-bit Parity
Error Correction None
Supported Memory Interleave 1-Way
Current Memory Interleave 1-Way
Supported Memory Types DIMM
Supported Memory Voltages 5V
Maximum Memory Module Size 2048 MB

The solution I chose for this project was Ubuntu 7.10 Server Edition and PostgreSQL 8.2. These aren’t the latest versions anymore. I strongly recommend using latest stable releases.

I also recommend installing phpsysinfo and phppgAdmin. These tools are good for simple monitoring of the system and the database. Also, this is a good way to remotely query the database, as I did when I showed the database to my teacher.

In this tutorial I will show you how to build a database containing two tables, one for storing names, and other for storing the username and the passwords. The two tables will be in a 1 to 1 relation.

Before we start, I suggest some optimisations to the database server. Locate your postgresql.conf and open it with your favourite text editor ;) . Locate the lines :

#work_mem = 1MB replace it with work_mem = 80MB (the memory allocated for a postgres process)

shared_buffers = 24MB replace it with shared_buffers = 32MB

#fsync = on replace it with fsync = off (turns forced synchronization on or off)

Save it, and restart the postgres server. If you encounter problems, type thise two commands :

sorinello@Homeworld:sysctl -w kernel.shmmax=134217728

sorinello@Homeworld:sysctl -w kernel.shmall=2097152

Now, for generating the list of 700 million names I had made a script in perl (tried to aviod PHP because perl is faster).

The script will generate in parallel 2 CSV files, called file_names.txt and file_users.txt , first one for storing the data needed to populate the names table, and the other file for populating the users table. The password field will be a md5 hash over the username.

file_names.txt will contain entries in this form : id;name;surname

file_users.txt will contain entries in this form : id;username;password

Create a file called generate.pl, and put the script below in this file.

There is the script you (it can also be downloaded from here):

#!/usr/bin/perl
use strict;
use warnings;
use Digest::MD5 qw(md5 md5_hex md5_base64);
my @names_list = (name1, name2, and so on .. you must put here the names);
#I store this so in case i'll use this again, or adding some names, i won't have to keep the count
my $array_length = $#names_list + 1;
print "Array length is " . $array_length . "\n";
my $file_names = 'file_names.txt'; #used to store names and surnames
my $file_users = 'file_users.txt'; #used to store users and passwords
open(INFO1, ">$file_names");
open(INFO2, ">$file_users");
my $has_two_surnames;
my $i;
my $name;
my $surname; #because i have only one field in the db, i have to concat the 2 surnames
my $surname1;#one of the surname
my $surname2;#one of the surnames
my $username;
my $password;
for( $i=0 ; $i < 70000000; ++$i)
{
#15825911 Marton's Magical Number
if ($i % 15825911 eq 0)
{
print "Step " . $i . "\n";
}
$has_two_surnames = int(rand(10));
#let's say 4 of 10 people have two surnames
if ($has_two_surnames < 4 )
{
$name = $names_list[int(rand($array_length))];
$surname1 = $names_list[int(rand($array_length))];
$surname2 = $names_list[int(rand($array_length))];
$surname = $surname1 . " " . $surname2;
$username = $name . $surname1 . $surname2 . $i;
$password = md5_hex $username;
#print $i . " " .$name . " " . $surname . " " . $username . "\n";
#print to file user ; as a separator
print INFO1 $i . ";" .$name . ";" . $surname . "\n";
print INFO2 $i . ";" . $username . ";" . $password . "\n";
}
else
{
$name = $names_list[int(rand($array_length))];
$surname = $names_list[(int(rand($array_length)))];
$username = $name . $surname . $i;
$password = md5_hex $username;
#print $i . " " .$name . " " . $surname . " " . $username . " " . "\n";
#print to file used ; as a separator
print INFO1 $i . ";" .$name . ";" . $surname . "\n";
print INFO2 $i . ";" . $username . ";" . $password . "\n";
}
}
close(INFO1);
close(INFO2);

[del.icio.us] [Digg] [Facebook] [MySpace] [Technorati] [Twitter] [Email]

Popularity: 62% [?]


Pages: 1 2