Project “700″ [En]

Name the script : generate.pl (pl comes from perl).

To run the script, prepare some free space, put the script in a directory, and run it from terminal :

sorinello@Homeworld: $./generate.pl

The script should output from time to time at what value has come generating. This is only a checking method to see if it still works. Obviously, you can check the hard drive LED too.

If you intend to modify the script, or modify the rate some output is shown, beware that too much output actually slows the generation process. That’s why I used a number found by a friend of mine and co-partner at this project. This number is a prime number that has the lowest rate in dividing a number from 0 to 700 million.

My generation took about 3-4 hours, and the output was something like this :

sorinello@Homeworld:/media/tmp# ls -lah
total 63G
drwxrwxrwx 1 root root 4.0K 2008-05-05 01:42 .
drwxr-xr-x 4 root root 4.0K 2008-05-05 03:22 ..
-rwxrwxrwx 1 root root 18G 2008-04-23 10:46 file_names.txt
-rwxrwxrwx 1 root root 45G 2008-04-23 12:21 file_users.txt

Now you should see some data like this.

sorinello@Homeworld:/media/tmp# tail file_users.txt
699999997;JohnDoe699999997;54c51b468a430edc6049df8ef34ab8b2
699999998;JackRipper699999998;1cc0388178a72b116799e50af869b4d4
699999999;FedoraCore699999999;0536192688993eeb541ed5f1aba16353

Now that we have the two files created, we need to create a database, and the two tables. Assuming you have installed and configured PostgreSql, login with the postgres using psql, and create the database.

Login using the postgres system account. You must create a user for yourself. Template1 is an existing database. You should see something like this :

postgres@Homeworld:/media/tmp$ psql -U postgres -d template1
Welcome to psql 8.2.5, the PostgreSQL interactive terminal.

Type: \copyright for distribution terms
\h for help with SQL commands
\? for help with psql commands
\g or terminate with semicolon to execute query
\q to quit
template1=#

Now that you are here, you must create your user. CREATEDB means that this user will be allowed to create databases;

template1=# CREATE USER sorinello WITH PASSWORD 'jw8s0F4' CREATEDB;

You should see confirmation :

CREATE ROLE

Now, type \q and quit. Re-login with your newly created user. You will be prompted to type your password :

sorinello@Homeworld:/media/tmp# psql -U sorinello -d template1
Welcome to psql 8.2.5, the PostgreSQL interactive terminal.

Type: \copyright for distribution terms
\h for help with SQL commands
\? for help with psql commands
\g or terminate with semicolon to execute query
\q to quit
template1=>

Now, let’s create the database. Type :

template1=> CREATE DATABASE engineering;

You should see confirmation :

CREATE DATABASE

Now, type \q and quit. Re-login with your user. This time, you will connect to the engineering database :

sorinello@Homeworld:/media/tmp# psql -U sorinello -d engineering

Now that the database is created, we need to create the two tables. Keep in mind, that you must chose the right dimension for the data fields, otherwise you’ll get out of space.

Tips : Check the longest name you entered in the script file. Remember it’s length. That will be your aid in choosing the right data types and data length.

We will try to optimize this, to gain space. We’ll approach the Worst Case scenario. We want to be sure that all possible combinations will fit, right?

For example : if the longest name from the array is 11 this is how you should do :

  • name field as VARCHAR(11)
  • surname (worst case scenario – the surname will be randomed twice with the longest name : 11 + ‘ ‘ + 11) VARCHAR(23)
  • username (same here, and 9 is the number of digits of any number from the 100 millions to 700 millions : 11 + 11 + 11 + 9) VARCHAR(42)
  • password (the hash function will have always 32 bytes length) CHAR(32)

Doing this, we assured ourselves that the worst scenario will fit.

Now, for creating the two tables, login to postgres console, and type :

engineering=#create table names (id INT PRIMARY KEY, name VARCHAR(12), surname VARCHAR(23));
engineering=#create table users (id INT PRIMARY KEY, username VARCHAR(42) NOT NULL, password CHAR(32) NOT NULL);

Now, we can populate the 2 tables, with the following two commands :

engineering=#COPY names FROM'/path_to_file/filename_names.txt' WITH DELIMITER ';';

engineering=#COPY users FROM'/path'_to_file/filename_users.txt WITH DELIMITER ';';

This should take several hours each. After it is complete, you should get a confirmation, of how many rows have been inserted. It should look something like this :

COPY 700000000

Great, now let’s create an index after usernames. This is the most time consuming process of all the project. It took me about 20 hours to create this index :

engineering=#CREATE INDEX username_users ON users(username);

After this step, all it’s left is to create the relation between the two tables :

engineering=#ALTER TABLE users ADD FOREIGN KEY (id) REFERENCES names(id) ON UPDATE CASCADE ON DELETE RESTRICT;

After it is ready, you’re done. You can start querying the database, and have a little fun with it. Test the query time, or even try some complex queries.

Another method to create this database is using INHERITANCE. More exactly, you will create 26 tables for each of the two tables, and each table will store the users beginning with a letter. This technique is called partitioning. The only problem was that I realized too late is that the COPY command ignores rules created with partitioning. If you want to do it this way, modify my script and instead a CVS file, you’ll get an SQL command file. And you can insert all the data, in one transaction. I will not go in further details, I only want to mention this is another way of doing it.

Estimated time of this project :

  • creating the perl generation script : 2 hours
  • running the generation script : 5 hours
  • populating database, both tables : 12 hours
  • indexing usernames : 21 hours
  • creating relations : 8 hours.
  • TOTAL : 48 hours = 2 days
  • TOTAL database : 179 GB

Personal recommendations :

  1. Make sure you have at least 250 GB for the database
  2. Make sure you have at least 60 GB for the text files
  3. Use latest Kernel
  4. Try optimizing postgres.conf, according to your ammount of RAM.
[del.icio.us] [Digg] [Facebook] [MySpace] [Technorati] [Twitter] [Email]

Popularity: 15% [?]


Pages: 1 2