| « Yikes! | Today's roads, tomorrow » |
Tue, Sep 27, 2011
![[Icon]](rsc/img/chain_link.gif)
...PVCS!
My workplace uses a variety of software written in a variety of langauages. Sadly, this includes some legacy code. Written in Cobol.
Over two million lines of Cobol. In nearly five thousand seperate files.
It's version-controlled, of course. The records of all the changes to these files goes all the way back to 1995. All faithfully recorded in PVCS.
PVCS differs from the more modern VCS' by being based around single files: One file, one record. You can't say "Show me what the files were like on this date", only "Show me what THAT file was like on this date"
And it's slow. I thought SVN was slow (because it is, compared to git) but it's a neutrino from the LHC compared to PVCS (Wow, topical geek jokes! Go me!)
We're talking the better part of a minute to get a single commit for some of the bigger files.
A while ago, I had to migrate another repo - an RCS one - across to git. This was simple enough, as we didn't care so much about the fine detail of the history, so I could just dump the commits in file by file.
But we need to have a good history of all this legacy cobol stuff. So I came up with a two-step solution: Firstly, get each file's complete commit history into a local database; then use the database to get all the commits in date order and add them to git in order. This would give us a complete git history, as if we had actually been using git the entire time, giving us a massive performance boost, and a host of other improvements as well.
The main problem was in finding a git equivalent for the PVCS branching model. Let's imagine we have a file, 'file.txt'. We put it into PVCS as version 1.0. We make a minor change, we get 1.1; we make a major change we go up to 2.0
Now let's imagine that we need to apply a change to 1.1, but not 2.0. So we check out 1.1 and we branch it. This gives us 1.1.1.0 - when it comes to PVCS branch numbers, as Yoda would put it, "Always two there are" - a branch number and a version number.
Thing is, if you now want ANOTHER branch, you get ANOTHER two digits: 1.1.1.0.1.0
Somehow, I had to make git handle files doing this kind of thing. And to crazy levels: I kid you not, we've had 13.9.1.1.1.0.1.0.1.0.1.0.1.0.1.0.1.0.1.0 as a version number!
But I figured it all out, and after a quarter hour to open the log of every file in the repo & parse it into the DB I started the import into git. Six hours later, and we're nearly six thousand commits in!
That's not bad.. but there *are* over eighty thousand commits to get through. This is going to take a while...
In the meantime, since it's working surprisingly well, I thought I'd put the scripts that are doing the work online. PVCS isn't heavily-used any more but I've seen a few people asking for ways to get it into git online, so it might be of use to some...
I make no apologies for how ugly some of it is: It works, and that's what counts :P
--->
#!/usr/bin/perl
# Script to migrate PVCS history into DB
use strict;
use warnings;
use local::lib;
use lib "lib";
use DBIx::Class;
use DateTime;
use DB::Schema;
# Create DB if it doesn't already exist
my $db_file = 'files.db';
my $db = get_db();
sub get_db {
DB::Schema->connect("dbi:SQLite:$db_file");
}
unless (-e $db_file){
$db->deploy();
print "Deployed DB from schema\n";
}
# Let's get the files we're after!
my $count;
my @errors;
opendir my $source, "source";
while (my $file = readdir $source) {
# Crop out a few unwanted types of file
next if $file =~ m#^\.+$#; ## Skip . and ..
next if $file =~ m#swp#;
# Get the history
my @history = `nice -10 vlog $file 2>&1`;
# If this isn't a PVCS-managed file, skip it
$count++;
print "$count $file\n";
unless ($history[2]){
push @errors, $file;
}
next if $history[2] =~ m#^vlog: warning, can't locate archive#;
# Check if this file is in the database yet, create it if not
my $files_rs = $db->resultset('Files')->find_or_create({ filename => $file });
my $file_id = $files_rs->id;
# Process the history
my $i;
my $commit_count;
for my $line (@history){
$i++;
next unless $line =~ m#-----------------------------------#;
# Guard against duplicate dotted-lines
next if $history[$i] =~ m#-----------------------------------#;
# Entries almost always have the same format:
#-----------------------------------
#Rev NNNN
#Checked in: Ddd Mmm xx hh:mm:ss yyyy
#Last modified: Ddd Mmm xx hh:mm:ss yyyy
#Author id: AAAA lines deleted/added/moved: x/x/x
#MESSAGE
# So we can just grab what we want from the next few lines:
# However, they may be locked, which we have to account for:
my $j=0;
$j = 1 if $history[$i+1] =~ m#Locked by#;
# And onto the content:
my ($rev) = $history[$i] =~ m#^\s*Rev\s*(.*)\n#;
my ($date) = $history[$i+1+$j] =~ m#^\s*Checked in:\s*\w{3}\s(.*)\n#;
my ($author) = $history[$i+3+$j] =~ m#^\s*Author id:\s*(\w*)#;
my ($message) = $history[$i+4+$j] =~ m#^\s*(.*)\n#;
# Convert date into DateTime object
$date =~ s#\s#:#g;
my ($month, $day, $hour, $minute, $second, $year) = split(':', $date);
my %months = (
Jan => 1,
Feb => 2,
Mar => 3,
Apr => 4,
May => 5,
Jun => 6,
Jul => 7,
Aug => 8,
Sep => 9,
Oct => 10,
Nov => 11,
Dec => 12,
);
$month = $months{$month};
$date = DateTime->new(
year => $year,
month => $month,
day => $day,
hour => $hour,
minute => $minute,
second => $second,
);
# We have all the data needed for the commit. Put it into the database
eval {
$db->resultset('Commits')->find_or_create({
file => $file_id,
author => $author,
message => $message,
version => $rev,
date_time => $date->epoch,
})->update;
$commit_count++;
};
if ($@ || !$commit_count){
print "Error: ".$@."\n";
print @history;
die;
}
}
}
print "Unsuccessful files: ".@errors;
#!/usr/bin/perl
use strict;
use warnings;
use local::lib;
use lib "lib";
use DBIx::Class;
use DateTime;
use Git::Repository;
use DB::Schema;
# Connect to DB
my $db_file = 'files.db';
my $db = get_db();
sub get_db {
DB::Schema->connect("dbi:SQLite:$db_file");
}
# Open logfile
open (my $log, ">", "log.txt");
print $log "Start: ".`date`;
# Get a resultset from the DB in chronological order
my $rows = $db->resultset('Commits')->search(undef,{ order_by => 'date_time' });
# Create a git history for all files in date order
# Initialise git (on first run only!)
#`rm -rf files/.git`;
#Git::Repository->run( init => '/home/djh/git/files' );
my $repo = Git::Repository->new( work_tree => '/home/djh/git/files' );
# Get existing branches in case we're re-entrant
my @branches = $repo->run('branch');
my %branches;
for my $branch (@branches){
$branch =~ s#^..(.+)#$1#;
print "Existing branch: $branch\n";
$branches{$branch} = 1;
}
while (my $row = $rows->next){
# Get file details
print $row->file->filename." ".$row->version."\n";
next if $row->committed;
my $file = $row->file->filename;
my $author = $row->author;
my $message = $row->message;
my $version = $row->version;
my $date_time = DateTime->from_epoch( epoch => $row->date_time);
# Log details
my $now = `date`;
chomp $now;
print $log "$now - Committing $file version $version to branch ";
# Switch branches if necessary
my $branch = 'master';
if ($version =~ m#\..*\.#){
$branch = $file."_".$version;
# Crop out the last digit to get the branchname we need
$branch =~ s#(.*)\.\d+#$1#;
# Have we already created this branch?
if ($branches{$branch}){
# Yes - check it out
$repo->run( checkout => $branch );
}
else {
# No - better create it then!
# We *must* already have the parent branch - no need to check
my ($parent_branch) = $branch =~ m#(.*)\.\d+\.\d+#;
# Parent branch is either master or contains version number(s)
$parent_branch = 'master' unless $parent_branch =~ m#\..*\.#;
# Check we're not already on the right branch
my $current_branch = `git symbolic-ref HEAD`;
$current_branch =~ s#refs/heads/(\S)#$1#;
chomp $current_branch;
$repo->run( checkout => $parent_branch ) unless $current_branch eq $parent_branch;
# First, store the current HEAD sha so we can get back to it
my $head = $repo->run( 'rev-parse' => 'HEAD' );
# Next get back to the commit we want to base the branch off
my ($parent_commit) = $branch =~ m#(.*)\.\d+#;
$repo->run( 'reset', '--hard', $parent_commit );
# Create the branch
$repo->run( 'checkout', '-b', $branch );
$branches{$branch} = 1;
# Go back to the parent branch, restore its head, then go back to current branch
$repo->run( checkout => $parent_branch );
$repo->run( 'reset', '--hard', $head );
$repo->run( checkout => $branch );
}
}
# Log the branch we're going to commit to
print $log "$branch\n";
# Get the right version of the file
chdir 'files';
system("rm -f $file") if -e $file;
# Handle failed pgets by looping (a sane number of times)
my $pget_attempt = 0;
while (!-e $file && $pget_attempt < 10){
$pget_attempt++;
# Use local copy of repo file if available
if (-e "$file.v"){
my $pget = `nice -10 pget -r$version ./$file.v 2>&1`;
}
else {
my $pget = `nice -10 pget -r$version $file 2>&1`;
}
}
system("chmod 777 $file");
# Get it into git
eval {
$repo->run( add => $file );
my $resp = $repo->run(
commit => $file,
'-m' => "$file $version - $message",
'--author' => "$author <$author\@example.com>",
'--date' => $date_time->ymd."T".$date_time->hms,
);
# Log result
print $log "Result: $resp\n";
$repo->run(
'tag' => $file."_".$version,
'-m' => "$file $version",
);
my $current_branch = `git symbolic-ref HEAD`;
$current_branch =~ s#refs/heads/(\S)#$1#;
chomp $current_branch;
$repo->run( checkout => 'master' ) unless $current_branch eq 'master';
# End of this log entry
print $log '-'x80;
print $log "\n";
$row->committed(1);
$row->update;
};
print $log "Error with adding to git: $@\n" if $@;
}
print $log `date`;
close $log;
![[Links]](http://geekblog.oneandoneis2.org/skins/112/rsc/img/chain_link.gif)
I'm in the Perl newsletter again. I should try and write about some other language...
21/05/12
Facebook Syndication Error
22/05/12
![]()
I last listened to:
Johann Pachelbel - Canon in D major
Most recent photo:
js.js