Home Page
Archive > Posts > Tags > Diagnosing
Search:

Weird compiler problem

I wanted to write about a really weird problem I recently had while debugging in C++ (technically, it’s all C). Unfortunately, I was doing this in kernel debugging mode, which made life a bit harder, but it would have happened the same in userland.

I had an .hpp file (we’ll call it process_internal.hpp) that was originally an internal file just to be included from a .cpp file (we’ll call it process.cpp), so it contained global variables as symbols. I ended up needing to include this process_internal.hpp file elsewhere (for testing, we’ll call it test.cpp). Because of this, the same symbol was included in multiple files, so the separate .o builds were not properly interacting. I ended up using “#ifdef”s to only include the parts I needed in the test.cpp file, and doing “extern” defines of the global variables for it. It looked something like the following:

enum { FT_Inbound, FT_Outbound };
typedef struct FilteringLayer {
	int FilterTypeNum, OriginalID;
	const char *Name;
} FilteringLayer;
const int FT_NumTypes=2;

#ifdef _PROCESS_INTERNAL
	FilteringLayer FilterTypes[FT_NumTypes]={
		{FT_Inbound,  5, "Inbound"),
		{FT_Outbound, 8, "Outbound"),
	};
#else
	extern "C" FilteringLayer *FilterTypes;
#endif

So I was accessing this variable in test.cpp and getting a really weird problem. The code looked something like this:

struct foo { int a, b; };
foo Stuff[]={...};
void FunctionBar()
{
	for(int i=0;i<FT_NumTypes;i++)
		Stuff[FilterTypes[i].OriginalID].b=1;
}

This was causing an access exception, which blue screened my debug VM. I tried running the exact same statements in the visual studio debugger, and things were working just as they were supposed to! So I decided to go to the assembly level. It looked something like this: (I included descriptions)

L#CodeDescriptionCombined description
for(int i=0;i<FT_NumTypes;i++)
1 mov qword ptr [rsp+58h],0 int i=0
2 jmp MODULENAME!FunctionBar+0xef JUMP TO #LINE@6
3 mov rax,qword ptr [rsp+58h] RAX=i
4 inc rax RAX++ i++
5 mov qword ptr [rsp+58h],rax I=RAX
6 cmp qword ptr [rsp+58h],02h CMP=(i-FT_NumTypes)
7 jae MODULENAME!FunctionBar+0x11e IF(CMP>=0) GOTO #LINE@15 if(i>=FT_NumTypes) GOTO #LINE@15
Stuff[FilterTypes[i].OriginalID].b=i;
8 imul rax,qword ptr [rsp+58h],10h RAX=i*sizeof(FilterTypes)
9 mov rcx,[MODULENAME!FilterTypes ]RCX=(void**)&FilterTypes
10movzx eax,word ptr [rcx+rax+4] RAX=((UINT16*)(RCX+RAX+4) RAX=((FilteringLayer*)&FilterType)[i].OriginalID
11imul rax,rax,30h RAX*=sizeof(foo)
12lea rcx,[MODULENAME!Stuff ] RCX=(void*)&Stuff
13mov dword ptr [rcx+rax+04h],1 *(UINT32*)(RCX+RAX+0x4)=1 Stuff[RAX].b=1
14jmp MODULENAME!FunctionBar+0xe2 GOTO #LINE@3
15...

I noticed that line #9 was putting 0x0000000C`00000000 into RCX instead of &FilterTypes. I knew the instruction should have been an “lea” instead of a “mov” to fix this. My first thought was compiler bug, but as many programming mantras say, that is very very rarely the case. If you want to guess now what the problem is, now is the time. I’ve given you all the information (and more) to make the guess.



The answer: extern "C" FilteringLayer *FilterTypes; should have been extern "C" FilteringLayer FilterTypes[];. Oops! The debugger was getting it right because it had the extra information of the real definition of the FilterTypes variable.

Monitoring PHP calls

I recently had a Linux client that was, for whatever odd reason, making infinite recursive HTTP calls to a single script, which was making the server process count skyrocket. I decided to use the same module as I did in my Painless migration from PHP MySQL to MySQLi post, which is to say, overriding base functions for fun and profit using the PHP runkit extension. I did this so I could gather, for debugging, logs of when and where the calls that were causing this to occur.


The below code overrides all functions listed on the line that says “List of functions to intercept” [Line 9]. It works by first renaming these built in functions to “OVERRIDE_$FuncName[Line 12], and replacing them with a call to “GlobalRunFunc()” [Line 13], which receives the original function name and argument list. The GlobalRunFunc():

  1. Checks to see if it is interested in logging the call
    • In the case of this example, it will log the call if [Line 20]:
      • Line 21: curl_setopt is called with the CURLOPT_URL parameter (enum=10002)
      • Line 22: curl_init is called with a first parameter, which would be a URL
      • Line 23: file_get_contents or fopen is called and is not an absolute path
        (Wordpress calls everything absolutely. Normally I would have only checked for http[s] calls).
    • If it does want to log the call, it stores it in a global array (which holds all the calls we will want to log).
      The logged data includes [Line 25]:
      • The function name
      • The function parameters
      • 2 functions back of backtrace (which can often get quite large when stored in the log file)
  2. It then calls the original function, with parameters intact, and passes through the return [Line 27].

The “GlobalShutdown()” [Line 30] is then called when the script is closing [Line 38] and saves all the logs, if any exist, to “$GlobalLogDir/$DATETIME.srl”.

I have it using “serialize()” to encode the log data [Line 25], as opposed to “json_encode()” or “print_r()” calls, as the latter were getting too large for the logs. You may want to have it use one of these other encoding functions for easier log perusal, if running out of space is not a concern.

<?
//The log data to save is stored here
global $GlobalLogArr, $GlobalLogDir;
$GlobalLogArr=Array();
$GlobalLogDir='./LOG_DIRECTORY_NAME';

//Override the functions here to instead have them call to GlobalRunFunc, which will in turn call the original functions
foreach(Array(
        'fopen', 'file_get_contents', 'curl_init', 'curl_setopt', //List of functions to intercept
) as $FuncName)
{
        runkit_function_rename($FuncName, "OVERRIDE_$FuncName");
        runkit_function_add($FuncName, '', "return GlobalRunFunc('$FuncName', func_get_args());");
}

//This optionally 
function GlobalRunFunc($FuncName, $Args)
{
        global $GlobalLogArr;
        if(
                ($FuncName=='curl_setopt' && $Args[1]==10002) || //CURLOPT enumeration can be found at https://curl.haxx.se/mail/archive-2004-07/0100.html
                ($FuncName=='curl_init' && isset($Args[0])) ||
                (($FuncName=='file_get_contents' || $FuncName=='fopen') && $Args[0][0]!='/')
        )
                $GlobalLogArr[]=serialize(Array('FuncName'=>$FuncName, 'Args'=>$Args, 'Trace'=>array_slice(debug_backtrace(), 1, 2)));

        return call_user_func_array("OVERRIDE_$FuncName", $Args);
}

function GlobalShutdown()
{
        global $GlobalLogArr, $GlobalLogDir;
        $Time=microtime(true);
        if(count($GlobalLogArr))
                file_put_contents($GlobalLogDir.date('Y-m-d_H:i:s.'.substr($Time-floor($Time), 2, 3), floor($Time)).'.srl', implode("\n", $GlobalLogArr));

}
register_shutdown_function('GlobalShutdown');
?>
XCode Compiler Fail
Now this is just ridiculous

While I’ve been encountering more bugs than I can count on both hands while working with XCode, this one takes the cake. Clang (the compiler) was throwing the following errors while it was trying to compile one of its objective C source files (.m extension).


clang: error: unable to execute command: Segmentation fault: 11
clang: error: clang frontend command failed due to signal (use -v to see invocation)
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: arm-apple-darwin14.5.0
Thread model: posix
clang: note: diagnostic msg: PLEASE submit a bug report to http://developer.apple.com/bugreporter/ and include the crash backtrace, preprocessed source, and associated run script.
clang: error: unable to execute command: Segmentation fault: 11
clang: note: diagnostic msg: Error generating preprocessed source(s).

The fix... was to keep the specific source file open in an XCode window ~.~ . How the heck do you integrate the [CLI] compiler so much into the IDE that this could happen? Or is this simply a weird file system thing? I should note that my XCode project directory, with all files, is located on a VMware volume share.

iGoogle Security Problems
For a company that stresses security...

I’ve recently been having problems using the Google Reader widget in iGoogle. Normally, when I clicked on an RSS Title, a “bubble” popped up with the post’s content. However recently when clicking on the titles, the original post’s source opened up in a new tab. I confirmed the settings for the widget were correct, so I tried to remember the last change I made in Firefox that could have triggered this problem, as it seems the problem was not widespread, and only occurred to a few other people with no solution found. I realized a little bit back that I had installed the HTTPS Everywhere Firefox plugin. As described on the EFF’s site “HTTPS Everywhere is a Firefox extension ... [that] encrypts your communications with a number of major websites”.

Once I disabled the plugin and found the problem went away, I started digging through Google’s JavaScript code with FireBug. It turns out the start of the problem was that the widgets in iGoogle are run in their own IFrames (which is a very secure way of doing a widget system like this). However, the Google Reader contents was being pulled in through HTTPS secure channels (as it should thanks to HTTPS Everywhere), while the iGoogle page itself was pulled in through a normal HTTP channel! Separate windows/frames/tabs cannot interact with each other through JavaScript if they are not part of the same domain and protocol (HTTP/HTTPS) to prevent Cross-site scripting hacks.

I was wondering why HTTPS Everywhere was not running iGoogle through an HTTPS channel, so I tried it myself and found out Google automatically redirects HTTPS iGoogle requests to non secure HTTP channels! So much for having a proper security model in place...

So I did a lot more digging and modifying of Google’s code to see if I couldn’t find out exactly where the problem was occurring and if it couldn’t be fixed with a hack. It seems the code to handle the RSS Title clicking is injected during the “onload” event of the widget’s IFrame. I believe this was the code that was hitting the security privilege error to make things not work. I attempted to hijack the Google Reader widget’s onload function and add special privileges using “netscape.security.PrivilegeManager.enablePrivilege”, but it didn’t seem to help the problem. I think with some more prodding I could have gotten it working, but I didn’t want to waste any more time than I already had on the problem.

The code that would normally be loaded into the widget’s IFrame window hooks the “onclick” event of all RSS Title links to both perform the bubble action and cancel the normal “click” action. Since the normal click action for the anchor links was not being canceled, the browser action of following the link occurred. In this case, the links also had a “target” set to open a new window/tab.


There is however a “fix” for this problem, though I don’t find it ideal. If you edit the “extensions\https-everywhere@eff.org\chrome\content\rules\GoogleServices.xml” file in your Firefox profile directory (most likely at “C:\Users\USERNAME\AppData\Roaming\Mozilla\Firefox\Profiles\PROFILENAME\” if running Windows 7), you can comment out or delete the following rule so Google Reader is no longer run through secure HTTPS channels:

<rule from="^http://(www\.)?google\.com/reader/" 
to="https://www.google.com/reader/"/>

That being said, I’ve been having a plethora of problems with Facebook and HTTPS Everywhere too :-\ (which it actually mentions might happen in its options dialog). You’d think the largest sites on the Internet could figure out how to get their security right, but either they don’t care (the more likely option), or they don’t want the encryption overhead. Alas.

CSharp error failure
Why does Microsoft always have to make everything so hard?

I was running into a rather nasty .NET crash today in C# for a rather large project that I have been continuing development for on a handheld device that runs Windows CE6. When I was calling a callback function pointer (called a Delegate in .NET land) from a module, I was getting a TypeLoadException error with no further information. I started out making the incorrect assumption that I was doing something wrong with Delegates, as C# is not exactly my primary language ;-). The symptoms were pointing to the delegate call being the problem because the program was crashing during the delegate call itself, as the code reached the call, and did not make it into the callback function. After doing the normal debugging-thing, I found out the program crashed in the same manner every time the specific callback function was called and before it started executing, even if it was called in a normal fashion from the same class.

After further poking around, I realized that there was one line of code in the function that if included in any function, would cause the program to fail out on calling said function. Basically, resources were somehow missing from the compilation and there were no warnings anywhere telling me this. If I tried to access said resource normally, I was getting an easily traceable MissingManifestResourceException error. However, the weird situation was happening because I had the missing resource being accessed from a static member in another class. So here is some example code that was causing the problem:

public class ClassA
{
	public void PlaySuccess()
	{
		//Execution DOES NOT reach here
		Sound.Play(Sound.Success);
	}
}

public class Sound
{
	public static byte[] Success=MyResource.Success; //This resource is somehow missing from the executable
	public static byte[] Failure=MyResource.Failure;
	public static void Play(byte[] TheSound) { sndPlaySound(TheSound, SND_ASYNC|SND_MEMORY); }
}

ClassA Foo=new ClassA();
//Execution reaches here
Foo.PlaySuccess();

Oh well, at least it wasn’t an array overrun, those are fun to track down :-).

Legacy Geocities Login Problems
Big corporations refusing to acknowledge that they have problems, let alone fix them

I have a friend that has a legacy Geocities (the MySpace of the 1990s for free web hosting) account (one from who knows how long before GeoCities was bought by Yahoo). The control panel (at geocities.yahoo.com/gcp) won’t allow logging in to his legacy account because it gets stuck in an infinite redirect loop, redirecting right back to itself.

My guess is that the problem has to do with cookies (on Geocities’ servers’ side, not the client’s!), but I didn’t get that far, as I found a roundabout solution to his problem. After logging in, the user can go to http://geocities.yahoo.com/filemanager or http://geocities.yahoo.com/v/fm.html to manage their files. While the rest of the control panel is still not accessible, this was enough of a solution for him.

Reports are that Yahoo refuses to respond about this problem with their servers.

XML Problems in PHP
I hate debugging other peoples’ libraries :-\

We recently moved one of our important web server clients to a newly acquired server (our 12th server at ThePlanetThePlanet [Used to be called EV1Servers, and before that RackShack], one of, if not the largest, server-farm hosting company in the states). A bad problem cropped up on his site in the form of a PHP script (CaRP) that deals with parsing XML.

The problem was that whenever the XML was parsed and then returned, all XML entities (escaped XML characters like “&gt;” “&lt;” and “&quot;”) were removed/deleted. I figured the problem had to do with a bad library, as the code worked perfectly on our old server, and the PHP settings on both were almost identical, but I wasn’t sure which one. After an hour or two of manipulating the code and debugging, I narrowed it down to which XML function calls had the problem, and that it was definitely not the scripts themselves. The following code demonstrates the problem.

$MyXMLData='<?xml version="1.0" encoding="iso-8859-1"?><description>&lt;img test=&quot;a&quot;</description>';
$MyXml=xml_parser_create(strtoupper('ISO-8859-1'));
xml_parser_set_option($MyXml,XML_OPTION_TARGET_ENCODING,'ISO-8859-1');
xml_parse_into_struct($MyXml, $MyXMLData, $MyData);
print htmlentities($MyData[0]['value']);
On the server with the problem, the following was outputted:
img test=a
while it should have outputted the following:
<img test="a"

I went with a hunch at this point and figured that it might be the system’s LibXML libraries, so I repointed them away from version 2.7.1, which appears to be buggy, to an older version that was also on the system, 2.6.32. And low and behold, things were working again, yay :-D.

Technical data (This is a cPanel install): In “/opt/xml2/lib/” delete the symbolic links “libxml2.so” & “libxml2.so.2” and redirect them as symbolic links to “libxml2.so.2.6.32” instead of “libxml2.so.2.7.1”.

Linux Runlevels
“Safe Mode” for Linux

I am still, very unfortunately, looking into the problem I talked about way back here :-( [not a lot, but it still persists]. This time I decided to try and boot the OS into a “Safe Mode” with nothing running that could hinder performance tests (like hundreds of HTTP and MySQL sessions). Fortunately, my friend whom is a Linux server admin for a tech firm was able to point me in the right direction after researching the topic was proving frustratingly fruitless.


Linux has “runlevels” it can run at, which are listed in “/etc/inittab” as follows:

# Default runlevel. The runlevels used by RHS are:
#   0 - halt (Do NOT set initdefault to this)
#   1 - Single user mode
#   2 - Multiuser, without NFS (The same as 3, if you do not have networking)
#   3 - Full multiuser mode
#   4 - unused
#   5 - X11
#   6 - reboot (Do NOT set initdefault to this)

So I needed to get into “Single user mode” to run the tests, which could be done two ways. Before I tell you how though, it is important to note that if you are trying to do something like this remotely, normal SSH/Telnet will not be accessible, so you will need either physical access to the computer, or something like a serial console connection, which can be routed through networks.

So the two ways are:
  • Through the “init” command. Running “init #” at the console, where # is the runlevel number, will bring you into that runlevel. However, this might not kill all currently unneeded running processes when going to a lower level, but it should get the majority of them, I believe.
  • Append “s” (for single user mode) to the grub configuration file (/boot/grub/grub.conf on my system) at the end of the line starting with “kernel”, then reboot. I am told appending a runlevel number may also work.
Core Dump Files
Not all OSs crash in the same way :-)

If you ever find a file named “core.#” when running Linux, where # is replaced by a number, it means something crashed at some point. Most of the time, you will probably just want to delete the file, but sometimes you may wonder what crashed. To do this, you use gdb (The GNU debugger), a very power tool, to analyze the core dump file.

gdb --core=COREFILENAME

Near the very bottom of the blob of outputted text after running this command, you should see a line that says “Core was generated by `...'.”. This tells you the command line of what crashed. To exit gdb, enter “quit”. You can also use gdb to find out what actually happened and troubleshoot/debug the problem, but that’s a very long and complex topic.


Recently, I started seeing hundreds of core dump files taking up gigabytes of space showing up in “/usr/local/cpanel/whostmgr/docroot/” on multiple of our web servers. According to several online sources, it seems cPanel (web hosting made easy!) likes to dump many, if not all, of its programs' core files into this directory. In our case, it has been “dnsadmin” doing the crashing. We’ve been having some pretty major DNS problems lately, this kind on the name server level, so I may have to rebuild our DNS cluster in the next few days. Joy.

Diagnosing DNS Problems
Digging until you find the root

Yesterday I wrote a bit about the DNS system being rather fussy, so I thought today I’d go a bit more into how DNS works, and some good tools for problem solving in this area.


First, some technical background on the subject is required.
  • A network is simply a group of computers hooked together to communicate with each other. In the old days, all networking was done through physical wires (called the medium), but nowadays much of it is done through wireless connections. Wired networking is still required for the fastest communications, and is especially important for major backbones (the super highly utilized lines that connect networks together across the world).
  • A LAN is a local network of all computers connected together in one physical location, whether it be a single room, a building, or a city. Technically, a LAN doesn’t have to be localized in one area, but it is preferred, and we will just assume it is so for arguments sake :-).
  • A WAN is a Wide (Area) Network that connects multiple LANs together. This is what the Internet is.
  • The way one computer finds another computer on a network is through its IP Address [hereby referred to as IPs in this post only]. There are other protocols, but this (TCP/IP) is by far the most widely utilized and is the true backbone of the Internet. IPs are like a house’s address (123 Fake Street, Theoretical City, Made Up Country). To explain it in a very simplified manner (this isn’t even remotely accurate, as networking is a complicated topic, but this is a good generalization), IPs have 4 sections of numbers ranging from 0-255 (1 byte). For example, 67.45.32.28 is a (class 4) IP. Each number in that address is a broader location, so the “28” is like a street address, “32” is the street, “45” is the city, and “67” is the country. When you send a packet from your computer, it goes to your local (street) router which then passes it to the city router and so on until it reaches its destination. If you are in the same city as the final destination of the packet, then it wouldn’t have to go to the country level.
  • The final important part of networking (for this post) is the domain system (DNS) itself. A domain is a label for an IP Address, like calling “1600 Pennsylvania Avenue” as “The White House”. As an example, “www.castledragmire.com” just maps to my web server at “209.85.115.128” (this is the current IP, it will change if the site is ever moved to a new server).

Next is a brief lesson on how DNS itself works:
  • The root DNS servers (a.root-servers.net through m.root-servers.net) point to the servers that hold top-level-domain information (.com, .org., .net, .jp, etc)
    Examples of these servers are as follows:
    auns1.audns.net.au
    bizE.GTLD.biz
    caCA04.CIRA.ca
    cnA.DNS.cn
    com&netA.GTLD-SERVERS.NET
    deZ.NIC.de
    euU.NIC.eu
    infoB9.INFO.AFILIAS-NST.ORG
    orgTLD1.ULTRADNS.NET
    tvC5.NSTLD.COM
  • Next, these root name servers (like A.GTLD-SERVERS.NET through M.GTLD-SERVERS.NET for .com) hold two main pieces of information for ALL domains under their top-level-domain jurisdiction:
    • The registrar where the domain was registered
    • The name server(s) that are responsible for the domain
    Only registrars can talk to these root servers, so you have to go through the registrar to change the name server information.
  • The final lowest rung in the DNS hierarchy is name servers. Name servers hold all the actual addressing information for a domain and can be run by anyone. The 2 most important (or maybe relevant is a better word...) types of DNS records are:
    • A: There should be many of these, each pointing a domain or subdomain (castledragmire.com, www.castledragmire.com, info.castledragmire.com, ...) to a specific IP address (version 4)
    • SOA: Start of Authority - There is only one of these records per domain, and it specifies authoritative information including the primary name server, the domain administrator’s email, the domain serial number, and several timeout values relating to refreshing domain information.

Now that we have all the basics down, on to the actual reason for this post. It’s really a nuisance trying to explain to people why their domain isn’t working, or is pointing to the wrong place. So here’s why it happens!

Back in the old days, it often took days for DNS propagation to happen after you made changes at your registrar or elsewhere, but fortunately, this problem is of the past. The reason for this is that ISPs and/or routers cached domain lookups and only refreshed them according to the metrics in the SOA record mentioned above, as they were supposed to. This was done for network speed reasons, as I believe older OSs might not have cached domains (wild speculation), and ISPs didn’t want to look up the address for a domain every time it was requested. Now, though, I rarely see caching on any level except at the local computer; not only on the OS level, but even some programs cache domains, like FireFox.

So the answer for when a person is getting the wrong address for a domain, and you know it is set correctly, is usually to just reboot. Clearing the DNS cache works too (for the OS level), but explaining how to do that is harder than saying “just reboot” ^_^;.

To clear the DNS cache in XP, enter the following into your “run” menu or in the command prompt: “ipconfig /flushdns”. This does not ALWAYS work, but it should work.


If your domain is still resolving to the wrong address when you ping it after your DNS cache is cleared, the next step is to see what name servers are being used for the information. You can do a whois on your domain to get the information directly form the registrar who controls the domain, but be careful where you do this as you never know what people are doing with the information. For a quick and secure whois, you can use “whois” from your linux command line, which I have patched through to a web script here. This script gives both normal and extended information, FYI.

Whois just tells you the name servers that you SHOULD be contacting, it doesn’t mean these are the ones you are asking, as the root DNS servers may not have updated the information yet. This is where our command line programs come into play.

In XP, you can use “nslookup -query=hinfo DOMAINNAME” and “nslookup -query=soa DOMAINNAME” to get a domain’s name servers, and then “nslookup NAMESERVER DOMAINNAME” to get the IP the name server points too. For example: (Important information in the following examples are bolded and in white)

C:\>nslookup -query=hinfo castledragmire.com
Server:  dns-redirect-lb-01.texas.rr.com
Address:  24.93.41.127

castledragmire.com
        primary name server = ns3.deltaarc.com
        responsible mail addr = admins.deltaarc.net
        serial  = 2007022713
        refresh = 14400 (4 hours)
        retry   = 7200 (2 hours)
        expire  = 3600000 (41 days 16 hours)
        default TTL = 86400 (1 day)

C:\>nslookup -query=soa castledragmire.com
Server:  dns-redirect-lb-01.texas.rr.com
Address:  24.93.41.127

Non-authoritative answer:
castledragmire.com
        primary name server = ns3.deltaarc.com
        responsible mail addr = admins.deltaarc.net
        serial  = 2007022713
        refresh = 14400 (4 hours)
        retry   = 7200 (2 hours)
        expire  = 3600000 (41 days 16 hours)
        default TTL = 86400 (1 day)

castledragmire.com      nameserver = ns4.deltaarc.com
castledragmire.com      nameserver = ns3.deltaarc.com
ns3.deltaarc.com        internet address = 216.127.92.71

C:\>nslookup ns3.deltaarc.com castledragmire.com
Server:  ev1s-209-85-115-128.theplanet.com
Address:  209.85.115.128

Name:    ns3.deltaarc.com
Address:  216.127.92.71

Nslookup is also available in Linux, but Linux has a better tool for this, as nslookup itself doesn’t always seem to give the correct answers, for some reason. So I recommend you use dig if you have it or Linux available to you. So with dig, we just start at the root name servers and work our way up to the SOA name server to get the real information of where the domain is resolving to and why.

root@www [~]# dig @a.root-servers.net castledragmire.com

; <<>> DiG 9.2.4 <<>> @a.root-servers.net castledragmire.com
; (2 servers found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 5587
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 13, ADDITIONAL: 14

;; QUESTION SECTION:
;castledragmire.com.            IN      A

;; AUTHORITY SECTION:
com.                    172800  IN      NS      H.GTLD-SERVERS.NET.
com.                    172800  IN      NS      I.GTLD-SERVERS.NET.
com.                    172800  IN      NS      J.GTLD-SERVERS.NET.
com.                    172800  IN      NS      K.GTLD-SERVERS.NET.
com.                    172800  IN      NS      L.GTLD-SERVERS.NET.
com.                    172800  IN      NS      M.GTLD-SERVERS.NET.
com.                    172800  IN      NS      A.GTLD-SERVERS.NET.
com.                    172800  IN      NS      B.GTLD-SERVERS.NET.
com.                    172800  IN      NS      C.GTLD-SERVERS.NET.
com.                    172800  IN      NS      D.GTLD-SERVERS.NET.
com.                    172800  IN      NS      E.GTLD-SERVERS.NET.
com.                    172800  IN      NS      F.GTLD-SERVERS.NET.
com.                    172800  IN      NS      G.GTLD-SERVERS.NET.

;; ADDITIONAL SECTION:
A.GTLD-SERVERS.NET.     172800  IN      A       192.5.6.30
A.GTLD-SERVERS.NET.     172800  IN      AAAA    2001:503:a83e::2:30
B.GTLD-SERVERS.NET.     172800  IN      A       192.33.14.30
B.GTLD-SERVERS.NET.     172800  IN      AAAA    2001:503:231d::2:30
C.GTLD-SERVERS.NET.     172800  IN      A       192.26.92.30
D.GTLD-SERVERS.NET.     172800  IN      A       192.31.80.30
E.GTLD-SERVERS.NET.     172800  IN      A       192.12.94.30
F.GTLD-SERVERS.NET.     172800  IN      A       192.35.51.30
G.GTLD-SERVERS.NET.     172800  IN      A       192.42.93.30
H.GTLD-SERVERS.NET.     172800  IN      A       192.54.112.30
I.GTLD-SERVERS.NET.     172800  IN      A       192.43.172.30
J.GTLD-SERVERS.NET.     172800  IN      A       192.48.79.30
K.GTLD-SERVERS.NET.     172800  IN      A       192.52.178.30
L.GTLD-SERVERS.NET.     172800  IN      A       192.41.162.30

;; Query time: 240 msec
;; SERVER: 198.41.0.4#53(198.41.0.4)
;; WHEN: Sat Aug 23 04:15:28 2008
;; MSG SIZE  rcvd: 508

root@www [~]# dig @a.gtld-servers.net castledragmire.com

; <<>> DiG 9.2.4 <<>> @a.gtld-servers.net castledragmire.com
; (2 servers found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 35586
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 2, ADDITIONAL: 2

;; QUESTION SECTION:
;castledragmire.com.            IN      A

;; AUTHORITY SECTION:
castledragmire.com.     172800  IN      NS      ns3.deltaarc.com.
castledragmire.com.     172800  IN      NS      ns4.deltaarc.com.

;; ADDITIONAL SECTION:
ns3.deltaarc.com.       172800  IN      A       216.127.92.71
ns4.deltaarc.com.       172800  IN      A       209.85.115.181

;; Query time: 58 msec
;; SERVER: 192.5.6.30#53(192.5.6.30)
;; WHEN: Sat Aug 23 04:15:42 2008
;; MSG SIZE  rcvd: 113

root@www [~]# dig @ns3.deltaarc.com castledragmire.com

; <<>> DiG 9.2.4 <<>> @ns3.deltaarc.com castledragmire.com
; (1 server found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 26198
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 0

;; QUESTION SECTION:
;castledragmire.com.            IN      A

;; ANSWER SECTION:
castledragmire.com.     14400   IN      A       209.85.115.128

;; AUTHORITY SECTION:
castledragmire.com.     14400   IN      NS      ns4.deltaarc.com.
castledragmire.com.     14400   IN      NS      ns3.deltaarc.com.

;; Query time: 1 msec
;; SERVER: 216.127.92.71#53(216.127.92.71)
;; WHEN: Sat Aug 23 04:15:52 2008
;; MSG SIZE  rcvd: 97

Linux also has the “host” command, but I prefer and recommend “dig”.


And that’s how you diagnose DNS problems! :-). For reference, two common DNS configuration problems are not having your SOA and NS records properly set for the domain on your name server.


I also went ahead and added dig to the “Useful Bash commands and scripts” post.

Windows Hosts File
When DNS decides to be finicky

Another of my favorite XP hacks is modifying domain addresses through XP’s Hosts file. You can remap where a domain points on your local computer by adding an IP address followed by a domain in the “c:\windows\system32\drivers\etc\hosts” file.

Domain names are locally controlled, looked up, and cached on your computer at the OS level, so there are simple hacks like this for other OSs too.

I often utilize this solution as a server admin who controls a lot of domains (Over 100, and I control most of them at the registrar level too ^_^). The domain system itself across the web is incredibly fastidious and prone to problems if not perfectly configured, so this hack is a wonderful time saver and diagnostic tool until things resolve and work properly.

Computers are Evil
Setting up new computers can be quite the hassle

The new home server for the new entertainment center I recently set up has made itself out to be quite a nuisance. I am unsure as to whether I will keep using it or not, but fortunately, I have not yet taken down my old home server, as I wanted to do some break in testing on the new one first.

Setting up new computers is almost always a pain in the ass, what with installing and configuring all the software from scratch (which always includes a format and new OS), and making sure all the hardware works properly and finding drivers for it (sometimes when you don’t even have the proper information on what that hardware is). But sometimes, computers can go above and beyond the normal setup nuances and annoyances and be downright evil. I have long proclaimed to people that computers have personalities and minds of their own and they decide when and where they want to be accommodating or uncooperative. Besides all the normal computer setup problems (including not knowing what the hardware was and having to figure that out), this one also had a few more doozies.

The first big problem started with the fact that I wanted to use this computer for video output, and it does not have an AGP slot. As I contemplated in the previous post on this topic, I went ahead and bought a PCI Geforce 5200 for $27.79 including shipping. The card did not fit properly in the new case, so I had to unscrew a few things, which were fortunately designed for just that reason. Then the big problem came up in that video outputted from the s-video port on the card showed up on the TV at a 50% over zoom, so I couldn’t see half the screen. I couldn’t test the monitor output port either because it is DVI, and I have no DVI monitors, alas. After 2 or 3 hours of tinkering with it and throwing everything plus the kitchen sink at the problem, including trying a different s-video cable, I finally stumbled on the solution and got it working, yay. That is... until after I rebooted and it wasn’t working again x.x;. Another 20 or so more minutes of tinkering got it fixed again, and I was able to quickly hone down on a procedure to fix the problem on the next reboot, optimizing it with each successive reboot over the next few days. The procedure is as follows: (The TV over s-video starts as the primary monitor, and I have a second monitor connected to the VGA port to the onboard graphics card)

  • Open “Display Properties” [Right click Desktop > Properties] > Settings
  • Attach second monitor so I can see what I’m doing
  • Open NVidia Control Panel
  • Rotate screen to 90 degrees. It only wants to rotate the screen at 1024x768, which is too high a resolution for the TV, so it kicks the resolution down to 640x480 while rotating
  • Keep setting the screen to no rotation (0 degrees) until the scaling is correct [usually twice]. The NVidia control panel doesn’t want to allow going back to normal rotation now due to the 1024x768 required resolution thing, and will keep the setting set as 90 degrees, so the process can easily be repeated until it works.
  • Now that the screen is at the correct scale (at 640x480), all that’s left is to get the rotation back to normal. To do this, immediately after accepting the rotation process in the NVidia Control Panel, it has to be closed out (alt+f4) so that it saves the rotation setting at 0 degrees but doesn’t try to set it back after all the resolution changes.
  • Raise the resolution back to 800x600
  • Detach secondary monitor now that it is no longer needed

The screen still unfortunately has about 100-200 “pixels” (monitors don’t have pixels, technically) on the top and bottom of the screen that are unused, but eh, NBD. At least this graphics card lets me properly pan and scan (zoom/scale and move) the s-video output around unlike my Geforce4 Ti 4600! The next problem with the video card is that some video outputted from it is just too slow. Though most content is watchable, the choppiness makes it unbearable. The problem with this might just be that the PCI bus doesn’t have the required throughput, which is why most video cards are used over AGP (or nowadays PCI express).

There are even two more final problems with it, one a possible deal killer, the other rather insignificant. The unimportant problem is that XP refuses to install updates. I believe this to be a problem with SP3. The final problem is that the computer seems to randomly compltely freeze up every now and then for no particular reason, requiring a reboot. This has happened 2 or 3 times so far, so I’m waiting to see how often it happens, if anymore. I know it’s not overheating as I currently have the case open; and I see no blown capacitors... hmmmm...



<frustration>Computers!</frustration>
FoxPro Table Memo Corruption
Data integrity loss is such a drag :-(

My father’s optometric practice has been using an old DOS database called “Eyecare” since the (I believe) early 80’s. For many years, he has been programming a new, very customized, database up from scratch in Microsoft Access which is backwards compatible with “Eyecare”, which uses a minor variant of FoxPro databases. I’ve been helping him with minor things on it for a number of years, and more recently I’ve been giving a lot more help in getting it secured and migrated from Microsoft Access databases (.mdb) into MySQL.

A recent problem cropped up in that one of the primary tables started crashing Microsoft Access when it was opened (through a FoxPro ODBC driver). Through some tinkering, he discovered that the memo file (.fpt) for the table was corrupted, as trying to view any memo fields is what crashed Access. He asked me to see if I could help in recovering the file, which fortunately I can do at my leisure, as he keeps paper backups of everything for just such circumstances. He keeps daily backups of everything too… but for some reason that’s not an option.


I went about trying to recover it through the easiest means first, namely, trying to open and export the database through FoxPro, which only recovered 187 of the ~9000 memo records. Next, I tried finding a utility online that did the job, and the first one I found that I thought should work was called “FoxFix”, but it failed miserably. There are a number of other Shareware utilities I could try, but I decided to just see how hard it would be to fix myself first.


I opened the memo file up in a HEX editor, and after some very quick perusing and calculations, it was quite easy to determine the format:

So I continued on the path of seeing what I could do to fix the file.
  • First, I had it jump to the header of each record and just get the record data length, and I very quickly found multiple invalid record lengths.
  • Next, I had it attempt to fix each of these by determining the real length of the memo by searching for the first null terminator (“\0”) character, but I quickly discovered an oddity. There are weird sections in many of the memo fields in the format BYTE{0,0,0,1,0,0,0,1,x}, which is 2 little endian DWORDS which equal 1, and a final byte character (usually 0).
  • I added to the algorithm to include these as part of a memo record, and many more original memo lengths then agreed with my calculated memo lengths.
  • The final thing I did was determine how many invalid (non keyboard) characters there were in the memo data fields. There were ~3500 0x8D characters, which were usually always followed by 0xA, so I assume these were supposed to be line breaks (Windows line breaks are denoted by [0xD/new line/\r],[0xA/carriage return/\n]). There were only 5 other invalid characters, so I just changed these to question marks ‘?’.

Unfortunately, Microsoft Access still crashed when I tried to access the comments fields, so I will next try to just recover the data, tie it to its primary keys (which I will need to determine through the table file [.dbf]), and then rebuild the table. I should be making another post when I get around to doing this.


The following code which “fixes” the table’s memo file took about 2 hours to code up.
//Usually included in windows.h
typedef unsigned long DWORD;
typedef unsigned char BYTE;

//Includes
#include <iostream.h> //cout
#include <stdio.h> //file io
#include <conio.h> //getch
#include <ctype.h> //isprint

//Memo file structure
#pragma warning(disable: 4200) //Remove zero-sized array warning
const MemoFileHeadLength=512;
const RecordBlockLength=32; //This is actually found in the header at (WORD*)(Start+6)
struct MemoRecord //Full structure must be padded at end with \0 to RecordBlockLength
{
	DWORD Type; //Type in little endian, 1=Memo
	DWORD Length; //Length in little endian
	BYTE Data[0];
};
#pragma warning(default: 4200)

//Input and output files
const char *InFile="EXAM.Fpt.old", *OutFile="EXAM.Fpt";

//Assembly functions
__forceinline DWORD BSWAP(DWORD n) //Swaps endianness
{
	_asm mov eax,n
	_asm bswap eax
	_asm mov n, eax
	return n;
}

//Main function
void main()
{
	//Read in file
	const FileSize=6966592; //This should actually be found when the file is opened...
	FILE* MyFile=fopen(InFile, "rb");
	BYTE *MyData=new BYTE[FileSize];
	fread(MyData, FileSize, 1, MyFile);
	fclose(MyFile);

	//Start checking file integrity
	DWORD FilePosition=MemoFileHeadLength; //Where we currently are in the file
	DWORD RecordNum=0, BadRecords=0, BadBreaks=0, BadChars=0; //Data Counters
	const DWORD OneInLE=0x01000000; //One in little endian
	while(FilePosition<FileSize) //Loop until EOF
	{
		FilePosition+=sizeof(((MemoRecord*)NULL)->Type); //Advanced passed record type (1=memo)
		DWORD CurRecordLength=BSWAP(*(DWORD*)(MyData+FilePosition)); //Pull in little endian record size
		cout << "Record #" << RecordNum++ << " reports " << CurRecordLength << " characters long. (Starts at offset " << FilePosition << ")" << endl; //Output record information

		//Determine actual record length
		FilePosition+=sizeof(((MemoRecord*)NULL)->Length); //Advanced passed record length
		DWORD RealRecordLength=0; //Actual record length
		while(true)
		{
			for(;MyData[FilePosition+RealRecordLength]!=0 && FilePosition+RealRecordLength<FileSize;RealRecordLength++) //Loop until \0 is encountered
			{
#if 1 //**Check for valid characters might not be needed
				if(!isprint(MyData[FilePosition+RealRecordLength])) //Makes sure all characters are valid
					if(MyData[FilePosition+RealRecordLength]==0x8D) //**0x8D maybe should be in ValidCharacters string? - If 0x8D is encountered, replace with 0xD
					{
						MyData[FilePosition+RealRecordLength]=0x0D;
						BadBreaks++;
					}
					else //Otherwise, replace with a "?"
					{
						MyData[FilePosition+RealRecordLength]='?';
						BadChars++;
					}
#endif
			}

			//Check for inner record memo - I'm not really sure why these are here as they don't really fit into the proper memo record format.... Format is DWORD(1), DWORD(1), BYTE(0)
			if(((MemoRecord*)(MyData+FilePosition+RealRecordLength))->Type==OneInLE && ((MemoRecord*)(MyData+FilePosition+RealRecordLength))->Length==OneInLE /*&& ((MemoRecord*)(MyData+FilePosition+RealRecordLength))->Data[0]==0*/) //**The last byte seems to be able to be anything, so I removed its check
			{ //If inner record memo, current memo must continue
				((MemoRecord*)(MyData+FilePosition+RealRecordLength))->Data[0]=0; //**This might need to be taken out - Force last byte back to 0
				RealRecordLength+=sizeof(MemoRecord)+1;
			}
			else //Otherwise, current memo is finished
				break;
		}
		if(RealRecordLength!=CurRecordLength) //If given length != found length
		{
			//Tell the user a bad record was found
			cout << "   Real Length=" << RealRecordLength << endl;
			CurRecordLength=RealRecordLength;
			BadRecords++;
			//getch();

			//Update little endian bad record length
			((MemoRecord*)(MyData+FilePosition-sizeof(MemoRecord)))->Length=BSWAP(RealRecordLength);
		}

		//Move to next record - Each record, including RecordLength is padded to RecordBlockLength
		DWORD RealRecordSize=sizeof(MemoRecord)+CurRecordLength;
		FilePosition+=CurRecordLength+(RealRecordSize%RecordBlockLength==0 ? 0 : RecordBlockLength-RealRecordSize%RecordBlockLength);
	}

	//Tell the user file statistics
	cout << "Total bad records=" << BadRecords << endl << "Total bad breaks=" << BadBreaks << endl << "Total bad chars=" << BadChars << endl;

	//Output fixed data to new file
	MyFile=fopen(OutFile, "wb");
	fwrite(MyData, FileSize, 1, MyFile);
	fclose(MyFile);

	//Cleanup and wait for user keystroke to end
	delete[] MyData;
	getch();
}
Always Confirm Potentially Hazardous Actions
Also treat what others tell you with discretion

So I have been having major speed issues with one of our servers. After countless hours of diagnoses, I determined the bottle neck was always I/O (input/output, accessing the hard drive). For example, when running an MD5 hash on a 600MB file load would jump up to 31 with 4 logical CPUs and it would take 5-10 minutes to complete. When performing the same test on the same machine on a second drive it finished within seconds.

Replacing the hard drive itself is a last resort for a live production server, and a friend suggested the drive controller could be the problem, so I confirmed that the drive controller for our server was not on-board (on its own card), and I attempted to convince the company hosting our server of the problem so they would replace the drive controller. I ran my own tests first with an iostat check while doing a read of the main hard drive (cat /etc/sda > /dev/null). This produced steadily worsening results the longer the test went on, and always much worse than our secondary drive. I passed these results on to the hosting company, and they replied that a “badblocks –vv” produced results that showed things looked fine.

So I was about to go run his test to confirm his findings, but decided to check parameters first, as I always like to do before running new Linux commands. Thank Thor I did. The admin had meant to write “badblocks –v” (verbose) and typoed with a double key stroke. The two v’s looked like a w due to the font, and had I ran a “badblocks –w” (write-mode test), I would have wiped out the entire hard drive.

Anyways, the test outputted the same basic results as my iostat test with throughput results very quickly decreasing from a remotely acceptable level to almost nil. Of course, the admin only took the best results of the test, ignoring the rest.

I had them swap out the drive controller anyways, and it hasn’t fixed things, so a hard drive replace will probably be needed soon. This kind of problem would be trivial if I had access to the server and could just test the hardware myself, but that is a price to pay for proper security at a server farm.