Pedro Vale Estrela - NS2 Debug / BugFix Tutorial (OTCL + C++)

|
Pedro Vale Estrela - NS2 Debug / BugFix Tutorial (OTCL + C++)
This tutorial will focus on how to use the oTcl and C++ debugger tools to find a bug in NS2.28 or earlier.
(recent CVS snapshots and future 2.29 version will have it corrected, as the patch was already been applied to the CVS tree.
Thus, this tutorial will guide you in the typical debug process that are useful for a variety of situations.

Files and Patches (contains the scripts patches mentioned in these pages)
Contact: pedro.estrela@inesc.pt


--------------------------------------------------------------------------------


NOTE 1: The tutorial mentions several scripts. These are available at the files directory. I've also made a compressed file that has everything you'll need on this guide test-suite-hier-routing-bug.zip. It will also depend on recent versions of my ns2_shared_procs.tcl file.
NOTE 2: This tutorial will give greater detail on the otcl debugging part; However, an experienced NS developer could directly jump to the C++ debugging part, by closely studing the call stack dump information.
NOTE 3: A different way to modify the built-in tcl functions would be do modify them directly in the tcl source files and recompiling NS. However, the method outlined below is preferable to begginners, as it doesn't require recompilation, and doesn't changes the existing code (resuting in trivial backtraction, if necessary).
NOTE 4: Like my other tutorial, I'll present a complete script and an image for each step. however, one should try to make the modifications required by hand, to get a much better understanding on modifiyng NS.


--------------------------------------------------------------------------------

Step 0 - Prologue
The bug that will be investigated in this tutorial appeared when I've tried to add dynamic routing capabilities (eg, possibility to simulate link failures recovery) to a fairly complex script that featured a large topology of wired links, coupled with several wireless links (both base stations and pure mobile nodes). Another relevant point is that the script used hierarchical routing in all the nodes, for Mobile IP usage.
The said script was working 100% until I've added to it instructions to simulate a link failure on one of the wired links. As explained in this section of the Marc Greis's tutorial, all that is required is to add "$ns rtmodel-at time up|down node1 node2" commands to the script.
However, the problem was when I've enabled the "session" dynamic routing to use the alternative paths of the wired topology. (e.g., using "$ns rtproto Session" at the start of the script).
At run time, the simulator crashed in the middle of the simulation with the following error: test-suite-hier-routing.error1.txt.

Step 1 - Choosing a simpler scenario that is known to be correct
To try to isolate the bug, a common heuristic is to try to simplify the scenario, by removing unused parts that (hopefully) are unrelated to it. On the above example, I've suspected that the bug was somewhere in the interaction of the dynamic routing and the hierarchical routing (as the simulator crashed when the wired link went down); In that case, the wireless nodes and complex topologies only further obscured the real problem. (As it will be shown later, this supposition was correct).
On the other hand, I also wanted to validate my own script, has I could be doing something in it that could be corrupting the simulator.

Thus, one good approach to find the bug is to start from a known correct scenario, and slowly introduce minimal features to force the bug to appear. The best example for this are the standard test suites included in NS2, which are used to validate the simulator itself, againt the most recent modifications and patches.
(note: as explained here, these tests suites are the only scripts that are guaranteed to use the latest APIs; on the contrary, the examples in "ns/tcl/ex" and the Marc Greis's tutorial is known to be out-of-date, especially on the wireless examples).

Searching in "ns/tcl/test", I've found that the only test suite that used Hiererarchical routing was "test-suite-hier-routing". This test used a simple non-redundant topology (eg, only direct paths) and used regular static routing. This has produced a topology with 9 nodes.

Script: test-suite-hier-routing_1.tcl
Result: hier_step1.gif

To run the simulation: "ns test-suite-hier-routing.tcl hier-simple"
To view the simulation in nam: "nam temp.rands.nam"

Step 2 - Making the bug appear in the simpler scenario
Now let's try adding a new link between nodes 5 and 7, and make it go down at time 2. This is attained by introducing these 2 lines in the script, in the instproc "TestSuite instproc init".
$ns_ duplex-link $n_(5) $n_(7) 5Mb 2ms DropTail
$ns_ rtmodel-at 2 down $n_(5) $n_(7)

Use the same commands as before with the new script. Using it, the traffic first goes to the new link, and at time 2, all packets are lost at the new link, making nodes 7 8 and 9 unreachable.
Script: test-suite-hier-routing_2.tcl
Result: hier_step2.gif

Now, let's use dynamic routing to correct this, choosing type Session. Just add "$ns_ rtproto Session" after the simulator object creation, in init-simulator {}.
If you now run the new script, it will crash with the exact same error as before. Good work! Now we have a much simpler scenario which is sufficient to trigger the bug, and will be much easier to debug!
Note that only now you should ask on the NS2 mailing lists concerning about the bug that you've found, to know if somebody has made any work for its fix. It is fairly important to use a simple scnenario as the one exaplined here. As an example, check the email I've sent to the NS developer's mailing list for this very bug: Bug report
Script: test-suite-hier-routing_3.tcl
Error (Call Trace): test-suite-hier-routing.error2.txt

Step 3 - Getting to know what is going on at the beginning of the simulation
In this section we'll take an inside look on the TCL objects that are created by the script, to get a insight view of the inner workings of the simulator. I assume that you've followed and experimented my tutorial on otcl debugging.

The ideia will be to stop the simulator immediately before the simulation starts (eg, before $ns run). For this:
a) modify the script to include a new MashInspector object in "Test/hier-simple instproc run", before "$ns run";
b) make it stop before "$ns run" with "debug 1";
c) modify it to run the "hier-simple" test, ignoring command line parameters (check runtest() of the resulting script if its too dificult);

Then use the resulting script as follows:

a) start nstk without parameters. It should open the tkcon console.
b) start the script ("source test-suite-hier-routing_4.tcl").

You'll now see the Mash's Object Inspector that you can use to peek into the otcl objects that are created at the start of the simulation. In folowing figure, I'm inspecting the main "simulator" object, which is created in the script by "set ns [new Simulator]". For this, I've selected the "Simulator" class on the first column, and its unique instance on the fourth column (in my case, it was object _o5).
At this stage you'll find _o5's private variables in the last column, namely the node information (array Node_[]), each link (link_[]), and private variables that contain references to the name of other core objects, namely the scheduler in use, type of trace in use, etc. Another important column is the second, as it contains the references to the procs available to the selected object. If you click on each, you'll check the source code for it (This will be very important on the next step).

Navigating with the references, you can now inspect each object in succession. For example, clicking on the private variable "routingTable_", you are moved to an instance of "RoutingLogic", that contains a private variables rtprotos_(Session). This is enough to confirm that you are using correctly the "Session" type of dynamic routing.
Using this technique is useful to check the inner state of the objects created by your script before the simulation, to make sure that these start as intended.
However, our specific error ocours at run-time 2.0, when the link goes down. Thus, our next step is to stop the simulator at exactly this event.

Script: test-suite-hier-routing_4.tcl
Image: hier_step3.gif

Step 4 - Getting to now what is going on immediately before the crash at runtime.
The ideia to debug at run-time is to insert "debug 1" commands at interesting points of the code, to break the execution at runtime. For this, we'll check the tcl call stack
that simulator dumps when it crashes; it starts from the innermost tcl procedure that crashed, then the function that called it, etc, until the first tcl function that triggered the calling stack.
In our case, the outermost function (eg the first) is "runq" proc. Note that there is no easy reference to the actual tcl source code file that contains this function; for this, you should make a recursive grep of teh string "runq" in the whole ns/tcl sourcetree:

ns/tcl> grep -d recurse "runq" *

You'll be able to check that this procedure resides inside the file "rtglib/dynamics.tcl". Next, you could simply modify proc "runq" and recompile NS. However, TCL enables to replace any given proc in run-time; thus, to avoid modifying the ns2 core files, we'll copy the "runq" proc to our script, and insert the "debug 1" instruction on our private copy.

Script: test-suite-hier-routing_5.tcl
Image: hier_step4.gif

The next image shows actual interaction at run-time. Notice as I've confirmed what is the current simulation time when the debugger breaks in (e.g., at 2 seconds); for that, I've just called the "now" proc of the simulator object on the bottom evaluation line (also check that I'm showing the actual code for the "now" procedure).

Now you can position yourself on the current running object, in order to inspect it. For this, run "puts $self" on the debugger window and find the object name on the list of all instances. Then open the "runq" procedure - see the folowing image.
Image: hier_step5.gif

You are now on an rtQueue object, that has a list of events (see array rtq_[]). You can now do a step by step trace in the debugger window, and check the code to be executed in the Mashinspector window at each time (using the enter key in the debugger console). This will take you, step by step, to all procs that are mentioned on the call trace after the crash. However, at any time you can check the internal state of the objects, to check for logical bugs.

Using these techniques, and more closer "debug 1" statements up the stack, you'll eventually reach the "simulator compute-hier-routes" function, and conclude that the bug is triggered when the "$r hier-reset $srcID $dstID" line is called. (the compute-hier-routes is in ns/tcl/lib/ns-route; use recursive grep to find its location).
The following script has debug code immediately before this function call, to produce the correponding screenshot:
Script: test-suite-hier-routing_6.tcl
Image: hier_step6.gif

Here, i'm checking what are the values of the parameters for the link (_o12), the source node (1.1.0) and the destination node (1.0.0). As all these values are correct, lets now check the proc itself (hier-reset).

Step 5 - Understanding Shared C++ / TCL procs
For this, we'll go to object _o12 and check its procs. However, as it can be seen on image7, this proc doesn't appear in the list. This happens because of a powerful (but confusing to beginners) mechanism that simplifies C++ procedures calling in TCL.

When an unexisting procedure is called to an otcl object, the tclcl library that is part of the core ns modules calls the "*command(argv argc)" of the corresponding C++ object, with all the parameters as string.
This function inspects the command name in the argument, and if its know, executes it; if not, an error is returned.

This way, the available procs that can an object can execute are:
- defined in its own oTCL class;
- heritaged from parent otcl super classes;
- contained in the C++ command() of the corresponding C++ class ;

However, only the first types appears directly in the object inspector; the heritaged otcl procedures are visible if one chooses the parent classes in the heritage column (3rd column). As the hier-reset proc isnt present in the otcl class or super classes, it has to be in the C++ code.

For this, make a recursive grep from the base of the ns2 tree:
ns2> grep -d recurse "hier-reset" *

(NOTE: must faster way would be to only check for .cc files, for example:
grep -d recurse "hier-reset" *.c
grep -d recurse "hier-reset" */*.c)

The recursive grep tells us that the function is inside RouteLogic::command(argc, argv), on routing/route.cc.

The relevant part is:
...
} else if (strcmp(argv[1], "hier-reset") == 0) {
int i;
int src_addr[SMALL_LEN], dst_addr[SMALL_LEN];

str2address(argv, src_addr, dst_addr);
// assuming node-node addresses (instead of
// node-cluster or node-domain pair)
// are sent for hier_reset
for (i=0; i < level_; i++)
if (src_addr[i]<=0 || dst_addr[i]<=0){
tcl.result ("negative node number");
return (TCL_ERROR);
}
hier_reset(src_addr, dst_addr);
} else if (strcmp(argv[1], "hier-lookup") == 0) {
...

We'll now proceed into C++ level debugging. However, you should now comment the lines that called the otcl debugger.
Script: test-suite-hier-routing_7.tcl

Step 6 - Move into C++ debugging
Fortunately we'll now proceed into C++ level debugging, which has much better tools for debugging. I suggest using ddd, which is a front end to gdb. (check details and tutorials here).

Start ddd, open the ns executable (menu file / open program ) then put a breakpoint in route.o's RouteLogic::command().
(menu file / open source / route.cc )
Image: hier_step8.gif

Now lets run the program (menu program / run / arguments: test-suite-hier-routing_7.tcl))
Notice how you'll have a source level debugger window that is stopped at the breakpoint.
Now, use step by step (f5), and notice how the arguments are processed; then the hier_reset() function is called, to perform the actual work.
Now notice that after hier_reset(), the control falls trough to the end of the command() function, reaching return(TclObject::command(argv, argc));
Image: hier_step9.gif

This line passes control to the TCL standard command processor, which doesnt know anything about link failures, hierarchical resets etc. Thus, this function will return an error, and the simulator will crash in run-time.

Looking for the other commands processed by this function, a simple pattern is easy to catch:
- each "if" verifies the command name (strcmp== 0);
- arguments are collected from the argv/argc array;
- a function is called that does the actual work;
- the function either returns with TCL_OK or TCL_ERROR.

However, such is not the case in our hier-reset function, as there is no return(TCL_OK) anywhere.
Thus, the control falls-through to the default behaviour, which will subsequently let to the simulator crash.

As the hier-reset is a void function, it will not have anything to return; thus, we'll arbitrate that the command() function should return TCL_OK, to indicate to tcl that it has processed the hier-reset call just fine.
As such, just insert a "return (TCL_OK);" immediately after the existing hier_reset(src_addr, dst_addr);. Then recompile the simulator and rerun the script.

You'll then check that it no longer crashes at run time, and is able to do the whole simulation without problems. Then, use nam and check that the original problem has been corrected (e.g. Hierarchical + Dynamic Routing). As you can check in the folowing image, where the link failure at time 2 is instantly "healed" by the Session routing.

Image: hier_step10.gif

Now its the time to go outside, and celebrate the bugfix that you've acheived!

Step 7 - Contribute a patch to the NS developers with your newest bug discovery
Er, actually not so fast. :-)
That celebration idea should be delayed until the WHOLE work is done. And no bug is fixed until a patch is submitted to the NS developers.

This will enable the bug to be corrected on the following version of NS2, benefiting the whole community at once; on the other hand, it saves other fellow researchers the necessary time to fix the same bug over and over, enabling actual research work to be done.

For this, I recommend the use of CVS, for you to keep track on your own modifications and bugfixes to the simulator.
Other simpler usage to make a patch is to make a comparisation of the modified source files. For this, try the folowing line

diff -C3 original unmodified source file your modified source file

...and send the result in a SHORT but CLEAR email as a bug fix to the developers.

As an example, check the patch report on this very bug: Contributed Patch


--------------------------------------------------------------------------------

Check the files, patches, etc in this directory

Go back to my NS2 page

Contact: pedro.estrela@inesc.pt
www.terraview.org Programa de apoio cartogr?co (SIG) para planeamento agricola, florestal e ambiental
사용자 삽입 이미지

'Computer_language > Debug' 카테고리의 다른 글

Pedro Vale Estrela - NS2 Debugging Page  (0) 2009.01.12
[From NS-User]  (0) 2009.01.12
NS2 Programming  (0) 2009.01.12
ns2 gdb debug 관련 파일  (0) 2009.01.12
GDB 잘 쓰기 2: User Defined Commands  (0) 2009.01.12
And