Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 2 Sep 2008 22:02:10 +0100 (BST)
From:      Robert Watson <rwatson@FreeBSD.org>
To:        Luigi Rizzo <rizzo@iet.unipi.it>
Cc:        FreeBSD networking and TCP/IP list <freebsd-net@freebsd.org>
Subject:   Re: how to read dynamic data structures from the kernel (was Re: reading routing table)
Message-ID:  <alpine.BSF.1.10.0809022157340.84006@fledge.watson.org>
In-Reply-To: <20080902105124.GA22832@onelab2.iet.unipi.it>
References:  <3170f42f0809010507q6c37a9d5q19649bc261d7656d@mail.gmail.com> <48BBE7B2.4050409@FreeBSD.org> <48BCE4AA.6050807@elischer.org> <3170f42f0809020017k643180efte155a5b5701a40cf@mail.gmail.com> <alpine.BSF.1.10.0809021017500.1150@fledge.watson.org> <20080902105124.GA22832@onelab2.iet.unipi.it>

next in thread | previous in thread | raw e-mail | index | archive | help

On Tue, 2 Sep 2008, Luigi Rizzo wrote:

> The real problem is that these data structures are dynamic and potentially 
> large, so the following approach (used e.g. in ipfw)
>
> 	enter kernel;
> 	get shared lock on the structure;
> 	navigate through the structure and make a linearized copy;
> 	unlock;
> 	copyout the linearized copy;
>
> is extremely expensive and has the potential to block other activities for a 
> long time.

Sockets, sysctl, kmem, etc, are all really just I/O mechanisms, with varying 
levels of abstraction, for pushing data, and all fundamentally suffer from the 
problem of a lack of general export abstraction.

> What we'd need is some internal representation of the data structure that 
> could give us individual entries of the data structure on each call, 
> together with extra info (a pointer if we can guarantee that it doesn't get 
> stale, something more if we cannot make the guarantee) to allow the 
> navigation to occur.

I think there's necessarily implementation-specific details to all of these 
steps for any given kernel subsystem -- we have data structures, 
synchronization models, etc, that are all tuned to their common use 
requirements, and monitoring is very much an edge case.  I don't think this is 
bad: this is an OS kernel, after all, but it does make things a bit more 
tricky.  Even if we can't share code, sharing approaches across subsystems is 
a good idea.

For an example of what you have in mind, take a look at the sysctl monitoring 
for UNIX domain sockets.  First, we allocate an array of pointers sized to the 
number of unpcb's we have.  Then we walk the list, bumping the references and 
adding pointers to the array.  Then we release the global locks, and proceed 
lock, externalize, unlock, and copyout each individual entry, using a 
generation number fo manage staleness.  Finally, we walk the list dropping the 
refcounts and free the array.  This voids holding global locks for a long 
time, as well as the stale data issue.  It's unideal in other ways -- long 
lists, reference count complexity, etc, but as I mentioned, it is very much an 
edge case, and much of that mechanism (especially refcounts) is something we 
need anyway for any moderately complex kernel data structure.

Robert N M Watson
Computer Laboratory
University of Cambridge


> Accessing /dev/kmem and follow pointers there has probably the risk
> that you cannot lock the kernel data structure while you navigate
> on it, so you are likely to follow stale pointers.
>
> I believe this is a very old and common problem, so my question is:
>
> do you know if any of the *BSD kernels implements some good mechanism
> to access a dynamic kernel data structure (e.g. the routing tree/trie,
> or even a list or hash table) without the flaws of the two approaches
> i indicate above ?
>
> 	cheers
> 	luigi
>
> [original thread below just for reference, but i believe i made a
> fair summary above]
>
> On Tue, Sep 02, 2008 at 10:19:55AM +0100, Robert Watson wrote:
>> On Tue, 2 Sep 2008, Debarshi Ray wrote:
>>
>>>> unfortunatly netstat -rn uses /dev/kmem
>>>
>>> Yes. I also found that FreeBSD's route(8) implementation does not have an
>>> equivalent of 'netstat -r'. NetBSD and GNU/Linux implementations have such
>>> an option. Any reason for this? Is it because you did not want to muck
>>> with /dev/kmem in route(8) and wanted it to work with PF_ROUTE only? I
>>> have not yet gone through NetBSD's route(8) code though.
>>
>> Usually the "reason" for things like this is that no one has written the
>> code to do otherwise :-).  PF_ROUTE is probably not a good mechanism for
>> any bulk data transfer due to the constraints of being a datagram socket,
>> although doing it via an interated dump rather than a simple dump operation
>> would probably work.  Sysctl is generally a better interface for monitoring
>> for various reasona, although it also has limitations.  Maintaining
>> historic kmem support is important, since it is also the code used for
>> interpreting core dumps, and we don't want to lose support for that.
>>
>> Robert N M Watson
>> Computer Laboratory
>> University of Cambridge
>> _______________________________________________
>> freebsd-net@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?alpine.BSF.1.10.0809022157340.84006>