Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 22 Aug 2017 11:10:29 +0200
From:      "Ronald Klop" <ronald-lists@klop.ws>
To:        "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>, "Rick Macklem" <rmacklem@uoguelph.ca>
Subject:   Re: when has a pNFS data server failed?
Message-ID:  <op.y5c7rrupkndu52@klop.ws>
In-Reply-To: <YTXPR01MB018952E64C3026F95165B45FDD800@YTXPR01MB0189.CANPRD01.PROD.OUTLOOK.COM>
References:  <YTXPR01MB018952E64C3026F95165B45FDD800@YTXPR01MB0189.CANPRD01.PROD.OUTLOOK.COM>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 18 Aug 2017 23:52:12 +0200, Rick Macklem <rmacklem@uoguelph.ca>  
wrote:

> This is kind of a "big picture" question that I thought I 'd throw out.
>
> As a brief background, I now have the code for running mirrored pNFS  
> Data Servers
> working for normal operation. You can look at:
> http://people.freebsd.org/~rmacklem/pnfs-planb-setup.txt
> if you are interested in details related to the pNFS server code/testing.
>
> So, now I am facing the interesting part:
> 1 - The Metadata Server (MDS) needs to decide that a mirrored DS has  
> failed at some
>       point. Once that happens, it stops using the DS, etc.
> --> This brings me to the question of "when should the MDS decide that  
> the DS has
>       failed and should be taken offline?".
>       - I'm not up to date w.r.t. the TCP stack, so I'm not sure how  
> long it will take for the
>         TCP connection to decide that a DS server is no longer working  
> and fail the TCP
>         connection. I think it takes a fair amount of time, so I'm not  
> sure if TCP connection
>         loss is a good indicator of DS server failure or not?
>     - It seems to me that the MDS should wait a fairly long time before  
> failing the DS,
>       since this will have a major impact on the pNFS server, requiring  
> repair/resilvering
>       by a sysadmin once it happens.
> So, any comments or thoughts on this? rick

Hi,

This is a quite common problem for all clustered/connected systems. I  
think there is no general answer. And there are a lot of papers written  
about it.
For example: in NFS you have the 'soft' option. It is recommended not to  
use it. I can imagine that if your home-dir or /usr is mounted over NFS,  
but at work I want my http-servers to not hang and just give an IO-error  
when the backend fileserver with data is gone.
Something similar happens here.

Doesn't the protocol definition say something about this? Or what do other  
implemenations do?

Regards,
Ronald.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?op.y5c7rrupkndu52>