D32693.diff
No OneTemporary
Actions

Size

67 KB

Referenced Files

None

Subscribers

None

D32693.diff
View Options

	diff --git a/UPDATING b/UPDATING
	--- a/UPDATING
	+++ b/UPDATING
	@@ -27,6 +27,19 @@
	world, or to merely disable the most expensive debugging functionality
	at runtime, run "ln -s 'abort:false,junk:false' /etc/malloc.conf".)

	+20211110:
	+ Commit xxxxxx changed the TCP congestion control framework so
	+ that any of the included congestion control modules could be
	+ the single module built into the kernel. Previously newreno
	+ was automatically built in through direct reference. Has of
	+ this commit you are required to declare at least one congestion
	+ control module (e.g. 'options CC_NEWRENO') and to also delcare a
	+ default using the CC_DEFAULT option (e.g. options CC_DEFAULT="newreno\").
	+ The GENERIC configuation includes CC_NEWRENO and defines newreno
	+ as the default. If no congestion control option is built into the
	+ kernel and you are including networking, the kernel compile will
	+ fail. Also if no default is declared the kernel compile will fail.
	+
	20211106:
	Commit f0c9847a6c47 changed the arguments for VOP_ALLOCATE.
	The NFS modules must be rebuilt from sources and any out
	diff --git a/share/man/man4/cc_newreno.4 b/share/man/man4/cc_newreno.4
	--- a/share/man/man4/cc_newreno.4
	+++ b/share/man/man4/cc_newreno.4
	@@ -75,7 +75,33 @@
	.Va net.inet.tcp.cc.abe=1
	per: cwnd = (cwnd * CC_NEWRENO_BETA_ECN) / 100.
	Default is 80.
	+.It Va CC_NEWRENO_ENABLE_HYSTART
	+will enable or disable the application of Hystart++.
	+The current implementation allows the values 0, 1, 2 and 3.
	+A value of 0 (the default) disables the use of Hystart++.
	+Setting the value to 1 enables Hystart++.
	+Setting the value to 2 enables Hystart++ but also will cause, on exit from Hystart++'s CSS, to
	+set the cwnd to the value of where the increase in RTT first began as
	+well as setting ssthresh to the flight at send when we exit CSS.
	+Setting a value of 3 will keep the setting of the cwnd the same as 2, but will cause ssthresh
	+to be set to the average value between the lowest fas rtt (the value cwnd is
	+set to) and the fas value at exit of CSS.
	+.PP
	+Note that currently the only way to enable
	+hystart++ is to enable it via socket option.
	+When enabling it a value of 1 will enable precise internet-draft behavior
	+(subject to any MIB variable settings), other setting (2 and 3) are experimental.
	.El
	+.PP
	+Note that hystart++ requires the TCP stack be able to call to the congestion
	+controller with both the
	+.Va newround
	+function as well as the
	+.Va rttsample
	+function.
	+Currently the only TCP stacks that provide this feedback to the
	+congestion controller is rack.
	+.Pp
	.Sh MIB Variables
	The algorithm exposes these variables in the
	.Va net.inet.tcp.cc.newreno
	@@ -94,6 +120,32 @@
	.Va net.inet.tcp.cc.abe=1
	per: cwnd = (cwnd * beta_ecn) / 100.
	Default is 80.
	+.It Va hystartplusplus.bblogs
	+This boolean controls if black box logging will be done for hystart++ events. If set
	+to zero (the default) no logging is performed.
	+If set to one then black box logs will be generated on all hystart++ events.
	+.It Va hystartplusplus.css_rounds
	+This value controls the number of rounds that CSS runs for.
	+The default value matches the current internet-draft of 5.
	+.It Va hystartplusplus.css_growth_div
	+This value controls the divisor applied to slowstart during CSS.
	+The default value matches the current internet-draft of 4.
	+.It Va hystartplusplus.n_rttsamples
	+This value controls how many rtt samples must be collected in each round for
	+hystart++ to be active.
	+The default value matches the current internet-draft of 8.
	+.It Va hystartplusplus.maxrtt_thresh
	+This value controls the maximum rtt variance clamp when considering if CSS is needed.
	+The default value matches the current internet-draft of 16000 (in microseconds).
	+For further explanation please see the internet-draft.
	+.It Va hystartplusplus.minrtt_thresh
	+This value controls the minimum rtt variance clamp when considering if CSS is needed.
	+The default value matches the current internet-draft of 4000 (in microseconds).
	+For further explanation please see the internet-draft.
	+.It Va hystartplusplus.lowcwnd
	+This value controls what is the lowest congestion window that the tcp
	+stack must be at before hystart++ engages.
	+The default value matches the current internet-draft of 16.
	.El
	.Sh SEE ALSO
	.Xr cc_cdg 4 ,
	diff --git a/share/man/man4/mod_cc.4 b/share/man/man4/mod_cc.4
	--- a/share/man/man4/mod_cc.4
	+++ b/share/man/man4/mod_cc.4
	@@ -67,6 +67,16 @@
	for details).
	Callers must pass a pointer to an algorithm specific data, and specify
	its size.
	+.Pp
	+Unloading a congestion control module will fail if it is used as a
	+default by any Vnet.
	+When unloading a module, the Vnet default is
	+used to switch a connection to an alternate congestion control.
	+Note that the new congestion control module may fail to initialize its
	+internal memory, if so it will fail the module unload.
	+If this occurs often times retrying the unload will succeed since the temporary
	+memory shortage as the new CC module malloc's memory, that prevented the
	+switch is often transient.
	.Sh MIB Variables
	The framework exposes the following variables in the
	.Va net.inet.tcp.cc
	@@ -93,6 +103,44 @@
	If non-zero, apply standard beta instead of ABE-beta during ECN-signalled
	congestion recovery episodes if loss also needs to be repaired.
	.El
	+.Pp
	+Each congestion control module may also expose other MIB variables
	+to control their behaviour.
	+.Sh Kernel Configuration
	+.Pp
	+All of the available congestion control modules may also be loaded
	+via kernel configutation options.
	+A kernel configuration is required to have at least one congestion control
	+algorithm built into it via kernel option and a system default specified.
	+Compilation of the kernel will fail if these two conditions are not met.
	+.Sh Kernel Configuration Options
	+The framework exposes the following kernel configuration options.
	+.Bl -tag -width ".Va CC_NEWRENO"
	+.It Va CC_NEWRENO
	+This directive loads the newreno congestion control algorithm and is included
	+in GENERIC by default.
	+.It Va CC_CUBIC
	+This directive loads the cubic congestion control algorithm.
	+.It Va CC_VEGAS
	+This directive loads the vegas congestion control algorithm, note that
	+this algorithm also requires the TCP_HHOOK option as well.
	+.It Va CC_CDG
	+This directive loads the cdg congestion control algorithm, note that
	+this algorithm also requires the TCP_HHOOK option as well.
	+.It Va CC_DCTCP
	+This directive loads the dctcp congestion control algorithm.
	+.It Va CC_HD
	+This directive loads the hd congestion control algorithm, note that
	+this algorithm also requires the TCP_HHOOK option as well.
	+.It Va CC_CHD
	+This directive loads the chd congestion control algorithm, note that
	+this algorithm also requires the TCP_HHOOK option as well.
	+.It Va CC_HTCP
	+This directive loads the htcp congestion control algorithm.
	+.It Va CC_DEFAULT
	+This directive specifies the string that represents the name of the system default algorithm, the GENERIC kernel
	+defaults this to newreno.
	+.El
	.Sh SEE ALSO
	.Xr cc_cdg 4 ,
	.Xr cc_chd 4 ,
	@@ -103,6 +151,8 @@
	.Xr cc_newreno 4 ,
	.Xr cc_vegas 4 ,
	.Xr tcp 4 ,
	+.Xr config 5 ,
	+.Xr config 8 ,
	.Xr mod_cc 9
	.Sh ACKNOWLEDGEMENTS
	Development and testing of this software were made possible in part by grants
	diff --git a/share/man/man9/mod_cc.9 b/share/man/man9/mod_cc.9
	--- a/share/man/man9/mod_cc.9
	+++ b/share/man/man9/mod_cc.9
	@@ -68,7 +68,8 @@
	char name[TCP_CA_NAME_MAX];
	int (*mod_init) (void);
	int (*mod_destroy) (void);
	- int (cb_init) (struct cc_var ccv);
	+ size_t (*cc_data_sz)(void);
	+ int (cb_init) (struct cc_var ccv, void *ptr);
	void (cb_destroy) (struct cc_var ccv);
	void (conn_init) (struct cc_var ccv);
	void (ack_received) (struct cc_var ccv, uint16_t type);
	@@ -76,6 +77,8 @@
	void (post_recovery) (struct cc_var ccv);
	void (after_idle) (struct cc_var ccv);
	int (ctl_output)(struct cc_var , struct sockopt , void );
	+ void (rttsample)(struct cc_var , uint32_t, uint32_t, uint32_t);
	+ void (newround)(struct cc_var , uint32_t);
	};
	.Ed
	.Pp
	@@ -104,6 +107,17 @@
	The return value is currently ignored.
	.Pp
	The
	+.Va cc_data_sz
	+function is called by the socket option code to get the size of
	+data that the
	+.Va cb_init
	+function needs.
	+The socket option code then preallocates the modules memory so that the
	+.Va cb_init
	+function will not fail (the socket option code uses M_WAITOK with
	+no locks held to do this).
	+.Pp
	+The
	.Va cb_init
	function is called when a TCP control block
	.Vt struct tcpcb
	@@ -114,6 +128,9 @@
	.Va cb_init
	will cause the connection set up to be aborted, terminating the connection as a
	result.
	+Note that the ptr argument passed to the function should be checked to
	+see if it is non-NULL, if so it is preallocated memory that the cb_init function
	+must use instead of calling malloc itself.
	.Pp
	The
	.Va cb_destroy
	@@ -182,6 +199,30 @@
	pointer to algorithm specific argument.
	.Pp
	The
	+.Va rttsample
	+function is called to pass round trip time information to the
	+congestion controller.
	+The additional arguments to the function include the microsecond RTT
	+that is being noted, the number of times that the data being
	+acknowledged was retransmitted as well as the flightsize at send.
	+For transports that do not track flightsize at send, this variable
	+will be the current cwnd at the time of the call.
	+.Pp
	+The
	+.Va newround
	+function is called each time a new round trip time begins.
	+The montonically increasing round number is also passed to the
	+congestion controller as well.
	+This can be used for various purposes by the congestion controller (e.g Hystart++).
	+.Pp
	+Note that currently not all TCP stacks call the
	+.Va rttsample
	+and
	+.Va newround
	+function so dependancy on these functions is also
	+dependant upon which TCP stack is in use.
	+.Pp
	+The
	.Fn DECLARE_CC_MODULE
	macro provides a convenient wrapper around the
	.Xr DECLARE_MODULE 9
	@@ -203,8 +244,23 @@
	.Vt struct cc_algo ,
	but are only required to set the name field, and optionally any of the function
	pointers.
	+Note that if a module defines the
	+.Va cb_init
	+function it also must define a
	+.Va cc_data_sz
	+function.
	+This is because when switching from one congestion control
	+module to another the socket option code will preallocate memory for the
	+.Va cb_init
	+function. If no memory is allocated by the modules
	+.Va cb_init
	+then the
	+.Va cc_data_sz
	+function should return 0.
	+.Pp
	The stack will skip calling any function pointer which is NULL, so there is no
	-requirement to implement any of the function pointers.
	+requirement to implement any of the function pointers (with the exception of
	+the cb_init <-> cc_data_sz dependancy noted above).
	Using the C99 designated initialiser feature to set fields is encouraged.
	.Pp
	Each function pointer which deals with congestion control state is passed a
	@@ -222,6 +278,8 @@
	struct tcpcb *tcp;
	struct sctp_nets *sctp;
	} ccvc;
	+ uint16_t nsegs;
	+ uint8_t labc;
	};
	.Ed
	.Pp
	@@ -305,6 +363,19 @@
	by the value of the congestion window.
	Algorithms should use the absence of this flag being set to avoid accumulating
	a large difference between the congestion window and send window.
	+.Pp
	+The
	+.Va nsegs
	+variable is used to pass in how much compression was done by the local
	+LRO system.
	+So for example if LRO pushed three in-order acknowledgements into
	+one acknowledgement the variable would be set to three.
	+.Pp
	+The
	+.Va labc
	+variable is used in conjunction with the CCF_USE_LOCAL_ABC flag
	+to override what labc variable the congestion controller will use
	+for this particular acknowledgement.
	.Sh SEE ALSO
	.Xr cc_cdg 4 ,
	.Xr cc_chd 4 ,
	diff --git a/sys/amd64/conf/GENERIC b/sys/amd64/conf/GENERIC
	--- a/sys/amd64/conf/GENERIC
	+++ b/sys/amd64/conf/GENERIC
	@@ -30,6 +30,8 @@
	options VIMAGE # Subsystem virtualization, e.g. VNET
	options INET # InterNETworking
	options INET6 # IPv6 communications protocols
	+options CC_NEWRENO # include newreno congestion control
	+options CC_DEFAULT=\"newreno\" # define our default CC module it should be compiled in.
	options IPSEC_SUPPORT # Allow kldload of ipsec and tcpmd5
	options ROUTE_MPATH # Multipath routing support
	options FIB_ALGO # Modular fib lookups
	diff --git a/sys/arm/conf/std.armv6 b/sys/arm/conf/std.armv6
	--- a/sys/arm/conf/std.armv6
	+++ b/sys/arm/conf/std.armv6
	@@ -8,6 +8,8 @@
	options VIMAGE # Subsystem virtualization, e.g. VNET
	options INET # InterNETworking
	options INET6 # IPv6 communications protocols
	+options CC_NEWRENO # include newreno congestion control
	+options CC_DEFAULT=\"newreno\" # define our default CC module it should be compiled in.
	options TCP_HHOOK # hhook(9) framework for TCP
	device crypto # core crypto support
	options IPSEC_SUPPORT # Allow kldload of ipsec and tcpmd5
	diff --git a/sys/arm/conf/std.armv7 b/sys/arm/conf/std.armv7
	--- a/sys/arm/conf/std.armv7
	+++ b/sys/arm/conf/std.armv7
	@@ -8,6 +8,8 @@
	options VIMAGE # Subsystem virtualization, e.g. VNET
	options INET # InterNETworking
	options INET6 # IPv6 communications protocols
	+options CC_NEWRENO # include newreno congestion control
	+options CC_DEFAULT=\"newreno\" # define our default CC module it should be compiled in.
	options TCP_HHOOK # hhook(9) framework for TCP
	device crypto # core crypto support
	options IPSEC_SUPPORT # Allow kldload of ipsec and tcpmd5
	diff --git a/sys/arm64/conf/std.arm64 b/sys/arm64/conf/std.arm64
	--- a/sys/arm64/conf/std.arm64
	+++ b/sys/arm64/conf/std.arm64
	@@ -11,6 +11,8 @@
	options VIMAGE # Subsystem virtualization, e.g. VNET
	options INET # InterNETworking
	options INET6 # IPv6 communications protocols
	+options CC_NEWRENO # include newreno congestion control
	+options CC_DEFAULT=\"newreno\" # define our default CC module it should be compiled in.
	options IPSEC_SUPPORT # Allow kldload of ipsec and tcpmd5
	options ROUTE_MPATH # Multipath routing support
	options FIB_ALGO # Modular fib lookups
	diff --git a/sys/conf/NOTES b/sys/conf/NOTES
	--- a/sys/conf/NOTES
	+++ b/sys/conf/NOTES
	@@ -646,7 +646,26 @@
	#
	options INET #Internet communications protocols
	options INET6 #IPv6 communications protocols
	-
	+#
	+# Note if you include INET/INET6 or both options
	+# You must define at least one of the congestion control
	+# options or the compile will fail. Generic defines
	+# options CC_NEWRENO. You also will need to specify
	+# a default or the compile of your kernel will fail
	+# as well. The string in default is the name of the
	+# cc module as it would appear in the sysctl for
	+# setting the default. Generic defines newreno
	+# as shown below.
	+#
	+options CC_CDG
	+options CC_CHD
	+options CC_CUBIC
	+options CC_DCTCP
	+options CC_HD
	+options CC_HTCP
	+options CC_NEWRENO
	+options CC_VEGAS
	+options CC_DEFAULT=\"newreno\"
	options RATELIMIT # TX rate limiting support

	options ROUTETABLES=2 # allocated fibs up to 65536. default is 1.
	diff --git a/sys/conf/files b/sys/conf/files
	--- a/sys/conf/files
	+++ b/sys/conf/files
	@@ -4351,8 +4351,20 @@
	netinet/ip_output.c optional inet
	netinet/ip_reass.c optional inet
	netinet/raw_ip.c optional inet \| inet6
	-netinet/cc/cc.c optional inet \| inet6
	-netinet/cc/cc_newreno.c optional inet \| inet6
	+netinet/cc/cc.c optional cc_newreno inet \| cc_vegas inet \| \
	+ cc_htcp inet \| cc_hd inet \| cc_dctcp inet \| cc_cubic inet \| \
	+ cc_chd inet \| cc_cdg inet \| cc_newreno inet6 \| cc_vegas inet6 \| \
	+ cc_htcp inet6 \| cc_hd inet6 \|cc_dctcp inet6 \| cc_cubic inet6 \| \
	+ cc_chd inet6 \| cc_cdg inet6
	+netinet/cc/cc_cdg.c optional inet cc_cdg tcp_hhook
	+netinet/cc/cc_chd.c optional inet cc_chd tcp_hhook
	+netinet/cc/cc_cubic.c optional inet cc_cubic \| inet6 cc_cubic
	+netinet/cc/cc_dctcp.c optional inet cc_dctcp \| inet6 cc_dctcp
	+netinet/cc/cc_hd.c optional inet cc_hd tcp_hhook
	+netinet/cc/cc_htcp.c optional inet cc_htcp \| inet6 cc_htcp
	+netinet/cc/cc_newreno.c optional inet cc_newreno \| inet6 cc_newreno
	+netinet/cc/cc_vegas.c optional inet cc_vegas tcp_hhook
	+netinet/khelp/h_ertt.c optional inet tcp_hhook
	netinet/sctp_asconf.c optional inet sctp \| inet6 sctp
	netinet/sctp_auth.c optional inet sctp \| inet6 sctp
	netinet/sctp_bsd_addr.c optional inet sctp \| inet6 sctp
	diff --git a/sys/conf/options b/sys/conf/options
	--- a/sys/conf/options
	+++ b/sys/conf/options
	@@ -81,6 +81,15 @@
	CALLOUT_PROFILING
	CAPABILITIES opt_capsicum.h
	CAPABILITY_MODE opt_capsicum.h
	+CC_CDG opt_global.h
	+CC_CHD opt_global.h
	+CC_CUBIC opt_global.h
	+CC_DEFAULT opt_cc.h
	+CC_DCTCP opt_global.h
	+CC_HD opt_global.h
	+CC_HTCP opt_global.h
	+CC_NEWRENO opt_global.h
	+CC_VEGAS opt_global.h
	COMPAT_43 opt_global.h
	COMPAT_43TTY opt_global.h
	COMPAT_FREEBSD4 opt_global.h
	diff --git a/sys/i386/conf/GENERIC b/sys/i386/conf/GENERIC
	--- a/sys/i386/conf/GENERIC
	+++ b/sys/i386/conf/GENERIC
	@@ -31,6 +31,8 @@
	options VIMAGE # Subsystem virtualization, e.g. VNET
	options INET # InterNETworking
	options INET6 # IPv6 communications protocols
	+options CC_NEWRENO # include newreno congestion control
	+options CC_DEFAULT=\"newreno\" # define our default CC module it should be compiled in.
	options IPSEC_SUPPORT # Allow kldload of ipsec and tcpmd5
	options ROUTE_MPATH # Multipath routing support
	options TCP_HHOOK # hhook(9) framework for TCP
	diff --git a/sys/modules/cc/Makefile b/sys/modules/cc/Makefile
	--- a/sys/modules/cc/Makefile
	+++ b/sys/modules/cc/Makefile
	@@ -1,6 +1,7 @@
	# $FreeBSD$

	-SUBDIR= cc_cubic \
	+SUBDIR= cc_newreno \
	+ cc_cubic \
	cc_dctcp \
	cc_htcp

	diff --git a/sys/modules/cc/cc_newreno/Makefile b/sys/modules/cc/cc_newreno/Makefile
	new file mode 100644
	--- /dev/null
	+++ b/sys/modules/cc/cc_newreno/Makefile
	@@ -0,0 +1,7 @@
	+# $FreeBSD$
	+
	+.PATH: ${SRCTOP}/sys/netinet/cc
	+KMOD= cc_newreno
	+SRCS= cc_newreno.c
	+
	+.include <bsd.kmod.mk>
	diff --git a/sys/netinet/cc/cc.h b/sys/netinet/cc/cc.h
	--- a/sys/netinet/cc/cc.h
	+++ b/sys/netinet/cc/cc.h
	@@ -53,10 +53,11 @@

	#ifdef _KERNEL

	+MALLOC_DECLARE(M_CC_MEM);
	+
	/* Global CC vars. */
	extern STAILQ_HEAD(cc_head, cc_algo) cc_list;
	extern const int tcprexmtthresh;
	-extern struct cc_algo newreno_cc_algo;

	/* Per-netstack bits. */
	VNET_DECLARE(struct cc_algo *, default_cc_ptr);
	@@ -139,8 +140,19 @@
	/* Cleanup global module state on kldunload. */
	int (*mod_destroy)(void);

	- /* Init CC state for a new control block. */
	- int (cb_init)(struct cc_var ccv);
	+ /* Return the size of the void pointer the CC needs for state */
	+ size_t (*cc_data_sz)(void);
	+
	+ /*
	+ * Init CC state for a new control block. The CC
	+ * module may be passed a NULL ptr indicating that
	+ * it must allocate the memory. If it is passed a
	+ * non-null pointer it is pre-allocated memory by
	+ * the caller and the cb_init is expected to use that memory.
	+ * It is not expected to fail if memory is passed in and
	+ * all currently defined modules do not.
	+ */
	+ int (cb_init)(struct cc_var ccv, void *ptr);

	/* Cleanup CC state for a terminating control block. */
	void (cb_destroy)(struct cc_var ccv);
	@@ -176,8 +188,11 @@
	int (ctl_output)(struct cc_var , struct sockopt , void );

	STAILQ_ENTRY (cc_algo) entries;
	+ uint8_t flags;
	};

	+#define CC_MODULE_BEING_REMOVED 0x01 /* The module is being removed */
	+
	/* Macro to obtain the CC algo's struct ptr. */
	#define CC_ALGO(tp) ((tp)->cc_algo)

	@@ -185,7 +200,7 @@
	#define CC_DATA(tp) ((tp)->ccv->cc_data)

	/* Macro to obtain the system default CC algo's struct ptr. */
	-#define CC_DEFAULT() V_default_cc_ptr
	+#define CC_DEFAULT_ALGO() V_default_cc_ptr

	extern struct rwlock cc_list_lock;
	#define CC_LIST_LOCK_INIT() rw_init(&cc_list_lock, "cc_list")
	@@ -198,5 +213,16 @@

	#define CC_ALGOOPT_LIMIT 2048

	+/*
	+ * These routines give NewReno behavior to the caller
	+ * they require no state and can be used by any other CC
	+ * module that wishes to use NewReno type behaviour (along
	+ * with anything else they may add on, pre or post call).
	+ */
	+void newreno_cc_post_recovery(struct cc_var *);
	+void newreno_cc_after_idle(struct cc_var *);
	+void newreno_cc_cong_signal(struct cc_var *, uint32_t );
	+void newreno_cc_ack_received(struct cc_var *, uint16_t);
	+
	#endif /* _KERNEL */
	#endif /* _NETINET_CC_CC_H_ */
	diff --git a/sys/netinet/cc/cc.c b/sys/netinet/cc/cc.c
	--- a/sys/netinet/cc/cc.c
	+++ b/sys/netinet/cc/cc.c
	@@ -50,7 +50,7 @@

	#include <sys/cdefs.h>
	__FBSDID("$FreeBSD$");
	-
	+#include <opt_cc.h>
	#include <sys/param.h>
	#include <sys/kernel.h>
	#include <sys/libkern.h>
	@@ -70,11 +70,15 @@
	#include <netinet/in.h>
	#include <netinet/in_pcb.h>
	#include <netinet/tcp.h>
	+#include <netinet/tcp_seq.h>
	#include <netinet/tcp_var.h>
	+#include <netinet/tcp_log_buf.h>
	+#include <netinet/tcp_hpts.h>
	#include <netinet/cc/cc.h>
	-
	#include <netinet/cc/cc_module.h>

	+MALLOC_DEFINE(M_CC_MEM, "CC Mem", "Congestion Control State memory");
	+
	/*
	* List of available cc algorithms on the current system. First element
	* is used as the system default CC algorithm.
	@@ -84,7 +88,10 @@
	/* Protects the cc_list TAILQ. */
	struct rwlock cc_list_lock;

	-VNET_DEFINE(struct cc_algo *, default_cc_ptr) = &newreno_cc_algo;
	+VNET_DEFINE(struct cc_algo *, default_cc_ptr) = NULL;
	+
	+VNET_DEFINE(uint32_t, newreno_beta) = 50;
	+#define V_newreno_beta VNET(newreno_beta)

	/*
	* Sysctl handler to show and change the default CC algorithm.
	@@ -98,7 +105,10 @@

	/* Get the current default: */
	CC_LIST_RLOCK();
	- strlcpy(default_cc, CC_DEFAULT()->name, sizeof(default_cc));
	+ if (CC_DEFAULT_ALGO() != NULL)
	+ strlcpy(default_cc, CC_DEFAULT_ALGO()->name, sizeof(default_cc));
	+ else
	+ memset(default_cc, 0, TCP_CA_NAME_MAX);
	CC_LIST_RUNLOCK();

	error = sysctl_handle_string(oidp, default_cc, sizeof(default_cc), req);
	@@ -108,7 +118,6 @@
	goto done;

	error = ESRCH;
	-
	/* Find algo with specified name and set it to default. */
	CC_LIST_RLOCK();
	STAILQ_FOREACH(funcs, &cc_list, entries) {
	@@ -141,7 +150,9 @@
	nalgos++;
	}
	CC_LIST_RUNLOCK();
	-
	+ if (nalgos == 0) {
	+ return (ENOENT);
	+ }
	s = sbuf_new(NULL, NULL, nalgos * TCP_CA_NAME_MAX, SBUF_FIXEDLEN);

	if (s == NULL)
	@@ -176,12 +187,13 @@
	}

	/*
	- * Reset the default CC algo to NewReno for any netstack which is using the algo
	- * that is about to go away as its default.
	+ * Return the number of times a proposed removal_cc is
	+ * being used as the default.
	*/
	-static void
	-cc_checkreset_default(struct cc_algo *remove_cc)
	+static int
	+cc_check_default(struct cc_algo *remove_cc)
	{
	+ int cnt = 0;
	VNET_ITERATOR_DECL(vnet_iter);

	CC_LIST_LOCK_ASSERT();
	@@ -189,12 +201,16 @@
	VNET_LIST_RLOCK_NOSLEEP();
	VNET_FOREACH(vnet_iter) {
	CURVNET_SET(vnet_iter);
	- if (strncmp(CC_DEFAULT()->name, remove_cc->name,
	- TCP_CA_NAME_MAX) == 0)
	- V_default_cc_ptr = &newreno_cc_algo;
	+ if ((CC_DEFAULT_ALGO() != NULL) &&
	+ strncmp(CC_DEFAULT_ALGO()->name,
	+ remove_cc->name,
	+ TCP_CA_NAME_MAX) == 0) {
	+ cnt++;
	+ }
	CURVNET_RESTORE();
	}
	VNET_LIST_RUNLOCK_NOSLEEP();
	+ return (cnt);
	}

	/*
	@@ -218,31 +234,36 @@

	err = ENOENT;

	- /* Never allow newreno to be deregistered. */
	- if (&newreno_cc_algo == remove_cc)
	- return (EPERM);
	-
	/* Remove algo from cc_list so that new connections can't use it. */
	CC_LIST_WLOCK();
	STAILQ_FOREACH_SAFE(funcs, &cc_list, entries, tmpfuncs) {
	if (funcs == remove_cc) {
	- cc_checkreset_default(remove_cc);
	- STAILQ_REMOVE(&cc_list, funcs, cc_algo, entries);
	- err = 0;
	+ if (cc_check_default(remove_cc)) {
	+ err = EBUSY;
	+ break;
	+ }
	+ /* Add a temp flag to stop new adds to it */
	+ funcs->flags \|= CC_MODULE_BEING_REMOVED;
	+ break;
	+ }
	+ }
	+ CC_LIST_WUNLOCK();
	+ err = tcp_ccalgounload(remove_cc);
	+ /*
	+ * Now back through and we either remove the temp flag
	+ * or pull the registration.
	+ */
	+ CC_LIST_WLOCK();
	+ STAILQ_FOREACH_SAFE(funcs, &cc_list, entries, tmpfuncs) {
	+ if (funcs == remove_cc) {
	+ if (err == 0)
	+ STAILQ_REMOVE(&cc_list, funcs, cc_algo, entries);
	+ else
	+ funcs->flags &= ~CC_MODULE_BEING_REMOVED;
	break;
	}
	}
	CC_LIST_WUNLOCK();
	-
	- if (!err)
	- /*
	- * XXXLAS:
	- * - We may need to handle non-zero return values in future.
	- * - If we add CC framework support for protocols other than
	- * TCP, we may want a more generic way to handle this step.
	- */
	- tcp_ccalgounload(remove_cc);
	-
	return (err);
	}

	@@ -263,19 +284,218 @@
	*/
	CC_LIST_WLOCK();
	STAILQ_FOREACH(funcs, &cc_list, entries) {
	- if (funcs == add_cc \|\| strncmp(funcs->name, add_cc->name,
	- TCP_CA_NAME_MAX) == 0)
	+ if (funcs == add_cc \|\|
	+ strncmp(funcs->name, add_cc->name,
	+ TCP_CA_NAME_MAX) == 0) {
	err = EEXIST;
	+ break;
	+ }
	}
	-
	- if (!err)
	+ /*
	+ * The first loaded congestion control module will become
	+ * the default until we find the "CC_DEFAULT" defined in
	+ * the config (if we do).
	+ */
	+ if (!err) {
	STAILQ_INSERT_TAIL(&cc_list, add_cc, entries);
	-
	+ if (strcmp(add_cc->name, CC_DEFAULT) == 0) {
	+ V_default_cc_ptr = add_cc;
	+ } else if (V_default_cc_ptr == NULL) {
	+ V_default_cc_ptr = add_cc;
	+ }
	+ }
	CC_LIST_WUNLOCK();

	return (err);
	}

	+/*
	+ * Perform any necessary tasks before we exit congestion recovery.
	+ */
	+void
	+newreno_cc_post_recovery(struct cc_var *ccv)
	+{
	+ int pipe;
	+
	+ if (IN_FASTRECOVERY(CCV(ccv, t_flags))) {
	+ /*
	+ * Fast recovery will conclude after returning from this
	+ * function. Window inflation should have left us with
	+ * approximately snd_ssthresh outstanding data. But in case we
	+ * would be inclined to send a burst, better to do it via the
	+ * slow start mechanism.
	+ *
	+ * XXXLAS: Find a way to do this without needing curack
	+ */
	+ if (V_tcp_do_newsack)
	+ pipe = tcp_compute_pipe(ccv->ccvc.tcp);
	+ else
	+ pipe = CCV(ccv, snd_max) - ccv->curack;
	+ if (pipe < CCV(ccv, snd_ssthresh))
	+ /*
	+ * Ensure that cwnd does not collapse to 1 MSS under
	+ * adverse conditons. Implements RFC6582
	+ */
	+ CCV(ccv, snd_cwnd) = max(pipe, CCV(ccv, t_maxseg)) +
	+ CCV(ccv, t_maxseg);
	+ else
	+ CCV(ccv, snd_cwnd) = CCV(ccv, snd_ssthresh);
	+ }
	+}
	+
	+void
	+newreno_cc_after_idle(struct cc_var *ccv)
	+{
	+ uint32_t rw;
	+ /*
	+ * If we've been idle for more than one retransmit timeout the old
	+ * congestion window is no longer current and we have to reduce it to
	+ * the restart window before we can transmit again.
	+ *
	+ * The restart window is the initial window or the last CWND, whichever
	+ * is smaller.
	+ *
	+ * This is done to prevent us from flooding the path with a full CWND at
	+ * wirespeed, overloading router and switch buffers along the way.
	+ *
	+ * See RFC5681 Section 4.1. "Restarting Idle Connections".
	+ *
	+ * In addition, per RFC2861 Section 2, the ssthresh is set to the
	+ * maximum of the former ssthresh or 3/4 of the old cwnd, to
	+ * not exit slow-start prematurely.
	+ */
	+ rw = tcp_compute_initwnd(tcp_maxseg(ccv->ccvc.tcp));
	+
	+ CCV(ccv, snd_ssthresh) = max(CCV(ccv, snd_ssthresh),
	+ CCV(ccv, snd_cwnd)-(CCV(ccv, snd_cwnd)>>2));
	+
	+ CCV(ccv, snd_cwnd) = min(rw, CCV(ccv, snd_cwnd));
	+}
	+
	+/*
	+ * Perform any necessary tasks before we enter congestion recovery.
	+ */
	+void
	+newreno_cc_cong_signal(struct cc_var *ccv, uint32_t type)
	+{
	+ uint32_t cwin, factor;
	+ u_int mss;
	+
	+ cwin = CCV(ccv, snd_cwnd);
	+ mss = tcp_fixed_maxseg(ccv->ccvc.tcp);
	+ /*
	+ * Other TCP congestion controls use newreno_cong_signal(), but
	+ * with their own private cc_data. Make sure the cc_data is used
	+ * correctly.
	+ */
	+ factor = V_newreno_beta;
	+
	+ /* Catch algos which mistakenly leak private signal types. */
	+ KASSERT((type & CC_SIGPRIVMASK) == 0,
	+ ("%s: congestion signal type 0x%08x is private\n", __func__, type));
	+
	+ cwin = max(((uint64_t)cwin * (uint64_t)factor) / (100ULL * (uint64_t)mss),
	+ 2) * mss;
	+
	+ switch (type) {
	+ case CC_NDUPACK:
	+ if (!IN_FASTRECOVERY(CCV(ccv, t_flags))) {
	+ if (!IN_CONGRECOVERY(CCV(ccv, t_flags)))
	+ CCV(ccv, snd_ssthresh) = cwin;
	+ ENTER_RECOVERY(CCV(ccv, t_flags));
	+ }
	+ break;
	+ case CC_ECN:
	+ if (!IN_CONGRECOVERY(CCV(ccv, t_flags))) {
	+ CCV(ccv, snd_ssthresh) = cwin;
	+ CCV(ccv, snd_cwnd) = cwin;
	+ ENTER_CONGRECOVERY(CCV(ccv, t_flags));
	+ }
	+ break;
	+ case CC_RTO:
	+ CCV(ccv, snd_ssthresh) = max(min(CCV(ccv, snd_wnd),
	+ CCV(ccv, snd_cwnd)) / 2 / mss,
	+ 2) * mss;
	+ CCV(ccv, snd_cwnd) = mss;
	+ break;
	+ }
	+}
	+
	+void
	+newreno_cc_ack_received(struct cc_var *ccv, uint16_t type)
	+{
	+ if (type == CC_ACK && !IN_RECOVERY(CCV(ccv, t_flags)) &&
	+ (ccv->flags & CCF_CWND_LIMITED)) {
	+ u_int cw = CCV(ccv, snd_cwnd);
	+ u_int incr = CCV(ccv, t_maxseg);
	+
	+ /*
	+ * Regular in-order ACK, open the congestion window.
	+ * Method depends on which congestion control state we're
	+ * in (slow start or cong avoid) and if ABC (RFC 3465) is
	+ * enabled.
	+ *
	+ * slow start: cwnd <= ssthresh
	+ * cong avoid: cwnd > ssthresh
	+ *
	+ * slow start and ABC (RFC 3465):
	+ * Grow cwnd exponentially by the amount of data
	+ * ACKed capping the max increment per ACK to
	+ * (abc_l_var * maxseg) bytes.
	+ *
	+ * slow start without ABC (RFC 5681):
	+ * Grow cwnd exponentially by maxseg per ACK.
	+ *
	+ * cong avoid and ABC (RFC 3465):
	+ * Grow cwnd linearly by maxseg per RTT for each
	+ * cwnd worth of ACKed data.
	+ *
	+ * cong avoid without ABC (RFC 5681):
	+ * Grow cwnd linearly by approximately maxseg per RTT using
	+ * maxseg^2 / cwnd per ACK as the increment.
	+ * If cwnd > maxseg^2, fix the cwnd increment at 1 byte to
	+ * avoid capping cwnd.
	+ */
	+ if (cw > CCV(ccv, snd_ssthresh)) {
	+ if (V_tcp_do_rfc3465) {
	+ if (ccv->flags & CCF_ABC_SENTAWND)
	+ ccv->flags &= ~CCF_ABC_SENTAWND;
	+ else
	+ incr = 0;
	+ } else
	+ incr = max((incr * incr / cw), 1);
	+ } else if (V_tcp_do_rfc3465) {
	+ /*
	+ * In slow-start with ABC enabled and no RTO in sight?
	+ * (Must not use abc_l_var > 1 if slow starting after
	+ * an RTO. On RTO, snd_nxt = snd_una, so the
	+ * snd_nxt == snd_max check is sufficient to
	+ * handle this).
	+ *
	+ * XXXLAS: Find a way to signal SS after RTO that
	+ * doesn't rely on tcpcb vars.
	+ */
	+ uint16_t abc_val;
	+
	+ if (ccv->flags & CCF_USE_LOCAL_ABC)
	+ abc_val = ccv->labc;
	+ else
	+ abc_val = V_tcp_abc_l_var;
	+ if (CCV(ccv, snd_nxt) == CCV(ccv, snd_max))
	+ incr = min(ccv->bytes_this_ack,
	+ ccv->nsegs * abc_val *
	+ CCV(ccv, t_maxseg));
	+ else
	+ incr = min(ccv->bytes_this_ack, CCV(ccv, t_maxseg));
	+
	+ }
	+ /* ABC is on by default, so incr equals 0 frequently. */
	+ if (incr > 0)
	+ CCV(ccv, snd_cwnd) = min(cw + incr,
	+ TCP_MAXWIN << CCV(ccv, snd_scale));
	+ }
	+}
	+
	/*
	* Handles kld related events. Returns 0 on success, non-zero on failure.
	*/
	@@ -290,6 +510,15 @@

	switch(event_type) {
	case MOD_LOAD:
	+ if ((algo->cc_data_sz == NULL) && (algo->cb_init != NULL)) {
	+ /*
	+ * A module must have a cc_data_sz function
	+ * even if it has no data it should return 0.
	+ */
	+ printf("Module Load Fails, it lacks a cc_data_sz() function but has a cb_init()!\n");
	+ err = EINVAL;
	+ break;
	+ }
	if (algo->mod_init != NULL)
	err = algo->mod_init();
	if (!err)
	diff --git a/sys/netinet/cc/cc_cdg.c b/sys/netinet/cc/cc_cdg.c
	--- a/sys/netinet/cc/cc_cdg.c
	+++ b/sys/netinet/cc/cc_cdg.c
	@@ -67,6 +67,10 @@

	#include <net/vnet.h>

	+#include <net/route.h>
	+#include <net/route/nhop.h>
	+
	+#include <netinet/in_pcb.h>
	#include <netinet/tcp.h>
	#include <netinet/tcp_seq.h>
	#include <netinet/tcp_timer.h>
	@@ -197,10 +201,6 @@
	32531,32533,32535,32537,32538,32540,32542,32544,32545,32547};

	static uma_zone_t qdiffsample_zone;
	-
	-static MALLOC_DEFINE(M_CDG, "cdg data",
	- "Per connection data required for the CDG congestion control algorithm");
	-
	static int ertt_id;

	VNET_DEFINE_STATIC(uint32_t, cdg_alpha_inc);
	@@ -222,10 +222,11 @@
	static int cdg_mod_init(void);
	static int cdg_mod_destroy(void);
	static void cdg_conn_init(struct cc_var *ccv);
	-static int cdg_cb_init(struct cc_var *ccv);
	+static int cdg_cb_init(struct cc_var ccv, void ptr);
	static void cdg_cb_destroy(struct cc_var *ccv);
	static void cdg_cong_signal(struct cc_var *ccv, uint32_t signal_type);
	static void cdg_ack_received(struct cc_var *ccv, uint16_t ack_type);
	+static size_t cdg_data_sz(void);

	struct cc_algo cdg_cc_algo = {
	.name = "cdg",
	@@ -235,7 +236,10 @@
	.cb_init = cdg_cb_init,
	.conn_init = cdg_conn_init,
	.cong_signal = cdg_cong_signal,
	- .mod_destroy = cdg_mod_destroy
	+ .mod_destroy = cdg_mod_destroy,
	+ .cc_data_sz = cdg_data_sz,
	+ .post_recovery = newreno_cc_post_recovery,
	+ .after_idle = newreno_cc_after_idle,
	};

	/* Vnet created and being initialised. */
	@@ -271,10 +275,6 @@
	CURVNET_RESTORE();
	}
	VNET_LIST_RUNLOCK();
	-
	- cdg_cc_algo.post_recovery = newreno_cc_algo.post_recovery;
	- cdg_cc_algo.after_idle = newreno_cc_algo.after_idle;
	-
	return (0);
	}

	@@ -286,15 +286,25 @@
	return (0);
	}

	+static size_t
	+cdg_data_sz(void)
	+{
	+ return (sizeof(struct cdg));
	+}
	+
	static int
	-cdg_cb_init(struct cc_var *ccv)
	+cdg_cb_init(struct cc_var ccv, void ptr)
	{
	struct cdg *cdg_data;

	- cdg_data = malloc(sizeof(struct cdg), M_CDG, M_NOWAIT);
	- if (cdg_data == NULL)
	- return (ENOMEM);
	-
	+ INP_WLOCK_ASSERT(ccv->ccvc.tcp->t_inpcb);
	+ if (ptr == NULL) {
	+ cdg_data = malloc(sizeof(struct cdg), M_CC_MEM, M_NOWAIT);
	+ if (cdg_data == NULL)
	+ return (ENOMEM);
	+ } else {
	+ cdg_data = ptr;
	+ }
	cdg_data->shadow_w = 0;
	cdg_data->max_qtrend = 0;
	cdg_data->min_qtrend = 0;
	@@ -350,7 +360,7 @@
	qds = qds_n;
	}

	- free(ccv->cc_data, M_CDG);
	+ free(ccv->cc_data, M_CC_MEM);
	}

	static int
	@@ -484,7 +494,7 @@
	ENTER_RECOVERY(CCV(ccv, t_flags));
	break;
	default:
	- newreno_cc_algo.cong_signal(ccv, signal_type);
	+ newreno_cc_cong_signal(ccv, signal_type);
	break;
	}
	}
	@@ -714,5 +724,5 @@
	"the window backoff for loss based CC compatibility");

	DECLARE_CC_MODULE(cdg, &cdg_cc_algo);
	-MODULE_VERSION(cdg, 1);
	+MODULE_VERSION(cdg, 2);
	MODULE_DEPEND(cdg, ertt, 1, 1, 1);
	diff --git a/sys/netinet/cc/cc_chd.c b/sys/netinet/cc/cc_chd.c
	--- a/sys/netinet/cc/cc_chd.c
	+++ b/sys/netinet/cc/cc_chd.c
	@@ -69,6 +69,10 @@

	#include <net/vnet.h>

	+#include <net/route.h>
	+#include <net/route/nhop.h>
	+
	+#include <netinet/in_pcb.h>
	#include <netinet/tcp.h>
	#include <netinet/tcp_seq.h>
	#include <netinet/tcp_timer.h>
	@@ -89,10 +93,11 @@

	static void chd_ack_received(struct cc_var *ccv, uint16_t ack_type);
	static void chd_cb_destroy(struct cc_var *ccv);
	-static int chd_cb_init(struct cc_var *ccv);
	+static int chd_cb_init(struct cc_var ccv, void ptr);
	static void chd_cong_signal(struct cc_var *ccv, uint32_t signal_type);
	static void chd_conn_init(struct cc_var *ccv);
	static int chd_mod_init(void);
	+static size_t chd_data_sz(void);

	struct chd {
	/*
	@@ -126,8 +131,6 @@
	#define V_chd_loss_fair VNET(chd_loss_fair)
	#define V_chd_use_max VNET(chd_use_max)

	-static MALLOC_DEFINE(M_CHD, "chd data",
	- "Per connection data required for the CHD congestion control algorithm");

	struct cc_algo chd_cc_algo = {
	.name = "chd",
	@@ -136,7 +139,10 @@
	.cb_init = chd_cb_init,
	.cong_signal = chd_cong_signal,
	.conn_init = chd_conn_init,
	- .mod_init = chd_mod_init
	+ .mod_init = chd_mod_init,
	+ .cc_data_sz = chd_data_sz,
	+ .after_idle = newreno_cc_after_idle,
	+ .post_recovery = newreno_cc_post_recovery,
	};

	static __inline void
	@@ -304,18 +310,27 @@
	static void
	chd_cb_destroy(struct cc_var *ccv)
	{
	+ free(ccv->cc_data, M_CC_MEM);
	+}

	- free(ccv->cc_data, M_CHD);
	+size_t
	+chd_data_sz(void)
	+{
	+ return (sizeof(struct chd));
	}

	static int
	-chd_cb_init(struct cc_var *ccv)
	+chd_cb_init(struct cc_var ccv, void ptr)
	{
	struct chd *chd_data;

	- chd_data = malloc(sizeof(struct chd), M_CHD, M_NOWAIT);
	- if (chd_data == NULL)
	- return (ENOMEM);
	+ INP_WLOCK_ASSERT(ccv->ccvc.tcp->t_inpcb);
	+ if (ptr == NULL) {
	+ chd_data = malloc(sizeof(struct chd), M_CC_MEM, M_NOWAIT);
	+ if (chd_data == NULL)
	+ return (ENOMEM);
	+ } else
	+ chd_data = ptr;

	chd_data->shadow_w = 0;
	ccv->cc_data = chd_data;
	@@ -374,7 +389,7 @@
	break;

	default:
	- newreno_cc_algo.cong_signal(ccv, signal_type);
	+ newreno_cc_cong_signal(ccv, signal_type);
	}
	}

	@@ -403,10 +418,6 @@
	printf("%s: h_ertt module not found\n", __func__);
	return (ENOENT);
	}
	-
	- chd_cc_algo.after_idle = newreno_cc_algo.after_idle;
	- chd_cc_algo.post_recovery = newreno_cc_algo.post_recovery;
	-
	return (0);
	}

	@@ -493,5 +504,5 @@
	"as the basic delay measurement for the algorithm.");

	DECLARE_CC_MODULE(chd, &chd_cc_algo);
	-MODULE_VERSION(chd, 1);
	+MODULE_VERSION(chd, 2);
	MODULE_DEPEND(chd, ertt, 1, 1, 1);
	diff --git a/sys/netinet/cc/cc_cubic.c b/sys/netinet/cc/cc_cubic.c
	--- a/sys/netinet/cc/cc_cubic.c
	+++ b/sys/netinet/cc/cc_cubic.c
	@@ -62,6 +62,10 @@

	#include <net/vnet.h>

	+#include <net/route.h>
	+#include <net/route/nhop.h>
	+
	+#include <netinet/in_pcb.h>
	#include <netinet/tcp.h>
	#include <netinet/tcp_seq.h>
	#include <netinet/tcp_timer.h>
	@@ -72,7 +76,7 @@

	static void cubic_ack_received(struct cc_var *ccv, uint16_t type);
	static void cubic_cb_destroy(struct cc_var *ccv);
	-static int cubic_cb_init(struct cc_var *ccv);
	+static int cubic_cb_init(struct cc_var ccv, void ptr);
	static void cubic_cong_signal(struct cc_var *ccv, uint32_t type);
	static void cubic_conn_init(struct cc_var *ccv);
	static int cubic_mod_init(void);
	@@ -80,6 +84,7 @@
	static void cubic_record_rtt(struct cc_var *ccv);
	static void cubic_ssthresh_update(struct cc_var *ccv, uint32_t maxseg);
	static void cubic_after_idle(struct cc_var *ccv);
	+static size_t cubic_data_sz(void);

	struct cubic {
	/* Cubic K in fixed point form with CUBIC_SHIFT worth of precision. */
	@@ -114,9 +119,6 @@
	int t_last_cong_prev;
	};

	-static MALLOC_DEFINE(M_CUBIC, "cubic data",
	- "Per connection data required for the CUBIC congestion control algorithm");
	-
	struct cc_algo cubic_cc_algo = {
	.name = "cubic",
	.ack_received = cubic_ack_received,
	@@ -127,6 +129,7 @@
	.mod_init = cubic_mod_init,
	.post_recovery = cubic_post_recovery,
	.after_idle = cubic_after_idle,
	+ .cc_data_sz = cubic_data_sz
	};

	static void
	@@ -149,7 +152,7 @@
	if (CCV(ccv, snd_cwnd) <= CCV(ccv, snd_ssthresh) \|\|
	cubic_data->min_rtt_ticks == TCPTV_SRTTBASE) {
	cubic_data->flags \|= CUBICFLAG_IN_SLOWSTART;
	- newreno_cc_algo.ack_received(ccv, type);
	+ newreno_cc_ack_received(ccv, type);
	} else {
	if ((cubic_data->flags & CUBICFLAG_RTO_EVENT) &&
	(cubic_data->flags & CUBICFLAG_IN_SLOWSTART)) {
	@@ -243,25 +246,34 @@
	cubic_data->max_cwnd = ulmax(cubic_data->max_cwnd, CCV(ccv, snd_cwnd));
	cubic_data->K = cubic_k(cubic_data->max_cwnd / CCV(ccv, t_maxseg));

	- newreno_cc_algo.after_idle(ccv);
	+ newreno_cc_after_idle(ccv);
	cubic_data->t_last_cong = ticks;
	}

	static void
	cubic_cb_destroy(struct cc_var *ccv)
	{
	- free(ccv->cc_data, M_CUBIC);
	+ free(ccv->cc_data, M_CC_MEM);
	+}
	+
	+static size_t
	+cubic_data_sz(void)
	+{
	+ return (sizeof(struct cubic));
	}

	static int
	-cubic_cb_init(struct cc_var *ccv)
	+cubic_cb_init(struct cc_var ccv, void ptr)
	{
	struct cubic *cubic_data;

	- cubic_data = malloc(sizeof(struct cubic), M_CUBIC, M_NOWAIT\|M_ZERO);
	-
	- if (cubic_data == NULL)
	- return (ENOMEM);
	+ INP_WLOCK_ASSERT(ccv->ccvc.tcp->t_inpcb);
	+ if (ptr == NULL) {
	+ cubic_data = malloc(sizeof(struct cubic), M_CC_MEM, M_NOWAIT\|M_ZERO);
	+ if (cubic_data == NULL)
	+ return (ENOMEM);
	+ } else
	+ cubic_data = ptr;

	/* Init some key variables with sensible defaults. */
	cubic_data->t_last_cong = ticks;
	@@ -484,4 +496,4 @@
	}

	DECLARE_CC_MODULE(cubic, &cubic_cc_algo);
	-MODULE_VERSION(cubic, 1);
	+MODULE_VERSION(cubic, 2);
	diff --git a/sys/netinet/cc/cc_dctcp.c b/sys/netinet/cc/cc_dctcp.c
	--- a/sys/netinet/cc/cc_dctcp.c
	+++ b/sys/netinet/cc/cc_dctcp.c
	@@ -50,6 +50,10 @@

	#include <net/vnet.h>

	+#include <net/route.h>
	+#include <net/route/nhop.h>
	+
	+#include <netinet/in_pcb.h>
	#include <netinet/tcp.h>
	#include <netinet/tcp_seq.h>
	#include <netinet/tcp_var.h>
	@@ -76,18 +80,16 @@
	uint32_t num_cong_events; /* # of congestion events */
	};

	-static MALLOC_DEFINE(M_dctcp, "dctcp data",
	- "Per connection data required for the dctcp algorithm");
	-
	static void dctcp_ack_received(struct cc_var *ccv, uint16_t type);
	static void dctcp_after_idle(struct cc_var *ccv);
	static void dctcp_cb_destroy(struct cc_var *ccv);
	-static int dctcp_cb_init(struct cc_var *ccv);
	+static int dctcp_cb_init(struct cc_var ccv, void ptr);
	static void dctcp_cong_signal(struct cc_var *ccv, uint32_t type);
	static void dctcp_conn_init(struct cc_var *ccv);
	static void dctcp_post_recovery(struct cc_var *ccv);
	static void dctcp_ecnpkt_handler(struct cc_var *ccv);
	static void dctcp_update_alpha(struct cc_var *ccv);
	+static size_t dctcp_data_sz(void);

	struct cc_algo dctcp_cc_algo = {
	.name = "dctcp",
	@@ -99,6 +101,7 @@
	.post_recovery = dctcp_post_recovery,
	.ecnpkt_handler = dctcp_ecnpkt_handler,
	.after_idle = dctcp_after_idle,
	+ .cc_data_sz = dctcp_data_sz,
	};

	static void
	@@ -117,10 +120,10 @@
	*/
	if (IN_CONGRECOVERY(CCV(ccv, t_flags))) {
	EXIT_CONGRECOVERY(CCV(ccv, t_flags));
	- newreno_cc_algo.ack_received(ccv, type);
	+ newreno_cc_ack_received(ccv, type);
	ENTER_CONGRECOVERY(CCV(ccv, t_flags));
	} else
	- newreno_cc_algo.ack_received(ccv, type);
	+ newreno_cc_ack_received(ccv, type);

	if (type == CC_DUPACK)
	bytes_acked = min(ccv->bytes_this_ack, CCV(ccv, t_maxseg));
	@@ -158,7 +161,13 @@
	SEQ_GT(ccv->curack, dctcp_data->save_sndnxt))
	dctcp_update_alpha(ccv);
	} else
	- newreno_cc_algo.ack_received(ccv, type);
	+ newreno_cc_ack_received(ccv, type);
	+}
	+
	+static size_t
	+dctcp_data_sz(void)
	+{
	+ return (sizeof(struct dctcp));
	}

	static void
	@@ -179,25 +188,27 @@
	dctcp_data->num_cong_events = 0;
	}

	- newreno_cc_algo.after_idle(ccv);
	+ newreno_cc_after_idle(ccv);
	}

	static void
	dctcp_cb_destroy(struct cc_var *ccv)
	{
	- free(ccv->cc_data, M_dctcp);
	+ free(ccv->cc_data, M_CC_MEM);
	}

	static int
	-dctcp_cb_init(struct cc_var *ccv)
	+dctcp_cb_init(struct cc_var ccv, void ptr)
	{
	struct dctcp *dctcp_data;

	- dctcp_data = malloc(sizeof(struct dctcp), M_dctcp, M_NOWAIT\|M_ZERO);
	-
	- if (dctcp_data == NULL)
	- return (ENOMEM);
	-
	+ INP_WLOCK_ASSERT(ccv->ccvc.tcp->t_inpcb);
	+ if (ptr == NULL) {
	+ dctcp_data = malloc(sizeof(struct dctcp), M_CC_MEM, M_NOWAIT\|M_ZERO);
	+ if (dctcp_data == NULL)
	+ return (ENOMEM);
	+ } else
	+ dctcp_data = ptr;
	/* Initialize some key variables with sensible defaults. */
	dctcp_data->bytes_ecn = 0;
	dctcp_data->bytes_total = 0;
	@@ -292,7 +303,7 @@
	break;
	}
	} else
	- newreno_cc_algo.cong_signal(ccv, type);
	+ newreno_cc_cong_signal(ccv, type);
	}

	static void
	@@ -312,7 +323,7 @@
	static void
	dctcp_post_recovery(struct cc_var *ccv)
	{
	- newreno_cc_algo.post_recovery(ccv);
	+ newreno_cc_post_recovery(ccv);

	if (CCV(ccv, t_flags2) & TF2_ECN_PERMIT)
	dctcp_update_alpha(ccv);
	@@ -468,4 +479,4 @@
	"half CWND reduction after the first slow start");

	DECLARE_CC_MODULE(dctcp, &dctcp_cc_algo);
	-MODULE_VERSION(dctcp, 1);
	+MODULE_VERSION(dctcp, 2);
	diff --git a/sys/netinet/cc/cc_hd.c b/sys/netinet/cc/cc_hd.c
	--- a/sys/netinet/cc/cc_hd.c
	+++ b/sys/netinet/cc/cc_hd.c
	@@ -84,6 +84,7 @@

	static void hd_ack_received(struct cc_var *ccv, uint16_t ack_type);
	static int hd_mod_init(void);
	+static size_t hd_data_sz(void);

	static int ertt_id;

	@@ -97,9 +98,19 @@
	struct cc_algo hd_cc_algo = {
	.name = "hd",
	.ack_received = hd_ack_received,
	- .mod_init = hd_mod_init
	+ .mod_init = hd_mod_init,
	+ .cc_data_sz = hd_data_sz,
	+ .after_idle = newreno_cc_after_idle,
	+ .cong_signal = newreno_cc_cong_signal,
	+ .post_recovery = newreno_cc_post_recovery,
	};

	+static size_t
	+hd_data_sz(void)
	+{
	+ return (0);
	+}
	+
	/*
	* Hamilton backoff function. Returns 1 if we should backoff or 0 otherwise.
	*/
	@@ -150,14 +161,14 @@
	* half cwnd and behave like an ECN (ie
	* not a packet loss).
	*/
	- newreno_cc_algo.cong_signal(ccv,
	+ newreno_cc_cong_signal(ccv,
	CC_ECN);
	return;
	}
	}
	}
	}
	- newreno_cc_algo.ack_received(ccv, ack_type); /* As for NewReno. */
	+ newreno_cc_ack_received(ccv, ack_type);
	}

	static int
	@@ -169,11 +180,6 @@
	printf("%s: h_ertt module not found\n", __func__);
	return (ENOENT);
	}
	-
	- hd_cc_algo.after_idle = newreno_cc_algo.after_idle;
	- hd_cc_algo.cong_signal = newreno_cc_algo.cong_signal;
	- hd_cc_algo.post_recovery = newreno_cc_algo.post_recovery;
	-
	return (0);
	}

	@@ -251,5 +257,5 @@
	"minimum queueing delay threshold (qmin) in ticks");

	DECLARE_CC_MODULE(hd, &hd_cc_algo);
	-MODULE_VERSION(hd, 1);
	+MODULE_VERSION(hd, 2);
	MODULE_DEPEND(hd, ertt, 1, 1, 1);
	diff --git a/sys/netinet/cc/cc_htcp.c b/sys/netinet/cc/cc_htcp.c
	--- a/sys/netinet/cc/cc_htcp.c
	+++ b/sys/netinet/cc/cc_htcp.c
	@@ -64,6 +64,10 @@

	#include <net/vnet.h>

	+#include <net/route.h>
	+#include <net/route/nhop.h>
	+
	+#include <netinet/in_pcb.h>
	#include <netinet/tcp.h>
	#include <netinet/tcp_seq.h>
	#include <netinet/tcp_timer.h>
	@@ -137,7 +141,7 @@

	static void htcp_ack_received(struct cc_var *ccv, uint16_t type);
	static void htcp_cb_destroy(struct cc_var *ccv);
	-static int htcp_cb_init(struct cc_var *ccv);
	+static int htcp_cb_init(struct cc_var ccv, void ptr);
	static void htcp_cong_signal(struct cc_var *ccv, uint32_t type);
	static int htcp_mod_init(void);
	static void htcp_post_recovery(struct cc_var *ccv);
	@@ -145,6 +149,7 @@
	static void htcp_recalc_beta(struct cc_var *ccv);
	static void htcp_record_rtt(struct cc_var *ccv);
	static void htcp_ssthresh_update(struct cc_var *ccv);
	+static size_t htcp_data_sz(void);

	struct htcp {
	/* cwnd before entering cong recovery. */
	@@ -175,9 +180,6 @@
	#define V_htcp_adaptive_backoff VNET(htcp_adaptive_backoff)
	#define V_htcp_rtt_scaling VNET(htcp_rtt_scaling)

	-static MALLOC_DEFINE(M_HTCP, "htcp data",
	- "Per connection data required for the HTCP congestion control algorithm");
	-
	struct cc_algo htcp_cc_algo = {
	.name = "htcp",
	.ack_received = htcp_ack_received,
	@@ -186,6 +188,8 @@
	.cong_signal = htcp_cong_signal,
	.mod_init = htcp_mod_init,
	.post_recovery = htcp_post_recovery,
	+ .cc_data_sz = htcp_data_sz,
	+ .after_idle = newreno_cc_after_idle,
	};

	static void
	@@ -214,7 +218,7 @@
	*/
	if (htcp_data->alpha == 1 \|\|
	CCV(ccv, snd_cwnd) <= CCV(ccv, snd_ssthresh))
	- newreno_cc_algo.ack_received(ccv, type);
	+ newreno_cc_ack_received(ccv, type);
	else {
	if (V_tcp_do_rfc3465) {
	/* Increment cwnd by alpha segments. */
	@@ -238,18 +242,27 @@
	static void
	htcp_cb_destroy(struct cc_var *ccv)
	{
	- free(ccv->cc_data, M_HTCP);
	+ free(ccv->cc_data, M_CC_MEM);
	+}
	+
	+static size_t
	+htcp_data_sz(void)
	+{
	+ return(sizeof(struct htcp));
	}

	static int
	-htcp_cb_init(struct cc_var *ccv)
	+htcp_cb_init(struct cc_var ccv, void ptr)
	{
	struct htcp *htcp_data;

	- htcp_data = malloc(sizeof(struct htcp), M_HTCP, M_NOWAIT);
	-
	- if (htcp_data == NULL)
	- return (ENOMEM);
	+ INP_WLOCK_ASSERT(ccv->ccvc.tcp->t_inpcb);
	+ if (ptr == NULL) {
	+ htcp_data = malloc(sizeof(struct htcp), M_CC_MEM, M_NOWAIT);
	+ if (htcp_data == NULL)
	+ return (ENOMEM);
	+ } else
	+ htcp_data = ptr;

	/* Init some key variables with sensible defaults. */
	htcp_data->alpha = HTCP_INIT_ALPHA;
	@@ -333,16 +346,12 @@
	static int
	htcp_mod_init(void)
	{
	-
	- htcp_cc_algo.after_idle = newreno_cc_algo.after_idle;
	-
	/*
	* HTCP_RTT_REF is defined in ms, and t_srtt in the tcpcb is stored in
	* units of TCP_RTT_SCALE*hz. Scale HTCP_RTT_REF to be in the same units
	* as t_srtt.
	*/
	htcp_rtt_ref = (HTCP_RTT_REF * TCP_RTT_SCALE * hz) / 1000;
	-
	return (0);
	}

	@@ -535,4 +544,4 @@
	"enable H-TCP RTT scaling");

	DECLARE_CC_MODULE(htcp, &htcp_cc_algo);
	-MODULE_VERSION(htcp, 1);
	+MODULE_VERSION(htcp, 2);
	diff --git a/sys/netinet/cc/cc_newreno.c b/sys/netinet/cc/cc_newreno.c
	--- a/sys/netinet/cc/cc_newreno.c
	+++ b/sys/netinet/cc/cc_newreno.c
	@@ -71,6 +71,10 @@

	#include <net/vnet.h>

	+#include <net/route.h>
	+#include <net/route/nhop.h>
	+
	+#include <netinet/in_pcb.h>
	#include <netinet/in.h>
	#include <netinet/in_pcb.h>
	#include <netinet/tcp.h>
	@@ -82,22 +86,20 @@
	#include <netinet/cc/cc_module.h>
	#include <netinet/cc/cc_newreno.h>

	-static MALLOC_DEFINE(M_NEWRENO, "newreno data",
	- "newreno beta values");
	-
	static void newreno_cb_destroy(struct cc_var *ccv);
	static void newreno_ack_received(struct cc_var *ccv, uint16_t type);
	static void newreno_after_idle(struct cc_var *ccv);
	static void newreno_cong_signal(struct cc_var *ccv, uint32_t type);
	-static void newreno_post_recovery(struct cc_var *ccv);
	static int newreno_ctl_output(struct cc_var ccv, struct sockopt sopt, void *buf);
	static void newreno_newround(struct cc_var *ccv, uint32_t round_cnt);
	static void newreno_rttsample(struct cc_var *ccv, uint32_t usec_rtt, uint32_t rxtcnt, uint32_t fas);
	-static int newreno_cb_init(struct cc_var *ccv);
	+static int newreno_cb_init(struct cc_var ccv, void );
	+static size_t newreno_data_sz(void);

	-VNET_DEFINE(uint32_t, newreno_beta) = 50;
	-VNET_DEFINE(uint32_t, newreno_beta_ecn) = 80;
	+
	+VNET_DECLARE(uint32_t, newreno_beta);
	#define V_newreno_beta VNET(newreno_beta)
	+VNET_DEFINE(uint32_t, newreno_beta_ecn) = 80;
	#define V_newreno_beta_ecn VNET(newreno_beta_ecn)

	struct cc_algo newreno_cc_algo = {
	@@ -106,11 +108,12 @@
	.ack_received = newreno_ack_received,
	.after_idle = newreno_after_idle,
	.cong_signal = newreno_cong_signal,
	- .post_recovery = newreno_post_recovery,
	+ .post_recovery = newreno_cc_post_recovery,
	.ctl_output = newreno_ctl_output,
	.newround = newreno_newround,
	.rttsample = newreno_rttsample,
	.cb_init = newreno_cb_init,
	+ .cc_data_sz = newreno_data_sz,
	};

	static uint32_t hystart_lowcwnd = 16;
	@@ -167,14 +170,24 @@
	}
	}

	+static size_t
	+newreno_data_sz(void)
	+{
	+ return (sizeof(struct newreno));
	+}
	+
	static int
	-newreno_cb_init(struct cc_var *ccv)
	+newreno_cb_init(struct cc_var ccv, void ptr)
	{
	struct newreno *nreno;

	- ccv->cc_data = malloc(sizeof(struct newreno), M_NEWRENO, M_NOWAIT);
	- if (ccv->cc_data == NULL)
	- return (ENOMEM);
	+ INP_WLOCK_ASSERT(ccv->ccvc.tcp->t_inpcb);
	+ if (ptr == NULL) {
	+ ccv->cc_data = malloc(sizeof(struct newreno), M_CC_MEM, M_NOWAIT);
	+ if (ccv->cc_data == NULL)
	+ return (ENOMEM);
	+ } else
	+ ccv->cc_data = ptr;
	nreno = (struct newreno *)ccv->cc_data;
	/* NB: nreno is not zeroed, so initialise all fields. */
	nreno->beta = V_newreno_beta;
	@@ -201,7 +214,7 @@
	static void
	newreno_cb_destroy(struct cc_var *ccv)
	{
	- free(ccv->cc_data, M_NEWRENO);
	+ free(ccv->cc_data, M_CC_MEM);
	}

	static void
	@@ -209,13 +222,7 @@
	{
	struct newreno *nreno;

	- /*
	- * Other TCP congestion controls use newreno_ack_received(), but
	- * with their own private cc_data. Make sure the cc_data is used
	- * correctly.
	- */
	- nreno = (CC_ALGO(ccv->ccvc.tcp) == &newreno_cc_algo) ? ccv->cc_data : NULL;
	-
	+ nreno = ccv->cc_data;
	if (type == CC_ACK && !IN_RECOVERY(CCV(ccv, t_flags)) &&
	(ccv->flags & CCF_CWND_LIMITED)) {
	u_int cw = CCV(ccv, snd_cwnd);
	@@ -249,8 +256,7 @@
	* avoid capping cwnd.
	*/
	if (cw > CCV(ccv, snd_ssthresh)) {
	- if ((nreno != NULL) &&
	- (nreno->newreno_flags & CC_NEWRENO_HYSTART_IN_CSS)) {
	+ if (nreno->newreno_flags & CC_NEWRENO_HYSTART_IN_CSS) {
	/*
	* We have slipped into CA with
	* CSS active. Deactivate all.
	@@ -284,8 +290,7 @@
	abc_val = ccv->labc;
	else
	abc_val = V_tcp_abc_l_var;
	- if ((nreno != NULL) &&
	- (nreno->newreno_flags & CC_NEWRENO_HYSTART_ALLOWED) &&
	+ if ((nreno->newreno_flags & CC_NEWRENO_HYSTART_ALLOWED) &&
	(nreno->newreno_flags & CC_NEWRENO_HYSTART_ENABLED) &&
	((nreno->newreno_flags & CC_NEWRENO_HYSTART_IN_CSS) == 0)) {
	/*
	@@ -323,8 +328,7 @@
	incr = min(ccv->bytes_this_ack, CCV(ccv, t_maxseg));

	/* Only if Hystart is enabled will the flag get set */
	- if ((nreno != NULL) &&
	- (nreno->newreno_flags & CC_NEWRENO_HYSTART_IN_CSS)) {
	+ if (nreno->newreno_flags & CC_NEWRENO_HYSTART_IN_CSS) {
	incr /= hystart_css_growth_div;
	newreno_log_hystart_event(ccv, nreno, 3, incr);
	}
	@@ -340,39 +344,10 @@
	newreno_after_idle(struct cc_var *ccv)
	{
	struct newreno *nreno;
	- uint32_t rw;
	-
	- /*
	- * Other TCP congestion controls use newreno_after_idle(), but
	- * with their own private cc_data. Make sure the cc_data is used
	- * correctly.
	- */
	- nreno = (CC_ALGO(ccv->ccvc.tcp) == &newreno_cc_algo) ? ccv->cc_data : NULL;
	- /*
	- * If we've been idle for more than one retransmit timeout the old
	- * congestion window is no longer current and we have to reduce it to
	- * the restart window before we can transmit again.
	- *
	- * The restart window is the initial window or the last CWND, whichever
	- * is smaller.
	- *
	- * This is done to prevent us from flooding the path with a full CWND at
	- * wirespeed, overloading router and switch buffers along the way.
	- *
	- * See RFC5681 Section 4.1. "Restarting Idle Connections".
	- *
	- * In addition, per RFC2861 Section 2, the ssthresh is set to the
	- * maximum of the former ssthresh or 3/4 of the old cwnd, to
	- * not exit slow-start prematurely.
	- */
	- rw = tcp_compute_initwnd(tcp_maxseg(ccv->ccvc.tcp));
	-
	- CCV(ccv, snd_ssthresh) = max(CCV(ccv, snd_ssthresh),
	- CCV(ccv, snd_cwnd)-(CCV(ccv, snd_cwnd)>>2));

	- CCV(ccv, snd_cwnd) = min(rw, CCV(ccv, snd_cwnd));
	- if ((nreno != NULL) &&
	- (nreno->newreno_flags & CC_NEWRENO_HYSTART_ENABLED) == 0) {
	+ nreno = ccv->cc_data;
	+ newreno_cc_after_idle(ccv);
	+ if ((nreno->newreno_flags & CC_NEWRENO_HYSTART_ENABLED) == 0) {
	if (CCV(ccv, snd_cwnd) <= (hystart_lowcwnd * tcp_fixed_maxseg(ccv->ccvc.tcp))) {
	/*
	* Re-enable hystart if our cwnd has fallen below
	@@ -396,12 +371,7 @@

	cwin = CCV(ccv, snd_cwnd);
	mss = tcp_fixed_maxseg(ccv->ccvc.tcp);
	- /*
	- * Other TCP congestion controls use newreno_cong_signal(), but
	- * with their own private cc_data. Make sure the cc_data is used
	- * correctly.
	- */
	- nreno = (CC_ALGO(ccv->ccvc.tcp) == &newreno_cc_algo) ? ccv->cc_data : NULL;
	+ nreno = ccv->cc_data;
	beta = (nreno == NULL) ? V_newreno_beta : nreno->beta;;
	beta_ecn = (nreno == NULL) ? V_newreno_beta_ecn : nreno->beta_ecn;
	/*
	@@ -426,8 +396,7 @@

	switch (type) {
	case CC_NDUPACK:
	- if ((nreno != NULL) &&
	- (nreno->newreno_flags & CC_NEWRENO_HYSTART_ENABLED)) {
	+ if (nreno->newreno_flags & CC_NEWRENO_HYSTART_ENABLED) {
	/* Make sure the flags are all off we had a loss */
	nreno->newreno_flags &= ~CC_NEWRENO_HYSTART_ENABLED;
	nreno->newreno_flags &= ~CC_NEWRENO_HYSTART_IN_CSS;
	@@ -445,8 +414,7 @@
	}
	break;
	case CC_ECN:
	- if ((nreno != NULL) &&
	- (nreno->newreno_flags & CC_NEWRENO_HYSTART_ENABLED)) {
	+ if (nreno->newreno_flags & CC_NEWRENO_HYSTART_ENABLED) {
	/* Make sure the flags are all off we had a loss */
	nreno->newreno_flags &= ~CC_NEWRENO_HYSTART_ENABLED;
	nreno->newreno_flags &= ~CC_NEWRENO_HYSTART_IN_CSS;
	@@ -466,41 +434,6 @@
	}
	}

	-/*
	- * Perform any necessary tasks before we exit congestion recovery.
	- */
	-static void
	-newreno_post_recovery(struct cc_var *ccv)
	-{
	- int pipe;
	-
	- if (IN_FASTRECOVERY(CCV(ccv, t_flags))) {
	- /*
	- * Fast recovery will conclude after returning from this
	- * function. Window inflation should have left us with
	- * approximately snd_ssthresh outstanding data. But in case we
	- * would be inclined to send a burst, better to do it via the
	- * slow start mechanism.
	- *
	- * XXXLAS: Find a way to do this without needing curack
	- */
	- if (V_tcp_do_newsack)
	- pipe = tcp_compute_pipe(ccv->ccvc.tcp);
	- else
	- pipe = CCV(ccv, snd_max) - ccv->curack;
	-
	- if (pipe < CCV(ccv, snd_ssthresh))
	- /*
	- * Ensure that cwnd does not collapse to 1 MSS under
	- * adverse conditons. Implements RFC6582
	- */
	- CCV(ccv, snd_cwnd) = max(pipe, CCV(ccv, t_maxseg)) +
	- CCV(ccv, t_maxseg);
	- else
	- CCV(ccv, snd_cwnd) = CCV(ccv, snd_ssthresh);
	- }
	-}
	-
	static int
	newreno_ctl_output(struct cc_var ccv, struct sockopt sopt, void *buf)
	{
	@@ -723,4 +656,4 @@


	DECLARE_CC_MODULE(newreno, &newreno_cc_algo);
	-MODULE_VERSION(newreno, 1);
	+MODULE_VERSION(newreno, 2);
	diff --git a/sys/netinet/cc/cc_vegas.c b/sys/netinet/cc/cc_vegas.c
	--- a/sys/netinet/cc/cc_vegas.c
	+++ b/sys/netinet/cc/cc_vegas.c
	@@ -71,6 +71,10 @@

	#include <net/vnet.h>

	+#include <net/route.h>
	+#include <net/route/nhop.h>
	+
	+#include <netinet/in_pcb.h>
	#include <netinet/tcp.h>
	#include <netinet/tcp_timer.h>
	#include <netinet/tcp_var.h>
	@@ -87,10 +91,11 @@

	static void vegas_ack_received(struct cc_var *ccv, uint16_t ack_type);
	static void vegas_cb_destroy(struct cc_var *ccv);
	-static int vegas_cb_init(struct cc_var *ccv);
	+static int vegas_cb_init(struct cc_var ccv, void ptr);
	static void vegas_cong_signal(struct cc_var *ccv, uint32_t signal_type);
	static void vegas_conn_init(struct cc_var *ccv);
	static int vegas_mod_init(void);
	+static size_t vegas_data_sz(void);

	struct vegas {
	int slow_start_toggle;
	@@ -103,9 +108,6 @@
	#define V_vegas_alpha VNET(vegas_alpha)
	#define V_vegas_beta VNET(vegas_beta)

	-static MALLOC_DEFINE(M_VEGAS, "vegas data",
	- "Per connection data required for the Vegas congestion control algorithm");
	-
	struct cc_algo vegas_cc_algo = {
	.name = "vegas",
	.ack_received = vegas_ack_received,
	@@ -113,7 +115,10 @@
	.cb_init = vegas_cb_init,
	.cong_signal = vegas_cong_signal,
	.conn_init = vegas_conn_init,
	- .mod_init = vegas_mod_init
	+ .mod_init = vegas_mod_init,
	+ .cc_data_sz = vegas_data_sz,
	+ .after_idle = newreno_cc_after_idle,
	+ .post_recovery = newreno_cc_post_recovery,
	};

	/*
	@@ -162,24 +167,33 @@
	}

	if (vegas_data->slow_start_toggle)
	- newreno_cc_algo.ack_received(ccv, ack_type);
	+ newreno_cc_ack_received(ccv, ack_type);
	}

	static void
	vegas_cb_destroy(struct cc_var *ccv)
	{
	- free(ccv->cc_data, M_VEGAS);
	+ free(ccv->cc_data, M_CC_MEM);
	+}
	+
	+static size_t
	+vegas_data_sz(void)
	+{
	+ return (sizeof(struct vegas));
	}

	static int
	-vegas_cb_init(struct cc_var *ccv)
	+vegas_cb_init(struct cc_var ccv, void ptr)
	{
	struct vegas *vegas_data;

	- vegas_data = malloc(sizeof(struct vegas), M_VEGAS, M_NOWAIT);
	-
	- if (vegas_data == NULL)
	- return (ENOMEM);
	+ INP_WLOCK_ASSERT(ccv->ccvc.tcp->t_inpcb);
	+ if (ptr == NULL) {
	+ vegas_data = malloc(sizeof(struct vegas), M_CC_MEM, M_NOWAIT);
	+ if (vegas_data == NULL)
	+ return (ENOMEM);
	+ } else
	+ vegas_data = ptr;

	vegas_data->slow_start_toggle = 1;
	ccv->cc_data = vegas_data;
	@@ -216,7 +230,7 @@
	break;

	default:
	- newreno_cc_algo.cong_signal(ccv, signal_type);
	+ newreno_cc_cong_signal(ccv, signal_type);
	}

	if (IN_RECOVERY(CCV(ccv, t_flags)) && !presignalrecov)
	@@ -236,16 +250,11 @@
	static int
	vegas_mod_init(void)
	{
	-
	ertt_id = khelp_get_id("ertt");
	if (ertt_id <= 0) {
	printf("%s: h_ertt module not found\n", __func__);
	return (ENOENT);
	}
	-
	- vegas_cc_algo.after_idle = newreno_cc_algo.after_idle;
	- vegas_cc_algo.post_recovery = newreno_cc_algo.post_recovery;
	-
	return (0);
	}

	@@ -301,5 +310,5 @@
	"vegas beta, specified as number of \"buffers\" (0 < alpha < beta)");

	DECLARE_CC_MODULE(vegas, &vegas_cc_algo);
	-MODULE_VERSION(vegas, 1);
	+MODULE_VERSION(vegas, 2);
	MODULE_DEPEND(vegas, ertt, 1, 1, 1);
	diff --git a/sys/netinet/tcp_subr.c b/sys/netinet/tcp_subr.c
	--- a/sys/netinet/tcp_subr.c
	+++ b/sys/netinet/tcp_subr.c
	@@ -2137,8 +2137,9 @@
	*/
	CC_LIST_RLOCK();
	KASSERT(!STAILQ_EMPTY(&cc_list), ("cc_list is empty!"));
	- CC_ALGO(tp) = CC_DEFAULT();
	+ CC_ALGO(tp) = CC_DEFAULT_ALGO();
	CC_LIST_RUNLOCK();
	+
	/*
	* The tcpcb will hold a reference on its inpcb until tcp_discardcb()
	* is called.
	@@ -2147,7 +2148,7 @@
	tp->t_inpcb = inp;

	if (CC_ALGO(tp)->cb_init != NULL)
	- if (CC_ALGO(tp)->cb_init(tp->ccv) > 0) {
	+ if (CC_ALGO(tp)->cb_init(tp->ccv, NULL) > 0) {
	if (tp->t_fb->tfb_tcp_fb_fini)
	(*tp->t_fb->tfb_tcp_fb_fini)(tp, 1);
	in_pcbrele_wlocked(inp);
	@@ -2240,25 +2241,23 @@
	}

	/*
	- * Switch the congestion control algorithm back to NewReno for any active
	- * control blocks using an algorithm which is about to go away.
	- * This ensures the CC framework can allow the unload to proceed without leaving
	- * any dangling pointers which would trigger a panic.
	- * Returning non-zero would inform the CC framework that something went wrong
	- * and it would be unsafe to allow the unload to proceed. However, there is no
	- * way for this to occur with this implementation so we always return zero.
	+ * Switch the congestion control algorithm back to Vnet default for any active
	+ * control blocks using an algorithm which is about to go away. If the algorithm
	+ * has a cb_init function and it fails (no memory) then the operation fails and
	+ * the unload will not succeed.
	+ *
	*/
	int
	tcp_ccalgounload(struct cc_algo *unload_algo)
	{
	- struct cc_algo *tmpalgo;
	+ struct cc_algo oldalgo, newalgo;
	struct inpcb *inp;
	struct tcpcb *tp;
	VNET_ITERATOR_DECL(vnet_iter);

	/*
	* Check all active control blocks across all network stacks and change
	- * any that are using "unload_algo" back to NewReno. If "unload_algo"
	+ * any that are using "unload_algo" back to its default. If "unload_algo"
	* requires cleanup code to be run, call it.
	*/
	VNET_LIST_RLOCK();
	@@ -2272,6 +2271,7 @@
	* therefore don't enter the loop below until the connection
	* list has stabilised.
	*/
	+ newalgo = CC_DEFAULT_ALGO();
	CK_LIST_FOREACH(inp, &V_tcb, inp_list) {
	INP_WLOCK(inp);
	/* Important to skip tcptw structs. */
	@@ -2280,24 +2280,48 @@
	/*
	* By holding INP_WLOCK here, we are assured
	* that the connection is not currently
	- * executing inside the CC module's functions
	- * i.e. it is safe to make the switch back to
	- * NewReno.
	+ * executing inside the CC module's functions.
	+ * We attempt to switch to the Vnets default,
	+ * if the init fails then we fail the whole
	+ * operation and the module unload will fail.
	*/
	if (CC_ALGO(tp) == unload_algo) {
	- tmpalgo = CC_ALGO(tp);
	- if (tmpalgo->cb_destroy != NULL)
	- tmpalgo->cb_destroy(tp->ccv);
	- CC_DATA(tp) = NULL;
	- /*
	- * NewReno may allocate memory on
	- * demand for certain stateful
	- * configuration as needed, but is
	- * coded to never fail on memory
	- * allocation failure so it is a safe
	- * fallback.
	- */
	- CC_ALGO(tp) = &newreno_cc_algo;
	+ struct cc_var cc_mem;
	+ int err;
	+
	+ oldalgo = CC_ALGO(tp);
	+ memset(&cc_mem, 0, sizeof(cc_mem));
	+ cc_mem.ccvc.tcp = tp;
	+ if (newalgo->cb_init == NULL) {
	+ /*
	+ * No init we can skip the
	+ * dance around a possible failure.
	+ */
	+ CC_DATA(tp) = NULL;
	+ goto proceed;
	+ }
	+ err = (newalgo->cb_init)(&cc_mem, NULL);
	+ if (err) {
	+ /*
	+ * Presumably no memory the caller will
	+ * need to try again.
	+ */
	+ INP_WUNLOCK(inp);
	+ INP_INFO_WUNLOCK(&V_tcbinfo);
	+ CURVNET_RESTORE();
	+ VNET_LIST_RUNLOCK();
	+ return (err);
	+ }
	+proceed:
	+ if (oldalgo->cb_destroy != NULL)
	+ oldalgo->cb_destroy(tp->ccv);
	+ CC_ALGO(tp) = newalgo;
	+ memcpy(tp->ccv, &cc_mem, sizeof(struct cc_var));
	+ if (TCPS_HAVEESTABLISHED(tp->t_state) &&
	+ (CC_ALGO(tp)->conn_init != NULL)) {
	+ /* Yep run the connection init for the new CC */
	+ CC_ALGO(tp)->conn_init(tp->ccv);
	+ }
	}
	}
	INP_WUNLOCK(inp);
	@@ -2306,7 +2330,6 @@
	CURVNET_RESTORE();
	}
	VNET_LIST_RUNLOCK();
	-
	return (0);
	}

	diff --git a/sys/netinet/tcp_usrreq.c b/sys/netinet/tcp_usrreq.c
	--- a/sys/netinet/tcp_usrreq.c
	+++ b/sys/netinet/tcp_usrreq.c
	@@ -2007,6 +2007,115 @@
	}
	#endif

	+extern struct cc_algo newreno_cc_algo;
	+
	+static int
	+tcp_congestion(struct socket so, struct sockopt sopt, struct inpcb inp, struct tcpcb tp)
	+{
	+ struct cc_algo *algo;
	+ void *ptr = NULL;
	+ struct cc_var cc_mem;
	+ char buf[TCP_CA_NAME_MAX];
	+ size_t mem_sz;
	+ int error;
	+
	+ INP_WUNLOCK(inp);
	+ error = sooptcopyin(sopt, buf, TCP_CA_NAME_MAX - 1, 1);
	+ if (error)
	+ return(error);
	+ buf[sopt->sopt_valsize] = '\0';
	+ CC_LIST_RLOCK();
	+ STAILQ_FOREACH(algo, &cc_list, entries)
	+ if (strncmp(buf, algo->name,
	+ TCP_CA_NAME_MAX) == 0) {
	+ if (algo->flags & CC_MODULE_BEING_REMOVED) {
	+ /* We can't "see" modules being unloaded */
	+ continue;
	+ }
	+ break;
	+ }
	+ if (algo == NULL) {
	+ CC_LIST_RUNLOCK();
	+ return(ESRCH);
	+ }
	+do_over:
	+ if (algo->cb_init != NULL) {
	+ /* We can now pre-get the memory for the CC */
	+ mem_sz = (*algo->cc_data_sz)();
	+ if (mem_sz == 0) {
	+ goto no_mem_needed;
	+ }
	+ CC_LIST_RUNLOCK();
	+ ptr = malloc(mem_sz, M_CC_MEM, M_WAITOK);
	+ CC_LIST_RLOCK();
	+ STAILQ_FOREACH(algo, &cc_list, entries)
	+ if (strncmp(buf, algo->name,
	+ TCP_CA_NAME_MAX) == 0)
	+ break;
	+ if (algo == NULL) {
	+ if (ptr)
	+ free(ptr, M_CC_MEM);
	+ CC_LIST_RUNLOCK();
	+ return(ESRCH);
	+ }
	+ } else {
	+no_mem_needed:
	+ mem_sz = 0;
	+ ptr = NULL;
	+ }
	+ /*
	+ * Make sure its all clean and zero and also get
	+ * back the inplock.
	+ */
	+ memset(&cc_mem, 0, sizeof(cc_mem));
	+ if (mem_sz != (*algo->cc_data_sz)()) {
	+ if (ptr)
	+ free(ptr, M_CC_MEM);
	+ goto do_over;
	+ }
	+ if (ptr) {
	+ memset(ptr, 0, mem_sz);
	+ INP_WLOCK_RECHECK_CLEANUP(inp, free(ptr, M_CC_MEM));
	+ } else
	+ INP_WLOCK_RECHECK(inp);
	+ CC_LIST_RUNLOCK();
	+ cc_mem.ccvc.tcp = tp;
	+ /*
	+ * We once again hold a write lock over the tcb so it's
	+ * safe to do these things without ordering concerns.
	+ * Note here we init into stack memory.
	+ */
	+ if (algo->cb_init != NULL)
	+ error = algo->cb_init(&cc_mem, ptr);
	+ else
	+ error = 0;
	+ /*
	+ * The CC algorithms, when given their memory
	+ * should not fail we could in theory have a
	+ * KASSERT here.
	+ */
	+ if (error == 0) {
	+ /*
	+ * Touchdown, lets go ahead and move the
	+ * connection to the new CC module by
	+ * copying in the cc_mem after we call
	+ * the old ones cleanup (if any).
	+ */
	+ if (CC_ALGO(tp)->cb_destroy != NULL)
	+ CC_ALGO(tp)->cb_destroy(tp->ccv);
	+ memcpy(tp->ccv, &cc_mem, sizeof(struct cc_var));
	+ tp->cc_algo = algo;
	+ /* Ok now are we where we have gotten past any conn_init? */
	+ if (TCPS_HAVEESTABLISHED(tp->t_state) && (CC_ALGO(tp)->conn_init != NULL)) {
	+ /* Yep run the connection init for the new CC */
	+ CC_ALGO(tp)->conn_init(tp->ccv);
	+ }
	+ } else if (ptr)
	+ free(ptr, M_CC_MEM);
	+ INP_WUNLOCK(inp);
	+ return (error);
	+}
	+
	int
	tcp_default_ctloutput(struct socket so, struct sockopt sopt, struct inpcb inp, struct tcpcb tp)
	{
	@@ -2016,7 +2125,6 @@
	#ifdef KERN_TLS
	struct tls_enable tls;
	#endif
	- struct cc_algo *algo;
	char *pbuf, buf[TCP_LOG_ID_LEN];
	#ifdef STATS
	struct statsblob *sbp;
	@@ -2223,46 +2331,7 @@
	break;

	case TCP_CONGESTION:
	- INP_WUNLOCK(inp);
	- error = sooptcopyin(sopt, buf, TCP_CA_NAME_MAX - 1, 1);
	- if (error)
	- break;
	- buf[sopt->sopt_valsize] = '\0';
	- INP_WLOCK_RECHECK(inp);
	- CC_LIST_RLOCK();
	- STAILQ_FOREACH(algo, &cc_list, entries)
	- if (strncmp(buf, algo->name,
	- TCP_CA_NAME_MAX) == 0)
	- break;
	- CC_LIST_RUNLOCK();
	- if (algo == NULL) {
	- INP_WUNLOCK(inp);
	- error = EINVAL;
	- break;
	- }
	- /*
	- * We hold a write lock over the tcb so it's safe to
	- * do these things without ordering concerns.
	- */
	- if (CC_ALGO(tp)->cb_destroy != NULL)
	- CC_ALGO(tp)->cb_destroy(tp->ccv);
	- CC_DATA(tp) = NULL;
	- CC_ALGO(tp) = algo;
	- /*
	- * If something goes pear shaped initialising the new
	- * algo, fall back to newreno (which does not
	- * require initialisation).
	- */
	- if (algo->cb_init != NULL &&
	- algo->cb_init(tp->ccv) != 0) {
	- CC_ALGO(tp) = &newreno_cc_algo;
	- /*
	- * The only reason init should fail is
	- * because of malloc.
	- */
	- error = ENOMEM;
	- }
	- INP_WUNLOCK(inp);
	+ error = tcp_congestion(so, sopt, inp, tp);
	break;

	case TCP_REUSPORT_LB_NUMA:
	diff --git a/sys/powerpc/conf/GENERIC b/sys/powerpc/conf/GENERIC
	--- a/sys/powerpc/conf/GENERIC
	+++ b/sys/powerpc/conf/GENERIC
	@@ -38,6 +38,8 @@
	options VIMAGE # Subsystem virtualization, e.g. VNET
	options INET #InterNETworking
	options INET6 #IPv6 communications protocols
	+options CC_NEWRENO # include newreno congestion control
	+options CC_DEFAULT=\"newreno\" # define our default CC module it should be compiled in.
	options IPSEC_SUPPORT # Allow kldload of ipsec and tcpmd5
	options TCP_HHOOK # hhook(9) framework for TCP
	options TCP_RFC7413 # TCP Fast Open
	diff --git a/sys/riscv/conf/GENERIC b/sys/riscv/conf/GENERIC
	--- a/sys/riscv/conf/GENERIC
	+++ b/sys/riscv/conf/GENERIC
	@@ -29,6 +29,8 @@
	options VIMAGE # Subsystem virtualization, e.g. VNET
	options INET # InterNETworking
	options INET6 # IPv6 communications protocols
	+options CC_NEWRENO # include newreno congestion control
	+options CC_DEFAULT=\"newreno\" # define our default CC module it should be compiled in.
	options TCP_HHOOK # hhook(9) framework for TCP
	options IPSEC_SUPPORT # Allow kldload of ipsec and tcpmd5
	options ROUTE_MPATH # Multipath routing support

File Metadata

Mime Type: text/plain
Expires: Sun, Feb 23, 9:23 AM (14 h, 58 m)
Storage Engine: blob
Storage Format: Raw Data
Storage Handle: 16792467
Default Alt Text: D32693.diff (67 KB)

D32693.diffNo OneTemporaryActions

D32693.diffView Options

File Metadata

Event Timeline

D32693.diff
No OneTemporary
Actions

D32693.diff
View Options