summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2016-04-05nl80211: add feature for BSS selection supportArend van Spriel2-0/+118
Introducing a new feature that the driver can use to indicate the driver/firmware supports configuration of BSS selection criteria upon CONNECT command. This can be useful when multiple BSS-es are found belonging to the same ESS, ie. Infra-BSS with same SSID. The criteria can then be used to offload selection of a preferred BSS. Reviewed-by: Hante Meuleman <meuleman@broadcom.com> Reviewed-by: Franky (Zhenhui) Lin <frankyl@broadcom.com> Reviewed-by: Pieter-Paul Giesberts <pieterpg@broadcom.com> Reviewed-by: Lei Zhang <leizh@broadcom.com> Signed-off-by: Arend van Spriel <arend@broadcom.com> [move wiphy support check into parse_bss_select()] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-04-05mac80211: synchronize driver rx queues before removing a stationSara Sharon3-1/+35
Some devices, like iwlwifi, have RSS queues. This may cause a situation where a disassociation is handled in control path and results in station removal while there are prior RX frames that were still not processed in other queues. When they will be processed the station will be gone, and the frames will be dropped. Add a synchronization interface to avoid that. When driver returns from the synchronization mac80211 may remove the station. Signed-off-by: Sara Sharon <sara.sharon@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-04-05mac80211: mesh: convert path table to rhashtableBob Copeland4-575/+259
In the time since the mesh path table was implemented as an RCU-traversable, dynamically growing hash table, a generic RCU hashtable implementation was added to the kernel. Switch the mesh path table over to rhashtable to remove some code and also gain some features like automatic shrinking. Cc: Thomas Graf <tgraf@suug.ch> Cc: netdev@vger.kernel.org Signed-off-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-04-05rhashtable: accept GFP flags in rhashtable_walk_initBob Copeland4-5/+8
In certain cases, the 802.11 mesh pathtable code wants to iterate over all of the entries in the forwarding table from the receive path, which is inside an RCU read-side critical section. Enable walks inside atomic sections by allowing GFP_ATOMIC allocations for the walker state. Change all existing callsites to pass in GFP_KERNEL. Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Bob Copeland <me@bobcopeland.com> [also adjust gfs2/glock.c and rhashtable tests] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-04-05mac80211: mesh: embed known gates list in struct mesh_pathBob Copeland2-56/+45
The mesh path table uses a struct mesh_node in its hlists in order to support a resizable hash table: the mesh_node provides an indirection to the actual mesh path so that two different bucket lists can point to the same path entry. However, for the known gates list, we don't need this indirection because there is ever only one list. So we can just embed the hlist_node in the mesh path itself, which simplifies things a bit and saves a linear search whenever we need to find an item in the list. Signed-off-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-04-05mac80211: mesh: factor out common mesh path allocation codeBob Copeland1-23/+28
Remove duplicate code to allocate and initialize a mesh path or mesh proxy path. Signed-off-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-04-05mac80211: mesh: don't hash sdata in mpath tablesBob Copeland1-10/+8
Now that the sdata pointer is the same for all entries of a path table, hashing it is pointless, so hash only the address. Signed-off-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-04-05mac80211: mesh: move path tables into if_meshBob Copeland6-115/+104
The mesh path and mesh gate hashtables are global, containing all of the mpaths for every mesh interface, but the paths are all tied logically to a single interface. The common case is just a single mesh interface, so optimize for that by moving the global hashtable into the per-interface struct. Doing so allows us to drop sdata pointer comparisons inside the lookups and also saves a few bytes of BSS and data. Signed-off-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-04-05mac80211: Support a scan request for a specific BSSIDJouni Malinen1-1/+3
If the cfg80211 scan trigger operation specifies a single BSSID, use that value instead of the wildcard BSSID in the Probe Request frames. Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-04-05cfg80211: Allow a scan request for a specific BSSIDJouni Malinen3-0/+10
This allows scans for a specific BSSID to be optimized by the user space application by requesting the driver to set the Probe Request frame BSSID field (Address 3) to the specified BSSID instead of the wildcard BSSID. This prevents other APs from replying which reduces airtime need and latency in getting the response from the target AP through. This is an optimization and as such, it is acceptable for some of the drivers not to support the mechanism. If not supported, the wildcard BSSID will be used and more responses may be received. Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-04-05mac80211: allow not sending MIC up from driver for HW cryptoSara Sharon2-14/+17
When HW crypto is used, there's no need for the CCMP/GCMP MIC to be available to mac80211, and the hardware might have removed it already after checking. The MIC is also useless to have when the frame is already decrypted, so allow indicating that it's not present. Since we are running out of bits in mac80211_rx_flags, make the flags field a u64. Signed-off-by: Sara Sharon <sara.sharon@intel.com> Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-04-05mac80211: allow drivers to report CLOCK_BOOTTIME for scan resultsJohannes Berg1-1/+3
This was requested by Android, and the appropriate cfg80211 API had been added by Dmitry. Support it in mac80211, allowing drivers to provide the timestamp. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-04-05mac80211: parse VHT info in injected framesLorenzo Bianconi1-0/+31
Add VHT radiotap parsing support to ieee80211_parse_tx_radiotap(). That capability has been tested using a d-link dir-860l rev b1 running OpenWrt trunk and mt76 driver Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi83@gmail.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-04-05rfkill: Use switch to demux userspace operationsJoão Paulo Rechi Vita1-14/+22
Using a switch to handle different ev.op values in rfkill_fop_write() makes the code easier to extend, as out-of-range values can always be handled by the default case. Signed-off-by: João Paulo Rechi Vita <jprvita@endlessm.com> [roll in fix for RFKILL_OP_CHANGE from Jouni] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-04-05wext: unregister_pernet_subsys() on notifier registration failureJohannes Berg1-1/+4
If register_netdevice_notifier() fails (which in practice it can't right now), we should call unregister_pernet_subsys(). Do that. Reported-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2016-04-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds23-146/+224
Pull networking fixes from David Miller: 1) Missing device reference in IPSEC input path results in crashes during device unregistration. From Subash Abhinov Kasiviswanathan. 2) Per-queue ISR register writes not being done properly in macb driver, from Cyrille Pitchen. 3) Stats accounting bugs in bcmgenet, from Patri Gynther. 4) Lightweight tunnel's TTL and TOS were swapped in netlink dumps, from Quentin Armitage. 5) SXGBE driver has off-by-one in probe error paths, from Rasmus Villemoes. 6) Fix race in save/swap/delete options in netfilter ipset, from Vishwanath Pai. 7) Ageing time of bridge not set properly when not operating over a switchdev device. Fix from Haishuang Yan. 8) Fix GRO regression wrt nested FOU/GUE based tunnels, from Alexander Duyck. 9) IPV6 UDP code bumps wrong stats, from Eric Dumazet. 10) FEC driver should only access registers that actually exist on the given chipset, fix from Fabio Estevam. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (73 commits) net: mvneta: fix changing MTU when using per-cpu processing stmmac: fix MDIO settings Revert "stmmac: Fix 'eth0: No PHY found' regression" stmmac: fix TX normal DESC net: mvneta: use cache_line_size() to get cacheline size net: mvpp2: use cache_line_size() to get cacheline size net: mvpp2: fix maybe-uninitialized warning tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter net: usb: cdc_ncm: adding Telit LE910 V2 mobile broadband card rtnl: fix msg size calculation in if_nlmsg_size() fec: Do not access unexisting register in Coldfire net: mvneta: replace MVNETA_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES net: mvpp2: replace MVPP2_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES net: dsa: mv88e6xxx: Clear the PDOWN bit on setup net: dsa: mv88e6xxx: Introduce _mv88e6xxx_phy_page_{read, write} bpf: make padding in bpf_tunnel_key explicit ipv6: udp: fix UDP_MIB_IGNOREDMULTI updates bnxt_en: Fix ethtool -a reporting. bnxt_en: Fix typo in bnxt_hwrm_set_pause_common(). bnxt_en: Implement proper firmware message padding. ...
2016-04-01tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filterDaniel Borkmann1-12/+21
Sasha Levin reported a suspicious rcu_dereference_protected() warning found while fuzzing with trinity that is similar to this one: [ 52.765684] net/core/filter.c:2262 suspicious rcu_dereference_protected() usage! [ 52.765688] other info that might help us debug this: [ 52.765695] rcu_scheduler_active = 1, debug_locks = 1 [ 52.765701] 1 lock held by a.out/1525: [ 52.765704] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff816a64b7>] rtnl_lock+0x17/0x20 [ 52.765721] stack backtrace: [ 52.765728] CPU: 1 PID: 1525 Comm: a.out Not tainted 4.5.0+ #264 [...] [ 52.765768] Call Trace: [ 52.765775] [<ffffffff813e488d>] dump_stack+0x85/0xc8 [ 52.765784] [<ffffffff810f2fa5>] lockdep_rcu_suspicious+0xd5/0x110 [ 52.765792] [<ffffffff816afdc2>] sk_detach_filter+0x82/0x90 [ 52.765801] [<ffffffffa0883425>] tun_detach_filter+0x35/0x90 [tun] [ 52.765810] [<ffffffffa0884ed4>] __tun_chr_ioctl+0x354/0x1130 [tun] [ 52.765818] [<ffffffff8136fed0>] ? selinux_file_ioctl+0x130/0x210 [ 52.765827] [<ffffffffa0885ce3>] tun_chr_ioctl+0x13/0x20 [tun] [ 52.765834] [<ffffffff81260ea6>] do_vfs_ioctl+0x96/0x690 [ 52.765843] [<ffffffff81364af3>] ? security_file_ioctl+0x43/0x60 [ 52.765850] [<ffffffff81261519>] SyS_ioctl+0x79/0x90 [ 52.765858] [<ffffffff81003ba2>] do_syscall_64+0x62/0x140 [ 52.765866] [<ffffffff817d563f>] entry_SYSCALL64_slow_path+0x25/0x25 Same can be triggered with PROVE_RCU (+ PROVE_RCU_REPEATEDLY) enabled from tun_attach_filter() when user space calls ioctl(tun_fd, TUN{ATTACH, DETACH}FILTER, ...) for adding/removing a BPF filter on tap devices. Since the fix in f91ff5b9ff52 ("net: sk_{detach|attach}_filter() rcu fixes") sk_attach_filter()/sk_detach_filter() now dereferences the filter with rcu_dereference_protected(), checking whether socket lock is held in control path. Since its introduction in 994051625981 ("tun: socket filter support"), tap filters are managed under RTNL lock from __tun_chr_ioctl(). Thus the sock_owned_by_user(sk) doesn't apply in this specific case and therefore triggers the false positive. Extend the BPF API with __sk_attach_filter()/__sk_detach_filter() pair that is used by tap filters and pass in lockdep_rtnl_is_held() for the rcu_dereference_protected() checks instead. Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-31rtnl: fix msg size calculation in if_nlmsg_size()Nicolas Dichtel1-0/+1
Size of the attribute IFLA_PHYS_PORT_NAME was missing. Fixes: db24a9044ee1 ("net: add support for phys_port_name") CC: David Ahern <dsahern@gmail.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-30bpf: make padding in bpf_tunnel_key explicitDaniel Borkmann1-1/+4
Make the 2 byte padding in struct bpf_tunnel_key between tunnel_ttl and tunnel_label members explicit. No issue has been observed, and gcc/llvm does padding for the old struct already, where tunnel_label was not yet present, so the current code works, but since it's part of uapi, make sure we don't introduce holes in structs. Therefore, add tunnel_ext that we can use generically in future (f.e. to flag OAM messages for backends, etc). Also add the offset to the compat tests to be sure should some compilers not padd the tail of the old version of bpf_tunnel_key. Fixes: 4018ab1875e0 ("bpf: support flow label for bpf_skb_{set, get}_tunnel_key") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-30ipv6: udp: fix UDP_MIB_IGNOREDMULTI updatesEric Dumazet1-2/+2
IPv6 counters updates use a different macro than IPv4. Fixes: 36cbb2452cbaf ("udp: Increment UDP_MIB_IGNOREDMULTI for arriving unmatched multicasts") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Rick Jones <rick.jones2@hp.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-30gro: Allow tunnel stacking in the case of FOU/GUEAlexander Duyck1-0/+16
This patch should fix the issues seen with a recent fix to prevent tunnel-in-tunnel frames from being generated with GRO. The fix itself is correct for now as long as we do not add any devices that support NETIF_F_GSO_GRE_CSUM. When such a device is added it could have the potential to mess things up due to the fact that the outer transport header points to the outer UDP header and not the GRE header as would be expected. Fixes: fac8e0f579695 ("tunnels: Don't apply GRO to multiple layers of encapsulation.") Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-30sctp: really allow using GFP_KERNEL on sctp_packet_transmitMarcelo Ricardo Leitner1-3/+3
Somehow my patch for commit cea8768f333e ("sctp: allow sctp_transmit_packet and others to use gfp") missed two important chunks, which are now added. Fixes: cea8768f333e ("sctp: allow sctp_transmit_packet and others to use gfp") Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-By: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-30bridge: Allow set bridge ageing time when switchdev disabledHaishuang Yan1-1/+1
When NET_SWITCHDEV=n, switchdev_port_attr_set will return -EOPNOTSUPP, we should ignore this error code and continue to set the ageing time. Fixes: c62987bbd8a1 ("bridge: push bridge setting ageing_time down to switchdev") Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Acked-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller13-122/+166
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for you net tree, they are: 1) There was a race condition between parallel save/swap and delete, which resulted a kernel crash due to the increase ref for save, swap, wrong ref decrease operations. Reported and fixed by Vishwanath Pai. 2) OVS should call into CT NAT for packets of new expected connections only when the conntrack state is persisted with the 'commit' option to the OVS CT action. From Jarno Rajahalme. 3) Resolve kconfig dependencies with new OVS NAT support. From Arnd Bergmann. 4) Early validation of entry->target_offset to make sure it doesn't take us out from the blob, from Florian Westphal. 5) Again early validation of entry->next_offset to make sure it doesn't take out from the blob, also from Florian. 6) Check that entry->target_offset is always of of sizeof(struct xt_entry) for unconditional entries, when checking both from check_underflow() and when checking for loops in mark_source_chains(), again from Florian. 7) Fix inconsistent behaviour in nfnetlink_queue when NFQA_CFG_F_FAIL_OPEN is set and netlink_unicast() fails due to buffer overrun, we have to reinject the packet as the user expects. 8) Enforce nul-terminated table names from getsockopt GET_ENTRIES requests. 9) Don't assume skb->sk is set from nft_bridge_reject and synproxy, this fixes a recent update of the code to namespaceify ip_default_ttl, patch from Liping Zhang. This batch comes with four patches to validate x_tables blobs coming from userspace. CONFIG_USERNS exposes the x_tables interface to unpriviledged users and to be honest this interface never received the attention for this move away from the CAP_NET_ADMIN domain. Florian is working on another round with more patches with more sanity checks, so expect a bit more Netfilter fixes in this development cycle than usual. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-28netfilter: ipv4: fix NULL dereferenceLiping Zhang2-36/+38
Commit fa50d974d104 ("ipv4: Namespaceify ip_default_ttl sysctl knob") use sock_net(skb->sk) to get the net namespace, but we can't assume that sk_buff->sk is always exist, so when it is NULL, oops will happen. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Reviewed-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28netfilter: x_tables: enforce nul-terminated table name from getsockopt ↵Pablo Neira Ayuso4-0/+10
GET_ENTRIES Make sure the table names via getsockopt GET_ENTRIES is nul-terminated in ebtables and all the x_tables variants and their respective compat code. Uncovered by KASAN. Reported-by: Baozeng Ding <sploving1@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28netfilter: nfnetlink_queue: honor NFQA_CFG_F_FAIL_OPEN when netlink unicast ↵Pablo Neira Ayuso1-1/+6
fails When netlink unicast fails to deliver the message to userspace, we should also check if the NFQA_CFG_F_FAIL_OPEN flag is set so we reinject the packet back to the stack. I think the user expects no packet drops when this flag is set due to queueing to userspace errors, no matter if related to the internal queue or when sending the netlink message to userspace. The userspace application will still get the ENOBUFS error via recvmsg() so the user still knows that, with the current configuration that is in place, the userspace application is not consuming the messages at the pace that the kernel needs. Reported-by: "Yigal Reiss (yreiss)" <yreiss@cisco.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Tested-by: "Yigal Reiss (yreiss)" <yreiss@cisco.com>
2016-03-28netfilter: x_tables: fix unconditional helperFlorian Westphal3-33/+31
Ben Hawkes says: In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it is possible for a user-supplied ipt_entry structure to have a large next_offset field. This field is not bounds checked prior to writing a counter value at the supplied offset. Problem is that mark_source_chains should not have been called -- the rule doesn't have a next entry, so its supposed to return an absolute verdict of either ACCEPT or DROP. However, the function conditional() doesn't work as the name implies. It only checks that the rule is using wildcard address matching. However, an unconditional rule must also not be using any matches (no -m args). The underflow validator only checked the addresses, therefore passing the 'unconditional absolute verdict' test, while mark_source_chains also tested for presence of matches, and thus proceeeded to the next (not-existent) rule. Unify this so that all the callers have same idea of 'unconditional rule'. Reported-by: Ben Hawkes <hawkes@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28netfilter: x_tables: make sure e->next_offset covers remaining blob sizeFlorian Westphal3-6/+12
Otherwise this function may read data beyond the ruleset blob. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28netfilter: x_tables: validate e->target_offset earlyFlorian Westphal3-27/+24
We should check that e->target_offset is sane before mark_source_chains gets called since it will fetch the target entry for loop detection. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28openvswitch: call only into reachable nf-nat codeArnd Bergmann2-9/+11
The openvswitch code has gained support for calling into the nf-nat-ipv4/ipv6 modules, however those can be loadable modules in a configuration in which openvswitch is built-in, leading to link errors: net/built-in.o: In function `__ovs_ct_lookup': :(.text+0x2cc2c8): undefined reference to `nf_nat_icmp_reply_translation' :(.text+0x2cc66c): undefined reference to `nf_nat_icmpv6_reply_translation' The dependency on (!NF_NAT || NF_NAT) prevents similar issues, but NF_NAT is set to 'y' if any of the symbols selecting it are built-in, but the link error happens when any of them are modular. A second issue is that even if CONFIG_NF_NAT_IPV6 is built-in, CONFIG_NF_NAT_IPV4 might be completely disabled. This is unlikely to be useful in practice, but the driver currently only handles IPv6 being optional. This patch improves the Kconfig dependency so that openvswitch cannot be built-in if either of the two other symbols are set to 'm', and it replaces the incorrect #ifdef in ovs_ct_nat_execute() with two "if (IS_ENABLED())" checks that should catch all corner cases also make the code more readable. The same #ifdef exists ovs_ct_nat_to_attr(), where it does not cause a link error, but for consistency I'm changing it the same way. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: 05752523e565 ("openvswitch: Interface with NAT.") Acked-by: Joe Stringer <joe@ovn.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28openvswitch: Fix checking for new expected connections.Jarno Rajahalme1-2/+3
OVS should call into CT NAT for packets of new expected connections only when the conntrack state is persisted with the 'commit' option to the OVS CT action. The test for this condition is doubly wrong, as the CT status field is ANDed with the bit number (IPS_EXPECTED_BIT) rather than the mask (IPS_EXPECTED), and due to the wrong assumption that the expected bit would apply only for the first (i.e., 'new') packet of a connection, while in fact the expected bit remains on for the lifetime of an expected connection. The 'ctinfo' value IP_CT_RELATED derived from the ct status can be used instead, as it is only ever applicable to the 'new' packets of the expected connection. Fixes: 05752523e565 ('openvswitch: Interface with NAT.') Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28netfilter: ipset: fix race condition in ipset save, swap and deleteVishwanath Pai4-8/+31
This fix adds a new reference counter (ref_netlink) for the struct ip_set. The other reference counter (ref) can be swapped out by ip_set_swap and we need a separate counter to keep track of references for netlink events like dump. Using the same ref counter for dump causes a race condition which can be demonstrated by the following script: ipset create hash_ip1 hash:ip family inet hashsize 1024 maxelem 500000 \ counters ipset create hash_ip2 hash:ip family inet hashsize 300000 maxelem 500000 \ counters ipset create hash_ip3 hash:ip family inet hashsize 1024 maxelem 500000 \ counters ipset save & ipset swap hash_ip3 hash_ip2 ipset destroy hash_ip3 /* will crash the machine */ Swap will exchange the values of ref so destroy will see ref = 0 instead of ref = 1. With this fix in place swap will not succeed because ipset save still has ref_netlink on the set (ip_set_swap doesn't swap ref_netlink). Both delete and swap will error out if ref_netlink != 0 on the set. Note: The changes to *_head functions is because previously we would increment ref whenever we called these functions, we don't do that anymore. Reviewed-by: Joshua Hunt <johunt@akamai.com> Signed-off-by: Vishwanath Pai <vpai@akamai.com> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-03-28openvswitch: Use proper buffer size in nla_memcpyHaishuang Yan1-1/+2
For the input parameter count, it's better to use the size of destination buffer size, as nla_memcpy would take into account the length of the source netlink attribute when a data is copied from an attribute. Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-27Fix returned tc and hoplimit values for route with IPv6 encapsulationQuentin Armitage1-2/+2
For a route with IPv6 encapsulation, the traffic class and hop limit values are interchanged when returned to userspace by the kernel. For example, see below. ># ip route add 192.168.0.1 dev eth0.2 encap ip6 dst 0x50 tc 0x50 hoplimit 100 table 1000 ># ip route show table 1000 192.168.0.1 encap ip6 id 0 src :: dst fe83::1 hoplimit 80 tc 100 dev eth0.2 scope link Signed-off-by: Quentin Armitage <quentin@armitage.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-26Merge branch 'for-linus' of ↵Linus Torvalds5-272/+344
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "There is quite a bit here, including some overdue refactoring and cleanup on the mon_client and osd_client code from Ilya, scattered writeback support for CephFS and a pile of bug fixes from Zheng, and a few random cleanups and fixes from others" [ I already decided not to pull this because of it having been rebased recently, but ended up changing my mind after all. Next time I'll really hold people to it. Oh well. - Linus ] * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (34 commits) libceph: use KMEM_CACHE macro ceph: use kmem_cache_zalloc rbd: use KMEM_CACHE macro ceph: use lookup request to revalidate dentry ceph: kill ceph_get_dentry_parent_inode() ceph: fix security xattr deadlock ceph: don't request vxattrs from MDS ceph: fix mounting same fs multiple times ceph: remove unnecessary NULL check ceph: avoid updating directory inode's i_size accidentally ceph: fix race during filling readdir cache libceph: use sizeof_footer() more ceph: kill ceph_empty_snapc ceph: fix a wrong comparison ceph: replace CURRENT_TIME by current_fs_time() ceph: scattered page writeback libceph: add helper that duplicates last extent operation libceph: enable large, variable-sized OSD requests libceph: osdc->req_mempool should be backed by a slab pool libceph: make r_request msg_size calculation clearer ...
2016-03-25libceph: use KMEM_CACHE macroGeliang Tang1-8/+2
Use KMEM_CACHE() instead of kmem_cache_create() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: use sizeof_footer() moreIlya Dryomov1-16/+3
Don't open-code sizeof_footer() in read_partial_message() and ceph_msg_revoke(). Also, after switching to sizeof_footer(), it's now possible to use con_out_kvec_add() in prepare_write_message_footer(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
2016-03-25libceph: add helper that duplicates last extent operationYan, Zheng1-0/+22
This helper duplicates last extent operation in OSD request, then adjusts the new extent operation's offset and length. The helper is for scatterd page writeback, which adds nonconsecutive dirty pages to single OSD request. Signed-off-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: enable large, variable-sized OSD requestsIlya Dryomov1-15/+28
Turn r_ops into a flexible array member to enable large, consisting of up to 16 ops, OSD requests. The use case is scattered writeback in cephfs and, as far as the kernel client is concerned, 16 is just a made up number. r_ops had size 3 for copyup+hint+write, but copyup is really a special case - it can only happen once. ceph_osd_request_cache is therefore stuffed with num_ops=2 requests, anything bigger than that is allocated with kmalloc(). req_mempool is backed by ceph_osd_request_cache, which means either num_ops=1 or num_ops=2 for use_mempool=true - all existing users (ceph_writepages_start(), ceph_osdc_writepages()) are fine with that. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: osdc->req_mempool should be backed by a slab poolIlya Dryomov1-2/+2
ceph_osd_request_cache was introduced a long time ago. Also, osd_req is about to get a flexible array member, which ceph_osd_request_cache is going to be aware of. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: make r_request msg_size calculation clearerIlya Dryomov1-10/+11
Although msg_size is calculated correctly, the terms are grouped in a misleading way - snaps appears to not have room for a u32 length. Move calculation closer to its use and regroup terms. No functional change. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: move r_reply_op_{len,result} into struct ceph_osd_req_opYan, Zheng1-2/+2
This avoids defining large array of r_reply_op_{len,result} in in struct ceph_osd_request. Signed-off-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: rename ceph_osd_req_op::payload_len to indata_lenIlya Dryomov1-6/+6
Follow userspace nomenclature on this - the next commit adds outdata_len. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: behave in mon_fault() if cur_mon < 0Ilya Dryomov1-14/+9
This can happen if __close_session() in ceph_monc_stop() races with a connection reset. We need to ignore such faults, otherwise it's likely we would take !hunting, call __schedule_delayed() and end up with delayed_work() executing on invalid memory, among other things. The (two!) con->private tests are useless, as nothing ever clears con->private. Nuke them. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: reschedule tick in mon_fault()Ilya Dryomov1-4/+4
Doing __schedule_delayed() in the hunting branch is pointless, as the tick will have already been scheduled by then. What we need to do instead is *reschedule* it in the !hunting branch, after reopen_session() changes hunt_mult, which affects the delay. This helps with spacing out connection attempts and avoiding things like two back-to-back attempts followed by a longer period of waiting around. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: introduce and switch to reopen_session()Ilya Dryomov1-17/+16
hunting is now set in __open_session() and cleared in finish_hunting(), instead of all around. The "session lost" message is printed not only on connection resets, but also on keepalive timeouts. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: monc hunt rate is 3s with backoff up to 30sIlya Dryomov1-9/+16
Unless we are in the process of setting up a client (i.e. connecting to the monitor cluster for the first time), apply a backoff: every time we want to reopen a session, increase our timeout by a multiple (currently 2); when we complete the connection, reduce that multipler by 50%. Mirrors ceph.git commit 794c86fd289bd62a35ed14368fa096c46736e9a2. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: monc ping rate is 10sIlya Dryomov2-7/+2
Split ping interval and ping timeout: ping interval is 10s; keepalive timeout is 30s. Make monc_ping_timeout a constant while at it - it's not actually exported as a mount option (and the rest of tick-related settings won't be either), so it's got no place in ceph_options. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: pick a different monitor when reconnectingIlya Dryomov1-28/+57
Don't try to reconnect to the same monitor when we fail to establish a session within a timeout or it's lost. For that, pick_new_mon() needs to see the old value of cur_mon, so don't clear it in __close_session() - all calls to __close_session() but one are followed by __open_session() anyway. __open_session() is only called when a new session needs to be established, so the "already open?" branch, which is now in the way, is simply dropped. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>