Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
So you think you understand IP fragmentation? (lwn.net)
195 points by kevincox on Feb 17, 2024 | hide | past | favorite | 119 comments


> Worse, fragments are more likely to be lost. Many routers and firewalls treat fragments as a security risk because they don't include the information from higher-level protocols like TCP or UDP and can't be filtered based on port, so they drop all IP fragments.

I've seen worse than that. A firewall dropping the first fragment based on the UDP port number (which is available in the first fragment), but allowing further fragments.

I'd love to see their new discovery algorithm get widely distributed, it's 2024, and a lot of stuff still breaks or suffers terrible delays if I don't apply the proper settings with my 1492 MTU.


That could be an old BSD bug. When there is no ARP entry the first fragment is dropped before the ARP entry is cached. That made it into many BDS based network software stacks before it was fixed.

I just use 1024 since nobody seems to use SLIP anymore. It should fit under 1200 with headers and the logic is similar to DEC 512+64 limit arrived at initially. All the PMTU detection algos suffer from something lowering the MTU along a route for a long lasting connection.


This was chargen servers on random Windows boxes. Hitting me with tons of 64k UDP packets, so they would would get fragmented, but I almost never got the first fragment.

This is, of course, nonsense on so many levels, but it was very effective, because my servers at the time had a ridiculous IP fragment reassembly buffer (autoscaled based on system ram size, with a formula written decades ago) and reassembly did a linear search of the buffer. No big deal in normal conditions when we got one or two fragmented packets a minute; immensely terrible when getting a couple GBps of fragmented chargen reflection that could never be reassembled. Setting the reassembly buffer to the minimum solved the problem; during DDoS attacks, legitimate fragments wouldn't be reassembled, but oh well, so many other services don't ever assemble fragments, we did the best we could.


That's seems a great way to bypass such a firewall.


it's still useable because most ppl just don't care


Is there any place for IP fragmentation anymore? On surface level look, the main motivation seems to have been fitting over-sized dnssec responses in single datagram, but with ecdsa/eddsa is that still a relevant concern? I realize some TXT etc data might be big, but presumably those could also be queried over TCP/HTTP/QUIC/etc; I see UDP more as the fast path for common A/AAAA queries, and those should fit in unfragmented packets just fine?

What are other cases where fragmentation truly makes sense?


There has never really been a place for it; "Fragmentation Considered Harmful" is one of the original and most famous "Considered Harmfuls" and it's from the late 1980s. A lot of protocol engineering, from MSS options to ENDS0, goes into ensuring that you never hit the fragmentation case in the first place. It's been kept around as a mechanism of last resort for weirdo hops with bizarro MTUs.


I’m surprised by the willful disregard of the feature, e.g., with some routers just dropping fragmented packets. Sending packets bigger than the MTU seems like a reasonable thing to want to do—especially given that IP by design won’t guarantee a stable path between endpoints. Was the reasoning that it’s always better handled at a higher protocol layer?


Yes, that's the reasoning. Fragmentation is kind of a performative half-measure. You're not really enabling delivery across varying network links, in that the performance is so bad (necessarily!) that it alters the service model for many protocols.

IPv6 moves to an end-to-end fragmentation model, but even then the right answer is to negotiate in a higher level protocol a maximum segment size for the path you're talking on, and then just avoid fragmentation entirely. Fragmentation is an absolutely wretched stream transport protocol!


the messy challenge here is that the trade-offs shifted, but persist contrarian arguments. packets are too small to be efficient without doing excessively tight loop optimizations, which is what makes fragmentation "slow" (it's not such a big slow down in a simple software implementation, it's a disaster in a highly common case optimized device). On the other side in order to move away from those excessively optimized systems while still delivering on user demand for higher throughput we need larger payloads - but we can't get to larger payloads without transparent solutions for fixup (ICMP, fragmentation, etc). Sprinkle in some stuff that has never been formally fixed (bad parts of ICMP concerns), and it's an ongoing recipe for ossified non-progress.


Fragmentation isn't slow because routers are bad at fragmenting (though: they are), they're slow because the loss of any one fragment forces the discarding of the whole packet. Because you can't possibly make a transport protocol as dumb as fragmentation reassembly fast, forwarders don't optimize it.


The picture in the case of losses is pretty complicated with dack, sack, congestion control, and so on. If loss is high enough that this is a main concern TCP performance is generally shot already.


I'm not a network engineer, but somehow I imagine that for at least some links fragmentation could be also handled transparently at lower layer, i.e. limit fragmentation to single hops and always reassemble packets before passing them on.


It is. In my opinion that's what you should do if you can't provide 1500 MTU to your users directly.


How is fragmenting and reassembling a better option than just communicating what the real MTU is so higher-layer protocols can adjust to it? What is the extra mechanism buying you?


At this point lots of things don't work right otherwise, and assuming your users want to run wireguard over this link that takes them down to 1420, and if you're already at 1420 it takes them down to 1340... Anyway, you can probably use a lot less header overhead if you do it at a lower layer too.


Automotive ethernet, particularly for high data rate sensors like LiDAR or radar in autonomous vehicles have a predefined 1500 byte fragment size and then expect fragmentation of larger datagrams to be done in sensor firmware. This prevents clogging the routers with huge packets and preventing time-critical messages from waiting in a queue too long while a large message sends.


But why do it on IP level instead of higher level like everyone else is doing it?


Why do it at the high level if the protocol supports it? Especially in an automotive system where you don't need to worry about rogue configurations as much.


quite the opposite:

CAN Injection: keyless car theft

https://kentindell.github.io/2023/04/03/can-injection/


I don't follow.

I meant why not leverage fragmentation at the routing+transport protocol layer if configs can be controlled to support it and not drop fragmented packets.

What advantage is there doing it at the application layer?

I'm aware that all the layers can be messed with by rogue malicious actors.

Though IIRC CAN doesn't really have a transport or routing layer; the ARB IDs are baked into the physical protocol, which is very cool, but it's fundamentally a bus architecture, and I'm not sure how it would apply to current thread.


that concern with these numbers is very surprising, do you know if there’s somewhere with writing about this?


No, unfortunately I only learned about it while debugging a very annoying issue that resulted from mixing automotive modules with off the shelf data center type managed switches and had to consult one of the system design technical leads, so I only got what detail he provided, but the intuition matches some of what you can get from understanding CAN.


I’m just a little surprised as economies of scale have pushed support for larger frames into most of the hardware you can purchase. I can absolutely believe somebody saying this, but intuition is telling me that it’s a misdiagnosed problem - if it isn’t there’s something interesting I’d love to hear about!


That's for throughput not latency, in automotive/realtime systems, you care about the latter and only slightly about the former. The real piece to that is that the modules themselves have to do the fragmentation, so their send queue can reorder things since each message is "pre" broken up, meaning that something that has a lot of sensor data and also some keep alive signals and some safety critical signals can interleave the safety and keep alive with the fragmented raw data by prioritizing at the send queue level, whereas if the data was being fragmented at the protocol level, that would potentially clog the queue when something even more time critical should be going out.


in a lot of these systems I interact with they're limited by packet rate more than packet size. I'm not disputing latency sensitivity in the application, I'm surprised that frame size in the normal range has measured to introduce the kinds of problems you're describing. The cost of per-packet parsing and reassembly is typically a higher proportion than would lead to a need to strictly pin to 1500 byte ethernet frames.

now if you have one system trying to send really oversized frames constantly, creating lots of fragmentation, and you have limited sized buffers, I could see a situation where the fragments are too regularly eating up buffer space, but that's even a slightly different problem too. I would expect inexpensive hardware to be able to handle 9k packets without introducing queue delay.

I heard upcoming networks are heading over 10gbps, which means their packet processing capabilities are necessarily well over 1Mpps, more than sufficient for latency concerns in the kinds of control systems hanging off of this.

Looking up some basic (possibly inaccurate) specs for the kinds of lidar used, frame samples won't fit in a packet, so we're talking in the range of 500 packets at 1500 bytes, or less than 100 packets at 9k, at around 30hz per sensor?

Clearly the deployed system works, we're dealing with 1500 bytes in a lot of places where it being the standard introduces a lot of inefficiency/hard optimization work, so it's hardly abnormal. As I said originally, I'm curious about what the factors are, they seem surprising.


I suspect, but cannot confirm or even really look up, that they cheaped out on some of the switch infrastructure in the middle. Also, at least some of the radars I dealt with produced object data and that meant that they had a defined "frame" size of 65536, even though it would often be mostly zeros, so it had to do fragmentation no matter how you set up jumbo frames and that needed to be handled at the sensor firmware level. There's also some tradeoffs inherent in gigabit speed T1 automotive ethernet and some stuff that was defined at the mac layer rather than the packet layer. Happy to chat more about it, feel free to look up my email in my profile.


> with ecdsa/eddsa is that still a relevant concern

Perhaps not, but with quantum-safe signature algorithms on the horizon, signatures are about to get much longer again.


WireGuard!

I had multiple issues with WireGuard related to MTU/fragmentation. Once, service reported that a system is pingable, but the management website didn't load. I was able to SSH into it, but similarily, the connection broke everytime I tried to run `ip addr` or view logs. Commands with short output worked, however – apparently, somewhere on the path the packets got fragmented and/or dropped without notice.


This is going to be true for any VPN - even if it were purely IP in IP encapsulation, you'd need an additional IP header to get the packet to the VPN endpoint, so your payload is smaller. But fragmentation probably isn't the right answer (especially since any packets marked DF will be dropped). If the MTU for your Wireguard interface is set correctly then anything trying to push a 1500 byte packet via a Wireguard interface should get back an ICMP packet telling it that that won't fit and adjust its packet size downwards appropriately.

I actually hit this recently, where I wasn't able to stream videos from certain sites over Wireguard. I spend a while with tcpdump and figured out that they were using Fastly as a CDN, Fastly was sending 1500 byte packets marked with Don't Fragment, my Wireguard endpoint was returning a message saying that the maximum packet size for the link was 1460 bytes, and Fastly was then… sending another 1500 byte packet marked Don't Fragment. To their credit when I was able to provide their engineering with logs showing this was clearly a Fastly problem, they fixed it fairly promptly


In my case, the application was using the Wireguard MTU correctly, but the Wireguard packet itself was too big for something on the path. So, the Wireguard MTU wasn't set correctly, basically. But Wireguard packets aren't marked DF, so one would expect a worse, but still working connection in that case. However, that something on the path seemingly just dropped the packets without ICMP notice.


Did they fix it just for you or for everyone? A proposed project at work involves using Fastly as CDN and I'm curious if this is something we need to keep in mind.


Everyone, as far as I know


Very interesting, I bet this had come up before but remained undiagnosed, yay for packet captures


> To their credit when I was able to provide their engineering with logs showing this was clearly a Fastly problem, they fixed it fairly promptly

regarding credit; did you get paid for doing their homework?


If your path requires fragmentation to get to the wider world, there's a good chance it simply won't work. There's a small chance that it will work with some destinations. I don't think there's really a place for it in that sense.

Maybe for UDP for protocols where the designers were really optimistic. IMHO, for DNS if you want it to work as much as possible, your queries and responses should fit in 512 bytes, or some people are going to have problems on some networks. DNS over TCP doesn't work everywhere, even if it's supposed to.

It's more palatable to require DNS over TCP or working UDP over fragmented IP if it's for big txt records like for mail, but if your A and/or AAAA records don't fit in 512 bytes, expect lingering problems that are hard to track down.


well, if you read the article. the issue isn't really fragmentation as a mechanism, its finding the appropriate (dynamic) path MTU. that's kinda gonna be a problem forever given the stateless nature of IP. it would be fine if people hadn't decided that ICMP was .. undesirable.

I guess you could just say '1480 for all time' and be done with it


If I could go back in time, I would replace fragmentation with in-flight truncation. The receiver sees the exact MTU and can communicate it back to the sender as needed.


Say the active L2 path between two routers was:

  router A <-> switch B <-> switch C <-> router D
Router A still needs to have a reliable, actively updating, way to detect the path MTU to Router D because something could change and the path turns into:

  router A <-> switch E <-> switch C <-> router D
where switch E has a different MTU than the other equipment. Now you could go even further back and say the same happens for anything IP transits on top of but then the folks at Xerox PARC would probably wonder why the hell you're trying to make their hubs so damn smart and expensive and the people after them would probably wonder why the hell you're telling them to throw backwards compatibility with all the previous gear out the window in the name of making MTU slightly simpler.

It's a decent solution I think you just have the time backwards - it's something we should look to do as we move towards the future and the overhead of such logic is tiny to add to networking rather than something to wish we had in the past when even dumb hubs were expensively complicated.

This does leave one case unhandled though: you end up with a low MTU path at some point, things update so your traffic is going over a high MTU path, you still need to have some algorithm to discover this or the connection never recovers the performance.


This discussion is about truncating at L3, not L2; it assumes that the L2 MTU is fixed and known by every directly connected router, but a L3 packet transits over several L2 networks, each with its own MTU. A L2 network with variable MTU (or a MTU too small for Internet packets, like ATM) has its own independent segmentation and reassembly logic.


Well that's the thing, for the real world the L3's MTU is limited by the L2 MTU. If you can guarantee L2 MTU will be fixed and known for every device along the path then there is not really a need for a variable L3 MTU anymore and you've solved the problem by convincing everyone in the world they'll never need or want a different MTU again. Then someone gets the bright idea they can do something with tunnels and it starts back up.

If you want a world where you can rely on an L2 to have a way to handle the MTU mismatch then nothing is being solved by this other than the ability to say "fragmentation is not not in the current abstraction layer anymore, it's now an annoying problem on the layer below it instead" :).


> If you can guarantee L2 MTU will be fixed and known for every device along the path then there is not really a need for a variable L3 MTU anymore

The L2 MTU is fixed and known for every device on that same L2 network. The Internet is a network of networks; the reason we need to find the path MTU on the L3 network is that a single packet transits over several L2 networks, and while each L2 network has a single MTU, the L3 path can vary.


The different L3 networks are connected by L2 networks in-between them, not just behind them. L2 doesn't disappear once you hit the router, you've still got to reach your peer's IP over L1/L2 somehow at some point. A great deal of internet peering is not even a L3 box <- direct L1+L2 connection -> L3 box connection, it's dumb L2 transport which provides the IP connectivity path for the BGP session to a more centralized router. Sometimes the path is a pseudowire doing the same with a functional MTU lower than the switched/routed path it rides on too.

These underlying paths are not always static either, just because you have 9000 today doesn't mean when the path fails to a backup alternate tomorrow the MTU will be 9000. You have to get everyone involved on the internet to agree all links should now be 9000, now you can reliably set your router's outbound link to 9000 in all cases and rely on L3 fragmentation. Until someone wants to set their MTU to 12000 :). Even when I've had paid WAN transport with contracted MTU the MTU has lowered during carrier maintenance like firmware upgrades or unit replacements and I've had to call them up saying the MTU is broken because something like NFS servers will think they can statically set a known MTU on the path and it'll stay that way, a routed neighbor on the path would make the same error in this example.


At the time of sending the L2 packet, the router must know the MTU of the link it's sending it on. This is the case with router based fragmentation, otherwise it wouldn't fragment. This is also the case where the router drops packets that are too big, otherwise what does it do; assuming you don't build routers where all MTUs on all interfaces are required to be the same.

The proposal is that in the case that the router is forwarding a packet which won't fit in the egress MTU, it truncates it, rather than drops it. Hopefully maybe with a truncated bit (maybe we can reuse the evil bit for this). If a peer receives a truncated packet, which it should know because either the evil bit is set, or the received length doesn't match the length in the IP header, that would be processed depending on the protocol. For TCP, the receiver should send a duplicate ack with a newly specified option that indicates the length of the truncated packet it received; then the other peer would know the maximum length packet it could send. For UDP, the truncated packet would probably need to be sent to the application and it could figure out how to notify the peer to send shorter packets (and perhaps process the received truncated packet).

Had this been specified in the before times, it would be reasonable to implement, and path mtu detection would work a lot better now. Some routers may still have been built that drop rather than truncate, but it's a lot more reasonable to send a smaller packet in some circumstances than to have the router fragment things (needs to send two packets sometimes instead of one) or to send a return packet (needs to create a packet and start it back through the routing process again).

Additionally, all of the signalling is in-band, which makes it harder to lose the MTU signals when (small) data is otherwise flowing.


I already described the problem with assuming "the router must know the MTU of the link it's sending it on" implies guaranteed delivery across a switched path a comment higher in the chain:

router A <-> switch E <-> switch C <-> router D

Say router A and switch E have a 9000 byte MTU but switch C has an MTU of 1500 bytes. Router A gets a 12000 byte IP packet and sees to truncate the IP packet to 9000 bytes to be able to egress it to switch E, the designated path to its routed peer D. Switch E accepts this 9000 byte packet and bridges to switch C. Switch C only knows how to accept 1500 byte packets and so drops the 9000 byte packet. No truncated IP packet is ever delivered and the MTU is never learned.

This scenario, and a few others, are already noted in the RFC mentioned in the article https://datatracker.ietf.org/doc/html/rfc8899 as problems classically run into when trying to do PMTUD. Your browser based PMTUD test relies on these principles when it sends packets of various size, that's the only foolproof PMTUD method there is - truncating at egress on only routed hops does not provide foolproof IP delivery across the internet so cannot provide foolproof PMTUD.

Even when you can take the liberty to 100% guarantee A, C, E, and D will all support an MTU of 9000 bytes you can't guarantee nowhere else on your path across the internet will have the above issue on their switched paths between routers. Or that your packet won't go through a non-IP protocol tunnel as it goes through the internet and get dropped as too big in that encapsulation. Truncation is only 100% reliable when you're already 100% certain the real physical path across the internet can handle that size, which in practice requires knowing the real dynamic PMTU across multiple parties i.e. having PMTUD already.


> Switch E accepts this 9000 byte packet and bridges to switch C. Switch C only knows how to accept 1500 byte packets and so drops the 9000 byte packet.

I guess if you have this situation, you'd have the same problem we have today. Which is a too big packet gets dropped with no notification. But systems today don't tend to handle that well either. In practice today, usually people don't setup a mixed MTU broadcast domain, so I would expect a router between different networks to know the MTU of the next link. MTU (or really MRU) could have been put into ARP, if mixed MTU broadcast domains were expected, as well.

Otoh, there's a lot of situations where the router knows a route has a lower MTU, and could easily truncate, even if it can't easily fragment or drop and send an ICMP backwards. Truncation would be better than those two options.

And it might be easier to convince people to properly configure next hop MTU when it would enable truncation that would actually work. Instead of today where properly setting next hop MTU enables slightly earlier dropping of oversized packets because DF is almost (but not quite) universal and router fragmentation is disabled or rate limited and ICMP sending is disabled, rate limited, firewalled, or the router doesn't have a reasonable sending address.

In conclusion, while I ageee truncation wouldn't handle every case, it would handle every case where router based fragmentation could handle, and some more cases, and I think would have put us in a better place than we are today.


Why would that be preferable over having routers just drop oversized packets and send icmp message to sender?


1. ICMP messages often don't reliably make it back to the sender. This is especially true with low level load balancers.

2. You at least get some information. Maybe half a packet is better than no packet.

3. You get an exact size limit of the entire path the packet took. Rather than just a "too big" of the first hop that wouldn't pass it.

Overall I think it is a pretty decent solution. Assuming that the IP header says the original size or similar so that the truncation can be easily detected.

I do see some flaws though:

1. Layered protocols would have to be more complex to deal with packet truncation. (Although I guess they could simply treat truncation the same as dropped to avoid extra complexity.)

2. Checksums at the end of a packet being dropped likely make the whole packet useless.

3. It may be better to get a quick rejection from a local router rather than the possibly far away peer.


I'd add a fourth potential positive aspect of truncation: it could be implemented at full speed in the data path, instead of a separate process running on another CPU to generate, route, and send the ICMP packet.

All these advantages would come from truncation being "in-band", following the same path as the data packets, instead of being a special "out-of-band" message like ICMP packets are.

As for the flaws you listed:

1. The main complexity would come not from dealing with truncation itself (as you said, just treat it the same as a dropped packet), but from each protocol having to reflect the "truncated size" back to the original sender (like how ECN deals with marked packets).

2. The whole packet is not useless, the reason for truncating instead of turning it into an empty packet is to keep as much as possible of the higher layer headers. That is, most of the packet might be useless, but the higher layer headers at the start can be useful.

3. On the other hand, you might need several of these "quick" rejections before arriving at the ideal size.


I'll add an extra bonus: every now and then, an endpoint could append one extra bonus byte to a transmitted packet. That would probe for a potential increase in PMTU at very little cost.


Yeah, maybe a "contains MTU dummy data" bit at the start of the header which lets the receiving client know that this packet 1) May contain additional data after the real payload 2) If it does, send back an updated max packet size based on the amount that made it through.

IETF folks wouldn't like that much state in the IP protocol and it doesn't solve intermediate L2 loss (the real reason you have IP MTU differences) when switches are involved in the path though. That said, might make in interesting ICMP probing method.


In TCP, for packets without the urgent bit set, you could include the real size of the data in the urgent pointer field. The sender could occasionally send a local max MTU packet with at most the detected path mtu of data. The receiver could send a return indicator to confirm the received length if the urgent pointer field was set.


> 3. You get an exact size limit of the entire path the packet took. Rather than just a "too big" of the first hop that wouldn't pass it.

ICMP does send the hop MTU, it's not just a "too big" signal. (But I do acknowledge that you're going for the PMTU in a single shot, and that's slightly neater.)


Both approaches seem like they wouldn't work with one-way routes? It's relatively rare, but I've worked on IP based telemetry systems with only one way, or highly asymmetric routes. Probably the most fun was tunneling IP from a space capsule, through the international space station's 1553-MIL-STD databus, through Houston, to California.


One way communication naturally makes it impossible to get any information back whatsoever so no dynamic solution is possible. Asymmetric should be able to be handled though, send and receive don't actually have to have the same MTU in this scenario.

The fallback for anything (one way, lazy, filtered connection, truly anything) is you always have your minimum MTU where the protocol is guaranteed to work at a given size and if it's not the protocol knows the solution will need to lie outside itself. IPv4/IPv6 already have this today.


My point is that fragmentation is a dynamic solution that works for one way links. It's really a very nice feature all things considered. As an embedded engineer that typically deals with controlled networks rather than the Internet, it means I can send up to a 64K UDP packet without overthinking it.


I think there is some confusion on what's happening in your example case. At minimum we know for sure this will happen:

- Local UDP socket bind receives a 65k payload to deliver

- Local IP stack looks at where this is going to go, sees it'll be going out an interface with 1.5k MTU

- Local IP stack creates 44 IP fragments of 1.5k and a final smaller fragment (Note: For IPv6 the IP stack must instead default to the safe-by-standard value of 1280 byte MTU since dynamic discovery could not be performed)

- 1.5k packets get sent to the network

I can guarantee these steps because no NIC or router uses 64k L2 frames for this to be able to reach the gateway unfragmented.

From here we run into a bit of a real world problem vs a theoretical how fragmentation could be done problem. The article touches on why a little bit: intermediate routers can fragment if they want but they aren't required to (overwhelming often they actually won't) but even fewer routers will bother re-fragmenting (what would need to be done here, as you already have 1.5k fragments) as it's significantly messier still. Even Linux won't re-fragment IP packets when doing pure software forwarding!


That all sounds about right to me. What's the confusion?


That's not a dynamic solution, the information is all static and local without any active MTU probing. Nothing about the other two probing methods changes the above steps in the one-way scenario, so why do you say they wouldn't work with one-way routes?


So the confusion is around the intended definition of the word "dynamic"? Sure, you can't rely on any feedback in a one way route situation.

My point is that IP fragmentation, as a method, gets packets through a network that would otherwise be truncated or dropped by other methods. It can also do it reasonably efficiently and conveniently, moreso than always assuming a worst case MTU at the application protocol layer.

I think what changes between methods is the use of fragmented packets to begin with? The whole premise is that IP fragmentation is bad because real world routers and firewalls don't support fragmentation properly. If fragmentation is deprecated, then wouldn't network stacks need to also start dropping oversized packets coming from user space applications?


Wow, so no way to run TCP or otherwise get ACKs or NACK like packets back? Yikes.


Yeah. You wouldn't want to run TCP on those links anyway as the high latency and non-negligible loss rates would make TCP run like crap. This is where protocols like DTN are helpful, so long as you can interface your application with DTN. Fine for email, annoying for browsing, and painful for chat.


Truncation and encryption are an unfavorable mix.


serious question, how would be different from sending packed with no-fragment bit set?


The DF bit will cause the packet to get dropped & an ICMP response to maybe get generated.

If your route looks like this,

  E_a -- MTU 1500 --> R_1 -- MTU 1200 --> R_2 -- MTU 1000 -> E_b
Where E_a & E_b are the endpoints, with two routers. There PMTU is 1000; if you first emit a 1500 B packet with DF, R_1 will drop it, and emit an ICMP with the next hop MTU of 1200. You then have to emit another packet, and get another ICMP response from R_2 with the MTU of 1000, and now you know the PMTU, but it takes two failures to get there.

OP's suggestion is a single round trip: you emit a 1500 B packet, E_b receives a truncated 1000 B packet. E_b responds to E_a saying "that got truncated to 1000 B" and now E_a knows the PMTU, in a single shot. In an simple design, we can call the truncated packet "worthless" and force E_a to repeat it with the right PMTU, but it's still only too a single packet from E_a to elicit the right value for the PMTU from the network.


The behavior of DNF is you either forward the whole thing or drop the whole thing.


I grok IP-fragmentation because IDS/IPS/NDS/XDS all have to splice packets together to get to the payload data ... at 5Gbps.

At that nosebleed speed, FPGA is designed to do such defragmentation of IP as well as assembly of UDP reassembly and TCP desegmentation.

Very fast.

Including a bulk of FPGA dedicated to the overlapping payload that hackers often play tricks on such IDS, et. al.

That is probably the ONLY beef I have with RFCs covering IP/TCP/UDP: did not detail what to do in event of overlapping packets (first overlapped takes precedence or last overlapped overrides old data?)


> That is probably the ONLY beef I have with RFCs covering IP/TCP/UDP: did not detail what to do in event of overlapping packets (first overlapped takes precedence or last overlapped overrides old data?)

For a correctly operating sender, it shouldn't matter, since the data should be identical. It would be the same as asking what should happen if the sender changes the data when retransmitting a packet (the only difference from overlapping fragments is that a retransmitted packet has 100% overlap). It's Undefined Behavior, the receiver is allowed to do anything it wants, and the sender can't complain since it's its own fault.

(Also consider that packets are always allowed to be reordered in transit, so what you thought was the first packet might become the last packet on the next hop; even if the standard constrained the reassembly ordering, you might still not get the result you expected.)


I believe egberts1's point is that leaving it unspecified invites different implementations to implement it differently. Some might choose to use the bytes from the first fragment, others might choose to use the bytes from the newer fragment. You could have a situation where the IDS / firewall / misc security appliance think the packets contain a benign request but the application server interprets them in a malicious way. Things like HTTP request smuggling rely on the same kind of mismatch at the application protocol layer, for example.


The problem is intractable unless you enforce lossless perfect ordering as a strict requirement. There's no way to define "first" as written in the premise without it.


Yeah, when programmers think they understand networking, is like the long rants about their understanding of timezones, and names, and email addresses.

Any order is a valid case that someone got as their "correct" order.


Case in point, MS Windows does it one way, UNIX does it another way.


You're missing the point of because of packet reordering, any order can and should be valid.


5 Gbps is pretty slow. Internet carrier routers typically connect to each other at 100 gbps (or 400 gbps these days) and a single router will have many such connections, so Tbps of bandwidth, often costing about the same as that IPS FPGA hardware (well, the routers go up to rack size in which case you'd probably want to compare to a rack size IPS box cost).


There is a 25Gbps XNS in the work.

With even more live updating of newly added Aho-Corrasick and Regex algorithm into the FPGA.


Oh yeah, things can still scale up but the gap ends up remaining the same as the same scaling drives each. Fortinet does some really insane stuff scaling its custom ASICs (+some FPGA, but it's not fast enough for everything so most is in ASIC instead) on e.g. the FG-7121F chassis to get ~500+ Gbps IPS/NGWF throughput. I haven't had excuse to go past their "smaller" fixed platforms which do 40+ though. Naturally you're paying more for that single 500 Gbps (total), of which elephant flows are limited because that's via parallel flow processing, than you would for something like a Nokia 7750 SR-s with ~500x (216 Tbps) that throughput though but such is life when you need to act as an endpoint instead of a router.


Oops, maybe we can add massively parallel FPGA processors should be in there somewhere before the SSL stage.

Goal is to do proper reduction of each ASIC/FFPA stage without a packet drop.

Meanwhile, your website is dropping SSL connections, zamadatix.


Yes, analyzing hostile traffic is much more expensive than routing.


Well yes, you could gather that, but also highlighting that what inspection boxes do in regards to F&R does not necessarily relate to the problems covered in the article because F&R has to work extremely quickly across the internet, often ruling out anything related how the box in the middle could see it and instead being about what endpoints can do independently via things like DPLPMTUD instead.

The boxes do run into very interesting problems and knowledge space though, just more on the inverse of the problem at hand (how to properly act like a client receiving already fragmented data).


Or properly behave as all-client allowable rather than A client-permitted


One of the biggest misses with IP fragmentation was not requiring each fragment to carry the higher protocol header. Or at least do that for UDP.

That decision alone would’ve made fragments so much simpler on network devices and appliances, and much less likely for them to get dropped.


That would be a layering violation. IP routers don't necessarily know about higher protocols.


You could implement it as a generic 'application metadata' field in the IP header. From the perspective of IP, it one more length prefixed field in the IP header. Routers may interpret it in conjunction with the value of the protocol field; otherwise they are just required to leave it in unchanged in the header (including in all fragments).

For packets that don't want to use it, this is just 1 byte of overhead to set the size to 0.


You could design a network protocol that fragments by capturing a variable number of bytes from the next header, and ICMP already does something like that.

(None of this would fix the real problem with fragmentation, which is that you can't efficiently segment out a large frame without having some kind of reliability layer).


If I was revisiting, I'd probably eradicate the layer and pick a fixed number of flow types with distinct headers and state machines. The layers were a reasonable choice given the understanding of the time, but in hindsight I think you can make a strong case they're cut at the wrong places.


It's just a dumb mistake. All it takes is a "next layer header length" field. It would have been very simple.

You don't even really need that, and as proof, take ICMP ... which was designed as part of IP ... actually does do this. Routers are already required to copy and include the header of the packet that triggered an ICMP error.


The IP layer doesn't have to know what is in those upper layers to include 50 or 100 bytes of it in a little trunk.


If you always chop 100 add 100 then it's even more massively inefficient than the problem it solves. The router would at least need to have every protocol start with a header length value. Otherwise if you just take the first 100 bytes and stick it in the front of each packet and the header was only 57 bytes then you've suddenly got 43 bytes of garbage in the next layer's payload when you reassemble.

Keep in mind, most routers don't even bother supporting existing fragmentation because it's costly to implement in high speed hardware. So while you could theoretically have that dynamic next protocol header length value field it'd only be complicating something hardware makers already think is too complicated to be worth it. Making things unappealing complex is one of the common results of layering violations.


Theres no strict rules about layers, most routers can and do read info in tcp/udp headers.


And that's how we got forever stuck with those 2 and now have to build every new protocol on top of UDP.


Actually, that's not a bad thing. UDP is small enough to have nearly no overhead, but complex enough to let firewalls do their job. Six of the eight bytes in its header would probably be in the header of any transport layer protocol anyways (only the checksum might be unnecessary).

Wikipedia lists over 100 assigned IP protocol numbers [1], and while it would break existing firewalls, adding a new protocol would certainly require less work than the transition from IPv4 to IPv6. But UDP is already simple enough that there's very little benefit in not just building on that.

[1] https://en.wikipedia.org/wiki/List_of_IP_protocol_numbers


No it isn't. That fault lies with nat and idiots who only open http on their firewalls.


They can read higher layers, but they (currently) don't have to in order to implement IP correctly


> most routers can and do read info in tcp/udp headers.

Do most routers really do that, or just the ones which are also trying to act as a firewall?


For example, IP routers often peek at UDP/TCP port numbers to calculate ECMP flow hashing. This is technically naughty but it's read-only and it's only an optimization that isn't required for correct forwarding.


Yes. I doubt you can find one that is not capable.


Almost every modern router in a multipath network peeks at the next layer to implement flow hashing correctly.


> In my experience, it is rare for a network to correctly generate Time Exceeded messages for both IPv4 and IPv6.

Doesn't that make it more one of those situations where the non-documented behavior has become the de facto standard, rather than "wrong" exactly? (I guess it depends on whether that decision is being made consciously by the implementors or just for lack of knowledge of the standards.)


People who filter out all ICMP are probably unaware of the standard, but router implementors that limit ICMP rates are balancing transparent observability with the need to keep the equipment running.

I guess you could provision the router cpus so they could send ICMPs for line rate incoming packets that must be dropped, but that doesn't seem like a good cost tradeoff.


Fragmentation/MTU is right up there with verifying delivery as one of those things that can sound like a really easy problem to solve until you start trying to solve it.


What I'm gathering from the quiz is that the documentation for IP_MTU_DISCOVER is terribly written.

> IP_PMTUDISC_WANT will fragment a datagram if needed according to the path MTU, or will set the don't-fragment flag otherwise.

And then the quiz asks for the DF bit in the former case. The docs don't say; one could illogically assume the DF is unset in the former case … and in fact, that seems to be the case, though this leaves me wondering why it would do that, since with DF unset, it won't elicit the needed ICMP responses. There still remain some cases, of course, where we could do that (the latter case in the docs) … but this seems inefficient. In the former case, why wouldn't we fragment according to the PMTU and just set DF in all cases?


Is it permissible to set DF on a datagram which is already fragmented? Or if not formally impermissible, does it trigger some interesting bugs in something that assumes DF set = not fragmented?


The IPv4 RFC doesn't have anything precluding DF on a fragment. Since each fragment is a datagram in its own right, adding a DF just means that it can't be fragmented even further. The example fragmentation algorithm only considers DF on a per-datagram basis, regardless of whether or not it is a fragment of some original datagram. (It does take care to forward the original offsets and MF flag correctly, to account for the datagram being a fragment.)


I'm amazed by this; call it "PMTUD Broadside" or some such. The idea is really a bit egg-of-columbus. Anyway, it tickles my geek senses.


Haha, I like broadside. I had to lookup egg of columbus, TIL


It was fun to see a lot of flak towards fragments, when really just about everything you access today is delivered in fragments called TCP segments.

Sure, the delivery mechanisms are quite a bit different - but not that much if we were to speak of a case of a packet fitting into two fragments…


does it even make sense to probe the path MTU on a packet-switched network such as the Internet? In general, the packets of a message could each go a different path with vastly different MTUs.


I wonder if IP fragmentation would/could have any effect on the ability for a Nation State's (i.e., China, Russia, North Korea, ?, ???) firewall to prevent packets from entering/leaving a blocked country...

Why or why not?

In fact, now that I think about it... what happens if a TCP/IP connection is initiated from another country to a TCP/IP address inside of an outwardly blocked country?

Would that act similar to a NAT punchthrough, but on a Nation-State's firewall?

Think about it this way... let's take an arbitrary example, China...

And let's suppose for the purposes of discussion that what the western Mainstream Media says about China's Internet is completely true -- that outgoing connections to the West from China -- are blocked.

OK, but there still is Alibaba -- the equivalent of China's Amazon.com.

Connections FROM the West (e-commerce connections, "we buy from you", "you are a merchant to us", etc., etc.) TO Alibaba.com (in China) -- are not blocked, and would NOT be blocked by the Chinese government!

Why?

Because anywhere there's trade -- if the trade is beneficial to the government in question (anyone heard of taxes, anyone? They benefit the government you know!) -- there will be internal political pressure NOT to prevent that trade from occurring!

So let's say that a TCP/IP connection that looked like a legitimate e-commerce connection -- was initiated FROM the West, to an IP address in China for a legitimate Chinese e-commerce site...

Now, couldn't that connection, once established (since it is, after all bi-directional) -- be used to send data outside of China?

Yes!

But to only one outside/external IP address!

But what if that IP address on the Western side (being in a non-firewalled country) -- now could somehow route that data to any other IP address around the world?

Sort of like a VPN -- but the connection is opened up from an INCOMING connection, not an outgoing one...

Would that be the equivalent of NAT punchthrough -- for a Nation State's firewall?

?

Before you answer, I'm guessing that there are Deep State bots (from all countries!) that will try to derail this conversation...

That's a testament to "always think for yourself" (let logic be your guide!) -- stay away from so called "expert" opinions, especially one or two sentence "no it can't be done" or "it would never work" replies by fake posters with fake names from fake accounts!

But, I guess we'll open up the floor to the AI powered agenda-driven bots... foreign and domestic! :-)

Also, note that I don't suggest/condone that anyone actually do any of the above -- this discussion is theoretical in scope only!


What does any of that have to do with IP fragmentation?

VPNs work pretty much all the same regardless of which "direction" the tunnel is opened in, its still just normal VPN. Heck, you don't even need a open connection for a VPN, IP itself is stateless and connectionless.


> Connections FROM the West (e-commerce connections, "we buy from you", "you are a merchant to us", etc., etc.) TO Alibaba.com (in China) -- are not blocked, and would NOT be blocked by the Chinese government!

Um, conspiracy theories aside, Alibaba has servers in the US.


Which would have two-way data paths (inbound/outbound) to the Alibaba company (or other e-commerce company in the case of a different e-commerce company) in mainland China, or, more broadly, to the interior of other countries whose Nation-State firewall blocks outgoing connections...

Point is, if there's a way out, then there's a way in...

How does the data on the Alibaba servers in the U.S. get updated? Probably not by carrier pigeon... Someone (or a group of people) in the Alibaba offices in China updates that server when prices change, when new products are added, etc., etc.

And how does that happen, if there isn't an outgoing Internet connection?

And if so, then that's proof of a non-blocked outgoing connection from China.

And if that's true, then we need to raise the question of which types of outgoing connections are selectively permitted from China (because obviously the Alibaba update on U.S. servers is permitted!)

Which would in turn start to crumble the western mainstream media narrative that China blocks all outgoing connections...

In other words, if China doesn't block all outgoing connections, then we need to know (if we care at all about global Internet censorship) which types of connections are blocked, which ones aren't, and what the exact criteria for a blocked or permitted connection is...

Because if you are correct, then it would seem that connections from mainland China to Alibaba's U.S. servers are NOT blocked...

Conspiracy theories aside!


Obviously not all connections from China are blocked! That is not news.

Try reading the basic Wikipedia articles for starters:

https://en.wikipedia.org/wiki/Internet_censorship_in_China

https://en.wikipedia.org/wiki/Great_Firewall

https://en.wikipedia.org/wiki/List_of_websites_blocked_in_ma...


>we need to raise the question of which types of outgoing connections are selectively permitted from China (because obviously the Alibaba update on U.S. servers is permitted!)

That question was raised decades ago, and the answer is already known: traffic which sufficiently disagrees with the current government of China is blocked.

>Which would in turn start to crumble the western mainstream media narrative that China blocks all outgoing connections

I have never seen anyone (other than you) claim that China blocks all outgoing connections. That seems to be your narrative, rather than some mythical, conspiratorial, "western mainstream media narrative". I don't think it does us any good to invent divisive narratives like this. You could instead invent narratives that unite people and ideas and nations.

The reality that the world realizes, is that the current government of China blocks traffic containing content telling narratives which sufficiently disagree with them, using the tool they developed specifically for that purpose.


>"the answer is already known: traffic which sufficiently disagrees with the current government of China is blocked."

"Sufficiently disagrees" is not an exact understanding of exactly when/where/why and how the Chinese (or more broadly, any other Nation-State's) government blocks connections, or even if they indeed block connections (I have not personally tested Internet connectivity to the world at large from within China, so I'm going by secondhand information, western news reports -- that make the general claim that at least some aspects of China's Internet are censored and/or blocked: https://www.google.com/search?q=great+firewall+of+china )

>I have never seen anyone (other than you) claim that China blocks all outgoing connections.

I did not claim that...

Did I not say in the first message in this chain: "And let's suppose for the purposes of discussion that what the western Mainstream Media says about China's Internet is completely true -- that outgoing connections to the West from China -- are blocked." ?

The purpose of this message chain was not about China nor western mainstream media!

The purpose of this message chain was to discuss the following:

>"what happens if a TCP/IP connection is initiated from another country to a TCP/IP address inside of an outwardly blocked country?"

?

???

China, in this context, was only used as an arbitrary choice for the purposes of a philosophical discussion...

We could substitute China with any other country for the purposes of this same discussion, if it pleases you!

Related:

https://en.wikipedia.org/wiki/Great_Firewall

https://en.wikipedia.org/wiki/NAT_traversal

https://en.wikipedia.org/wiki/Hole_punching_(networking)

https://en.wikipedia.org/wiki/TCP_hole_punching

https://en.wikipedia.org/wiki/UDP_hole_punching

https://en.wikipedia.org/wiki/ICMP_hole_punching


> >"the answer is already known: traffic which sufficiently disagrees with the current government of China is blocked."

>"Sufficiently disagrees" is not an exact understanding of exactly when/where/why and how the Chinese...government blocks connections

An exact understanding isn't necessary to determine whether it happens (it does). Nonetheless, it seems like an interesting topic: how exactly do Chinese government decisions on what to block, flow through their government system, before getting to the technical level where they actually do the blocking? I look forward to the results of your research on this, as you have good questions to ask the Chinese government.

>or even if they indeed block connections

They do. It's in the lede of the article you linked:

> The Great Firewall (GFW; simplified Chinese: 防火长城; traditional Chinese: 防火長城; pinyin: Fánghuǒ Chángchéng) is the combination of legislative actions and technologies enforced by the People's Republic of China... to block access to selected foreign websites [0]

I know it sounds obvious, but to avoid confusion, it's worth being absolutely clear here: blocking access to foreign websites means blocking connections to them. Honestly, it's quite confusing to me how someone can doubt the very existence of China's connection-blocking Great Firewall. Is this some attempted jedi mind trick? Of course it exists.

0: https://en.wikipedia.org/wiki/Great_Firewall


>Nonetheless, it seems like an interesting topic: how exactly do Chinese [or more broadly, any] government [make] decisions on what to block, flow through their government system, before getting to the technical level where they actually do the blocking?

Indeed, that is an excellent question, applied to any Nation-State that blocks connections for any reason!

Now, this brings up a highly interesting question, which looks something like this:

How could someone in Nation-State X know exactly what is blocked by a different Nation-State, Nation-State Y -- without physically being there to perform tests on their Internet as if they were a Citizen/National/Resident/Physically Located -- in Nation-State Y?

?

There's no easy answer for that question... at least not without going to Nation-State Y and physically testing their Internet...

But if I were to guess at a hypothetical way to do it -- if someone had a Google-size cache snapshot of all websites and their contents on the global Internet, then if that someone also had a Google-style multi-threaded multi-computer web spider like the Googlebot -- then if one could somehow get VPN connections to outgoing IP addresses in Nation-State Y, then one could attempt running that, the effect being similar to being physically in Nation-State Y...

From that, one could compare ALL of the results of those connection requests and corresponding web pages -- to see what connections and web pages worked and which ones didn't...

For the connections and pages which didn't work, those might have been caused by a website being updated, being taken offline, random Internet failure, etc., etc., that is, one couldn't immediately ascribe responsibility to Nation-State Y intentionally blocking the site or connection without additional debugging/fact-finding/triangulation of the root cause...

So, that wouldn't solve the problem... but it might be a good starting point...

Something like that, if it would exist, if it could exist (Google + a VPN provider in Nation-State Y would have all of the necessary infrastructure, incidentally!) -- would be a good first step in that direction...

In fact, a smaller, one-site-at-a-time version of it could be implemented by remote P2P nodes...

In fact, here's an even weirder idea(!), now that I think about it -- use Javascript running on remote P2P nodes (with permission of the node owners!) in other countries to test Internet connectivity to Web Sites (or more broadly, any kind of Internet connectivity, since the Internet is not only the web!):

In other words, use something like the following (in conjunction with custom JavaScript code which would test websites/Internet connectivity):

https://news.ycombinator.com/item?id=39373960

https://pears.com/news/holepunch-unveils-groundbreaking-open...

Anyway, some excellent questions! Thank you for them! You made me think!


> So you think you understand IP fragmentation?

Titles like these sound unnecessarily arrogant to me and are off-putting.


Perhaps. I would have phrased it "Nobody understands IP fragmentation", but sometimes when delivering to a room of experts you need to increase the volume a little bit.

Of course, it really depends on whether you can actually back this up! And in this case Val does, quite clearly.


> Of course, it really depends on whether you can actually back this up! And in this case Val does, quite clearly.

Yes indeed; and reading the article, she does not sound that arrogant:

"Many networking experts think they know when IP fragmentation will happen, and I thought I did too—until I had to implement an algorithm for a VPN"




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: