Wiznet please reply: Data corruption

[size=150]Edit as of Oct 02 2015: the issue was solved.[/size]
Explanation can be found here: [url][From QnA] WIZ108SR RS485].
If you experience issues like this you can contact me for tips.

Problem: I receive correct volume of data, but data is corrupt.

I use TCP / HTTP communication, using hardware interrupts. Sending exactly 128 KBytes of data plus HTTP header. On approximately 40th Kilobyte image I receive starts getting differences in 16-bit simple checksum (word sum of all bytes).

I do the following:

  1. setting all registers in W5100, preparing interrupt handler, opening TCP, establishing connection.
  2. I start getting interrupt requests.
  3. At the entrance to interrupt routine I check socket IR flags, and reset them all immediately writing 0FFh into respective IR.
  4. Then I get count of bytes from RSR register and read them from the W5100’s RX buffer, then update RXRD register and write RECV command to command register of the socket.
  5. process continues until remote host breaks established connection.

As I said - size of data is correct, but contents are corrupt.

Another issue is that communication is very slow. Receive/send LEDs flash once per second, and transfer takes 20 seconds. It is 6 KBytes per second. Given that algorithm and CPU I use are not fastest, let’s take “real” speed of service is 10 KBytes per second. Nearby computer running Windows XP achieves 500 KBytes per second for the same web-server from the same subnet.

If my algorithm is correct, I suspect W5100 experiences errors during communication and is busy with retransmission - retry is set to 400ms and 8 attempts. Is it possible to get such retransmission/line quality statistics from W5100?

Some time ago I sent several emails to Joachim Wuelbeck @ Wiznet asking for correct model of RJ45 with built-in magnetics. He never replied. As I had to proceed with project, I chosen J1006F01P, which has symmetrical transformer, but slightly different circuit diagram (comparing to sample wiznet uses) without 75 ohm resistors and 1000pF capacitor onto the ground. Can it be the cause? Again, how to prove it - any statistics in W5100 on its quality of communication?

In overall I find Wiznet hardware documentation, and documentation in overall to be of poor quality - some areas are covered by just several words (e.g. RJ connector and transformer types), some have serious mistakes (that issue with read-only socket IR register which appeared to be have been writable - in W5100 and W5200 documentation). Support is also inadequate to the complexity of device.

Eugeny Brychkov, Ph.D.
[url]https://www.linkedin.com/in/brychkov[/url]

Hi,
At the 1st question.
Checksum data is wrong in TCP header?? or data are difference sent data with received data??

If later, check send data from packet capture program like wireshark and then check data from sender.

2nd.
Set 5th bit of Sn_MR: Not use no delayed ACK option.
If, transferring speed is up.

And RJ45 is look find to use.

Thanks.

Hello Tony, thank you very much for reply.

Time decreased to 7 seconds (from 20), data transfer speed (from server to final memory in machine) is 18.7 KBytes/s. Good result.

I am sorry I did not understand your statement. You mean J1006F01P is fine to use?

Data difference. Image on the server is correct, it sends image properly to other host (PC downloads image with correct checksum).
I will be making more tests and let you know.

Hi,

Sorry, I look another datasheet J1011.
J1006 is not correct at W5100.
The magjack must have resistor and cap.

If send data is correct, I think your address access to buffer is wrong or store buffer data to memory is wrong.

Plz more try to debug.

Tony, thank you very much. I will order J1011F21PNL.

Please advise if the following situation is valid for W5100.

I get interrupt with packet size 0b68h and RXRD=0. I process 300h bytes, and update RXRD register to 300h, and write RECV command allowing W5100 receive further bytes until it fills its receive buffer and this 300h bytes space which was “freed”.

I am asking because datasheet only gives example of flushing whole receive buffer and only then updating RXRD. But can I do it partially? I also guess if this partial update is valid situation for W5100, it can increase speed of communication.

Edit: Related question: is TCP data contiguously loaded into W5100?

Is it valid to calculate RX read pointer once (using RXRD), and then flush data contiguously until RSR becomes 0 and connection is terminated by remote host? RXRD will loop through 0000-FFFF space.

Such handling will simplify the algorithm, and will make it much faster. I use below 1MIPS CPU and program in assembler, and each additional relatively complex calculation adds delays and complexity.

Ok, I took some troubleshooting steps, here’s report:
00003F63: FE 59
00015F61: FF A4
First column is address in image, second is byte is from original image, and third from image received by W5100. So we have 2 corrupt bytes for 131072 bytes received. I will wait for new jack and see.

[size=150]Tony[/size], do termination resistors should have exactly 49.9 Ohm value? I was unable to find such nominal, and used 51 Ohms instead (CAY16-510J4LF). Nominal tolerance (I guess 1%)? Again, datasheet does not state anything about PHY connections, nominals and tolerance…

I replaced J1006 with J1011F21PNL, replaced 51 Ohm resistors with 49.9 Ohm resistors, simplified algorithms as much as possible to minimize errors in data processing. Calculation of checksum by 8192 chunks of 128KBytes image gives differences in three pages. DIfference is slight (~one byte corruption per page), but in case of executable code it is fatal.

What’s next?
I will go to different forums asking people to perform data checks on their devices and let’s see if they have same issue as I do.

Edit: so far here
[url]Data corruption with W5100 (and other Wiznet products) - Networking, Protocols, and Devices - Arduino Forum
[url]https://plus.google.com/communities/109883731954182375648/stream/62e02ee7-7311-4788-8cb3-78d10c315c75[/url]

Ok, we have some pattern. Here’re samples to downloads and differences I made:
First byte is one received by W5100, second - original byte. HTTP Header size is 01A5h.

000034EB: C4 FE 000150E8: C5 3F 000176EA: 3B 1F

000034EB: C4 FE 0000C9EA: 0F FF 000150E8: 5E 3F

00000EE8: A0 FF 000030EA: 03 C7 000030EB: C8 AF 000034EB: C4 FE 000150E8: 5E 3F

000030EA: 03 C7 000030EB: C8 AF 0000C9EA: 00 FF 000176EA: 3B 1F

000034EB: C4 FE 0000C9EA: 0F FF 000150E8: 5E 3F 000176EA: 3B 1F

0000C467: F0 BF 0000C663: AA BF 00015F61: 33 FF

0000C663: AA BF 00015F61: 33 FF 00018C67: 39 7F 0001D462: 8F FF

00003F63: 59 FE 0000C467: 0C BF 00015F61: A4 FF 00018C67: 39 7F 00019462: 0F 7F 0001D462: 8F FF

0000B161: 24 FF 0000B163: 94 FF 0000C663: AA BF 0001C063: 98 FF

I think you have a charset converting problem UTF-8 / Unicode between the different systems.
The file on the server is UTF-8 which contains unicode characters so to understand :
ASCII character “%” is H25 Ascii code, in unicode is HFF H05 (2 bytes) but translated in UTF-8 becomes H85 HBC HEF (3 bytes).
Could all be fine until they are accented characters that depend on the country code and which are treated with some PC softwares in windows-1252 encoding.
When you download a file in HTTP is the server that transcoding it while sends.
Any test begins from a file that the server has processed and perhaps may be different from the original … then attach a zip of the source file and put it in the post.
It would be better understand the operations that you carry to transfer the flow between systems.
It seems that here had suggested something similar [url]Data corruption with W5100 (and other Wiznet products) - #16 by SurferTim - Networking, Protocols, and Devices - Arduino Forum

Unfortunately it is NOT the case. I use cURL to query file from my PC, server responds with

HTTP/1.1 200 OK Server: nginx/1.6.2 Date: Sun, 17 May 2015 11:51:38 GMT Content-Type: text/html; charset=utf-8 Content-Length: 131072 Connection: close Accept-Ranges: bytes Expires: Mon, 18 May 2015 11:51:38 GMT Cache-Control: max-age=86400 Vary: User-Agent, Accept-Encoding

and FC /b (utility to compare files in binary mode) says no difference. File arrives without modification. I would be astonished if W5100 parses HTTP headers and change contents according to content type. 99.(9)% it is not.
If there would be conversion, I would get wrong file size. I get exactly the size of original file, with just several bytes corrupt.
In contrast with Arduino, I use parallel access to W5100. I checked access timing is met. I checked all soldering and it is fine. I will replace the chip.
It will be second (in my life) W5100 I replace from 7 I bought. 28% is very high DOA rate.

I replaced the chip. Same situation. Checksum, with each transaction, slightly changes between 5c00 and 5e11 (equals to 0 to 4 corrupt bytes).
Wiznet, please give me FAE/support engineer to figure out what is wrong. I need one person to dig into what I do, I will supply all the needed information including source code.

Edit: differences (first is original, second - received)

0000B163: FF 94 0000C663: BF AA 0001C063: FF 84

0000B163: FF 24 00019466: 7F 0D 0001C063: FF FE
You notice that bytes received incorrectly have maximal bit number set - 7F, FF, BF - each having 7 or 8 of 8 bits set.

I have made two test files - first full of FF, another is with alternating FF 00 - both transferred correctly.
File filled with such data

00000000: AA 0A 00 00-00 00 00 BB-A0 BB 00 FF-F0 FB A0 FF 00000010: FF FF A0 BF-FF FF A0 0B-BF FF A0 F0-0F FA A0 A0
also arrived without corruption.

Ok, I cut A000-C000 page from the file which is being received corrupt, filled 128K with this 8K data, and got the following:

00003163: FF 24 00005163: FF F0 00007161: FF 44 0000B161: FF 44 0000D161: FF D0 0000F163: FF 94 00011163: FF 94 00013163: FF 00 00015161: FF 44 0001B161: FF 44 0001D161: FF 4D 0001F163: FF 94
Non-corrupt pages are: 11xx, 31xx, 91xx, 17xx, 19xx - 5 pages in contrast with 11 pages with corrupt byte in approximately same location.

Now I cut 2000h page from A800-C800 to see if data locations will change (in other words, if corruption is related to the byte offset or its value), and here’s what I get

00001F5F: FF F9 00003F5F: FF F0 00004961: FF DD 00005264: FF CC 00007264: FF BB 00007E5F: FF 00 00008961: FF F0 0000A961: FF DD 0000B264: FF CC 0000BE5F: FF AF 0000C961: FF 44 0000F260: FF 00 0000FF5F: FF 00 00011264: FF 99 00011F5F: FF 44 00013264: FF 99 00013E5F: FF AF 00015260: FF 00 00015E5F: FF AF 00016961: FF F0 00017F5F: FF 29 00019264: FF CC 0001A961: FF F0 0001DE5F: FF AF 0001F260: FF 00 0001FF5F: FF 0B
Note that in previous test addresses were x161 and x163, here we have a couple of x961, which is exactly +800h offset I took in data.

[size=150]From this I make a 60% probability conclusion that it is NOT problem of W5100 internal SRAM which holds data, it is NOT a problem of my receiving code (which I reworked several times with same corruption result), it is NOT related so occasional corruption of the memory cell (e.g. incorrect interrupt handling in my machine), it is rather related to the data pattern being transferred. [/size]

However next test puts me in confusion: I took 15h bytes shift: a015 to c015, and got the following:

00002263: FF 0A 0000E263: FF 0A 00012263: FF 0A 00014263: FF 0A 00018263: FF 0A
All FFs in location offset x263 are changed to 0A. Interesting that looking to all the numbers corrupt locations are in xx5f-xx66 range.

Here’s advanced test: I set up Apache web server on my PC, and use Wireshark to capture data on its LAN interface.
I downloaded the file under question from this web server, and I see that remote machine received discrepancy/corruption is (received-sent):

0001D0C6: 1B FF

and Wireshark confirms that in this location + header size 0140h web server sent “FF”. Thus web server sends correct data to W5100.

Small final note: I have other executables - 16 KBytes transfer properly, I have another 128KByte executable, it also gets corruption (its checksum changes every download).

By the way, looking to Wireshark I see line:
16 3.641003000 192.168.1.40 192.168.1.35 TCP 60 aibkup > http [RST] Seq=3556851805 Win=0 Len=0
where 40 is W5100 machine and 35 is local web server. It says that W5100 breaks TCP protocol…


I was tracing my algorithm in hope to find out that data corruption occurs on some “edge” of the packet or data transfer… no way. It happens inside transfer, in the middle of the packet. E.g. location 0B163h is somewhere between boundaries of transfer 0B000 and 0B680, not at its edge or any of its “remarkable” locations.

0000B163: FF 24 0001C063: FF 00 0001D462: FF C0

I changed W5100 memory allocation from 0AAh (4K buffers for sockets 0 and 1) to standard 055h (2K for all four sockets) - no luck, same corruption.

0000B863: 9B 09 0000C467: BF 90 0000C663: BF 0D 00015F61: FF 01
As you can see same xx6x addresses

I filed complaint with Korean Trade commission in the category “Unfair Trade”.

Major distributors were sent the following notice:
“There’s problem with WIZnet products, please be careful as you may start getting returns. See here: Wiznet please reply: Data corruption. Good luck with your sales.”

[size=200]Wiznet - any reaction? Do not know what to do? Support me please to prove that your product is behaving properly. If the problem is caused by my mistake, I will happily apologize and revoke my claims. But time works against you, not against me.[/size]

Hi, Eugeny

I want to review your code if you don’t mind.
I think it is best way to reduce debug time.

Thanks.
Tony.

And we recommend termination resistor as 49.9 ohm but 51 ohm maybe work correctly.

Refer to Ethernet termination resistor

Tony, thank you. Jack was replaced with J1011P21, and resistors were replaced with 49.9 1%. Chip was replaced one time. Same corruption result, thus it should not be related to single-defective-chip, to jack or to resistors. There should be something else. I sent you email with all the details. Please refer here for corruption sample data. It should be easy for you to check hardware design, please contact me when you will be reviewing software design.

Hi Eugeny.

You seem a circuit without a problem. But there is a problem in your Artwork.

The TX +/- and RX +/- signals are high-frequency signals of 100MHz.

But you had to design without considering at all. The timing of signal is wrong.

  1. Make Remove the test pad the TX +/- and RX +/- signal.
  2. The TX + and TX- are the same length, side by side Make connections. (Also RX +/-)
  3. Change the position of the matching resistor.
    • Your designs: W5100 — RJ45 — matching resistance (X)
    • The right design: W5100 — matching resistor — RJ45 (O)
  4. Do not use the array resistance on matching resistor.
  5. RJ45 underneath do not place. - parts, GND, signal.
  6. Make the GND of the RJ45 and digital GND isolation.
  7. Make use of 49.9 ohms.

Thak you.

Scott.