Skip to content

Mangled content in parsed tweet #5

@joewiz

Description

@joewiz

Tweet ID 768603772968919040 was parsed as follows - note the mangled/redundant data in the <html> element.

<tweet>
    <id>768603772968919040</id>
    <date>2016-08-25T00:19:54</date>
    <screen-name>HistoryAtState</screen-name>
    <url>https://twitter.com/HistoryAtState/status/768603772968919040</url>
    <text>RT @CIA: 2,500 intel docs, no longer for just the President’s eyes only.

The #PDB: Delivering Intel to Nixon and Ford:

https://t.co/HgvyO…</text>
    <html>RT <a href="https://twitter.com/RT @CIA: 2,500 intel docs, no longer for just the President’s eyes only.&#xA;&#xA;The #PDB: Delivering Intel to Nixon and Ford:&#xA;&#xA;https://t.co/HgvyO…">@RT @CIA: 2,500 intel docs, no longer for just the President’s eyes only.

The #PDB: Delivering Intel to Nixon and Ford:

https://t.co/HgvyO…</a>:  2,500 intel docs, no longer for just the President’s eyes only.

The <a href="https://twitter.com/search?q=%23PDB&amp;src=hash">#PDB</a>: Delivering Intel to Nixon and Ford:

<a href="http://bit.ly/2biPIEP">bit.ly/2biPIEP</a>
    </html>
</tweet>

Should've been simply:

    <html>RT <a href="https://twitter.com/CIA">@CIA</a>: 2,500 intel docs, no longer for just the President’s eyes only.

The <a href="https://twitter.com/search?q=%23PDB&amp;src=hash">#PDB</a>: Delivering Intel to Nixon and Ford:

<a href="http://bit.ly/2biPIEP">bit.ly/2biPIEP</a>
    </html>

I bet this has to do with the fact that in retweeting the original, the text had to be truncated, which I recall led the twitter API's indices for entities to be off. I thought my old code had addressed this, but perhaps there was an edge case remaining.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions