NLP Research Based on Transformer Model

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Mobile Navigation

Video generation models as world simulators.

We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.

More resources

  • View Sora overview

This technical report focuses on (1) our method for turning visual data of all types into a unified representation that enables large-scale training of generative models, and (2) qualitative evaluation of Sora’s capabilities and limitations. Model and implementation details are not included in this report.

Much prior work has studied generative modeling of video data using a variety of methods, including recurrent networks, [^1] [^2] [^3] generative adversarial networks, [^4] [^5] [^6] [^7] autoregressive transformers, [^8] [^9] and diffusion models. [^10] [^11] [^12] These works often focus on a narrow category of visual data, on shorter videos, or on videos of a fixed size. Sora is a generalist model of visual data—it can generate videos and images spanning diverse durations, aspect ratios and resolutions, up to a full minute of high definition video.

Turning visual data into patches

We take inspiration from large language models which acquire generalist capabilities by training on internet-scale data. [^13] [^14] The success of the LLM paradigm is enabled in part by the use of tokens that elegantly unify diverse modalities of text—code, math and various natural languages. In this work, we consider how generative models of visual data can inherit such benefits. Whereas LLMs have text tokens, Sora has visual patches . Patches have previously been shown to be an effective representation for models of visual data. [^15] [^16] [^17] [^18] We find that patches are a highly-scalable and effective representation for training generative models on diverse types of videos and images.

Figure Patches

At a high level, we turn videos into patches by first compressing videos into a lower-dimensional latent space, [^19] and subsequently decomposing the representation into spacetime patches.

Video compression network

We train a network that reduces the dimensionality of visual data. [^20] This network takes raw video as input and outputs a latent representation that is compressed both temporally and spatially. Sora is trained on and subsequently generates videos within this compressed latent space. We also train a corresponding decoder model that maps generated latents back to pixel space.

Spacetime latent patches

Given a compressed input video, we extract a sequence of spacetime patches which act as transformer tokens. This scheme works for images too since images are just videos with a single frame. Our patch-based representation enables Sora to train on videos and images of variable resolutions, durations and aspect ratios. At inference time, we can control the size of generated videos by arranging randomly-initialized patches in an appropriately-sized grid.

Scaling transformers for video generation

Sora is a diffusion model [^21] [^22] [^23] [^24] [^25] ; given input noisy patches (and conditioning information like text prompts), it’s trained to predict the original “clean” patches. Importantly, Sora is a diffusion transformer . [^26] Transformers have demonstrated remarkable scaling properties across a variety of domains, including language modeling, [^13] [^14] computer vision, [^15] [^16] [^17] [^18] and image generation. [^27] [^28] [^29]

Figure Diffusion

In this work, we find that diffusion transformers scale effectively as video models as well. Below, we show a comparison of video samples with fixed seeds and inputs as training progresses. Sample quality improves markedly as training compute increases.

Variable durations, resolutions, aspect ratios

Past approaches to image and video generation typically resize, crop or trim videos to a standard size—e.g., 4 second videos at 256x256 resolution. We find that instead training on data at its native size provides several benefits.

Sampling flexibility

Sora can sample widescreen 1920x1080p videos, vertical 1080x1920 videos and everything inbetween. This lets Sora create content for different devices directly at their native aspect ratios. It also lets us quickly prototype content at lower sizes before generating at full resolution—all with the same model.

Improved framing and composition

We empirically find that training on videos at their native aspect ratios improves composition and framing. We compare Sora against a version of our model that crops all training videos to be square, which is common practice when training generative models. The model trained on square crops (left) sometimes generates videos where the subject is only partially in view. In comparison, videos from Sora (right) have improved framing.

Language understanding

Training text-to-video generation systems requires a large amount of videos with corresponding text captions. We apply the re-captioning technique introduced in DALL·E 3 [^30] to videos. We first train a highly descriptive captioner model and then use it to produce text captions for all videos in our training set. We find that training on highly descriptive video captions improves text fidelity as well as the overall quality of videos.

Similar to DALL·E 3, we also leverage GPT to turn short user prompts into longer detailed captions that are sent to the video model. This enables Sora to generate high quality videos that accurately follow user prompts.

Prompting with images and videos

All of the results above and in our landing page show text-to-video samples. But Sora can also be prompted with other inputs, such as pre-existing images or video. This capability enables Sora to perform a wide range of image and video editing tasks—creating perfectly looping video, animating static images, extending videos forwards or backwards in time, etc.

Animating DALL·E images

Sora is capable of generating videos provided an image and prompt as input. Below we show example videos generated based on DALL·E 2 [^31] and DALL·E 3 [^30] images.

ieee research paper transformer

Extending generated videos

Sora is also capable of extending videos, either forward or backward in time. Below are four videos that were all extended backward in time starting from a segment of a generated video. As a result, each of the four videos starts different from the others, yet all four videos lead to the same ending.

We can use this method to extend a video both forward and backward to produce a seamless infinite loop.

Video-to-video editing

Diffusion models have enabled a plethora of methods for editing images and videos from text prompts. Below we apply one of these methods, SDEdit, [^32] to Sora. This technique enables Sora to transform  the styles and environments of input videos zero-shot.

Connecting videos

We can also use Sora to gradually interpolate between two input videos, creating seamless transitions between videos with entirely different subjects and scene compositions. In the examples below, the videos in the center interpolate between the corresponding videos on the left and right.

Image generation capabilities

Sora is also capable of generating images. We do this by arranging patches of Gaussian noise in a spatial grid with a temporal extent of one frame. The model can generate images of variable sizes—up to 2048x2048 resolution.

ieee research paper transformer

Emerging simulation capabilities

We find that video models exhibit a number of interesting emergent capabilities when trained at scale. These capabilities enable Sora to simulate some aspects of people, animals and environments from the physical world. These properties emerge without any explicit inductive biases for 3D, objects, etc.—they are purely phenomena of scale.

3D consistency. Sora can generate videos with dynamic camera motion. As the camera shifts and rotates, people and scene elements move consistently through three-dimensional space.

Long-range coherence and object permanence. A significant challenge for video generation systems has been maintaining temporal consistency when sampling long videos. We find that Sora is often, though not always, able to effectively model both short- and long-range dependencies. For example, our model can persist people, animals and objects even when they are occluded or leave the frame. Likewise, it can generate multiple shots of the same character in a single sample, maintaining their appearance throughout the video.

Interacting with the world. Sora can sometimes simulate actions that affect the state of the world in simple ways. For example, a painter can leave new strokes along a canvas that persist over time, or a man can eat a burger and leave bite marks.

Simulating digital worlds. Sora is also able to simulate artificial processes–one example is video games. Sora can simultaneously control the player in Minecraft with a basic policy while also rendering the world and its dynamics in high fidelity. These capabilities can be elicited zero-shot by prompting Sora with captions mentioning “Minecraft.”

These capabilities suggest that continued scaling of video models is a promising path towards the development of highly-capable simulators of the physical and digital world, and the objects, animals and people that live within them.

Sora currently exhibits numerous limitations as a simulator. For example, it does not accurately model the physics of many basic interactions, like glass shattering. Other interactions, like eating food, do not always yield correct changes in object state. We enumerate other common failure modes of the model—such as incoherencies that develop in long duration samples or spontaneous appearances of objects—in our landing page .

We believe the capabilities Sora has today demonstrate that continued scaling of video models is a promising path towards the development of capable simulators of the physical and digital world, and the objects, animals and people that live within them.

  • Bill Peebles
  • Connor Holmes
  • David Schnurr
  • Troy Luhman
  • Eric Luhman
  • Clarence Wing Yin Ng
  • Aditya Ramesh

Acknowledgments

Please cite as Brooks, Peebles, et al., and use the following BibTeX for citation:  https://openai.com/bibtex/videoworldsimulators2024.bib

Help | Advanced Search

Computer Science > Computation and Language

Title: attention is all you need.

Abstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Submission history

Access paper:.

  • Download PDF
  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

111 blog links

Dblp - cs bibliography, bibtex formatted citation.

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base
  • IEEE Paper Format | Template & Guidelines

IEEE Paper Format | Template & Guidelines

Published on August 24, 2022 by Jack Caulfield . Revised on April 6, 2023.

IEEE provides guidelines for formatting your paper. These guidelines must be followed when you’re submitting a manuscript for publication in an IEEE journal. Some of the key guidelines are:

  • Formatting the text as two columns, in Times New Roman, 10 pt.
  • Including a byline, an abstract , and a set of keywords at the start of the research paper
  • Placing any figures, tables, and equations at the top or bottom of a column, not in the middle
  • Following the appropriate heading styles for any headings you use
  • Including a full list of IEEE references at the end
  • Not including page numbers

IEEE example paper

To learn more about the specifics of IEEE paper format, check out the free template below. Note that you may not need to follow these rules if you’ve only been told to use IEEE citation format for a student paper. But you do need to follow them to submit to IEEE publications.

Instantly correct all language mistakes in your text

Be assured that you'll submit flawless writing. Upload your document to correct all your mistakes.

upload-your-document-ai-proofreader

Table of contents

Ieee format template, ieee heading styles, frequently asked questions about ieee.

The template below can be used to make sure that your paper follows IEEE format. It’s set up with custom Word styles for all the different parts of the text, with the right fonts and formatting and with further explanation of key points.

Make sure to remove all the explanatory text in the template when you insert your own.

Download IEEE paper format template

The only proofreading tool specialized in correcting academic writing - try for free!

The academic proofreading tool has been trained on 1000s of academic texts and by native English editors. Making it the most accurate and reliable proofreading tool for students.

ieee research paper transformer

Try for free

IEEE recommends specific heading styles to distinguish the title and different levels of heading in your paper from each other. Styles for each of these are built into the template.

The paper title is written in 24 pt. Times New Roman, centered at the top of the first page. Other headings are all written in 10 pt. Times New Roman:

  • Level 1 text headings begin with a roman numeral followed by a period. They are written in small caps, in title case, and centered.
  • Level 2 text headings begin with a capital letter followed by a period. They are italicized, left-aligned, and written in title case.
  • Level 3 text headings begin with a number followed by a closing parenthesis . They are italicized, written in sentence case, and indented like a regular paragraph. The text of the section follows the heading immediately, after a colon .
  • Level 4 text headings begin with a lowercase letter followed by a closing parenthesis. They are italicized, written in sentence case, and indented slightly further than a normal paragraph. The text of the section follows the heading immediately, after a colon.
  • Component headings are used for the different components of your paper outside of the main text, such as the acknowledgments and references. They are written in small caps, in title case, centered, and without any numbering.

IEEE heading styles

You should use 10 pt. Times New Roman font in your IEEE format paper .

For the paper title, 26 pt. Times New Roman is used. For some other paper elements like table footnotes, the font can be slightly smaller. All the correct stylings are available in our free IEEE format template .

No, page numbers are not included in an IEEE format paper . If you’re submitting to an IEEE publication, page numbers will be added in the final publication but aren’t needed in the manuscript.

IEEE paper format requires you to include an abstract summarizing the content of your paper. It appears at the start of the paper, right after you list your name and affiliation.

The abstract begins with the word “Abstract,” italicized and followed by an em dash. The abstract itself follows immediately on the same line. The entire section is written in bold font. For example: “ Abstract —This paper discusses … ”

You can find the correct format for your IEEE abstract and other parts of the paper in our free IEEE paper format template .

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Caulfield, J. (2023, April 06). IEEE Paper Format | Template & Guidelines. Scribbr. Retrieved February 18, 2024, from https://www.scribbr.com/ieee/ieee-paper-format/

Is this article helpful?

Jack Caulfield

Jack Caulfield

Other students also liked, ieee reference page | format & examples, ieee in-text citation | guidelines & examples, ieee journal citation | guide with examples.

  • Newsletters

OpenAI teases an amazing new generative video model called Sora

The firm is sharing Sora with a small group of safety testers but the rest of us will have to wait to learn more.

  • Will Douglas Heaven archive page

OpenAI has built a striking new generative video model called Sora that can take a short text description and turn it into a detailed, high-definition film clip up to a minute long.

Based on four sample videos that OpenAI shared with MIT Technology Review ahead of today’s announcement, the San Francisco–based firm has pushed the envelope of what’s possible with text-to-video generation (a hot new research direction that we flagged as a trend to watch in 2024 ).

“We think building models that can understand video, and understand all these very complex interactions of our world, is an important step for all future AI systems,” says Tim Brooks, a scientist at OpenAI.

But there’s a disclaimer. OpenAI gave us a preview of Sora (which means sky in Japanese) under conditions of strict secrecy. In an unusual move, the firm would only share information about Sora if we agreed to wait until after news of the model was made public to seek the opinions of outside experts. [Editor’s note: We’ve updated this story with outside comment below.] OpenAI has not yet released a technical report or demonstrated the model actually working. And it says it won’t be releasing Sora anytime soon. [ Update: OpenAI has now shared more technical details on its website.]

The first generative models that could produce video from snippets of text appeared in late 2022. But early examples from Meta , Google, and a startup called Runway were glitchy and grainy. Since then, the tech has been getting better fast. Runway’s gen-2 model, released last year, can produce short clips that come close to matching big-studio animation in their quality. But most of these examples are still only a few seconds long.  

The sample videos from OpenAI’s Sora are high-definition and full of detail. OpenAI also says it can generate videos up to a minute long. One video of a Tokyo street scene shows that Sora has learned how objects fit together in 3D: the camera swoops into the scene to follow a couple as they walk past a row of shops.

OpenAI also claims that Sora handles occlusion well. One problem with existing models is that they can fail to keep track of objects when they drop out of view. For example, if a truck passes in front of a street sign, the sign might not reappear afterward.  

In a video of a papercraft underwater scene, Sora has added what look like cuts between different pieces of footage, and the model has maintained a consistent style between them.

It’s not perfect. In the Tokyo video, cars to the left look smaller than the people walking beside them. They also pop in and out between the tree branches. “There’s definitely some work to be done in terms of long-term coherence,” says Brooks. “For example, if someone goes out of view for a long time, they won’t come back. The model kind of forgets that they were supposed to be there.”

Impressive as they are, the sample videos shown here were no doubt cherry-picked to show Sora at its best. Without more information, it is hard to know how representative they are of the model’s typical output.   

It may be some time before we find out. OpenAI’s announcement of Sora today is a tech tease, and the company says it has no current plans to release it to the public. Instead, OpenAI will today begin sharing the model with third-party safety testers for the first time.

In particular, the firm is worried about the potential misuses of fake but photorealistic video . “We’re being careful about deployment here and making sure we have all our bases covered before we put this in the hands of the general public,” says Aditya Ramesh, a scientist at OpenAI, who created the firm’s text-to-image model DALL-E .

But OpenAI is eyeing a product launch sometime in the future. As well as safety testers, the company is also sharing the model with a select group of video makers and artists to get feedback on how to make Sora as useful as possible to creative professionals. “The other goal is to show everyone what is on the horizon, to give a preview of what these models will be capable of,” says Ramesh.

To build Sora, the team adapted the tech behind DALL-E 3, the latest version of OpenAI’s flagship text-to-image model. Like most text-to-image models, DALL-E 3 uses what’s known as a diffusion model. These are trained to turn a fuzz of random pixels into a picture.

Sora takes this approach and applies it to videos rather than still images. But the researchers also added another technique to the mix. Unlike DALL-E or most other generative video models, Sora combines its diffusion model with a type of neural network called a transformer.

Transformers are great at processing long sequences of data, like words. That has made them the special sauce inside large language models like OpenAI’s GPT-4 and Google DeepMind’s Gemini . But videos are not made of words. Instead, the researchers had to find a way to cut videos into chunks that could be treated as if they were. The approach they came up with was to dice videos up across both space and time. “It’s like if you were to have a stack of all the video frames and you cut little cubes from it,” says Brooks.

The transformer inside Sora can then process these chunks of video data in much the same way that the transformer inside a large language model processes words in a block of text. The researchers say that this let them train Sora on many more types of video than other text-to-video models, varied in terms of resolution, duration, aspect ratio, and orientation. “It really helps the model,” says Brooks. “That is something that we’re not aware of any existing work on.”

“From a technical perspective it seems like a very significant leap forward,” says Sam Gregory, executive director at Witness, a human rights organization that specializes in the use and misuse of video technology. “But there are two sides to the coin,” he says. “The expressive capabilities offer the potential for many more people to be storytellers using video. And there are also real potential avenues for misuse.” 

OpenAI is well aware of the risks that come with a generative video model. We are already seeing the large-scale misuse of deepfake images . Photorealistic video takes this to another level.

Gregory notes that you could use technology like this to misinform people about conflict zones or protests. The range of styles is also interesting, he says. If you could generate shaky footage that looked like something shot with a phone, it would come across as more authentic.

The tech is not there yet, but generative video has gone from zero to Sora in just 18 months. “We’re going to be entering a universe where there will be fully synthetic content, human-generated content and a mix of the two,” says Gregory.

The OpenAI team plans to draw on the safety testing it did last year for DALL-E 3. Sora already includes a filter that runs on all prompts sent to the model that will block requests for violent, sexual, or hateful images, as well as images of known people. Another filter will look at frames of generated videos and block material that violates OpenAI’s safety policies.

OpenAI says it is also adapting a fake-image detector developed for DALL-E 3 to use with Sora. And the company will embed industry-standard C2PA tags , metadata that states how an image was generated, into all of Sora’s output. But these steps are far from foolproof. Fake-image detectors are hit-or-miss. Metadata is easy to remove, and most social media sites strip it from uploaded images by default.  

“We’ll definitely need to get more feedback and learn more about the types of risks that need to be addressed with video before it would make sense for us to release this,” says Ramesh.

Brooks agrees. “Part of the reason that we’re talking about this research now is so that we can start getting the input that we need to do the work necessary to figure out how it could be safely deployed,” he says.

Update 2/15: Comments from Sam Gregory were added .

Artificial intelligence

Ai for everything: 10 breakthrough technologies 2024.

Generative AI tools like ChatGPT reached mass adoption in record time, and reset the course of an entire industry.

What’s next for AI in 2024

Our writers look at the four hot trends to watch out for this year

  • Melissa Heikkilä archive page

These six questions will dictate the future of generative AI

Generative AI took the world by storm in 2023. Its future—and ours—will be shaped by what we do next.

Google’s Gemini is now in everything. Here’s how you can try it out.

Gmail, Docs, and more will now come with Gemini baked in. But Europeans will have to wait before they can download the app.

Stay connected

Get the latest updates from mit technology review.

Discover special offers, top stories, upcoming events, and more.

Thank you for submitting your email!

It looks like something went wrong.

We’re having trouble saving your preferences. Try refreshing this page and updating them one more time. If you continue to get this message, reach out to us at [email protected] with a list of newsletters you’d like to receive.

IMAGES

  1. (PDF) Distribution Transformer-Impact of Harmonics-IEEE Format-2

    ieee research paper transformer

  2. Modeling Of Power Transformer for Differential

    ieee research paper transformer

  3. (PDF) IEEE paper

    ieee research paper transformer

  4. IEEE Transformer presentation

    ieee research paper transformer

  5. IEEE C57.13 : Requirements for Instrument Transformers

    ieee research paper transformer

  6. (DOC) IEEE Conference Paper Template

    ieee research paper transformer

VIDEO

  1. IEEE video

  2. ieee demo

  3. How to Download IEEE Research Paper Free By Prof Abhijit Kalbande

  4. My transformer 3D Origami

  5. MeshReduce IEEE VR 24 Demo

  6. Simplifying Transformer Blocks

COMMENTS

  1. Transformer Design and Optimization: A Literature Survey

    This paper conducts a literature survey and reveals general backgrounds of research and developments in the field of transformer design and optimization for the past 35 years, based on more than 420 published articles, 50 transformer books, and 65 standards. Published in: IEEE Transactions on Power Delivery ( Volume: 24 , Issue: 4 , October 2009 )

  2. Transformer basics

    Transformer basics Abstract: A transformer is a device that transfers electrical energy from one circuit to another by magnetic coupling without requiring relative motion between its parts. It usually comprises two or more coupled windings, and, in most cases, a core to concentrate magnetic flux.

  3. Multimodal Learning With Transformers: A Survey

    Abstract: Transformer is a promising neural network learner, and has achieved great success in various machine learning tasks. Thanks to the recent prevalence of multimodal applications and Big Data, Transformer-based multimodal learning has become a hot topic in AI research.

  4. A Survey on Vision Transformer

    A Survey on Vision Transformer Abstract: Transformer, first applied to the field of natural language processing, is a type of deep neural network mainly based on the self-attention mechanism. Thanks to its strong representation capabilities, researchers are looking at ways to apply transformer to computer vision tasks.

  5. PDF Transformer Design & Design Parameters

    (ANSI) IEEE C57.12.90-2010, standard test code for liquid-immersed distribution, power and regulating transformers and guide for short-circuit testing of distribution and power transformers NEMA standards publication no. TR1-2013; transformers, regulators and reactors Canada

  6. An Efficient Convolutional Multi-Scale Vision Transformer ...

    This paper introduces an innovative and efficient multi-scale Vision Transformer (ViT) for the task of image classification. The proposed model leverages the inherent power of transformer architecture and combines it with the concept of multi-scale processing generally used in convolutional neural networks (CNNs). The work aims to address the limitations of conventional ViTs which typically ...

  7. Machine Learning Approaches for Transformer Modeling

    Abstract: In this paper, several machine learning modeling methodologies are applied to accurately and efficiently model transformers, which are still a bottleneck in millimeter-wave circuit design. In order to compare the models, a statistical validation is performed against electromagnetic simulations using hundreds of passive structures.

  8. Vision Transformers: A Review of Architecture ...

    In recent years, the development of deep learning has revolutionized the field of computer vision, especially the convolutional neural networks (CNNs), which become the preferred approach for numerous tasks handling images. However, CNNs have difficulty interpreting massive and complicated datasets, which has led to the creation of alternative architectures such as vision transformers. The ...

  9. Transformer Approaches in Image Captioning: A Literature Review

    This paper presents a literature review of image captioning using transformer methods. The literature is reviewed from reputable journals and conferences. Our review focus on transformer approaches in order to improve the model performance in image captioning. We also explore the existing public datasets that are used in image captioning.

  10. [2106.04554] A Survey of Transformers

    A Survey of Transformers Tianyang Lin, Yuxin Wang, Xiangyang Liu, Xipeng Qiu Transformers have achieved great success in many artificial intelligence fields, such as natural language processing, computer vision, and audio processing. Therefore, it is natural to attract lots of interest from academic and industry researchers.

  11. [2103.14030] Swin Transformer: Hierarchical Vision Transformer using

    This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision.

  12. [2111.06091] A Survey of Visual Transformers

    A Survey of Visual Transformers. Yang Liu, Yao Zhang, Yixin Wang, Feng Hou, Jin Yuan, Jiang Tian, Yang Zhang, Zhongchao Shi, Jianping Fan, Zhiqiang He. Transformer, an attention-based encoder-decoder model, has already revolutionized the field of natural language processing (NLP). Inspired by such significant achievements, some pioneering works ...

  13. IS-GGT: Iterative Scene Graph Generation with Generative Transformers

    Explore Research Products in the NSF-PAR. ... This paper develops a Kalman-smoothing method for estimating graphs from noisy, cluttered, and incomplete data. ... Iterative Scene Graph Generation with Generative Transformers. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (). Retrieved from https://par.nsf.gov/biblio ...

  14. A comprehensive survey on applications of transformers for deep

    The paper follows the organization depicted in the visual abstract shown in Fig. 1.To provide context, the motivation behind this paper has been discussed in the current section, and Section 2 explains the preliminary concepts essential for the rest of the paper. A comprehensive account of the systematic methodology used to search for relevant research articles is detailed in Section 3.

  15. Dual-Transformer-Based DAB Converter with Controllable ...

    A dual-transformer-based dual active bridge (DT DAB) converter is proposed in this paper. The proposed DT DAB has small input current ripple and a wide voltage range because the magnetizing inductances of the dual transformers can be used as the input filter inductances at the same time. In addition, the zero-voltage-switching (ZVS) of all switches in the full load range under wide range input ...

  16. Jack of All Trades, Master of Some, a Multi-Purpose Transformer Agent

    The search for a general model that can operate seamlessly across multiple domains remains a key goal in machine learning research. The prevailing methodology in Reinforcement Learning (RL) typically limits models to a single task within a unimodal framework, a limitation that contrasts with the broader vision of a versatile, multi-domain model. In this paper, we present Jack of All Trades ...

  17. [2010.11929] An Image is Worth 16x16 Words: Transformers for Image

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain ...

  18. [2402.08132] On the Resurgence of Recurrent Models for Long Sequences

    A longstanding challenge for the Machine Learning community is the one of developing models that are capable of processing and learning from very long sequences of data. The outstanding results of Transformers-based networks (e.g., Large Language Models) promotes the idea of parallel attention as the key to succeed in such a challenge, obfuscating the role of classic sequential processing of ...

  19. PDF A Survey of Transformers

    1 INTRODUCTION Transformer [137] is a prominent deep learning model that has been widely adopted in various fields, such as natural language processing (NLP), computer vision (CV) and speech processing. Transformerwasoriginallyproposedasasequence-to-sequencemodel[130]formachinetranslation.

  20. NLP Research Based on Transformer Model

    NLP Research Based on Transformer Model Abstract: Natural language processing technology is an important research area in artificial intelligence which occupies a pivotal position in deep learning. This paper describes in detail the research of NLP based on Transformer structure, thus showing its ultra-high performance and development prospects.

  21. Transformer Design and Optimization: A Literature Survey

    This paper conducts a literature survey and reveals general backgrounds of research and developments in the field of transformer design and optimization for the past 35 years, based on more...

  22. [2206.06488] Multimodal Learning with Transformers: A Survey

    The main contents of this survey include: (1) a background of multimodal learning, Transformer ecosystem, and the multimodal big data era, (2) a theoretical review of Vanilla Transformer, Vision Transformer, and multimodal Transformers, from a geometrically topological perspective, (3) a review of multimodal Transformer applications, via two imp...

  23. [2402.08975] Research and application of Transformer based anomaly

    Transformer, as one of the most advanced neural network models in Natural Language Processing (NLP), exhibits diverse applications in the field of anomaly detection. To inspire research on Transformer-based anomaly detection, this review offers a fresh perspective on the concept of anomaly detection. We explore the current challenges of anomaly detection and provide detailed insights into the ...

  24. SCGFormer: Semantic Chebyshev Graph Convolution Transformer for ...

    A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications. Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the ...

  25. Video generation models as world simulators

    We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video ...

  26. Free IEEE Citation Generator

    Autocite Search for your source by title, URL, DOI, ISBN, and more to retrieve the relevant information automatically. Wide variety of styles Scribbr's Citation Generator supports a variety of other styles in addition to IEEE: try citing in APA, MLA, Chicago, and more. Export to Bib (La)TeX

  27. [1706.03762] Attention Is All You Need

    The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely ...

  28. IEEE Paper Format

    Published on August 24, 2022 by Jack Caulfield . Revised on April 6, 2023. IEEE provides guidelines for formatting your paper. These guidelines must be followed when you're submitting a manuscript for publication in an IEEE journal. Some of the key guidelines are: Formatting the text as two columns, in Times New Roman, 10 pt.

  29. OpenAI teases an amazing new generative video model called Sora

    Transformers are great at processing long sequences of data, like words. That has made them the special sauce inside large language models like OpenAI's GPT-4 and Google DeepMind's Gemini ...